Meta engineering team released a paper with the model named ImageBlind, which can bind 6 modalities. It outperforms prior specialist models trained for a particular modality.
It joins data from:
- - Images
- - Text
- - Audio
- - Depth
- - Thermal
- - IMU data
It uses large-scale vision-language models and extends its zero-shot capabilities to new modalities just by using their natural pairing with images, such as video audio and image-depth data, to learn a single joint embedding space.
The paper shows that not all combinations of paired data are required to train a joint embedding, but only image-paired data is sufficient to bind the modalities together.