Friday, August 18, 2023

ImageBlind


Meta engineering team released a paper with the model named ImageBlind, which can bind 6 modalities. It outperforms prior specialist models trained for a particular modality.

It joins data from:

  • - Images
  • - Text
  • - Audio
  • - Depth
  • - Thermal
  • - IMU data

It uses large-scale vision-language models and extends its zero-shot capabilities to new modalities just by using their natural pairing with images, such as video audio and image-depth data, to learn a single joint embedding space.

The paper shows that not all combinations of paired data are required to train a joint embedding, but only image-paired data is sufficient to bind the modalities together.

Ref: https://lnkd.in/gXNndk63

No comments:

Post a Comment