Meta AI Technology launches an “omnivorous” model for image, video, and 3D classification tasks: On January 24 News, Meta AI recently introduced such an “Omnivore” model that can classify data from different visual modalities, including images, videos, and 3D data.
For example, facing the leftmost image, it can collect the best matching results from depth maps, monovision 3D maps, and video datasets.
Previously, this had to be implemented in different models; now one model does it.
Moreover, Omnivore is easy to train and using ready-made standard datasets, its performance can be equal to or higher than that of the corresponding single model.
Experimental results show that Omnivore can achieve 86.0% accuracy on ImageNet, an image classification dataset, 84.1% on the Kinetics dataset for action recognition, and SUN RGB-D for single-view 3D scene classification. 67.1%.
In addition, Omnivore does not need to access the correspondence between modalities when implementing all cross-modal recognition.
An “omnivorous eater” who can eat different visual modalities
Omnivore is based on the Transformer architecture, which has unique flexibility and is jointly trained for classification tasks of different modalities.
The model architecture is as follows:
Omnivore converts input images, videos, and single-view 3D images into embeddings and feeds them into the Transformer.
Although it can use any vision transformer architecture for patch embedding, given the strong performance of Swin transformer on image and video tasks, this architecture is used here as the base model.
Specifically, Omnivore converts images into patches, videos into Spatio-temporal tubes, and single-view 3D images into RGB patches and depth patches.
The patches are then mapped into the embeddings using a linear layer. where the same linear layer is used for RGB patches and a separate one for depth patches.
In general, it converts all visual modalities into a common format through embedding and then uses a series of spatiotemporal attention operations to build a unified representation of different visual modalities.
The researchers jointly trained various Omnivore models on the ImageNet-1K dataset, Kinetics-400 dataset, and SUN RGB-D dataset.
This approach is similar to multi-task learning and cross-modal alignment, with 2 important differences:
- Does not assume the alignment of input observations (ie, does not assume correspondence between images, videos, and 3D data);
- Nor does it assume that these datasets share the same label space.
Super SOTA performance
In terms of experiments, we first compare Omnivore with a specific model (referred to as Specific in the table below) corresponding to each visual modality.
There are three different model sizes: T, S, and B.
The pre-trained model is fine-tuned on all seven downstream tasks.
Image-specific models are pre-trained on IN1K. Both the video-specific model and the single-view 3D-specific model are initialized using the inflation of the pre-trained image-specific model and fine-tuned on K400 and SUN RGB-D, respectively.
It was found that Omnivore performed equivalent to or better than each specific model on almost all downstream tasks.
Among them, Swin-B, the largest in size, achieves SOTA on all tasks.
Comparing Omnivore to a specific model with the same model architecture and number of parameters also yields the same result.
Where Omnivore is jointly trained from scratch on IN1K, K400, and SUN datasets, while modality-specific models are trained specifically for each dataset:
The images in the model are trained from scratch, and the VideoSwin and DepthSwin models are fine-tuned from the ImageSwin model.
We next compare Omnivore with SOTA models on image, video, and 3D data classification tasks.
The results are still good, with Omnivore showing better performance than the SOTA model in all pretraining tasks (image, video, and 3D data from top to bottom).
Furthermore, retrieving depth maps for a given RGB image on the ImageNet-1K dataset also found that Omnivore was able to give semantically similar correct answers, even though it was not trained on 1K depth maps.
Finally, the authors say that while this “omnivorous” represents many advances over traditional mode-specific models, it has some limitations.
For example, it is currently only suitable for single-view 3D images, not for other 3D representations, such as voxels, point clouds, etc.
Paper address: click to open
The code is open source: click to open