The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works on their data mixes. MERV is up to 3.79% better in accuracy than Video-LLaVA across the standard suite video understanding benchmarks, while also having a better Video-ChatGPT score. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2.21%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder approaches. Finally, we provide qualitative evidence that our model captures domain knowledge from each encoder simultaneously, such as on the motion classification tasks found in Something-Something v2. Our results offer promising directions for future research in utilizing multiple vision encoders for comprehensive video understanding.
Prior Vision-Language Models (VLMs) and Video-Language Models (VidLMs) have only used a single vision encoder to process the visual input, generally picking CLIP for its dual vision-text understanding. However, this limits the amount and type of visual information that can be conveyed to the language model, as shown by both prior works like Tong et. al 2023, as well as our own investigations (e.g. the above figure and later experiments).
We propose MERV, a Multi-Encoder Representation of Videos, as a new method for integrating multiple visual encoders (DINOv2, ViViT, SigLIP, and LanguageBind) into a single VideoLLM using a cross-attentive encoder mixer for fusing representations. We introduce a spatio-temporally aligned representation for mixing the information from multiple types of visual encoders. Given the computational complexity of video tasks, we carefully experiment with optimization strategies and the system implementation, allowing us to combine four distinct visual encoders with minimal computational overhead.
We find that our method, generating video representation using multiple visual encoders that specialize in different skills of video understanding, outperforms Video-LLaVA across nearly all of the benchmarks, with a 3.23% gain on MSVD and a 3.79% gain on ActivityNet. Both of our methods perform better overall than Video-LLaVA, even when using less data with just Stage 2 as shown by the MERV numbers. While MERV-Full is not a strict improvement to MERV, it still improves on some difficult benchmarks with its additional video-language alignment. Compared to LLaMA-VID, which uses a different training mix, we also better in nearly all benchmarks, up to around 4.5% across Perception Test, ActivityNet, and TVQA. Even when our accuracy numbers are similar to previous methods such as for TGIF, our score metrics are significantly higher, increasing from 3.0 to 3.28 compared to Video-ChatGPT.
MERV-Full (using a full Stage 1 + 2 finetuning) outperforms the previous state-of-the-art on the Perception Test zero-shot with 48.41%, compared to SeViLa with a 46.2% accuracy. Overall, our design shows a significant improvement over the base Video-LLaVA and prior methods as a whole.
We first analyze two key questions about our choice of visual encoders, namely 1) do we benefit by using more than one encoder, and 2) do we need four with each encoder providing meaningful contributions towards the final performance? The answer to both is a resounding yes, as can be seen below from the plot.
We also do a detailed qualitative study of our model's capabilities on the Something-Something v2 dataset. We show that MERV can accurately capture both the contrastive encoders' strengths on general vision-language understanding, as well as ViViT's specialization on temporally-sensitive tasks (e.g. distinguishing pushing left vs. right), without trading off performance between these specializations as single encoder models do.
@misc{chung2024unifying,
title={Unifying Specialized Visual Encoders for Video Language Models},
author={Jihoon Chung and Tyler Zhu and Max Gonzalez Saez-Diez and Juan Carlos Niebles and Honglu Zhou and Olga Russakovsky},
year={2024},
eprint={2306.XXXX},
archivePrefix={arXiv},
primaryClass={cs.LG}
}