Unifying Specialized Visual Encoders for Video Language Models

Abstract

The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works on their data mixes. MERV is up to 3.79% better in accuracy than Video-LLaVA across the standard suite video understanding benchmarks, while also having a better Video-ChatGPT score. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2.21%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder approaches. Finally, we provide qualitative evidence that our model captures domain knowledge from each encoder simultaneously, such as on the motion classification tasks found in Something-Something v2. Our results offer promising directions for future research in utilizing multiple vision encoders for comprehensive video understanding.

Problem

Prior Vision-Language Models (VLMs) and Video-Language Models (VidLMs) have only used a single vision encoder to process the visual input, generally picking CLIP for its dual vision-text understanding. However, this limits the amount and type of visual information that can be conveyed to the language model, as shown by both prior works like Tong et. al 2023, as well as our own investigations (e.g. the above figure and later experiments).

Qualitative examples of MERV on Something-Something v2.

MERV tend to show improved understanding in temporal-heavy videos as in Something-Something v2 dataset [Top Row], while retaining the performance on scenic understanding, seen from popular video benchmarks (MSVD, MSRVTT, ActivityNet) [Bottom Row].

Solution: MERV

We propose MERV, a Multi-Encoder Representation of Videos, as a new method for integrating multiple visual encoders (DINOv2, ViViT, SigLIP, and LanguageBind) into a single VideoLLM using a cross-attentive encoder mixer for fusing representations. We introduce a spatio-temporally aligned representation for mixing the information from multiple types of visual encoders. Given the computational complexity of video tasks, we carefully experiment with optimization strategies and the system implementation, allowing us to combine four distinct visual encoders with minimal computational overhead.

Diagram of our MERV method. First, we feed in our input video into each of visual encoders to get different representations. They are then spatio-temporally aligned before being fused by a cross-attentive mixer. The output is a visual embedding with an additive mix of information from all of the encoders, which is concatenated with the tokenized text input as the input in our LLM for generation.

Results

We find that our method, generating video representation using multiple visual encoders that specialize in different skills of video understanding, outperforms Video-LLaVA across nearly all of the benchmarks, with a 3.23% gain on MSVD and a 3.79% gain on ActivityNet. Both of our methods perform better overall than Video-LLaVA, even when using less data with just Stage 2 as shown by the MERV numbers. While MERV-Full is not a strict improvement to MERV, it still improves on some difficult benchmarks with its additional video-language alignment. Compared to LLaMA-VID, which uses a different training mix, we also better in nearly all benchmarks, up to around 4.5% across Perception Test, ActivityNet, and TVQA. Even when our accuracy numbers are similar to previous methods such as for TGIF, our score metrics are significantly higher, increasing from 3.0 to 3.28 compared to Video-ChatGPT.

MERV-Full (using a full Stage 1 + 2 finetuning) outperforms the previous state-of-the-art on the Perception Test zero-shot with 48.41%, compared to SeViLa with a 46.2% accuracy. Overall, our design shows a significant improvement over the base Video-LLaVA and prior methods as a whole.

Comparison of different multimodal LLMs on video reasoning benchmarks.

A comparison of MERV to other popular multimodal LLMs.

Analysis

We first analyze two key questions about our choice of visual encoders, namely 1) do we benefit by using more than one encoder, and 2) do we need four with each encoder providing meaningful contributions towards the final performance? The answer to both is a resounding yes, as can be seen below from the plot.

We also do a detailed qualitative study of our model's capabilities on the Something-Something v2 dataset. We show that MERV can accurately capture both the contrastive encoders' strengths on general vision-language understanding, as well as ViViT's specialization on temporally-sensitive tasks (e.g. distinguishing pushing left vs. right), without trading off performance between these specializations as single encoder models do.

Plot of memory used by normal backprop and Reprop, PaReprop. Both use much less memory, but PaReprop's extra memory cost is negligible relatively.

Our model benefits from the extra encoders, as it drops in performance with one less or only one encoder [Left]. It also is able to intuit motion and general video understanding simultaneously, by way of the specialized encoders [Right].