CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Generative AI & LLMs
Published: arXiv: 2602.13191v1
Authors

Sayan Deb Sarkar Rémi Pautrat Ondrej Miksik Marc Pollefeys Iro Armeni Mahdi Rad Mihai Dusmanu

Abstract

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

Paper Summary

Problem
Video Language Models (VideoLMs) are a major advancement in multi-modal AI, enabling AI systems to understand how visual narratives, objects, actions, and relationships evolve across video sequences. However, VideoLMs have a maximum context window limiting the amount of information that can be provided as input, which is a significant challenge for their application in real-world scenarios.
Key Innovation
The researchers propose a novel approach to encode videos for VideoLMs by leveraging the standardized compressed representation of video codecs. This approach, called CoPE-VideoLM, uses codec primitives such as motion vectors and residuals to skip redundant RGB information, reducing the Time-To-First-Token (TTFT) by up to 86% and token usage by up to 93%. The approach also introduces two lightweight architectures for encoding codec primitives, achieving substantially higher compression rates and lower token counts than traditional image-based encoders.
Practical Impact
The CoPE-VideoLM approach has significant practical implications for real-world applications of VideoLMs. By reducing the TTFT and token usage, it enables VideoLMs to process video content more efficiently, making them suitable for real-time applications such as video question-answering, robotics, and human-computer interaction. The approach also opens up new directions for video understanding, positioning codec-based methods as a practical and efficient foundation for future VideoLMs.
Analogy / Intuitive Explanation
Think of a video as a long, complex story with many frames. Traditional VideoLMs try to understand this story by looking at every single frame, which is like trying to read a book by looking at every single word. CoPE-VideoLM, on the other hand, uses the "compressed" version of the story, where only the most important frames and changes between them are encoded. This allows the model to understand the story more efficiently, like a reader who can quickly grasp the main plot and characters by looking at the chapter headings and summaries.
Paper Information
Categories:
cs.CV cs.AI cs.CL
Published Date:

arXiv ID:

2602.13191v1

Quick Actions