arXiv:2512.109421 PaperLens breakdowncs.CV

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

VL-JEPA introduces a novel vision-language model that predicts continuous semantic embeddings of target texts instead of autoregressively generating discrete tokens. This approach leads to significantly higher efficiency, requiring 50% fewer parameters and enabling faster inference through selective decoding, while achieving stronger performance across various vision-language tasks.

Built with PaperLens

Key Takeaways

VL-JEPA predicts continuous text embeddings, not discrete tokens, for vision-language tasks.

This embedding-space prediction simplifies learning, making it more efficient and robust.

The model uses 50% fewer trainable parameters than classical VLMs with better performance.

It natively supports 'selective decoding', reducing decoding operations by ~2.85x for real-time applications.

VL-JEPA offers a unified architecture for generation, classification, retrieval, and VQA.

It outperforms CLIP, SigLIP2, and Perception Encoder on classification/retrieval benchmarks.

Core Concepts

Joint Embedding Predictive Architecture (JEPA)

JEPA learns by predicting abstract meanings, not raw data, making it efficient and robust.

Embedding Space Prediction

Predicting meaning-rich embeddings is more efficient and robust than predicting individual words.

Autoregressive Token Generation

Autoregressive generation is slow and expensive, which VL-JEPA aims to circumvent.

Selective Decoding

Selective decoding makes VL-JEPA highly efficient for real-time applications by only generating text when necessary.

Why It Matters

VL-JEPA's shift to embedding prediction and selective decoding fundamentally changes the feasibility of deploying advanced vision-language AI in real-time, resource-constrained environments. This means AI systems in wearable devices, robotics, and live video analytics can operate with significantly lower latency and computational cost, making them more responsive, power-efficient, and practical for continuous interaction with the physical world. It democratizes access to powerful multimodal AI by reducing the hardware requirements for inference.

Real-time action tracking in smart glasses for procedural assistance.Online scene recognition and monitoring for autonomous robots.Low-latency visual question answering for interactive AI agents.Efficient text-to-video retrieval in large video databases.Adaptive video summarization by decoding only significant events.