arXiv:2512.109421 PaperLens breakdowncs.CV
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
VL-JEPA introduces a novel vision-language model that predicts continuous semantic embeddings of target texts instead of autoregressively generating discrete tokens. This approach leads to significantly higher efficiency, requiring 50% fewer parameters and enabling faster inference through selective decoding, while achieving stronger performance across various vision-language tasks.
VL-JEPA predicts continuous text embeddings, not discrete tokens, for vision-language tasks.
This embedding-space prediction simplifies learning, making it more efficient and robust.
The model uses 50% fewer trainable parameters than classical VLMs with better performance.
It natively supports 'selective decoding', reducing decoding operations by ~2.85x for real-time applications.
VL-JEPA offers a unified architecture for generation, classification, retrieval, and VQA.
It outperforms CLIP, SigLIP2, and Perception Encoder on classification/retrieval benchmarks.
Joint Embedding Predictive Architecture (JEPA)
JEPA learns by predicting abstract meanings, not raw data, making it efficient and robust.
Embedding Space Prediction
Predicting meaning-rich embeddings is more efficient and robust than predicting individual words.
Autoregressive Token Generation
Autoregressive generation is slow and expensive, which VL-JEPA aims to circumvent.
Selective Decoding
Selective decoding makes VL-JEPA highly efficient for real-time applications by only generating text when necessary.
VL-JEPA's shift to embedding prediction and selective decoding fundamentally changes the feasibility of deploying advanced vision-language AI in real-time, resource-constrained environments. This means AI systems in wearable devices, robotics, and live video analytics can operate with significantly lower latency and computational cost, making them more responsive, power-efficient, and practical for continuous interaction with the physical world. It democratizes access to powerful multimodal AI by reducing the hardware requirements for inference.
Real-time action tracking in smart glasses for procedural assistance.Online scene recognition and monitoring for autonomous robots.Low-latency visual question answering for interactive AI agents.Efficient text-to-video retrieval in large video databases.Adaptive video summarization by decoding only significant events.