Скачать или смотреть 2512.10942 - VL-JEPA: Joint Embedding Predictive Architecture for Vision language

2512.10942 - VL-JEPA: Joint Embedding Predictive Architecture for Vision language

Machine LearningData Science

Скачать 2512.10942 - VL-JEPA: Joint Embedding Predictive Architecture for Vision language бесплатно в качестве 4к (2к / 1080p)

У нас вы можете скачать бесплатно 2512.10942 - VL-JEPA: Joint Embedding Predictive Architecture for Vision language или посмотреть видео с ютуба в максимальном доступном качестве.

Для скачивания выберите вариант из формы ниже:

Информация по загрузке:

Cкачать музыку 2512.10942 - VL-JEPA: Joint Embedding Predictive Architecture for Vision language бесплатно в формате MP3:

Если иконки загрузки не отобразились, ПОЖАЛУЙСТА, НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если у вас возникли трудности с загрузкой, пожалуйста, свяжитесь с нами по контактам, указанным в нижней части страницы.
Спасибо за использование сервиса video2dn.com

Описание к видео 2512.10942 - VL-JEPA: Joint Embedding Predictive Architecture for Vision language

title: VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
author: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung
arXiv:2512.10942 - https://arxiv.org/abs/2512.10942

We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

Комментарии

Информация по комментариям в разработке