Visual Instruction Tuning using LLaVA

Описание к видео Visual Instruction Tuning using LLaVA

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, the authors present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, they introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for generalpurpose visual and language understanding. Experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.

In this video, I will talk about the following: Multimodal Chatbot and Science QA using LLaVa. How is multimodal instruction-following dataset created? How is LLaVA trained? How does LLaVA perform?

For more details, please look at https://arxiv.org/pdf/2304.08485.pdf

Liu, Haotian, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. "Visual Instruction Tuning." arXiv preprint arXiv:2304.08485 (2023).

Комментарии

Информация по комментариям в разработке