SANE2023 | Yuan Gong - Audio Large Language Models: From Sound Perception to Understanding

Описание к видео SANE2023 | Yuan Gong - Audio Large Language Models: From Sound Perception to Understanding

Yuan Gong, research scientist at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), presents his work on audio large language models at the SANE 2023 workshop at New York University, New York, on October 26, 2023.
More info on the SANE workshop series: http://www.saneworkshop.org/

Abstract: Our cognitive abilities enable us not only to perceive and identify sounds but also to comprehend their implicit meaning. While significant advancements have been achieved in general audio event recognition in recent years, models trained with discrete sound label sets possess limited reasoning and understanding capabilities, e.g., the model may recognize the clock chime 6 times, but not know that it indicates a time of 6 o'clock. Can we build an AI model that has both audio perception and reasoning ability?
In this talk, I will share our recent progress in audio large language model (LLM) development. Specifically, I will first introduce a novel GPT-assisted method to generate our large-scale open-ended audio question-answering dataset OpenAQA. I will then discuss the key design choices and the model architecture of our audio large language model. Finally, I will also discuss how to connect an automatic speech recognition model with an audio large language model for joint audio and speech understanding.

Комментарии

Информация по комментариям в разработке