Can Whisper be used for real-time streaming ASR?

Описание к видео Can Whisper be used for real-time streaming ASR?

Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io

Whisper is a robust Automatic Speech Recognition (ASR) model by OpenAI, but can it handle real-time streaming ASR where the latency requirement is several seconds? This is actually not too difficult, using the open-source whisper-streaming project, which turns Whisper into a streaming ASR system. It works by feeding longer and longer audio buffers into the Whisper model, using the LocalAgreement algorithm to confirm output as soon as it is agreed upon in two iterations, and then scrolls the buffer forward until the start of the next sentence.

0:00 - Introduction
0:35 - Batch vs Streaming ASR
1:55 - Why is this difficult?
2:58 - Whisper-streaming demo
3:38 - Processing consecutive audio buffers
4:36 - Confirming tokens with LocalAgreement
6:05 - Prompting previous context
7:01 - Limitations vs other streaming ASR models

References:

https://github.com/ufal/whisper_strea...

Macháček, Dominik, Raj Dabre, and Ondřej Bojar. "Turning Whisper into Real-Time Transcription System." IJCNLP-AACL 2023.

Chen, Xie, et al. "Developing real-time streaming transformer transducer for speech recognition on large-scale dataset." ICASSP 2021.

Комментарии

Информация по комментариям в разработке