The KV Cache: Memory Usage in Transformers

Описание к видео The KV Cache: Memory Usage in Transformers

Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io

The KV cache is what takes up the bulk of the GPU memory during inference for large language models like GPT-4. Learn about how the KV cache works in this video!

0:00 - Introduction
1:15 - Review of self-attention
4:07 - How the KV cache works
5:55 - Memory usage and example

Further reading:
Speeding up the GPT - KV cache (https://www.dipkumar.dev/becoming-the...)
Transformer Inference Arithmetic (https://kipp.ly/transformer-inference...)
Efficiently Scaling Transformer Inference (https://arxiv.org/pdf/2211.05102.pdf)

Комментарии

Информация по комментариям в разработке