Deep Dive: Optimizing LLM inference

Скачать Deep Dive: Optimizing LLM inference бесплатно в качестве 4к (2к / 1080p)

У нас вы можете скачать бесплатно Deep Dive: Optimizing LLM inference или посмотреть видео с ютуба в максимальном доступном качестве.

Для скачивания выберите вариант из формы ниже:

Cкачать музыку Deep Dive: Optimizing LLM inference бесплатно в формате MP3:

Если иконки загрузки не отобразились, ПОЖАЛУЙСТА, НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если у вас возникли трудности с загрузкой, пожалуйста, свяжитесь с нами по контактам, указанным в нижней части страницы.
Спасибо за использование сервиса video2dn.com

Описание к видео Deep Dive: Optimizing LLM inference

Open-source LLMs are great for conversational applications, but they can be difficult to scale in production and deliver latency and throughput that are incompatible with your cost-performance objectives.

In this video, we zoom in on optimizing LLM inference, and study key mechanisms that help reduce latency and increase throughput: the KV cache, continuous batching, and speculative decoding, including the state-of-the-art Medusa approach.

Slides: https://fr.slideshare.net/slideshow/j...

⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at / julsimon or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️

00:00 Introduction
01:15 Decoder-only inference
06:05 The KV cache
11:15 Continuous batching
16:17 Speculative decoding
25:28 Speculative decoding: small off-the-shelf model
26:40 Speculative decoding: n-grams
30:25 Speculative decoding: Medusa