Скачать или смотреть Challenges with Ultra-low Latency LLM Inference at Scale | Haytham Abuelfutuh

Challenges with Ultra-low Latency LLM Inference at Scale | Haytham Abuelfutuh

Скачать Challenges with Ultra-low Latency LLM Inference at Scale | Haytham Abuelfutuh бесплатно в качестве 4к (2к / 1080p)

У нас вы можете скачать бесплатно Challenges with Ultra-low Latency LLM Inference at Scale | Haytham Abuelfutuh или посмотреть видео с ютуба в максимальном доступном качестве.

Для скачивания выберите вариант из формы ниже:

Информация по загрузке:

Cкачать музыку Challenges with Ultra-low Latency LLM Inference at Scale | Haytham Abuelfutuh бесплатно в формате MP3:

Если иконки загрузки не отобразились, ПОЖАЛУЙСТА, НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если у вас возникли трудности с загрузкой, пожалуйста, свяжитесь с нами по контактам, указанным в нижней части страницы.
Спасибо за использование сервиса video2dn.com

Описание к видео Challenges with Ultra-low Latency LLM Inference at Scale | Haytham Abuelfutuh

In this talk, we will discuss the challenges of running ultra-low latency Large Language Model (LLM) inference at scale. We will cover the unique challenges of LLM inference, such as large model sizes, KV Caching.

We will also discuss the challenges of scaling LLM inference to handle large volumes of requests, including the need for hardware, efficient scale up, and new routing architectures.

Finally, we will present some of our recent work on addressing these challenges, including our development of inference infrastructure at Union.

Upcoming Events for 2025:
AI & Data - June 25, 2025
Networking - August 13, 2025
Product - October 22, 2025

Learn more about the @Scale conference here: https://atscaleconference.com/

@Scale is a technical conference series for engineers who build or maintain systems designed for scale. New for 2025, in person and virtual attendance options will be available at all four of our programs, which will bring together complementary themes to create event communities to spark cross-discipline collaboration.

Key Points:
Building an inference service for large language models (LLMs) at scale is much more complex than it may seem when running on a local machine. [04:09] There are several challenges that need to be addressed, such as container image optimization, efficient model loading, caching, and distributed scaling. [06:32]

The speaker outlines the architecture and optimizations used in Union's inference service, which is designed to be faster and more cost-effective than existing API-based solutions. [21:38] Key components include a split prefill/decode stage, a smart router to route requests to the appropriate machine, and a distributed caching system. [17:48]

The speaker believes that more and more companies will be deploying their own LLM-based models in production, and that systems innovation and democratizing the process of taking LLMs to production will be crucial. [23:15] Union's goal is to provide a platform that makes it seamless for everyone to deploy their own models at scale. [21:38]

Комментарии

Информация по комментариям в разработке