Fast LLM Serving with vLLM and PagedAttention

Описание к видео Fast LLM Serving with vLLM and PagedAttention

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. To address this problem, we are developing vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention achieves up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes. vLLM has been developed at UC Berkeley and deployed for Chatbot Arena and Vicuna Demo for the past three months. In this talk, we will discuss the motivation, features, and implementation of vLLM in depth, and present our future plan.


About Anyscale
---
Anyscale is the AI Application Platform for developing, running, and scaling AI.

https://www.anyscale.com/

If you're interested in a managed Ray service, check out:
https://www.anyscale.com/signup/

About Ray
---
Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.
https://docs.ray.io/en/latest/


#llm #machinelearning #ray #deeplearning #distributedsystems #python #genai

Комментарии

Информация по комментариям в разработке