Sailing Ray Workloads with KubeRay and Kueue in Kubernetes - Jason Hu, Volcano Engine & Kante Yin

Описание к видео Sailing Ray Workloads with KubeRay and Kueue in Kubernetes - Jason Hu, Volcano Engine & Kante Yin

使用KubeRay和Kueue在Kubernetes中托管Sailing Ray工作负载 | Sailing Ray Workloads with KubeRay and Kueue in Kubernetes - Jason Hu, Volcano Engine & Kante Yin, DaoCloud

如今,机器学习的计算需求正在迅速增长。Ray是一个统一的计算框架,可以让机器学习工程师轻松扩展他们的工作负载,而无需构建复杂的计算基础设施。 另一方面,Kubernetes是一个流行的开源容器编排平台,通过KubeRay(Ray工作负载的操作员),可以轻松管理各种工作负载。 在字节跳动,每天都有数千个作业提交到由KubeRay创建的Ray集群中。通过在长时间运行的集群上调试程序并通过Ray Job自定义资源启动常规作业,用户可以从简化的工作流程中获益。 同时,高效地管理并发的Ray作业面临着诸如作业饥饿和资源分配等挑战。Kueue是一个基于Kubernetes的本地作业排队系统,提供资源管理、多租户支持和资源公平共享等功能,完美解决了Kubernetes中Ray作业的挑战。

Compute demands for machine learning are growing rapidly nowadays. Ray, a unified computing framework, allows ML engineers to scale their workloads effortlessly without building complex computing infrastructures. On the other hand, Kubernetes, a popular open-source container orchestration platform, can help to manage a wide range of workloads at ease with KubeRay, an operator for Ray workloads. At ByteDance, thousands of jobs are submitted to the Ray cluster created by KubeRay daily. With the capability to debug programs on long-running clusters and launch regular jobs through Ray Job custom resources, users benefit from a streamlined workflow. Meanwhile, efficiently managing concurrent Ray jobs poses challenges such as job starvation and resource allocation. Kueue, a Kubernetes native job queueing system offering capacities like resource management, multi-tenant support, and resource fair-sharing perfectly addresses the Ray job challenges in Kubernetes.

Комментарии

Информация по комментариям в разработке