Unlocking Heterogeneous AI Infrastructure K8s Cluster: Leveraging the Po... Xiao Zhang & Mengxuan Li

Описание к видео Unlocking Heterogeneous AI Infrastructure K8s Cluster: Leveraging the Po... Xiao Zhang & Mengxuan Li

Don't miss out! Join us at our upcoming conference: Open Source Summit + AI_Dev: Open Source GenAI & ML Summit in Tokyo from October 28-29, 2024. Connect with peers as the community gathers to further the education and advancement of open source and GenAI. Learn more at https://events.linuxfoundation.org/op...

Unlocking Heterogeneous AI Infrastructure K8s Cluster: Leveraging the Power of HAMi | 解锁异构AI基础设施K8s集群:发挥HAMi的力量 - Xiao Zhang, DaoCloud & Mengxuan Li, The 4th Paradigm

With AI's growing popularity, Kubernetes has become the de facto AI infrastructure. However, the increasing number of clusters with diverse AI devices (e.g., NVIDIA, Intel, Huawei Ascend) presents a major challenge. AI devices are expensive, how to better improve resource utilization? How to better integrate with K8s clusters? How to manage heterogeneous AI devices consistently, support flexible scheduling policies, and observability all bring many challenges The HAMi project was born for this purpose. This session including: * How K8s manages heterogeneous AI devices (unified scheduling, observability) * How to improve device usage by GPU share * How to ensure the QOS of high-priority tasks in GPU share stories * Support flexible scheduling strategies for GPU (NUMA affinity/anti-affinity, binpack/spread etc) * Integration with other projects (such as volcano, scheduler-plugin, etc.) * Real-world case studies from production-level users. * Some other challenges still faced and roadmap

随着人工智能的日益普及,Kubernetes已成为事实上的人工智能基础设施。然而,不断增加的具有多样化人工智能设备(如NVIDIA、Intel、华为Ascend)的集群数量带来了重大挑战。人工智能设备价格昂贵,如何更好地提高资源利用率?如何更好地与K8s集群集成?如何一致地管理异构人工智能设备,支持灵活的调度策略和可观察性都带来了许多挑战。HAMi项目应运而生。本场演讲包括: * K8s如何管理异构人工智能设备(统一调度、可观察性) * 如何通过GPU共享提高设备使用率 * 如何确保GPU共享故事中高优先级任务的QOS * 为GPU支持灵活的调度策略(NUMA亲和性/反亲和性、binpack/spread等) * 与其他项目的集成(如volcano、scheduler-plugin等) * 来自生产级用户的实际案例研究。 * 仍然面临的一些其他挑战和路线图

Комментарии

Информация по комментариям в разработке