Скачать или смотреть How I Tamed 2 × RTX 5090 + 2 × 4090 with Llama.cpp fork

How I Tamed 2 × RTX 5090 + 2 × 4090 with Llama.cpp fork

AILLAMA.CPPik_llama.cpp50904090inferenceprompt processing

Скачать How I Tamed 2 × RTX 5090 + 2 × 4090 with Llama.cpp fork бесплатно в качестве 4к (2к / 1080p)

У нас вы можете скачать бесплатно How I Tamed 2 × RTX 5090 + 2 × 4090 with Llama.cpp fork или посмотреть видео с ютуба в максимальном доступном качестве.

Для скачивания выберите вариант из формы ниже:

Информация по загрузке:

Cкачать музыку How I Tamed 2 × RTX 5090 + 2 × 4090 with Llama.cpp fork бесплатно в формате MP3:

Если иконки загрузки не отобразились, ПОЖАЛУЙСТА, НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если у вас возникли трудности с загрузкой, пожалуйста, свяжитесь с нами по контактам, указанным в нижней части страницы.
Спасибо за использование сервиса video2dn.com

Описание к видео How I Tamed 2 × RTX 5090 + 2 × 4090 with Llama.cpp fork

In this video, I tackle the challenge of setting up a heterogeneous multi-GPU system with two NVIDIA RTX 5090s and two RTX 4090s (100GB+ VRAM total). We dive deep into running 200B+ parameter models like DeepSeek R1 and Qwen3 using two frameworks:
🦙 llama.cpp (82k stars)
🦙 ik-llama.cpp (fork with insane multi-GPU support)

Key Highlights:
ik-llama.cpp Setup: How to clone, build, and configure for mixed GPUs (CUDA arch flags, VRAM allocation).
Performance Benchmarks:

700 tokens/sec prompt processing with ik-llama.cpp (vs 400-450 on vanilla llama.cpp).
10-23 tokens/sec generation across frameworks.
80K context length support (vs 24K on k-transformers).
Multi-GPU Layer Offloading: Custom scripts to distribute model layers across RTX 5090s/4090s.
Live Crash Demo: Lessons on VRAM limits and avoiding OOM errors.
Benchmarking Tools: Use llama-bench to test your config.

Timestamps:
0:00 Intro & hardware overview
1:17 Why multi-GPU with mixed cards is painful in K-Transformers
2:25 Llama.cpp vs ik_llama.cpp at a glance (stars aren’t everything)
3:55 Live VRAM read-out: 2×5090 + 2×4090 (more than 100 GB)
7:23 First speed test: 120 TPS → 700 TPS after tuning
14:09 Building ik_llama.cpp for Ada-Lovelace & Blackwell (-DCMAKE_CUDA_ARCHITECTURES=86;89;120)
18:00 Regex-based layer off-loading explained (-ot "blk\+\.ffn=CUDA")
29:40 Crash & recover: finding the VRAM sweet spot
38:02 llama-sweep-bench: automate prompt/gen benchmarks
41:55 Context length show-down: 24 K (K-Trans) vs 40 K / 80 K / 128 K (IK/Llama.cpp)
48:10 Single-GPU fallback test (one 4090)
51:15 Community resources & my startup scripts
53:14 Final thoughts & when to stick with vanilla Llama.cpp (function calling)

Resources:
ik-llama.cpp GitHub: https://github.com/ikawrakow/ik_llama...
HuggingFace Models: https://huggingface.co/ubergarm/Qwen3...
My GPU Layer Offloading Strategy: https://github.com/ikawrakow/ik_llama...

Tags: #AI #MachineLearning #MultiGPU #RTX5090 #llama.cpp #ikllama #LargeLanguageModels #DL #TechTutorial

Комментарии

Информация по комментариям в разработке