Ollama Llama3-8b Speed Compairson with different NVIDIA GPU and FP16/q8_0 quantification

Описание к видео Ollama Llama3-8b Speed Compairson with different NVIDIA GPU and FP16/q8_0 quantification

Note: In the table at the end of the video it must have token/s (token per second) and not s (seconds).

This video shows a comparison of five differently priced NVidia graphics cards using Ollama and META Llama3-8B on RTX 4090 24GB, Tesla P40 24GB, RTX A6000 48GB, RTX 6000 Ada 48GB and A100 SXM-4 80GB.

The comparison is always with the same question to the LLM and each GPU runs with the Llama3-8B-FP16 (full) and Llama3-8B-q8_0 quantization (half). The q8 quantization speeds up the system, but leads to a lower quality of responses.

In order to always obtain the same boundary conditions despite different VRAM sizes, all tests were carried out on a single GPU and always with the same two Llama models META Llama3-8B. The smallest GPU has 24 GB VRAM, so that the LLama3-8B-FP16 fits quite well into the memory.

The video is intended to show that even a relatively inexpensive Tesla P40 or gaming graphics cards are well suited to running simple but currently also powerful LLM models with Ollama. The P40 is sometimes available for under 300 $ / 300 €. But be careful, the P40 GPU has no active cooler and you have to build your own cooling system. Preferably from 3D printing with an appropriate fan. The temperatures of the GPU are often in the range of 80° - 90°C when generating with an LLM despite good ventilation.

Комментарии

Информация по комментариям в разработке