Towards Reliable Evaluation of Large Language Models (LLMs)

Описание к видео Towards Reliable Evaluation of Large Language Models (LLMs)

Large Language Models (LLMs) have become ubiquitous in today's technology landscape due to their remarkable ability to "understand" and generate human-like text. They are used in a wide array of applications from chatbots to content creation. But how do you properly evaluate the quality of such models?
This talk gives an overview of current approaches to evaluating LLMs and their respective shortcomings. We then present a statistical framework, developed by researchers at the ZHAW Datalab, to determine how reliable an evaluation method is, and how much data - human-annotated vs. automatically generated - is needed. We then show how this framework can be used to implement trustworthy real-world evaluation settings for LLMs.

Speaker:

Mark Cieliebak
Professor
Zurich University of Applied Sciences

Комментарии

Информация по комментариям в разработке