Скачать или смотреть Keynote Talk: Sanmi Koyejo - Beyond Benchmarks; Building a Science of AI Measurement

Keynote Talk: Sanmi Koyejo - Beyond Benchmarks; Building a Science of AI Measurement

Скачать Keynote Talk: Sanmi Koyejo - Beyond Benchmarks; Building a Science of AI Measurement бесплатно в качестве 4к (2к / 1080p)

У нас вы можете скачать бесплатно Keynote Talk: Sanmi Koyejo - Beyond Benchmarks; Building a Science of AI Measurement или посмотреть видео с ютуба в максимальном доступном качестве.

Для скачивания выберите вариант из формы ниже:

Информация по загрузке:

Cкачать музыку Keynote Talk: Sanmi Koyejo - Beyond Benchmarks; Building a Science of AI Measurement бесплатно в формате MP3:

Если иконки загрузки не отобразились, ПОЖАЛУЙСТА, НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если у вас возникли трудности с загрузкой, пожалуйста, свяжитесь с нами по контактам, указанным в нижней части страницы.
Спасибо за использование сервиса video2dn.com

Описание к видео Keynote Talk: Sanmi Koyejo - Beyond Benchmarks; Building a Science of AI Measurement

This session explores a critical question in modern AI research: How do we measure AI systems reliably, meaningfully, and responsibly?

Moving beyond traditional benchmarks, the talk presents a scientific framework for evaluating AI models in ways that truly reflect their capabilities and limitations.

Using case studies such as the Graduate-Level Google-Proof Q&A (GPQA) benchmark and examples of model collapse, the speaker introduces new proposals for improving AI evaluation—including claim-focused measurement, probabilistic modelling, adaptive testing, and the role of civil society in shaping evaluation targets.

This session is essential for researchers, practitioners, and policymakers interested in the future of AI safety, reliability, and assessment.

Timestamps:
00:00:00 – Intro
00:01:44 – Measurement in AI
00:05:07 – Example: The Graduate-Level GPQA
00:06:51 – Evaluation: Suppose a model scores 97% on the GPQA benchmark
00:08:02 – What to do about benchmarks?
00:09:29 – Measurement crisis
00:11:22 – Vignettes in AI measurement science
00:12:10 – Proposal 1: Focus AI measurement on the validity of specific claims
00:13:23 – An evaluation should support a specific claim
00:16:21 – Case Study: Graduate-Level Google-Proof Q&A (GPQA) Benchmark
00:17:55 – Validity
00:21:41 – Summary: Validity is tied to specific claims
00:22:50 – Proposal 2: Adapt and advance probabilistic models for AI measurement
00:23:35 – Reliable & efficient amortized model-based evaluation
00:27:47 - Item Response Theory (IRT) basics
00:29:09 - Data & Results
00:31:38 - Active learning of model capability; evaluation
00:32:50 - Predicting question difficulty (amortized calibration)
00:33:44 - Adaptive testing with question generator
00:35:01 - Item response theory for benchmarks
00:36:16 - Proposal 3: Learn to predict the effects of model interventions
00:37:24 - Model collapse: what happens when AI models are trained on their outputs?
00:38:24 - Case study: Model collapse
00:41:31 - Predicting the effects of model interventions: a case study of model collapse
00:42:01 - Open Problem 1: Understanding the links between upstream and downstream measurement
00:43:58 - Open Problem 2: Reduce dependence on multiple choice question answering
00:45:25 - Open Problem 3: Beyond human measurement priors
00:46:43 - Open Problem 4: Scaling institutions & participation of civil society in AI evaluation targets and critique
00:47:40 - A path forward
00:49:07 - Summary
00:50:00 - Q&A

#DeepLearningIndaba2025 #AIMeasurement #AIEvaluation #MachineLearning #Benchmarks #GPQA #AIResearch

Комментарии

Информация по комментариям в разработке