Data Heterogeneity Analysis for Out-of-Distribution Generalization - CoLLAs 2024 - PART 1

Описание к видео Data Heterogeneity Analysis for Out-of-Distribution Generalization - CoLLAs 2024 - PART 1

Speaker: Peng Cui & Jiashuo Liu


Abstract: Data heterogeneity is a key determinant of the performance of ML systems. Standard algorithms that optimize for average-case performance do not consider the presence of diversity within data. As a result, variations in data sources, data generation mechanisms, and sub-populations lead to unreliable decision-making, poor generalization, unfairness, and false scientific discoveries. Carefully modeling data heterogeneity is a necessary step in building reliable data-driven systems. Its rigorous study is a nascent field of research spanning several disciplines, including statistics, causal inference, machine learning, economics, and operations research. In this tutorial, we develop a unified view of the disparate intellectual threads developed by different communities. We aim to foster interdisciplinary research by providing a unified view based on a shared language. Drawing upon several separate literatures, we establish a taxonomy of heterogeneity and present quantitative measures and learning algorithms that consider heterogeneous data. To spur empirical progress, we conclude by discussing validation protocols and benchmarking practices.

Комментарии

Информация по комментариям в разработке