Jason Gross: "Compact Proofs: Measuring Quality of Understanding with a Compression-Based Metric"

Описание к видео Jason Gross: "Compact Proofs: Measuring Quality of Understanding with a Compression-Based Metric"

Topos Institute Colloquium, 24th of October 2024.
———
The field of mechanistic interpretability – techniques for reverse engineering model weights into human-interpretable algorithms – seeks to compress explanations model behavior. By studying tiny transformers trained to perform algorithmic tasks, we can make rigorous the extent to which various understandings of a model permit compressing an explanation of its behavior.

In this talk, I’ll discuss how we prototyped this approach in our paper where formally proved lower bounds on the accuracy of 151 small transformers trained on a Max-of-K task, creating 102 different computer-assisted proof strategies to assess their length and tightness of bound on each of our models. Using quantitative metrics, we found that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we found that more faithful mechanistic understanding leads to tighter performance bounds.

We identified compounding structureless noise as the leading obstacle to generating more compact proofs of tighter performance bounds. I plan to discuss ongoing work to address this challenge by either relaxing the worst-case constraint enforced by using proofs; or by fine-tuning partially-interpreted models to align more closely with our explanations.

I’ll conclude by discussing the roadmap I see to scaling the compact proofs approach to rigorous mech interp up to frontier models.

Комментарии

Информация по комментариям в разработке