Is finetuning GPT4o worth it?

Описание к видео Is finetuning GPT4o worth it?

Meet Cosine’s Genie: https://www.latent.space/p/cosine

SWE-Bench has been the most successful agent benchmark of the year, receiving honors at ICLR (our interview here) and recently being verified by OpenAI. Cognition (Devin) was valued at $2b after reaching 14% on it. So it is very, very big news when a new agent appears to beat all other solutions, by a lot:

While this number is self reported, it seems to be corroborated by OpenAI, who also award it clear highest marks on SWE-Bench verified:

The secret is GPT-4o finetuning on billions of tokens of synthetic data.

Finetuning: As OpenAI says:

Genie is powered by a fine-tuned GPT-4o model trained on examples of real software engineers at work, enabling the model to learn to respond in a specific way. The model was also trained to be able to output in specific formats, such as patches that could be committed easily to codebases.

Due to the scale of Cosine’s finetuning, OpenAI worked closely with them to figure out the size of the LoRA:

“They have to decide how big your LoRA adapter is going to be… because if you had a really sparse, large adapter, you’re not going to get any signal in that at all. So they have to dynamically size these things.”

Synthetic data: we need to finetune on the process of making code work instead of only training on working code.

“…we synthetically generated runtime errors. Where we would intentionally mess with the AST to make stuff not work, or index out of bounds, or refer to a variable that doesn't exist, or errors that the foundational models just make sometimes that you can't really avoid, you can't expect it to be perfect.”

Genie also has a 4 stage workflow with the standard LLM OS tooling stack that lets it solve problems iteratively.

Timestamps

[00:00:00] Alistair and Cosine intro
[00:11:34] GPT4o finetuning
[00:15:18] Genie Data Mix
[00:18:09] Customizing for Customers
[00:20:37] Genie Workflow
[00:22:41] Code Retrieval
[00:30:20] Planning
[00:37:29] Language Mix
[00:38:46] Running Code
[00:41:19] Finetuning with OpenAI
[00:44:32] Synthetic Code Data
[00:47:54] SynData in Llama 3
[00:48:33] SWE-Bench Submission Process
[00:53:20] Future Plans
[00:54:36] Ecosystem Trends
[00:55:55] Founder Lessons
[00:57:58] CTA: Hiring & Customers

Комментарии

Информация по комментариям в разработке