AI Magic: Shipping 1000s of successful products with no managers and a team of 12 — Jeremy Howard of Answer.ai

08/16/24 • 58 min

Disclaimer: We recorded this episode ~1.5 months ago, timing for the FastHTML release. It then got bottlenecked by Llama3.1, Winds of AI Winter, and SAM2 episodes, so we’re a little late. Since then FastHTML was released, swyx is building an app in it for AINews, and Anthropic has also released their prompt caching API.

Remember when Dylan Patel of SemiAnalysis coined the GPU Rich vs GPU Poor war? (if not, see our pod with him). The idea was that if you’re GPU poor you shouldn’t waste your time trying to solve GPU rich problems (i.e. pre-training large models) and are better off working on fine-tuning, optimized inference, etc. Jeremy Howard (see our “End of Finetuning” episode to catchup on his background) and Eric Ries founded Answer.AI to do exactly that: “Practical AI R&D”, which is very in-line with the GPU poor needs. For example, one of their first releases was a system based on FSDP + QLoRA that let anyone train a 70B model on two NVIDIA 4090s. Since then, they have come out with a long list of super useful projects (in no particular order, and non-exhaustive):

FSDP QDoRA: this is just as memory efficient and scalable as FSDP/QLoRA, and critically is also as accurate for continued pre-training as full weight training.

Cold Compress: a KV cache compression toolkit that lets you scale sequence length without impacting speed.

colbert-small: state of the art retriever at only 33M params

JaColBERTv2.5: a new state-of-the-art retrievers on all Japanese benchmarks.

gpu.cpp: portable GPU compute for C++ with WebGPU.

Claudette: a better Anthropic API SDK.

They also recently released FastHTML, a new way to create modern interactive web apps. Jeremy recently released a 1 hour “Getting started” tutorial on YouTube; while this isn’t AI related per se, but it’s close to home for any AI Engineer who are looking to iterate quickly on new products:

In this episode we broke down 1) how they recruit 2) how they organize what to research 3) and how the community comes together.

At the end, Jeremy gave us a sneak peek at something new that he’s working on that he calls dialogue engineering:

So I've created a new approach. It's not called prompt engineering. I'm creating a system for doing dialogue engineering. It's currently called AI magic. I'm doing most of my work in this system and it's making me much more productive than I was before I used it.

He explains it a bit more ~44:53 in the pod, but we’ll just have to wait for the public release to figure out exactly what he means.

Timestamps

[00:00:00] ...

FSDP QDoRA: this is just as memory efficient and scalable as FSDP/QLoRA, and critically is also as accurate for continued pre-training as full weight training.

Cold Compress: a KV cache compression toolkit that lets you scale sequence length without impacting speed.

colbert-small: state of the art retriever at only 33M params

JaColBERTv2.5: a new state-of-the-art retrievers on all Japanese benchmarks.

gpu.cpp: portable GPU compute for C++ with WebGPU.

Claudette: a better Anthropic API SDK.

In this episode we broke down 1) how they recruit 2) how they organize what to research 3) and how the community comes together.

At the end, Jeremy gave us a sneak peek at something new that he’s working on that he calls dialogue engineering:

He explains it a bit more ~44:53 in the pod, but we’ll just have to wait for the public release to figure out exactly what he means.

Timestamps

[00:00:00] ...

Previous Episode

Segment Anything 2: Demo-first Model Development

Because of the nature of SAM, this is more video heavy than usual. See our YouTube!

Because vision is first among equals in multimodality, and yet SOTA vision language models are closed, we’ve always had an interest in learning what’s next in vision.

Our first viral episode was Segment Anything 1, and we have since covered LLaVA, IDEFICS, Adept, and Reka. But just like with Llama 3, FAIR holds a special place in our hearts as the New Kings of Open Source AI.

The list of sequels better than the originals is usually very short, but SAM 2 delighted us by not only being a better image segmentation model than SAM 1, it also conclusively and inexpensively solved video segmentation in just an elegant a way as SAM 1 did for images, and releasing everything to the community as Apache 2/CC by 4.0.

“In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches.

In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM).”

Surprisingly Efficient

The paper reports that SAM 2 was trained on 256 A100 GPUs for 108 hours (59% more than SAM 1). Taking the upper end $2 A100 cost off gpulist.ai means SAM2 cost ~$50k to train if it had an external market-rate cost - surprisingly cheap for adding video understanding!

The newly released SA-V dataset is also the largest video segment dataset to date, with careful attention given to scene/object/geographical diversity, including that of annotators. In some ways, we are surprised that SOTA video segmentation can be done on only ~50,000 videos (and 640k masklet annotations).

Model-in-the-loop Data Engine for Annotations and Demo-first Development

Similar to SAM 1, a 3 Phase Data Engine helped greatly in bootstrapping this dataset. As Nikhila says in the episode, the demo you see wasn’t just for show, they actually used this same tool to do annotations for the model that is now demoed in the tool:

“With the original SAM, we put a lot of effort in building a high-quality demo. And the other piece here is that the demo is actually the annotation tool. So we actually use the demo as a way to improve our annotation tool. And so then it becomes very natural to invest in building a good demo because it speeds up your annotation. and improve the data quality, and that will improve the model quality. With this approach, we found it to be really successful.”

An incredible 90% speedup in annotation happened due to this virtuous cycle which helped SA-V reach this incredible scale.

Building the demo also helped the team live the context that their own downstream users, like Roboflow, would experience, and forced them to make choices accordingly.

As Nikhila says:

“It's a really encouraging trend for not thinking about only the new model capability, but what sort of applications folks want to build with models as a result of that downstream.

I think it also really forces you to think about many things that you might postpone. For example, efficiency. For a good demo experience, making it real time is super important. No one wants to wait. And so it really forces you to think about these things much sooner and actually makes us think about what kind of image encoder we want to use or other things. hardware efficiency improvements. So those kind of things, I think, become a first-class...

Next Episode

Is finetuning GPT4o worth it? — with Alistair Pullen, Cosine (Genie)

Betteridge's law says no: with seemingly infinite flavors of RAG, and >2million token context + prompt caching from Anthropic/Deepmind/Deepseek, it's reasonable to believe that "in context learning is all you need".

But then there’s Cosine Genie, the first to make a huge bet using OpenAI’s new GPT4o fine-tuning for code at the largest scale it has ever been used externally; resulting in what is now the #1 coding agent in the world according to SWE-Bench Full, Lite, and Verified:

SWE-Bench has been the most successful agent benchmark of the year, receiving honors at ICLR (our interview here) and recently being verified by OpenAI. Cognition (Devin) was valued at $2b after reaching 14% on it. So it is very, very big news when a new agent appears to beat all other solutions, by a lot:

While this number is self reported, it seems to be corroborated by OpenAI, who also award it clear highest marks on SWE-Bench verified:

The secret is GPT-4o finetuning on billions of tokens of synthetic data.

Finetuning: As OpenAI says:

Genie is powered by a fine-tuned GPT-4o model trained on examples of real software engineers at work, enabling the model to learn to respond in a specific way. The model was also trained to be able to output in specific formats, such as patches that could be committed easily to codebases.

Due to the scale of Cosine’s finetuning, OpenAI worked closely with them to figure out the size of the LoRA:

“They have to decide how big your LoRA adapter is going to be... because if you had a really sparse, large adapter, you’re not going to get any signal in that at all. So they have to dynamically size these things.”

Synthetic data: we need to finetune on the process of making code work instead of only training on working code.

“...we synthetically generated runtime errors. Where we would intentionally mess with the AST to make stuff not work, or index out of bounds, or refer to a variable that doesn't exist, or errors that the foundational models just make sometimes that you can't really avoid, you can't expect it to be perfect.”

Genie also has a 4 stage workflow with the standard LLM OS tooling stack that lets it solve problems iteratively:

Full Video Pod

like and subscribe etc!

Show Notes

Alistair Pullen - Twitter, Linkedin

Cosine Genie launch, technical report

OpenAI GPT-4o finetuning GA

Llama 3 backtranslation

Cursor episode and Aman + SWEBench at ICLR episode

Timestamps

[00:00:00] Suno Intro

[00:05:01] Alistair and Cosine intro