Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

10/14/23 • 65 min

Thanks to the over 11,000 people who joined us for the first AI Engineer Summit! A full recap is coming, but you can 1) catch up on the fun and videos on Twitter and YouTube, 2) help us reach 1000 people for the first comprehensive State of AI Engineering survey and 3) submit projects for the new AI Engineer Foundation.

See our Community page for upcoming meetups in SF, Paris, NYC, and Singapore.

This episode had good interest on Twitter.

Last month, Imbue was crowned as AI’s newest unicorn foundation model lab, raising a $200m Series B at a >$1 billion valuation. As “stealth” foundation model companies go, Imbue (f.k.a. Generally Intelligent) has stood as an enigmatic group given they have no publicly released models to try out. However, ever since their $20m Series A last year their goal has been to “develop generally capable AI agents with human-like intelligence in order to solve problems in the real world”.

From RL to Reasoning LLMs

Along with their Series A, they announced Avalon, “A Benchmark for RL Generalization Using Procedurally Generated Worlds”. Avalon is built on top of the open source Godot game engine, and is ~100x faster than Minecraft to enable fast RL benchmarking and a clear reward with adjustable game difficulty.

After a while, they realized that pure RL isn’t a good path to teach reasoning and planning. The agents were able to learn mechanical things like opening complex doors, climbing, but couldn’t go to higher level tasks. A pure RL world also doesn’t include a language explanation of the agent reasoning, which made it hard to understand why it made certain decisions. That pushed the team more towards the “models for reasoning” path:

“The second thing we learned is that pure reinforcement learning is not a good vehicle for planning and reasoning. So these agents were able to learn all sorts of crazy things: They could learn to climb like hand over hand in VR climbing, they could learn to open doors like very complicated, like multiple switches and a lever open the door, but they couldn't do any higher level things. And they couldn't do those lower level things consistently necessarily. And as a user, I do not want to interact with a pure reinforcement learning end to end RL agent. As a user, like I need much more control over what that agent is doing.”

Inspired by Chelsea Finn’s work on SayCan at Stanford, the team pivoted to have their agents do the reasoning in natural language instead. This development parallels the large leaps in reasoning that humans have developed as the scientific method:

“We are better at reasoning now than we were 3000 years ago. An example of a reasoning strategy is noticing you're confused. Then when I notice I'm confused, I should ask:

What was the original claim that was made?

What evidence is there for this claim?

Does the evidence support the claim?

Is the claim correct?

This is like a reasoning strategy that was developed in like the 1600s, you know, with like the advent of science. So that's an example of a reasoning strategy. There are tons of them. We employ all the time, lots of heuristics that help us be better at reasoning. And we can generate data that's much more specific to them.“

The Full Stack Model Lab

One year later, it would seem that the pivot to reasoning has had tremendous success, and Imbue has now reached a >$1B valuation, with participation ...

See our Community page for upcoming meetups in SF, Paris, NYC, and Singapore.

This episode had good interest on Twitter.

From RL to Reasoning LLMs

“We are better at reasoning now than we were 3000 years ago. An example of a reasoning strategy is noticing you're confused. Then when I notice I'm confused, I should ask:

What was the original claim that was made?

What evidence is there for this claim?

Does the evidence support the claim?

Is the claim correct?

The Full Stack Model Lab

One year later, it would seem that the pivot to reasoning has had tremendous success, and Imbue has now reached a >$1B valuation, with participation ...

Previous Episode

[AIE Summit Preview #2] The AI Horcrux — Swyx on Cognitive Revolution

This is a special double weekend crosspost of AI podcasts, helping attendees prepare for the AI Engineer Summit next week. After our first friendly feedswap with the Cognitive Revolution pod, swyx was invited for a full episode to go over the state of AI Engineering and to preview the AI Engineer Summit Schedule, where we share many former CogRev guests as speakers.

For those seeking to understand how two top AI podcasts think about major top of mind AI Engineering topics, this should be the perfect place to get up to speed, which will be a preview of many of the conversations taking place during the topic tables sessions on the night of Monday October 9 at the AI Engineer Summit.

While you are listening, there are two things you can do to be part of the AI Engineer experience. One, join the AI Engineer Summit Slack. Two, take the State of AI Engineering survey and help us get to 1000 respondents!

Links

AI Engineer Summit (Join livestream and Slack community)

State of AI Engineering Survey (please help us fill this out to represent you!)

Cognitive Revolution full episode with Nathan

swyx’s ai-notes (featuring Communities in README.md)

We referenced The Eleuther AI Mafia

This podcast intro voice was AI Anna again, from our Wondercraft pod!

Timestamps

(00:00:49) AI Nathan’s intro

(00:03:14) What is an AI engineer?

(00:05:56) What backgrounds do AI engineers typically have?

(00:17:13) Swyx’s Discord AI project

(00:20:41) Key tools for AI engineers

(00:23:42) HumanLoop, Guardrails, Langchain

(00:27:01) Criteria for identifying capable AI engineers when hiring

(00:30:59) Skepticism around AI being a fad and doubts about contributing to AI

(00:34:03) AI Engineer Conference speaker lineup

(00:41:14) AI agents and two years to AGI

(00:46:04) Expectations and disagreement around what AI agent capabilities will work soon

(00:50:12) Swyx’s OpenAI thesis

(00:53:03) AI safety considerations and the role of AI engineers

(00:56:24) Disagreement on whether AI will soon be able to generate code pull requests

(01:01:07) AI helping non-technical people to code

(01:01:49) Multi-modal Chat-GPT and the future implications

(

Next Episode

The End of Finetuning — with Jeremy Howard of Fast.ai

Thanks to the over 17,000 people who have joined the first AI Engineer Summit! A full recap is coming. Last call to fill out the State of AI Engineering survey! See our Community page for upcoming meetups in SF, Paris and NYC.

This episode had good interest on Twitter and was discussed on the Vanishing Gradients podcast.

Fast.ai’s “Practical Deep Learning” courses been watched by over >6,000,000 people, and the fastai library has over 25,000 stars on Github. Jeremy Howard, one of the creators of Fast, is now one of the most prominent and respected voices in the machine learning industry; but that wasn’t always the case.

Being non-consensus and right

In 2018, Jeremy and Sebastian Ruder published a paper on ULMFiT (Universal Language Model Fine-tuning), a 3-step transfer learning technique for NLP tasks:

The paper demonstrated that pre-trained language models could be fine-tuned on a specific task with a relatively small amount of data to achieve state-of-the-art results. They trained a 24M parameters model on WikiText-103 which was beat most benchmarks.

While the paper had great results, the methods behind weren’t taken seriously by the community:

“Everybody hated fine tuning. Everybody hated transfer learning. I literally did tours trying to get people to start doing transfer learning and nobody was interested, particularly after GPT showed such good results with zero shot and few shot learning [...] which I was convinced was not the right direction, but who's going to listen to me, cause as you said, I don't have a PhD, not at a university... I don't have a big set of computers to fine tune huge transformer models.”

Five years later, fine-tuning is at the center of most major discussion topics in AI (we covered some like fine tuning vs RAG and small models fine tuning), and we might have gotten here earlier if Jeremy had OpenAI-level access to compute and distribution. At heart, Jeremy has always been “GPU poor”:

“I've always been somebody who does not want to build stuff on lots of big computers because most people don't have lots of big computers and I hate creating stuff that most people can't use.”

This story is a good reminder of how some of the best ideas are hiding in plain sight; we recently covered RWKV and will continue to highlight the most interesting research that isn’t being done in the large labs.

Replacing fine-tuning with continued pre-training

Even though fine-tuning is now mainstream, we still have a lot to learn. The issue of “catastrophic forgetting” and potential solutions have been brought up in many papers: at the fine-tuning stage, the model can forget tasks it previously knew how to solve in favor of new ones.

The other issue is apparent memorization of the dataset even after a single epoch, which Jeremy covered Can LLMs learn from a single example? but we still don’t have the answer to.

Despite being the creator of ULMFiT, Jeremy still professes that there are a lot of open questions on finetuning:

“So I still don't know how to fine tune language models properly and I haven't found anybody who feels like they do.”

He now advocates for "continued pre-training" - maintaining a diversity of data throughout the training process rather than separate pre-training and fine-tuning stages....