Log in

goodpods headphones icon

To access all our features

Open the Goodpods app
Close icon
headphones
Latent Space: The AI Engineer Podcast

Latent Space: The AI Engineer Podcast

swyx + Alessio

The podcast by and for AI Engineers! In 2024, over 2 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space
profile image

2 Listeners

Share icon

All episodes

Best episodes

Top 10 Latent Space: The AI Engineer Podcast Episodes

Goodpods has curated a list of the 10 best Latent Space: The AI Engineer Podcast episodes, ranked by the number of listens and likes each episode have garnered from our listeners. If you are listening to Latent Space: The AI Engineer Podcast for the first time, there's no better place to start than with one of these standout episodes. If you are a fan of the show, vote for your favorite Latent Space: The AI Engineer Podcast episode by adding your comments to the episode page.

Latent Space: The AI Engineer Podcast - Open Operator, Serverless Browsers and the Future of Computer-Using Agents
play

02/28/25 • 61 min

Today's episode is with Paul Klein, founder of Browserbase. We talked about building browser infrastructure for AI agents, the future of agent authentication, and their open source framework Stagehand.

[00:00:00] Introductions

[00:04:46] AI-specific challenges in browser infrastructure

[00:07:05] Multimodality in AI-Powered Browsing

[00:12:26] Running headless browsers at scale

[00:18:46] Geolocation when proxying

[00:21:25] CAPTCHAs and Agent Auth

[00:28:21] Building “User take over” functionality

[00:33:43] Stagehand: AI web browsing framework

[00:38:58] OpenAI's Operator and computer use agents

[00:44:44] Surprising use cases of Browserbase

[00:47:18] Future of browser automation and market competition

[00:53:11] Being a solo founder

Transcript

Alessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.

swyx [00:00:12]: Hey, and today we are very blessed to have our friends, Paul Klein, for the fourth, the fourth, CEO of Browserbase. Welcome.

Paul [00:00:21]: Thanks guys. Yeah, I'm happy to be here. I've been lucky to know both of you for like a couple of years now, I think. So it's just like we're hanging out, you know, with three ginormous microphones in front of our face. It's totally normal hangout.

swyx [00:00:34]: Yeah. We've actually mentioned you on the podcast, I think, more often than any other Solaris tenant. Just because like you're one of the, you know, best performing, I think, LLM tool companies that have started up in the last couple of years.

Paul [00:00:50]: Yeah, I mean, it's been a whirlwind of a year, like Browserbase is actually pretty close to our first birthday. So we are one years old. And going from, you know, starting a company as a solo founder to... To, you know, having a team of 20 people, you know, a series A, but also being able to support hundreds of AI companies that are building AI applications that go out and automate the web. It's just been like, really cool. It's been happening a little too fast. I think like collectively as an AI industry, let's just take a week off together. I took my first vacation actually two weeks ago, and Operator came out on the first day, and then a week later, DeepSeat came out. And I'm like on vacation trying to chill. I'm like, we got to build with this stuff, right? So it's been a breakneck year. But I'm super happy to be here and like talk more about all the stuff we're seeing. And I'd love to hear kind of what you guys are excited about too, and share with it, you know?

swyx [00:01:39]: Where to start? So people, you've done a bunch of podcasts. I think I strongly recommend Jack Bridger's Scaling DevTools, as well as Turner Novak's The Peel. And, you know, I'm sure there's others. So you covered your Twilio story in the past, talked about StreamClub, you got acquired to Mux, and then you left to start Browserbase. So maybe we just start with what is Browserbase? Yeah.

Paul [00:02:02]: Browserbase is the web browser for your AI. We're building headless browser infrastructure, which are browsers that run in a server environment that's accessible to developers via APIs and SDKs. It's really hard to run a web browser in the cloud. You guys are probably running Chrome on your computers, and that's using a lot of resources, right? So if you want to run a web browser or thousands of web browsers, you can't just spin up a bunch of lambdas. You actually need to use a secure containerized environment. You have to scale it up and down. It's a stateful system. And that infrastructure is, like, super painful. And I know that firsthand, because at my last company, StreamClub, I was CTO, and I was building our own internal headless browser infrastructure. That's actually why we sold the company, is because Mux really wanted to buy our headless browser infrastructure that we'd built. And it's just a super hard problem. And I actually told my co-founders, I would never start another company unless it was a browser infrastructure company. And it turns out that's really necessary in the age of AI, when AI can actually go out and interact with websites, click on buttons, fill in forms. You need AI to do all of that work in an actual browser running somewhere on a server. And BrowserBase powers that.

swyx [00:03:08]: While you're talking about it, it occurred to me, not that you're going to be acquired or anything, but it occurred to me that it would be really funny if you became the Nikita Beer of headless browser companies. You just have one trick, ...

profile image

1 Listener

bookmark
plus icon
share episode
Latent Space: The AI Engineer Podcast - Agent Engineering with Pydantic + Graphs — with Samuel Colvin
play

02/06/25 • 64 min

Did you know that adding a simple Code Interpreter took o3 from 9.2% to 32% on FrontierMath? The Latent Space crew is hosting a hack night Feb 11th in San Francisco focused on CodeGen use cases, co-hosted with E2B and Edge AGI; watch E2B’s new workshop and RSVP here!

We’re happy to announce that today’s guest Samuel Colvin will be teaching his very first Pydantic AI workshop at the newly announced AI Engineer NYC Workshops day on Feb 22! 25 tickets left.

If you’re a Python developer, it’s very likely that you’ve heard of Pydantic. Every month, it’s downloaded >300,000,000 times, making it one of the top 25 PyPi packages. OpenAI uses it in its SDK for structured outputs, it’s at the core of FastAPI, and if you’ve followed our AI Engineer Summit conference, Jason Liu of Instructor has given two great talks about it: “Pydantic is all you need” and “Pydantic is STILL all you need”.

Now, Samuel Colvin has raised $17M from Sequoia to turn Pydantic from an open source project to a full stack AI engineer platform with Logfire, their observability platform, and PydanticAI, their new agent framework.

Logfire: bringing OTEL to AI

OpenTelemetry recently merged Semantic Conventions for LLM workloads which provides standard definitions to track performance like gen_ai.server.time_per_output_token. In Sam’s view at least 80% of new apps being built today have some sort of LLM usage in them, and just like web observability platform got replaced by cloud-first ones in the 2010s, Logfire wants to do the same for AI-first apps.

If you’re interested in the technical details, Logfire migrated away from Clickhouse to Datafusion for their backend. We spent some time on the importance of picking open source tools you understand and that you can actually contribute to upstream, rather than the more popular ones; listen in ~43:19 for that part.

Agents are the killer app for graphs

Pydantic AI is their attempt at taking a lot of the learnings that LangChain and the other early LLM frameworks had, and putting Python best practices into it. At an API level, it’s very similar to the other libraries: you can call LLMs, create agents, do function calling, do evals, etc.

They define an “Agent” as a container with a system prompt, tools, structured result, and an LLM. Under the hood, each Agent is now a graph of function calls that can orchestrate multi-step LLM interactions. You can start simple, then move toward fully dynamic graph-based control flow if needed.

“We were compelled enough by graphs once we got them right that our agent implementation [...] is now actually a graph under the hood.”

Why Graphs?

More natural for complex or multi-step AI workflows.

Easy to visualize and debug with mermaid diagrams.

Potential for distributed runs, or “waiting days” between steps in certain flows.

In parallel, you see folks like Emil Eifrem of Neo4j talk about GraphRAG as another place where graphs fit really well in the AI stack, so it might be time for more people to take them seriously.

Full Video Episode

Like and subscribe!

Chapters

00:00:00 Introductions

00:00:24 Origins of Pydantic

...
profile image

1 Listener

bookmark
plus icon
share episode

Welcome to the almost 3k latent space explorers that joined us last month! We’re holding our first SF listener meetup with Practical AI next Monday; join us if you want to meet past guests and put faces to voices! All events are in /community.

Who among you regularly click the ubiquitous 👍 /👎 buttons in ChatGPT/Bard/etc?

Anyone? I don’t see any hands up.

OpenAI has told us how important reinforcement learning from human feedback (RLHF) is to creating the magic that is ChatGPT, but we know from our conversation with Databricks’ Mike Conover just how hard it is to get just 15,000 pieces of explicit, high quality human responses.

We are shockingly reliant on good human feedback. Andrej Karpathy’s recent keynote at Microsoft Build on the State of GPT demonstrated just how much of the training process relies on contractors to supply the millions of items of human feedback needed to make a ChatGPT-quality LLM (highlighted by us in red):

But the collection of good feedback is an incredibly messy problem. First of all, if you have contractors paid by the datapoint, they are incentivized to blast through as many as possible without much thought. So you hire more contractors and double, maybe triple, your costs. Ok, you say, lets recruit missionaries, not mercenaries. People should volunteer their data! Then you run into the same problem we and any consumer review platform run into - the vast majority of people send nothing at all, and those who do are disproportionately representing negative reactions. More subtle problems emerge when you try to capture subjective human responses - the reason that ChatGPT responses tend to be inhumanly verbose, is because humans have a well documented “longer = better” bias when classifying responses in a “laboratory setting”.

The fix for this, of course, is to get out of the lab and learn from real human behavior, not artificially constructed human feedback. You don’t see a thumbs up/down button in GitHub Copilot nor Codeium nor Codium. Instead, they work an implicit accept/reject event into the product workflow, such that you cannot help but to give feedback while you use the product. This way you hear from all your users, in their natural environments doing valuable tasks they are familiar with. The prototypal example in this is Midjourney, who unobtrusively collect 1 of 9 types of feedback from every user as part of their workflow, in exchange for much faster first draft image generations:

The best known public example of AI product telemetry is in the Copilot-Explorer writeup, which checks for the presence of generated code after 15-600 second intervals, which enables GitHub to claim that 40% of code is generated by Copilot.

This is fantastic and “obviously” the future of productized AI. Every AI application should figure out how to learn from all their real users, not some contractors in a foreign country. Most prompt engineers and prompt engineering tooling also tend to focus on pre-production prototyping, but could also benefit from A/B testing their prompts in the real world.

In short, AI may need Analytics more than Analytics needs AI.

Amplitude’s Month of AI

This is why Amplitude is going hard on AI - and why we recently spent a weekend talking to Jeffrey Wang, cofounder and chief architect at Amplitude, and Joe Reeve, head of AI, recording a li...

1 Listener

bookmark
plus icon
share episode
Latent Space: The AI Engineer Podcast - Commoditizing the Petaflop — with George Hotz of the tiny corp
play

06/20/23 • 72 min

We are now launching our dedicated new YouTube and Twitter! Any help in amplifying our podcast would be greatly appreciated, and of course, tell your friends!

Notable followon discussions collected on Twitter, Reddit, Reddit, Reddit, HN, and HN. Please don’t obsess too much over the GPT4 discussion as it is mostly rumor; we spent much more time on tinybox/tinygrad on which George is the foremost authority!

We are excited to share the world’s first interview with George Hotz on the tiny corp!

If you don’t know George, he was the first person to unlock the iPhone, jailbreak the PS3, went on to start Comma.ai, and briefly “interned” at the Elon Musk-run Twitter.

Tinycorp is the company behind the deep learning framework tinygrad, as well as the recently announced tinybox, a new $15,000 “luxury AI computer” aimed at local model training and inference, aka your “personal compute cluster”:

738 FP16 TFLOPS

144 GB GPU RAM

5.76 TB/s RAM bandwidth

30 GB/s model load bandwidth (big llama loads in around 4 seconds)

AMD EPYC CPU

1600W (one 120V outlet)

Runs 65B FP16 LLaMA out of the box (using tinygrad, subject to software development risks)

(In the episode, we also talked about the future of the tinybox as the intelligence center of every home that will help run models, at-home robots, and more. Make sure to check the timestamps 👀 )

The tiny corp manifesto

There are three main theses to tinycorp:

If XLA/PrimTorch are CISC, tinygrad is RISC: CISC (Complex Instruction Set Computing) are more complex instruction sets where a single instruction can execute many low-level operations. RISC (Reduced Instruction Set Computing) are smaller, and only let you execute a single low-level operation per instruction, leading to faster and more efficient instruction execution. If you’ve used the Apple Silicon M1/M2, AMD Ryzen, or Raspberry Pi, you’ve used a RISC computer.

If you can’t write a fast ML framework for GPU, you can’t write one for your own chip: there are many “AI chips” companies out there, and they all started from taping the chip. Some of them like Cerebras are still building, while others like Graphcore seem to be struggling. But building chips with higher TFLOPS isn’t enough: “There’s a great chip already on the market. For $999, you get a 123 TFLOP card with 24 GB of 960 GB/s RAM. This is the best FLOPS per dollar today, and yet...nobody in ML uses it.”, referring to the AMD RX 7900 XTX. NVIDIA’s lead is not only thanks to high-performing cards, but also thanks to a great developer platform in CUDA. Starting with the chip development rather than the dev toolkit is much more cost-intensive, so tinycorp is starting by writing a framework for off-the-shelf hardware rather than taping their own chip.

Turing completeness considered harmful: Once you call in to Turing complete kernels, you can

bookmark
plus icon
share episode
Latent Space: The AI Engineer Podcast - AGI is Being Achieved Incrementally (DevDay Recap - cleaned audio)
play

11/08/23 • 141 min

We left a high amount of background audio in the Devday podcast, which many of you loved, but we definitely understand that some of you may have had trouble with it. Listener Klaus Breyer ran it through Auphonic with speech islolation and we figured we’d upload it as a backdated pod for people who prefer this. Of course it means that our speakers sound out of place since they now sound like they are talking loudly in a quiet room. Let us know in the comments what you think?

Timestamps

the cleaned part is only part 2:

[00:55:09] Part II: Spot Interviews

[00:55:59] Jim Fan (Nvidia) - High Level Takeaways

[01:05:19] Raza Habib (Humanloop) - Foundation Model Ops

[01:13:32] Surya Dantuluri (Stealth) - RIP Plugins

[01:20:53] Reid Robinson (Zapier) - AI Actions for GPTs

[01:30:45] Div Garg (MultiOn) - GPT4V for Agents

[01:36:42] Louis Knight-Webb (Bloop.ai) - AI Code Search

[01:48:36] Shreya Rajpal (Guardrails) - Guardrails for LLMs

[01:59:00] Alex Volkov (Weights & Biases, ThursdAI) - "Keeping AI Open"

[02:09:39] Rahul Sonwalkar (Julius AI) - Advice for Founders


Get full access to Latent.Space at www.latent.space/subscribe
bookmark
plus icon
share episode
Latent Space: The AI Engineer Podcast - The Creators of Model Context Protocol

The Creators of Model Context Protocol

Latent Space: The AI Engineer Podcast

play

04/03/25 • 79 min

Today’s guests, David Soria Parra and Justin Spahr-Summers, are the creators of Anthropic’s Model Context Protocol (MCP). When we first wrote Why MCP Won, we had no idea how quickly it was about to win.

In the past 4 weeks, OpenAI and now Google have now announced the MCP support, effectively confirming our prediction that MCP was the presumptive winner of the agent standard wars. MCP has now overtaken OpenAPI, the incumbent option and most direct alternative, in GitHub stars (3 months ahead of conservative trendline):

For protocol and history nerds, we also asked David and Justin to tell the origin story of MCP, which we leave to the reader to enjoy (you can also skim the transcripts, or, the changelogs of a certain favored IDE). It’s incredible the impact that individual engineers solving their own problems can have on an entire industry.

Timestamps

00:00 Introduction and Guest Welcome

00:37 What is MCP?

02:00 The Origin Story of MCP

05:18 Development Challenges and Solutions

08:06 Technical Details and Inspirations

29:45 MCP vs Open API

32:48 Building MCP Servers

40:39 Exploring Model Independence in LLMs

41:36 Building Richer Systems with MCP

43:13 Understanding Agents in MCP

45:45 Nesting and Tool Confusion in MCP

49:11 Client Control and Tool Invocation

52:08 Authorization and Trust in MCP Servers

01:01:34 Future Roadmap and Stateless Servers

01:10:07 Open Source Governance and Community Involvement

01:18:12 Wishlist and Closing Remarks

bookmark
plus icon
share episode
Latent Space: The AI Engineer Podcast - How AI is eating Finance — with Mike Conover of Brightwave
play

06/11/24 • 54 min

In April 2023 we released an episode named “Mapping the future of *truly* open source models” to talk about Dolly, the first open, commercial LLM.

Mike was leading the OSS models team at Databricks at the time. Today, Mike is back on the podcast to give us the “one year later” update on the evolution of large language models and how he’s been using them to build Brightwave, an an AI research assistant for investment professionals.

Today they are announcing a $6M seed round (led by Alessio and Decibel!), and sharing some of the learnings from serving customers with >$120B of assets under management in production in the last 4 months since launch.

Losing faith in long context windows

In our recent “Llama3 1M context window” episode we talked about the amazing progress we have done in context window size, but it’s good to remember that Dolly’s original context size was 1,024 tokens, and this was only 14 months ago.

But while understanding length has increased, models are still not able to generate very long answers. His empirical intuition (which matches ours while building smol-podcaster) is that most commercial LLMs, as well as Llama, tend to generate responses <=1,200 tokens most of the time. While Needle in a Haystack tests will pass with flying colors at most context sizes, the granularity of the summary decreases as the context increases as it tries to fit the answer in the same tokens range, rather than returning tokens close to the 4,096 max_output, for example.

Recently Rob Mulla from Dreadnode highlighted how LMSys Arena results prefer longer responses by a large margin, so both LLMs and humans have a well documented length bias which doesn’t necessarily track the quality of answer:

The way Mike and team solved this is by breaking down the task in multiple subtasks, and then merging them back together. For example, have a book summarized chapter by chapter to preserve more details, and then put those summaries together. In Brightwave’s case, it’s creating multiple subsystems that accomplish different tasks on a large corpus of text separately, and then bringing them all together in a report. For example understanding intent of the question, extracting relations between companies, figuring out if it’s a positive / negative, etc.

Mike’s question is whether or not we’ll be able to imbue better synthesis capabilities in the models: can you have synthesis-oriented demonstrations at training time rather than single token prediction?

“LLMs as Judges” Strategies

In our David Luan episode he mentioned they don’t use any benchmarks for their models, because the benchmarks don’t reflect their customer needs. Brightwave shared some tips on leveraging LLMs as Judges:

Human vs LLM reviews: while they work with human annotators to create high quality datasets, that data isn’t just used to fine tune models but also as a reference basis for future LLM reviews. Having a set of trusted data to use as calibration helps you trust the LLM judgement even more.

Ensemble consistency checking: rather than using an LLM as judge for one output, you use different LLMs to generate a result for the same task, and then use another LLM to highlight where those generations differ. Do the two outputs differ meaningfully? Do they have different beliefs about the implications of something? If there are a lot of discrepancies between generations coming from different models, you then do additional passes to try and resolve them.

Entailment verification: for each unique insight that they generate, they take the output and separately ask LLMs to verify factuality of information based on the original sources. In the actual product, user can then highlight any piece of text and ask it to 1) “Tell Me More” 2) “Show Sources”. Since there’s no way to guarantee factuality of 100% of outputs, and humans have good intuition for things that look out of the ordinary, giving the user access to the review tool helps them build trust in it.

It’s all about the data

bookmark
plus icon
share episode
Latent Space: The AI Engineer Podcast - Unsupervised Learning x Latent Space Crossover Special

Unsupervised Learning x Latent Space Crossover Special

Latent Space: The AI Engineer Podcast

play

03/29/25 • -1 min

Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what’s real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs.

Top guests: Noam Shazeer, Bob McGrew, Noam Brown, Dylan Patel, Percy Liang, David Luan

https://www.latent.space/p/unsupervised-learning

Timestamps

00:00 Introduction and Excitement for Collaboration

00:27 Reflecting on Surprises in AI Over the Past Year

01:44 Open Source Models and Their Adoption

06:01 The Rise of GPT Wrappers

06:55 AI Builders and Low-Code Platforms

09:35 Overhyped and Underhyped AI Trends

22:17 Product Market Fit in AI

28:23 Google's Current Momentum

28:33 Customer Support and AI

29:54 AI's Impact on Cost and Growth

31:05 Voice AI and Scheduling

32:59 Emerging AI Applications

34:12 Education and AI

36:34 Defensibility in AI Applications

40:10 Infrastructure and AI

47:08 Challenges and Future of AI

52:15 Quick Fire Round and Closing Remarks

Chapters

  • 00:00:00 Introduction and Collab Excitement
  • 00:00:58 Open Source and Model Adoption
  • 00:01:58 Enterprise Use of Open Source Models
  • 00:02:57 The Competitive Edge of Closed Source Models
  • 00:03:56 DeepSea and Open Source Model Releases
  • 00:04:54 Market Narrative and DeepSea Impact
  • 00:05:53 AI Engineering and GPT Wrappers
  • 00:06:53 AI Builders and Low-Code Platforms
  • 00:07:50 Innovating Beyond Existing Paradigms
  • 00:08:50 Apple and AI Product Development
  • 00:09:48 Overhyped and Underhyped AI Trends
  • 00:10:46 Frameworks and Protocols in AI Development
  • 00:11:45 Emerging Opportunities in AI
  • 00:12:44 Stateful AI and Memory Innovation
  • 00:13:44 Challenges with Memory in AI Agents
  • 00:14:44 The Future of Model Training Companies
  • 00:15:44 Specialized Use Cases for AI Models
  • 00:16:44 Vertical Models vs General Purpose Models
  • 00:17:42 General Purpose vs Domain-Specific Models
  • 00:18:42 Reflections on Model Companies
  • 00:19:39 Model Companies Entering Product Space
  • 00:20:38 Competition in AI Model and Product Sectors
  • 00:21:35 Coding Agents and Market Dynamics
  • 00:22:35 Defensibility in AI Applications
  • 00:23:35 Investing in Underappreciated AI Ventures
  • 00:24:32 Analyzing Market Fit in AI
  • 00:25:31 AI Applications with Product Market Fit
  • 00:26:31 OpenAI's Impact on the Market
  • 00:27:31 Google and OpenAI Competition
  • 00:28:31 Exploring Google's Advancements
  • 00:29:29 Customer Support and AI Applications
  • 00:30:27 The Future of AI in Customer Support
  • 00:31:26 Cost-Cutting vs Growth in AI
  • 00:32:23 Voice AI and Real-World Applications
  • 00:33:23 Scaling AI Applications for Demand
  • 00:34:22 Summarization and Conversational AI
  • 00:35:20 Future AI Use Cases and Market Fit
  • 00:36:20 AI Education and Model Capabilities
  • 00:37:17 Reforming Education with AI
  • 00:38:15 Defensibility in AI Apps
  • 00:39:13 Network Effects and AI
  • 00:40:12 AI Brand and Market Positioning
  • 00:41:11 AI Application Defensibility
  • 00:42:09 LLM OS and AI Infrastructure
  • 00:43:06 Security and AI Application
  • 00:44:06 OpenAI's Role in AI Infrastructure
  • 00:45:02 The Balance of AI Applications and Infrastructure
  • 00:46:02 Capital Efficiency in AI Infrastructure
  • 00:47:01 Challenges in AI DevOps and Infrastructure
  • 00:47:59 AI SRE and Monitoring
  • 00:48:59 Scaling AI and Hardware Challenges
  • 00:49:58 Reliability and Compute in AI
  • 00:50:57 Nvidia's Dominance and AI Hardware
  • 00:51:57 Emerging Competition in AI Silicon
  • 00:52:54 Agent Authenticatio...
bookmark
plus icon
share episode
Latent Space: The AI Engineer Podcast - Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue
play

10/14/23 • 65 min

Thanks to the over 11,000 people who joined us for the first AI Engineer Summit! A full recap is coming, but you can 1) catch up on the fun and videos on Twitter and YouTube, 2) help us reach 1000 people for the first comprehensive State of AI Engineering survey and 3) submit projects for the new AI Engineer Foundation.

See our Community page for upcoming meetups in SF, Paris, NYC, and Singapore.

This episode had good interest on Twitter.

Last month, Imbue was crowned as AI’s newest unicorn foundation model lab, raising a $200m Series B at a >$1 billion valuation. As “stealth” foundation model companies go, Imbue (f.k.a. Generally Intelligent) has stood as an enigmatic group given they have no publicly released models to try out. However, ever since their $20m Series A last year their goal has been to “develop generally capable AI agents with human-like intelligence in order to solve problems in the real world”.

From RL to Reasoning LLMs

Along with their Series A, they announced Avalon, “A Benchmark for RL Generalization Using Procedurally Generated Worlds”. Avalon is built on top of the open source Godot game engine, and is ~100x faster than Minecraft to enable fast RL benchmarking and a clear reward with adjustable game difficulty.

After a while, they realized that pure RL isn’t a good path to teach reasoning and planning. The agents were able to learn mechanical things like opening complex doors, climbing, but couldn’t go to higher level tasks. A pure RL world also doesn’t include a language explanation of the agent reasoning, which made it hard to understand why it made certain decisions. That pushed the team more towards the “models for reasoning” path:

“The second thing we learned is that pure reinforcement learning is not a good vehicle for planning and reasoning. So these agents were able to learn all sorts of crazy things: They could learn to climb like hand over hand in VR climbing, they could learn to open doors like very complicated, like multiple switches and a lever open the door, but they couldn't do any higher level things. And they couldn't do those lower level things consistently necessarily. And as a user, I do not want to interact with a pure reinforcement learning end to end RL agent. As a user, like I need much more control over what that agent is doing.”

Inspired by Chelsea Finn’s work on SayCan at Stanford, the team pivoted to have their agents do the reasoning in natural language instead. This development parallels the large leaps in reasoning that humans have developed as the scientific method:

We are better at reasoning now than we were 3000 years ago. An example of a reasoning strategy is noticing you're confused. Then when I notice I'm confused, I should ask:

What was the original claim that was made?

What evidence is there for this claim?

Does the evidence support the claim?

Is the claim correct?

This is like a reasoning strategy that was developed in like the 1600s, you know, with like the advent of science. So that's an example of a reasoning strategy. There are tons of them. We employ all the time, lots of heuristics that help us be better at reasoning. And we can generate data that's much more specific to them.“

The Full Stack Model Lab

One year later, it would seem that the pivot to reasoning has had tremendous success, and Imbue has now reached a >$1B valuation, with participation ...

bookmark
plus icon
share episode
Latent Space: The AI Engineer Podcast - SF Compute: Commoditizing Compute

SF Compute: Commoditizing Compute

Latent Space: The AI Engineer Podcast

play

04/11/25 • 72 min

Evan Conrad, co-founder of SF Compute, joined us to talk about how they started as an AI lab that avoided bankruptcy by selling GPU clusters, why CoreWeave financials look like a real estate business, and how GPUs are turning into a commodities market.

Chapters:

00:00:05 - Introductions

00:00:12 - Introduction of guest Evan Conrad from SF Compute

00:00:12 - CoreWeave Business Model Discussion

00:05:37 - CoreWeave as a Real Estate Business

00:08:59 - Interest Rate Risk and GPU Market Strategy Framework

00:16:33 - Why Together and DigitalOcean will lose money on their clusters

00:20:37 - SF Compute's AI Lab Origins

00:25:49 - Utilization Rates and Benefits of SF Compute Market Model

00:30:00 - H100 GPU Glut, Supply Chain Issues, and Future Demand Forecast

00:34:00 - P2P GPU networks

00:36:50 - Customer stories

00:38:23 - VC-Provided GPU Clusters and Credit Risk Arbitrage

00:41:58 - Market Pricing Dynamics and Preemptible GPU Pricing Model

00:48:00 - Future Plans for Financialization?

00:52:59 - Cluster auditing and quality control

00:58:00 - Futures Contracts for GPUs

01:01:20 - Branding and Aesthetic Choices Behind SF Compute

01:06:30 - Lessons from Previous Startups

01:09:07 - Hiring at SF Compute

Chapters

  • 00:00:00 Introduction and Background
  • 00:00:58 Analysis of GPU Business Models
  • 00:01:53 Challenges with GPU Pricing
  • 00:02:48 Revenue and Scaling with GPUs
  • 00:03:46 Customer Sensitivity to GPU Pricing
  • 00:04:44 Core Weave's Business Strategy
  • 00:05:41 Core Weave's Market Perception
  • 00:06:40 Hyperscalers and GPU Market Dynamics
  • 00:07:37 Financial Strategies for GPU Sales
  • 00:08:35 Interest Rates and GPU Market Risks
  • 00:09:30 Optimal GPU Contract Strategies
  • 00:10:27 Risks in GPU Market Contracts
  • 00:11:25 Price Sensitivity and Market Competition
  • 00:12:21 Market Dynamics and GPU Contracts
  • 00:13:18 Hyperscalers and GPU Market Strategies
  • 00:14:15 Nvidia and Market Competition
  • 00:15:12 Microsoft's Role in GPU Market
  • 00:16:10 Challenges in GPU Market Dynamics
  • 00:17:07 Economic Realities of the GPU Market
  • 00:18:03 Real Estate Model for GPU Clouds
  • 00:18:59 Price Sensitivity and Chip Design
  • 00:19:55 SF Compute's Beginnings and Challenges
  • 00:20:54 Navigating the GPU Market
  • 00:21:54 Pivoting to a GPU Cloud Provider
  • 00:22:53 Building a GPU Market
  • 00:23:52 SF Compute as a GPU Marketplace
  • 00:24:49 Market Liquidity and GPU Pricing
  • 00:25:47 Utilization Rates in GPU Markets
  • 00:26:44 Brokerage and Market Flexibility
  • 00:27:42 H100 Glut and Market Cycles
  • 00:28:40 Supply Chain Challenges and GPU Glut
  • 00:29:35 Future Predictions for the GPU Market
  • 00:30:33 Speculations on Test Time Inference
  • 00:31:29 Market Demand and Test Time Inference
  • 00:32:26 Open Source vs. Closed AI Demand
  • 00:33:24 Future of Inference Demand
  • 00:34:24 Peer-to-Peer GPU Markets
  • 00:35:17 Decentralized GPU Market Skepticism
  • 00:36:15 Redesigning Architectures for New Markets
  • 00:37:14 Supporting Grad Students and Startups
  • 00:38:11 Successful Startups Using SF Compute
  • 00:39:11 VCs and GPU Infrastructure
  • 00:40:09 VCs as GPU Credit Transformators
  • 00:41:06 Market Timing and GPU Infrastructure
  • 00:42:02 Understanding GPU Pricing Dynamics
  • 00:43:01 Market Pricing and Preemptible Compute
  • 00:43:55 Price Volatility and Market Optimization
  • 00:44:52 Customizing Compute Contracts
  • 00:45:50 Creating Flexible Compute Guarantees
  • 00:46:45 Financialization of GPU Markets
  • 00:47:44 Building a Spot Market for GPUs
  • 00:48:40 Auditing and Standardizing Clusters
  • 00:49:40 Ensuring Cluster Reliability
  • 00:50:36 Active Mo...
bookmark
plus icon
share episode

Show more best episodes

Toggle view more icon

FAQ

How many episodes does Latent Space: The AI Engineer Podcast have?

Latent Space: The AI Engineer Podcast currently has 127 episodes available.

What topics does Latent Space: The AI Engineer Podcast cover?

The podcast is about Entrepreneurship, Software, Podcasts, Technology, Business and Engineering.

What is the most popular episode on Latent Space: The AI Engineer Podcast?

The episode title 'Commoditizing the Petaflop — with George Hotz of the tiny corp' is the most popular.

What is the average episode length on Latent Space: The AI Engineer Podcast?

The average episode length on Latent Space: The AI Engineer Podcast is 76 minutes.

How often are episodes of Latent Space: The AI Engineer Podcast released?

Episodes of Latent Space: The AI Engineer Podcast are typically released every 6 days, 14 hours.

When was the first episode of Latent Space: The AI Engineer Podcast?

The first episode of Latent Space: The AI Engineer Podcast was released on Feb 23, 2023.

Show more FAQ

Toggle view more icon

Comments