
Latent Space: The AI Engineer Podcast
swyx + Alessio

2 Listeners
All episodes
Best episodes
Top 10 Latent Space: The AI Engineer Podcast Episodes
Goodpods has curated a list of the 10 best Latent Space: The AI Engineer Podcast episodes, ranked by the number of listens and likes each episode have garnered from our listeners. If you are listening to Latent Space: The AI Engineer Podcast for the first time, there's no better place to start than with one of these standout episodes. If you are a fan of the show, vote for your favorite Latent Space: The AI Engineer Podcast episode by adding your comments to the episode page.

Open Operator, Serverless Browsers and the Future of Computer-Using Agents
Latent Space: The AI Engineer Podcast
02/28/25 • 61 min
Today's episode is with Paul Klein, founder of Browserbase. We talked about building browser infrastructure for AI agents, the future of agent authentication, and their open source framework Stagehand.
[00:00:00] Introductions
[00:04:46] AI-specific challenges in browser infrastructure
[00:07:05] Multimodality in AI-Powered Browsing
[00:12:26] Running headless browsers at scale
[00:18:46] Geolocation when proxying
[00:21:25] CAPTCHAs and Agent Auth
[00:28:21] Building “User take over” functionality
[00:33:43] Stagehand: AI web browsing framework
[00:38:58] OpenAI's Operator and computer use agents
[00:44:44] Surprising use cases of Browserbase
[00:47:18] Future of browser automation and market competition
[00:53:11] Being a solo founder
Transcript
Alessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.
swyx [00:00:12]: Hey, and today we are very blessed to have our friends, Paul Klein, for the fourth, the fourth, CEO of Browserbase. Welcome.
Paul [00:00:21]: Thanks guys. Yeah, I'm happy to be here. I've been lucky to know both of you for like a couple of years now, I think. So it's just like we're hanging out, you know, with three ginormous microphones in front of our face. It's totally normal hangout.
swyx [00:00:34]: Yeah. We've actually mentioned you on the podcast, I think, more often than any other Solaris tenant. Just because like you're one of the, you know, best performing, I think, LLM tool companies that have started up in the last couple of years.
Paul [00:00:50]: Yeah, I mean, it's been a whirlwind of a year, like Browserbase is actually pretty close to our first birthday. So we are one years old. And going from, you know, starting a company as a solo founder to... To, you know, having a team of 20 people, you know, a series A, but also being able to support hundreds of AI companies that are building AI applications that go out and automate the web. It's just been like, really cool. It's been happening a little too fast. I think like collectively as an AI industry, let's just take a week off together. I took my first vacation actually two weeks ago, and Operator came out on the first day, and then a week later, DeepSeat came out. And I'm like on vacation trying to chill. I'm like, we got to build with this stuff, right? So it's been a breakneck year. But I'm super happy to be here and like talk more about all the stuff we're seeing. And I'd love to hear kind of what you guys are excited about too, and share with it, you know?
swyx [00:01:39]: Where to start? So people, you've done a bunch of podcasts. I think I strongly recommend Jack Bridger's Scaling DevTools, as well as Turner Novak's The Peel. And, you know, I'm sure there's others. So you covered your Twilio story in the past, talked about StreamClub, you got acquired to Mux, and then you left to start Browserbase. So maybe we just start with what is Browserbase? Yeah.
Paul [00:02:02]: Browserbase is the web browser for your AI. We're building headless browser infrastructure, which are browsers that run in a server environment that's accessible to developers via APIs and SDKs. It's really hard to run a web browser in the cloud. You guys are probably running Chrome on your computers, and that's using a lot of resources, right? So if you want to run a web browser or thousands of web browsers, you can't just spin up a bunch of lambdas. You actually need to use a secure containerized environment. You have to scale it up and down. It's a stateful system. And that infrastructure is, like, super painful. And I know that firsthand, because at my last company, StreamClub, I was CTO, and I was building our own internal headless browser infrastructure. That's actually why we sold the company, is because Mux really wanted to buy our headless browser infrastructure that we'd built. And it's just a super hard problem. And I actually told my co-founders, I would never start another company unless it was a browser infrastructure company. And it turns out that's really necessary in the age of AI, when AI can actually go out and interact with websites, click on buttons, fill in forms. You need AI to do all of that work in an actual browser running somewhere on a server. And BrowserBase powers that.
swyx [00:03:08]: While you're talking about it, it occurred to me, not that you're going to be acquired or anything, but it occurred to me that it would be really funny if you became the Nikita Beer of headless browser companies. You just have one trick, ...

1 Listener

Agent Engineering with Pydantic + Graphs — with Samuel Colvin
Latent Space: The AI Engineer Podcast
02/06/25 • 64 min
Did you know that adding a simple Code Interpreter took o3 from 9.2% to 32% on FrontierMath? The Latent Space crew is hosting a hack night Feb 11th in San Francisco focused on CodeGen use cases, co-hosted with E2B and Edge AGI; watch E2B’s new workshop and RSVP here!
We’re happy to announce that today’s guest Samuel Colvin will be teaching his very first Pydantic AI workshop at the newly announced AI Engineer NYC Workshops day on Feb 22! 25 tickets left.
If you’re a Python developer, it’s very likely that you’ve heard of Pydantic. Every month, it’s downloaded >300,000,000 times, making it one of the top 25 PyPi packages. OpenAI uses it in its SDK for structured outputs, it’s at the core of FastAPI, and if you’ve followed our AI Engineer Summit conference, Jason Liu of Instructor has given two great talks about it: “Pydantic is all you need” and “Pydantic is STILL all you need”.
Now, Samuel Colvin has raised $17M from Sequoia to turn Pydantic from an open source project to a full stack AI engineer platform with Logfire, their observability platform, and PydanticAI, their new agent framework.
Logfire: bringing OTEL to AI
OpenTelemetry recently merged Semantic Conventions for LLM workloads which provides standard definitions to track performance like gen_ai.server.time_per_output_token. In Sam’s view at least 80% of new apps being built today have some sort of LLM usage in them, and just like web observability platform got replaced by cloud-first ones in the 2010s, Logfire wants to do the same for AI-first apps.
If you’re interested in the technical details, Logfire migrated away from Clickhouse to Datafusion for their backend. We spent some time on the importance of picking open source tools you understand and that you can actually contribute to upstream, rather than the more popular ones; listen in ~43:19 for that part.
Agents are the killer app for graphs
Pydantic AI is their attempt at taking a lot of the learnings that LangChain and the other early LLM frameworks had, and putting Python best practices into it. At an API level, it’s very similar to the other libraries: you can call LLMs, create agents, do function calling, do evals, etc.
They define an “Agent” as a container with a system prompt, tools, structured result, and an LLM. Under the hood, each Agent is now a graph of function calls that can orchestrate multi-step LLM interactions. You can start simple, then move toward fully dynamic graph-based control flow if needed.
“We were compelled enough by graphs once we got them right that our agent implementation [...] is now actually a graph under the hood.”
Why Graphs?
More natural for complex or multi-step AI workflows.
Easy to visualize and debug with mermaid diagrams.
Potential for distributed runs, or “waiting days” between steps in certain flows.
In parallel, you see folks like Emil Eifrem of Neo4j talk about GraphRAG as another place where graphs fit really well in the AI stack, so it might be time for more people to take them seriously.
Full Video Episode
Chapters
00:00:00 Introductions
00:00:24 Origins of Pydantic
...
1 Listener

From RLHF to RLHB: The Case for Learning from Human Behavior - with Jeffrey Wang and Joe Reeve of Amplitude
Latent Space: The AI Engineer Podcast
06/08/23 • 49 min
Welcome to the almost 3k latent space explorers that joined us last month! We’re holding our first SF listener meetup with Practical AI next Monday; join us if you want to meet past guests and put faces to voices! All events are in /community.
Who among you regularly click the ubiquitous 👍 /👎 buttons in ChatGPT/Bard/etc?
Anyone? I don’t see any hands up.
OpenAI has told us how important reinforcement learning from human feedback (RLHF) is to creating the magic that is ChatGPT, but we know from our conversation with Databricks’ Mike Conover just how hard it is to get just 15,000 pieces of explicit, high quality human responses.
We are shockingly reliant on good human feedback. Andrej Karpathy’s recent keynote at Microsoft Build on the State of GPT demonstrated just how much of the training process relies on contractors to supply the millions of items of human feedback needed to make a ChatGPT-quality LLM (highlighted by us in red):
But the collection of good feedback is an incredibly messy problem. First of all, if you have contractors paid by the datapoint, they are incentivized to blast through as many as possible without much thought. So you hire more contractors and double, maybe triple, your costs. Ok, you say, lets recruit missionaries, not mercenaries. People should volunteer their data! Then you run into the same problem we and any consumer review platform run into - the vast majority of people send nothing at all, and those who do are disproportionately representing negative reactions. More subtle problems emerge when you try to capture subjective human responses - the reason that ChatGPT responses tend to be inhumanly verbose, is because humans have a well documented “longer = better” bias when classifying responses in a “laboratory setting”.
The fix for this, of course, is to get out of the lab and learn from real human behavior, not artificially constructed human feedback. You don’t see a thumbs up/down button in GitHub Copilot nor Codeium nor Codium. Instead, they work an implicit accept/reject event into the product workflow, such that you cannot help but to give feedback while you use the product. This way you hear from all your users, in their natural environments doing valuable tasks they are familiar with. The prototypal example in this is Midjourney, who unobtrusively collect 1 of 9 types of feedback from every user as part of their workflow, in exchange for much faster first draft image generations:
The best known public example of AI product telemetry is in the Copilot-Explorer writeup, which checks for the presence of generated code after 15-600 second intervals, which enables GitHub to claim that 40% of code is generated by Copilot.
This is fantastic and “obviously” the future of productized AI. Every AI application should figure out how to learn from all their real users, not some contractors in a foreign country. Most prompt engineers and prompt engineering tooling also tend to focus on pre-production prototyping, but could also benefit from A/B testing their prompts in the real world.
In short, AI may need Analytics more than Analytics needs AI.
Amplitude’s Month of AI
This is why Amplitude is going hard on AI - and why we recently spent a weekend talking to Jeffrey Wang, cofounder and chief architect at Amplitude, and Joe Reeve, head of AI, recording a li...
1 Listener

Commoditizing the Petaflop — with George Hotz of the tiny corp
Latent Space: The AI Engineer Podcast
06/20/23 • 72 min
We are now launching our dedicated new YouTube and Twitter! Any help in amplifying our podcast would be greatly appreciated, and of course, tell your friends!
Notable followon discussions collected on Twitter, Reddit, Reddit, Reddit, HN, and HN. Please don’t obsess too much over the GPT4 discussion as it is mostly rumor; we spent much more time on tinybox/tinygrad on which George is the foremost authority!
We are excited to share the world’s first interview with George Hotz on the tiny corp!
If you don’t know George, he was the first person to unlock the iPhone, jailbreak the PS3, went on to start Comma.ai, and briefly “interned” at the Elon Musk-run Twitter.
Tinycorp is the company behind the deep learning framework tinygrad, as well as the recently announced tinybox, a new $15,000 “luxury AI computer” aimed at local model training and inference, aka your “personal compute cluster”:
738 FP16 TFLOPS
144 GB GPU RAM
5.76 TB/s RAM bandwidth
30 GB/s model load bandwidth (big llama loads in around 4 seconds)
AMD EPYC CPU
1600W (one 120V outlet)
Runs 65B FP16 LLaMA out of the box (using tinygrad, subject to software development risks)
(In the episode, we also talked about the future of the tinybox as the intelligence center of every home that will help run models, at-home robots, and more. Make sure to check the timestamps 👀 )
The tiny corp manifesto
There are three main theses to tinycorp:
If XLA/PrimTorch are CISC, tinygrad is RISC: CISC (Complex Instruction Set Computing) are more complex instruction sets where a single instruction can execute many low-level operations. RISC (Reduced Instruction Set Computing) are smaller, and only let you execute a single low-level operation per instruction, leading to faster and more efficient instruction execution. If you’ve used the Apple Silicon M1/M2, AMD Ryzen, or Raspberry Pi, you’ve used a RISC computer.
If you can’t write a fast ML framework for GPU, you can’t write one for your own chip: there are many “AI chips” companies out there, and they all started from taping the chip. Some of them like Cerebras are still building, while others like Graphcore seem to be struggling. But building chips with higher TFLOPS isn’t enough: “There’s a great chip already on the market. For $999, you get a 123 TFLOP card with 24 GB of 960 GB/s RAM. This is the best FLOPS per dollar today, and yet...nobody in ML uses it.”, referring to the AMD RX 7900 XTX. NVIDIA’s lead is not only thanks to high-performing cards, but also thanks to a great developer platform in CUDA. Starting with the chip development rather than the dev toolkit is much more cost-intensive, so tinycorp is starting by writing a framework for off-the-shelf hardware rather than taping their own chip.
Turing completeness considered harmful: Once you call in to Turing complete kernels, you can 1 Listener

AGI is Being Achieved Incrementally (DevDay Recap - cleaned audio)
Latent Space: The AI Engineer Podcast
11/08/23 • 141 min
We left a high amount of background audio in the Devday podcast, which many of you loved, but we definitely understand that some of you may have had trouble with it. Listener Klaus Breyer ran it through Auphonic with speech islolation and we figured we’d upload it as a backdated pod for people who prefer this. Of course it means that our speakers sound out of place since they now sound like they are talking loudly in a quiet room. Let us know in the comments what you think?
Timestamps
the cleaned part is only part 2:
[00:55:09] Part II: Spot Interviews
[00:55:59] Jim Fan (Nvidia) - High Level Takeaways
[01:05:19] Raza Habib (Humanloop) - Foundation Model Ops
[01:13:32] Surya Dantuluri (Stealth) - RIP Plugins
[01:20:53] Reid Robinson (Zapier) - AI Actions for GPTs
[01:30:45] Div Garg (MultiOn) - GPT4V for Agents
[01:36:42] Louis Knight-Webb (Bloop.ai) - AI Code Search
[01:48:36] Shreya Rajpal (Guardrails) - Guardrails for LLMs
[01:59:00] Alex Volkov (Weights & Biases, ThursdAI) - "Keeping AI Open"
[02:09:39] Rahul Sonwalkar (Julius AI) - Advice for Founders
Get full access to Latent.Space at www.latent.space/subscribe

The Creators of Model Context Protocol
Latent Space: The AI Engineer Podcast
04/03/25 • 79 min
Today’s guests, David Soria Parra and Justin Spahr-Summers, are the creators of Anthropic’s Model Context Protocol (MCP). When we first wrote Why MCP Won, we had no idea how quickly it was about to win.
In the past 4 weeks, OpenAI and now Google have now announced the MCP support, effectively confirming our prediction that MCP was the presumptive winner of the agent standard wars. MCP has now overtaken OpenAPI, the incumbent option and most direct alternative, in GitHub stars (3 months ahead of conservative trendline):
For protocol and history nerds, we also asked David and Justin to tell the origin story of MCP, which we leave to the reader to enjoy (you can also skim the transcripts, or, the changelogs of a certain favored IDE). It’s incredible the impact that individual engineers solving their own problems can have on an entire industry.
Timestamps
00:00 Introduction and Guest Welcome
00:37 What is MCP?
02:00 The Origin Story of MCP
05:18 Development Challenges and Solutions
08:06 Technical Details and Inspirations
29:45 MCP vs Open API
32:48 Building MCP Servers
40:39 Exploring Model Independence in LLMs
41:36 Building Richer Systems with MCP
43:13 Understanding Agents in MCP
45:45 Nesting and Tool Confusion in MCP
49:11 Client Control and Tool Invocation
52:08 Authorization and Trust in MCP Servers
01:01:34 Future Roadmap and Stateless Servers
01:10:07 Open Source Governance and Community Involvement
01:18:12 Wishlist and Closing Remarks

How AI is eating Finance — with Mike Conover of Brightwave
Latent Space: The AI Engineer Podcast
06/11/24 • 54 min
In April 2023 we released an episode named “Mapping the future of *truly* open source models” to talk about Dolly, the first open, commercial LLM.
Mike was leading the OSS models team at Databricks at the time. Today, Mike is back on the podcast to give us the “one year later” update on the evolution of large language models and how he’s been using them to build Brightwave, an an AI research assistant for investment professionals.
Today they are announcing a $6M seed round (led by Alessio and Decibel!), and sharing some of the learnings from serving customers with >$120B of assets under management in production in the last 4 months since launch.
Losing faith in long context windows
In our recent “Llama3 1M context window” episode we talked about the amazing progress we have done in context window size, but it’s good to remember that Dolly’s original context size was 1,024 tokens, and this was only 14 months ago.
But while understanding length has increased, models are still not able to generate very long answers. His empirical intuition (which matches ours while building smol-podcaster) is that most commercial LLMs, as well as Llama, tend to generate responses <=1,200 tokens most of the time. While Needle in a Haystack tests will pass with flying colors at most context sizes, the granularity of the summary decreases as the context increases as it tries to fit the answer in the same tokens range, rather than returning tokens close to the 4,096 max_output, for example.
Recently Rob Mulla from Dreadnode highlighted how LMSys Arena results prefer longer responses by a large margin, so both LLMs and humans have a well documented length bias which doesn’t necessarily track the quality of answer:
The way Mike and team solved this is by breaking down the task in multiple subtasks, and then merging them back together. For example, have a book summarized chapter by chapter to preserve more details, and then put those summaries together. In Brightwave’s case, it’s creating multiple subsystems that accomplish different tasks on a large corpus of text separately, and then bringing them all together in a report. For example understanding intent of the question, extracting relations between companies, figuring out if it’s a positive / negative, etc.
Mike’s question is whether or not we’ll be able to imbue better synthesis capabilities in the models: can you have synthesis-oriented demonstrations at training time rather than single token prediction?
“LLMs as Judges” Strategies
In our David Luan episode he mentioned they don’t use any benchmarks for their models, because the benchmarks don’t reflect their customer needs. Brightwave shared some tips on leveraging LLMs as Judges:
Human vs LLM reviews: while they work with human annotators to create high quality datasets, that data isn’t just used to fine tune models but also as a reference basis for future LLM reviews. Having a set of trusted data to use as calibration helps you trust the LLM judgement even more.
Ensemble consistency checking: rather than using an LLM as judge for one output, you use different LLMs to generate a result for the same task, and then use another LLM to highlight where those generations differ. Do the two outputs differ meaningfully? Do they have different beliefs about the implications of something? If there are a lot of discrepancies between generations coming from different models, you then do additional passes to try and resolve them.
Entailment verification: for each unique insight that they generate, they take the output and separately ask LLMs to verify factuality of information based on the original sources. In the actual product, user can then highlight any piece of text and ask it to 1) “Tell Me More” 2) “Show Sources”. Since there’s no way to guarantee factuality of 100% of outputs, and humans have good intuition for things that look out of the ordinary, giving the user access to the review tool helps them build trust in it.
It’s all about the data

Unsupervised Learning x Latent Space Crossover Special
Latent Space: The AI Engineer Podcast
03/29/25 • -1 min
Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what’s real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs.
Top guests: Noam Shazeer, Bob McGrew, Noam Brown, Dylan Patel, Percy Liang, David Luan
https://www.latent.space/p/unsupervised-learning
Timestamps
00:00 Introduction and Excitement for Collaboration
00:27 Reflecting on Surprises in AI Over the Past Year
01:44 Open Source Models and Their Adoption
06:01 The Rise of GPT Wrappers
06:55 AI Builders and Low-Code Platforms
09:35 Overhyped and Underhyped AI Trends
22:17 Product Market Fit in AI
28:23 Google's Current Momentum
28:33 Customer Support and AI
29:54 AI's Impact on Cost and Growth
31:05 Voice AI and Scheduling
32:59 Emerging AI Applications
34:12 Education and AI
36:34 Defensibility in AI Applications
40:10 Infrastructure and AI
47:08 Challenges and Future of AI
52:15 Quick Fire Round and Closing Remarks
Chapters
- 00:00:00 Introduction and Collab Excitement
- 00:00:58 Open Source and Model Adoption
- 00:01:58 Enterprise Use of Open Source Models
- 00:02:57 The Competitive Edge of Closed Source Models
- 00:03:56 DeepSea and Open Source Model Releases
- 00:04:54 Market Narrative and DeepSea Impact
- 00:05:53 AI Engineering and GPT Wrappers
- 00:06:53 AI Builders and Low-Code Platforms
- 00:07:50 Innovating Beyond Existing Paradigms
- 00:08:50 Apple and AI Product Development
- 00:09:48 Overhyped and Underhyped AI Trends
- 00:10:46 Frameworks and Protocols in AI Development
- 00:11:45 Emerging Opportunities in AI
- 00:12:44 Stateful AI and Memory Innovation
- 00:13:44 Challenges with Memory in AI Agents
- 00:14:44 The Future of Model Training Companies
- 00:15:44 Specialized Use Cases for AI Models
- 00:16:44 Vertical Models vs General Purpose Models
- 00:17:42 General Purpose vs Domain-Specific Models
- 00:18:42 Reflections on Model Companies
- 00:19:39 Model Companies Entering Product Space
- 00:20:38 Competition in AI Model and Product Sectors
- 00:21:35 Coding Agents and Market Dynamics
- 00:22:35 Defensibility in AI Applications
- 00:23:35 Investing in Underappreciated AI Ventures
- 00:24:32 Analyzing Market Fit in AI
- 00:25:31 AI Applications with Product Market Fit
- 00:26:31 OpenAI's Impact on the Market
- 00:27:31 Google and OpenAI Competition
- 00:28:31 Exploring Google's Advancements
- 00:29:29 Customer Support and AI Applications
- 00:30:27 The Future of AI in Customer Support
- 00:31:26 Cost-Cutting vs Growth in AI
- 00:32:23 Voice AI and Real-World Applications
- 00:33:23 Scaling AI Applications for Demand
- 00:34:22 Summarization and Conversational AI
- 00:35:20 Future AI Use Cases and Market Fit
- 00:36:20 AI Education and Model Capabilities
- 00:37:17 Reforming Education with AI
- 00:38:15 Defensibility in AI Apps
- 00:39:13 Network Effects and AI
- 00:40:12 AI Brand and Market Positioning
- 00:41:11 AI Application Defensibility
- 00:42:09 LLM OS and AI Infrastructure
- 00:43:06 Security and AI Application
- 00:44:06 OpenAI's Role in AI Infrastructure
- 00:45:02 The Balance of AI Applications and Infrastructure
- 00:46:02 Capital Efficiency in AI Infrastructure
- 00:47:01 Challenges in AI DevOps and Infrastructure
- 00:47:59 AI SRE and Monitoring
- 00:48:59 Scaling AI and Hardware Challenges
- 00:49:58 Reliability and Compute in AI
- 00:50:57 Nvidia's Dominance and AI Hardware
- 00:51:57 Emerging Competition in AI Silicon
- 00:52:54 Agent Authenticatio...

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue
Latent Space: The AI Engineer Podcast
10/14/23 • 65 min
Thanks to the over 11,000 people who joined us for the first AI Engineer Summit! A full recap is coming, but you can 1) catch up on the fun and videos on Twitter and YouTube, 2) help us reach 1000 people for the first comprehensive State of AI Engineering survey and 3) submit projects for the new AI Engineer Foundation.
See our Community page for upcoming meetups in SF, Paris, NYC, and Singapore.
This episode had good interest on Twitter.
Last month, Imbue was crowned as AI’s newest unicorn foundation model lab, raising a $200m Series B at a >$1 billion valuation. As “stealth” foundation model companies go, Imbue (f.k.a. Generally Intelligent) has stood as an enigmatic group given they have no publicly released models to try out. However, ever since their $20m Series A last year their goal has been to “develop generally capable AI agents with human-like intelligence in order to solve problems in the real world”.
From RL to Reasoning LLMs
Along with their Series A, they announced Avalon, “A Benchmark for RL Generalization Using Procedurally Generated Worlds”. Avalon is built on top of the open source Godot game engine, and is ~100x faster than Minecraft to enable fast RL benchmarking and a clear reward with adjustable game difficulty.
After a while, they realized that pure RL isn’t a good path to teach reasoning and planning. The agents were able to learn mechanical things like opening complex doors, climbing, but couldn’t go to higher level tasks. A pure RL world also doesn’t include a language explanation of the agent reasoning, which made it hard to understand why it made certain decisions. That pushed the team more towards the “models for reasoning” path:
“The second thing we learned is that pure reinforcement learning is not a good vehicle for planning and reasoning. So these agents were able to learn all sorts of crazy things: They could learn to climb like hand over hand in VR climbing, they could learn to open doors like very complicated, like multiple switches and a lever open the door, but they couldn't do any higher level things. And they couldn't do those lower level things consistently necessarily. And as a user, I do not want to interact with a pure reinforcement learning end to end RL agent. As a user, like I need much more control over what that agent is doing.”
Inspired by Chelsea Finn’s work on SayCan at Stanford, the team pivoted to have their agents do the reasoning in natural language instead. This development parallels the large leaps in reasoning that humans have developed as the scientific method:
“We are better at reasoning now than we were 3000 years ago. An example of a reasoning strategy is noticing you're confused. Then when I notice I'm confused, I should ask:
What was the original claim that was made?
What evidence is there for this claim?
Does the evidence support the claim?
Is the claim correct?
This is like a reasoning strategy that was developed in like the 1600s, you know, with like the advent of science. So that's an example of a reasoning strategy. There are tons of them. We employ all the time, lots of heuristics that help us be better at reasoning. And we can generate data that's much more specific to them.“
The Full Stack Model Lab
One year later, it would seem that the pivot to reasoning has had tremendous success, and Imbue has now reached a >$1B valuation, with participation ...

SF Compute: Commoditizing Compute
Latent Space: The AI Engineer Podcast
04/11/25 • 72 min
Evan Conrad, co-founder of SF Compute, joined us to talk about how they started as an AI lab that avoided bankruptcy by selling GPU clusters, why CoreWeave financials look like a real estate business, and how GPUs are turning into a commodities market.
Chapters:
00:00:05 - Introductions
00:00:12 - Introduction of guest Evan Conrad from SF Compute
00:00:12 - CoreWeave Business Model Discussion
00:05:37 - CoreWeave as a Real Estate Business
00:08:59 - Interest Rate Risk and GPU Market Strategy Framework
00:16:33 - Why Together and DigitalOcean will lose money on their clusters
00:20:37 - SF Compute's AI Lab Origins
00:25:49 - Utilization Rates and Benefits of SF Compute Market Model
00:30:00 - H100 GPU Glut, Supply Chain Issues, and Future Demand Forecast
00:34:00 - P2P GPU networks
00:36:50 - Customer stories
00:38:23 - VC-Provided GPU Clusters and Credit Risk Arbitrage
00:41:58 - Market Pricing Dynamics and Preemptible GPU Pricing Model
00:48:00 - Future Plans for Financialization?
00:52:59 - Cluster auditing and quality control
00:58:00 - Futures Contracts for GPUs
01:01:20 - Branding and Aesthetic Choices Behind SF Compute
01:06:30 - Lessons from Previous Startups
01:09:07 - Hiring at SF Compute
Chapters
- 00:00:00 Introduction and Background
- 00:00:58 Analysis of GPU Business Models
- 00:01:53 Challenges with GPU Pricing
- 00:02:48 Revenue and Scaling with GPUs
- 00:03:46 Customer Sensitivity to GPU Pricing
- 00:04:44 Core Weave's Business Strategy
- 00:05:41 Core Weave's Market Perception
- 00:06:40 Hyperscalers and GPU Market Dynamics
- 00:07:37 Financial Strategies for GPU Sales
- 00:08:35 Interest Rates and GPU Market Risks
- 00:09:30 Optimal GPU Contract Strategies
- 00:10:27 Risks in GPU Market Contracts
- 00:11:25 Price Sensitivity and Market Competition
- 00:12:21 Market Dynamics and GPU Contracts
- 00:13:18 Hyperscalers and GPU Market Strategies
- 00:14:15 Nvidia and Market Competition
- 00:15:12 Microsoft's Role in GPU Market
- 00:16:10 Challenges in GPU Market Dynamics
- 00:17:07 Economic Realities of the GPU Market
- 00:18:03 Real Estate Model for GPU Clouds
- 00:18:59 Price Sensitivity and Chip Design
- 00:19:55 SF Compute's Beginnings and Challenges
- 00:20:54 Navigating the GPU Market
- 00:21:54 Pivoting to a GPU Cloud Provider
- 00:22:53 Building a GPU Market
- 00:23:52 SF Compute as a GPU Marketplace
- 00:24:49 Market Liquidity and GPU Pricing
- 00:25:47 Utilization Rates in GPU Markets
- 00:26:44 Brokerage and Market Flexibility
- 00:27:42 H100 Glut and Market Cycles
- 00:28:40 Supply Chain Challenges and GPU Glut
- 00:29:35 Future Predictions for the GPU Market
- 00:30:33 Speculations on Test Time Inference
- 00:31:29 Market Demand and Test Time Inference
- 00:32:26 Open Source vs. Closed AI Demand
- 00:33:24 Future of Inference Demand
- 00:34:24 Peer-to-Peer GPU Markets
- 00:35:17 Decentralized GPU Market Skepticism
- 00:36:15 Redesigning Architectures for New Markets
- 00:37:14 Supporting Grad Students and Startups
- 00:38:11 Successful Startups Using SF Compute
- 00:39:11 VCs and GPU Infrastructure
- 00:40:09 VCs as GPU Credit Transformators
- 00:41:06 Market Timing and GPU Infrastructure
- 00:42:02 Understanding GPU Pricing Dynamics
- 00:43:01 Market Pricing and Preemptible Compute
- 00:43:55 Price Volatility and Market Optimization
- 00:44:52 Customizing Compute Contracts
- 00:45:50 Creating Flexible Compute Guarantees
- 00:46:45 Financialization of GPU Markets
- 00:47:44 Building a Spot Market for GPUs
- 00:48:40 Auditing and Standardizing Clusters
- 00:49:40 Ensuring Cluster Reliability
- 00:50:36 Active Mo...
Show more best episodes

Show more best episodes
FAQ
How many episodes does Latent Space: The AI Engineer Podcast have?
Latent Space: The AI Engineer Podcast currently has 127 episodes available.
What topics does Latent Space: The AI Engineer Podcast cover?
The podcast is about Entrepreneurship, Software, Podcasts, Technology, Business and Engineering.
What is the most popular episode on Latent Space: The AI Engineer Podcast?
The episode title 'Commoditizing the Petaflop — with George Hotz of the tiny corp' is the most popular.
What is the average episode length on Latent Space: The AI Engineer Podcast?
The average episode length on Latent Space: The AI Engineer Podcast is 76 minutes.
How often are episodes of Latent Space: The AI Engineer Podcast released?
Episodes of Latent Space: The AI Engineer Podcast are typically released every 6 days, 14 hours.
When was the first episode of Latent Space: The AI Engineer Podcast?
The first episode of Latent Space: The AI Engineer Podcast was released on Feb 23, 2023.
Show more FAQ

Show more FAQ