#002: The secrets to building secure & scalable OTA infrastructure with Nick Sinas

12/18/24 • 57 min

In today’s Coredump Session, the team dives deep into the world of over-the-air (OTA) updates—why they matter, how they break, and what it takes to get them right. From horror stories involving IR updates in a snowstorm to best practices for deploying secure firmware across medical devices, this conversation covers the full stack of OTA: device, cloud, process, and people. It's equal parts cautionary tale and technical masterclass.

Key Takeaways:

OTA is essential for modern hardware—without it, even small bugs can require massive field operations.
Good OTA starts early, ideally at the product design and architecture phase.
Bootloaders, memory maps, and security keys must be carefully planned to avoid long-term issues.
Staged rollouts and cohorts help mitigate fleet-wide disasters.
Signing keys and root certificates should be treated like firmware—versioned, updatable, and secure.
Real-world constraints (medical, smart home, etc.) make OTA more complex—but not optional.
Testing both the update and the update mechanism itself is critical before going live.
When OTA fails, fallback plans (like dual banks or A/B slots) can be the difference between a patch and a catastrophe.

Chapters:

00:00 Episode Teasers & Intro

03:29 Meet the Guests + OTA Gut Reactions

05:33 Why OTA Is Non-Negotiable

03:29 The OTA Wake-Up Call: Why You Need It

09:31 Building OTA into Hardware from Day One

16:49 Cloud-Side OTA: Cohorts, Load, and Timing

21:53 OTA in Regulated Industries

30:10 When OTA Breaks Itself

34:44 Minimizing OTA Risk: The Defensive Playbook

41:18 OTA and the Matter Standard

47:17 Networking Stacks, Constraints, and Reliability

51:11 Security, Scale, and the OTA Future

⁠⁠⁠⁠Join the Interrupt Slack

Watch this episode on YouTube

Follow Memfault

Other ways to listen:

⁠⁠Visit our website

Key Takeaways:

OTA is essential for modern hardware—without it, even small bugs can require massive field operations.
Good OTA starts early, ideally at the product design and architecture phase.
Bootloaders, memory maps, and security keys must be carefully planned to avoid long-term issues.
Staged rollouts and cohorts help mitigate fleet-wide disasters.
Signing keys and root certificates should be treated like firmware—versioned, updatable, and secure.
Real-world constraints (medical, smart home, etc.) make OTA more complex—but not optional.
Testing both the update and the update mechanism itself is critical before going live.
When OTA fails, fallback plans (like dual banks or A/B slots) can be the difference between a patch and a catastrophe.

Chapters:

00:00 Episode Teasers & Intro

03:29 Meet the Guests + OTA Gut Reactions

05:33 Why OTA Is Non-Negotiable

03:29 The OTA Wake-Up Call: Why You Need It

09:31 Building OTA into Hardware from Day One

16:49 Cloud-Side OTA: Cohorts, Load, and Timing

21:53 OTA in Regulated Industries

30:10 When OTA Breaks Itself

34:44 Minimizing OTA Risk: The Defensive Playbook

41:18 OTA and the Matter Standard

47:17 Networking Stacks, Constraints, and Reliability

51:11 Security, Scale, and the OTA Future

⁠⁠⁠⁠Join the Interrupt Slack

Watch this episode on YouTube

Follow Memfault

Other ways to listen:

⁠⁠Visit our website

Previous Episode

#001: The future of Bluetooth connectivity with Blecon Founder, Simon Ford

In today’s Coredump Session, we unpack the full story of Bluetooth—from its PDA-era beginnings to its rising role in cloud-connected devices. With insights from Memfault’s Chris Coleman and François Baldassari, along with Blecon’s Simon Ford, this wide-ranging conversation explores how Bluetooth Low Energy has evolved, where it thrives (and doesn’t), and why it’s often the right tool, even if it’s not a perfect one. Expect history, hot takes, and practical guidance for building better Bluetooth-powered products.

Key Takeaways:

Bluetooth Low Energy (BLE) and Bluetooth Classic are fundamentally different—and BLE was never just a “lite” version.
BLE's strength lies in its low power consumption and quick connection setup, making it ideal for peripheral devices that sleep most of the time.
Use cases like audio, asset tracking, and cloud sync continue to shape BLE’s evolution, and new specs like LE Audio and PAwR are expanding its reach.
Bluetooth wins not because it’s perfect—but because it’s practical: globally adopted, low-cost, and well-supported.
Debugging Bluetooth at scale requires collecting connection parameters, analyzing retries, and understanding phone ecosystem quirks.
BLE Mesh adoption has been underwhelming, with real-world complexity often outweighing its theoretical benefits.
Expect to see BLE turn up in more places, including MEMS sensors and energy-harvesting devices, not just consumer gadgets.
Designers should understand trade-offs in connection intervals, latency, and power draw when choosing Bluetooth for cloud or local connectivity.

Chapters:

00:00 Episode Teasers & Intro

01:10 Meet the Guests: Bluetooth Roots at Pebble, Fitbit, and Blecon

06:51 BLE’s Breakthrough: The iPhone 4S Moment

10:22 BLE vs Classic: Why It Took Off

14:39 Specs That Shifted Everything: Packet Length, Coded PHY & LE Audio

21:41 Is BLE Still Interoperable? And Does It Matter?

28:22 The BLE Cloud Puzzle: Gateways, Phones & Golden Gate

38:40 BLE’s Sweet Spot: Power, Latency & When It Just Works

47:12 Operating BLE Devices in the Wild: What to Track & Why

57:40 Mesh Ambitions vs Reality

⁠⁠Join the Interrupt Slack

Watch this episode on YouTube

Follow Memfault

Other ways to listen:

⁠⁠Visit our website

Next Episode

#003: Pebble's Code is Free: Three Former Pebble Engineers Discuss Why It's Important (PART 1/2)

In this episode of Coredump, three former Pebble engineers reunite to dive deep into the technical quirks, philosophies, and brilliant hacks behind Pebble OS. From crashing on purpose to building a single codebase that powered every watch, they share war stories, bugs, and what made Pebble’s firmware both rare and remarkable. If you love embedded systems, software-forward thinking, or startup grit— this one’s for you.

Key topics:

Pebble intentionally crashed devices to collect core dumps and improve reliability.
All Pebble devices ran on a single codebase, which simplified development and updates.
The open-sourcing of Pebble OS is a rare opportunity to study real, commercial firmware.
A platform mindset—supporting all devices and apps consistently—shaped major engineering decisions.
Pebble’s app sandbox isolated bad code without crashing the OS, improving developer experience.
The team built a custom NOR flash file system to overcome constraints in size and endurance.
Core dumps and analytics were essential for tracking bugs, deadlocks, and field issues.
Collaborations between hardware and firmware engineers led to better debugging tools and smoother development.

Chapters:

00:00 Episode Teasers & Intro01:10 Meet the Team: Pebble Engineers Reunite01:13 Meet the Hosts + Why Pebble Still Matters03:47 Why Open-Sourcing Pebble OS Is a Big Deal06:20 The Startup Firmware Mentality08:44 One OS, All Devices: Pebble’s Platform Bet12:30 App Compatibility and the KEMU Emulator14:51 Sandboxing, Syscalls, and Crashing with Grace20:25 Pebble File System: Built from Scratch (and Why)23:32 From Dumb to Smart: The Iterative Codebase Ethos26:09 Core Dumps: Crashing Is a Feature30:45 How Firmware Shaped Hardware Decisions33:56 Rust, Easter Eggs, and Favorite Bugs36:09 Wear-Level Failures, Security Exploits & Font Hacks39:42 Why We Chose WAF (and Regret Nothing?)42:41 What We’d Do Differently Next Time47:00 Final Q&A: Open Hardware, Protocols, and Part Two?

Join the Interrupt Slack ⁠⁠

⁠⁠Watch this episode on YouTube⁠⁠

Follow Memfault

Other ways to listen:

⁠⁠Visit our website