Kaizen! The day half the internet went down

07/15/21 • 68 min

Ship It! Cloud, SRE, Platform Engineering

Kaizen means “change for the better”, continuous improvement in this context. Failure is essential to learning, but how do we learn as a team? The simplest thing is to regularly dedicate time for taking a step back, talking about what works & what doesn’t, maybe writing some of it down, and eventually deciding what we should improve next. I intend to make every 10th Ship It! episode a Kaizen one. This is the first one when we talk with Adam and Jerod about the things that we want to improve in our setup over the next few months. We talk about how the June Fastly outage affected changelog.com, how we responded that day, and what we could do better. We discuss multi-cloud, multi-CDN, and the next sensible and obvious improvements for our app. Let us know via Slack or Twitter what learnings are valuable to you so that we can produce the best content for you.

Previous Episode

What is good release engineering?

This week we talk with Jean-Sébastien Pedron, RabbitMQ and FreeBSD contributor, about the importance of good release engineering for core infrastructure. Both Jean-Sébastien and I have been part of the Core RabbitMQ team for many years now. We have built some of the biggest CI/CD pipelines (check the show notes for one example), wrote and shipped some great code together, while breaking and fixing many things in the process. We have been wrestling with today’s topic since 2016. Jean-Sébastien has some great FreeBSD stories to share, as well as an interesting perspective on shipping graphic card drivers. Oh, and by the way, it’s probably our fault why your remote car key stopped working that afternoon. It will all make sense after you listen to this episode.

Next Episode

Honeycomb's secret to high-performing teams

Gerhard talks with Charity Majors, ops engineer and accidental startup founder at honeycomb.io about high-performing teams, why “15 minutes or bust,” and how we should start using Honeycomb in our own monolithic Phoenix app that runs changelog.com. There is just one step, and it’s actually really simple!

They also talk about how Honeycomb uses Honeycomb to learn about Honeycomb, which is one of Gerhard’s favorite questions. As for key take-aways, deploying straight into production is really important, but not as important as optimising for humans - which are not replaceable cogs, that learn and share their learnings continuously. That is the secret to making things easy and happy for everyone.

Join the discussion

Changelog++ members save 5 minutes on this episode because they made the ads disappear. Join today!

Sponsors:

Fly – Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
Armory – Empower your development teams to deploy code with increased safety, resilience, velocity, and compliance – to any production target on prem or in the cloud using Armory’s enterprise-grade distribution of Spinnaker. Learn more at armory.io/shipit
LaunchDarkly – Ship fast. Rest easy. Deploy code at any time, even if a feature isn’t ready to be released to your users. Wrap code in feature flags to get the safety to test new features and infrastructure in prod without impacting the wrong end users.
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.