Log in

goodpods headphones icon

To access all our features

Open the Goodpods app
Close icon
Make it Work - Keep Alert Chaos in Check

Keep Alert Chaos in Check

01/26/25 • 41 min

Make it Work

Today we talk with Matvey Kukuy and Tal Borenstein, co-founders of Keep, a startup focused on helping companies manage and make sense of their alert systems. The discussion comes three years after Matvey's previous appearance - https://shipit.show/36 - where he talked about Grafana Labs' acquisition of his previous startup Amixr (now Grafana OnCall).

Keep tackles a significant challenge in modern tech infrastructure: managing the overwhelming volume of alerts that companies receive from their various monitoring systems. Some enterprises deal with up to 70,000 alerts daily, making it crucial to identify which ones represent actual incidents requiring attention.

We explore real-world examples of major incidents, including the significant CrowdStrike outage in July 2024 that caused widespread system crashes and resulted in an estimated $10 billion in worldwide damages. This incident highlighted how critical it is to quickly identify and respond to serious issues among numerous alerts. Matvey tells us about his most black swan experience.

The episode concludes with a hint that some of Keep's AI features may eventually be released as open source once they're sufficiently polished.

LINKS

EPISODE CHAPTERS

  • (00:00) - What is new after three years?
  • (02:58) - Take us through the last memorable incident
  • (07:16) - My most black swan
  • (08:50) - How would have Keep made the CrowdStrike experience different?
  • (12:38) - How do companies end up in that place?
  • (15:29) - Keep name origin
  • (17:40) - Why would someone pick Keep?
  • (23:22) - Let's think about our use case
  • (25:03) - Demo ends
  • (28:21) - Reporting capabilities?
  • (30:25) - Deploying & running Keep
  • (33:12) - 2025 for Keep
  • (38:50) - Until next time
plus icon
bookmark

Today we talk with Matvey Kukuy and Tal Borenstein, co-founders of Keep, a startup focused on helping companies manage and make sense of their alert systems. The discussion comes three years after Matvey's previous appearance - https://shipit.show/36 - where he talked about Grafana Labs' acquisition of his previous startup Amixr (now Grafana OnCall).

Keep tackles a significant challenge in modern tech infrastructure: managing the overwhelming volume of alerts that companies receive from their various monitoring systems. Some enterprises deal with up to 70,000 alerts daily, making it crucial to identify which ones represent actual incidents requiring attention.

We explore real-world examples of major incidents, including the significant CrowdStrike outage in July 2024 that caused widespread system crashes and resulted in an estimated $10 billion in worldwide damages. This incident highlighted how critical it is to quickly identify and respond to serious issues among numerous alerts. Matvey tells us about his most black swan experience.

The episode concludes with a hint that some of Keep's AI features may eventually be released as open source once they're sufficiently polished.

LINKS

EPISODE CHAPTERS

  • (00:00) - What is new after three years?
  • (02:58) - Take us through the last memorable incident
  • (07:16) - My most black swan
  • (08:50) - How would have Keep made the CrowdStrike experience different?
  • (12:38) - How do companies end up in that place?
  • (15:29) - Keep name origin
  • (17:40) - Why would someone pick Keep?
  • (23:22) - Let's think about our use case
  • (25:03) - Demo ends
  • (28:21) - Reporting capabilities?
  • (30:25) - Deploying & running Keep
  • (33:12) - 2025 for Keep
  • (38:50) - Until next time

Previous Episode

undefined - Let's build a CDN - Part 2

Let's build a CDN - Part 2

This is a follow-up to Let's build a CDN - Part 1

A new friend joins us. We talk about the high-level, including why Varnish and why we are doing this in the first place. We go through the plan for this session, and then just make it happen. The video in the show notes captures most of this pairing session.

If you enjoyed this podcast and the YouTube video, you can now watch the full movie in 4k on 📺 makeitwork.tv. Offline download is available.

LINKS

EPISODE CHAPTERS

  • (00:00) - Who is James
  • (01:15) - Who is Matt?
  • (02:26) - Why Varnish?
  • (06:01) - Would you still choose Varnish today?
  • (10:10) - Did you do a typo?
  • (11:04) - Why are we doing this?
  • (17:21) - Where did we stop in part 1?
  • (21:40) - What are we trying to achieve today?
  • (24:03) - Outro

Next Episode

undefined - Fast Infrastructure

Fast Infrastructure

Hugo Santos, founder & CEO of Namespace Labs joins us today to share his passion for fast infrastructure. From sharing childhood stories & dial-up modem phone line wiring experiences, we get to speed testing Hugo's current home internet connection: 25 gigabit FTTP.

We shift focus to Namespace, and talk about how it evolved from software-defined storage to building an application platform that starts Kubernetes clusters in seconds. The underlying infrastructure is fast, custom built and is able to:

  • Spin up thousands of isolated, virtual machine-based Kubernetes clusters
  • Run millions of jobs concurrently
  • Control everything from CPU/RAM allocation to networking setup
  • Deliver exceptionally low latency at high concurrency

A significant portion of the conversation centres on a major service degradation Namespace experienced in October 2024. Hugo shares the full story, including:

  1. How a hardware delivery delay combined with network issues from a third-party provider created problems
  2. The difficult decision to rebuild the network setup rather than depend on unreliable components
  3. The emotional toll of not meeting self-imposed high standards despite working around the clock
  4. The surprising customer loyalty, with no customers leaving despite an impact on their build system

Hugo emphasizes taking full responsibility for this incident: "That's on us. We decide which companies we work with..."

The episode concludes with Hugo sharing his philosophy on excellence: "I find that it's usually some kind of unrelenting curiosity that really propels people beyond just being good to being excellent... When we approach how we build our products, it's with that same level of unrelenting curiosity and willingness to break through and change things."

🍿 This entire conversation, including all three YouTube videos, is available for members only as a 1h+ long movie at makeitwork.tv/fast-infrastructure

LINKS

EPISODE CHAPTERS

  • (00:33) - Weekend projects
  • (03:16) - Love for all things infrastructure
  • (09:58) - Hugo's 25 gigabit home internet connection
  • (13:33) - How does this love for infrastructure translate to Namespace.so?
  • (15:28) - What does it mean for a Kubernetes cluster to spin up fast?
  • (20:24) - What does a job mean in infrastructure terms?
  • (23:12) - Let's talk about your last major outage
  • (37:15) - What does Namespace.so look in practice?
  • (39:51) - Namespace Foundation - Open-source Kubernetes app platform
  • (40:54) - Complex preview scenarios
  • (42:37) - One last thought

Episode Comments

Generate a badge

Get a badge for your website that links back to this episode

Select type & size
Open dropdown icon
share badge image

<a href="https://goodpods.com/podcasts/make-it-work-340598/keep-alert-chaos-in-check-83257995"> <img src="https://storage.googleapis.com/goodpods-images-bucket/badges/generic-badge-1.svg" alt="listen to keep alert chaos in check on goodpods" style="width: 225px" /> </a>

Copy