High Volume Event Processing with John-Daniel Trask

11/16/17 • 57 min

Cloud Engineering Archives - Software Engineering Daily

A popular software application serves billions of user requests. These requests could be for many different things. These requests need to be routed to the correct destination, load balanced across different instances of a service, and queued for processing. Processing a request might require generating a detailed response to the user, or making a write to a database, or the creation of a new file on a file system.

As a software product grows in popularity, it will need to scale these different parts of infrastructure at different rates. You many not need to grow your database cluster at the same pace that you grow the number of load balancers at the front of your infrastructure. Your users might start making 70% of their requests to one specific part of your application, and you might need to scale up the services that power that portion of the infrastructure.

Today’s episode is a case study of a high-volume application: a monitoring platform called Raygun.

Raygun’s software runs on client applications and delivers monitoring data and crash reports back to Raygun’s servers. If I have a podcast player application on my iPhone that runs the Raygun software, and that application crashes, Raygun takes a snapshot of the system state and reports that information along with the exception, so that the developer of that podcast player application can see the full picture of what was going on in the user’s device, along with the exception that triggered the application crash.

Throughout the day, applications all around the world are crashing and sending requests to Rayguns servers. Even when crashes are not occurring, Raygun is receiving monitoring and health data from those applications. Raygun’s infrastructure routes those different types of requests to different services, queues them up, and writes the data to multiple storage layers–ElasticSearch, a relational SQL database, and a custom file server built on top of S3.

John-Daniel Trask is the CEO of Raygun and he joins the show to describe the end-to-end architecture of Raygun’s request processing and storage system. We also explore specific refactoring changes that were made to save costs at the worker layer of the architecture. This is useful memory management strategy for anyone working in a garbage collected language. If you would like to see diagrams that explain the architecture and other technical decisions, the show notes have a video that explains what we talk about in this show. Full disclosure: Raygun is a sponsor of Software Engineering Daily.

The post High Volume Event Processing with John-Daniel Trask appeared first on Software Engineering Daily.

Today’s episode is a case study of a high-volume application: a monitoring platform called Raygun.

The post High Volume Event Processing with John-Daniel Trask appeared first on Software Engineering Daily.

Previous Episode

Fiverr Engineering with Gil Sheinfeld

As the gig economy grows, that growth necessitates innovations in the online infrastructure powering these new labor markets.

In our previous episodes about Uber, we explored the systems that balance server load and gather geospacial data. In our coverage of Lyft, we studied Envoy, the service proxy that standardizes communications and load balancing among services. In shows about Airbnb, we talked about the data engineering pipeline that powers economic calculations, user studies, and everything else that requires a MapReduce.

In today’s episode, we explore the business and engineering behind another online labor platform: Fiverr.

Fiverr is a marketplace for digital services. On Fiverr, I have purchased podcast editing, logo creation, music lyrics, videos, and sales leads. I have found people who will work for cheap, and quickly finish a job to my exact specification. I have discovered visual artists who worked with me to craft a music video for a song I wrote.

Workers on Fiverr post “gigs”–jobs that they can perform. Most of the workers on Fiverr specialize in knowledge work, like proofreading or gathering sales leads. The workers are all over the world. I have worked with people from Germany, the Philippines, and Africa through Fiverr.

Fiverr has become the leader in digital freelancing. The staggering growth of Fiverr’s marketplace has put the company in a position similar to an early Amazon. There is room for strategic expansion, but there is also an urgency to improve the infrastructure and secure the market lead.

Gil Sheinfeld is the CTO at Fiverr, and he joins the show to explain how the teams at Fiverr are organized to fulfill the two goals of strategic, creative growth and continuous improvement to the platform.

One engineering topic we discussed at length was event sourcing. Event sourcing is a pattern for modeling each change to your application as an event. Each event is placed on a pub/sub messaging queue, and made available to the different systems within your company. Event sourcing creates a centralized place to listen to all of the changes that are occurring within your company.

For example, you might be working on a service that allows a customer to make a payment to a worker. The payment becomes an event. Several different systems might want to listen for that event. Fiverr needs to call out to a credit card processing system. Fiverr also needs to send an email to the worker, to let them know they have been paid. Fiverr ALSO needs to update internal accounting records.

Event sourcing is useful because the creator of the event is decoupled from all of the downstream consumers. As the platform engineering team works to build out event sourcing, communications between different service owners will become more efficient.

The post Fiverr Engineering with Gil Sheinfeld appeared first on Software Engineering Daily.

Next Episode

Run Less Software with Rich Archbold

There is a quote from Jeff Bezos: “70% of the work of building a business today is undifferentiated heavy lifting. Only 30% is creative work. Things will be more exciting when those numbers are inverted.”

That quote is from 2006, before Amazon Web Services had built most of their managed services. In 2006, you had no choice but to manage your own database, data warehouse, and search cluster. If your server crashed in the middle of the night, you had to wake up and fix it. And you had to deal with these engineering problems in addition to building your business.

Technology today evolves much faster than in 2006. That is partly because managed cloud services make operating a software company so much smoother. You can build faster, iterate faster, and there are fewer outages.

If you are an insurance company or a t-shirt manufacturing company or an online education platform, software engineering is undifferentiated heavy lifting. Your customers are not paying you for your expertise in databases or your ability to configure load balancers. As a business, you should be focused on what the customers are paying you for, and spending the minimal amount of time on rebuilding software that is available as a commodity cloud service.

Rich Archbold is the director of engineering at Intercom, a rapidly growing software company that allows for communication between customers and businesses. At Intercom, the engineering teams have adopted a philosophy called Run Less Software.

Running less software means reducing choices among engineering teams, and standardizing on technologies wherever possible.

When Intercom was in its early days, the systems were more heterogeneous. Different teams could choose whatever relational database they wanted–MySQL or Postgres. They could choose whatever key/value store they were most comfortable with.

The downside of all this choice was that engineers who moved from one team to another team might not know how to use the tools at the new team they were moving to. After switching teams, you would have to figure out how to onboard with those new tools, and that onboarding process was time that was not spent on effort that impacted the business.

By reducing the number of different choices that engineering teams have, and opting for managed services wherever possible, Intercom ships code at an extremely fast pace with very few outages. In our conversation, Rich contrasts his experience at Intercom with his experiences working at Amazon Web Services and Facebook.

Amazon and Facebook were built in a time where there was not a wealth of managed services to choose from, and this discussion was a reminder of how much software engineering has changed because of cloud computing.

To learn more about Intercom, you can check out the Inside Intercom podcast.

The post Run Less Software with Rich Archbold appeared first on Software Engineering Daily.