Presentation: Spotify's Reliable Event Delivery System

Location:

Duration

Duration: 
1:40pm - 2:30pm

Day of week:

Level:

Persona:

Key Takeaways

  • Learn how Spotify re-architected their existing on-premise data streaming architecture to a cloud-based Google Cloud Pub/Sub environment
  • Hear some of the war stories and experiences that Spotify engineers discovered through trial and error
  • Understand approaches to reliability used at Spotify such as measuring each event via their Pub/Sub input and output components and then leveraging Kibana, PagerDuty, and Jira to immediately respond to SLA issues.

Abstract

Spotify’s event delivery system is one of the foundational pieces of Spotify’s data infrastructure. It has a key requirement to reliably deliver complete data with a predictable latency and make it available to Spotify developers via well-defined interface. Delivered data is than used to produce Discover Weekly, Fresh Finds, Spotify Party and many other Spotify features. Currently 1M events is delivered via Spotify's event delivery system every second. To seamlessly scale the system we designed it as a set of micro-services. System is using Google Cloud Pub/Sub for transferring vast amounts of data between Spotify's data-centres. This talk is going to cover the design and operational aspects of Spotify’s reliable event delivery system.

Interview

Question: 
How does your reliable event delivery system look like at Spotify?
Answer: 

Spotify’s data is increasing at a rate of 60 billion events per day. The previous event delivery system, which is based on Kafka 0.7, slowly reached its limitations forcing us to find a new solution. To be able to seamlessly scale the event delivery system with Spotify’s growth, we decided to base a new event delivery system on Google Cloud Pubsub and Google Cloud Dataflow. Spotify’s new event delivery system is one of the foundational pieces of Spotify’s data infrastructure. One of the key requirements was to deliver complete data with predictable latency while making the whole process approachable to Spotify developers.

Question: 
What’s the goal for this talk?
Answer: 

Neville Li and I spoke about the original evolution to the new architecture at a previous QCon. We’ve now been running on Google Pub/Sub for the last year or so. This talk is about the experience of having the system up and running and what we’ve learned. I will talk a bit about the system, but I’ll spend more time talking about some of the really fun incidents that we discovered building and running this system.

Question: 
I won’t hold you to this answer, but can you give us an example of an incident you may discuss in the talk?
Answer: 

Sure. One example is would be how we abused the autoscaler. We had huge incident that impacted lots of things. All because we made our components too greedy. We were scaling our system until we literally ate all the quota we had. This impacted other services in Spotify for some short period of time too, since people couldn’t order new machines for them. We ended up having to go into a scaling down mode really fast. That was a fun experience. I plan to go into some of the lessons we learned like this and how we learned to prevent them in the future.

Speaker: Igor Maravic

Software Engineer @Spotify

As a part of the band he worked on developing and maintaining Spotify's gateways, migrating mobile clients from using custom TLV protocol to HTTP, designing and developing continuous delivery infrastructure, stress testing services... Currently he's living and breathing event delivery.

Find Igor Maravic at

Similar Talks

Office of the CTO @MuleSoft
Developer Advocate, JFrog
CTO who understands the science around helping people do their best
Senior Software Engineer @IBM, Committer on Apache Aries
Distributed Systems Engineer Working on Cache @Twitter
Deep Learning Engineer @Teza (ex-Amazon, ex-NVidia)

Tracks

Conference for Professional Software Developers