Conference:March 6-8, 2017
Workshops:March 9-10, 2017
Presentation: Spotify's Reliable Event Delivery System
Location:
- Fleming, 3rd flr.
Duration
Day of week:
- Wednesday
Level:
- Intermediate
Persona:
- Architect
Key Takeaways
- Learn how Spotify re-architected their existing on-premise data streaming architecture to a cloud-based Google Cloud Pub/Sub environment
- Hear some of the war stories and experiences that Spotify engineers discovered through trial and error
- Understand approaches to reliability used at Spotify such as measuring each event via their Pub/Sub input and output components and then leveraging Kibana, PagerDuty, and Jira to immediately respond to SLA issues.
Abstract
Spotify’s event delivery system is one of the foundational pieces of Spotify’s data infrastructure. It has a key requirement to reliably deliver complete data with a predictable latency and make it available to Spotify developers via well-defined interface. Delivered data is than used to produce Discover Weekly, Fresh Finds, Spotify Party and many other Spotify features. Currently 1M events is delivered via Spotify's event delivery system every second. To seamlessly scale the system we designed it as a set of micro-services. System is using Google Cloud Pub/Sub for transferring vast amounts of data between Spotify's data-centres. This talk is going to cover the design and operational aspects of Spotify’s reliable event delivery system.
Interview
Spotify’s data is increasing at a rate of 60 billion events per day. The previous event delivery system, which is based on Kafka 0.7, slowly reached its limitations forcing us to find a new solution. To be able to seamlessly scale the event delivery system with Spotify’s growth, we decided to base a new event delivery system on Google Cloud Pubsub and Google Cloud Dataflow. Spotify’s new event delivery system is one of the foundational pieces of Spotify’s data infrastructure. One of the key requirements was to deliver complete data with predictable latency while making the whole process approachable to Spotify developers.
Neville Li and I spoke about the original evolution to the new architecture at a previous QCon. We’ve now been running on Google Pub/Sub for the last year or so. This talk is about the experience of having the system up and running and what we’ve learned. I will talk a bit about the system, but I’ll spend more time talking about some of the really fun incidents that we discovered building and running this system.
Sure. One example is would be how we abused the autoscaler. We had huge incident that impacted lots of things. All because we made our components too greedy. We were scaling our system until we literally ate all the quota we had. This impacted other services in Spotify for some short period of time too, since people couldn’t order new machines for them. We ended up having to go into a scaling down mode really fast. That was a fun experience. I plan to go into some of the lessons we learned like this and how we learned to prevent them in the future.
Similar Talks
Tracks
-
Architecting for Failure
Building fault tolerate systems that are truly resilient
-
Architectures You've Always Wondered about
QCon classic track. You know the names. Hear their lessons and challenges.
-
Modern Distributed Architectures
Migrating, deploying, and realizing modern cloud architecture.
-
Fast & Furious: Ad Serving, Finance, & Performance
Learn some of the tips and technicals of high speed, low latency systems in Ad Serving and Finance
-
Java - Performance, Patterns and Predictions
Skills embracing the evolution of Java (multi-core, cloud, modularity) and reenforcing core platform fundamentals (performance, concurrency, ubiquity).
-
Performance Mythbusting
Performance myths that need busting and the tools & techniques to get there
-
Dark Code: The Legacy/Tech Debt Dilemma
How do you evolve your code and modernize your architecture when you're stuck with part legacy code and technical debt? Lessons from the trenches.
-
Modern Learning Systems
Real world use of the latest machine learning technologies in production environments
-
Practical Cryptography & Blockchains: Beyond the Hype
Looking past the hype of blockchain technologies, alternate title: Weaselfree Cryptography & Blockchain
-
Applied JavaScript - Atomic Applications and APIs
Angular, React, Electron, Node: The hottest trends and techniques in the JavaScript space
-
Containers - State Of The Art
What is the state of the art, what's next, & other interesting questions on containers.
-
Observability Done Right: Automating Insight & Software Telemetry
Tools, practices, and methods to know what your system is doing
-
Data Engineering : Where the Rubber meets the Road in Data Science
Science does not imply engineering. Engineering tools and techniques for Data Scientists
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
-
Workhorse Languages, Not Called Java
Workhorse languages not called Java.
-
Security: Lessons Learned From Being Pwned
How Attackers Think. Penetration testing techniques, exploits, toolsets, and skills of software hackers
-
Engineering Culture @{{cool_company}}
Culture, Organization Structure, Modern Agile War Stories
-
Softskills: Essential Skills for Developers
Skills for the developer in the workplace