Track:

Location:

Fleming, 3rd flr.

Duration

Duration:

1:40pm - 2:30pm

Day of week:

Wednesday

Level:

Intermediate

Persona:

Architect

Key Takeaways

Learn how Spotify re-architected their existing on-premise data streaming architecture to a cloud-based Google Cloud Pub/Sub environment
Hear some of the war stories and experiences that Spotify engineers discovered through trial and error
Understand approaches to reliability used at Spotify such as measuring each event via their Pub/Sub input and output components and then leveraging Kibana, PagerDuty, and Jira to immediately respond to SLA issues.

Abstract

Spotify’s event delivery system is one of the foundational pieces of Spotify’s data infrastructure. It has a key requirement to reliably deliver complete data with a predictable latency and make it available to Spotify developers via well-defined interface. Delivered data is than used to produce Discover Weekly, Fresh Finds, Spotify Party and many other Spotify features. Currently 1M events is delivered via Spotify's event delivery system every second. To seamlessly scale the system we designed it as a set of micro-services. System is using Google Cloud Pub/Sub for transferring vast amounts of data between Spotify's data-centres. This talk is going to cover the design and operational aspects of Spotify’s reliable event delivery system.

Interview

Question:

How does your reliable event delivery system look like at Spotify?

Answer:

Spotify’s data is increasing at a rate of 60 billion events per day. The previous event delivery system, which is based on Kafka 0.7, slowly reached its limitations forcing us to find a new solution. To be able to seamlessly scale the event delivery system with Spotify’s growth, we decided to base a new event delivery system on Google Cloud Pubsub and Google Cloud Dataflow. Spotify’s new event delivery system is one of the foundational pieces of Spotify’s data infrastructure. One of the key requirements was to deliver complete data with predictable latency while making the whole process approachable to Spotify developers.

Question:

What’s the goal for this talk?

Answer:

Neville Li and I spoke about the original evolution to the new architecture at a previous QCon. We’ve now been running on Google Pub/Sub for the last year or so. This talk is about the experience of having the system up and running and what we’ve learned. I will talk a bit about the system, but I’ll spend more time talking about some of the really fun incidents that we discovered building and running this system.

Question:

I won’t hold you to this answer, but can you give us an example of an incident you may discuss in the talk?

Answer:

Sure. One example is would be how we abused the autoscaler. We had huge incident that impacted lots of things. All because we made our components too greedy. We were scaling our system until we literally ate all the quota we had. This impacted other services in Spotify for some short period of time too, since people couldn’t order new machines for them. We ended up having to go into a scaling down mode really fast. That was a fun experience. I plan to go into some of the lessons we learned like this and how we learned to prevent them in the future.

Speaker: Igor Maravic

Software Engineer @Spotify

As a part of the band he worked on developing and maintaining Spotify's gateways, migrating mobile clients from using custom TLV protocol to HTTP, designing and developing continuous delivery infrastructure, stress testing services... Currently he's living and breathing event delivery.

Find Igor Maravic at

Speaker page

Software Engineer at Spotify

Core Kafka team @Confluent

Ben Stopford

Web Components @ Scale

Office of the CTO @MuleSoft

Pawel Psztyc

Dev Ops @ Scale

Developer Advocate, JFrog

Baruch Sadogursky

Effective Data Pipelines: Data Mngmt from Chaos

Python engineer, Founder @kjamistan

Katharine Jarmul

Deliver Docker Containers Continuously on AWS

Lead Software Developer @AutoScout24

Philipp Garbe

Creating Space To Be Awesome

CTO who understands the science around helping people do their best

Meri Williams

Thinking Strategically About IoT

Senior Software Engineer @IBM, Committer on Apache Aries

Holly Cummins

In-Memory Caching: Curb Tail Latency with Pelikan

Distributed Systems Engineer Working on Cache @Twitter

Yao Yue

DSSTNE: Deep Learning at Scale

Deep Learning Engineer @Teza (ex-Amazon, ex-NVidia)

Scott Le Grand

Tracks

Architecting for Failure

Building fault tolerate systems that are truly resilient
Architectures You've Always Wondered about

QCon classic track. You know the names. Hear their lessons and challenges.
Modern Distributed Architectures

Migrating, deploying, and realizing modern cloud architecture.
Fast & Furious: Ad Serving, Finance, & Performance

Learn some of the tips and technicals of high speed, low latency systems in Ad Serving and Finance
Java - Performance, Patterns and Predictions

Skills embracing the evolution of Java (multi-core, cloud, modularity) and reenforcing core platform fundamentals (performance, concurrency, ubiquity).
Performance Mythbusting

Performance myths that need busting and the tools & techniques to get there

Dark Code: The Legacy/Tech Debt Dilemma

How do you evolve your code and modernize your architecture when you're stuck with part legacy code and technical debt? Lessons from the trenches.
Modern Learning Systems

Real world use of the latest machine learning technologies in production environments
Practical Cryptography & Blockchains: Beyond the Hype

Looking past the hype of blockchain technologies, alternate title: Weaselfree Cryptography & Blockchain
Applied JavaScript - Atomic Applications and APIs

Angular, React, Electron, Node: The hottest trends and techniques in the JavaScript space
Containers - State Of The Art

What is the state of the art, what's next, & other interesting questions on containers.
Observability Done Right: Automating Insight & Software Telemetry

Tools, practices, and methods to know what your system is doing

Data Engineering : Where the Rubber meets the Road in Data Science

Science does not imply engineering. Engineering tools and techniques for Data Scientists
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS ideas
Workhorse Languages, Not Called Java

Workhorse languages not called Java.
Security: Lessons Learned From Being Pwned

How Attackers Think. Penetration testing techniques, exploits, toolsets, and skills of software hackers
Engineering Culture @{{cool_company}}

Culture, Organization Structure, Modern Agile War Stories
Softskills: Essential Skills for Developers

Skills for the developer in the workplace

LAST YEAR'S SCHEDULE

Location:

Duration

Day of week:

Level:

Persona:

Key Takeaways

Abstract

Interview

Find Igor Maravic at

Similar Talks

Tracks

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Spotify's Reliable Event Delivery System

Location:

Duration

Day of week:

Level:

Persona:

More talks on:

Key Takeaways

Abstract

Interview

Find Igor Maravic at

Similar Talks

Tracks

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World