You are viewing content from a past/completed QCon

Presentation: From Batch to Streaming to Both

Track: Streaming Data Architectures

Location: Churchill, G flr.

Duration: 2:55pm - 3:45pm

Day of week: Monday

Slides: Download Slides

Share this on:

What You’ll Learn

  1. Hear about Skyscanner’s journey to implement their data platform to stream and store millions of events per second.
  2. Learn about the need and get advice on crafting a plan to implement a streaming data platform.

Abstract

In this talk I walk through how the streaming data platform at Skyscanner evolved over time. This platform now processes hundreds of billions of events per day, including all our application logs, metrics and business events. But streaming platforms are hard, and we did not get it right on day one. In fact, it’s still evolving as we learn more.  Our story is a case study of developing a streaming data platform in agile fashion. And evidence that with data platforms, small decisions can have out-sized effects. We went from a batch-driven system in a data center, to a streaming platform that processes events in real-time, to something in-between. I will explain what got us here, our current plans and why you may want to skip some of the steps along the way.  Choosing the right mix of batch and real-time for your problem is critical. I hope the war story I share here will help you make the right call for your organisation. And if nothing else, it will show you that it’s never too late to correct course.

Question: 

What is the work you're doing today?

Answer: 

I am a Principal Software Engineer at Skyscanner working on the data platform. This is the central data platform that powers all the Skyscanners' events, metrics and logs. My primary role there is making sure that the 2 million or so events we receive every second arrives safely and securely in long term storage, which is in S3, and that they are auditable and reliable. We also need to capture metadata about these events and be able to trace the lineage. That's what I'm working on, and that's what my talk is about as well.

Question: 

What are the goals you have for the talk?

Answer: 

My main goal is sharing our story of how we used the Agile method to build a data platform and how there is some fundamental tensions between using Agile and delivering in Agile fashion, and the long term planning that you need for a data platform to succeed. My goal here is to share that story, share how that happened, how we got to where we are and what we're doing about it now, and hopefully share a number of lessons that we've learned along the way to help my audience avoid those same mistakes. Hopefully skipping some steps and skipping right to--I wouldn't say the final solution--but a solution that was learned after a couple of hard years of iterating on the problem.

Question: 

Can you tell me a bit about Skyscanners' streaming stack?

Answer: 

The main component would be Kafka. We have a proxy in front of that all services write to. And that's deployed in a highly available multiregion fashion. And then we have a number of things reading from Kafka. Just to throw some names out there, Elasticsearch and Logstash and so on. And we use OpenTSDB for our metrics. We're also using a number of AWS components: Firehose, Kinesis, Kinesis Analytics and then also Flink, which is the component that we're using for transporting things to the archive.

Question: 

I don't want to give too much about the talk away, but what's the motivation for that technical shift?

Answer: 

The main motivation is not having a lot of visibility about what goes on in Kafka and wanting to have the ability to trace lineage, for example, and to understand how data flows and who owns data and be able to do data governance. And we found that quite difficult in a full streaming pipeline that's open to every team in the company. I think it's possible to do this in streaming with Kafka, but we didn't think about that from day one. So now we're changing tack and trying a different approach. And that's why we're doing this transition this time, fully cognizant of the problems that lie down the road if you don't think about user access, lineage and metadata upfront.

Question: 

What do you want people to leave the talk with?

Answer: 

The main thing is the realization that there are many twists and turns along the way to building a data platform, especially a streaming one. And there are some fundamental problems that are really inherent to streaming platforms, which I will share our experience of. The takeaways I would like to give are about the design decisions that need to go into delivering data and to make it as useful as possible to data scientists, machine learning practitioners and analysts. And things you should really be thinking about upfront: if you are not doing proper metadata tracking right now, if you are not tracking lineage, or you don't know the intended usage of data, then you need to take ownership over that as a data platform owner. Both to help yourself and your users. If there's one thing I want everyone to leave the talk with it's the recognition that you need to go and think about these things right now and start putting a plan into place to get this visibility, taking some inspiration from how we did it at Skyscanner.

Speaker: Herman Schaaf

Senior Software Engineer @Skyscanner

Herman Schaaf is a senior software engineer at Skyscanner, where he works primarily on building the central data platform. Before this he worked on applications in machine learning and machine translation, including an offline mobile application that can recognize and translate Chinese to English. In his free time he loves reading and traveling, but even then is known to scribble down new ideas about software, data structures, algorithms and distributed systems.

Find Herman Schaaf at

Last Year's Tracks