You are viewing content from a past/completed QCon

Track: Streaming Data Architectures

Location: Churchill, G flr.

Day of week: Monday

Today’s modern systems generate, as well as process data faster, and in larger volumes than ever before. Whether it’s IOT sensor data, business or social media feeds, or even internal system logs, data often arrives as a continuous stream of events needing attention. Whilst there is no one size fits all, it's crucial to ensure you have a solid streaming architecture in place capable of handling everything thrown at it. Architectures which can scale and are resilient to failures and data surges, and likewise allow for intelligent, timely decision making processes (including AI/ML models) to be embedded and executed at the most appropriate time.  

Join us on this track to learn from innovators and engineers in the trenches on how they are designing systems and leveraging modern data stream processing platforms to deal with such challenges. Hear war stories and gain insights into the concrete approaches and solutions applied, as well as pitfalls to avoid. Finally, peer into the future world of the event streaming database - an emerging new category of database that looks to focus on 'data that moves’.

Track Host: Nicki Watt

Chief Technology Officer @OpenCredo

Nicki Watt currently serves as OpenCredo’s Chief Technology Officer, a pragmatic hands on software consultancy with specialisms in data engineering, ML & cloud native solutions. Her technical career has seen her wear many hats from Engineer, Systems & Technical Architects to Consultant and now CTO. She is a techie at heart, with involvement in the development, delivery and leading of large scale platform and application development projects. Nicki is also co-author of the graph database book Neo4J in Action.

10:35am - 11:25am

Internet of Tomatoes: Building a Scalable Cloud Architecture

Five years ago we started on a journey of building a website monitoring tool. Little did I know that this would land up morphing into a full IoT based agriculture platform. Discussing if tomatoes need dark hours to sleep was not the type of question I had anticipated having to answer. But don't underestimate how you can innovate the agriculture world with your technology. At 30MHz we're building a data platform for the agriculture sector. It provides full insight into the climatic conditions of horticultural and agricultural produce for all stakeholders in the sector. This includes ingesting all kinds of data sources and analysing the information interactively - enabling the continuous improvement of the production process for crops, plants, seeds, and bulbs.  
 
In this talk I'll tell the story of our platform and how we ended up helping growers in 30 countries, deploying 3.5K sensors and process data at 4K events per minute. I'll share our architecture, how it grew, the challenges, and how we are continuing to transform it - for example - to learn how to grow the best tomatoes!  
 
Key takeaways:  
  • Gain insight into a concrete solution for gathering, storing and accessing big amounts of real-time time-based data.
  • Understand some of the problems that you could encounter building such a platform.
  • Get inspiration for embarking on projects related to IoT, (big) data collection or even getting into the agriculture industry.

Flavia Paganelli, CTO and Founder @30Mhz

11:50am - 12:40pm

Streaming a Million likes/second: Real-time Interactions on Live Video

When a broadcaster like BBC streams a live video on LinkedIn, tens of thousands of viewers will watch it concurrently. Typically, hundreds of likes on the video will be streamed in real-time to all of these viewers. That amounts to a million likes/second streamed to viewers per live video. How do we make this massive real-time interaction possible across the globe? In this talk, I’ll do a technical deep-dive into how we use the Play/Akka Framework and a scalable distributed system to enable live interactions like likes/comments at massive scale at extremely low costs across multiple data centers.

Topics I will cover include:

  • Server-side and client-side frameworks for persistent connections.
  • Managing persistent connections with millions of active clients.
  • Pub/Sub architecture for real-time streaming with less than 100ms end to end latency to millions of connected clients. Hint: No Kafka!
  • Leveraging the same platform for other dynamic experiences like Presence.

Akhilesh Gupta, Sr. Staff Software Engineer @LinkedIn

1:40pm - 2:30pm

Databases and Stream Processing: A Future of Consolidation

Are databases and stream processors wholly different things, or are they really two sides of the same coin? Certainly, stream processors feel very different from traditional databases when you use them. In this talk, we’ll explore why this is true, but maybe more importantly why it's likely to be less true in the future: a future where consolidation seems inevitable.  

So what advantage is there to be found in merging these two fields? To understand this we will dig into why both stream processors and databases are necessary, from a technical standpoint, but also by exploring industry trends that make consolidation in the future far more likely. Finally, we'll examine how these trends map onto common approaches from active databases like MongoDB to streaming solutions like Flink, Kafka Streams or ksqlDB.  

By the end of this talk, you should have a clear idea of how stream processors and databases relate and why there is an emerging new category of databases that focus on data that moves.

Benjamin Stopford, Author of “Designing Event Driven Systems” & Senior Director @confluentinc

2:55pm - 3:45pm

From Batch to Streaming to Both

In this talk I walk through how the streaming data platform at Skyscanner evolved over time. This platform now processes hundreds of billions of events per day, including all our application logs, metrics and business events. But streaming platforms are hard, and we did not get it right on day one. In fact, it’s still evolving as we learn more.  Our story is a case study of developing a streaming data platform in agile fashion. And evidence that with data platforms, small decisions can have out-sized effects. We went from a batch-driven system in a data center, to a streaming platform that processes events in real-time, to something in-between. I will explain what got us here, our current plans and why you may want to skip some of the steps along the way.  Choosing the right mix of batch and real-time for your problem is critical. I hope the war story I share here will help you make the right call for your organisation. And if nothing else, it will show you that it’s never too late to correct course.

Herman Schaaf, Senior Software Engineer @Skyscanner

4:10pm - 5:00pm

Streaming Data Architectures Open Space

Details to follow.

5:25pm - 6:15pm

Machine Learning Through Streaming at Lyft

Uses of Machine Learning are pervasive in today’s world. From recommendations systems to ads serving. In the world of ride sharing we use Machine Learning to make a lot of decisions in realtime, for example: supply/demand curves are used to get an accurate ETA(estimated time of arrival) and fair pricing, patterns of user behaviour is used to detect fraudulent activities on the platform. Insights drawn from raw data becomes less relevant over time, so it is important to generate features in near real time to aid in decision making with the most recent view of the world.  

At Lyft, all client applications as well as services generate many millions of events every second that are ingested and streamed through a network of Kinesis and Kafka streams with latencies in the order of milliseconds. There is a need to leverage this streaming data to produce realtime features for model scoring and execution and to do it efficiently and in a scaleable manner.  

In order to make it really easy for our Research Scientists and Data Scientists to write stream processing programs that produce data for ML models - we built a full managed, self service platform for stream processing using Flink. This platform completely abstracts away the complexities of writing and managing a distributed data processing engine including data discovery, lineage, schematization, deployments, fault tolerance, fanout to various different sinks, bootstrapping, backfills among other things. Users can specify execution logic in terms of SQL and run programs with a short and simple declarative configuration. Time to write and launch an application is under 1 hour.  

The audience will learn about:

  1. The challenges of building and scaling such a platform, best practices, and common pitfalls. I will be going into the details of how our system evolved over the last couple of years as well as the design tradeoffs we made.
  2. Bootstrapping State - why it is needed and how we built it.
  3. Our solution for dealing with data skew
  4. How moving to a Kubernetes based architectures allowed us to scale fast.
  5. Our solutions to common problems such as data recovery and backfills, maintaining data freshness in the face of downtimes, providing platform level reliability guarantees, prototyping etc

Sherin Thomas, Senior Software Engineer @Lyft

Last Year's Tracks