You are viewing content from a past/completed QCon

Presentation: Machine Learning Through Streaming at Lyft

Track: Streaming Data Architectures

Location: Whittle, 3rd flr.

Duration: 5:25pm - 6:15pm

Day of week: Monday

Slides: Download Slides

Share this on:

This presentation is now available to view on InfoQ.com

Watch video with transcript

What You’ll Learn

  1. Hear about Lyft’s self service streaming framework, how it is used in the Machine Learning workflow, challenges they had and how they solved them.
  2. Learn about the importance of bootstrapping state, backfills and recovery
  3. Find out about data skew, how it can affect performance and how to deal with it.

Abstract

Uses of Machine Learning are pervasive in today’s world. From recommendations systems to ads serving. In the world of ride sharing we use Machine Learning to make a lot of decisions in realtime, for example: supply/demand curves are used to get an accurate ETA(estimated time of arrival) and fair pricing, patterns of user behaviour is used to detect fraudulent activities on the platform. Insights drawn from raw data becomes less relevant over time, so it is important to generate features in near real time to aid in decision making with the most recent view of the world.  

At Lyft, all client applications as well as services generate many millions of events every second that are ingested and streamed through a network of Kinesis and Kafka streams with latencies in the order of milliseconds. There is a need to leverage this streaming data to produce realtime features for model scoring and execution and to do it efficiently and in a scaleable manner.  

In order to make it really easy for our Research Scientists and Data Scientists to write stream processing programs that produce data for ML models - we built a full managed, self service platform for stream processing using Flink. This platform completely abstracts away the complexities of writing and managing a distributed data processing engine including data discovery, lineage, schematization, deployments, fault tolerance, fanout to various different sinks, bootstrapping, backfills among other things. Users can specify execution logic in terms of SQL and run programs with a short and simple declarative configuration. Time to write and launch an application is under 1 hour.  

The audience will learn about:

  1. The challenges of building and scaling such a platform, best practices, and common pitfalls. I will be going into the details of how our system evolved over the last couple of years as well as the design tradeoffs we made.
  2. Bootstrapping State - why it is needed and how we built it.
  3. Our solution for dealing with data skew
  4. How moving to a Kubernetes based architectures allowed us to scale fast.
  5. Our solutions to common problems such as data recovery and backfills, maintaining data freshness in the face of downtimes, providing platform level reliability guarantees, prototyping etc
Question: 

What is the work you are doing today?

Answer: 

I work for a ride sharing company called Lyft. Our mission is to improve people's lives through the world's best transportation. For the last two and a half years, I have been working on the streaming platform team. In the ride sharing world, it is imperative to build the most recent state of the world and take decisions based on this information, things like determining prices based on supply and demand. We also want to fight fraud in real time. We want to determine the most optimum routes, things like that. My work here is aimed at making all that possible and easy. We want to enable streaming use cases across Lyft through Kafka, Kinesis, Apache Beam, Apache Flink. One of the problems we are trying to solve is making streaming available for machine learning use cases. We are building a self-service platform to make it easy for our Research Scientists and our Data Scientists to spin up streaming programs with no code, declaratively and deploy it and get it running in less than one hour and get results as new events come in, all in real time.

Question: 

What are the goals for your talk?

Answer: 

One of my goals is to convey the power of event-driven programing and how it ties into the whole machine learning workflow. I also want to talk about some of the challenges that we faced when we were building this self serve streaming feature generation service. Things like bootstrapping state. How do you deal with data skew? How do you build a platform and provide reliability guarantees across the board when all the use cases are so different? And also things like schema evolution, managing high throughput computations while maintaining low-latency aspects. And I will be talking about how our platform evolved over the last two years and what our users can learn from our experience.

Question: 

Can you give us a preview of the challenges and pitfalls you had?

Answer: 

One of the problems that I realized based on talking to people at conferences and meetups is that everyone is trying to solve the problem of bootstrapping state, especially in a streaming world. So, for example, if your state involves counting some numbers over a period of two days, for example, and if you start your program now, do you wait for those two days for it to have enough data to build that state or do you bootstrap it? If you are limited by things like streams retention periods, bootstrapping state can be challenging. This is something that I will be talking about, how we solved that problem at Lyft. That is one of the things. The other is the operational aspects of building a self service streaming system. How do you scale such a big system? How do you manage deploys with the main goal of abstracting away all those complexities from our users? I will be talking about how Kubernetes came to our rescue and how we incorporated that into our deployment workflow.

Question: 

What are the key takeaways you would like to provide to the audience?

Answer: 

The main thing that I hope people take away from this talk are ways to tackle some common problems in the streaming world, things like bootstrapping state. Why it is required, why it is important to solve. Also, dealing with things like data skew. These are common problems that almost everyone faces. Data skew can bring up these interesting challenges - like exponentially increasing state size among other things. Data recovery and backfills is another common challenge that almost everyone is trying to solve. The problem of maintaining feature freshness as well as correctness when outages inevitably happen. I hope people can learn from our experiences and find some valuable takeaways from our approach.

Speaker: Sherin Thomas

Senior Software Engineer @Lyft

Sherin is a Senior Software Engineer at Lyft. In her career spanning 8 years, she has worked on most parts of the tech stack, but enjoys the challenges in Data and Machine Learning the most. Most recently she has been focussed on building products that would facilitate advances in Artificial Intelligence and Machine Learning through Streaming.  She is passionate about increasing the representation of women and URMs in tech and has been involved in several Diversity and Inclusion efforts at her company. She has also been sharing her work with the tech community through meetups and conference talks and has spoken at several renowned conferences such as Flink Forward, Scale by the Bay, Women Who Code etc.   In her free time she loves to read, paint or play with dogs. She the president of the Russian Hill book club based in San Francisco and also loves to organize events for her local library.

Find Sherin Thomas at

Last Year's Tracks