Presentation: Straggler Free Data Processing in Cloud Dataflow



4:10pm - 5:00pm

Day of week:



Key Takeaways

  • Learn that it's important to optimize for acceptable and predictable worst-case behavior in complex systems
  • Understand that being able to rely on a system automatically "just doing the right thing" in all cases is a qualitative change that often enables entirely new technology
  • Explore how applying techniques such as auto-scaling and dynamic work balancing limits the negative impact of information


One of the main causes of performance problems in distributed data processing systems (from the original MapReduce to modern Spark and Flink) is "stragglers." Stragglers are parts of the input that take an unexpectedly long time to process, delaying the completion of the whole job, and wasting resources that stay idle. Stragglers can happen due to imbalance of data distribution or processing complexity, hardware/networking anomalies, and a variety of other factors.

Google Cloud Dataflow is the first system to address the problem of stragglers in a fully general way. By dynamically redistributing parts of already launched work from straggler workers onto idle workers to maximize utilization, Google Cloud Dataflow is able to preserve data consistency and minimizing re-execution.

This talk describes the theory and practice behind Cloud Dataflow's approach to straggler elimination, as well as the associated non-obvious challenges, benefits, and implications of the technique.


What is the focus of your work today?

Right now I'm working on introducing a new primitive into the Apache Beam programming model, It is only marginally related to the topic of the current talk, but it enables a large number of new use cases and greatly simplifies some existing ones, in particular, making it much easier to integrate Apache Beam with new data sources.

I am also staying involved in work related to auto-scaling and dynamic work rebalancing, in particular, we are planning on exploring more into applying these techniques in the face of noisy information about the progress of shards, limiting the negative impact of missing or even completely inaccurate information.

What’s the motivation for your talk?

It is twofold.

First, I want to advocate for the general practical software design principles on which our solution rests - for example, the importance of looking for a perfect, "holy grail" type solution, rather than settling for solutions addressing individual common cases; the importance of optimizing the worst-case behavior; the challenges of optimizing dynamic system behavior in an unpredictable environment.

Second, I would like to emphasize how important is the problem of stragglers and how enormous is the impact of having a good solution for it. I really would like to see authors of other data processing systems realize this and address the problem in their systems as well.

How you you describe the persona of the target audience of this talk?

It describes a high-level architecture and its underlying principles, so it's targeted at tech leads who may be able to apply similar principles to their architectures. The talk is not specific to any particular language.

Can you give me a glimpse of what you’ll dive into with your talk?

To get an idea of what I plan to discuss, a good place to start is a blog post we wrote. It’s called No Shard Left Behind. The basic idea sounds really simple in principle. We repartition data as it’s processed (Dynamic Work Rebalancing). It’s really that simple. This talk dives into why it’s also challenging and worth the effort. It’s also important to note that the talk also explores other approaches that just don’t work in all use cases.

How would you rate the level of this talk?

Intermediate - anyone marginally familiar with parallel or distributed processing should be able to enjoy the talk.

QCon targets advanced architects and sr development leads, what do you feel will be the actionable that type of persona will walk away from your talk with?

I hope they will walk away with an inspiration to keep looking for "holy grail"-type solutions to architectural problems they face, and an appreciation for systems that implement such solutions, such as Cloud Dataflow.

What do you feel is the most disruptive tech in IT right now?

I have to admit I'm in a bit of a bubble, having been so excited about my project for the past few years that I haven't looked around very much. But from what I see, two directions look particularly promising: - The "cloudification" of everything - the move from libraries and on-premises hardware to high-level, zero-configuration cloud services (not just allocation of resources), serverless apps etc., hiding all the operational and scaling complexity from the developer. I'm hopeful that the entire concepts of "deployment" and "configuration" will become a thing of the past. - Tying into the previous point, the recent strides made in deep learning. I'm especially excited about how cloud services offering very high-quality ML will greatly lower the bar to entry into the market for smart apps integrated with the physical world: using sensor data, images, video, sound, etc.

Speaker: Eugene Kirpichov

Cloud Dataflow Sr SE @Google

Eugene is a Senior Software Engineer on the Cloud Dataflow team at Google, working primarily on the autoscaling and work rebalancing secret sauce as well as the Apache Beam programming model. He is also very interested in functional programming languages, data visualization (especially visualization of behavior of parallel/distributed systems), and machine learning.

Find Eugene Kirpichov at

Similar Talks

CTO who understands the science around helping people do their best
Senior Software Engineer @IBM, Committer on Apache Aries
Distributed Systems Engineer Working on Cache @Twitter
Gold Badges Java, JVM, Memory, & Performance @StackOverflow / Lead developer of the OpenHFT project
Research Lead, Software Correctness @Galois


Conference for Professional Software Developers