Conference:March 6-8, 2017
Workshops:March 9-10, 2017
Presentation: Straggler Free Data Processing in Cloud Dataflow
Location:
- Fleming, 3rd flr.
Duration
Day of week:
- Wednesday
Level:
- Intermediate
Persona:
- Developer
Key Takeaways
- Learn that it's important to optimize for acceptable and predictable worst-case behavior in complex systems
- Understand that being able to rely on a system automatically "just doing the right thing" in all cases is a qualitative change that often enables entirely new technology
- Explore how applying techniques such as auto-scaling and dynamic work balancing limits the negative impact of information
Abstract
One of the main causes of performance problems in distributed data processing systems (from the original MapReduce to modern Spark and Flink) is "stragglers." Stragglers are parts of the input that take an unexpectedly long time to process, delaying the completion of the whole job, and wasting resources that stay idle. Stragglers can happen due to imbalance of data distribution or processing complexity, hardware/networking anomalies, and a variety of other factors.
Google Cloud Dataflow is the first system to address the problem of stragglers in a fully general way. By dynamically redistributing parts of already launched work from straggler workers onto idle workers to maximize utilization, Google Cloud Dataflow is able to preserve data consistency and minimizing re-execution.
This talk describes the theory and practice behind Cloud Dataflow's approach to straggler elimination, as well as the associated non-obvious challenges, benefits, and implications of the technique.
Interview
Right now I'm working on introducing a new primitive into the Apache Beam programming model, https://s.apache.org/splittable-do-fn. It is only marginally related to the topic of the current talk, but it enables a large number of new use cases and greatly simplifies some existing ones, in particular, making it much easier to integrate Apache Beam with new data sources.
I am also staying involved in work related to auto-scaling and dynamic work rebalancing, in particular, we are planning on exploring more into applying these techniques in the face of noisy information about the progress of shards, limiting the negative impact of missing or even completely inaccurate information.
It is twofold.
First, I want to advocate for the general practical software design principles on which our solution rests - for example, the importance of looking for a perfect, "holy grail" type solution, rather than settling for solutions addressing individual common cases; the importance of optimizing the worst-case behavior; the challenges of optimizing dynamic system behavior in an unpredictable environment.
Second, I would like to emphasize how important is the problem of stragglers and how enormous is the impact of having a good solution for it. I really would like to see authors of other data processing systems realize this and address the problem in their systems as well.
It describes a high-level architecture and its underlying principles, so it's targeted at tech leads who may be able to apply similar principles to their architectures. The talk is not specific to any particular language.
To get an idea of what I plan to discuss, a good place to start is a blog post we wrote. It’s called No Shard Left Behind. The basic idea sounds really simple in principle. We repartition data as it’s processed (Dynamic Work Rebalancing). It’s really that simple. This talk dives into why it’s also challenging and worth the effort. It’s also important to note that the talk also explores other approaches that just don’t work in all use cases.
Intermediate - anyone marginally familiar with parallel or distributed processing should be able to enjoy the talk.
I hope they will walk away with an inspiration to keep looking for "holy grail"-type solutions to architectural problems they face, and an appreciation for systems that implement such solutions, such as Cloud Dataflow.
I have to admit I'm in a bit of a bubble, having been so excited about my project for the past few years that I haven't looked around very much. But from what I see, two directions look particularly promising: - The "cloudification" of everything - the move from libraries and on-premises hardware to high-level, zero-configuration cloud services (not just allocation of resources), serverless apps etc., hiding all the operational and scaling complexity from the developer. I'm hopeful that the entire concepts of "deployment" and "configuration" will become a thing of the past. - Tying into the previous point, the recent strides made in deep learning. I'm especially excited about how cloud services offering very high-quality ML will greatly lower the bar to entry into the market for smart apps integrated with the physical world: using sensor data, images, video, sound, etc.
Similar Talks






Tracks
-
Architecting for Failure
Building fault tolerate systems that are truly resilient
-
Architectures You've Always Wondered about
QCon classic track. You know the names. Hear their lessons and challenges.
-
Modern Distributed Architectures
Migrating, deploying, and realizing modern cloud architecture.
-
Fast & Furious: Ad Serving, Finance, & Performance
Learn some of the tips and technicals of high speed, low latency systems in Ad Serving and Finance
-
Java - Performance, Patterns and Predictions
Skills embracing the evolution of Java (multi-core, cloud, modularity) and reenforcing core platform fundamentals (performance, concurrency, ubiquity).
-
Performance Mythbusting
Performance myths that need busting and the tools & techniques to get there
-
Dark Code: The Legacy/Tech Debt Dilemma
How do you evolve your code and modernize your architecture when you're stuck with part legacy code and technical debt? Lessons from the trenches.
-
Modern Learning Systems
Real world use of the latest machine learning technologies in production environments
-
Practical Cryptography & Blockchains: Beyond the Hype
Looking past the hype of blockchain technologies, alternate title: Weaselfree Cryptography & Blockchain
-
Applied JavaScript - Atomic Applications and APIs
Angular, React, Electron, Node: The hottest trends and techniques in the JavaScript space
-
Containers - State Of The Art
What is the state of the art, what's next, & other interesting questions on containers.
-
Observability Done Right: Automating Insight & Software Telemetry
Tools, practices, and methods to know what your system is doing
-
Data Engineering : Where the Rubber meets the Road in Data Science
Science does not imply engineering. Engineering tools and techniques for Data Scientists
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
-
Workhorse Languages, Not Called Java
Workhorse languages not called Java.
-
Security: Lessons Learned From Being Pwned
How Attackers Think. Penetration testing techniques, exploits, toolsets, and skills of software hackers
-
Engineering Culture @{{cool_company}}
Culture, Organization Structure, Modern Agile War Stories
-
Softskills: Essential Skills for Developers
Skills for the developer in the workplace