You are viewing content from a past/completed QCon

Presentation: Why Distributed Systems Are Hard

Track: Next Generation Microservices: Building Distributed Systems the Right Way

Location: Fleming, 3rd flr. & Simulcast in Abbey, 4th flr.

Duration: 11:50am - 12:40pm

Day of week: Monday

Slides: Download Slides

Share this on:

What You’ll Learn

  1. Hear about the complexity of distributed systems.
  2. Learn why one needs to account for the human factor when designing a complex system.

Abstract

Every company that has adopted microservices architecture operates a complex distributed system. It's basically a full-time endeavor to keep up with the ever-changing landscape of technologies and tools to build, maintain, and scale these towering production systems, but the fundamentals of distributed computing theory have remained relatively constant in the last few decades. So, why are distributed systems known for being notoriously difficult to wrangle?

This talk will cover a brief history of distributed computing, present a survey of key academic contributions to distributed systems theory including the CAP theorem and the FLP correctness result, and dig into why network partitions are inevitable today. Though operating in a distributed fashion is full of unknowns, mathematics (consensus algorithms) and engineering (designing for observability) can work together to mitigate these risks. We'll also take a look at how to design systems for greater resilience by studying human factors, which can help reduce the impact of programmatic uncertainty when you're at the helm of a sprawling ecosystem of microservices.

Question: 

What is the work that you are doing today?

Answer: 

I work as a senior software engineer at GitHub on the community and safety team. The purpose of my team is to help GitHub as a platform become a more welcoming and inclusive and productive place for open source communities to thrive. My team is largely responsible for building things like moderation tools for giving advice about how UI, for example, can be modified or redesigned to encourage positive interactions and discourage negative interactions. There's a lot of design thinking that goes into it. There's a bit of behavioral psychology that goes into how we make decisions. But on the whole, it's a team that I recently joined, and I'm very excited to be working on this mission because harassment on the Internet is a problem that I care deeply about solving.

Question: 

What is the goal of your talk?

Answer: 

This talk is basically going to collapse on the idea that distributed systems today are extremely complex, that they have so many moving parts that any person can understand 100% what's happening. It would just literally be impossible because of the number of libraries that we use, because of a number of things that are flying over networks, because of the number of tools written by people that we don't know. I think it's just impossible to have complete ownership and complete knowledge from top to bottom of your stack. My talk explores the history of how we got here. I don't want to say that it's a negative thing that things are complex. I think it's just reality. So the more productive question to ask today is, given that our systems are always going to be complex, we should accept that reality, and we also need to start reframing our approaches to managing that complexity. There are many historical reasons and a lot of papers written about the evolution of this complexity. John Allspaw talks a lot about things that are above the line and below the line. Above the line means things that are within the realm of human cognition, things that we can reason about mostly accurately. Below the line is things that we think are true, but we need proxies to experiment and test and see whether our beliefs are true or not about the system. I think there definitely is a bit of faith involved to reason about systems that are this big. But it's not all randomness. It's not all chaos. There are tools and there are ways that we can productively frame. I guess we can productively come up with mitigation strategies so that we humans in our limited capacity to reason about complex things can still make enough sense of these systems.

Question: 

You also mentioned in your abstract that you can design the system for resilience by studying human factors. Can you give us a little preview of what it means?

Answer: 

Human factors is a term that's been tossed around more and more in the past few years, especially with the rise of Site Reliability Engineering as a discipline and as a job title. But I think if I were to trace the lineage of this term back, I would say that Richard Cooke was one of the first people talking about this, and then John Allspaw, as I mentioned, also did a lot of work on this. Human factors in that sense means acknowledging that humans are part of the technical system. Human factors is borrowed not from software. It comes from emergency response, disaster preparedness, like responding to natural emergencies, firefighting, responding to floods and earthquakes and that sort of thing. Hospital emergency rooms.

All of those fields have in common they account for things like fatigue, and they account for things like humans are going to make mistakes. It's the observation that Richard Cooke and John Allspaw have been making in the past few years that in software we often think that automation is the answer. You just write a Bash script or you set up some task that you run every few hours. Then you can get rid of the humans. But that's just not true because the opinions and biases and assumptions of the humans that designed the system are baked into the system. So if you remove the humans, you don't remove that humanness. But by zooming out and focusing on human factors and acknowledging that the humans who build the systems, designed the systems and run the systems are part of that. And they are also a point of failure. Sounds very pessimistic or risk vector sounds very pessimistic, but that might be the language that is useful for some people to think about it. Once you acknowledge that humans will make mistakes and you can have a more productive conversation, like, how do we redesign, for example, our alert system so that it's understandable to humans? What would it mean to optimize your middle of the night pagers so that if someone wakes up, that alert message has everything they need to go and find, diagnose and solve the problem.

Question: 

What do you want the people to leave the talk with?

Answer: 

When you are building complex systems today design first for the humans that are operating the systems and using the systems. The software stuff, the tools you choose to use, whatever hosting, whatever provider you choose, that's all secondary. What matters is, can a person make sense of the dashboards, monitoring alerts? Can a person reason about the health of their system, and a median experience engineer on your team, can that person find a bug at 3:00 a.m. in the morning and understand what a reasonable next step is?

Speaker: Denise Yu

Senior Software Engineer @GitHub

Denise is a Senior Software Engineer at GitHub, currently working to help make the platform a safer and more inclusive place, as part of the Community & Safety Team. She speaks and runs workshops frequently at conferences in North America and Europe on topics ranging from scaling organizational culture, to reliability engineering, to sketchnoting. She lives in Toronto, Canada with her partner, along with their fluffy orange terror-cat named Sam.

Find Denise Yu at

Last Year's Tracks