Presentation: Why Distributed Systems Are Hard

Track: Next Generation Microservices: Building Distributed Systems the Right Way

Location: Fleming, 3rd flr.

Duration: 11:50am - 12:40pm

Day of week: Monday

Share this on:

What You’ll Learn

  1. Hear about the complexity of distributed systems.
  2. Learn why one needs to account for the human factor when designing a complex system.


Every company that has adopted microservices architecture operates a complex distributed system. It's basically a full-time endeavor to keep up with the ever-changing landscape of technologies and tools to build, maintain, and scale these towering production systems, but the fundamentals of distributed computing theory have remained relatively constant in the last few decades. So, why are distributed systems known for being notoriously difficult to wrangle?

This talk will cover a brief history of distributed computing, present a survey of key academic contributions to distributed systems theory including the CAP theorem and the FLP correctness result, and dig into why network partitions are inevitable today. Though operating in a distributed fashion is full of unknowns, mathematics (consensus algorithms) and engineering (designing for observability) can work together to mitigate these risks. We'll also take a look at how to design systems for greater resilience by studying human factors, which can help reduce the impact of programmatic uncertainty when you're at the helm of a sprawling ecosystem of microservices.


What is the work that you are doing today?


I work as a senior software engineer at GitHub on the community and safety team. The purpose of my team is to help GitHub as a platform become a more welcoming and inclusive and productive place for open source communities to thrive. My team is largely responsible for building things like moderation tools for giving advice about how UI, for example, can be modified or redesigned to encourage positive interactions and discourage negative interactions. There's a lot of design thinking that goes into it. There's a bit of behavioral psychology that goes into how we make decisions. But on the whole, it's a team that I recently joined, and I'm very excited to be working on this mission because harassment on the Internet is a problem that I care deeply about solving.


What is the goal of your talk?


This talk is basically going to collapse on the idea that distributed systems today are extremely complex, that they have so many moving parts that any person can understand 100% what's happening. It would just literally be impossible because of the number of libraries that we use, because of a number of things that are flying over networks, because of the number of tools written by people that we don't know. I think it's just impossible to have complete ownership and complete knowledge from top to bottom of your stack. My talk explores the history of how we got here. I don't want to say that it's a negative thing that things are complex. I think it's just reality. So the more productive question to ask today is, given that our systems are always going to be complex, we should accept that reality, and we also need to start reframing our approaches to managing that complexity. There are many historical reasons and a lot of papers written about the evolution of this complexity. John Allspaw talks a lot about things that are above the line and below the line. Above the line means things that are within the realm of human cognition, things that we can reason about mostly accurately. Below the line is things that we think are true, but we need proxies to experiment and test and see whether our beliefs are true or not about the system. I think there definitely is a bit of faith involved to reason about systems that are this big. But it's not all randomness. It's not all chaos. There are tools and there are ways that we can productively frame. I guess we can productively come up with mitigation strategies so that we humans in our limited capacity to reason about complex things can still make enough sense of these systems.


ou also mentioned in your abstract that you can design the system for resilience by studying human factors. Can you give us a little preview of what it means?


Human factors is a term that's been tossed around more and more in the past few years, especially with the rise of Site Reliability Engineering as a discipline and as a job title. But I think if I were to trace the lineage of this term back, I would say that Richard Cooke was one of the first people talking about this, and then John Allspaw, as I mentioned, also did a lot of work on this. Human factors in that sense means acknowledging that humans are part of the technical system. Human factors is borrowed not from software. It comes from emergency response, disaster preparedness, like responding to natural emergencies, firefighting, responding to floods and earthquakes and that sort of thing. Hospital emergency rooms.

All of those fields have in common they account for things like fatigue, and they account for things like humans are going to make mistakes. It's the observation that Richard Cooke and John Allspaw have been making in the past few years that in software we often think that automation is the answer. You just write a Bash script or you set up some task that you run every few hours. Then you can get rid of the humans. But that's just not true because the opinions and biases and assumptions of the humans that designed the system are baked into the system. So if you remove the humans, you don't remove that humanness. But by zooming out and focusing on human factors and acknowledging that the humans who build the systems, designed the systems and run the systems are part of that. And they are also a point of failure. Sounds very pessimistic or risk vector sounds very pessimistic, but that might be the language that is useful for some people to think about it. Once you acknowledge that humans will make mistakes and you can have a more productive conversation, like, how do we redesign, for example, our alert system so that it's understandable to humans? What would it mean to optimize your middle of the night pagers so that if someone wakes up, that alert message has everything they need to go and find, diagnose and solve the problem.


What do you want the people to leave the talk with?


When you are building complex systems today design first for the humans that are operating the systems and using the systems. The software stuff, the tools you choose to use, whatever hosting, whatever provider you choose, that's all secondary. What matters is, can a person make sense of the dashboards, monitoring alerts? Can a person reason about the health of their system, and a median experience engineer on your team, can that person find a bug at 3:00 a.m. in the morning and understand what a reasonable next step is?

Speaker: Denise Yu

Software Engineer @Pivotal

Denise is a Senior Software Engineer at GitHub, currently working to help make the platform a safer and more inclusive place, as part of the Community & Safety Team. She speaks and runs workshops frequently at conferences in North America and Europe on topics ranging from scaling organizational culture, to reliability engineering, to sketchnoting. She lives in Toronto, Canada with her partner, along with their fluffy orange terror-cat named Sam.

Find Denise Yu at

Similar Talks

Scaling N26 Technology Through Hypergrowth


Software Engineer and Tech Lead @N26

Folger Fonseca

3 Disciplines for Leading a Distributed Agile Organization


Distributed Coach/Mentor & Community Cultivator

Mark Kilby

A Brief History of the Future of the API


Co-Author of gRPC for WCF Developers and Creator @VisualRecode

Mark Rendle

Preparing for the Unexpected


Principal Engineer @FinancialTimes

Samuel Parkinson

Security Vulnerabilities Decomposition


Principal Application Security Consultant @Veracode

Katy Anton


  • Architectures You've Always Wondered About

    Hard-earned lessons from the names you know on scalability, reliability, security, and performance.

  • Machine Learning: The Latest Innovations

    AI and machine learning is more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice.

  • Kubernetes and Cloud Architectures

    Learn about cloud native architectural approaches from the leading industry experts who have operated Kubernetes and FaaS at scale, and explore the associated modern DevOps practices.

  • Evolving Java

    JVM futures, JIT directions and improvements to the runtimes stack is the theme of this year’s JVM track.

  • Next Generation Microservices: Building Distributed Systems the Right Way

    Microservice-based applications are everywhere, but well-built distributed systems are not so common. Early adopters of microservices share their insights on how to design systems the right way.

  • Chaos and Resilience: Architecting for Success

    Making systems resilient involves people and tech. Learn about strategies being used, from cognitive systems engineering to chaos engineering.

  • The Future of the API: REST, gRPC, GraphQL and More

    The humble web-based API is evolving. This track provides the what, how, and why of future APIs.

  • Streaming Data Architectures

    Today's systems move huge volumes of data. Hear how the innovators in this space are designing systems and leveraging modern data stream processing platforms.

  • Modern Compilation Targets

    Learn about the innovation happening in the compilation target space. WebAssembly is only the tip of the iceberg.

  • Modern CS in the Real World

    Head back to academia to solve today's problems in software engineering.

  • Bare Knuckle Performance

    Crushing latency and getting the most out of your hardware.

  • Leading Distributed Teams

    Remote and distributed working are increasing in popularity, but many organisations underestimate the leadership challenges. Learn from those who are doing this effectively.

  • Driving Full Cycle Engineering Teams at Every Level

    "Full cycle developers" is not just another catch phrase; it's about engineers taking ownership and delivering value, and doing so with the support of their entire organisation. Learn more from the pioneers.

  • JavaScript: Pushing the Client Beyond the Browser

    JavaScript is not just the language of the web. Join this track to learn how the innovators are pushing the boundaries of this classic language and ecosystem

  • When Things Go Wrong: GDPR, Ethics, & Politics

    Privacy, confidentiality, safety and security: learning from the frontlines, from both good and bad experiences

  • Growing Unicorns in the EU: Building, Leading and Scaling Financial Tech Start Ups

    Learn how EU FinTech innovators have designed, built, and led both their technologies and organisations.

  • Building High Performing Teams

    There are many discussions outlining the secret sauce of high-performing teams. Learn how to balance the essential ingredients of high performing teams such as trust and delegation, as well as recognising the pitfalls and problems that will ruin any recipe.

  • Scaling Security, from Device to Cloud

    Implementing effective security is vitally important, regardless of where you are deploying software applications