You are viewing content from a past/completed QCon

Presentation: Using Randomized Communication for Robust, Scalable Systems

Track: Modern CS in the Real World

Location: Windsor, 5th flr.

Duration: 10:35am - 11:25am

Day of week: Wednesday

Slides: Download Slides

Share this on:

This presentation is now available to view on

Watch video with transcript

What You’ll Learn

  1. Learn about randomized communication.
  2. Listen to how HashiCorp uses SWIM and other academic research in Consul.
  3. Hear about using research in a real system, including handling situations the authors couldn't have planned for.


Three key needs that any distributed system must address are discovery, fault detection, and load balancing among its components. Satisfying these needs in a robust and scalable manner is challenging, but it turns out randomized communication can help with each of them. In this talk, we will examine the evolving use of randomized communication within HashiCorp’s Consul, a popular service mesh solution. Along the way we will consider how to evaluate academic research for production use, and what to do when your real-world deployment goes beyond the researchers’ assumptions. Our experience with Consul and other HashiCorp tools is that the overhead of consuming research is worthwhile, and that practitioners can engage the research community and make a meaningful contribution to advancing the state of the art.


What's the real meat of what you're gonna be talking about?


I'm going to talk about SWIM and other academic research based on randomisation that we have applied in Consul. I'll cover the concrete details of how the randomisation helps with scalability and robustness, but simultaneously this was a learning process for us - it was a journey that took us a number of iterations. So I also want to show people how to engage with academic research successfully. It can have a huge impact on the quality of your product, but there are a lot of tricks to mining the research publications, understanding how the research community works and evaluating a paper. Then there's the issue of how do we actually translate that into the real world of product development? Because the academic work is not necessarily done at scale or with all of the constraints that we have in the real world. And of course if you're using research from the past this is an area that's moving very quickly. Cloud scale, public cloud and hybrid cloud, there's a lot of a lot of things in research that wasn't targeted at these domains but it's highly applicable if you know how to translate that.


What is a randomize communication protocol?


In a randomized communication protocol you are not doing a full mesh, everybody communicating with everybody else, nor are you having everybody always communicate with the same subset of their peers as you might have in say a token ring.

The randomized component is saying that during the lifetime of the execution of each process the peers that it chooses to communicate with will be selected using pseudo-randomization. Doing this often allows you to reduce the total amount of communication, and it also helps to reduce the probability of encountering a correlated failure within the protocol itself. So it improves the robustness of the solution, while requiring less communication and to have that communication be more evenly balanced within the population of members of the system. A nice set of properties to have simultaneously.


What is SWIM?


SWIM is a solution for group membership. It allows a group of peers to discover one another and monitor one another's health. So it can be used to deliver both service discovery and availability checks for the service instances. It was developed at Cornell University and published in 2002. This was a different era: They only had 55 computers available for experiments, but that was very respectable then. Hot applications included reliable multicast and peer-to-peer file sharing, as well as low-powered sensor networks. The datacenter settings we use it in weren't explicitly on their radar, but SWIM turns out to work well there too, and it has enjoyed a healthy life in data centers.


You picked SWIM. Why not Raft?


Well actually we use Raft as well. SWIM and Raft are complementary technologies, and there was an evolution here. Before we created Consul we had Serf (which we still have and which Consul is built on top of.) Serf uses SWIM, and offers a weak, 'eventually consistent' view of group membership. Consul adds use of Raft on top of that, for a consistent view of the group. Consul also exposes the Raft-based consistent view as a key-value store. So if there are parts of your application where it is important to have a consistent view of state as different processes or nodes come up and go down, you can achieve that by writing and reading against the Consul KV store. Raft takes care of replicating that data between the servers, so that consistent view is  highly available. But when you don't need it, you can get weaker consistency with performance benefits.


What you want somebody who comes to your talk to leave with?


This is two fold: concretely, to understand how randomized communication is something that can have multiple benefits in a system, including necessary caveats and how to debug it. But also, stepping back, an appreciation of the benefits of applying academic research to challenging problems in your real-world systems, along with specific techniques and practices that can help you pick and apply that research successfully. The meta-level learning is how to engage with academic research. It's not a passive thing: We are not in the academic setting, so you need to develop a whole pipeline, from discovering and evaluating the most relevant research, through translating it into your real-world environment and debugging it. Things get interesting when you pass the limits of what the researchers could attend to, because of things like scale and the passage of time. But with some tried and tested practices, this can be a rewarding phase of the process too.

Speaker: Jon Currey

Director of Research @HashiCorp

Jon leads HashiCorp's research initiatives, with the mandate to impact their open source tools and enterprise products, while contributing back to the community with novel work and pragmatic whitepapers. Prior to HashiCorp, Jon conducted research at Microsoft Research, Samsung Research, and Nortel. He has shipped production systems at Apple, Oracle and several startups.

Find Jon Currey at

Last Year's Tracks