Presentation: CRDTs and the Quest for Distributed Consistency

Track: Modern CS in the Real World

Location: Mountbatten, 6th flr.

Duration: 2:55pm - 3:45pm

Day of week: Monday

Level: Advanced

Share this on:

What You’ll Learn

  • Learn about CRDTs, what they are and what they are useful for.

  • Hear how to build distributed applications that keep data in sync without the need of a centralized service.

  • Find out what applications CRDTs are good for, and other situations for which you may want a different solution.

Abstract

We all know how to build applications that rely on a central server. However, such centralisation is not always desirable, and recently there has been new interest in developing decentralised applications. Blockchains inevitably come up in that conversation, but when you examine them critically, not every problem is best solved by a blockchain.

In this talk we will explore how to ensure data consistency in distributed systems, especially in systems that don't have an authoritative leader. We will see how to sync data between your phone and your laptop without sending it via a remote server. We will explore algorithms that allow several people to collaborate on a shared document, communicating via a peer-to-peer network.

Conflict-Free Replicated Datatypes (CRDTs) are a set of algorithms that ensure data consistency in such settings. Recent research on CRDTs has enabled us to better understand their consistency guarantees and design richer datatypes. On the practical side, CRDTs are making their way into more and more applications. This talk will examine that research and its uses, showing where we are now and where we are heading in the future.

Question: 

What are you working on lately?

Answer: 

I've been at the University of Cambridge, doing research full time now, which is really cool, I get to go deep into interesting topics. Most of my work there has been on CRDTs, which is what I shall talk about in the talk. That is, data structures which can be modified on several different devices at the same time by several different users, and they automatically ensure a good degree of consistency. If people are editing stuff at the same time, they ensure that everyone ends up with the same document on their screen at the end. The structural integrity of the document is maintained.

Question: 

Is this an introductory talk? Intermediate talk? What's the level of expectation walking into the talk?

Answer: 

I'm not expecting people to know much about CRDTs already, but I don't want to make it just a pure introductory talk because there have been loads of those already. I'll give a quick introduction so we are all on the same level, and then move on to some recent developments. We have been working quite hard on this topic for over two years now and have some new results that haven't yet been published in papers, and that haven't yet made their way into the software engineering community because some of the theoretical papers are quite forbidding. I want to take the core ideas of what we've figured out more recently and communicate that to a broader audience.

Question: 

Can you give me an example that you might go through that a developer might gain some experience with?

Answer: 

A lot of what I've been thinking about is how would you take various different kinds of applications and model them on top of CRDTs. If you want to make a spreadsheet or a CAD application for designing industrial power plants or whatever, how would you model that kind of data in a way that several people can work on it at the same time using CRDTs? That's something that the existing implementations haven't really looked at.

You usually get some CRDTs that model sets and maps and counters, and that's about it, which is not enough if you want to build a spreadsheet. Some do text editing, but they don't really go beyond text editing. Most of our work has been on richer data structures. JSON, for example, where you can have combinations of lists and maps nested inside each other.

One that I have been working on for months, and now finally I have a solution to, is the following: if you have something like JSON, which you can represent as a tree, you can imagine wanting to move a sub-tree around. You take some object tree that is in one place and move it somewhere else. In a graphics application you get that kind of operation all the time. Even in a rich text word processor you have those kind of tree operations: for example, you can imagine taking up a paragraph and turning it into a bulleted list. What you are doing is you're creating a new bulleted list sub-tree, and then moving that paragraph to be a child of that list. These kinds of operations are really useful, but they are really difficult to make consistent. Users can move things around arbitrarily at the same time, and you can get tricky edge-cases. Imagine you have a tree with A and B as siblings, and that one user moves A to be a child of B. And another user concurrently moves B to be a child of A. Each tree is moved to be a child of the other one. If you’re not careful, you end up with a loop where two nodes become dependent on each other, and detached from the rest of the tree, so that they just disappear. Those are the kinds of problems that arise.

Question: 

Are you going to talk about methods to solve problems like what you just described? Are you going to talk about abstractions that allow you to solve such problems?

Answer: 

What I want to get across is how one might build applications on top of this stuff. These tree structures, for example, are just to show there are tricky problems that arise, and some approaches we're trying to use to figure it out. We've got a JavaScript implementation of this stuff which we've actually been using to build real apps. For example, we built a clone of Trello, the project management tool, which works in a complete peer-to-peer way without any servers. And it's actually reasonably usable, even for fairly quickly hacked-together code. That's just one concrete example. I'm working with some folks who are trying to port a graphics application over to CRDTs, just to try something completely different. We're at the point now where we'd actually like to get a wider community of developers interested in this. To be clear, this is still hacky research software, it is not production ready, but I think it's getting to the point where we can have more people playing around with it and trying to build real applications. That way, we can work out what problems need to be ironed out.

Question: 

What do you want someone to walk away with?

Answer: 

I think that's the motivation for why we want to build applications in this way in the first place. At the moment it's really simple to build centralized applications, where you just have a single server in charge of making the decisions. But for business purposes it would actually be really valuable to have applications where the data is more controlled by the end-user devices. What I'm thinking here is — let's say some software service application is developed by a startup. If a business wants to start using this application then they're totally dependent on that startup’s continued existence. And I like to remind people that there is no such thing as the cloud — it’s just using somebody else's computer. With CRDTs we can start building applications where more happens on the client side.

There may still be a cloud component, but we can make it optional, we can reduce the dependence on it. The recent CPU bug, Meltdown, showed us that the isolation between different users in multi-tenant cloud systems is maybe not that perfect. I think that the fewer secrets we can put in the cloud, the better. CRDTs work nicely with end-to-end encryption, so we can have encrypted data stored in cloud services without any plain text being visible to the cloud services. They only store encrypted data. The security angle is not the main thrust of this talk but I think it's another good additional motivation.

Question: 

Are there use cases that are not appropriate for CRDTs?

Answer: 

The best applications I can see for them are those where you have some kind of a document, which is like a unit on which several users collaborate. It might be text, or graphic or something completely different, but it's a well-defined thing. On the other hand, if you are a bank and you have a database of accounts for your customers, it makes no sense to use CRDTs because in the end, what you want is an authoritative database that contains the transactions that happened. And that is best done in a centralized way. Some people with blockchains are trying to decentralize those types of applications as well, but that's not what I'm talking about.

Speaker: Martin Kleppmann

Software Engineer, Author, & Samza and Avro Committer

Martin Kleppmann is a distributed systems researcher at the University of Cambridge, and author of the acclaimed O'Reilly book Designing Data-Intensive Applications (http://dataintensive.net/). Previously he was a software engineer and entrepreneur, co-founding and selling two startups, and working on large-scale data infrastructure at LinkedIn.

Find Martin Kleppmann at

Tracks