Presentation: GoshawkDB: Making time with Vector Clocks

Location:

Duration

Duration: 
5:25pm - 6:15pm

Day of week:

Key Takeaways

  • Understand that data stores are still an exciting area of technology, and there are still ways to make them better.
  • Understand the architectural and algorithmic choices when attempting to solve challenging issues with distributed data stores. 
  • Hear GoshawkDB’s innovative approach to model causality.

Abstract

It's well understood how logical clocks can be used to capture the order in which events occur. By extension, vector clocks encode when an event occurred across a distributed system. GoshawkDB reverses this idea: by analysing the dependencies between transactions, each participant calculates a vector clock that captures the constraints necessary to achieve strong serializability. The vector clocks from the different participants in a transaction can then be combined safely using the same techniques as CRDTs. This allows GoshawkDB to achieve strong serializability without imposing a total global ordering on transactions.

In this talk I will demonstrate these algorithms: what dependencies do we care about between transactions, how can we capture these with Vector Clocks, how we can treat Vector Clocks as a CRDT, and how GoshawkDB uses all these ideas and more to achieve a scalable distributed data store with strong serializability, general transactions and fault tolerance.

Interview

Question: 
QCon: What’s the motivation for your talk?
Answer: 
Matthew: Over the past 10 years there has been huge innovation in data-stores, mainly due to the increasing desire for distributed stores and the ability to continue operation in the face of failures. This work is still on-going and there is much more to achieve: data-stores are by no means a finished branch of technology.
GoshawkDB is another step forwards, offering a unique set of features, and using some new algorithms and ideas. The purpose of this talk is to explain some of these algorithms in detail: how does GoshawkDB achieve its features?
Question: 
QCon: Is this a product talk or do you focus on the problem set and only discuss GoshawkDB in context of the problem?
Answer: 
Matthew: The talk gives a quick overview of the features of GoshawkDB, and then focusses on a key algorithm within GoshawkDB which is an essential part of how GoshawkDB achieves both its performance and strong serializability.
Question: 
QCon: What are your key takeaways for this talk?
Answer: 
Matthew: Causality is very important in distributed systems and there have been papers on causality in computer systems going back more than 30 years. The mechanisms used to model causality (logical clocks, and by extension, vector clocks) are robust and well-understood, but they can be inverted, allowing causality to be determined rather than captured.
GoshawkDB’s use of Vector Clocks appears to be unique: despite much searching, I’ve been unable to find another system or paper that uses Vector Clocks in the same way. This in itself is quite exciting!
More generally, data-stores like GoshawkDB are distributed but still offer intuitive and powerful semantics such as general purpose transactions and strong serializability without substantial performance penalties. There are proofs that stores like GoshawkDB must do more work than AP stores, but that does not require a global total ordering of transactions, nor a primary/secondaries architecture.
Question: 
QCon: Can you elaborate on the tricky bits you'll discuss?
Answer: 
Matthew: Goshawk is able to look at the transactions and show what the dependencies are between the transactions (if there are any), and then it will calculate the necessary vector clock. So, rather than capturing the order in which things occurred, it is instead calculating the order in which things must occur in order to satisfy the dependencies between different transactions and achieve strong serializability.
So that is really the focus: a novel algorithm that uses vector clocks to model a safe ordering of events which achieves strong serializability; and that allows vector clocks to grow and shrink as necessary, avoiding the traditional pitfall of vector clocks (that they are very wide, and so have a high cost to serialize).
Question: 
QCon: Do you get into the implementation details? ...or stay at an architectural level?
Answer: 
Matthew: Architectural and algorithmic. Pseudo code may appear but I doubt actual code from GoshawkDB itself. Lots of diagrams and even animations!
Question: 
QCon: What is the main thing you want people to leave with from your talk?
Answer: 
Matthew: That data-stores are still a very exciting area of technology and there are still ways to make them better. Even if you’ve been burnt in the past by unreliable stores with weird semantics, you shouldn’t give up hope: there is a lot of interest and study going on right now on distributed data stores and they’re still getting better!

Tracks

Covering innovative topics

Monday, 7 March

Tuesday, 8 March

Wednesday, 9 March

Conference for Professional Software Developers