Presentation: Observability, Event Sourcing and State Machines



5:25pm - 6:15pm

Day of week:



Key Takeaways

  • Learn how to handle terabytes of data with gigabytes of heap in a JVM.
  • Discuss architect applications to use event sourcing for reliability and persistence.
  • Compare the differences between log monitoring and event sourcing.


What is a way to have complete transparency of the state of a service? Ideally we would record everything - the inputs, outputs and timings - in order to capture highly reproducible and transparent state changes. However, is it possible to record every event or message in and out of a service without hurting performance? Join Peter for an exploration of the use cases and practicalities for having downstream services consuming all of the state changes of an upstream service in order to provide automated, real-time insight of services.


What’s the focus of the talk?

It’s a call to arms for event sourcing and a record everything model - you can do it if you have aggressive low latency requirements, so there isn’t a performance reason not to do it. If you have performance problems you’re doing it wrong.

The talk will be presented in general terms; and is applicable to a number of different problems and products, so it will be applicable to Kafka, JMS, Aeron, but when referring to implementation details the product that will be used as an example will be our company’s open-source product, Chronicle Queue.

What are you going to talk about?

Two clients have over 100TB of buffers, going back three years - and they can replay anything from that time.

It’s like data driven tests on steroids. It’s possible to replay a day’s worth of events in a specific order that came in. This allows performance tests to be investigated and debugged in case a specific ordering of events happens.

What are some of the challenges of logging in production?

Logging presents a challenge in your application: you don’t want to slow down the critical path but you want to record what is happening. However, if you log everything you can slow down your application. The key is to be able to log the right amount, so that the application doesn’t get slowed down, but enough to be useful.

So it is necessary to decide up front that you can record enough information to reproduce any problem, and that it’s being recorded by default.

Why use event sourcing as a means of driving data through the application?

Event sourcing allows the application to be viewed as a series of state machines.

With event sourcing, you can have confidence that you can reproduce any problem as it won’t run unless there is enough information and the confidence to know you can fix the bug. If you can reproduce the problem you can demonstrate whether the fix works as expected with that data, rather than making a change and hoping that it works.

The queue is one or more files. Effectively the application is in replay mode the whole time; when the application starts, it follows the file and is playing the log.

Being able to record a stream of events is separate from being able to replay those events afterwards. Applications that are put into production are often not tested for replay. In addition the application may need supporting infrastructure that isn’t necessarily easy to set up on a development machine.

By writing events to a log and having the application tail that log for listening to new events, it’s easy to set up both in production and on a development machine.

How does event sourcing relate to backpressure?

Back-pressure is worth talking about: one of the differentiators of the messaging systems is how they implement back-pressure and this can determine if it is appropriate for your problem or not. In particular as a contrast a very good solution is Reactive Streams, which is what Akka supports. It’s been added to Java 9 as the Flow API, and one of the key elements about this is that the consumer pushes back on the producer in a fairly simply yet elegant way where it says give me another 5, another 2 etc. While it’s receiving data it’s giving the producer permission to give more data while applying back pressure, instead of polling for more information but without the latency downsides of blocking. It works particularly well with GUI clients, because they are running on different machines and with different networks - you don’t have a level of control that allows you to send (or consume) messages at a fixed rate.

With reactive streams you continually say whether how much you can handle, which you can then adjust over time.

In Chronicle Queue we have a different approach, by focussing on upstream, like market data or compliance. Instead of utilising back-pressure, we have an enormous buffer. The assumption is that you can run for a week without it filling, and as a part of your weekly maintenance cycle you can rotate it and delete, compress or archive it..

What performance can you drive with this architectural pattern?

So for example on a machine with 128 GB of memory we test what happens if you give it a terabyte of data in 3 hours. The difference between the first 64 GB and last 64 GB is only about 20%, so while it slows down, it doesn’t have a dramatic impact. It takes 0.9s to write the first GB and 1.1s to write the last GB.

Some clients are taking peaks of 30 million messages per second without a push back on the market data provider. Compliance is also another compelling case; regulatory requirements are placed on existing systems that don’t have development teams any more. They may not be in a situation to add back-pressure support.

This allows the pattern to be used in a number of back-end systems, although GUI clients will still need something reactive to be able to display updates.

Speaker: Peter Lawrey

Gold Badges Java, JVM, Memory, & Performance @StackOverflow / Lead developer of the OpenHFT project

Peter Lawrey likes to inspire developers to improve the craftsmanship of their solutions, engineer their systems for simplicity and performance, and enjoy their work more by being creative and innovative. He has a popular blog “Vanilla Java” which gets 120K page views per months, is 3rd on for [Java] and 2nd for [concurrency], and is lead developer of the OpenHFT project which includes support for off heap memory, thread pinning and low latency persistence and IPC (as low as 100 nano-seconds).

Find Peter Lawrey at

Similar Talks


Conference for Professional Software Developers