Deconstructing an Abstraction to Reconstruct an Outage

Abstractions are what allow us to build the complex applications that we all use day-to-day. For example, it's rare for us to care about the precise details of on-disk storage when building an application — that's why databases exist!

Debugging is different though. It forces us to break through those abstractions in order to understand what the computer is really doing.

In this talk, we'll explore the aftermath of a complex outage in a Postgres cluster. We'll retrace the steps we took to reliably reproduce the failure in a local environment and pull out lessons about debugging complex systems along the way. At one point, we'll dive into the depths of how Postgres represents data on disk, and realise that even unfamiliar layers of a system don't need to be scary.

What's the focus of your work these days?

I work as an infrastructure engineer at a company called PlanetScale where we build a MySQL managed database platform. The focus of my work specifically is building all of the infrastructure underneath the database that helps it to run super reliably and automated and without too many hands.

What's the motivation for your talk at QCon London 2023?

My talk's motivation is about trying to help people get better at debugging really complex problems. I think it's really easy to become very skilled in other parts of writing software and lag behind in terms of debugging skills. It's something I feel very passionate about as an infrastructure engineer because a lot of my job is figuring out why things are going wrong. I actually don't think it's that hard, I think people often find it tricky because it's outside of their comfort zone. In my talk I go through a really complex example of a failure, but show that it's really just about applying the same step-by-step approach.

How would you describe your main persona and target audience for this session?

Anyone who builds and runs software in production. In terms of level, I'd say probably mid to senior plus. I try to avoid assuming any domain knowledge in the talk. I do assume a base level of programming knowledge to get there.

Is there anything specific that you'd like people to walk away with after watching your session?

I'd like people to believe that they can do things that they're not so familiar with. The specific example I use in the talk is of a database clustering outage. It's about following that methodical debugging process, in which we had to dive all the way down into the binary on disk format of the database - which is a scary place that you don't normally go to. 


Speaker

Chris Sinjakli

Infra Engineer @planetscaledata

Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems.

All his programs are made from organic, hand-picked, artisanal keypresses.

Read more
Find Chris Sinjakli at:

Date

Tuesday Mar 28 / 10:35AM BST ( 50 minutes )

Location

Churchill (Ground Fl.)

Topics

debugging database best practices

Share

From the same track

Session Java

Your Java Application Is Slow? Check Out These Open-Source Profilers

Tuesday Mar 28 / 04:10PM BST

Profilers help to analyze performance bottlenecks of your application - if you know which to use and how to work with them. There are many open-source profilers, like async-profiler or JMC. This talk will give you insights into these tools, focusing on:

Speaker image - Johannes Bechberger
Johannes Bechberger

Software Developer @SAP

Session application security

Celebrity Vulnerabilities: Effective Response to Critical Production Threats

Tuesday Mar 28 / 11:50AM BST

Log4Shell, Spring4Shell, are you tired of being told to drop everything and respond to the next critical vulnerability in an open-source package? Chances are, if you work in the engineering team of any software development organization, the answer is yes.

Speaker image - Alyssa Miller
Alyssa Miller

Chief Information Security Officer @EpiqGlobal

Session web development

Observable Frontends

Tuesday Mar 28 / 01:40PM BST

As an industry, we’ve made big strides in working within complexity in microservices: we build in observability with OpenTelemetry standards. But what about client-side? This is the most inscrutable part of our system, because it runs on anyone’s computer.

Speaker image - Jessica Kerr
Jessica Kerr

Principal Developer Evangelist @honeycombio

Session

Unconference: Debugging in Production

Tuesday Mar 28 / 02:55PM BST

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.

Speaker image - Shane Hastie
Shane Hastie

Global Delivery Lead @SoftEd, Lead Editor for Culture & Methods @InfoQ

Session

No Instrumentation Observability With eBPF - Are We There Yet?

Tuesday Mar 28 / 05:25PM BST

Gaining interest for the past few years, eBPF promises zero-instrumentation observability with low performance overhead. Sounds like a dream, but are we there already?

Speaker image - Anna Kapuścińska
Anna Kapuścińska

Software Engineer @Isovalent