From Monitoring to Observability: eBPF Chaos

If you have registered for QCon London, please log into your account to get access to this presentation.

Collecting Observability data with eBPF aims to help Dev, Ops, and SREs to debug and troubleshoot incidents. Data requires storage, visualization, and verification: Do the Service Level Objectives (SLOs) match, dashboards visualize useful data correlation, network service maps make sense, and what about security policies?

Simulating a production incident is challenging. Chaos engineering enables teams to break things in a controlled environment and verify alerts, SLOs, and data accuracy. Which data retention cycle is best, and which dashboards reduce the mean-time-to-response? Anomaly detection and forecasting would be great.

This talk dives into the learning steps with eBPF and discusses traditional metrics monitoring and future Observability data collection, storage and visualization. Learn from hands-on examples with chaos experiments that attempt to break eBPF probes, data collection, and policies in unexpected ways - and bring new perspectives into cloud-native reliability.

What's the focus of your work these days?

In my previous roles, I was an operations person and OSS monitoring tool developer, with a passion for educating everyone about Git, GitLab, and CI/CD. This led me to become a Developer Evangelist sharing what I’m learning in public, and ensuring that everyone can contribute. New complex technology that needs to be explained often leads to pitching a new talk/story or thinking about tutorial articles or workshops. What are the benefits? What are the risks? What is the best way to try together in a live session?

The challenge is to explain a complex topic in a way that you don't need to understand all the technical details but consider it important, similar to what brought me to eBPF. Everyone kept talking about it, and I asked myself “What is that?”. The monitoring methods on the Kernel level felt familiar, but I only understood certain parts. I take on the challenge to learn, and document everything in public GitLab projects - notes, demos, code. Complex technology needs practical examples, going beyond the talk slides to learn async later.

What's the motivation for your talk at QCon London 2023?

The main motivation for this talk is specifically that I found that learning this technology is really hard. I wanted to learn about eBPF when I focussed more on Observability and Chaos Engineering in 2022. There are so many resources: websites, social media accounts, newsletters, webinars, events - whatever you want to consume. It may take you hours or weeks to learn, and there is probably too much.

With eBPF, I'm seeing that everyone talks about it. There are great ideas that can help improve our way of working with cloud-native technologies. For example, it can help with debugging in production. It's also a new way to collect observability data. I'm asking the critical questions of what happens in the background. What are the benefits? What are the risks? I also will pitch new ideas, which might sound a little crazy, but I want to inspire thought leadership in technology and think beyond what the current state of observability is.

How would you describe your main persona and target audience for this session?

I'm aiming for everyone who wants to get started and go deeper with eBPF and Observability. For example, a DevOps engineer or SRE who wants to learn about the tools which could be helpful to debug a problem and incident in production. It is important that we can sleep at night and don't get paged too often, and when we get paged it needs fast problem resolution and reducing the meantime to respond or meantime to resolve (MTTR).

Developers can benefit from getting an inside look into their applications without code instrumentation. Get traces, analyze the performance and identify possibilities to improve. For anyone getting started with writing eBPF program code, the developer experience is also crucial - going high level, or deep down, understanding why eBPF helps, and will help improve observability.

Users familiar with chaos engineering will also learn how to test eBPF-based tools, and how to break them in a controlled way. This talk should also encourage team leaders to understand the benefits and risks with eBPF, and allow them to plan ahead for evaluating the technology for production environments.

Is there anything specific that you'd like people to walk away with after watching your session?

One of the main objectives is to learn from my learning experience. How to get started with eBPF, focus on the tools, focus on the libraries, and define a use case for yourself. Then, keep going - because when you see the small networking data packet on your screen you might get addicted and this is what keeps you learning.

Developers also should be able to write eBPF programs using the examples shown in the talk. Getting inspired by ideas on how to improve DevSecOps workflows, make debugging easier, and benefit from Observability.


Michael Friedrich

Senior Developer Evangelist @GitLab

Michael Friedrich is a Senior Developer Evangelist at GitLab, focussing on Observability, SRE, Ops. He loves to educate everyone and regularly speaks at events and meetups. Michael co-founded the #EveryoneCanContribute cafe meetup group to learn cloud-native & DevSecOps. Michael created as an Observability learning platform, and shares technology trends and insights into day-2-ops, Chaos Engineering, eBPF, OpenTelemetry and AI/MLOps in his newsletter.

Read more
Find Michael Friedrich at:


Tuesday Mar 28 / 05:25PM BST ( 50 minutes )


Fleming (3rd Fl.)


ebpf monitoring chaos engineering


From the same track

Session Java

Your Java Application Is Slow? Check Out These Open-Source Profilers

Tuesday Mar 28 / 04:10PM BST

Profilers help to analyze performance bottlenecks of your application - if you know which to use and how to work with them. There are many open-source profilers, like async-profiler or JMC. This talk will give you insights into these tools, focusing on:

Johannes Bechberger

Software Developer @SAP

Session application security

Celebrity Vulnerabilities: Effective Response to Critical Production Threats

Tuesday Mar 28 / 11:50AM BST

Log4Shell, Spring4Shell, are you tired of being told to drop everything and respond to the next critical vulnerability in an open-source package? Chances are, if you work in the engineering team of any software development organization, the answer is yes.

Alyssa Miller

Chief Information Security Officer @EpiqGlobal

Session debugging

Deconstructing an Abstraction to Reconstruct an Outage

Tuesday Mar 28 / 10:35AM BST

Abstractions are what allow us to build the complex applications that we all use day-to-day. For example, it's rare for us to care about the precise details of on-disk storage when building an application — that's why databases exist!

Chris Sinjakli

Infra Engineer @planetscaledata

Session web development

Observable Frontends

Tuesday Mar 28 / 01:40PM BST

As an industry, we’ve made big strides in working within complexity in microservices: we build in observability with OpenTelemetry standards. But what about client-side? This is the most inscrutable part of our system, because it runs on anyone’s computer.

Jessica Kerr

Principal Developer Evangelist @honeycombio


Unconference: Debugging in Production

Tuesday Mar 28 / 02:55PM BST

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.

Shane Hastie

Global Delivery Lead @SoftEd, Lead Editor for Culture & Methods @InfoQ