RECORDING: From Monitoring to Observability: eBPF Chaos

Collecting Observability data with eBPF aims to help Dev, Ops, and SREs to debug and troubleshoot incidents. Data requires storage, visualization, and verification: Do the Service Level Objectives (SLOs) match, dashboards visualize useful data correlation, network service maps make sense, and what about security policies?

Simulating a production incident is challenging. Chaos engineering enables teams to break things in a controlled environment and verify alerts, SLOs, and data accuracy. Which data retention cycle is best, and which dashboards reduce the mean-time-to-response? Anomaly detection and forecasting would be great.

This talk dives into the learning steps with eBPF and discusses traditional metrics monitoring and future Observability data collection, storage and visualization. Learn from hands-on examples with chaos experiments that attempt to break eBPF probes, data collection, and policies in unexpected ways - and bring new perspectives into cloud-native reliability.

What's the focus of your work these days?

In my previous roles, I was an operations person and OSS monitoring tool developer, with a passion for educating everyone about Git, GitLab, and CI/CD. This led me to become a Developer Evangelist sharing what I’m learning in public, and ensuring that everyone can contribute. New complex technology that needs to be explained often leads to pitching a new talk/story or thinking about tutorial articles or workshops. What are the benefits? What are the risks? What is the best way to try together in a live session?

The challenge is to explain a complex topic in a way that you don't need to understand all the technical details but consider it important, similar to what brought me to eBPF. Everyone kept talking about it, and I asked myself “What is that?”. The monitoring methods on the Kernel level felt familiar, but I only understood certain parts. I take on the challenge to learn, and document everything in public GitLab projects - notes, demos, code. Complex technology needs practical examples, going beyond the talk slides to learn async later.

What's the motivation for your talk at QCon London 2023?

The main motivation for this talk is specifically that I found that learning this technology is really hard. I wanted to learn about eBPF when I focussed more on Observability and Chaos Engineering in 2022. There are so many resources: websites, social media accounts, newsletters, webinars, events - whatever you want to consume. It may take you hours or weeks to learn, and there is probably too much.

With eBPF, I'm seeing that everyone talks about it. There are great ideas that can help improve our way of working with cloud-native technologies. For example, it can help with debugging in production. It's also a new way to collect observability data. I'm asking the critical questions of what happens in the background. What are the benefits? What are the risks? I also will pitch new ideas, which might sound a little crazy, but I want to inspire thought leadership in technology and think beyond what the current state of observability is.

How would you describe your main persona and target audience for this session?

I'm aiming for everyone who wants to get started and go deeper with eBPF and Observability. For example, a DevOps engineer or SRE who wants to learn about the tools which could be helpful to debug a problem and incident in production. It is important that we can sleep at night and don't get paged too often, and when we get paged it needs fast problem resolution and reducing the meantime to respond or meantime to resolve (MTTR).

Developers can benefit from getting an inside look into their applications without code instrumentation. Get traces, analyze the performance and identify possibilities to improve. For anyone getting started with writing eBPF program code, the developer experience is also crucial - going high level, or deep down, understanding why eBPF helps, and will help improve observability.

Users familiar with chaos engineering will also learn how to test eBPF-based tools, and how to break them in a controlled way. This talk should also encourage team leaders to understand the benefits and risks with eBPF, and allow them to plan ahead for evaluating the technology for production environments.

Is there anything specific that you'd like people to walk away with after watching your session?

One of the main objectives is to learn from my learning experience. How to get started with eBPF, focus on the tools, focus on the libraries, and define a use case for yourself. Then, keep going - because when you see the small networking data packet on your screen you might get addicted and this is what keeps you learning.

Developers also should be able to write eBPF programs using the examples shown in the talk. Getting inspired by ideas on how to improve DevSecOps workflows, make debugging easier, and benefit from Observability.


Speaker

Michael Friedrich

Senior Developer Advocate @GitLab

Michael Friedrich is a Senior Developer Advocate at GitLab, focussing on DevSecOps, AI, Observability. He loves to educate everyone and regularly speaks at events and meetups. Michael created o11y.love as an Observability learning platform, and shares technology trends and insights into day-2-ops, eBPF, OpenTelemetry and AI/MLOps in his opsindev.news newsletter. When not traveling and working remotely, he enjoys building LEGO models.

Read more
Find Michael Friedrich at:

Date

Tuesday Mar 28 / 05:25PM BST ( 50 minutes )

Topics

ebpf monitoring chaos engineering

Share

From the same track

Session

Online Unconference: Platforms - Building Modern Backends & Frontends

Tuesday Mar 28 / 04:10PM BST

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.

Speaker image - Amr Elssamadisy
Amr Elssamadisy

Agile Coach & Author of Agile Adoption Patterns