Effective and Efficient Observability with OpenTelemetry

Modern architectures require effective observability solutions to be able to monitor their health and understand how system changes affect operations distributed across multiple services. Using telemetry signals, like metrics or logs, in their own silos is no longer enough to provide the type of holistic view required to efficiently operate the increasingly complex systems, we, engineers, operate at scale.

Since its 1.0 releases in early 2021 and reaching CNCF incubating state in August of the same year, the popularity of OpenTelemetry has been constantly on the rise. Now supported by most major open source telemetry tooling and observability vendors, it is changing the playing field in how we instrument, collect and consume telemetry from our services. Most importantly, it is also changing how we think about debugging distributed systems.

In this talk, Dan shares his experience leading a large scale observability initiative at Skyscanner, based on the adoption of OpenTelemetry across hundreds of services, and the motivation and value gained from adopting open standards across the entire organisation. This presentation details some of the learnings, and challenges, on how the platform engineering team he's part of is helping to roll out OpenTelemetry with minimum friction, including default SDK configuration and telemetry transport pipelines, and how to encourage the change in debugging workflows needed to maximise the value of telemetry data.

Interview:

What's the focus of your work these days?

I'm a Principal Engineer at Skyscanner and technical lead in the area of operational monitoring and observability. My area of focus these days is to roll out open standards for telemetry instrumentation across the whole organization and to simplify and reduce toil for all teams at Skyscanner to instrument their services, understand their applications and systems, and to be able to detect regressions and then debug them faster than previously.

What's the motivation for your talk at QCon London 2023?

The motivation for my talk is to give attendees an idea of the benefits of open standards in telemetry, and the value of OpenTelemetry as a CNCF project. How we adopted it at Skyscanner and why we adopted it. Some of the best practices and some of the caveats and challenges that we found while rolling out OpenTelemetry, including advice for anyone wanting to adopt open standards in their organization.

How would you describe your main persona and target audience for this session?

I think anybody that is in charge of operating systems at scale and supporting production workloads that's interested in speeding up incident detection within their systems, and debugging regressions faster to reduce the time that their systems are down for their end users. Also, for anybody that wants to simplify telemetry across an organization to apply open standards and to benefit from out of the box instrumentation for the distributed systems.

Is there anything specific that you'd like people to walk away with after watching your session?

I would like it to be to understand the value that open standards can bring to an organization, and to understand the need for observability and how the practices that we've been following for the last 10, 20 years are no longer enough to be able to effectively operate systems in production.


Speaker

Daniel Gomez Blanco

Principal Engineer @Skyscanner

Dan is a Principal Engineer working at Skyscanner within the Production Platform tribe, and author of "Practical OpenTelemetry: Adopting Open Observability Standards Across Your Organization". Throughout his career, his main focus has been building solutions to reduce the cognitive load required to operate systems at scale. At Skyscanner, rolling out a standard approach to observability across hundreds of components, which provide a reliable service for nearly 100 million unique monthly users, presents him and his team with unique challenges. The sort of challenges that manifest themselves more clearly with scale, and that are fascinating to tackle and to share learnings from.

Read more
Find Daniel Gomez Blanco at:

Date

Monday Mar 27 / 02:55PM BST ( 50 minutes )

Location

Churchill (Ground Fl.)

Topics

OpenTelemetry case study best practices

Share

From the same track

Session consistency

Eventual Consistency – Don’t Be Afraid!

Monday Mar 27 / 10:35AM BST

Distributed data-intensive systems are increasingly designed to be only eventually consistent.

Speaker image - Susanne Braun

Susanne Braun

Principal Tech Lead @SAPSignavio

Session cloud native

The Commoditization of the Software Stack: How Application-first Cloud Services are Changing the Game

Monday Mar 27 / 11:50AM BST

The runtime boundaries between applications and the cloud are shifting from virtual machines to containers and functions. The integration boundaries are moving away from pure data access to one where the mechanical parts of the application are running within the cloud.

Speaker image - Bilgin Ibryam

Bilgin Ibryam

Principal Product Manager @Diagrid, Co-author of “Kubernetes Patterns“, Previously Architect @RedHat

Session Data

What is Derived Data? (And do You Already Have Any?)

Monday Mar 27 / 05:25PM BST

There is a growing trend of databases specializing in derived data ingestion and serving. They complement more traditional “primary data” (or “source of truth”) systems.

Speaker image - Felix GV

Felix GV

Principal Staff Engineer @LinkedIn

Session api

Connecting the Dots: API Design in a Distributed World

Monday Mar 27 / 04:10PM BST

As we’ve gone from building monoliths to building microservices, the number of APIs we’ve got to manage has gone from just the database and front end, to at least one per service. 

Speaker image - Ben Gamble

Ben Gamble

Adviser, Architect & Speaker About Interactive Technology, Startups & Event Driven Systems

Session

Unconference: Building Modern Backends

Monday Mar 27 / 01:40PM BST

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.

Speaker image - Shane Hastie

Shane Hastie

Global Delivery Lead @SoftEd, Lead Editor for Culture & Methods @InfoQ