Modern architectures require effective observability solutions to be able to monitor their health and understand how system changes affect operations distributed across multiple services. Using telemetry signals, like metrics or logs, in their own silos is no longer enough to provide the type of holistic view required to efficiently operate the increasingly complex systems, we, engineers, operate at scale.
Since its 1.0 releases in early 2021 and reaching CNCF incubating state in August of the same year, the popularity of OpenTelemetry has been constantly on the rise. Now supported by most major open source telemetry tooling and observability vendors, it is changing the playing field in how we instrument, collect and consume telemetry from our services. Most importantly, it is also changing how we think about debugging distributed systems.
In this talk, Dan shares his experience leading a large scale observability initiative at Skyscanner, based on the adoption of OpenTelemetry across hundreds of services, and the motivation and value gained from adopting open standards across the entire organisation. This presentation details some of the learnings, and challenges, on how the platform engineering team he's part of is helping to roll out OpenTelemetry with minimum friction, including default SDK configuration and telemetry transport pipelines, and how to encourage the change in debugging workflows needed to maximise the value of telemetry data.
Interview:
What's the focus of your work these days?
I'm a Principal Engineer at Skyscanner and technical lead in the area of operational monitoring and observability. My area of focus these days is to roll out open standards for telemetry instrumentation across the whole organization and to simplify and reduce toil for all teams at Skyscanner to instrument their services, understand their applications and systems, and to be able to detect regressions and then debug them faster than previously.
What's the motivation for your talk at QCon London 2023?
The motivation for my talk is to give attendees an idea of the benefits of open standards in telemetry, and the value of OpenTelemetry as a CNCF project. How we adopted it at Skyscanner and why we adopted it. Some of the best practices and some of the caveats and challenges that we found while rolling out OpenTelemetry, including advice for anyone wanting to adopt open standards in their organization.
How would you describe your main persona and target audience for this session?
I think anybody that is in charge of operating systems at scale and supporting production workloads that's interested in speeding up incident detection within their systems, and debugging regressions faster to reduce the time that their systems are down for their end users. Also, for anybody that wants to simplify telemetry across an organization to apply open standards and to benefit from out of the box instrumentation for the distributed systems.
Is there anything specific that you'd like people to walk away with after watching your session?
I would like it to be to understand the value that open standards can bring to an organization, and to understand the need for observability and how the practices that we've been following for the last 10, 20 years are no longer enough to be able to effectively operate systems in production.
Speaker
Daniel Gomez Blanco
Principal Engineer @Skyscanner
Dan is a Principal Engineer working at Skyscanner within the Production Platform tribe, and author of "Practical OpenTelemetry: Adopting Open Observability Standards Across Your Organization". Throughout his career, his main focus has been building solutions to reduce the cognitive load required to operate systems at scale. At Skyscanner, rolling out a standard approach to observability across hundreds of components, which provide a reliable service for nearly 100 million unique monthly users, presents him and his team with unique challenges. The sort of challenges that manifest themselves more clearly with scale, and that are fascinating to tackle and to share learnings from.