Uncorking Queueing Bottlenecks with OpenTelemetry

Summary

Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconlondon.com with any comments or concerns.

This presentation discusses how Gearset utilizes OpenTelemetry to enhance visibility in asynchronous workflows, addressing the complexities and bottlenecks of distributed, event-driven systems.

Key Points:

  • Background: OpenTelemetry was adopted to address queue bottlenecks which were not easily detectable with traditional metrics and logs.
  • Importance of Distributed Tracing: It is crucial for event-driven systems to connect the whole workflow from producers to consumers.
  • Implementation Challenges: At Gearset, implementing OpenTelemetry involved customizing middleware to propagate trace contexts across systems.
  • Benefits Gained: Greater insight into system behaviors, unknown dependencies, and the ability to visualize the operation lifetime of requests contributed significantly to operational improvements.

Lessons Learned:

  • Cultural Shift: Moving from dashboard metrics to a trace-driven debugging approach proved crucial for precise problem-solving.
  • Operational Transparency: Visualization of service interconnectivity and latency through tracing exposed inefficiencies and allowed for optimization.
  • Proactive Alerting: Incorporating service level objectives (SLOs) based on rich tracing data allows for more effective and meaningful alerts that are linked to customer experience rather than simply infrastructure metrics.

Conclusion: The integration of OpenTelemetry significantly improved Gearset’s ability to diagnose and preemptively address queueing bottlenecks, providing richer data insights and transforming their incident response strategy.

This is the end of the AI-generated content.


Abstract

Queues are the backbone of scalable, asynchronous systems, but they can easily create a tangled web of complexity. When things slow down, the bottleneck could be anywhere, from producer lag to consumer exhaustion, and standard metrics often fail to show the full picture.

In this session, we’ll explore how Gearset uses OpenTelemetry to bring visibility to our asynchronous workflows. We’ll share the lessons learned at Gearset while scaling our message-driven systems, and dive into:

  • Connecting the dots: Why distributed tracing is non-negotiable for event-driven systems.
  • Implementation: Using OTel standards to bridge the gap between producers and consumers.
  • The hard parts: Tackling long-running traces, measuring true end-to-end duration, and creating lasting cultural change
  • Outcomes: How we shifted from "drowning in dashboards" to precise, trace-driven debugging, with real world examples

Speaker

Julian Wreford

Team Lead of Operability Team @Gearset, Software Engineer Turned Accidental SRE

Julian Wreford is an engineering team lead at Gearset where he leads the team responsible for all things site reliability. After starting as a developer, he quickly became interested in operability and has helped lead the growth of observability culture and incident response at Gearset as the company has scaled from small teams to large enterprises. He is passionate about developer ownership throughout the software lifecycle and enjoys empowering developers to better understand and debug the code they write when it is running at scale.

Read more
Find Julian Wreford at:

Speaker

Oli Lane

Engineering Team Lead @Gearset, Focusing on Engineering Culture, Observability, and Platform Reliability

Oli is an Engineering Team Lead and self-described "Jack of at least some trades." A fixture at Gearset for over ten years, he has ridden the wave from a scrappy 7-person startup to a 350+ employee scale-up.

Along the way, he has gained deep experience across both product and infrastructure teams, with a particular interest in the sociotechnical side of engineering. Currently, Oli focuses on platform engineering and observability, building the culture and tools needed for high-performing teams and reliable systems.

Read more
Find Oli Lane at:

From the same track

Session architecture

From Fan-Out to Fast: Sub-100ms API Design in Distributed Systems

Monday Mar 16 / 10:35AM GMT

A “simple” API request rarely stays simple. In distributed systems, one call quickly turns into fan-out across gateways, services, caches, and databases — and your p99 becomes the sum of every hop and every flaky dependency.

Speaker image - Saranya Vedagiri

Saranya Vedagiri

Senior Staff Engineer @eBay

Session Platform Engineering

APIs for Agents: Rethinking API Programs in the MCP Era

Monday Mar 16 / 01:35PM GMT

As API programs mature, a familiar gap emerges: some teams operate with strong standards, reusable platforms, and clear governance,  while others rely on informal guidance and best-effort consistency.

Speaker image - Jim Gough

Jim Gough

Distinguished Engineer, API Platform Lead Architect @Morgan Stanley, Co-Author of Optimizing Java

Speaker image - Andreea Niculcea

Andreea Niculcea

Vice President @Morgan Stanley

Session architecture

Managing Asynchronous APIs at Scale

Monday Mar 16 / 05:05PM GMT

When event-driven architectures are small, teams can reason about events through word-of-mouth. They know who publishes what, who consumes it, and how messages flow through the system. Teams manage their own infrastructure or raise tickets to request changes.

Speaker image - Ian Cooper

Ian Cooper

Senior Principal Engineer @Just Eat Takeaway

Session AI

Enchant Your AI and APIs with eBPF Magic 🪄

Monday Mar 16 / 03:55PM GMT

It is a common occurrence to see applications thrown over the fence, landing somewhere in production without a second thought about their lifecycle or how they may need maintaining in the future to connect to more efficient API endpoints.

Speaker image - Dan Finneran

Dan Finneran

Principal Community Advocate at Isovalent @Cisco

Session

Unconference: Connecting Systems

Monday Mar 16 / 02:45PM GMT