Summary
Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconlondon.com with any comments or concerns.
This presentation discusses how Gearset utilizes OpenTelemetry to enhance visibility in asynchronous workflows, addressing the complexities and bottlenecks of distributed, event-driven systems.
Key Points:
- Background: OpenTelemetry was adopted to address queue bottlenecks which were not easily detectable with traditional metrics and logs.
- Importance of Distributed Tracing: It is crucial for event-driven systems to connect the whole workflow from producers to consumers.
- Implementation Challenges: At Gearset, implementing OpenTelemetry involved customizing middleware to propagate trace contexts across systems.
- Benefits Gained: Greater insight into system behaviors, unknown dependencies, and the ability to visualize the operation lifetime of requests contributed significantly to operational improvements.
Lessons Learned:
- Cultural Shift: Moving from dashboard metrics to a trace-driven debugging approach proved crucial for precise problem-solving.
- Operational Transparency: Visualization of service interconnectivity and latency through tracing exposed inefficiencies and allowed for optimization.
- Proactive Alerting: Incorporating service level objectives (SLOs) based on rich tracing data allows for more effective and meaningful alerts that are linked to customer experience rather than simply infrastructure metrics.
Conclusion: The integration of OpenTelemetry significantly improved Gearset’s ability to diagnose and preemptively address queueing bottlenecks, providing richer data insights and transforming their incident response strategy.
This is the end of the AI-generated content.
Abstract
Queues are the backbone of scalable, asynchronous systems, but they can easily create a tangled web of complexity. When things slow down, the bottleneck could be anywhere, from producer lag to consumer exhaustion, and standard metrics often fail to show the full picture.
In this session, we’ll explore how Gearset uses OpenTelemetry to bring visibility to our asynchronous workflows. We’ll share the lessons learned at Gearset while scaling our message-driven systems, and dive into:
- Connecting the dots: Why distributed tracing is non-negotiable for event-driven systems.
- Implementation: Using OTel standards to bridge the gap between producers and consumers.
- The hard parts: Tackling long-running traces, measuring true end-to-end duration, and creating lasting cultural change
- Outcomes: How we shifted from "drowning in dashboards" to precise, trace-driven debugging, with real world examples
Speaker
Julian Wreford
Team Lead of Operability Team @Gearset, Software Engineer Turned Accidental SRE
Julian Wreford is an engineering team lead at Gearset where he leads the team responsible for all things site reliability. After starting as a developer, he quickly became interested in operability and has helped lead the growth of observability culture and incident response at Gearset as the company has scaled from small teams to large enterprises. He is passionate about developer ownership throughout the software lifecycle and enjoys empowering developers to better understand and debug the code they write when it is running at scale.
Find Julian Wreford at:
Speaker
Oli Lane
Engineering Team Lead @Gearset, Focusing on Engineering Culture, Observability, and Platform Reliability
Oli is an Engineering Team Lead and self-described "Jack of at least some trades." A fixture at Gearset for over ten years, he has ridden the wave from a scrappy 7-person startup to a 350+ employee scale-up.
Along the way, he has gained deep experience across both product and infrastructure teams, with a particular interest in the sociotechnical side of engineering. Currently, Oli focuses on platform engineering and observability, building the culture and tools needed for high-performing teams and reliable systems.