You are viewing content from a past/completed QCon

Presentation: Preparing for the Unexpected

Track: Chaos and Resilience: Architecting for Success

Location: Fleming, 3rd flr.

Duration: 11:50am - 12:40pm

Day of week: Wednesday

Slides: Download Slides

Share this on:

This presentation is now available to view on InfoQ.com

Watch video with transcript

What You’ll Learn

  1. Hear how the FT manages incidents and what they are doing to make it a sustainable process.
  2. Learn how to benefit from past incidents and encourage engineers to get involved.

Abstract

Convincing engineers to be on-call isn’t always straightforward. In 2019 the Customer Products group at the Financial Times set out to make their out of hours support process more sustainable after losing a number of people from their on-call team.

In this talk you’ll discover how to continuously learn from past incidents by applying your team’s most recent operational experience, increase the confidence of your team in handling live incidents away from the pressures of production, and convince them that, actually, joining the on-call team is a great idea!

Hear how the Financial Times is using incident workshops to prepare for the unexpected and make incident management a more consistent process by sharing the group’s wide range of operational knowledge and architectural insights.

Question: 

What is the work you are doing today?

Answer: 

I work at the Financial Times as a Principal Engineer. I support the development of FT.com, the website and mobile applications. There's two things going on in our department at the moment that I'm supporting. One is we're relaunching a whole bunch of teams. Getting all of those teams kicked off and started at the beginning of this year. Quite a lot of energy going into that, but it's all quite exciting. We're trying to address in a microservices world the issue of ownership as part of that. So the outcome should be we have a whole bunch of teams with full ownership of everything.

The other side is recruitment. We're starting a big new year, opening 40 new positions. So we're hiring quite a lot! That's exciting. It's a scale I haven't worked with before. It's all about scaling the recruitment process, making it possible for us to interview a lot of people quickly and fairly.

Question: 

What are your goals for the talk?

Answer: 

One of them is to use this as a reason to deep dive into something I'm quite interested in, incident management and reliability engineering. I'm really interested in telling the story that we have had at the FT over the last year about how we've handled incidents. And I want to get across that It's possible for engineers on the ground to make space for incident management and training and get people interested in the operational side of running systems. I think it's quite interesting. The FT is similar to a lot of companies where engineers have many different responsibilities and sometimes you have to jump into incident management, taking it all the way to producing an incident report.

I want to get across that engineers can put on the incident management hat. And I'm using this also internally as a launchpad to become a team that excels at incident management within the FT.

The final goal is to publish a framework for how to run incident management tabletop exercises, a lightweight workshop for sharing knowledge and making new connections between people so that we can better handle incidents.

Question: 

What are the core personas for the talk?

Answer: 

This talk is for engineers, and any other discipline that would get value out of learning from incidents.

Question: 

Could you share a few key takeaways?

Answer: 

I want to get across that your company's previous incidents are a treasure trove for preparing for what's to happen. We keep a record of all of our incidents at the FT and we review them regularly. There's always new things to learn even if they've happened in the past. And new people provide new eyes on those previous incidents and things that we didn't know at the time.

As engineers, you can carve out time and make space for running these workshops without too much effort. You don't have to have a dedicated role to run something like this. And you can really lift up your team by running these workshops.

The other important bit I want people to take away is an awareness of the barrier to entry that can exist in getting into incident management for the first time. This talk will cover a really good way to lower that barrier and get, say, junior engineers or anyone who's interested in incident management involved without the pressures of production and working on an active incident, and getting more people motivated and engaged with looking after the systems running in production.

Speaker: Samuel Parkinson

Principal Engineer @FinancialTimes

Sam is a Principal Engineer at the Financial Times, supporting the development of FT.com and the mobile apps. Previously he’s worked at Graze, a start-up that sends snacks through the post. Working in the industry for six years as a software engineer, he’s also spent time on the operational side as an integration engineer.

At the FT he has recently supported the Operations & Reliability group with their rebuild of the company-wide monitoring platform and is doing his best to convince people that joining the on-call team is definitely a good idea.

Find Samuel Parkinson at

Last Year's Tracks