SESSION + Live Q&A

Preparing for the Unexpected

Convincing engineers to be on-call isn’t always straightforward. In 2019 the Customer Products group at the Financial Times set out to make their out of hours support process more sustainable after losing a number of people from their on-call team.

In this talk you’ll discover how to continuously learn from past incidents by applying your team’s most recent operational experience, increase the confidence of your team in handling live incidents away from the pressures of production, and convince them that, actually, joining the on-call team is a great idea!

Hear how the Financial Times is using incident workshops to prepare for the unexpected and make incident management a more consistent process by sharing the group’s wide range of operational knowledge and architectural insights.


What is the work you are doing today?

I work at the Financial Times as a Principal Engineer. I support the development of FT.com, the website and mobile applications. There's two things going on in our department at the moment that I'm supporting. One is we're relaunching a whole bunch of teams. Getting all of those teams kicked off and started at the beginning of this year. Quite a lot of energy going into that, but it's all quite exciting. We're trying to address in a microservices world the issue of ownership as part of that. So the outcome should be we have a whole bunch of teams with full ownership of everything.

What are your goals for the talk?

One of them is to use this as a reason to deep dive into something I'm quite interested in, incident management and reliability engineering. I'm really interested in telling the story that we have had at the FT over the last year about how we've handled incidents. And I want to get across that It's possible for engineers on the ground to make space for incident management and training and get people interested in the operational side of running systems. I think it's quite interesting. The FT is similar to a lot of companies where engineers have many different responsibilities and sometimes you have to jump into incident management, taking it all the way to producing an incident report.

What are the core personas for the talk?

This talk is for engineers, and any other discipline that would get value out of learning from incidents.

Could you share a few key takeaways?

I want to get across that your company's previous incidents are a treasure trove for preparing for what's to happen. We keep a record of all of our incidents at the FT and we review them regularly. There's always new things to learn even if they've happened in the past. And new people provide new eyes on those previous incidents and things that we didn't know at the time.


Speaker

Samuel Parkinson

Principal Engineer @FinancialTimes

Sam is a Principal Engineer at the Financial Times, supporting the development of FT.com and the mobile apps. Previously he’s worked at Graze, a start-up that sends snacks through the post. Working in the industry for six years as a software engineer, he’s also spent time on the...

Read more

Location

Fleming, 3rd flr.

Track

Chaos and Resilience: Architecting for Success

Topics

Interview AvailableLondonIncident ManagementSite Reliability EngineeringResilient Systems

Share

From the same track

SESSION + Live Q&A Interview Available

Better Resilience Adoption through UX

Too often, attempts to bring resilience engineering to an organization fall flat. Perhaps there’s some initial interest, but that wavers under the crushing weight of JIRA queues and sprint reviews. The tools are there but no one’s using them.This session will go over three case...

Randall Koutnik

UI Engineer

SESSION + Live Q&A Incident Management

Growing Resilience: Serving Half a Billion Users Monthly at Condé Nast

Serving over half a billion monthly customers while keeping service availability high is a monumental task. Condé Nast operates in nearly 40 countries and is better known for it’s portfolio of household brands such as Vogue, Wired, Vanity Fair, The New Yorker. Our globally distributed...

Crystal Hirschorn

VP Engineering, Global Strategy & Operations @CondeNast

SESSION + Live Q&A Incident Management

How Many Is Too Much? Exploring Costs of Coordination During Outages

Service outages can attract a lot of attention from a wide range of participants - particularly when the service is for a business critical function. These ‘stakeholders’ represent multiple roles with different experience, responsibilities, expertise and knowledge about how the system...

Laura Maguire

Cognitive Systems Engineer & Researcher

SESSION + Live Q&A Incident Management

Rethinking How the Industry Approaches Chaos Engineering:

In order to determine and envision how to achieve reliability and resilience that drive our businesses forward, organizations must be able to look back at past blunders unobscured by hindsight bias. Resilient organizations don’t take past successes as a reason for confidence. Instead, they...

Nora Jones

Senior Developer/ Engineer

View full Schedule