Session + Live Q&A

The Scientific Method for Testing System Resilience

Do you remember the Scientific Method from elementary school science class? It's time to dust off that knowledge and use it to your advantage to test your IT systems! In this session, you'll be re-introduced to the Scientific Method, and learn how Vanguard's software engineers and IT architects draw inspiration from it in their resilience testing efforts. We’ll do a deep dive into the "Failure Modes and Effects Analysis" technique, in which engineers examine complex architecture diagrams, asking themselves questions about the failure modes of various technical components and developing hypotheses based on their expectations of how the system would behave. Then, we’ll discuss how the engineers use these conjectures as inputs into experimentation, selecting and executing chaos experiments accordingly to validate (or disprove!) their hypotheses. We’ll even take a look behind the curtain at how some of these fault injection tests are implemented at Vanguard.

Main Takeaways

1 Hear about how Vanguard deals with issues in their software systems.

2 Learn how to use Failure Modes and Effects Analysis.

Christina, what is the focus of your work these days?

Right now, my primary focus is the staffing, onboarding and subsequent education of site reliability engineers for Vanguard. So I handle everything from what it means to be a site reliability engineer in the day-to-day, what tools and technologies they'll need to be familiar with and how to best get them up to speed. But also on where are we going to find these SREs, and how many do we need to find and where should we be putting them within our organization?

What is the motivation for your talk?

Share the story of a practice we've adopted at Vanguard across the organization, in particular in areas where we have site reliability engineer staff to do the work. Share the story of the failure mode and effects analysis practice where it started, which had many, many challenges, lots of bumps in the road to the various iterations to make the practice better and then share the value that we've derived from making this practice a step for the majority of applications going into production for Vanguard. And I'll share all of the frameworks to make sure that all of the attendees of this presentation can repeat the successes that we've seen at Vanguard in their own systems at their companies.

And how would you describe the persona and the level for the target audience?

I think that the right audience for this talk is anyone who is in the position of maybe a technical lead for a software system. Oftentimes, the people involved in the conversations that make up a failure modes and effects analysis are technical leads, architects or senior engineers who can look at an architecture diagram and ask the right questions, interpret what they're seeing and make suggestions for how to improve the architecture to make it more resilient.

What do you want this persona to walk away with from your presentation?

I hope that anyone attending my presentation will feel confident taking what they've learned about the failure modes and effects analysis technique and even chaos experimentation, and be able to bring that back to their organizations, to the software systems that they are building and apply it to their own work so that they can reap the same benefits I have. 


Christina Yakomin

Senior Site Reliability Engineering Specialist @Vanguard_Group

Christina is a Senior Site Reliability Engineering Specialist in Vanguard's Chief Technology Office. She has worked at the company's Malvern, PA headquarters since graduating from Villanova University with an undergraduate degree in Computer Science. Throughout her career, she...

Read more
Find Christina Yakomin at:


Tuesday Apr 5 / 01:40PM BST (50 minutes)


Fleming, 3rd flr.


Resilient Architectures


Resilient SystemsChaos Engineering

Add to Calendar

Add to calendar


From the same track

Session + Live Q&A Resilient Systems

Practical Resilience - The Core Stuff

Tuesday Apr 5 / 02:55PM BST

This panel will aim to explore, share ideas and provide pragmatic insight around some key areas related to designing, running and maintaining resilient architectures.

Liz Rice

Chief Open Source Officer @Isovalent

Christina Yakomin

Senior Site Reliability Engineering Specialist @Vanguard_Group

Jason Barto

Principal Solutions Architect @AWS

Kai Waehner

Field CTO @Confluentinc

Session + Live Q&A Resilient Systems

How to Test Your Fault Isolation Boundaries in the Cloud

Tuesday Apr 5 / 04:10PM BST

Will my system keep working when a server fails? When a data center goes offline? When a service dependency is unavailable?Availability calculations for redundant components require that those components are independent and autonomous of each other. But modern day systems are complex, exhibiting...

Jason Barto

Principal Solutions Architect @AWS

Session + Live Q&A Resilient Systems

Resilient Real-Time Data Streaming Across the Edge and Hybrid Cloud

Tuesday Apr 5 / 05:25PM BST

Hybrid cloud architectures are the new black for most companies. A cloud-first strategy is evident for many new enterprise architectures, but some use cases require resiliency across edge sites and multiple cloud regions. Data streaming with the Apache Kafka ecosystem is a perfect technology for...

Kai Waehner

Field CTO @Confluentinc


Unconference: Resilient Architectures

Tuesday Apr 5 / 11:50AM BST

Details coming soon.

Session + Live Q&A eBPF

Resiliency Superpowers with eBPF

Tuesday Apr 5 / 10:35AM BST

eBPF is a powerful technology that allows us to run custom programs in the kernel. It’s enabling a whole new generation of tools for networking, security and observability. Let’s explore how it can help us build resilient architectures. This talk - with demos - considers...

Liz Rice

Chief Open Source Officer @Isovalent

View full Schedule