Track: Chaos and Resilience: Architecting for Success

Location: Fleming, 3rd flr.

Day of week: Wednesday

Understanding the complex, socio-technical systems found in software today are paramount to future successes of organizations and of the industry in general. In this track, you will learn from stories of practitioners who are bringing changes to their organizations in new, and previously unheard of, ways through Chaos Engineering, Resilience Engineering, and critically reflecting on Cognitive Systems Engineering and Human Factors techniques. The practitioners in this track will show you how bringing these disciplines to software helps organizations learn and grow in beneficial ways, such as: leading to better architecture decisions, growing and distilling technical expertise, and having more confidence when disaster strikes. You will leave this track having explored more effective approaches and techniques for building the adaptive capacity, in both people and technology, to manage the consequences of failure successfully.

Track Host: Nora Jones

Senior Developer/ Engineer

Nora is a dedicated and driven technology leader and software engineer with a passion for people and reliable software, as well as the intersection between those two worlds. She truly believes that safety is pivotal with software development nowadays. She co-wrote two O’Reilly books on Chaos Engineering, and how a product’s availability can be improved through intentional failure experimentation.

She also shared her experiences helping organizations large and small reach crucial availability and in November of 2017 keynoted at AWS re:Invent to share these experiences with an audience of ~40,000 people, helping kick off the Chaos Engineering movement we see today. Since then she has keynoted at several other conferences throughout the world highlighting her work on topics such as: Resilience Engineering, Chaos Engineering, Human Factors, Site Reliability, and more from her work at Netflix, Slack, and Jet.com.

10:35am - 11:25am

Better Resilience Adoption through UX

Too often, attempts to bring resilience engineering to an organization fall flat. Perhaps there’s some initial interest, but that wavers under the crushing weight of JIRA queues and sprint reviews. The tools are there but no one’s using them.

This session will go over three case studies where teams achieved success (and a few that didn't!) by focusing on the human element of engineering tooling. In each one, we’ll look at a specific UX technique that team employed to put their company on a path to resilience.

Randall Koutnik, UI Engineer

11:50am - 12:40pm

Preparing for the Unexpected

Convincing engineers to be on-call isn’t always straightforward. In 2019 the Customer Products group at the Financial Times set out to make their out of hours support process more sustainable after losing a number of people from their on-call team.

In this talk you’ll discover how to continuously learn from past incidents by applying your team’s most recent operational experience, increase the confidence of your team in handling live incidents away from the pressures of production, and convince them that, actually, joining the on-call team is a great idea!

Hear how the Financial Times is using incident workshops to prepare for the unexpected and make incident management a more consistent process by sharing the group’s wide range of operational knowledge and architectural insights.

Samuel Parkinson, Principal Engineer @FinancialTimes

1:40pm - 2:30pm

How Many Is Too Much? Exploring Costs of Coordination During Outages

Service outages can attract a lot of attention from a wide range of participants - particularly when the service is for a business critical function. These ‘stakeholders’ represent multiple roles with different experience, responsibilities, expertise and knowledge about how the system functions - be they users, management, engineers from other dependent services or the incident responders paged in to help with the response. Each stakeholder brings important contributions that are necessary for maintaining reliable operations but smoothly and effectively integrating their contributions or sufficiently meeting their needs for updates, for task delegation or for decisions requires elaborate coordination often under extreme time pressure.  Prior research has shown these coordinative efforts represent a significant cognitive cost (Klein et al, 2005; Klinger & Klein, 1999; Klein, 2006) and require a distinct set of skills (Woods, 2017) to manage in concert with the demands of diagnosing and resolving the incident itself.

Presenting findings from her doctoral research and her experience working with site reliability engineers responsible for critical digital infrastructure (CDI), Laura will uncover the hidden costs of coordination, highlight how the challenges of modern IT infrastructure will continue to impede hitting four 9’s service reliability and show how resilient performance is directly tied to coordination. Along the way, she will examine problematic elements of an Incident Command System, use case study examples to describe helpful and harmful patterns of coordination and offer some promising directions for how to control the costs of coordination in your incident response practices. You will never look at incident response the same way!

Laura Maguire, Cognitive Systems Engineer & Researcher

2:55pm - 3:45pm

Learning From Incidents: How Things Went Right

"At some level of analysis, systems are human systems since it is people who create, operate, and modify that system for human purposes, and since people, not machines, gain or suffer from the operation of that system." JC Le Coze

When things go wrong, we tend to focus on mistakes, miscalculations, and deficiencies in design. By limiting our investigations to the details of what went wrong, we ignore a far richer and more interesting source of learning: how things went right.

Research across numerous safety-critical industries such as aviation and medicine is changing what we know about how to build systems and organizations which are resilient to failure. We will look into the findings of that research and discover how we can avoid falling into common traps of investigation which curtail our ability to learn. This research shows us that the best results come when we are able to answer questions such as:

How does the system normally work?

How did we recover?

How do teams adapt to surprising circumstances?

Where did we bring expertise to the incident, and what worse outcomes did we avoid?  We will share stories from beyond the boundaries of our own industry in order to show how powerful some of these new investigative techniques can be. We will move beyond a shallow analysis of root causes and remediation items in an effort to build truly resilient engineered systems for the future. 

 

Jessica DeVita, Sr. Resilience Engineering Advocate @Netflix

4:10pm - 5:00pm

Chaos and Resilience

Details to follow.

Tracks

  • Architectures You've Always Wondered About

    Hard-earned lessons from the names you know on scalability, reliability, security, and performance.

  • Machine Learning: The Latest Innovations

    AI and machine learning is more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice.

  • Kubernetes and Cloud Architectures

    Learn about cloud native architectural approaches from the leading industry experts who have operated Kubernetes and FaaS at scale, and explore the associated modern DevOps practices.

  • Evolving Java

    JVM futures, JIT directions and improvements to the runtimes stack is the theme of this year’s JVM track.

  • Next Generation Microservices: Building Distributed Systems the Right Way

    Microservice-based applications are everywhere, but well-built distributed systems are not so common. Early adopters of microservices share their insights on how to design systems the right way.

  • Chaos and Resilience: Architecting for Success

    Making systems resilient involves people and tech. Learn about strategies being used, from cognitive systems engineering to chaos engineering.

  • The Future of the API: REST, gRPC, GraphQL and More

    The humble web-based API is evolving. This track provides the what, how, and why of future APIs.

  • Streaming Data Architectures

    Today's systems move huge volumes of data. Hear how the innovators in this space are designing systems and leveraging modern data stream processing platforms.

  • Modern Compilation Targets

    Learn about the innovation happening in the compilation target space. WebAssembly is only the tip of the iceberg.

  • Leaving the Ivory Tower: Modern CS Research in the Real World

    Thoughts pushing software forward, including consensus, CRDT's, formal methods & probabilistic programming.

  • Bare Knuckle Performance

    Crushing latency and getting the most out of your hardware.

  • Leading Distributed Teams

    Remote and distributed working are increasing in popularity, but many organisations underestimate the leadership challenges. Learn from those who are doing this effectively.

  • Driving Full Cycle Engineering Teams at Every Level

    "Full cycle developers" is not just another catch phrase; it's about engineers taking ownership and delivering value, and doing so with the support of their entire organisation. Learn more from the pioneers.

  • JavaScript: Pushing the Client Beyond the Browser

    JavaScript is not just the language of the web. Join this track to learn how the innovators are pushing the boundaries of this classic language and ecosystem

  • When Things Go Wrong: GDPR, Ethics, & Politics

    Privacy, confidentiality, safety and security: learning from the frontlines, from both good and bad experiences

  • Growing Unicorns in the EU: Building, Leading and Scaling Financial Tech Start Ups

    Learn how EU FinTech innovators have designed, built, and led both their technologies and organisations.

  • Building High Performing Teams

    There are many discussions outlining the secret sauce of high-performing teams. Learn how to balance the essential ingredients of high performing teams such as trust and delegation, as well as recognising the pitfalls and problems that will ruin any recipe.

  • Scaling Security, from Device to Cloud

    Implementing effective security is vitally important, regardless of where you are deploying software applications