Making systems resilient involves people and tech. Learn about strategies being used from chaos testing to distributed system clustering.
Track: Architecting for Failure: Chaos, Complexity, and Resilience
Location: Whittle, 3rd flr.
Day of week: Wednesday

Track Host: Nicki Watt
Nicki Watt currently serves as OpenCredo’s Chief Technology Officer, a pragmatic hands on software consultancy with specialisms in data engineering, ML & cloud native solutions. Her technical career has seen her wear many hats from Engineer, Systems & Technical Architects to Consultant and now CTO. She is a techie at heart, with involvement in the development, delivery and leading of large scale platform and application development projects. Nicki is also co-author of the graph database book Neo4J in Action.
10:35am - 11:25am
Building Resilient Serverless Systems
In this brave new world of serverless, we entrust our vendors with keeping the infrastructure up and running. However, when even cloud behemoths like Amazon Web Services and Google Cloud have outages and failures, how can we build resilient systems?
John Chapin explains how to use serverless technologies and an infrastructure-as-code approach to architect, build, and operate large-scale systems that are resilient to vendor failures, even while taking advantage of fully managed vendor services and platforms. He then leads an end-to-end demo of the resilience of a well architected serverless system in the face of massive simulated failure. He further demonstrates how the system not only provides resilience to failure, but also has a side affect of improving the end-user experience.
Finally, John discusses some of the drawbacks and idiosyncrasies of the approach. All source code, infrastructures templates, and slides will be available for the audience to download and explore. While the examples largely focus on AWS—including API Gateway, CloudFormation, DynamoDB, Lambda, and Route 53—the techniques discussed are broadly applicable across cloud vendors.
11:50am - 12:40pm
Learning From Chaos: Architecting for Resilience
In this talk Russ Miles, CEO of ChaosIQ, will share how leading organisations are successfully adopting chaos engineering to encourage a mindset of "architecting for resilience". Through chaos engineering, architects are able to establish a true "learning system" where everyone is involved in exploring how their systems can improve through embracing failure.
Drawing from a collection of real-world examples and experience reports, Russ will show how you can set up your systems to learn from controlled failure and make resilience an important competitive edge for your organisation.
1:40pm - 2:30pm
An Engineer's Guide to a Good Night's Sleep
As organisations look to empower engineers more, and embrace devops practices, we have seen the support role change quite a bit too. Developers are moving from being purely third line support, to working more collaboratively with engineers and operational staff. Also as we move to cloud native microservice solutions, the increased complexity and diversity of our production landscape means operational staff may well rely more heavily on the engineers, in particular out of hours.
I have spent the last 18 years working across a plethora of industries utilising a myriad of technology and approaches. From working on everything from trading applications to content enrichment APIs, I have seen a lot of approaches and processes try to help minimise operational support for developers.
In this talk, I will be exploring and discussing some of my top approaches and techniques to help reduce the risk of that dreaded 3am call! You will gain some practical insight into how to handle failure in today's more complex distributed microservice systems. This will include looking at approaches to resiliency, understanding your system, understanding the requirements for fault tolerance, and the developers' mindset necessary for this. I will be peppering this talk with real world examples, and an occasional war story along the way too.
2:55pm - 3:45pm
How Condé Nast Succeeds by a Culture That Embraces Failure
Systems architectures are increasingly diverse to serve the growing demands for scalability, fault tolerance, isolation, and extensibility. But the compromise is ever complex software to operate and maintain often with no single shared view of entire design. This is especially true with the prevalence of microservices architectures, and a growing reliance of vendor capabilities which are largely out of our control. While errors and incidents themselves cannot be completely eradicated from our systems we can at least build for resilience and adaptability. Experimentation rigour as a cultural practice and habit can identify constraints in the current design with predictions about the emergence of newer patterns to handle failures gracefully such as preventing failure cascades. Another important benefit is aligning people’s mental models of how the software is designed and operated. Crystal will walk through learnings found by building a culture that embraced failure through Chaos Engineering practices as daily routine, what her teams have learned and adapted for their platforms at Condé Nast International which currently serve in excess of 220 million unique users every month across the globe.
4:10pm - 5:00pm
Amplifying Sources of Resilience: What Research Says
Building robust software systems means anticipating how failures may occur with components and subsystems and developing answers to the question:
“What is needed for the design of systems that prevents or limits catastrophic failure?” Investing in, developing, and sustaining the adaptive capacity to cope with unexpected situations is at the core of Resilience Engineering. In the software community, this means developing (continually!) ever-better answers to the question:
“When our preventative designs fail us, what are ways that teams of engineers successfully anticipate, resolve, and learn from those catastrophes?”
The Resilience Engineering community has been studying how people in high-consequence/high-tempo domains answer this latter question. Applying Resilience Engineering thinking and paradigms to the world of software engineering and operations is still in its infancy, but we have some promising routes for making progress. This talk will outline productive avenues to locate, amplify, support, and build this capacity that exists (sometimes invisibly) in the expertise of your organization. Spoiler: looking closely at the origins, handling, and perception of incidents is part of this story.
Last Year's Tracks
Monday, 2 March
-
Next Generation Microservices: Building Distributed Systems the Right Way
Microservice-based applications are everywhere, but well-built distributed systems are not so common. Early adopters of microservices share their insights on how to design systems the right way.
-
Streaming Data Architectures
Today's systems process huge volumes of continuously changing data. Hear how the innovators in this space are designing systems and leveraging modern data stream processing platforms.
-
Driving Full Cycle Engineering Teams at Every Level
"Full cycle developers" is not just another catch phrase; it's about engineers taking ownership and delivering value, and doing so with the support of their entire organisation. Learn more from the pioneers.
-
When Things Go Wrong: GDPR, Ethics, & Politics
Privacy, confidentiality, safety and security: learning from the frontlines, from both good and bad experiences
-
JavaScript: Pushing the Client Beyond the Browser
JavaScript is not just the language of the web. Join this track to learn how the innovators are pushing the boundaries of this classic language and ecosystem
-
Modern CS in the Real World
Head back to academia to solve today's problems in software engineering.
Tuesday, 3 March
-
Architectures You've Always Wondered About
Hard-earned lessons from the names you know on scalability, reliability, security, and performance.
-
The Future of the API: REST, gRPC, GraphQL and More
The humble web-based API is evolving. This track provides the what, how, and why of future APIs.
-
Building High Performing Teams
There are many discussions outlining the secret sauce of high-performing teams. Learn how to balance the essential ingredients of high performing teams such as trust and delegation, as well as recognising the pitfalls and problems that will ruin any recipe.
-
Machine Learning: The Latest Innovations
AI and machine learning is more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice.
-
Bare Knuckle Performance
Crushing latency and getting the most out of your hardware.
-
Modern Compilation Targets
Learn about the innovation happening in the compilation target space. WebAssembly is only the tip of the iceberg.
Wednesday, 4 March
-
Growing Unicorns in the EU: Building, Leading and Scaling Financial Tech Start Ups
Learn how EU FinTech innovators have designed, built, and led both their technologies and organisations.
-
Kubernetes and Cloud Architectures
Learn about cloud native architectural approaches from the leading industry experts who have operated Kubernetes and FaaS at scale, and explore the associated modern DevOps practices.
-
Chaos and Resilience: Architecting for Success
Making systems resilient involves people and tech. Learn about strategies being used, from cognitive systems engineering to chaos engineering.
-
Leading Distributed Teams
Remote and distributed working are increasing in popularity, but many organisations underestimate the leadership challenges. Learn from those who are doing this effectively.
-
Scaling Security, from Device to Cloud
Implementing effective security is vitally important, regardless of where you are deploying software applications
-
Evolving Java
JVM futures, JIT directions and improvements to the runtimes stack is the theme of this year’s JVM track.