You are viewing content from a past/completed QCon

Track: Architecting for Failure: Chaos, Complexity, and Resilience

Location: Whittle, 3rd flr.

Day of week: Wednesday

Making systems resilient involves people and tech. Learn about strategies being used from chaos testing to distributed system clustering.

Track Host: Nicki Watt

Chief Technology Officer @OpenCredo

Nicki Watt currently serves as OpenCredo’s Chief Technology Officer, a pragmatic hands on software consultancy with specialisms in data engineering, ML & cloud native solutions. Her technical career has seen her wear many hats from Engineer, Systems & Technical Architects to Consultant and now CTO. She is a techie at heart, with involvement in the development, delivery and leading of large scale platform and application development projects. Nicki is also co-author of the graph database book Neo4J in Action.

10:35am - 11:25am

Building Resilient Serverless Systems

In this brave new world of serverless, we entrust our vendors with keeping the infrastructure up and running. However, when even cloud behemoths like Amazon Web Services and Google Cloud have outages and failures, how can we build resilient systems?

 

John Chapin explains how to use serverless technologies and an infrastructure-as-code approach to architect, build, and operate large-scale systems that are resilient to vendor failures, even while taking advantage of fully managed vendor services and platforms. He then leads an end-to-end demo of the resilience of a well architected serverless system in the face of massive simulated failure. He further demonstrates how the system not only provides resilience to failure, but also has a side affect of improving the end-user experience.

 

Finally, John discusses some of the drawbacks and idiosyncrasies of the approach. All source code, infrastructures templates, and slides will be available for the audience to download and explore. While the examples largely focus on AWS—including API Gateway, CloudFormation, DynamoDB, Lambda, and Route 53—the techniques discussed are broadly applicable across cloud vendors.

Johnathan Chapin, Cloud Technology Consultant with an expertise in Serverless Computing

11:50am - 12:40pm

Learning From Chaos: Architecting for Resilience

In this talk Russ Miles, CEO of ChaosIQ, will share how leading organisations are successfully adopting chaos engineering to encourage a mindset of "architecting for resilience". Through chaos engineering, architects are able to establish a true "learning system" where everyone is involved in exploring how their systems can improve through embracing failure.

 

Drawing from a collection of real-world examples and experience reports, Russ will show how you can set up your systems to learn from controlled failure and make resilience an important competitive edge for your organisation.

Russell Miles, CEO of @chaosiqio

1:40pm - 2:30pm

An Engineer's Guide to a Good Night's Sleep

As organisations look to empower engineers more, and embrace devops practices, we have seen the support role change quite a bit too. Developers are moving from being purely third line support, to working more collaboratively with engineers and operational staff. Also as we move to cloud native microservice solutions, the increased complexity and diversity of our production landscape means operational staff may well rely more heavily on the engineers, in particular out of hours.

 

I have spent the last 18 years working across a plethora of industries utilising a myriad of technology and approaches. From working on everything from trading applications to content enrichment APIs, I have seen a lot of approaches and processes try to help minimise operational support for developers.

 

In this talk, I will be exploring and discussing some of my top approaches and techniques to help reduce the risk of that dreaded 3am call! You will gain some practical insight into how to handle failure in today's more complex distributed microservice systems. This will include looking at approaches to resiliency, understanding your system, understanding the requirements for fault tolerance, and the developers' mindset necessary for this. I will be peppering this talk with real world examples, and an occasional war story along the way too.

Nicky Wrightson, Principal Engineer @riverisland

2:55pm - 3:45pm

How Condé Nast Succeeds by a Culture That Embraces Failure

Systems architectures are increasingly diverse to serve the growing demands for scalability, fault tolerance, isolation, and extensibility. But the compromise is ever complex software to operate and maintain often with no single shared view of entire design. This is especially true with the prevalence of microservices architectures, and a growing reliance of vendor capabilities which are largely out of our control. While errors and incidents themselves cannot be completely eradicated from our systems we can at least build for resilience and adaptability. Experimentation rigour as a cultural practice and habit can identify constraints in the current design with predictions about the emergence of newer patterns to handle failures gracefully such as preventing failure cascades. Another important benefit is aligning people’s mental models of how the software is designed and operated. Crystal will walk through learnings found by building a culture that embraced failure through Chaos Engineering practices as daily routine, what her teams have learned and adapted for their platforms at Condé Nast International which currently serve in excess of 220 million unique users every month across the globe.

Crystal Hirschorn, Crystal Hirschorn is Director of Engineering and Cloud Platforms @CondeNast

4:10pm - 5:00pm

Amplifying Sources of Resilience: What Research Says

Building robust software systems means anticipating how failures may occur with components and subsystems and developing answers to the question:   

“What is needed for the design of systems that prevents or limits catastrophic failure?”   Investing in, developing, and sustaining the adaptive capacity to cope with unexpected situations is at the core of Resilience Engineering. In the software community, this means developing (continually!) ever-better answers to the question:   

“When our preventative designs fail us, what are ways that teams of engineers successfully anticipate, resolve, and learn from those catastrophes?”

  

The Resilience Engineering community has been studying how people in high-consequence/high-tempo domains answer this latter question. Applying Resilience Engineering thinking and paradigms to the world of software engineering and operations is still in its infancy, but we have some promising routes for making progress. This talk will outline productive avenues to locate, amplify, support, and build this capacity that exists (sometimes invisibly) in the expertise of your organization. Spoiler: looking closely at the origins, handling, and perception of incidents is part of this story.

John Allspaw, DevOps/Resilience Engineering Thought Leader, Previously CTO @Etsy & Co-founder of @AdaptiveCLabs

Tracks

  • Architectures You've Always Wondered About

    Hard-earned lessons from the names you know on scalability, reliability, security, and performance.

  • Machine Learning: The Latest Innovations

    AI and machine learning is more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice.

  • Kubernetes and Cloud Architectures

    Practical approaches and lessons learned for deploying systems into Kubernetes, cloud, and FaaS platforms.

  • Evolving Java

    JVM futures, JIT directions and improvements to the runtimes stack is the theme of this year’s JVM track.

  • Next Generation Microservices: Building Distributed Systems the Right Way

    Microservice-based applications are everywhere, but well-built distributed systems are not so common. Early adopters of microservices share their insights on how to design systems the right way.

  • Chaos and Resilience: Architecting for Success

    Making systems resilient involves people and tech. Learn about strategies being used, from cognitive systems engineering to chaos engineering.

  • The Future of the API: REST, gRPC, GraphQL and More

    The humble web-based API is evolving. This track provides the what, how, and why of future APIs.

  • Streaming Data Architectures

    Today's systems move huge volumes of data. Hear how the innovators in this space are designing systems and leveraging modern data stream processing platforms.

  • Modern Compilation Targets

    Learn about the innovation happening in the compilation target space. WebAssembly is only the tip of the iceberg.

  • Leaving the Ivory Tower: Modern CS Research in the Real World

    Thoughts pushing software forward, including consensus, CRDT's, formal methods & probabilistic programming.

  • Bare Knuckle Performance

    Crushing latency and getting the most out of your hardware.

  • Leading Distributed Teams

    Remote and distributed working are increasing in popularity, but many organisations underestimate the leadership challenges. Learn from those who are doing this effectively.

  • Full Cycle Developers: Lead the People, Manage the Process & Systems

    "Full cycle developers" is not just another catch phrase; it's about engineers taking ownership and delivering value, and doing so with the support of their entire organisation. Learn more from the pioneers.

  • JavaScript: Pushing the Client Beyond the Browser

    JavaScript is not just the language of the web. Join this track to learn how the innovators are pushing the boundaries of this classic language and ecosystem.

  • When Things Go Wrong: GDPR, Ethics, & Politics

    Privacy, confidentiality, safety and security: learning from the frontlines, from both good and bad experiences

  • Growing Unicorns in the EU: Building, Leading and Scaling Financial Tech Start Ups

    Learn how EU FinTech innovators have designed, built, and led both their technologies and organisations.

  • Building High Performing Teams

    To have a high-performing team, everybody on it has to feel and act like an owner. Learn about cultivating culture, creating psychological safety, sharing the vision effectively, and more

  • Scaling Security, from Device to Cloud

    Implementing effective security is vitally important, regardless of where you are deploying software applications.