You are viewing content from a past/completed QCon

Presentation: How Condé Nast Succeeds by a Culture That Embraces Failure

Track: Architecting for Failure: Chaos, Complexity, and Resilience

Location: Churchill, G flr.

Duration: 2:55pm - 3:45pm

Day of week: Wednesday

Share this on:

This presentation is now available to view on InfoQ.com

Watch video with transcript

What You’ll Learn

  1. Find out how Condé Nast developed a culture that embraces possible failure.
  2. Learn some of the arguments to use to advocate for Chaos Engineering.
  3. Hear about some of the tooling to leverage in Chaos Engineering.

Abstract

Systems architectures are increasingly diverse to serve the growing demands for scalability, fault tolerance, isolation, and extensibility. But the compromise is ever complex software to operate and maintain often with no single shared view of entire design. This is especially true with the prevalence of microservices architectures, and a growing reliance of vendor capabilities which are largely out of our control. While errors and incidents themselves cannot be completely eradicated from our systems we can at least build for resilience and adaptability. Experimentation rigour as a cultural practice and habit can identify constraints in the current design with predictions about the emergence of newer patterns to handle failures gracefully such as preventing failure cascades. Another important benefit is aligning people’s mental models of how the software is designed and operated. Crystal will walk through learnings found by building a culture that embraced failure through Chaos Engineering practices as daily routine, what her teams have learned and adapted for their platforms at Condé Nast International which currently serve in excess of 220 million unique users every month across the globe.

Question: 

Tell me a bit about the work that you do.

Answer: 

I'm the Director of Engineering and Cloud Platforms. I oversee the whole software engineering function at Condé Nast International which is better known for its portfolio of magazines such as Vogue, GQ, Wired, Vanity Fair, Glamour. It's an international company - we have operations in 11 different countries around the world in Asia, Europe and South America. Additionally, we have many further licensee countries running the same publications around the world so 28 countries in total. We have distributed engineering teams in all 11 countries, ranging between 7 - 30 engineers in each location, as well as a recently established headquarters in London. The London HQ began just over two years ago and I have grown the Engineering function from 4 engineers to 65 now in London, which is quite sizeable in terms of growth over two years. I’ve nearly 20 years in the software industry. Before my current role as VP, I oversaw multiple teams as a Technical Lead and Principal Engineer at the BBC for many years. I've worked in almost all types of engineering including back-end, front-end, operations, and platform engineering. I believe this holistic experience has given me a deep understanding about the practice and how to manage large interacting systems, teams and disciplines.

Question: 

What's the TLDR for your talk?

Answer: 

Practical experience of first understanding Resilience Engineering, and then in terms of how to get practices and techniques, such as Chaos Engineering, adopted in your own workplace. Adoption is hard, but what’s harder is to establish this culture for the long-term so I’ll also discuss strategies for doing this. I will explain what we have done at Condé Nast using some real-world examples of a media / publishing company that's managed to adopt these practices.Along the way we'll talk about how we've used certain techniques around Chaos Engineering and what we've done in terms of setting up our observability practices to match.

Question: 

Was Chaos Engineering a top-down or bottom-up push?

Answer: 

I would say that I advocated heavily internally especially for sponsorship from the executive level. I had used some of these practices and techniques at a previous company and the huge benefits we reaped. So I was keen to establish this culture early on at Condé Nast. I like to use a technique called “nudging” such as sending out links and interesting talks that people can go watch, I started speaking to the software engineers and people in Technology Service Operations and other parts of Technology too to ensure a good foundational understanding was being applied across the business. There were a couple of other people that were really enthusiastic and passionate about it as well, particularly within engineering, who also did the advocacy once the ball got rolling. This is ideal. I would prefer it to be more grassroots but often those efforts will only have limited influence without sponsorship from above.

Question: 

Can you give me an example how you got consensus on moving towards Chaos Engineering?

Answer: 

When we got to a point where we were beginning to launch new websites and services, it became apparent that we needed to strengthen our resilience both within our systems and our teams. We also wanted to ensure we were effectively communicating, planning, architecting and operating with other parts of Technology and the wider business.

Having some experience in resilience engineering made it easier for me to influence my peers and those above to build in time, and invest in, ensuring we could recover from failure. Failure is inevitable after all. How you mitigate and respond to it organisationally is what can set companies apart.

We also did data analysis for about a three months period just to see how our applications and services are performing to have some hard evidence. We put a heavy emphasis on making sure all of our runbooks followed a standard template. We began doing role playing Game Days to simulate failures hypothetically and then trying to follow processes, artifacts such as runbooks and diagrams, metrics, and the communications to reach a speedy resolution. We did this many times first before actually breaking anything - and I would suggest this to anyone starting out as these steps alone have huge benefits.

Finally we moved on to do Chaos Experiments not just on staging but production as well.

Question: 

Who are you targeting in the talk?

Answer: 

This talk is aimed at people who want to do the advocacy, get the buy in, but perhaps don't know how. There will be practical advice about how do you prove that this is a good idea for the company. I will also reveal some of the instrumentation, metrics and chaos tooling we use.

Question: 

What are some other key takeaways that you think the talk will offer?

Answer: 

It will show the tooling that we have implemented, what works and what didn't. The way that we've extended some of the tooling to work in our particular environment, our context. I'll talk about the observability tools and practices that we've implemented because things like tracing are not well implemented in a lot of companies.

Speaker: Crystal Hirschorn

VP Engineering, Global Strategy & Operations @CondeNast

Crystal Hirschorn is currently VP Engineering, Global Strategy & Operations at Condé Nast which is best known for its portfolio of global brands Vogue, Wired, Vanity Fair, The New Yorker and many more. She oversees a globally distributed engineering organisation and leading the technical strategy for building unified technology platforms deployed across the globe to meet the demands of more than 450 million monthly users. She led the teams at Condé Nast which built and deployed a Kubernetes platform that now operates globally.
 
She has nearly two decades in the Technology industry spending the majority of her career in the Media, Technology and Publishing sectors.  The majority of her career has been as a hands-on software engineer with more than 15 years experience tackling the challenges of building complex system architectures, scaling software and infrastructures. She’s a resilience engineering advocate and long-time practitioner and advocate of Lean, XP and DevOps practices to create a successful engineering culture.  

Find Crystal Hirschorn at

Tracks

  • Architectures You've Always Wondered About

    Hard-earned lessons from the names you know on scalability, reliability, security, and performance.

  • Machine Learning: The Latest Innovations

    AI and machine learning is more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice.

  • Kubernetes and Cloud Architectures

    Learn about cloud native architectural approaches from the leading industry experts who have operated Kubernetes and FaaS at scale, and explore the associated modern DevOps practices.

  • Evolving Java

    JVM futures, JIT directions and improvements to the runtimes stack is the theme of this year’s JVM track.

  • Next Generation Microservices: Building Distributed Systems the Right Way

    Microservice-based applications are everywhere, but well-built distributed systems are not so common. Early adopters of microservices share their insights on how to design systems the right way.

  • Chaos and Resilience: Architecting for Success

    Making systems resilient involves people and tech. Learn about strategies being used, from cognitive systems engineering to chaos engineering.

  • The Future of the API: REST, gRPC, GraphQL and More

    The humble web-based API is evolving. This track provides the what, how, and why of future APIs.

  • Streaming Data Architectures

    Today's systems move huge volumes of data. Hear how the innovators in this space are designing systems and leveraging modern data stream processing platforms.

  • Modern Compilation Targets

    Learn about the innovation happening in the compilation target space. WebAssembly is only the tip of the iceberg.

  • Leaving the Ivory Tower: Modern CS Research in the Real World

    Thoughts pushing software forward, including consensus, CRDT's, formal methods & probabilistic programming.

  • Bare Knuckle Performance

    Crushing latency and getting the most out of your hardware.

  • Leading Distributed Teams

    Remote and distributed working are increasing in popularity, but many organisations underestimate the leadership challenges. Learn from those who are doing this effectively.

  • Full Cycle Developers: Lead the People, Manage the Process & Systems

    "Full cycle developers" is not just another catch phrase; it's about engineers taking ownership and delivering value, and doing so with the support of their entire organisation. Learn more from the pioneers.

  • JavaScript: Pushing the Client Beyond the Browser

    JavaScript is not just the language of the web. Join this track to learn how the innovators are pushing the boundaries of this classic language and ecosystem.

  • When Things Go Wrong: GDPR, Ethics, & Politics

    Privacy, confidentiality, safety and security: learning from the frontlines, from both good and bad experiences

  • Growing Unicorns in the EU: Building, Leading and Scaling Financial Tech Start Ups

    Learn how EU FinTech innovators have designed, built, and led both their technologies and organisations.

  • Building High Performing Teams

    To have a high-performing team, everybody on it has to feel and act like an owner. Learn about cultivating culture, creating psychological safety, sharing the vision effectively, and more

  • Scaling Security, from Device to Cloud

    Implementing effective security is vitally important, regardless of where you are deploying software applications.