Track: Architecting for Failure


Day of week:

Complex systems fail in spectacular ways. Failure isn’t a question of if, but when. Resilient systems recover from failure; robust systems resist failure. In this track we’ll hear from experts who have designed systems that shifted from fragility to resilience and robustness in the face of failure. Attendees will learn architectural patterns and approaches that didn’t and did work, with take-aways that can be applied to their own systems.

Track Host:
Werner Schuster
InfoQ Lead Editor for Functional Programming
Werner Schuster (@murphee) sometimes writes software, sometimes interviews folks about software. His recent interests are languages, performance optimisation, monitoring, and how to make software suck less using computer science research.
10:35am - 11:25am

by Gavin Stevenson
Technology R&D Engineering Lead @WilliamHill

How do you design a system that handles 7,000,000+ product price changes per day, 160TB of data flowing through your network, at peak 460 transactions per second, 3 billion transactions per year. We trade globally, taking and settling the millions of bets placed during the Grand National, World Cup, Euros, Cheltenham, Melbourne Cup, Superbowl, from the smallest local event to the biggest globally.

How do you make such a system resilient to failure, robust enough to route around slow...

11:50am - 12:40pm

by Richard Kasperowski
Author of The Core Protocols: A Guide to Greatness

Open Space
1:40pm - 12:40pm

by Sid Anand
Data Architect @AgariInc, previously Engineering VP @Etsy, Search Architect @LinkedIn, and Cloud Architect @Netflix

Big Data companies (e.g. LinkedIn, Facebook, Google, and Twitter) have historically built custom data pipelines over bare metal in custom-designed data centers. In order to meet strict requirements on data security, fault-tolerance, cost control, job scalability, and uptime, they need to closely manage their core technology. Like serving systems (e.g. web application servers and OLTP databases) that need to be up 24x7 to display content to users, data pipelines...

2:55pm - 3:45pm

by Sankalp Kohli
Engineer/Lead Cassandra Storage @Apple

Apple runs Cassandra at a very large scale which leads to some interesting challenges. This talk will cover many such challenges including Corruption Detection during Gossip, Distributed Deletes coupled with corrupt data and consistent host replacement.

4:10pm - 5:00pm

by Martin Kleppmann
Software Engineer, Author, & Commiter to Samza and Avro

For the very simplest applications, a single database is sufficient, and then life is pretty good. But as your application needs to do more, you often find that no single technology can do everything you need to do with your data. And so you end up having to combine several databases, caches, search indexes, message queues, analytics tools, machine learning systems, and so on, into a heterogeneous infrastructure...

Now you have a new problem: your data is stored in several different...

5:25pm - 6:15pm

by Sadek Drobi
Co-founder & CEO

For a service built to handle millions of requests/hour, it's insufficient to rely on latest trendy components or datastores to save you from system failures, instead it's necessary to deeply understand the properties and the mechanics of your system, and to partition its different dimensions to avoid a domino style failure cascade.

Partitioning time is about uncoupling subsystems that don't absolutely need to be updated in sync, whereas partitioning space is achieved by separating...


Covering innovative topics

Monday, 7 March

Tuesday, 8 March

Wednesday, 9 March