You are viewing content from a past/completed QCon

Presentation: Monitoring All the Things: Keeping Track of a Mixed Estate

Track: Next Generation Microservices: Building Distributed Systems the Right Way

Location: Fleming, 3rd flr.

Duration: 4:10pm - 5:00pm

Day of week: Monday

Slides: Download Slides

Share this on:

This presentation is now available to view on InfoQ.com

Watch video with transcript

What You’ll Learn

  1. Hear about monitoring mixed environments where newer systems run alongside legacy ones.
  2. Learn about monitoring mixed environments and how to discuss issues with various teams across the organization.

Abstract

Monitoring all of a team’s systems can be tricky when you have a microservice architecture. But what happens when you have many teams, each building systems using totally different technology stacks? Add in decades of legacy systems and a sprinkling of third-party tools and you’ve got plenty of fun in store. Discover how to approach monitoring an estate of many technologies and find out what the Financial Times did to improve visibility across systems built by all its teams.

Question: 

What is the work you're doing today?

Answer: 

I'm a Principal Engineer on the FT's reliability engineering team. Our main goal is to assist the other teams around the business to help them build stuff that is secure and reliable. That involves us building tools and helping them. Also, a lot of talking to people and giving them advice around what approaches to take. Myself, I do a mixture of coding, tech leading and having discussions with other teams about what we build.

Question: 

Do you work with monitoring, using specific tools there, or is it about coding integrations?

Answer: 

We have a range of tools across the FT, including some older tools like Nagios, which we still support for the older systems within. Newer stuff tends to use things like CloudWatch and Graphite/Grafana and also Pingdom. We also have some internal tools.

Question: 

What can people expect from this talk?

Answer: 

I've been to talks before about monitoring. And often they focus very much on a single consistent estate: be that running in the same container platform or all using the same programing language. There's lots of nice, neat tricks for monitoring things when they're all consistent. But the problem I've often faced being in an organization that has more than one team, especially where each team has their own autonomy, you end up with vastly different states. We're not a startup. We've been around for a hundred and fifty years or more. We have legacy tech systems that we still need to support, that are still critical to the business. I want to talk about how you bring those different things together so that you can support the old and the new and the variety that you get in a real working organization.

Question: 

When you say legacy technology, are you talking about mainframes or the older stuff?

Answer: 

We're talking about some stuff that's been deployed to physical racks that are sitting in a data center that we run ourselves. Even up until a few months back, we had stuff running in the office, but a recent office move means we finally migrated all that stuff off. There's older stuff that isn't the best understood throughout the company, but it's still very important to our operations. A variety of different systems in different languages, and I don't really know what the oldest one is, to be honest.

Question: 

What are some of the challenges that you encounter?

Answer: 

One of the biggest challenges is looking at old monitoring systems and trying to understand the nuances, because people can tend to understand this is working and this is broken, but a lot of things have interesting failure states, and understanding what failure states you should alert on and what you shouldn't. What does it mean whenever something says it's a warning instead of an error, because different systems have a different understanding of those things and trying to bring them all together. You want the same user experience regardless of what monitoring system it came from. And I think that's actually the tricky bits. It's talking to all the different teams to understand what they mean by good and what they mean by a failure.

Question: 

What do you want people to take away?

Answer: 

I want them to take away ideas about how they can approach these problems. There isn't one solution that's going to fit every organization. But I want them to have an idea of what they need to think about and things that might trip them up and useful techniques that you can apply to multiple systems so you can start to bring these things together and have a meaningful conversation with people around the organization about what they want to do.

Speaker: Luke Blaney

Principal Engineer Operations and Reliability Programme @FT

Luke has worked for the Financial Times since 2012 as a Developer and then Platform Architect. Now a Principal Engineer on their Reliability Engineering team, tasked with improving operational resilience and reducing duplication of tech effort across the company.

Find Luke Blaney at

Last Year's Tracks