Presentation: Pragmatic Resiliency: Super 6 & Sky Bet Evolution

Track: Architecting for Failure

Location: Fleming, 3rd flr.

Duration: 1:40pm - 2:30pm

Day of week: Wednesday

Level: Intermediate - Advanced

Share this on:

What You’ll Learn

  1. Hear how Sky Bet deals with scalability and resilience issues.

  2. Learn how to improve a system to make it more resilient and cope with failure.

  3. Find out what can be some of the tradeoffs when improving a platform while staying competitive.

Abstract

Sky Sports Super 6 is a free football results prediction game, launched in 2008. It’s extremely popular with over a million entries per week and drives a substantial proportion of our traffic at peak time, putting heavy load on our login/single sign on systems. This talk will focus on the reality of adapting a complex set of interacting, highly coupled applications to make them more resilient and better able to cope with failure. I’ll also discuss how to manage the trade-offs of improving a technology platform while delivering to a business in a hugely competitive environment with ever growing customer demand.

Question: 

What is your talk about?

Answer: 

You go to a lot of conferences and you hear people from Google or Netflix talking about reactive architectures or the Simian army or whatever, and it all feels quite unattainable for a lot of people. It's like this big complicated thing, there is not much like those systems. And Sky Bet has changed quite a lot over the last few years. It's scaled constantly over the years. But it's still an architecture that a lot of people will recognize. It's got a big beefy database in the middle of some older code and stored procedures and a bunch of shiny stuff around the edges that we've done over the years. But it wouldn't know a Simian army if it tripped over one, to be honest.

How do you bring that idea of architecting for failure to a fairly organic grown architecture with a lot of historical baggage? More people in the room I suspect will recognize our architecture because it's more similar to what they work with. And how to make incremental steps to improve the architecture?

So, looking at some specific situations and scenarios where we've had particular failures that we've been worried about happening, and we've made some changes to try to mitigate the impact of them happening, or try to avoid them happening. I will talk about how we've done that, and then finish off with how organizational structure strategy and process plays into this. Unless you have an organizational approach to these things, the technology side of it probably is not going to bail you out. You've got to have people taking ownership, you've got to have people caring about it, and not just on the technology side but from a whole business has got to be clarity on what you're trying to achieve or you probably are not going to achieve it. I'll probably going to finish off a little bit more on engineering culture but from the perspective of how do you drive awareness of a need to think about failure. How do you identify what things you should be working on when there's probably a hundred things you could work on that could make the system better? How do you choose those things and manage them in a more rational way?

Question: 

Can you give me example of one of the things that you might discuss?

Answer: 

We have an account system that's part of our monolith at the bottom, and we experienced a set of failures several times with resource starvation at the API layer because some of the processes were being consumed by some slow running processes when they were interacting with payment providers. What was happening was slow payment requests were stacking up and consuming all of the processes in that API layer, and eventually that cascades up into the higher levels consuming the resources of the upper levels until your website stops working. What we're going to talk about is how we despite the fact that we have this monolith in database in the middle, there are still ways to segment the resource usage between these different use cases. Yes, your payments might fall over, but the rest runs OK, people that still use the rest of the services. I'll talk about how we did that and how we fixed it. The other one I was thinking of talking about was around when we have very large events, biggest horse race in the country every year, and it's absolutely huge event for us and we see a huge stampede of traffic. At the end of the race specifically when people come in to see the result of that.

It's very busy beforehand, the busiest time of the year beforehand, and there was a challenge to scale into that. It's very, very busy, but it's handleable. But what happens at the end of the race is different. A huge proportion of the people that placed a bet comes back all at once at the end of the race. These people aren't regular bettors for start, they don't even know whether they won or not. It's a big race with 40 horses in it. It's fun to come and have a look. And they all hit the same point of the site, which is really an expensive part of the site, pulling out their transaction history from the database. All at the same time. The site is struggling to cope at that point in the day.

I'm going to talk about that we don't try to cope with that. We actually incrementally banner parts of the service because during the first couple of minutes they can go to their account history, but the bet hasn't even been settled yet. So there is no information there that's useful to them. So we just put a banner up on that element at the service that says, we're still evaluating the race, come back in a few minutes. Then we incrementally start letting people through that banner. We take the banner down gradually. That protects that slow database service. There's a few reasons that we do it this way, one of which is that when it does fail,recovering all of the services, all of the processes of these different layers, you have to restart a whole bunch of stuff, and it takes a while to bring everything back up under the pressure of all the traffic. If you don't let it fall over, it becomes accessible much more quickly to people. But there's a flip side to it, if you're not careful with the service that's holding the banner people can go away until they get results. What we've seen is on some occasions we've held that banner for too long, and actually makes the situation worse. So it's a delicate balance between when you let people back in vs, not.

We could choose to scale out sufficiently to do this. There are ways of doing that. We could have a separate database read projection or use AWS. It's not that difficult a problem to solve. But it's only for three minutes. It's just not worth it. Sometimes you don't need to do anything terribly complicate. Don't get obsessed with some fancy technology solution.

Speaker: Michael Maibaum

Chief Architect @SkyBet

 

Michael Maibaum is chief architect at Sky Betting & Gaming. Michael started out as a geneticist and molecular biologist, moving from wet-lab experiments to bioinformatics, then manufacturing systems, and telecoms in various engineering and architecture roles. He is interested in solving problems, big data, scalable systems, company culture and ways of working, open source, Agile, science, genetics, and photography.
 

Find Michael Maibaum at

Last Year's Tracks

Monday, 5 March

Tuesday, 6 March

Wednesday, 7 March