Presentation: Pragmatic Resiliency: Super 6 & Sky Bet Evolution
Share this on:
What You’ll Learn
-
Hear how Sky Bet deals with scalability and resilience issues.
-
Learn how to improve a system to make it more resilient and cope with failure.
-
Find out what can be some of the tradeoffs when improving a platform while staying competitive.
Abstract
Sky Sports Super 6 is a free football results prediction game, launched in 2008. It’s extremely popular with over a million entries per week and drives a substantial proportion of our traffic at peak time, putting heavy load on our login/single sign on systems. This talk will focus on the reality of adapting a complex set of interacting, highly coupled applications to make them more resilient and better able to cope with failure. I’ll also discuss how to manage the trade-offs of improving a technology platform while delivering to a business in a hugely competitive environment with ever growing customer demand.
What is your talk about?
You go to a lot of conferences and you hear people from Google or Netflix talking about reactive architectures or the Simian army or whatever, and it all feels quite unattainable for a lot of people. It's like this big complicated thing, there is not much like those systems. And Sky Bet has changed quite a lot over the last few years. It's scaled constantly over the years. But it's still an architecture that a lot of people will recognize. It's got a big beefy database in the middle of some older code and stored procedures and a bunch of shiny stuff around the edges that we've done over the years. But it wouldn't know a Simian army if it tripped over one, to be honest.
How do you bring that idea of architecting for failure to a fairly organic grown architecture with a lot of historical baggage? More people in the room I suspect will recognize our architecture because it's more similar to what they work with. And how to make incremental steps to improve the architecture?
So, looking at some specific situations and scenarios where we've had particular failures that we've been worried about happening, and we've made some changes to try to mitigate the impact of them happening, or try to avoid them happening. I will talk about how we've done that, and then finish off with how organizational structure strategy and process plays into this. Unless you have an organizational approach to these things, the technology side of it probably is not going to bail you out. You've got to have people taking ownership, you've got to have people caring about it, and not just on the technology side but from a whole business has got to be clarity on what you're trying to achieve or you probably are not going to achieve it. I'll probably going to finish off a little bit more on engineering culture but from the perspective of how do you drive awareness of a need to think about failure. How do you identify what things you should be working on when there's probably a hundred things you could work on that could make the system better? How do you choose those things and manage them in a more rational way?
Can you give me example of one of the things that you might discuss?
We have an account system that's part of our monolith at the bottom, and we experienced a set of failures several times with resource starvation at the API layer because some of the processes were being consumed by some slow running processes when they were interacting with payment providers. What was happening was slow payment requests were stacking up and consuming all of the processes in that API layer, and eventually that cascades up into the higher levels consuming the resources of the upper levels until your website stops working. What we're going to talk about is how we despite the fact that we have this monolith in database in the middle, there are still ways to segment the resource usage between these different use cases. Yes, your payments might fall over, but the rest runs OK, people that still use the rest of the services. I'll talk about how we did that and how we fixed it. The other one I was thinking of talking about was around when we have very large events, biggest horse race in the country every year, and it's absolutely huge event for us and we see a huge stampede of traffic. At the end of the race specifically when people come in to see the result of that.
It's very busy beforehand, the busiest time of the year beforehand, and there was a challenge to scale into that. It's very, very busy, but it's handleable. But what happens at the end of the race is different. A huge proportion of the people that placed a bet comes back all at once at the end of the race. These people aren't regular bettors for start, they don't even know whether they won or not. It's a big race with 40 horses in it. It's fun to come and have a look. And they all hit the same point of the site, which is really an expensive part of the site, pulling out their transaction history from the database. All at the same time. The site is struggling to cope at that point in the day.
I'm going to talk about that we don't try to cope with that. We actually incrementally banner parts of the service because during the first couple of minutes they can go to their account history, but the bet hasn't even been settled yet. So there is no information there that's useful to them. So we just put a banner up on that element at the service that says, we're still evaluating the race, come back in a few minutes. Then we incrementally start letting people through that banner. We take the banner down gradually. That protects that slow database service. There's a few reasons that we do it this way, one of which is that when it does fail,recovering all of the services, all of the processes of these different layers, you have to restart a whole bunch of stuff, and it takes a while to bring everything back up under the pressure of all the traffic. If you don't let it fall over, it becomes accessible much more quickly to people. But there's a flip side to it, if you're not careful with the service that's holding the banner people can go away until they get results. What we've seen is on some occasions we've held that banner for too long, and actually makes the situation worse. So it's a delicate balance between when you let people back in vs, not.
We could choose to scale out sufficiently to do this. There are ways of doing that. We could have a separate database read projection or use AWS. It's not that difficult a problem to solve. But it's only for three minutes. It's just not worth it. Sometimes you don't need to do anything terribly complicate. Don't get obsessed with some fancy technology solution.
Similar Talks
Tracks
Monday, 5 March
-
Leading Edge Backend Languages
Code the future! How cutting-edge programming languages and their more-established forerunners can help solve today and tomorrow’s server-side technical problems.
-
Security: Red XOR Blue Team
Security from the defender's AND the attacker's point of view
-
Microservices/ Serverless: Patterns and Practices
Stories of success and failure building modern service and function-based applications, including event sourcing, reactive, decomposition, & more.
-
Stream Processing in the Modern Age
Compelling applications of stream processing & recent advances in the field
-
DevEx: The Next Evolution of DevOps
Removing friction from the developer experience.
-
Modern CS in the Real World
Applied trends in Computer Science that are likely to affect Software Engineers today.
-
Speaker AMAs (Ask Me Anything)
Tuesday, 6 March
-
Next Gen Banking: It’s not all Blockchains and ICOs
Great technologies like Blockchain, smartphones and biometrics must not be limited to just faster banking, but better banking.
-
Observability: Logging, Alerting and Tracing
Observability in modern large distributed computer systems
-
Building Great Engineering Cultures & Organizations
Stories of cultural change in organizations
-
Architectures You've Always Wondered About
Topics like next-gen architecture mixed with applied use cases found in today's large-scale systems, self-driving cars, network routing, scale, robotics, cloud deployments, and more.
-
The Practice & Frontiers of AI
Learn about machine learning in practice and on the horizon
-
JavaScript and Beyond: The Future of the Frontend
Exploring the great frontend frameworks that make JavaScript so popular and theg JavaScript-based languages revolutionising frontend development.
-
Speaker AMAs (Ask Me Anything)
Wednesday, 7 March
-
Distributed Stateful Systems
Architecting and leveraging NoSQL revisitied
-
Operating Systems: LinuxKit, Unikernels, & Beyond
Applied, practical, & real-world deep-dive into industry adoption of OS, containers and virtualisation, including Linux on Windows, LinuxKit, and Unikernels
-
Architecting for Failure
If you're not architecting for failure you're heading for failure
-
Evolving Java and the JVM: Mobile, Micro and Modular
Although the Java language is holding strong as a developer favourite, new languages and paradigms are being embraced on JVM.
-
Tech Ethics in Action
Learning from the experiences of real-world companies driving technology decisions from ethics as much as technology.
-
Bare Knuckle Performance
Killing latency and getting the most out of your hardware
-
Speaker AMAs (Ask Me Anything)