Presentation: Microsoft Cloud's Frontdoor: Building a Global API

Location:

Duration

Duration: 
1:40pm - 2:30pm

Day of week:

Key Takeaways

  • Learn about the scale and architecture of Microsoft’s Azure Management Gateway.
  • Follow the life of a request as it moves through Azure and the strategies in place to prevent down time for the services.
  • Understand how Azure API’s are built for high availability and for data sovereignty from a key Microsoft Azure Architectural resource. 

Abstract

All of Microsoft Azure’s management API requests pass through a single “frontdoor” service that handles routing and common functionality. The frontdoor service proxies requests to over 50 different Microsoft products (SQL, Virtual Machines, Dynamics, Active Directory, etc.) and is a mainline dependency for all service provisioning and management. As a result, downtime must be avoided at any cost and the APIs needs point of presence throughout the globe. Previously, this service was deployed to only a single geographic region, which caused three problems: (1) it hurt performance for our large international customer base; (2) it introduced a single point of failure; and (3) it would make metadata for other regulatory areas pass through the United States.

Over the past few years, Microsoft has transitioned to a fully geo distributed deployment for this frontdoor service (present in 20+ datacenters in North & South America, Europe, Asia and Australia). In each major area, there are multiple active datacenters that can automatically handle failovers due to physical or software faults. All metadata is also replicated using a simple table store that handles multi-master writes and asynchronous replication. This data store uses simple / common components (key/value store + queues) to appear consistent to our customers but is incredibly resilient to any number of common faults.

This session will cover the process of moving to this geo-distributed architecture, as well as relatable techniques for achieving high availability (via active-active-active deployments). The techniques should be transferable to public or private clouds.

Interview

Question: 
QCon: What is your role at Microsoft? ...and what was your path to join?
Answer: 
Charles: I founded a company called MetricsHub (which was a big data analytic start-up around cost management and monitoring of data centers). I founded that company, and it very quickly got to over 1000 customers. That company was acquired by Microsoft, and I was then brought in to integrate the actual platform. So that is actually how I joined Microsoft.
Then from there, I bounced through a few different teams helping bring the Azure management capabilities up to snuff. I developed the Azure equivalent of Cloud Formation, IAM, and EC2 tags/billing capabilities from AWS. I basically did a bunch of the management capabilities. The latest thing I’ve worked on is the Azure Resource Manager. Azure Resource Manager is the management API for all of our Cloud offerings.
Question: 
QCon: What is a bad day at Microsoft when you are operating Azure?
Answer: 
Charles: The day is bad when I get a phone call at 1:00am saying some region is down, and there may be some customer impact or some kind of issue like that. Then we have to login to an internal bridge (which can have a lot of people on it). I mean that bridge can have dozens to 50 people until the issue is worked through. Basically, each team that may be impacted always wants to have someone on the bridge.
I would say my biggest priority, in terms of engineering strength, is to make it so that if a region goes down, I don’t have to join that bridge. We do that by completely automating failover of a region. So if a region goes down, it takes about 2 mins for us to reach consensus that the region is down (or unhealthy), but after that, it cuts over. There isn’t even a phone call any more.
I would say my favorite thing to do, from an engineering point of view, is to avoid bad days as much as possible.
Question: 
QCon: Let’s talk about Microsoft’s Azure Management Gateway. What does Microsoft Azure’s Management API mean? I envision this huge Architecture.
Answer: 
Charles: I will give you a customer view and then an architecture view.
So from a customer’s point of view, when I interact with any of the enterprise Cloud product from Microsoft, say through management operations like creating a Virtual Machine or purchase an inTune license,I talk to a single API surface. We pursue that single API surface so that API developers and partners can target one thing. The idea is that one thing works across the breadth of Microsoft products.
On the architecture side, what that means is we have to have points of presence in all 20 of those different regions, because all of these different products have presence in all the Microsoft data centers. It also means we have to have a way that we can route requests to internal microservices.
So say you come to the API frontdoor, and you give me a virtual machine (VM), our component doesn’t understand how to allocate a VM in our compute fabric down at the lower levels, But we basically have awareness and routing capabilities to bring that request to the microservice responsible for allocating a new VM. Before we bring it to that service, we do things like authentication (making sure that your token is valid and making sure you look like a good user), check authorization to perform operations on the resource (or the billing on the Azure subscription), and make sure you aren’t doing any bad behavior.
Basically, there is like a litany of capabilities we provide for all those management API’s across Microsoft. Then we turn around and kind of expose those capabilities again back to the customer in a single consistent way for all management capabilities.
Question: 
QCon: What is the scale you’re dealing with?
Answer: 
To give you an idea of the scale of Azure Resource Manager, it does something in the order of 5,000 requests per second globally across 20 regions and alot of different Geo’s. On our backend, I think we replicate somewhere between 10,000 - 25,000 database commits per second (through all 20 region’s).
Question: 
When you have got that kind of scale going on with 20 data centers, and something goes wrong and you lose a GEO or lose several GEOs, how do you catch up? What is your strategy?
Answer: 
We have our capacity capabilities and we track that very closely. For instance, our system can do 300,000 database commits per second globally. So if we are down for 2 hours, we can go to that full set of capacity and in say, 20 mins, drain all the backlog. But if you are operating at 90% of your capacity and you go down for 6 hours, it is going to take a long, long time to catch up.
Question: 
So what is your focus for this talk?
Answer: 
My plan in the talk is to basically map out the life of a request. I want to show want it looks like when a request comes in.
Say I’m a client out there, and I want to talk to a management API, what happens? I go to our traffic manager infrastructure (translation… DNS in our terminology), to resolve the nearest data center. The request goes to the API front door. The API front door evaluates all these sub systems to make sure that request is valid (authentication, authorization and so on). And then looks at its routing information a bit (what endpoints should I go to based on the contents of this request?). If you say "hey, I want to create a virtual machine in Australia", that API gateway has to know it has to talk to the Australia based virtual machine service to commission this resource, and handle all that routing logic. Then we make the request, wait for its response, and return back success or failure to the user.
I'll also talk about how we built different elements for high availability and for data sovereignty as appropriate. For example, the DNS resolution uses traffic manager, where it takes whole regions out of rotation if they are unhealthy. That is how we ensure the customer doesn’t really see the problem if there is some kind of compute or network or storage issue in that region. Also since we actually have regional microservices behind the scenes (behind the API). Our API gateway for routing capabilities is fully global, but the compute service in Australia is bound just to Australia. It can’t do anything except in Australia. So we are basically responsible for bridging that global understanding. That way the compute services don’t have any kind of cross region failure issues. That's what I am going to describe in the talk.

Tracks

Covering innovative topics

Monday, 7 March

Tuesday, 8 March

Wednesday, 9 March

Conference for Professional Software Developers