Track:

Architectures You've Always Wondered about

Location:

Fleming, 3rd flr.

Duration

Duration:

1:40pm - 2:30pm

Day of week:

Tuesday

Key Takeaways

Learn about the scale and architecture of Microsoft’s Azure Management Gateway.
Follow the life of a request as it moves through Azure and the strategies in place to prevent down time for the services.
Understand how Azure API’s are built for high availability and for data sovereignty from a key Microsoft Azure Architectural resource.

Abstract

All of Microsoft Azure’s management API requests pass through a single “frontdoor” service that handles routing and common functionality. The frontdoor service proxies requests to over 50 different Microsoft products (SQL, Virtual Machines, Dynamics, Active Directory, etc.) and is a mainline dependency for all service provisioning and management. As a result, downtime must be avoided at any cost and the APIs needs point of presence throughout the globe. Previously, this service was deployed to only a single geographic region, which caused three problems: (1) it hurt performance for our large international customer base; (2) it introduced a single point of failure; and (3) it would make metadata for other regulatory areas pass through the United States.

Over the past few years, Microsoft has transitioned to a fully geo distributed deployment for this frontdoor service (present in 20+ datacenters in North & South America, Europe, Asia and Australia). In each major area, there are multiple active datacenters that can automatically handle failovers due to physical or software faults. All metadata is also replicated using a simple table store that handles multi-master writes and asynchronous replication. This data store uses simple / common components (key/value store + queues) to appear consistent to our customers but is incredibly resilient to any number of common faults.

This session will cover the process of moving to this geo-distributed architecture, as well as relatable techniques for achieving high availability (via active-active-active deployments). The techniques should be transferable to public or private clouds.

Interview

Question:

QCon: What is your role at Microsoft? ...and what was your path to join?

Answer:

Charles: I founded a company called MetricsHub (which was a big data analytic start-up around cost management and monitoring of data centers). I founded that company, and it very quickly got to over 1000 customers. That company was acquired by Microsoft, and I was then brought in to integrate the actual platform. So that is actually how I joined Microsoft.

Then from there, I bounced through a few different teams helping bring the Azure management capabilities up to snuff. I developed the Azure equivalent of Cloud Formation, IAM, and EC2 tags/billing capabilities from AWS. I basically did a bunch of the management capabilities. The latest thing I’ve worked on is the Azure Resource Manager. Azure Resource Manager is the management API for all of our Cloud offerings.

Question:

QCon: What is a bad day at Microsoft when you are operating Azure?

Answer:

Charles: The day is bad when I get a phone call at 1:00am saying some region is down, and there may be some customer impact or some kind of issue like that. Then we have to login to an internal bridge (which can have a lot of people on it). I mean that bridge can have dozens to 50 people until the issue is worked through. Basically, each team that may be impacted always wants to have someone on the bridge.

I would say my biggest priority, in terms of engineering strength, is to make it so that if a region goes down, I don’t have to join that bridge. We do that by completely automating failover of a region. So if a region goes down, it takes about 2 mins for us to reach consensus that the region is down (or unhealthy), but after that, it cuts over. There isn’t even a phone call any more.

I would say my favorite thing to do, from an engineering point of view, is to avoid bad days as much as possible.

Question:

QCon: Let’s talk about Microsoft’s Azure Management Gateway. What does Microsoft Azure’s Management API mean? I envision this huge Architecture.

Answer:

Charles: I will give you a customer view and then an architecture view.

So from a customer’s point of view, when I interact with any of the enterprise Cloud product from Microsoft, say through management operations like creating a Virtual Machine or purchase an inTune license,I talk to a single API surface. We pursue that single API surface so that API developers and partners can target one thing. The idea is that one thing works across the breadth of Microsoft products.

On the architecture side, what that means is we have to have points of presence in all 20 of those different regions, because all of these different products have presence in all the Microsoft data centers. It also means we have to have a way that we can route requests to internal microservices.

So say you come to the API frontdoor, and you give me a virtual machine (VM), our component doesn’t understand how to allocate a VM in our compute fabric down at the lower levels, But we basically have awareness and routing capabilities to bring that request to the microservice responsible for allocating a new VM. Before we bring it to that service, we do things like authentication (making sure that your token is valid and making sure you look like a good user), check authorization to perform operations on the resource (or the billing on the Azure subscription), and make sure you aren’t doing any bad behavior.

Basically, there is like a litany of capabilities we provide for all those management API’s across Microsoft. Then we turn around and kind of expose those capabilities again back to the customer in a single consistent way for all management capabilities.

Question:

QCon: What is the scale you’re dealing with?

Answer:

To give you an idea of the scale of Azure Resource Manager, it does something in the order of 5,000 requests per second globally across 20 regions and alot of different Geo’s. On our backend, I think we replicate somewhere between 10,000 - 25,000 database commits per second (through all 20 region’s).

Question:

When you have got that kind of scale going on with 20 data centers, and something goes wrong and you lose a GEO or lose several GEOs, how do you catch up? What is your strategy?

Answer:

We have our capacity capabilities and we track that very closely. For instance, our system can do 300,000 database commits per second globally. So if we are down for 2 hours, we can go to that full set of capacity and in say, 20 mins, drain all the backlog. But if you are operating at 90% of your capacity and you go down for 6 hours, it is going to take a long, long time to catch up.

Question:

So what is your focus for this talk?

Answer:

My plan in the talk is to basically map out the life of a request. I want to show want it looks like when a request comes in.

Say I’m a client out there, and I want to talk to a management API, what happens? I go to our traffic manager infrastructure (translation… DNS in our terminology), to resolve the nearest data center. The request goes to the API front door. The API front door evaluates all these sub systems to make sure that request is valid (authentication, authorization and so on). And then looks at its routing information a bit (what endpoints should I go to based on the contents of this request?). If you say "hey, I want to create a virtual machine in Australia", that API gateway has to know it has to talk to the Australia based virtual machine service to commission this resource, and handle all that routing logic. Then we make the request, wait for its response, and return back success or failure to the user.

I'll also talk about how we built different elements for high availability and for data sovereignty as appropriate. For example, the DNS resolution uses traffic manager, where it takes whole regions out of rotation if they are unhealthy. That is how we ensure the customer doesn’t really see the problem if there is some kind of compute or network or storage issue in that region. Also since we actually have regional microservices behind the scenes (behind the API). Our API gateway for routing capabilities is fully global, but the compute service in Australia is bound just to Australia. It can’t do anything except in Australia. So we are basically responsible for bridging that global understanding. That way the compute services don’t have any kind of cross region failure issues. That's what I am going to describe in the talk.

@WillHillBet: Love failure & embrace the fall out

Hot code is faster code - addressing JVM warm-up

Building a Modern Security Engineering Team

Real-Time Fraud Detection with Graphs

Successful Go program design, 6 years on

Rust: Systems Programming for Everyone

Using Pony for Fintech

Effortless Eventual Consistency with Weave Mesh

Tracks

Covering innovative topics

Monday, 7 March

Back to Java

What to expect in Java 9 and Spring 5
Stream Processing @ Scale

Big data, fast-moving data. Practical implementation lessons on Real-time Data
DevOps & CI/CD

Lessons/stories on optimizing the deployment pipeline
Head-to-Tail Functional Languages

Free-range Monads, Tackling immutability, tales from production, and more...
Architecting for Failure

Your system will fail. Take control before it takes you with it
21st Century Culture from Geeks on the Ground

New ways to organise technology companies and workplace culture

Tuesday, 8 March

Architectures You've Always Wondered about

In-depth technical case studies from giants like: Microsoft, Netflix, Google, Twitter, and more...
Close to the Metal

Get efficiency back into your code, concepts like: cache efficient algorithm and lock free data structures
Containers (in production)

Real-world lessons on scalability and reliability in production container deployments
Modern CS in the real world

Real-world Industry adoption of modern CS ideas
Security, Incident Response & Fraud Detection

Master-level classes on building security into your system and responding to incidents when things go wrong.
Optimizing You

Keeping life in balance is always a challenge. Learning lifehacks

Wednesday, 9 March

Disrupting Finance

Technology advances in finance (blockchain, P2P, Machine Learning, API's)
Modern Native Languages

Modern native languages: Safe efficiency with Go, Rust, Swift
Full Stack Javascript

Level up Javascript with topics like Angular, React/ReactNative, Node, Mongo/Couch/Other, Falcor, GraphQL, etc
Data Science & Machine Learning Methods

A developer's data science and machine learning toolkit
Microservices for Mega-Architectures

Practical lessons on Microservices success.
Modern Agile Development

Revisiting Agile today and tackling challenges we are seeing in the wild

FULL SCHEDULE

Location:

Duration

Day of week:

Key Takeaways

Abstract

Interview

Find Charles Lamanna at

Similar Talks

Tracks

Covering innovative topics

Monday, 7 March

Tuesday, 8 March

Wednesday, 9 March

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Microsoft Cloud's Frontdoor: Building a Global API

Location:

Duration

Day of week:

More talks on:

Key Takeaways

Abstract

Interview

Find Charles Lamanna at

Similar Talks

Tracks

Covering innovative topics

Monday, 7 March

Tuesday, 8 March

Wednesday, 9 March

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World