Presentation: Cloud-Native and Scalable Kafka Architecture

Track: Distributed Stateful Systems

Location: Churchill, G flr.

Duration: 1:40pm - 2:30pm

Day of week: Wednesday

Level: Intermediate - Advanced

Share this on:

Abstract

Kafka as a distributed stateful service faces serious stability and scalability challenges in cloud environment which favors stateless services. As cluster size grows with traffic, it faces issues of data balancing, high consumer data fan out and time consuming process to scale up or update. Failover is necessary to deal with cluster disasters but is hard to do right.

At Netflix, we address these issues by having many smaller and mostly “immutable” Kafka clusters which have limited state changes. We will prove the merit of this architecture in mathematical terms and illustrate how this architecture and additional tooling helps us to improve availability, scale and failover. Our Kafka service, which is composed of over 3000 brokers globally, is capable of processing over one trillion messages and petabytes of data per day with over 99.99% availability.

To make this multi-cluster architecture feasible, we also developed smart clients that conform to the standard Kafka producer/consumer interface but are capable of interacting with multiple clusters at the same time. Additional services are also created to orchestrate cluster/topic changes. The talk will go over the design principles of such clients and services.

Question: 

What is the focus of your work today??

Answer: 

My focus is on the scalability and operability of Kafka ecosystem (both brokers and clients) on AWS for Netflix. Kafka is heavily used inside Netflix for data buffering/collection for its unified data pipeline and also as a pub-sub/queue service for various applications. I need to make sure that

  • We maintain high availability (currently five nines) for our Kafka service at a reasonable cost
  • Changes/scaling out is easily achievable as other services in our cloud environment
  • We can react effectively and in timely manner for Kafka outages
Question: 

What’s the motivation for this talk?

Answer: 

Kafka as a stateful service faces a lot of challenges in AWS which is more friendly to stateless services. In addition, the data oriented culture in Netflix drives extremely high data volume for Kafka and adds its own unique scalability challenges. In a past two years, we explored ways to make Kafka quickly and infinitely scalable in AWS. We would like to share our thoughts and experiences with other Kafka users.

Question: 

What do you recommend for thinking about how you should deal with state in the cloud?

Answer: 

Cloud environment poses challenges for distributed stateful service. On the other hand, the elasticity of the cloud provide unique ways to scale these services which may not be feasible in traditional data center. It is also important to think “out of the box” and not to be bound by traditional concept like “partitions of a topic can be only in one cluster”. Providing higher level of abstraction with multi-cluster aware logic in clients/proxies will help to scale the service transparently.

Question: 

How you you describe the persona and level of the target audience?

Answer: 

The target audience are the application developers and SREs that have interacted with Kafka and anyone is interested in running distributed stateful services in cloud. They should be familiar with distributed stateful services like Cassandra and ElasticSearch. Basic Kafka concepts like producer, consumer, topic and partition will definitely help since this talk will expand these concepts.

Question: 

What trend in the next 12 months would you recommend an early adopter/early majority SWE to pay particular attention to?

Answer: 

Data is playing an ever important role in software development. Stream processing is replacing batch processing at a fast pace as Lambda architecture is giving way to Kappa architecture.

Speaker: Allen Wang

Senior Software Engineer - Cloud Platform @Netflix

Allen Wang is currently with Netflix Real Time Data Infrastructure team where he made significant contribution to Kafka and data infrastructure in AWS. He is a contributor to both Apache Kafka and NetflixOSS and the author of Kafka's rack aware partition assignment. He spoke in 2016 and 2017 Kafka Summit and other meet-ups/conferences.

Find Allen Wang at

Last Year's Tracks

Monday, 5 March

Tuesday, 6 March

Wednesday, 7 March