Cloud-Native and Scalable Kafka Architecture | Software Development Conference QCon London 2018

Abstract

Kafka as a distributed stateful service faces serious stability and scalability challenges in cloud environment which favors stateless services. As cluster size grows with traffic, it faces issues of data balancing, high consumer data fan out and time consuming process to scale up or update. Failover is necessary to deal with cluster disasters but is hard to do right.

At Netflix, we address these issues by having many smaller and mostly “immutable” Kafka clusters which have limited state changes. We will prove the merit of this architecture in mathematical terms and illustrate how this architecture and additional tooling helps us to improve availability, scale and failover. Our Kafka service, which is composed of over 3000 brokers globally, is capable of processing over one trillion messages and petabytes of data per day with over 99.99% availability.

To make this multi-cluster architecture feasible, we also developed smart clients that conform to the standard Kafka producer/consumer interface but are capable of interacting with multiple clusters at the same time. Additional services are also created to orchestrate cluster/topic changes. The talk will go over the design principles of such clients and services.

Question:

What is the focus of your work today??

Answer:

My focus is on the scalability and operability of Kafka ecosystem (both brokers and clients) on AWS for Netflix. Kafka is heavily used inside Netflix for data buffering/collection for its unified data pipeline and also as a pub-sub/queue service for various applications. I need to make sure that

We maintain high availability (currently five nines) for our Kafka service at a reasonable cost
Changes/scaling out is easily achievable as other services in our cloud environment
We can react effectively and in timely manner for Kafka outages

Question:

What’s the motivation for this talk?

Answer:

Kafka as a stateful service faces a lot of challenges in AWS which is more friendly to stateless services. In addition, the data oriented culture in Netflix drives extremely high data volume for Kafka and adds its own unique scalability challenges. In a past two years, we explored ways to make Kafka quickly and infinitely scalable in AWS. We would like to share our thoughts and experiences with other Kafka users.

Question:

What do you recommend for thinking about how you should deal with state in the cloud?

Answer:

Cloud environment poses challenges for distributed stateful service. On the other hand, the elasticity of the cloud provide unique ways to scale these services which may not be feasible in traditional data center. It is also important to think “out of the box” and not to be bound by traditional concept like “partitions of a topic can be only in one cluster”. Providing higher level of abstraction with multi-cluster aware logic in clients/proxies will help to scale the service transparently.

Question:

How you you describe the persona and level of the target audience?

Answer:

The target audience are the application developers and SREs that have interacted with Kafka and anyone is interested in running distributed stateful services in cloud. They should be familiar with distributed stateful services like Cassandra and ElasticSearch. Basic Kafka concepts like producer, consumer, topic and partition will definitely help since this talk will expand these concepts.

Question:

What trend in the next 12 months would you recommend an early adopter/early majority SWE to pay particular attention to?

Answer:

Data is playing an ever important role in software development. Stream processing is replacing batch processing at a fast pace as Lambda architecture is giving way to Kappa architecture.

Speaker: Allen Wang

Senior Software Engineer - Cloud Platform @Netflix

Allen Wang is currently with Netflix Real Time Data Infrastructure team where he made significant contribution to Kafka and data infrastructure in AWS. He is a contributor to both Apache Kafka and NetflixOSS and the author of Kafka's rack aware partition assignment. He spoke in 2016 and 2017 Kafka Summit and other meet-ups/conferences.

Find Allen Wang at

Speaker page

Last Year's Tracks

Monday, 5 March

Leading Edge Backend Languages

Code the future! How cutting-edge programming languages and their more-established forerunners can help solve today and tomorrow’s server-side technical problems.
Security: Red XOR Blue Team

Security from the defender's AND the attacker's point of view
Microservices/ Serverless: Patterns and Practices

Stories of success and failure building modern service and function-based applications, including event sourcing, reactive, decomposition, & more.
Stream Processing in the Modern Age

Compelling applications of stream processing & recent advances in the field
DevEx: The Next Evolution of DevOps

Removing friction from the developer experience.
Modern CS in the Real World

Applied trends in Computer Science that are likely to affect Software Engineers today.
Speaker AMAs (Ask Me Anything)

Tuesday, 6 March

Next Gen Banking: It’s not all Blockchains and ICOs

Great technologies like Blockchain, smartphones and biometrics must not be limited to just faster banking, but better banking.
Observability: Logging, Alerting and Tracing

Observability in modern large distributed computer systems
Building Great Engineering Cultures & Organizations

Stories of cultural change in organizations
Architectures You've Always Wondered About

Topics like next-gen architecture mixed with applied use cases found in today's large-scale systems, self-driving cars, network routing, scale, robotics, cloud deployments, and more.
The Practice & Frontiers of AI

Learn about machine learning in practice and on the horizon
JavaScript and Beyond: The Future of the Frontend

Exploring the great frontend frameworks that make JavaScript so popular and theg JavaScript-based languages revolutionising frontend development.
Speaker AMAs (Ask Me Anything)

Wednesday, 7 March

Distributed Stateful Systems

Architecting and leveraging NoSQL revisitied
Operating Systems: LinuxKit, Unikernels, & Beyond

Applied, practical, & real-world deep-dive into industry adoption of OS, containers and virtualisation, including Linux on Windows, LinuxKit, and Unikernels
Architecting for Failure

If you're not architecting for failure you're heading for failure
Evolving Java and the JVM: Mobile, Micro and Modular

Although the Java language is holding strong as a developer favourite, new languages and paradigms are being embraced on JVM.
Tech Ethics in Action

Learning from the experiences of real-world companies driving technology decisions from ethics as much as technology.
Bare Knuckle Performance

Killing latency and getting the most out of your hardware
Speaker AMAs (Ask Me Anything)

This Year's Schedule

Track: Distributed Stateful Systems

Location: Churchill, G flr.

Duration: 1:40pm - 2:30pm

Day of week: Wednesday

Level: Intermediate - Advanced

Abstract

Find Allen Wang at

Last Year's Tracks

Monday, 5 March

Tuesday, 6 March

Wednesday, 7 March

Learn trends from innovator and early adopter companies that you can bring home to your team

Presentation: Cloud-Native and Scalable Kafka Architecture

Track: Distributed Stateful Systems

Location: Churchill, G flr.

Duration: 1:40pm - 2:30pm

Day of week: Wednesday

Level: Intermediate - Advanced

More talks on:

Share this on:

Abstract

Find Allen Wang at

Last Year's Tracks

Monday, 5 March

Tuesday, 6 March

Wednesday, 7 March

Learn trends from innovator and early adopter companies that you can bring home to your team