Presentation: Cloud-Native and Scalable Kafka Architecture
Share this on:
Abstract
Kafka as a distributed stateful service faces serious stability and scalability challenges in cloud environment which favors stateless services. As cluster size grows with traffic, it faces issues of data balancing, high consumer data fan out and time consuming process to scale up or update. Failover is necessary to deal with cluster disasters but is hard to do right.
At Netflix, we address these issues by having many smaller and mostly “immutable” Kafka clusters which have limited state changes. We will prove the merit of this architecture in mathematical terms and illustrate how this architecture and additional tooling helps us to improve availability, scale and failover. Our Kafka service, which is composed of over 3000 brokers globally, is capable of processing over one trillion messages and petabytes of data per day with over 99.99% availability.
To make this multi-cluster architecture feasible, we also developed smart clients that conform to the standard Kafka producer/consumer interface but are capable of interacting with multiple clusters at the same time. Additional services are also created to orchestrate cluster/topic changes. The talk will go over the design principles of such clients and services.
What is the focus of your work today??
My focus is on the scalability and operability of Kafka ecosystem (both brokers and clients) on AWS for Netflix. Kafka is heavily used inside Netflix for data buffering/collection for its unified data pipeline and also as a pub-sub/queue service for various applications. I need to make sure that
- We maintain high availability (currently five nines) for our Kafka service at a reasonable cost
- Changes/scaling out is easily achievable as other services in our cloud environment
- We can react effectively and in timely manner for Kafka outages
What’s the motivation for this talk?
Kafka as a stateful service faces a lot of challenges in AWS which is more friendly to stateless services. In addition, the data oriented culture in Netflix drives extremely high data volume for Kafka and adds its own unique scalability challenges. In a past two years, we explored ways to make Kafka quickly and infinitely scalable in AWS. We would like to share our thoughts and experiences with other Kafka users.
What do you recommend for thinking about how you should deal with state in the cloud?
Cloud environment poses challenges for distributed stateful service. On the other hand, the elasticity of the cloud provide unique ways to scale these services which may not be feasible in traditional data center. It is also important to think “out of the box” and not to be bound by traditional concept like “partitions of a topic can be only in one cluster”. Providing higher level of abstraction with multi-cluster aware logic in clients/proxies will help to scale the service transparently.
How you you describe the persona and level of the target audience?
The target audience are the application developers and SREs that have interacted with Kafka and anyone is interested in running distributed stateful services in cloud. They should be familiar with distributed stateful services like Cassandra and ElasticSearch. Basic Kafka concepts like producer, consumer, topic and partition will definitely help since this talk will expand these concepts.
What trend in the next 12 months would you recommend an early adopter/early majority SWE to pay particular attention to?
Data is playing an ever important role in software development. Stream processing is replacing batch processing at a fast pace as Lambda architecture is giving way to Kappa architecture.
Last Year's Tracks
Monday, 5 March
-
Leading Edge Backend Languages
Code the future! How cutting-edge programming languages and their more-established forerunners can help solve today and tomorrow’s server-side technical problems.
-
Security: Red XOR Blue Team
Security from the defender's AND the attacker's point of view
-
Microservices/ Serverless: Patterns and Practices
Stories of success and failure building modern service and function-based applications, including event sourcing, reactive, decomposition, & more.
-
Stream Processing in the Modern Age
Compelling applications of stream processing & recent advances in the field
-
DevEx: The Next Evolution of DevOps
Removing friction from the developer experience.
-
Modern CS in the Real World
Applied trends in Computer Science that are likely to affect Software Engineers today.
-
Speaker AMAs (Ask Me Anything)
Tuesday, 6 March
-
Next Gen Banking: It’s not all Blockchains and ICOs
Great technologies like Blockchain, smartphones and biometrics must not be limited to just faster banking, but better banking.
-
Observability: Logging, Alerting and Tracing
Observability in modern large distributed computer systems
-
Building Great Engineering Cultures & Organizations
Stories of cultural change in organizations
-
Architectures You've Always Wondered About
Topics like next-gen architecture mixed with applied use cases found in today's large-scale systems, self-driving cars, network routing, scale, robotics, cloud deployments, and more.
-
The Practice & Frontiers of AI
Learn about machine learning in practice and on the horizon
-
JavaScript and Beyond: The Future of the Frontend
Exploring the great frontend frameworks that make JavaScript so popular and theg JavaScript-based languages revolutionising frontend development.
-
Speaker AMAs (Ask Me Anything)
Wednesday, 7 March
-
Distributed Stateful Systems
Architecting and leveraging NoSQL revisitied
-
Operating Systems: LinuxKit, Unikernels, & Beyond
Applied, practical, & real-world deep-dive into industry adoption of OS, containers and virtualisation, including Linux on Windows, LinuxKit, and Unikernels
-
Architecting for Failure
If you're not architecting for failure you're heading for failure
-
Evolving Java and the JVM: Mobile, Micro and Modular
Although the Java language is holding strong as a developer favourite, new languages and paradigms are being embraced on JVM.
-
Tech Ethics in Action
Learning from the experiences of real-world companies driving technology decisions from ethics as much as technology.
-
Bare Knuckle Performance
Killing latency and getting the most out of your hardware
-
Speaker AMAs (Ask Me Anything)