Presentation: Patterns of reliable in-stream processing @ Scale

Location:

Duration

Duration: 
10:35am - 11:25am

Day of week:

Key Takeaways

  • Hear explanations, reasoning, and capabilities of deploying a Kappa architecture.
  • Understand how one company handles their requirement that maintains data precision in a global, high volume (1.5M msg/sec) streaming environment.
  • Understand implementation patterns and use cases to solve high volume streaming problems around Apache Storm and Kafka.

Abstract

Modern data streaming systems process millions of messages per second. To extract value from their data, organizations employ horizontally scalable distributed event processors such as Apache Storm. Such architectures are frequently designed under the assumption that data loss and calculation errors are acceptable.

In other cases, Kappa architecture is used to fulfill performance requirements without sacrificing consistency and reliability. And given the typical data consistency and performance requirements, external state and reliance on world clock become a taxing and hardly maintainable choice.

In this talk, we will discuss how we handled the challenges when building 1.5M msg/sec global processing system with Apache Storm and Apache Kafka at Integral Ad Science. We will inspect technology agnostic patterns reemerging in multiple applications including stream rewind technique, volatile in-memory state, derived logical time and synchronization, benefits of components collocation, and precision/performance trade offs.

Interview

Question: 
QCon: Can you tell me about the work you are doing today?
Answer: 
Alexey: I am leading development of real-time infrastructure for Integral Ad Science plus a few projects in nearby areas like Internet crawling , data science, and data warehousing. I take care of the technical architecture and vision.
Question: 
QCon: So what were some of the issues that were challenging with your architecture?
Answer: 
Alexey: When people talk about real time, they implicitly assume you can lose precision or lose data during processing. That is not the case for us.
Integral is a media quality measurement company, so our data precision is very important. While we had to do a lot of trade offs, we couldn’t afford to lose data. So we had to come up with something different from a Lambda architecture. Some people call it a Kappa architecture.
The idea behind a Kappa architecture is to actually have your data in Kafka (in a stream broker) and then process it in volatile memory. Kappa architectures avoid any sort of intermediate storage.
Frequently when people use stream processing systems, the idea is to use Cassandra, HBase or something like that. So it maintains state of the system in a persistence storage. With our workload, it was not possible, so we had to maintain everything in memory. From that fact, we had to reuse some patterns from financial data processing.
Another challenge we had was we cannot rely on is wall clocks for our use cases. Basically the idea is that you store all the data in Kafka or some persistent storage and then if you need to recover from failure, you replay data from Kafka. But, in this case, we cannot use your wall clock, so we have to rely on some sort of logical time.
Question: 
QCon: Can you explain a bit more about a Kappa architecture, and how it’s different from Lambda architectures?
Answer: 
Alexey: Lambda architecture is basically the idea is that you have a real-time view and an offline view of data.
At the end of the day, you take the same data, you process it through your streaming system and you run in an offline job. You use the results as a more precise view than the real-time view. This means if your real-time system goes down, you always can recover by running this offline batch processing.
With a Kappa architecture, you actually rely on your message broker as a system of record, so you don’t have an offline backup system. It is just that the processing, all your data is in your Kafka and then if your real-time pipeline fails, you rewind your stream in Kafka and do your processing.

Tracks

Covering innovative topics

Monday, 7 March

Tuesday, 8 March

Wednesday, 9 March

Conference for Professional Software Developers