Streaming Really Large Data With Flink and Fluss

Modern data workloads are pushing the limits of our streaming systems. Row-oriented streaming pipelines often send every event (and every field) over the network, even if consumers only need a fraction of that data. With data volumes skyrocketing in the AI era – where machine learning and real-time analytics consume more data than ever – these inefficiencies become a serious bottleneck. In this talk, we explore a new approach that treats all data as tables, seamlessly integrating streams as continuously updating tables to make streaming more efficient and scalable.

What you'll learn:
Filtering Data at the Source: See how combining Apache Paimon with Apache Arrow (a columnar format) enables predicate pushdown and columnar storage. This allows data to be filtered and pruned at the source, significantly cutting down unnecessary network traffic (often by up to 50%). By avoiding the row-by-row transfer of irrelevant data, we reduce network overhead and boost throughput.

Unified Batch and Stream Queries: Learn how treating streams as tables lets you run the same queries across static data lakes and live streaming data without special-case code. We’ll discuss how Apache Arrow’s in-memory columnar capabilities, integrated with Apache Paimon’s table format, make it possible to query historical data and real-time events in a unified way. This streaming-lakehouse approach simplifies architecture by using one table-oriented model for both batch and streaming workloads.

Scalable Pipeline Architecture: Discover strategies for chaining multiple processing jobs (with engines like Apache Flink and even Apache Spark) while maintaining data integrity and low latency. We’ll cover how the open-source FLUSS project serves as a real-time streaming storage layer that works hand-in-hand with Flink. Drawing on real-world use cases – including how Alibaba’s platforms handle massive, continuous data streams – we’ll illustrate how this architecture supports billions of events without compromising performance.
This session is designed for data engineers and architects looking to build scalable, cost-effective data pipelines that blend streaming and batch processing. You’ll come away with practical insights into why our data architectures must evolve to support AI-driven demand, and how reimagining streams as tables can simplify your stack while delivering substantial performance gains and cost savings.


Speaker

Ben Gamble

Field CTO @Ververica

A long builder of AI powered games, simulations, and collaborative user experiences. Ben has previously built a global logistics company. Large scale online games and Augmented reality apps. Ben currently works to make fast data and AI a reality for everyone. He is the Field CTO of Ververica

Read more

Session Sponsored By

The Unified Streaming Data Platform powered by VERA, from the original creators of Apache Flink®

Date

Tuesday Apr 8 / 02:45PM BST ( 50 minutes )

Location

Westminster (4th Fl.)

Video

Video is not available

Share

From the same track

Session

Beyond Code: Building a Personal Brand To Boost Your Career

Tuesday Apr 8 / 01:35PM BST

In an increasingly competitive field, software expertise alone may not be enough to stand out and drive your career forward.

Speaker image - Roland Meertens

Roland Meertens

InfoQ Editor, Machine Learning Engineer @Wayve, Previously @Bumble Inc, @Annotell, and @Autonomous Intelligent Driving

Speaker image - Steef-Jan Wiggers

Steef-Jan Wiggers

Cloud Queue Lead Editor @InfoQ, Principal Consultant Cloud/DevOps @Team Rockstars IT

Session

From Concept to Code: Navigating Agentic AI Services

Tuesday Apr 8 / 11:45AM BST

Those who embrace agentic AI will reap the rewards. Building on the strategic insights from the first session (“A Blueprint for Agentic AI Services”), this presentation delves into the technical intricacies of harnessing agentic AI.

Speaker image - Alan Klikic

Alan Klikic

Senior Solutions Architect @Akka

Session

Engineering Excellence at ING: Balance Autonomy with Standardization

Tuesday Apr 8 / 10:35AM BST

ING is committed to empowering its engineers to maximize their impact and create more value for customers. To achieve this, ING continuously seeks innovative ways to accelerate development and enhance productivity.

Speaker image - Daniele Tonella

Daniele Tonella

CTO @ING

Session

AI-Enabled Delivery - ICSAET Cohort Only

Tuesday Apr 8 / 03:55PM BST

Only available to attendees with a “Conference (3 days) + Certification (half day)” ticket.AI isn't just about fancy code completion anymore – it's shaking up the entire software lifecycle, from design to deployment and beyond! 🤯

Speaker image - Wes Reisz

Wes Reisz

Technical Principal @EqualExperts, ex-Thoughtworker & ex-VMWare, 16-Time QCon Chair, Creator/Co-host of The InfoQ Podcast

Session

Unlock Continuous Testing with AI Test Agent Workflows

Tuesday Apr 8 / 05:05PM BST

This talk will explore how agentic AI is revolutionizing continuous testing workflows, with a practical demonstration of Diffblue Cover's autonomous AI-powered unit testing capabilities.

Speaker image - Animesh Mishra

Animesh Mishra

Senior Solutions Engineer