The Rise of the Streamhouse: Idea, Trade-Offs, and Evolution

Abstract

Over the last decade, streaming architectures have largely been built around topic-centric primitives—logs, streams, and event pipelines—then stitched together with databases, caches, OLAP engines, and (increasingly) new serving systems. This approach scales, but it also accumulates architectural debt: duplicated data, fractured "truth," inconsistent guarantees, and rising operational overhead as batch, streaming, and product analytics diverge.

In this talk we introduce an emerging shift we call the Streamhouse: a table-centric streaming architecture that treats tables as the primary primitive, and models "real-time" as freshness tiers rather than separate systems. Conceptually, it extends the lakehouse by making continuous ingestion + continuous maintenance the default, so one copy of data can serve both low-latency and historical workloads with straightforward SQL access.

Then we switch from idea to practice: we'll walk through how platforms evolve from batch refreshes -> "near-real-time" -> hot/warm/cold SLAs. We'll show a pattern we built first for analytics: an OLAP serving layer with federated SQL over tiered data. That solves the initial problem—until you notice the same shape repeating across other workloads that also want "fresh + queryable" canonical data, not just dashboards. The talk follows how this pushes teams to generalize from "OLAP over the lake" to "tiered access as a platform primitive," and what trade-offs you must get right: where state lives, how tier boundaries are defined, how continuous maintenance is paid for, and what consistency guarantees you can realistically promise.

The session is practical but forward-looking: a working mental model for tiered real-time lakehouse systems, plus reference architecture patterns and a decision framework for applying the Streamhouse idea as repeatable architecture rather than a custom stack.

Interview:

What is your session about, and why should senior developers care?

It's about replacing the patchwork of topics, databases, warehouses, and feature stores with a table-centric streaming architecture, the Streamhouse. A unified, continuously updated dataset serves real-time pipelines, analytics, and AI.

Senior developers should care because it cuts system complexity while improving correctness and maintainability.

Why is this topic especially important right now?

Data now needs to be fresh, consistent, and immediately usable for both analytics and AI. Traditional streaming architectures scale but at the cost of duplicated data, heavy state management, and fragile pipelines.

As real-time intelligence goes mainstream, those costs become blockers. It's time to fix the foundation, not keep layering tools on top.

What are the most common challenges teams face today?

Keeping streaming and batch in sync. Managing large state inside stream processors. Maintaining multiple systems that all claim to be the source of truth. AI often amplifies the problem: separate feature stores, vector systems, and analytics pipelines all operating on slightly different data. The result is complexity that slows teams down and makes systems hard to reason about.

What's one practical idea attendees can try immediately?

Pick one pipeline and ask: can this be modeled as a continuously updated table instead of a stream plus a database? Even a small shift toward a shared, table-backed model often simplifies joins, speeds up recovery and clarifies data ownership.

What makes QCon a great place for this discussion?

QCon attracts engineers who care about architectural fundamentals - primitives and trade-offs, not just tools. It's the right environment to have an honest conversation about what's working, what's not, and how to build simpler, more open data foundations for real-time intelligent systems.


Speaker

Giannis Polyzos

Principal Streaming Architect @Ververica

Giannis Polyzos is a Principal Streaming Architect working on large-scale data infrastructure and real-time systems. He has designed and operated streaming platforms used in production by high-scale organizations. He is a PPMC member of Apache Fluss and has been deeply involved in Apache Flink and the broader streaming ecosystem. His work focuses on unifying batch and streaming architectures, simplifying data primitives, and enabling streaming analytics and stateful workloads at scale.

Read more

Speaker

Anton Borisov

Principal Data Architect @Fresha

Anton Borisov is a Principal Data Architect building real-time data platforms for customer-facing analytics. His work spans zero-downtime Postgres migrations, CDC-driven streaming pipelines, and architectures that combine stream processing with open table formats and high-performance analytics engines. He’s a well-known voice in the streaming community, writing technical deep-dives on Apache Flink, Fluss, Iceberg, and StarRocks, with a focus on turning cutting-edge ideas into reliable production systems.

Read more

Date

Tuesday Mar 17 / 11:45AM GMT ( 50 minutes )

Location

Fleming (3rd Fl.)

Topics

streaming lakehouse architecture AI/ML

Share

From the same track

Session

Introducing Tansu.io -- Rethinking Kafka for Lean Operations

Tuesday Mar 17 / 10:35AM GMT

What if Kafka brokers were ephemeral, stateless and leaderless with durability delegated to a pluggable storage layer?

Speaker image - Peter Morgan

Peter Morgan

Founder @tansu.io

Session Machine Learning Infrastructure

From S3 to GPU in One Copy: Rethinking Data Loading for ML Training

Tuesday Mar 17 / 01:35PM GMT

ML training pipelines treat data as static. Teams spend weeks preprocessing datasets into WebDataset or TFRecords, and when they want to experiment with curriculum learning or data mixing, they reprocess everything from scratch.

Speaker image - Onur Satici

Onur Satici

Staff Engineer @SpiralDB & Core Maintainer of Vortex (LF AI & Data), Previously Building Distributed Systems @Palantir

Session Generative AI

Ontology‐Driven Observability: Building the E2E Knowledge Graph at Netflix Scale

Tuesday Mar 17 / 02:45PM GMT

As Netflix scales hundreds of client platforms, microservices, and infrastructure components, correlating user experience with system performance has become a hard data problem, not just an observability one.

Speaker image - Prasanna Vijayanathan

Prasanna Vijayanathan

Engineer @Netflix

Speaker image - Renzo  Sanchez-Silva

Renzo Sanchez-Silva

Engineer @Netflix

Session

Building a Control Plane for Production AI

Tuesday Mar 17 / 03:55PM GMT

Details coming soon.

Session

Unconference: Modern Data Engineering

Tuesday Mar 17 / 05:05PM GMT