Speed of Apache Pinot at the Cost of Cloud Object Storage with Tiered Storage

For real-time analytics, you need systems that can provide ultra low latency (milliseconds) and extremely high throughput (hundreds of thousands of queries per second). One example of such a system is Apache Pinot, which is excellent for real-time analytics use cases like user-facing analytics and personalization.

The users of Pinot love the speed and experience of Pinot, and want to use Pinot for all their use cases - be it internal analytics, ad hoc analytics, reporting and much more. For such use cases, you typically need to store really long retention data.

You can of course do that today, but it can get expensive to store large amounts of data in a system like Pinot, because such systems have tightly coupled storage and compute. As the total data volume grows, more resources (compute + storage) need to be provisioned, whether or not the corresponding compute resources are utilized, resulting in a high cost to serve. Plus, the fresh and recent data is often more valuable than the historical data, and typically queried more frequently, so beyond a certain retention, users are often okay with trading off slightly higher latencies in exchange for reduced cost.

One option for users is to introduce decoupled systems for historical data analytics. Such systems use cloud object storage, which reduces the cost. But that will take your latencies to the 10s of seconds range and also introduce the overhead of maintaining and operating a new system and federating queries.

To address these challenges, we added Tiered Storage for Apache Pinot in StarTree Cloud, which gives you speed of Apache Pinot, at the cost of cloud storage! In this talk, we will dive deep into how we built an abstraction in Apache Pinot to make it agnostic of where the data is located. We'll talk about how we're able to query data on the cloud directly (not downloading the entire data like lazy-loading) with sub-seconds latencies, diving very deep into all the data fetch and optimization  strategies, challenges faced and learnings. We'll talk about the various ways you can configure and customize which portion of your data resides locally as tightly-coupled and which moves to the cloud, giving the best of both worlds.


Speaker

Neha Pawar

Founding Engineer @StarTree

Neha Pawar is a Founding Engineer at StarTree (https://www.startree.ai/), which aims to democratize data for all users by providing real-time, user-facing analytics. Prior to this, she was part of LinkedIn's Data Analytics Infrastructure org for 5 years, working on Apache Pinot & ThirdEye. She is passionate about big data technologies and real-time analytics databases.

Neha is an Apache Pinot PMC and Committer. She has made numerous impactful contributions to Apache Pinot, with a focus on storage optimizations, tiered storage, real time streaming integrations and ingestion. She actively fosters the growing Apache Pinot community & loves to evangelize Pinot by making entertaining video tutorials & illustrations.

When not sipping Pinot, you can find Neha painting or hiking with her dogs.

Read more

From the same track

Session

Building High-Fidelity Data Streams

Low latency data streaming technology and practices remain a hot and trending topic among data engineers today. At its core, it promises to deliver data in near real time in order to provide snappy data-driven user experiences.

Sid Anand

Chief Architect and Head of Engineering @Datazoom

Session

Change Data Capture for Microservices

Microservices represent complex business domains in the form of loosely coupled systems, but these don't exist in isolation: services need to propagate data changes amongst each other, in a reliable and scalable way.

Gunnar Morling

Senior Staff Software Engineer @Decodableco

Session

DynamoDB Transactions

NoSQL cloud database services are popular for their simple key-value operations, high availability, high scalability, and predictable performance.

Akshat Vig

Principal Engineer NoSQL databases @awscloud