Speed of Apache Pinot at the Cost of Cloud Object Storage with Tiered Storage

For real-time analytics, you need systems that can provide ultra low latency (milliseconds) and extremely high throughput (hundreds of thousands of queries per second). One example of such a system is Apache Pinot, which is excellent for real-time analytics use cases like user-facing analytics and personalization.

The users of Pinot love the speed and experience of Pinot, and want to use Pinot for all their use cases - be it internal analytics, ad hoc analytics, reporting and much more. For such use cases, you typically need to store really long retention data.

You can of course do that today, but it can get expensive to store large amounts of data in a system like Pinot, because such systems have tightly coupled storage and compute. As the total data volume grows, more resources (compute + storage) need to be provisioned, whether or not the corresponding compute resources are utilized, resulting in a high cost to serve. Plus, the fresh and recent data is often more valuable than the historical data, and typically queried more frequently, so beyond a certain retention, users are often okay with trading off slightly higher latencies in exchange for reduced cost.

One option for users is to introduce decoupled systems for historical data analytics. Such systems use cloud object storage, which reduces the cost. But that will take your latencies to the 10s of seconds range and also introduce the overhead of maintaining and operating a new system and federating queries.

To address these challenges, we added Tiered Storage for Apache Pinot in StarTree Cloud, which gives you speed of Apache Pinot, at the cost of cloud storage! In this talk, we will dive deep into how we built an abstraction in Apache Pinot to make it agnostic of where the data is located. We'll talk about how we're able to query data on the cloud directly (not downloading the entire data like lazy-loading) with sub-seconds latencies, diving very deep into all the data fetch and optimization  strategies, challenges faced and learnings. We'll talk about the various ways you can configure and customize which portion of your data resides locally as tightly-coupled and which moves to the cloud, giving the best of both worlds.


Speaker

Neha Pawar

Founding Engineer @StarTree

Neha Pawar is a Founding Engineer at StarTree (https://www.startree.ai/), which aims to democratize data for all users by providing real-time, user-facing analytics. Prior to this, she was part of LinkedIn's Data Analytics Infrastructure org for 5 years, working on Apache Pinot & ThirdEye. She is passionate about big data technologies and real-time analytics databases.

Neha is an Apache Pinot PMC and Committer. She has made numerous impactful contributions to Apache Pinot, with a focus on storage optimizations, tiered storage, real time streaming integrations and ingestion. She actively fosters the growing Apache Pinot community & loves to evangelize Pinot by making entertaining video tutorials & illustrations.

When not sipping Pinot, you can find Neha painting or hiking with her dogs.

Read more

Date

Monday Mar 27 / 11:50AM BST ( 50 minutes )

Location

Windsor (5th Fl.)

Topics

Apache Pinot data access real-time analytics storage

Share

From the same track

Session Microservices

Change Data Capture for Microservices

Monday Mar 27 / 01:40PM BST

Microservices represent complex business domains in the form of loosely coupled systems, but these don't exist in isolation: services need to propagate data changes amongst each other, in a reliable and scalable way.

Speaker image - Gunnar Morling
Gunnar Morling

Senior Staff Software Engineer @Decodableco

Session transactions

Amazon DynamoDB Distributed Transactions at Scale

Monday Mar 27 / 02:55PM BST

NoSQL databases are popular for their high availability, high scalability, and predictable performance.

Speaker image - Akshat Vig
Akshat Vig

Senior Principal Engineer NoSQL databases @awscloud

Session processing techniques

In-Process Analytical Data Management with DuckDB

Monday Mar 27 / 05:25PM BST

Analytical data management systems have long been monolithic monsters far removed from the action by ancient protocols. Redesigning them to move into the application process greatly streamlines data transfer, deployment, and management.

Speaker image - Hannes Mühleisen
Hannes Mühleisen

Co-founder and CEO @duckdblabs

Session raft

Multi-Region Data Streaming with Redpanda

Monday Mar 27 / 04:10PM BST

Real time data streaming platforms such as Redpanda have become a mission critical component in enterprise infrastructure. Multi-region deployments of streaming applications can provide important benefits, such as improved resiliency, better performance and cost reduction.

Speaker image - Michał Maślanka
Michał Maślanka

Software Engineer @Redpanda

Session

A New Era for Database Design with TigerBeetle

Monday Mar 27 / 10:35AM BST

The pre-recorded video of this presentation will become available within the next few hours.  

Speaker image - Joran Greef
Joran Greef

Founder and CEO @TigerBeetle