For real-time analytics, you need systems that can provide ultra low latency (milliseconds) and extremely high throughput (hundreds of thousands of queries per second). One example of such a system is Apache Pinot, which is excellent for real-time analytics use cases like user-facing analytics and personalization.
The users of Pinot love the speed and experience of Pinot, and want to use Pinot for all their use cases - be it internal analytics, ad hoc analytics, reporting and much more. For such use cases, you typically need to store really long retention data.
You can of course do that today, but it can get expensive to store large amounts of data in a system like Pinot, because such systems have tightly coupled storage and compute. As the total data volume grows, more resources (compute + storage) need to be provisioned, whether or not the corresponding compute resources are utilized, resulting in a high cost to serve. Plus, the fresh and recent data is often more valuable than the historical data, and typically queried more frequently, so beyond a certain retention, users are often okay with trading off slightly higher latencies in exchange for reduced cost.
One option for users is to introduce decoupled systems for historical data analytics. Such systems use cloud object storage, which reduces the cost. But that will take your latencies to the 10s of seconds range and also introduce the overhead of maintaining and operating a new system and federating queries.
To address these challenges, we added Tiered Storage for Apache Pinot in StarTree Cloud, which gives you speed of Apache Pinot, at the cost of cloud storage! In this talk, we will dive deep into how we built an abstraction in Apache Pinot to make it agnostic of where the data is located. We'll talk about how we're able to query data on the cloud directly (not downloading the entire data like lazy-loading) with sub-seconds latencies, diving very deep into all the data fetch and optimization strategies, challenges faced and learnings. We'll talk about the various ways you can configure and customize which portion of your data resides locally as tightly-coupled and which moves to the cloud, giving the best of both worlds.
Speaker
Neha Pawar
Founding Engineer @StarTree
Neha Pawar is a Founding Engineer at StarTree (https://www.startree.ai/), which aims to democratize data for all users by providing real-time, user-facing analytics. Prior to this, she was part of LinkedIn's Data Analytics Infrastructure org for 5 years, working on Apache Pinot & ThirdEye. She is passionate about big data technologies and real-time analytics databases.
Neha is an Apache Pinot PMC and Committer. She has made numerous impactful contributions to Apache Pinot, with a focus on storage optimizations, tiered storage, real time streaming integrations and ingestion. She actively fosters the growing Apache Pinot community & loves to evangelize Pinot by making entertaining video tutorials & illustrations.
When not sipping Pinot, you can find Neha painting or hiking with her dogs.