Building High-Fidelity Data Streams

Low latency data streaming technology and practices remain a hot and trending topic among data engineers today. At its core, it promises to deliver data in near real time in order to provide snappy data-driven user experiences. This experience comes in many forms including low latency updates to social news feeds, near-real time payment fraud prevention, time-relevant recommender systems used in flash sales, self-driving car route planning, and more. Our need to stay engaged has made low latency data streams a critical part of modern data architectures. 

While it may seem trivial to get a data streaming POC up and running, productionalizing such a system under strict SLAs with the aid of a lean engineering team requires making the right choices but also learning from mistakes along the way. At Datazoom, we built a lossless streaming data system that guarantees sub-second (p95) event delivery at scale with better than three nines availability – we measure availability in terms of the on-time delivery of events. Come to this talk to learn how you can build such a system soup-to-nuts.

What is the focus of your work these days?

I currently serve as the Chief Architect and Head of Engineering at Datazoom, a company that offers a video data platform that captures video playback telemetry data. This data can be used to understand how customers experience and interact with video. At Datazoom, we build both client SDKs and a cloud-based analytics platform.

What’s the motivation for your talk?

In this talk, I explain how engineers can build a low-latency, high-fidelity data streaming system using open source software and public cloud technologies combined with recommended best practices. My talk focuses on the non-functional requirements (e.g. the -ilities) of such a system including but not limited to scalability, performance, reliability, observability, availability, etc…

How would you describe the persona and level of the target audience?

This talk will take a ground-up approach to building such a system. My talk requires little background knowledge beyond basic familiarity with various AWS technologies & Apache Kafka. The ideal target audience would be composed of engineers, ranging from beginner to intermediate, interested in building a high-fidelity streaming system.

What do you want this persona to walk away with from your presentation?

This talk will serve as an architect’s guide to building a high-fidelity streaming system. While it may leave out specific details for lack of time, it will provide enough information to get an architect 80% of the way to building a similar system.

What do you think is the next big disruption in software?

AI-managed data infra – it is sorely needed in order to reduce the onerous task of operating data infrastructure at scale.


Speaker

Sid Anand

Chief Architect and Head of Engineering @Datazoom

Sid Anand currently serves as the Chief Architect and Head of Engineering for Datazoom, where he and his team build autonomous streaming data systems for Datazoom's high-fidelity, low latency streaming analytics needs. Prior to joining Datazoom, Sid served as PayPal's Chief Data Engineer, focusing on ways to realize the value of PayPal's hundreds of petabytes of data. Prior to joining PayPal, he held several positions including Agari's Data Architect, a Technical Lead in Search @ LinkedIn, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. Sid earned his BS and MS degrees in CS from Cornell University, where he focused on Distributed Systems. Outside of work, Sid is a maintainer/committer on Apache Airflow and advises early-stage companies and several conferences (QCon, Data Council, and conferences under Skills Matter).

Read more
Find Sid Anand at:

From the same track

Session

Change Data Capture for Microservices

Microservices represent complex business domains in the form of loosely coupled systems, but these don't exist in isolation: services need to propagate data changes amongst each other, in a reliable and scalable way.

Gunnar Morling

Senior Staff Software Engineer @Decodableco

Session

DynamoDB Transactions

NoSQL cloud database services are popular for their simple key-value operations, high availability, high scalability, and predictable performance.

Akshat Vig

Principal Engineer NoSQL databases @awscloud

Session

Speed of Apache Pinot at the Cost of Cloud Object Storage with Tiered Storage

For real-time analytics, you need systems that can provide ultra low latency (milliseconds) and extremely high throughput (hundreds of thousands of queries per second).

Neha Pawar

Founding Engineer @StarTree