Building High-Fidelity Data Streams

Low latency data streaming technology and practices remain a hot and trending topic among data engineers today. At its core, it promises to deliver data in near real time in order to provide snappy data-driven user experiences. This experience comes in many forms including low latency updates to social news feeds, near-real time payment fraud prevention, time-relevant recommender systems used in flash sales, self-driving car route planning, and more. Our need to stay engaged has made low latency data streams a critical part of modern data architectures. 

While it may seem trivial to get a data streaming POC up and running, productionalizing such a system under strict SLAs with the aid of a lean engineering team requires making the right choices but also learning from mistakes along the way. At Datazoom, we built a lossless streaming data system that guarantees sub-second (p95) event delivery at scale with better than three nines availability – we measure availability in terms of the on-time delivery of events. Come to this talk to learn how you can build such a system soup-to-nuts.

Interview:

What is the focus of your work these days?

I currently serve as the Chief Architect and Head of Engineering at Datazoom, a company that offers a video data platform that captures video playback telemetry data. This data can be used to understand how customers experience and interact with video. At Datazoom, we build both client SDKs and a cloud-based analytics platform.

What’s the motivation for your talk?

In this talk, I explain how engineers can build a low-latency, high-fidelity data streaming system using open source software and public cloud technologies combined with recommended best practices. My talk focuses on the non-functional requirements (e.g. the -ilities) of such a system including but not limited to scalability, performance, reliability, observability, availability, etc…

How would you describe the persona and level of the target audience?

This talk will take a ground-up approach to building such a system. My talk requires little background knowledge beyond basic familiarity with various AWS technologies & Apache Kafka. The ideal target audience would be composed of engineers, ranging from beginner to intermediate, interested in building a high-fidelity streaming system.

What do you want this persona to walk away with from your presentation?

This talk will serve as an architect’s guide to building a high-fidelity streaming system. While it may leave out specific details for lack of time, it will provide enough information to get an architect 80% of the way to building a similar system.

What do you think is the next big disruption in software?

AI-managed data infra – it is sorely needed in order to reduce the onerous task of operating data infrastructure at scale.


Speaker

Sid Anand

Fellow, Cloud & Data Platform @Walmart, Apache Airflow Committer/PMC, Ex-Netflix, LinkedIn, eBay, Etsy, & PayPal

Sid recently joined Walmart (i.e. Walmart Global Tech) as a fellow to work on all things data. Prior to joining Walmart Global Tech, Sid served as the Chief Architect and Head of Engineering for Datazoom, where he and his team built high-fidelity, low-latency data streaming systems. Prior to joining Datazoom, Sid served as PayPal's Chief Data Engineer, where he helped build systems, platforms, teams, and processes, all with the aim of building access to the hundreds of petabytes of data under PayPal's management. Prior to joining PayPal, Sid held senior technical positions at Netflix, LinkedIn, eBay, & Etsy to name a few. He earned my BS and MS degrees in CS from Cornell University, focusing on Distributed Systems.

Outside of work, Sid advises early-stage companies and several conferences. Once an active committer on Apache Airflow, he is now mostly a fan.

Sid's body of work includes but is not limited to :

  • The world's first cloud-based streaming video service -- I was the first engineer to work on the cloud at Netflix
  • LinkedIn's Federated Search Typeahead (a.k.a. auto-complete)
  • LinkedIn's (Big Data) Self-service Marketing Analytics tool
  • PayPal's DBaaS - an internal self-service system to provision & manage heterogenous databases
  • PayPal's CDC - an internal self-service CDC system to stream DB updates to nearline applications
  • eBay-over-Skype : Following the Skype-acquisition, I built a P2P version of eBay offers
  • eBay's Best Match Search Ranking Engine powered by an In-Memory Database
  • eBay's Fuzzy-match name/email Search
  • Agari's Data Platform : Batch & Streaming Predictive Data Platform as a Service
  • Datazoom's Platform : High-fidelity, Low-latency Streaming Data Platform as a Service
Read more
Find Sid Anand at:

From the same track

Session Microservices

Banking on Thousands of Microservices

Monday Mar 27 / 05:25PM BST

Monzo has built an entire banking platform from scratch composed of many microservices; it serves over 7 million customers daily with an organisationally lean engineering team. All aspects of the bank are deployed hundreds of times a day (even on Fridays!).

Speaker image - Suhail Patel

Suhail Patel

Staff Engineer @Monzo Focused on Designing and Operating Distributed Systems, Previously @Citymapper

Session scalability

Scaling Google's Global Cloud L7 Load Balancer

Monday Mar 27 / 10:35AM BST

We'll take a look at Google's Global Cloud L7 Balancer, how it's put together and how we've scaled it to meet the reliability and performance demands of our Cloud customers.

Speaker image - James Spooner

James Spooner

Principal Engineer, Load Balancing @Google

Session scalability

Zoom: Why Does It Work?

Monday Mar 27 / 04:10PM BST

During the pandemic Zoom had to scale massively to support the big move from working in the office every day to meeting online for both business and private use. How did Zoom manage this scaling dilemma? And when you join a Zoom call how does that actually work?

Speaker image - Ian Sleebe

Ian Sleebe

Senior Solutions Architect @Zoom

Session Microservices

Tales of Kafka @Cloudflare: Lessons Learnt on the Way to 1 Trillion Messages

Monday Mar 27 / 02:55PM BST

Cloudflare uses Kafka to decouple microservices and communicate the creation, change or deletion of various resources via a common data format in a fault-tolerant manner.

Speaker image - Andrea Medda

Andrea Medda

Senior Systems Engineer @Cloudflare

Speaker image - Matt Boyle

Matt Boyle

Engineering Manager @Cloudflare

Session

Unconference: Architectures You've Always Wondered About

Monday Mar 27 / 11:50AM BST

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.

Speaker image - Shane Hastie

Shane Hastie

Global Delivery Lead @SoftEd, Lead Editor for Culture & Methods @InfoQ