For autonomous vehicle companies, finding valuable insights within millions of hours of video data is essential yet challenging. This talk explores how we at Wayve are leveraging foundation models and embeddings to build scalable search tools that make data discovery faster and labeling more efficient.
Attendees will learn how we leverage vision-language models (VLMs) to retrieve relevant scenarios at scale, which is invaluable for pinpointing scenes needed to meet safety standards or evaluate specific driving behaviors. By using embeddings, we can train classifiers to detect specific driving competencies. Through an active learning loop, we refine these classifiers, enabling them to label similar scenarios across the entire dataset with high efficiency. This embedding-based approach is both fast and scalable, and it also helps us spot “bad data” clusters, like images with droplets on the lens or scenes from test tracks.
The presentation will delve into the technical infrastructure behind these tools, from vector databases that enable rapid similarity search to Flyte workflows that orchestrate scalable processing across distributed systems. We’ll also explore how query generation helps bridge the gap in positional awareness within text embeddings, allowing for more precise search across video datasets. Finally, the talk will close with a look toward future possibilities, such as on-device edge filtering, which would use embeddings to reduce storage costs by capturing only the most interesting scenarios in real time.
Designed for engineers and data scientists, this session provides a deep dive into the power of embeddings and VLMs for labeling and retrieving data at scale, making it possible to unlock insights and drive advancements in autonomous vehicle technology.
Interview:
What is the focus of your work?
My work focuses on building scalable pipelines for running perception models at scale (e.g., segmentation, cuboids, CLIP) and enhancing semantic search capabilities to enable users to search and retrieve relevant video data using natural language. I’m also experimenting with the latest vision-language models (VLMs) for video understanding to address these challenges and perform data mining at scale.
What’s the motivation for your talk?
The motivation stems from the growing need for efficient data mining and discovery in the autonomous vehicle industry. With petabytes of data collected from our fleet, we need to surface valuable insights across diverse teams—from safety validation to offline evaluation—while addressing specific challenges like dataset coverage, behavioural evaluation, and bad data removal. This talk aims to highlight how we leverage foundation models and embeddings to solve these challenges, showcasing how scalable search and retrieval tools can transform data understanding and accelerate innovation.
Who is your talk for?
This talk is designed for data scientists, machine learning engineers, and anyone working on video understanding, large-scale ML pipelines, search and retrieval, or autonomous driving technology. It’s particularly relevant for teams dealing with vast datasets and looking to leverage open foundation models for smarter data processing and retrieval.
What do you want someone to walk away with from your presentation?
I want attendees to leave with a clear understanding of how foundation models and embeddings can revolutionise data retrieval and labelling at scale. They’ll learn how to design robust pipelines for large-scale data processing, use VLMs for scene retrieval, and apply learning loops to scale classifier training. Additionally, I hope to inspire ideas for exploring future possibilities, like on-device edge filtering, to optimise storage and processing costs.
What do you think is the next big disruption in software?
The next big disruption will likely come from the intersection of edge computing and foundation models. Real-time, on-device filtering and decision-making will enable systems to process and prioritise data at the source, dramatically reducing storage and compute costs while enhancing the speed and accuracy of downstream tasks. This will be particularly transformative in fields like autonomous vehicles, IoT, and video understanding, where real-time insights are critical.
Speaker
Kyra Mozley
Machine Learning Engineer @Wayve
Machine Learning Engineer @ Wayve. With a background in computer vision and deep learning, Kyra leads the development of tools that leverage foundation models and embeddings to efficiently process and understand vast amounts of autonomous vehicle driving data.