Deploy MultiModal RAG Systems with vLLM

While text-based RAG systems have been everywhere in the last year and a half, there is so much more than text data. Images, audio, and documents often need to be processed together to provide meaningful insights, yet most RAG implementations focus solely on text. Think automated visual inspection systems understanding both manufacturing logs and production line images, or robotics systems correlating sensor data with visual feedback. These multimodal scenarios demand RAG systems that go beyond text-only processing.

In this talk, we'll talk about how one can build a MultiModal RAG system that helps solve this problem. We'll explore the architecture that makes it possible to run such a system and demonstrate how to build one using Milvus, LlamaIndex, and vLLM for deploying open-source LLMs on your own infrastructure.

Through a live demo, we'll showcase a real-world application processing both images and text queries. Whether you're looking to reduce API costs, maintain data privacy, or simply gain more control over your AI infrastructure, this session will provide you with actionable insights to implement MultiModal RAG in your organization.

Interview:

What is the focus of your work?

I focus on GenAI usage, going from simple RAG systems to full Agentic ones. I also highlight how search works at Scale. I am a big open source fan so most of my work is focused around that. 

What’s the motivation for your talk?

To showcase that you can deploy open source apps that can be very good. The idea is to showcase to people that they can be in control and not dependent on closed source systems.

Who is your talk for?

People interested in moving from OpenAI and they want to control their GenAI stack. Also people interested in Multimodality.

What do you want someone to walk away with from your presentation?

Learn how specific open source tools like vLLMs can match or exceed proprietary solutions while giving you full control over your AI stack and SLAs.

What do you think is the next big disruption in software?

I believe that the combination of open source AI models and rapid development tools will enable more customized, sovereign AI solutions.

People will have their own unique version of their software and likely not rely as much on the typical apps we used to have. 


Speaker

Stephen Batifol

Developer Advocate @Zilliz, Founding Member of the MLOps Community Berlin, Previously Machine Learning Engineer @Wolt, and Data Scientist @Brevo

Stephen Batifol is a Developer Advocate at Zilliz. He previously worked as a Machine Learning Engineer at Wolt, where he created and worked on the ML Platform, and previously as a Data Scientist at Brevo. Stephen studied Computer Science and Artificial Intelligence.

He is a founding member of the MLOps.community Berlin group, where he organizes Meetups and hackathons. He enjoys boxing and surfing.

Read more

From the same track

Session AI/ML

How to Unlock Insights and Enable Discovery Within Petabytes of Autonomous Driving Data

Tuesday Apr 8 / 11:45AM BST

For autonomous vehicle companies, finding valuable insights within millions of hours of video data is essential yet challenging.

Speaker image - Kyra Mozley

Kyra Mozley

Machine Learning Engineer @Wayve

Session AI/ML

AI for Food Image Generation in Production: How & Why

Tuesday Apr 8 / 01:35PM BST

In this talk, we will conduct a technical overview of a client-facing Food Image Generation solution developed at Delivery Hero.

Speaker image - Iaroslav  Amerkhanov

Iaroslav Amerkhanov

Senior Data Scientist @Delivery Hero

Session

Foundation Models for Recommenders: Challenges, Successes, and Lessons Learned

Tuesday Apr 8 / 02:45PM BST

Recommender systems are an integral part of most products nowadays and are often a key driver of discovery for users of the product.

Speaker image - Moumita Bhattacharya

Moumita Bhattacharya

Senior Research Scientist @Netflix, Previously @Etsy

Session

Building Embedding Models for Large-Scale Real-World Applications

Tuesday Apr 8 / 03:55PM BST

Embedding models are at the core of search, recommendation, and retrieval-augmented generation (RAG) systems, transforming data into meaningful representations.

Speaker image - Sahil Dua

Sahil Dua

Senior Software Engineer, Machine Learning @Google, Stanford AI, Co-Author of “The Kubernetes Workshop”, Open-Source Enthusiast

Session

Lessons Learned From Building LinkedIn’s First Agent: Hiring Assistant

Tuesday Apr 8 / 05:05PM BST

In October 2024, we announced LinkedIn’s first agent, Hiring Assistant to a select group of LinkedIn customers.

Speaker image - Karthik Ramgopal

Karthik Ramgopal

Distinguished Engineer & Tech Lead of the Product Engineering Team @LinkedIn, 15+ Years of Experience in Full-Stack Software Development

Speaker image - Daniel Hewlett

Daniel Hewlett

Principal AI Engineer & Technical Lead for AI @LinkedIn, 12+ Years of Expierence in ML and AI Engineering, Previously @Google