Abstract
Everyone is building RAG systems now. But few are building reliable retrieval systems.
Drawing from our real world production RAG system built on 10K+ documents, used by 300+ users across the organization, we found that most issues can be traced back to two things: indexing and retrieval. This talk shares what matters when building production retrieval systems, starting from effective document parsing, chunking strategies to reliable retrieval pipeline. I will present specific implementation details, toolings, and how we solve the challenges encountered.
Here is one truth: you know who the real boss of NLP is? A PDF! Retrieval is only as good as the documents you index. The first important step is how you prepare the data. We will cover how to handle nightmares such as multi-page tables, merged cells and infographics, and present a decision framework for selecting the right tools. Next, we turn to chunking strategies, which depend on specific constraints such as embedding context window, document structure and information layout. Your system might not even need chunking, and adding it could hurt performance. I will cover when to chunk, and how to find the “chunking sweet spot.”
Search is at the heart of almost every AI system. Many retrieval systems see the entire world as just strings. But real user queries carry non-textual signals such as time which also needs to be encoded. We demonstrate how hybrid search can be combined with temporal scoring to capture user’s time intent using (1) timeline boosting, (2) time-decay functions (3) encoding time as a vector and when to use what.
Finally, let’s not forget evals. If you can't trust your evals, how can you trust your AI system? But creating hundreds or thousands of eval samples is expensive, we will see how a powerful technique called bootstrapping can help to determine how many samples you need for your evals!
Participants will leave with real insights and practical solutions for building reliable retrieval systems.
Speaker
Lan Chu
AI Tech Lead and Senior Data Scientist
Lan Chu is an AI Tech Lead and Senior Data Scientist with 7+ years of experience building production data and machine learning pipelines and 3+ years in building GenAI products. She specializes in designing, implementing data and AI pipelines and responsible AI practices. Lan has a background in Data Science, deep expertise in Natural Language Processing. She works with AI production systems powered by 10000+ documents.