Evals
Session
AI/ML
Beyond Benchmarks: How Evaluations Ensure Safety at Scale in LLM Applications
Wednesday Mar 18 / 11:45AM GMT
As LLM systems move from prototypes to production, the gap between benchmark performance and real-world reliability becomes impossible to ignore. Models that score well on benchmarks can still fail unpredictably when facing the complexity, ambiguity, and edge cases of real users.
Clara Matos
Director of Applied AI @Sword Health, Focused on Building and Scaling Machine Learning Systems