Evals

Session AI/ML

Beyond Benchmarks: How Evaluations Ensure Safety at Scale in LLM Applications

Wednesday Mar 18 / 11:45AM GMT

As LLM systems move from prototypes to production, the gap between benchmark performance and real-world reliability becomes impossible to ignore. Models that score well on benchmarks can still fail unpredictably when facing the complexity, ambiguity, and edge cases of real users.

Speaker image - Clara Matos

Clara Matos

Director of Applied AI @Sword Health, Focused on Building and Scaling Machine Learning Systems