Beyond Benchmarks: How Evaluations Ensure Safety at Scale in LLM Applications

Abstract

As LLM systems move from prototypes to production, the gap between benchmark performance and real-world reliability becomes impossible to ignore. Models that score well on benchmarks can still fail unpredictably when facing the complexity, ambiguity, and edge cases of real users. So how do we actually know if our AI systems are working?

In this practical, example-driven talk, we'll explore why robust evaluation, both before and after deployment, is the key to building trustworthy AI systems. We'll cover the full evaluation lifecycle: offline evaluation before release, from automated to human evaluation; and online evaluation in production, from observability to A/B testing. Drawing on examples from health AI, where safety, consistency, and reliability are non-negotiable, we'll show how these practices apply to any domain where AI needs to work reliably at scale.

By the end of this session, you'll walk away with an end-to-end framework for building a robust feedback flywheel that supports continuous, evaluation-driven development of LLM-powered products.


Speaker

Clara Matos

Director of Applied AI @Sword Health, Focused on Building and Scaling Machine Learning Systems

Clara enjoys working in the intersection of Machine Learning, Product, and Engineering, solving problems in a pragmatic and iterative way. She currently leads Applied AI at Sword Health, where her team is reinventing how patients access and receive care, by creating a more human, more clinically effective, and more scalable way to treat patients. She is focused on building and scaling machine learning systems that help achieve Sword's mission of freeing 2 million people from pain.

Read more
Find Clara Matos at:

Date

Wednesday Mar 18 / 11:45AM GMT ( 50 minutes )

Location

Fleming (3rd Fl.)

Slides

Slides are not available

Share

From the same track

Session agentic coding

The Right 300 Tokens Beat 100k Noisy Ones: The Architecture of Context Engineering

Wednesday Mar 18 / 10:35AM GMT

Your agent has 100k tokens of context. It still forgets what you told it two messages ago.

Speaker image - Patrick Debois

Patrick Debois

AI Product Engineer @Tessl, Co-Author of the "DevOps Handbook", Content Curator at AI Native Developer Community

Speaker image - Baruch Sadogursky

Baruch Sadogursky

DevRel Team and Context Engineering Management @Tessl AI, Co-Author of #LiquidSoftware and #DevOps Tools for #Java Developers, Java Champion, Microsoft MVP

Session

Explicit Semantics for AI Applications: Ontologies in Practice

Wednesday Mar 18 / 03:55PM GMT

Modern AI applications struggle not because of a lack of models, but because meaning is implicit, fragmented, and brittle. In this talk, we’ll explore how making semantics explicit (using ontologies and knowledge graphs) changes how we design, build, and operate AI systems.

Speaker image - Jesus Barrasa

Jesus Barrasa

Field CTO for AI @Neo4j

Session

Building an AI Ready Global Scale Data Platform

Wednesday Mar 18 / 01:35PM GMT

As organizations move from single-cloud setups to hybrid and multi-cloud strategies, they are under pressure to build data platforms that are both globally available and AI-ready.

Speaker image - George Peter Hantzaras

George Peter Hantzaras

Engineering Director, Core Platforms @MongoDB, Open Source Ambassador, Published Author

Session

Your Agent Sandbox Doesn't Know My Authz Model: A Standard-Shaped Hole

Wednesday Mar 18 / 02:45PM GMT

Sandboxes are the first line of defence for agentic systems: restrict the bash commands, filter the URLs, lock down the filesystem. But sandboxes operate on the syntax of requests, not the semantics of your authorization model.

Speaker image - Paul Carleton

Paul Carleton

Member of Technical Staff @Anthropic, Core Maintainer of MCP