As LLM systems move from prototypes to production, the gap between benchmark performance and real-world reliability becomes impossible to ignore. Models that score well on benchmarks can still fail unpredictably when facing the complexity, ambiguity, and edge cases of real users. So how do we actually know if our AI systems are working?

In this practical, example-driven talk, we'll explore why robust evaluation, both before and after deployment, is the key to building trustworthy AI systems. We'll cover the full evaluation lifecycle: offline evaluation before release, from automated to human evaluation; and online evaluation in production, from observability to A/B testing. Drawing on examples from health AI, where safety, consistency, and reliability are non-negotiable, we'll show how these practices apply to any domain where AI needs to work reliably at scale.

By the end of this session, you'll walk away with an end-to-end framework for building a robust feedback flywheel that supports continuous, evaluation-driven development of LLM-powered products.

From the same track

Session agentic coding

The Right 300 Tokens Beat 100k Noisy Ones: The Architecture of Context Engineering

Wednesday Mar 18 / 10:35AM GMT

Your agent has 100k tokens of context. It still forgets what you told it two messages ago.

Patrick Debois

AI Product Engineer @Tessl, Co-Author of the "DevOps Handbook", Content Curator at AI Native Developer Community

Baruch Sadogursky

DevRel Team and Context Engineering Management @Tessl AI, Co-Author of #LiquidSoftware and #DevOps Tools for #Java Developers, Java Champion, Microsoft MVP

Session

Explicit Semantics for AI Applications: Ontologies in Practice

Wednesday Mar 18 / 03:55PM GMT

Modern AI applications struggle not because of a lack of models, but because meaning is implicit, fragmented, and brittle. In this talk, we’ll explore how making semantics explicit (using ontologies and knowledge graphs) changes how we design, build, and operate AI systems.

Jesús Barrasa

Field CTO for AI @Neo4j

Session data platform engineering

Building an AI Ready Global Scale Data Platform

Wednesday Mar 18 / 01:35PM GMT

As organizations move from single-cloud setups to hybrid and multi-cloud strategies, they are under pressure to build data platforms that are both globally available and AI-ready.

George Peter Hantzaras

Engineering Director, Core Platforms @MongoDB, Open Source Ambassador, Published Author

Session

Your Agent Sandbox Doesn't Know My Authz Model: A Standard-Shaped Hole

Wednesday Mar 18 / 02:45PM GMT

Sandboxes are the first line of defence for agentic systems: restrict the bash commands, filter the URLs, lock down the filesystem. But sandboxes operate on the syntax of requests, not the semantics of your authorization model.

Paul Carleton

Member of Technical Staff @Anthropic, Core Maintainer of MCP

Beyond Benchmarks: How Evaluations Ensure Safety at Scale in LLM Applications

Abstract

Speaker

Clara Matos

Find Clara Matos at:

Speaker

Clara Matos

Date

Location

Track

Topics

Slides

Share

From the same track

The Right 300 Tokens Beat 100k Noisy Ones: The Architecture of Context Engineering

Explicit Semantics for AI Applications: Ontologies in Practice

Building an AI Ready Global Scale Data Platform

Your Agent Sandbox Doesn't Know My Authz Model: A Standard-Shaped Hole

Follow QCon

Contact

Menu

Conferences around the World