The Air Canada example really drives the point home. It's one thing to have a chatbot give a suboptimal answer, but when it starts making promises that contradict company policy, you're essentialy giving customers ammunition for legal actin. The three layer eval system makes a lot of sens for catching these kind of issues before they escallate.
The "ammunition" part is spot on. I couldn't have described it better. Even if some countries have more severe laws than others, as a customer, I'd be furious if I were told "yes" when it's actually "no."
We need better methods for catching all of these before things burn, as you said. I completely agree!
AI Evals are especially important since they bridge the gap between what engineering prioritizes (MMLU scores) and what users expect (quality, reliability, safety, and performance). And AI PMs can decide the next step based on them: whether to train the model, pivot, or put it on hold.
The Air Canada example really drives the point home. It's one thing to have a chatbot give a suboptimal answer, but when it starts making promises that contradict company policy, you're essentialy giving customers ammunition for legal actin. The three layer eval system makes a lot of sens for catching these kind of issues before they escallate.
The "ammunition" part is spot on. I couldn't have described it better. Even if some countries have more severe laws than others, as a customer, I'd be furious if I were told "yes" when it's actually "no."
We need better methods for catching all of these before things burn, as you said. I completely agree!
This article is right on spot between user focus and technical accuracy. Very helpful!
Thanks Benedikt glad you’ve found it useful! Hope this helps others too :)
Yes, I've actually been through this... not fun, but fixing it asap is the key... and not making the same mistakes.
I love it! Learn from your mistakes and document them so you don't repeat them, for everything not just AI.
Documentation isn't fun, either, but it's certainly better than having errors like this in production!
Amen to that. Making the same mistakes is a killer of progress.
Great insight, Elena.
AI Evals are especially important since they bridge the gap between what engineering prioritizes (MMLU scores) and what users expect (quality, reliability, safety, and performance). And AI PMs can decide the next step based on them: whether to train the model, pivot, or put it on hold.
Yes, Vishal! That's a perfect summary of everything. You put it perfectly! 👏