As a VP of Product Management in the AI space, Evals are a constant topic in my conversations with fellow product managers—and for good reason. If you're building products with Large Language Models (LLMs), you know their non-deterministic nature is a mixed blessing. It enables creative, human-like conversations but also means unpredictable results.
This unpredictability is a major hurdle for ensuring accuracy—which can mean factual correctness, appropriate tone, or successful task completion—and a positive user experience. This impacts our release speed and iteration capacity. In AI's fast-paced world, where rapid iteration is vital, understanding Evals is key to achieving desired velocity.
So, how do most of us currently handle accuracy for a great user experience? It often starts with manual tests.
During development, we manually check conversations and outputs. When issues arise, prompt engineering is a common solution. Crafting and refining prompts is becoming a core skill for every product manager. If an LLM “misbehaves,” improving prompts is often the quickest way to fix inconsistencies or hallucinations.
As AI applications grow more capable, comprehensive manual testing for consistency becomes incredibly difficult. A small prompt change can unintentionally affect other behaviors, sometimes unnoticed until a product is live. Moreover, manual sample tests paired with prompt engineering don't offer enough coverage for the ever-expanding range of user interactions.
This is where a more systematic, ongoing approach is essential. We need an automated way to track and understand AI performance, spot negative effects quickly, and course-correct.
Evals are not a one-time setup. They are a continuous process evolving with your product. The evaluation process—or Evals—typically involves:
Collecting Data: Gathering sample conversations or "traces" (user-LLM dialogues) from live systems or existing human to human conversations for accuracy review.
Evaluating Performance: Setting up reviews, done by humans or, increasingly, an LLM "judge," to assess AI response quality during development and production.
Iterative Improvement ("Hill Climbing"): Progressively improving the AI application to boost performance. For example, establishing a baseline by measuring current performance, let’s say 30% accuracy, and then gradually improving on that towards a goal, such as 90% accuracy.
Well-designed Evals are crucial. It's not just if the AI can respond, but if it responds correctly and consistently:
Accuracy of Deterministic Outputs: If your AI uses Retrieval Augmented Generation (RAG), does it consistently provide correct information (e.g., correct business hours)?
Consistency of Functional Calls: If your AI performs actions (like inventory checks), are these actions and results consistent (e.g., warning about low stock if the inventory is below a certain threshold)?
Adherence to Rules: Does the AI consistently follow predefined rules (e.g., capturing user details in the correct format)?
Manually creating comprehensive Evals can be time-consuming and requires resources, especially without much customer data. However, the long-term payoff in product quality and development speed justifies this investment. Your LLM can be a great starting point for generating realistic, synthetic Evals with expected outputs. Always review these synthetic tests to ensure they meet your expectations and adjust as needed.
If your AI handles data (e.g., an email assistant needing contacts), use an LLM to create sample database records. Then, have the LLM generate Evals testing various ways to reference these records. If a name lookup could yield multiple contacts, an Eval should ensure all options are presented.
Evals should be an integral, evolving part of development:
Automated Testing: Run Evals as automated tests with each prompt revision via a CI/CD system, just like other routine tests.
Version Control: Version prompts (like code, using Git) and review them via pull requests that include Evals.
Track Progress Over Time: Collect Eval data over time, not just from the latest run, to see if user interaction quality is improving. Use any reporting tool that works for you. Remember, Evals are a living system that should adapt.
Once your AI application is live, evaluation is ongoing and vital:
Collect Live Traces: Capture traces from live conversations to ensure your app is performing as expected, especially with user scenarios you didn't anticipate during development or data drift due to an evolving user base.
Identify Performance Gaps: Evals running on these captured conversations in production will reveal how accurately your AI application is performing in the real world.
Enhance Test Scenarios: These live traces are also an invaluable source for extending your test scenarios, helping you create a more complete set of evaluations for ongoing development. This ensures your Evals remain relevant and effective.
Prioritize Privacy: Always remember to consider user privacy. Ask for permission before collecting data, or ensure all data is anonymized.
Numerous tools are available that can help capture production traces and run Evals. Or you can roll your own. Some tools are specialized. For instance, AI Voice Agent Eval solutions might focus on audio and generate realistic voice conversation tests, including sentiment analysis.
💡Here’s a critical insight. Even with Evals, expecting AI to handle every situation perfectly is unrealistic.
Evals test effectiveness, and may get you to 90%+ accuracy, but complex AI applications also need well-designed "escape hatches" for the edge cases, such as transfers to human agents or other follow-ups when AI hits its limits. This hybrid human-AI approach is key for reliable service as the technology matures.
Furthermore, insights from Evals should inform your broader product strategy. Consistent underperformance in certain areas might signal a need to refine a feature, invest in better data, or even reconsider aspects of your AI's design.
As product managers in this exciting AI landscape, robust Evals are essential for building high-quality, reliable, and continuously improving AI-powered products.
How do you evaluate your AI applications' accuracy? I’d love to hear your experiences!
👉 Tell us how we’re doing! CLICK HERE to complete our reader survey.
Filip Szymanski has over 20 years of experience as a product leader and is passionate about leveraging AI to power a new generation of product managers. Follow my journey at productpath.ai or add me on LinkedIn.
Want to learn more about the many ways AI can supercharge product management and development? Subscribe to Filip’s newsletter for AI-powered PMs at GenAI Works.
🚀 Boost your business with us—advertise where 10M+ AI leaders engage
🌟 Sign up for the first AI Hub in the world.
Reply