Testing Your AI Peers with the Evaluation System
Building an AI peer is just the first step. Ensuring it consistently delivers high-quality responses is an ongoing challenge. That's why we built the Evaluation System - a comprehensive testing framework that helps you validate, measure, and improve your peers' performance.
Why Evaluate Your AI Peers?
AI peers, powered by large language models, can sometimes produce inconsistent results. Without systematic testing, you might:
- Miss quality issues until users report them
- Struggle to measure improvements after making changes
- Lack confidence in deploying peers to production
- Have no baseline to compare different configurations
The Evaluation System solves these problems by providing:
✅ Automated Testing: Run comprehensive test suites automatically
✅ Multiple Evaluation Methods: Choose from exact match, semantic similarity, or AI-powered judging
✅ Detailed Metrics: Get quantitative scores for peer performance
✅ AI-Powered Insights: Receive actionable improvement suggestions
✅ Version Comparison: Track performance across different configurations
Getting Started: Your First Evaluation Suite
Let's walk through creating your first evaluation suite for a customer support peer.
Step 1: Create the Suite
Navigate to the "Evaluations" section in your workspace and click "Create New Suite".
Suite Name: Customer Support Quality Test
Description: Tests accuracy and helpfulness of support responses
Tags: customer-support, quality-assuranceNote: Evaluation suites are workspace-level resources. You'll select which peer to test when running the evaluation.
Step 2: Add Test Questions
You can add questions manually or import them from a CSV file. Here's an example:
Question 1:
Question: What are your business hours?
Expected Answer: We are open Monday-Friday, 9 AM to 6 PM EST, and Saturday 10 AM to 4 PM EST.
Tags: hours, availabilityQuestion 2:
Question: How do I reset my password?
Expected Answer: Click 'Forgot Password' on the login page, enter your email, and follow the link sent to your inbox.
Tags: account, password, troubleshootingPro Tip: Start with 20-30 questions covering your most common use cases.
Step 3: Configure Evaluators
The Evaluation System offers three types of evaluators:
1. Exact Match Evaluator
Best for factual questions with specific answers:
{
"caseSensitive": false,
"ignoreWhitespace": true,
"normalizeNumbers": true
}Use for: Hours, prices, product specifications, policy statements.
2. Semantic Similarity Evaluator
Measures semantic closeness using embeddings:
{
"threshold": 0.8,
"modelId": "text-embedding-3-small"
}Use for: Paraphrased answers, conceptually similar responses.
3. LLM Judge Evaluator
AI-powered evaluation for quality and relevance:
{
"modelId": "chatgpt-4o-mini",
"criteria": ["accuracy", "helpfulness", "clarity"],
"strictness": "moderate"
}Use for: Open-ended questions, customer service scenarios, creative responses.
Recommendation: Enable all three evaluators for comprehensive testing.
Step 4: Run Your First Evaluation
Click "Run Evaluation", select the peer you want to test (e.g., "Support Assistant"), and watch the progress in real-time. The system will:
- Send each question to the selected peer
- Collect the responses
- Evaluate using your configured evaluators
- Calculate scores and metrics
Results typically complete in 2-5 minutes for 20-30 questions.
Tip: You can run the same suite against multiple peers or different versions of the same peer to compare performance.
Understanding Your Results
Once complete, you'll see:
Overall Metrics
Average Score: 0.85 (85%)
Pass Rate: 88% (22/25 questions)
Duration: 3 minutes 42 secondsPer-Question Breakdown
Question: "What are your business hours?"
Peer Response: "Our support team is available Monday through Friday,
from 9am to 6pm EST, and on Saturdays from 10am to 4pm EST."
Expected: "We are open Monday-Friday, 9 AM to 6 PM EST,
and Saturday 10 AM to 4 PM EST."
Scores:
✗ Exact Match: 0.0 (wording differs)
✓ LLM Judge: 0.95 (accurate and clear)
✓ Semantic Similarity: 0.98 (same meaning)
Overall: PASS (0.93)Evaluator Insights
Each evaluator provides different perspectives:
- Exact Match catches literal deviations
- Semantic Similarity ensures conceptual accuracy
- LLM Judge assesses quality and user experience
Real-World Example: Improving a Support Peer
Let's look at a real scenario:
Initial Results (Baseline)
Overall Score: 0.72 (72%)
Pass Rate: 68%
Common Issues:
- Generic responses lacking specifics
- Missing product details
- Inconsistent toneAnalysis Phase
After running the evaluation, we used the AI Analysis feature (more on this in our next blog post) which suggested:
- Add product documentation datasource - Score improvement: +15%
- Enhance system prompt with tone guidelines - Score improvement: +8%
- Lower temperature from 0.8 to 0.3 - Score improvement: +5%
Implementation
We applied the suggestions and re-ran the evaluation:
Overall Score: 0.91 (91%)
Pass Rate: 92%
Improvements:
✓ Specific, accurate product information
✓ Consistent professional tone
✓ Reduced hallucinationsNet improvement: +19 percentage points in just one iteration!
Best Practices for Effective Evaluation
1. Start with Real User Questions
Don't invent questions - use actual queries from your users:
# Export conversation logs
# Extract common questions
# Add to evaluation suiteThis ensures your tests reflect real usage patterns.
2. Cover Edge Cases
Include challenging scenarios:
- Ambiguous questions
- Multi-part questions
- Questions outside your peer's scope
- Questions requiring reasoning
- Questions with multiple valid answers
3. Define Clear Expected Answers
Vague expectations lead to unreliable scores:
❌ Bad: "Something about business hours"
✅ Good: "Monday-Friday, 9 AM to 6 PM EST"
4. Use Multiple Evaluators
Different evaluators catch different issues:
- Exact Match: Catches factual errors
- Semantic Similarity: Ensures meaning is preserved
- LLM Judge: Assesses overall quality
5. Run Regularly
Make evaluation part of your workflow:
- Before deployment: Validate changes
- After updates: Ensure no regressions
- Weekly: Monitor ongoing performance
- After feedback: Test reported issues
6. Track Improvements Over Time
Keep a log of evaluation scores:
Version 1.0: 72% (baseline)
Version 1.1: 85% (added datasources)
Version 1.2: 91% (improved prompt)
Version 2.0: 94% (new model)This helps justify optimizations and demonstrate ROI.
Advanced Techniques
Bulk Importing Questions
For large test suites, use CSV import:
question,expectedAnswer,context,tags
"What are your hours?","Mon-Fri 9-6 EST, Sat 10-4 EST","","hours"
"How to reset password?","Click 'Forgot Password' on login page","","account"Import 100+ questions in seconds.
Conditional Testing
Create different suites for different scenarios:
- Basic Functionality: Core features
- Advanced Features: Complex scenarios
- Error Handling: Edge cases and errors
- Multilingual: Different languages
- Performance: High-volume scenarios
Automated Regression Testing
Set up automated evaluation runs:
// Run evaluation after each deployment
await peer.deploy();
const result = await evaluation.run(suiteId);
if (result.averageScore < 0.85) {
await notifications.alert('Evaluation failed! Score dropped.');
await rollback.toPreviousVersion();
}A/B Testing with Evaluation
Compare different configurations:
// Test two versions
const resultA = await evaluate(peerA, testSuite);
const resultB = await evaluate(peerB, testSuite);
console.log(`Version A: ${resultA.score}`);
console.log(`Version B: ${resultB.score}`);
// Deploy winner
if (resultB.score > resultA.score) {
await deploy(peerB);
}Common Pitfalls to Avoid
❌ Testing Too Few Questions
Problem: 5-10 questions don't provide statistical confidence.
Solution: Aim for at least 20-30 questions, preferably 50+.
❌ Unrealistic Expected Answers
Problem: Expecting exact LLM output leads to false failures.
Solution: Focus on correctness and completeness, not exact wording.
❌ Not Using Multiple Evaluators
Problem: Single evaluator misses nuances.
Solution: Enable at least two evaluators for comprehensive coverage.
❌ Ignoring Failed Tests
Problem: Low scores without investigation.
Solution: Review each failed question, understand why, and iterate.
❌ Testing Only Happy Paths
Problem: Real users ask unexpected questions.
Solution: Include edge cases, errors, and ambiguous scenarios.
Integration with Your Workflow
CI/CD Pipeline Integration
# .github/workflows/test-peer.yml
name: Test AI Peer
on: [push]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- name: Run Evaluation
run: |
result=$(curl -X POST \
https://api.cognipeer.com/v1/evaluation/$SUITE_ID/run \
-H "Authorization: Bearer $API_KEY")
score=$(echo $result | jq '.data.averageScore')
if (( $(echo "$score < 0.85" | bc -l) )); then
echo "Evaluation failed: $score"
exit 1
fiMonitoring Dashboard
Track key metrics:
- Average scores over time
- Pass rates by category
- Most common failures
- Response time trends
Slack Notifications
Get alerted when scores drop:
if (newScore < previousScore - 0.05) {
await slack.notify({
channel: '#ai-alerts',
message: `⚠️ Peer performance dropped from ${previousScore} to ${newScore}`
});
}What's Next?
Now that you understand evaluation basics, check out our next blog post: "Optimizing Peers with AI-Powered Analysis", where we dive into using AI to automatically generate improvement suggestions based on evaluation results.
In the meantime, try these exercises:
- Create your first evaluation suite with 10 questions
- Run an evaluation and review the results
- Make one improvement based on the findings
- Re-run the evaluation and measure the improvement
Resources
Conclusion
Systematic evaluation is the key to building reliable, high-quality AI peers. The Evaluation System provides the tools you need to:
- Validate peer performance objectively
- Measure improvements quantitatively
- Iterate confidently with data-driven decisions
- Deploy with confidence knowing your peer is ready
Start testing your peers today and see the quality improvements for yourself!
Have questions about evaluation? Join our community forum or contact our team.
Want to learn more? Read our next post: Optimizing Peers with AI-Powered Analysis

