Skip to content

Testing Your AI Peers with the Evaluation System

Building an AI peer is just the first step. Ensuring it consistently delivers high-quality responses is an ongoing challenge. That's why we built the Evaluation System - a comprehensive testing framework that helps you validate, measure, and improve your peers' performance.

Why Evaluate Your AI Peers?

AI peers, powered by large language models, can sometimes produce inconsistent results. Without systematic testing, you might:

  • Miss quality issues until users report them
  • Struggle to measure improvements after making changes
  • Lack confidence in deploying peers to production
  • Have no baseline to compare different configurations

The Evaluation System solves these problems by providing:

Automated Testing: Run comprehensive test suites automatically
Multiple Evaluation Methods: Choose from exact match, semantic similarity, or AI-powered judging
Detailed Metrics: Get quantitative scores for peer performance
AI-Powered Insights: Receive actionable improvement suggestions
Version Comparison: Track performance across different configurations

Getting Started: Your First Evaluation Suite

Let's walk through creating your first evaluation suite for a customer support peer.

Step 1: Create the Suite

Navigate to the "Evaluations" section in your workspace and click "Create New Suite".

Suite Name: Customer Support Quality Test
Description: Tests accuracy and helpfulness of support responses
Tags: customer-support, quality-assurance

Note: Evaluation suites are workspace-level resources. You'll select which peer to test when running the evaluation.

Step 2: Add Test Questions

You can add questions manually or import them from a CSV file. Here's an example:

Question 1:

Question: What are your business hours?
Expected Answer: We are open Monday-Friday, 9 AM to 6 PM EST, and Saturday 10 AM to 4 PM EST.
Tags: hours, availability

Question 2:

Question: How do I reset my password?
Expected Answer: Click 'Forgot Password' on the login page, enter your email, and follow the link sent to your inbox.
Tags: account, password, troubleshooting

Pro Tip: Start with 20-30 questions covering your most common use cases.

Step 3: Configure Evaluators

The Evaluation System offers three types of evaluators:

1. Exact Match Evaluator

Best for factual questions with specific answers:

json
{
  "caseSensitive": false,
  "ignoreWhitespace": true,
  "normalizeNumbers": true
}

Use for: Hours, prices, product specifications, policy statements.

2. Semantic Similarity Evaluator

Measures semantic closeness using embeddings:

json
{
  "threshold": 0.8,
  "modelId": "text-embedding-3-small"
}

Use for: Paraphrased answers, conceptually similar responses.

3. LLM Judge Evaluator

AI-powered evaluation for quality and relevance:

json
{
  "modelId": "chatgpt-4o-mini",
  "criteria": ["accuracy", "helpfulness", "clarity"],
  "strictness": "moderate"
}

Use for: Open-ended questions, customer service scenarios, creative responses.

Recommendation: Enable all three evaluators for comprehensive testing.

Step 4: Run Your First Evaluation

Click "Run Evaluation", select the peer you want to test (e.g., "Support Assistant"), and watch the progress in real-time. The system will:

  1. Send each question to the selected peer
  2. Collect the responses
  3. Evaluate using your configured evaluators
  4. Calculate scores and metrics

Results typically complete in 2-5 minutes for 20-30 questions.

Tip: You can run the same suite against multiple peers or different versions of the same peer to compare performance.

Understanding Your Results

Once complete, you'll see:

Overall Metrics

Average Score: 0.85 (85%)
Pass Rate: 88% (22/25 questions)
Duration: 3 minutes 42 seconds

Per-Question Breakdown

Question: "What are your business hours?"
Peer Response: "Our support team is available Monday through Friday, 
                from 9am to 6pm EST, and on Saturdays from 10am to 4pm EST."
Expected: "We are open Monday-Friday, 9 AM to 6 PM EST, 
           and Saturday 10 AM to 4 PM EST."

Scores:
✗ Exact Match: 0.0 (wording differs)
✓ LLM Judge: 0.95 (accurate and clear)
✓ Semantic Similarity: 0.98 (same meaning)

Overall: PASS (0.93)

Evaluator Insights

Each evaluator provides different perspectives:

  • Exact Match catches literal deviations
  • Semantic Similarity ensures conceptual accuracy
  • LLM Judge assesses quality and user experience

Real-World Example: Improving a Support Peer

Let's look at a real scenario:

Initial Results (Baseline)

Overall Score: 0.72 (72%)
Pass Rate: 68%

Common Issues:
- Generic responses lacking specifics
- Missing product details
- Inconsistent tone

Analysis Phase

After running the evaluation, we used the AI Analysis feature (more on this in our next blog post) which suggested:

  1. Add product documentation datasource - Score improvement: +15%
  2. Enhance system prompt with tone guidelines - Score improvement: +8%
  3. Lower temperature from 0.8 to 0.3 - Score improvement: +5%

Implementation

We applied the suggestions and re-ran the evaluation:

Overall Score: 0.91 (91%)
Pass Rate: 92%

Improvements:
✓ Specific, accurate product information
✓ Consistent professional tone
✓ Reduced hallucinations

Net improvement: +19 percentage points in just one iteration!

Best Practices for Effective Evaluation

1. Start with Real User Questions

Don't invent questions - use actual queries from your users:

bash
# Export conversation logs
# Extract common questions
# Add to evaluation suite

This ensures your tests reflect real usage patterns.

2. Cover Edge Cases

Include challenging scenarios:

  • Ambiguous questions
  • Multi-part questions
  • Questions outside your peer's scope
  • Questions requiring reasoning
  • Questions with multiple valid answers

3. Define Clear Expected Answers

Vague expectations lead to unreliable scores:

Bad: "Something about business hours"
Good: "Monday-Friday, 9 AM to 6 PM EST"

4. Use Multiple Evaluators

Different evaluators catch different issues:

  • Exact Match: Catches factual errors
  • Semantic Similarity: Ensures meaning is preserved
  • LLM Judge: Assesses overall quality

5. Run Regularly

Make evaluation part of your workflow:

  • Before deployment: Validate changes
  • After updates: Ensure no regressions
  • Weekly: Monitor ongoing performance
  • After feedback: Test reported issues

6. Track Improvements Over Time

Keep a log of evaluation scores:

Version 1.0: 72% (baseline)
Version 1.1: 85% (added datasources)
Version 1.2: 91% (improved prompt)
Version 2.0: 94% (new model)

This helps justify optimizations and demonstrate ROI.

Advanced Techniques

Bulk Importing Questions

For large test suites, use CSV import:

csv
question,expectedAnswer,context,tags
"What are your hours?","Mon-Fri 9-6 EST, Sat 10-4 EST","","hours"
"How to reset password?","Click 'Forgot Password' on login page","","account"

Import 100+ questions in seconds.

Conditional Testing

Create different suites for different scenarios:

  • Basic Functionality: Core features
  • Advanced Features: Complex scenarios
  • Error Handling: Edge cases and errors
  • Multilingual: Different languages
  • Performance: High-volume scenarios

Automated Regression Testing

Set up automated evaluation runs:

javascript
// Run evaluation after each deployment
await peer.deploy();
const result = await evaluation.run(suiteId);

if (result.averageScore < 0.85) {
  await notifications.alert('Evaluation failed! Score dropped.');
  await rollback.toPreviousVersion();
}

A/B Testing with Evaluation

Compare different configurations:

javascript
// Test two versions
const resultA = await evaluate(peerA, testSuite);
const resultB = await evaluate(peerB, testSuite);

console.log(`Version A: ${resultA.score}`);
console.log(`Version B: ${resultB.score}`);

// Deploy winner
if (resultB.score > resultA.score) {
  await deploy(peerB);
}

Common Pitfalls to Avoid

❌ Testing Too Few Questions

Problem: 5-10 questions don't provide statistical confidence.

Solution: Aim for at least 20-30 questions, preferably 50+.

❌ Unrealistic Expected Answers

Problem: Expecting exact LLM output leads to false failures.

Solution: Focus on correctness and completeness, not exact wording.

❌ Not Using Multiple Evaluators

Problem: Single evaluator misses nuances.

Solution: Enable at least two evaluators for comprehensive coverage.

❌ Ignoring Failed Tests

Problem: Low scores without investigation.

Solution: Review each failed question, understand why, and iterate.

❌ Testing Only Happy Paths

Problem: Real users ask unexpected questions.

Solution: Include edge cases, errors, and ambiguous scenarios.

Integration with Your Workflow

CI/CD Pipeline Integration

yaml
# .github/workflows/test-peer.yml
name: Test AI Peer
on: [push]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - name: Run Evaluation
        run: |
          result=$(curl -X POST \
            https://api.cognipeer.com/v1/evaluation/$SUITE_ID/run \
            -H "Authorization: Bearer $API_KEY")
          
          score=$(echo $result | jq '.data.averageScore')
          
          if (( $(echo "$score < 0.85" | bc -l) )); then
            echo "Evaluation failed: $score"
            exit 1
          fi

Monitoring Dashboard

Track key metrics:

  • Average scores over time
  • Pass rates by category
  • Most common failures
  • Response time trends

Slack Notifications

Get alerted when scores drop:

javascript
if (newScore < previousScore - 0.05) {
  await slack.notify({
    channel: '#ai-alerts',
    message: `⚠️ Peer performance dropped from ${previousScore} to ${newScore}`
  });
}

What's Next?

Now that you understand evaluation basics, check out our next blog post: "Optimizing Peers with AI-Powered Analysis", where we dive into using AI to automatically generate improvement suggestions based on evaluation results.

In the meantime, try these exercises:

  1. Create your first evaluation suite with 10 questions
  2. Run an evaluation and review the results
  3. Make one improvement based on the findings
  4. Re-run the evaluation and measure the improvement

Resources

Conclusion

Systematic evaluation is the key to building reliable, high-quality AI peers. The Evaluation System provides the tools you need to:

  • Validate peer performance objectively
  • Measure improvements quantitatively
  • Iterate confidently with data-driven decisions
  • Deploy with confidence knowing your peer is ready

Start testing your peers today and see the quality improvements for yourself!


Have questions about evaluation? Join our community forum or contact our team.

Want to learn more? Read our next post: Optimizing Peers with AI-Powered Analysis

Built with VitePress