Testing Your AI Peers with the Evaluation System

Building an AI peer is just the first step. Ensuring it consistently delivers high-quality responses is an ongoing challenge. That's why we built the Evaluation System - a comprehensive testing framework that helps you validate, measure, and improve your peers' performance.

Why Evaluate Your AI Peers?

AI peers, powered by large language models, can sometimes produce inconsistent results. Without systematic testing, you might:

Miss quality issues until users report them
Struggle to measure improvements after making changes
Lack confidence in deploying peers to production
Have no baseline to compare different configurations

The Evaluation System solves these problems by providing:

✅ Automated Testing: Run comprehensive test suites automatically
✅ Multiple Evaluation Methods: Choose from exact match, semantic similarity, or AI-powered judging
✅ Detailed Metrics: Get quantitative scores for peer performance
✅ AI-Powered Insights: Receive actionable improvement suggestions
✅ Version Comparison: Track performance across different configurations

Getting Started: Your First Evaluation Suite

Let's walk through creating your first evaluation suite for a customer support peer.

Step 1: Create the Suite

Navigate to the "Evaluations" section in your workspace and click "Create New Suite".

Suite Name: Customer Support Quality Test
Description: Tests accuracy and helpfulness of support responses
Tags: customer-support, quality-assurance

Note: Evaluation suites are workspace-level resources. You'll select which peer to test when running the evaluation.

Step 2: Add Test Questions

You can add questions manually or import them from a CSV file. Here's an example:

Question 1:

Question: What are your business hours?
Expected Answer: We are open Monday-Friday, 9 AM to 6 PM EST, and Saturday 10 AM to 4 PM EST.
Tags: hours, availability

Question 2:

Question: How do I reset my password?
Expected Answer: Click 'Forgot Password' on the login page, enter your email, and follow the link sent to your inbox.
Tags: account, password, troubleshooting

Pro Tip: Start with 20-30 questions covering your most common use cases.

Step 3: Configure Evaluators

The Evaluation System offers three types of evaluators:

1. Exact Match Evaluator

Best for factual questions with specific answers:

json

{
  "caseSensitive": false,
  "ignoreWhitespace": true,
  "normalizeNumbers": true
}

Use for: Hours, prices, product specifications, policy statements.

2. Semantic Similarity Evaluator

Measures semantic closeness using embeddings:

json

{
  "threshold": 0.8,
  "modelId": "text-embedding-3-small"
}

Use for: Paraphrased answers, conceptually similar responses.

3. LLM Judge Evaluator

AI-powered evaluation for quality and relevance:

json

{
  "modelId": "chatgpt-4o-mini",
  "criteria": ["accuracy", "helpfulness", "clarity"],
  "strictness": "moderate"
}

Use for: Open-ended questions, customer service scenarios, creative responses.

Recommendation: Enable all three evaluators for comprehensive testing.

Step 4: Run Your First Evaluation

Click "Run Evaluation", select the peer you want to test (e.g., "Support Assistant"), and watch the progress in real-time. The system will:

Send each question to the selected peer
Collect the responses
Evaluate using your configured evaluators
Calculate scores and metrics

Results typically complete in 2-5 minutes for 20-30 questions.

Tip: You can run the same suite against multiple peers or different versions of the same peer to compare performance.

Understanding Your Results

Once complete, you'll see:

Overall Metrics

Average Score: 0.85 (85%)
Pass Rate: 88% (22/25 questions)
Duration: 3 minutes 42 seconds

Per-Question Breakdown

Question: "What are your business hours?"
Peer Response: "Our support team is available Monday through Friday, 
                from 9am to 6pm EST, and on Saturdays from 10am to 4pm EST."
Expected: "We are open Monday-Friday, 9 AM to 6 PM EST, 
           and Saturday 10 AM to 4 PM EST."

Scores:
✗ Exact Match: 0.0 (wording differs)
✓ LLM Judge: 0.95 (accurate and clear)
✓ Semantic Similarity: 0.98 (same meaning)

Overall: PASS (0.93)

Evaluator Insights

Each evaluator provides different perspectives:

Exact Match catches literal deviations
Semantic Similarity ensures conceptual accuracy
LLM Judge assesses quality and user experience

Real-World Example: Improving a Support Peer

Let's look at a real scenario:

Initial Results (Baseline)

Overall Score: 0.72 (72%)
Pass Rate: 68%

Common Issues:
- Generic responses lacking specifics
- Missing product details
- Inconsistent tone

Analysis Phase

After running the evaluation, we used the AI Analysis feature (more on this in our next blog post) which suggested:

Add product documentation datasource - Score improvement: +15%
Enhance system prompt with tone guidelines - Score improvement: +8%
Lower temperature from 0.8 to 0.3 - Score improvement: +5%

Implementation

We applied the suggestions and re-ran the evaluation:

Overall Score: 0.91 (91%)
Pass Rate: 92%

Improvements:
✓ Specific, accurate product information
✓ Consistent professional tone
✓ Reduced hallucinations

Net improvement: +19 percentage points in just one iteration!

Best Practices for Effective Evaluation

1. Start with Real User Questions

Don't invent questions - use actual queries from your users:

bash

# Export conversation logs
# Extract common questions
# Add to evaluation suite

This ensures your tests reflect real usage patterns.

2. Cover Edge Cases

Include challenging scenarios:

Ambiguous questions
Multi-part questions
Questions outside your peer's scope
Questions requiring reasoning
Questions with multiple valid answers

3. Define Clear Expected Answers

Vague expectations lead to unreliable scores:

❌ Bad: "Something about business hours"
✅ Good: "Monday-Friday, 9 AM to 6 PM EST"

4. Use Multiple Evaluators

Different evaluators catch different issues:

Exact Match: Catches factual errors
Semantic Similarity: Ensures meaning is preserved
LLM Judge: Assesses overall quality

5. Run Regularly

Make evaluation part of your workflow:

Before deployment: Validate changes
After updates: Ensure no regressions
Weekly: Monitor ongoing performance
After feedback: Test reported issues

6. Track Improvements Over Time

Keep a log of evaluation scores:

Version 1.0: 72% (baseline)
Version 1.1: 85% (added datasources)
Version 1.2: 91% (improved prompt)
Version 2.0: 94% (new model)

This helps justify optimizations and demonstrate ROI.

Advanced Techniques

Bulk Importing Questions

For large test suites, use CSV import:

csv

question,expectedAnswer,context,tags
"What are your hours?","Mon-Fri 9-6 EST, Sat 10-4 EST","","hours"
"How to reset password?","Click 'Forgot Password' on login page","","account"

Import 100+ questions in seconds.

Conditional Testing

Create different suites for different scenarios:

Basic Functionality: Core features
Advanced Features: Complex scenarios
Error Handling: Edge cases and errors
Multilingual: Different languages
Performance: High-volume scenarios

Automated Regression Testing

Set up automated evaluation runs:

javascript

// Run evaluation after each deployment
await peer.deploy();
const result = await evaluation.run(suiteId);

if (result.averageScore < 0.85) {
  await notifications.alert('Evaluation failed! Score dropped.');
  await rollback.toPreviousVersion();
}

A/B Testing with Evaluation

Compare different configurations:

javascript

// Test two versions
const resultA = await evaluate(peerA, testSuite);
const resultB = await evaluate(peerB, testSuite);

console.log(`Version A: ${resultA.score}`);
console.log(`Version B: ${resultB.score}`);

// Deploy winner
if (resultB.score > resultA.score) {
  await deploy(peerB);
}

Common Pitfalls to Avoid

❌ Testing Too Few Questions

Problem: 5-10 questions don't provide statistical confidence.

Solution: Aim for at least 20-30 questions, preferably 50+.

❌ Unrealistic Expected Answers

Problem: Expecting exact LLM output leads to false failures.

Solution: Focus on correctness and completeness, not exact wording.

❌ Not Using Multiple Evaluators

Problem: Single evaluator misses nuances.

Solution: Enable at least two evaluators for comprehensive coverage.

❌ Ignoring Failed Tests

Problem: Low scores without investigation.

Solution: Review each failed question, understand why, and iterate.

❌ Testing Only Happy Paths

Problem: Real users ask unexpected questions.

Solution: Include edge cases, errors, and ambiguous scenarios.

Integration with Your Workflow

CI/CD Pipeline Integration

yaml

# .github/workflows/test-peer.yml
name: Test AI Peer
on: [push]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - name: Run Evaluation
        run: |
          result=$(curl -X POST \
            https://api.cognipeer.com/v1/evaluation/$SUITE_ID/run \
            -H "Authorization: Bearer $API_KEY")
          
          score=$(echo $result | jq '.data.averageScore')
          
          if (( $(echo "$score < 0.85" | bc -l) )); then
            echo "Evaluation failed: $score"
            exit 1
          fi

Monitoring Dashboard

Track key metrics:

Average scores over time
Pass rates by category
Most common failures
Response time trends

Slack Notifications

Get alerted when scores drop:

javascript

if (newScore < previousScore - 0.05) {
  await slack.notify({
    channel: '#ai-alerts',
    message: `⚠️ Peer performance dropped from ${previousScore} to ${newScore}`
  });
}

What's Next?

Now that you understand evaluation basics, check out our next blog post: "Optimizing Peers with AI-Powered Analysis", where we dive into using AI to automatically generate improvement suggestions based on evaluation results.

In the meantime, try these exercises:

Create your first evaluation suite with 10 questions
Run an evaluation and review the results
Make one improvement based on the findings
Re-run the evaluation and measure the improvement

Resources

Conclusion

Systematic evaluation is the key to building reliable, high-quality AI peers. The Evaluation System provides the tools you need to:

Validate peer performance objectively
Measure improvements quantitatively
Iterate confidently with data-driven decisions
Deploy with confidence knowing your peer is ready

Start testing your peers today and see the quality improvements for yourself!

Have questions about evaluation? Join our community forum or contact our team.

Want to learn more? Read our next post: Optimizing Peers with AI-Powered Analysis

Testing Your AI Peers with the Evaluation System ​

Why Evaluate Your AI Peers? ​

Getting Started: Your First Evaluation Suite ​

Step 1: Create the Suite ​

Step 2: Add Test Questions ​

Step 3: Configure Evaluators ​

1. Exact Match Evaluator ​

2. Semantic Similarity Evaluator ​

3. LLM Judge Evaluator ​

Step 4: Run Your First Evaluation ​

Understanding Your Results ​

Overall Metrics ​

Per-Question Breakdown ​

Evaluator Insights ​

Real-World Example: Improving a Support Peer ​

Initial Results (Baseline) ​

Analysis Phase ​

Implementation ​

Best Practices for Effective Evaluation ​

1. Start with Real User Questions ​

2. Cover Edge Cases ​

3. Define Clear Expected Answers ​

4. Use Multiple Evaluators ​

5. Run Regularly ​

6. Track Improvements Over Time ​

Advanced Techniques ​

Bulk Importing Questions ​

Conditional Testing ​

Automated Regression Testing ​

A/B Testing with Evaluation ​

Common Pitfalls to Avoid ​

❌ Testing Too Few Questions ​

❌ Unrealistic Expected Answers ​

❌ Not Using Multiple Evaluators ​

❌ Ignoring Failed Tests ​

❌ Testing Only Happy Paths ​

Integration with Your Workflow ​

CI/CD Pipeline Integration ​

Monitoring Dashboard ​

Slack Notifications ​

What's Next? ​

Resources ​

Conclusion ​

Testing Your AI Peers with the Evaluation System

Why Evaluate Your AI Peers?

Getting Started: Your First Evaluation Suite

Step 1: Create the Suite

Step 2: Add Test Questions

Step 3: Configure Evaluators

1. Exact Match Evaluator

2. Semantic Similarity Evaluator

3. LLM Judge Evaluator

Step 4: Run Your First Evaluation

Understanding Your Results

Overall Metrics

Per-Question Breakdown

Evaluator Insights

Real-World Example: Improving a Support Peer

Initial Results (Baseline)

Analysis Phase

Implementation

Best Practices for Effective Evaluation

1. Start with Real User Questions

2. Cover Edge Cases

3. Define Clear Expected Answers

4. Use Multiple Evaluators

5. Run Regularly

6. Track Improvements Over Time

Advanced Techniques

Bulk Importing Questions

Conditional Testing

Automated Regression Testing

A/B Testing with Evaluation

Common Pitfalls to Avoid

❌ Testing Too Few Questions

❌ Unrealistic Expected Answers

❌ Not Using Multiple Evaluators

❌ Ignoring Failed Tests

❌ Testing Only Happy Paths

Integration with Your Workflow

CI/CD Pipeline Integration

Monitoring Dashboard

Slack Notifications

What's Next?

Resources

Conclusion