Skip to content

Evaluation & Testing

The Evaluation System provides a comprehensive framework for testing and evaluating your AI peers' performance. Evaluation suites are workspace-level resources that can be used to test any peer in your workspace. This feature helps you ensure consistent quality, measure improvements, and maintain high standards for your AI applications.

Important: Evaluation Architecture

  • Evaluation Suites are independent resources at the workspace level
  • Evaluation Runs associate a suite with a specific peer for testing
  • A single suite can be used to test multiple different peers
  • Suites are reusable and can be versioned independently

Overview

The Evaluation System allows you to:

  • Create Test Suites: Organize related test cases into suites
  • Import Questions: Bulk import test questions from CSV or JSON files
  • Multiple Evaluators: Use different evaluation methods (Exact Match, LLM Judge, Semantic Similarity)
  • Real-time Monitoring: Track evaluation progress in real-time
  • Detailed Results: Get comprehensive metrics and insights
  • AI-Powered Analysis: Receive improvement suggestions based on results

Key Concepts

Evaluation Suite

An evaluation suite is a workspace-level resource - a collection of test questions that can be used to evaluate any peer in your workspace. Suites are independent and reusable:

  • Workspace-Level: Not tied to a specific peer
  • Reusable: Use the same suite to test multiple peers or different versions
  • Name and Description: Identify the purpose of the suite
  • Questions: Test cases with expected answers
  • Evaluator Configuration: Which evaluation methods to use
  • Run History: Track all runs across different peers

Evaluation Question

Each question in a suite contains:

  • Question Text: The input to send to the peer
  • Expected Answer: The correct or ideal response
  • Context (optional): Additional context for the peer
  • Metadata (optional): Tags, categories, or custom fields

Evaluators

Cognipeer AI supports three types of evaluators:

1. Exact Match Evaluator

Compares the peer's response directly with the expected answer.

Best for:

  • Factual questions with specific answers
  • Structured data extraction
  • Command/query responses

Configuration:

json
{
  "caseSensitive": false,
  "ignoreWhitespace": true,
  "normalizeNumbers": true
}

Scoring:

  • 1.0: Exact match
  • 0.0: No match

2. LLM Judge Evaluator

Uses an AI model (GPT-4o) to evaluate the quality and correctness of responses.

Best for:

  • Open-ended questions
  • Creative responses
  • Contextual understanding
  • Complex reasoning

Configuration:

json
{
  "modelId": "chatgpt-4o-mini",
  "criteria": [
    "accuracy",
    "completeness",
    "relevance",
    "clarity"
  ],
  "strictness": "moderate"
}

Scoring:

  • 0.0 - 1.0: Graded based on multiple criteria
  • Provides detailed reasoning for the score

3. Semantic Similarity Evaluator

Measures semantic similarity between expected and actual answers using embeddings.

Best for:

  • Paraphrased answers
  • Conceptually similar responses
  • Language-agnostic evaluation

Configuration:

json
{
  "threshold": 0.8,
  "modelId": "text-embedding-3-small"
}

Scoring:

  • 0.0 - 1.0: Cosine similarity between embeddings
  • Pass/Fail: Based on configured threshold

Creating an Evaluation Suite

Evaluation suites are created at the workspace level and can be used to test any peer.

Step 1: Navigate to Evaluations

  1. Go to your workspace dashboard
  2. Click on the "Evaluations" menu item (workspace-level, not peer-specific)

Step 2: Configure Basic Information

Provide the following details:

  • Suite Name: A descriptive name (e.g., "Customer Support Quality Test")
  • Description: What this suite tests
  • Peer: Which peer to evaluate (pre-selected)

Step 3: Add Questions

You can add questions in three ways:

Manual Entry

Add questions one at a time:

Question: What are your business hours?
Expected Answer: We are open Monday-Friday, 9 AM to 6 PM EST.

CSV Import

Upload a CSV file with the following structure:

csv
question,expectedAnswer,context,tags
"What are your business hours?","We are open Monday-Friday, 9 AM to 6 PM EST.","","support,hours"
"How do I reset my password?","Click 'Forgot Password' on the login page...","","support,account"

CSV Format Requirements:

  • Must include headers: question, expectedAnswer
  • Optional columns: context, tags, metadata
  • UTF-8 encoding supported
  • Automatic encoding detection

JSON Import

Upload a JSON file:

json
{
  "questions": [
    {
      "question": "What are your business hours?",
      "expectedAnswer": "We are open Monday-Friday, 9 AM to 6 PM EST.",
      "context": "",
      "tags": ["support", "hours"]
    },
    {
      "question": "How do I reset my password?",
      "expectedAnswer": "Click 'Forgot Password' on the login page and follow the instructions.",
      "context": "",
      "tags": ["support", "account"]
    }
  ]
}

Step 4: Configure Evaluators

Select which evaluators to use:

  1. Exact Match: Enable for factual questions
  2. LLM Judge: Enable for quality assessment
  3. Semantic Similarity: Enable for paraphrased responses

Configure each evaluator's settings based on your needs.

Step 5: Save and Run

  1. Review your configuration
  2. Click "Create Suite"
  3. The suite is now created at workspace level, ready to test any peer

Running Evaluations

When you run an evaluation, you select which peer to test with the suite.

Starting an Evaluation Run

  1. Navigate to your evaluation suite
  2. Click "Run Evaluation"
  3. Select the peer you want to test (required)
  4. Optionally select a specific peer version
  5. Click "Start Run"
  6. Monitor progress in real-time

Important: The same suite can be run against multiple peers or different versions of the same peer to compare performance.

Real-time Progress

During execution, you'll see:

  • Overall Progress: Percentage complete
  • Questions Processed: X of Y completed
  • Current Status: Running, Completed, or Failed
  • Live Results: Scores update as questions are evaluated

Evaluation Results

After completion, view detailed results:

Summary Metrics

  • Overall Score: Average across all evaluators
  • Pass Rate: Percentage of passing questions
  • Total Questions: Number of test cases
  • Duration: Total execution time

Individual Question Results

For each question, see:

  • Question Text: The input provided
  • Peer Response: What the peer actually answered
  • Expected Answer: The correct response
  • Scores: Results from each evaluator
  • Pass/Fail Status: Based on configured thresholds

Evaluator Breakdown

View results by evaluator:

  • Exact Match Results: Binary pass/fail
  • LLM Judge Scores: Detailed reasoning and grades
  • Semantic Similarity: Similarity scores and threshold comparison

AI-Powered Analysis

After running an evaluation, you can request AI-powered improvement suggestions.

Requesting Analysis

  1. Open an evaluation run result
  2. Click "Analyze with AI"
  3. Wait for the analysis to complete (typically 10-30 seconds)

Understanding Suggestions

The AI analysis provides:

  • Problem Identification: What issues were detected
  • Root Cause Analysis: Why problems occurred
  • Specific Recommendations: Actionable improvements
  • Priority Level: Which changes to make first

Applying Improvements

Suggestions may include:

  1. Prompt Modifications: Enhanced instructions
  2. Tool Additions: Recommended tools to enable
  3. Settings Changes: Temperature, model selection, etc.
  4. Data Source Updates: Knowledge base improvements

You can:

  • Preview Changes: See what will be modified
  • Apply Individually: Select specific suggestions
  • Apply All: Implement all recommendations at once
  • Edit Before Applying: Customize suggestions

See AI Analysis Guide for more details.

Best Practices

Designing Test Suites

  1. Start Small: Begin with 10-20 representative questions
  2. Cover Edge Cases: Include unusual or challenging scenarios
  3. Regular Updates: Add new questions as your peer evolves
  4. Version Control: Clone suites before major changes

Question Design

  1. Clear Expected Answers: Be specific about what's correct
  2. Realistic Scenarios: Use actual user questions
  3. Diverse Topics: Cover all areas of your peer's knowledge
  4. Context Matters: Provide context when necessary

Evaluator Selection

EvaluatorUse WhenAvoid When
Exact MatchFactual data, structured outputCreative responses, paraphrasing
LLM JudgeQuality matters more than exact wordingNeed deterministic results
Semantic SimilarityMeaning is more important than exact wordsRequire precise terminology

Continuous Improvement

  1. Baseline First: Run initial evaluation before changes
  2. Incremental Changes: Make one change at a time
  3. Compare Results: Track improvements over time
  4. Automate: Schedule regular evaluation runs

Common Use Cases

Customer Support Quality

Test your support peer's ability to handle common questions:

Suite: Customer Support Quality
Questions: 50+
Evaluators: LLM Judge (primary), Semantic Similarity (secondary)
Focus: Helpfulness, accuracy, tone

Data Extraction Accuracy

Verify structured data extraction:

Suite: Invoice Data Extraction
Questions: 30+
Evaluators: Exact Match (primary)
Focus: Correct field extraction, data formatting

Conversation Quality

Assess natural conversation flow:

Suite: Conversation Quality
Questions: 40+
Evaluators: LLM Judge (primary)
Focus: Context awareness, coherence, engagement

Compliance & Safety

Ensure adherence to guidelines:

Suite: Compliance Check
Questions: 25+
Evaluators: LLM Judge with strict criteria
Focus: Policy compliance, appropriate responses

Troubleshooting

Low Scores

Problem: Evaluation scores are lower than expected

Solutions:

  1. Review failed questions individually
  2. Check if expected answers are realistic
  3. Request AI analysis for suggestions
  4. Adjust evaluator thresholds if too strict

Inconsistent Results

Problem: Same questions get different scores across runs

Solutions:

  1. For LLM Judge: This is expected due to model variance
  2. Increase sample size for more reliable averages
  3. Use Exact Match for deterministic results
  4. Set temperature to 0 for more consistent LLM evaluations

Slow Execution

Problem: Evaluations take too long to complete

Solutions:

  1. Reduce number of questions per run
  2. Use faster models for LLM Judge
  3. Disable Semantic Similarity if not needed
  4. Run evaluations during off-peak hours

Import Failures

Problem: Can't import questions from CSV/JSON

Solutions:

  1. Verify file encoding (UTF-8 recommended)
  2. Check CSV headers match required format
  3. Ensure JSON structure is correct
  4. Remove special characters causing parsing issues

API Integration

Automate evaluations using the API:

Create Evaluation Suite

javascript
POST /api/v1/evaluation
{
  "name": "Customer Support Quality",
  "description": "Tests support peer responses",
  "peerId": "peer_123",
  "evaluators": {
    "llmJudge": {
      "enabled": true,
      "config": {
        "modelId": "chatgpt-4o-mini",
        "criteria": ["accuracy", "helpfulness"]
      }
    }
  }
}

Import Questions

javascript
POST /api/v1/evaluation/:suiteId/questions/import
{
  "format": "csv",
  "data": "base64_encoded_csv_content"
}

Run Evaluation

javascript
POST /api/v1/evaluation/:suiteId/run
{
  "description": "Weekly quality check"
}

Get Results

javascript
GET /api/v1/evaluation/:suiteId/runs/:runId

See the API Reference for complete documentation.

Summary

The Evaluation System is a powerful tool for maintaining and improving your AI peer's quality. By regularly running evaluations, analyzing results, and applying AI-powered suggestions, you can ensure your peers consistently deliver excellent results.

Next Steps:

  1. Create your first evaluation suite
  2. Import or create test questions
  3. Run an evaluation
  4. Request AI analysis for improvement suggestions
  5. Monitor improvements over time

Built with VitePress