Evaluation & Testing

The Evaluation System provides a comprehensive framework for testing and evaluating your AI peers' performance. Evaluation suites are workspace-level resources that can be used to test any peer in your workspace. This feature helps you ensure consistent quality, measure improvements, and maintain high standards for your AI applications.

Important: Evaluation Architecture

Evaluation Suites are independent resources at the workspace level
Evaluation Runs associate a suite with a specific peer for testing
A single suite can be used to test multiple different peers
Suites are reusable and can be versioned independently

Overview

The Evaluation System allows you to:

Create Test Suites: Organize related test cases into suites
Import Questions: Bulk import test questions from CSV or JSON files
Multiple Evaluators: Use different evaluation methods (Exact Match, LLM Judge, Semantic Similarity)
Real-time Monitoring: Track evaluation progress in real-time
Detailed Results: Get comprehensive metrics and insights
AI-Powered Analysis: Receive improvement suggestions based on results

Key Concepts

Evaluation Suite

An evaluation suite is a workspace-level resource - a collection of test questions that can be used to evaluate any peer in your workspace. Suites are independent and reusable:

Workspace-Level: Not tied to a specific peer
Reusable: Use the same suite to test multiple peers or different versions
Name and Description: Identify the purpose of the suite
Questions: Test cases with expected answers
Evaluator Configuration: Which evaluation methods to use
Run History: Track all runs across different peers

Evaluation Question

Each question in a suite contains:

Question Text: The input to send to the peer
Expected Answer: The correct or ideal response
Context (optional): Additional context for the peer
Metadata (optional): Tags, categories, or custom fields

Evaluators

Cognipeer AI supports three types of evaluators:

1. Exact Match Evaluator

Compares the peer's response directly with the expected answer.

Best for:

Factual questions with specific answers
Structured data extraction
Command/query responses

Configuration:

json

{
  "caseSensitive": false,
  "ignoreWhitespace": true,
  "normalizeNumbers": true
}

Scoring:

1.0: Exact match
0.0: No match

2. LLM Judge Evaluator

Uses an AI model (GPT-4o) to evaluate the quality and correctness of responses.

Best for:

Open-ended questions
Creative responses
Contextual understanding
Complex reasoning

Configuration:

json

{
  "modelId": "chatgpt-4o-mini",
  "criteria": [
    "accuracy",
    "completeness",
    "relevance",
    "clarity"
  ],
  "strictness": "moderate"
}

Scoring:

0.0 - 1.0: Graded based on multiple criteria
Provides detailed reasoning for the score

3. Semantic Similarity Evaluator

Measures semantic similarity between expected and actual answers using embeddings.

Best for:

Paraphrased answers
Conceptually similar responses
Language-agnostic evaluation

Configuration:

json

{
  "threshold": 0.8,
  "modelId": "text-embedding-3-small"
}

Scoring:

0.0 - 1.0: Cosine similarity between embeddings
Pass/Fail: Based on configured threshold

Creating an Evaluation Suite

Evaluation suites are created at the workspace level and can be used to test any peer.

Step 1: Navigate to Evaluations

Go to your workspace dashboard
Click on the "Evaluations" menu item (workspace-level, not peer-specific)

Step 2: Configure Basic Information

Provide the following details:

Suite Name: A descriptive name (e.g., "Customer Support Quality Test")
Description: What this suite tests
Peer: Which peer to evaluate (pre-selected)

Step 3: Add Questions

You can add questions in three ways:

Manual Entry

Add questions one at a time:

Question: What are your business hours?
Expected Answer: We are open Monday-Friday, 9 AM to 6 PM EST.

CSV Import

Upload a CSV file with the following structure:

csv

question,expectedAnswer,context,tags
"What are your business hours?","We are open Monday-Friday, 9 AM to 6 PM EST.","","support,hours"
"How do I reset my password?","Click 'Forgot Password' on the login page...","","support,account"

CSV Format Requirements:

Must include headers: question, expectedAnswer
Optional columns: context, tags, metadata
UTF-8 encoding supported
Automatic encoding detection

JSON Import

Upload a JSON file:

json

{
  "questions": [
    {
      "question": "What are your business hours?",
      "expectedAnswer": "We are open Monday-Friday, 9 AM to 6 PM EST.",
      "context": "",
      "tags": ["support", "hours"]
    },
    {
      "question": "How do I reset my password?",
      "expectedAnswer": "Click 'Forgot Password' on the login page and follow the instructions.",
      "context": "",
      "tags": ["support", "account"]
    }
  ]
}

Step 4: Configure Evaluators

Select which evaluators to use:

Exact Match: Enable for factual questions
LLM Judge: Enable for quality assessment
Semantic Similarity: Enable for paraphrased responses

Configure each evaluator's settings based on your needs.

Step 5: Save and Run

Review your configuration
Click "Create Suite"
The suite is now created at workspace level, ready to test any peer

Running Evaluations

When you run an evaluation, you select which peer to test with the suite.

Starting an Evaluation Run

Navigate to your evaluation suite
Click "Run Evaluation"
Select the peer you want to test (required)
Optionally select a specific peer version
Click "Start Run"
Monitor progress in real-time

Important: The same suite can be run against multiple peers or different versions of the same peer to compare performance.

Real-time Progress

During execution, you'll see:

Overall Progress: Percentage complete
Questions Processed: X of Y completed
Current Status: Running, Completed, or Failed
Live Results: Scores update as questions are evaluated

Evaluation Results

After completion, view detailed results:

Summary Metrics

Overall Score: Average across all evaluators
Pass Rate: Percentage of passing questions
Total Questions: Number of test cases
Duration: Total execution time

Individual Question Results

For each question, see:

Question Text: The input provided
Peer Response: What the peer actually answered
Expected Answer: The correct response
Scores: Results from each evaluator
Pass/Fail Status: Based on configured thresholds

Evaluator Breakdown

View results by evaluator:

Exact Match Results: Binary pass/fail
LLM Judge Scores: Detailed reasoning and grades
Semantic Similarity: Similarity scores and threshold comparison

AI-Powered Analysis

After running an evaluation, you can request AI-powered improvement suggestions.

Requesting Analysis

Open an evaluation run result
Click "Analyze with AI"
Wait for the analysis to complete (typically 10-30 seconds)

Understanding Suggestions

The AI analysis provides:

Problem Identification: What issues were detected
Root Cause Analysis: Why problems occurred
Specific Recommendations: Actionable improvements
Priority Level: Which changes to make first

Applying Improvements

Suggestions may include:

Prompt Modifications: Enhanced instructions
Tool Additions: Recommended tools to enable
Settings Changes: Temperature, model selection, etc.
Data Source Updates: Knowledge base improvements

You can:

Preview Changes: See what will be modified
Apply Individually: Select specific suggestions
Apply All: Implement all recommendations at once
Edit Before Applying: Customize suggestions

See AI Analysis Guide for more details.

Best Practices

Designing Test Suites

Start Small: Begin with 10-20 representative questions
Cover Edge Cases: Include unusual or challenging scenarios
Regular Updates: Add new questions as your peer evolves
Version Control: Clone suites before major changes

Question Design

Clear Expected Answers: Be specific about what's correct
Realistic Scenarios: Use actual user questions
Diverse Topics: Cover all areas of your peer's knowledge
Context Matters: Provide context when necessary

Evaluator Selection

Evaluator	Use When	Avoid When
Exact Match	Factual data, structured output	Creative responses, paraphrasing
LLM Judge	Quality matters more than exact wording	Need deterministic results
Semantic Similarity	Meaning is more important than exact words	Require precise terminology

Continuous Improvement

Baseline First: Run initial evaluation before changes
Incremental Changes: Make one change at a time
Compare Results: Track improvements over time
Automate: Schedule regular evaluation runs

Common Use Cases

Customer Support Quality

Test your support peer's ability to handle common questions:

Suite: Customer Support Quality
Questions: 50+
Evaluators: LLM Judge (primary), Semantic Similarity (secondary)
Focus: Helpfulness, accuracy, tone

Data Extraction Accuracy

Verify structured data extraction:

Suite: Invoice Data Extraction
Questions: 30+
Evaluators: Exact Match (primary)
Focus: Correct field extraction, data formatting

Conversation Quality

Assess natural conversation flow:

Suite: Conversation Quality
Questions: 40+
Evaluators: LLM Judge (primary)
Focus: Context awareness, coherence, engagement

Compliance & Safety

Ensure adherence to guidelines:

Suite: Compliance Check
Questions: 25+
Evaluators: LLM Judge with strict criteria
Focus: Policy compliance, appropriate responses

Troubleshooting

Low Scores

Problem: Evaluation scores are lower than expected

Solutions:

Review failed questions individually
Check if expected answers are realistic
Request AI analysis for suggestions
Adjust evaluator thresholds if too strict

Inconsistent Results

Problem: Same questions get different scores across runs

Solutions:

For LLM Judge: This is expected due to model variance
Increase sample size for more reliable averages
Use Exact Match for deterministic results
Set temperature to 0 for more consistent LLM evaluations

Slow Execution

Problem: Evaluations take too long to complete

Solutions:

Reduce number of questions per run
Use faster models for LLM Judge
Disable Semantic Similarity if not needed
Run evaluations during off-peak hours

Import Failures

Problem: Can't import questions from CSV/JSON

Solutions:

Verify file encoding (UTF-8 recommended)
Check CSV headers match required format
Ensure JSON structure is correct
Remove special characters causing parsing issues

API Integration

Automate evaluations using the API:

Create Evaluation Suite

javascript

POST /api/v1/evaluation
{
  "name": "Customer Support Quality",
  "description": "Tests support peer responses",
  "peerId": "peer_123",
  "evaluators": {
    "llmJudge": {
      "enabled": true,
      "config": {
        "modelId": "chatgpt-4o-mini",
        "criteria": ["accuracy", "helpfulness"]
      }
    }
  }
}

Import Questions

javascript

POST /api/v1/evaluation/:suiteId/questions/import
{
  "format": "csv",
  "data": "base64_encoded_csv_content"
}

Run Evaluation

javascript

POST /api/v1/evaluation/:suiteId/run
{
  "description": "Weekly quality check"
}

Get Results

javascript

GET /api/v1/evaluation/:suiteId/runs/:runId

See the API Reference for complete documentation.

AI Analysis Guide - AI-powered improvement suggestions
Peer Settings - Configure your peer for optimal performance
Topics - Organize knowledge for better responses
API Reference - Evaluation API documentation

Summary

The Evaluation System is a powerful tool for maintaining and improving your AI peer's quality. By regularly running evaluations, analyzing results, and applying AI-powered suggestions, you can ensure your peers consistently deliver excellent results.

Next Steps:

Create your first evaluation suite
Import or create test questions
Run an evaluation
Request AI analysis for improvement suggestions
Monitor improvements over time

Evaluation & Testing ​

Important: Evaluation Architecture ​

Overview ​

Key Concepts ​

Evaluation Suite ​

Evaluation Question ​

Evaluators ​

1. Exact Match Evaluator ​

2. LLM Judge Evaluator ​

3. Semantic Similarity Evaluator ​

Creating an Evaluation Suite ​

Step 1: Navigate to Evaluations ​

Step 2: Configure Basic Information ​

Step 3: Add Questions ​

Manual Entry ​

CSV Import ​

JSON Import ​

Step 4: Configure Evaluators ​

Step 5: Save and Run ​

Running Evaluations ​

Starting an Evaluation Run ​

Real-time Progress ​

Evaluation Results ​

Summary Metrics ​

Individual Question Results ​

Evaluator Breakdown ​

AI-Powered Analysis ​

Requesting Analysis ​

Understanding Suggestions ​

Applying Improvements ​

Best Practices ​

Designing Test Suites ​

Question Design ​

Evaluator Selection ​

Continuous Improvement ​

Common Use Cases ​

Customer Support Quality ​

Data Extraction Accuracy ​

Conversation Quality ​

Compliance & Safety ​

Troubleshooting ​

Low Scores ​

Inconsistent Results ​

Slow Execution ​

Import Failures ​

API Integration ​

Create Evaluation Suite ​

Import Questions ​

Run Evaluation ​

Get Results ​

Related Documentation ​

Summary ​

Evaluation & Testing

Important: Evaluation Architecture

Overview

Key Concepts

Evaluation Suite

Evaluation Question

Evaluators

1. Exact Match Evaluator

2. LLM Judge Evaluator

3. Semantic Similarity Evaluator

Creating an Evaluation Suite

Step 1: Navigate to Evaluations

Step 2: Configure Basic Information

Step 3: Add Questions

Manual Entry

CSV Import

JSON Import

Step 4: Configure Evaluators

Step 5: Save and Run

Running Evaluations

Starting an Evaluation Run

Real-time Progress

Evaluation Results

Summary Metrics

Individual Question Results

Evaluator Breakdown

AI-Powered Analysis

Requesting Analysis

Understanding Suggestions

Applying Improvements

Best Practices

Designing Test Suites

Question Design

Evaluator Selection

Continuous Improvement

Common Use Cases

Customer Support Quality

Data Extraction Accuracy

Conversation Quality

Compliance & Safety

Troubleshooting

Low Scores

Inconsistent Results

Slow Execution

Import Failures

API Integration

Create Evaluation Suite

Import Questions

Run Evaluation

Get Results

Related Documentation

Summary