Skip to content

Evaluation & Testing

The Evaluation System provides a comprehensive framework for testing and evaluating your AI peers' performance. Evaluation suites are workspace-level resources that can be used to test any peer in your workspace. This feature helps you ensure consistent quality, measure improvements, and maintain high standards for your AI applications.

Important: Evaluation Architecture

  • Evaluation Suites are independent resources at the workspace level
  • Evaluation Runs associate a suite with a specific peer for testing
  • A single suite can be used to test multiple different peers
  • Suites are reusable and can be versioned independently

Overview

The Evaluation System allows you to:

  • Create Test Suites: Organize related test cases into suites
  • Import Questions: Bulk import test questions from CSV or JSON files
  • Multiple Evaluators: Use different evaluation methods (Exact Match, LLM Judge, Semantic Similarity)
  • Real-time Monitoring: Track evaluation progress in real-time
  • Detailed Results: Get comprehensive metrics and insights
  • AI-Powered Analysis: Receive improvement suggestions based on results

Key Concepts

Evaluation Suite

An evaluation suite is a workspace-level resource - a collection of test questions that can be used to evaluate any peer in your workspace. Suites are independent and reusable:

  • Workspace-Level: Not tied to a specific peer
  • Reusable: Use the same suite to test multiple peers or different versions
  • Name and Description: Identify the purpose of the suite
  • Questions: Test cases with expected answers
  • Evaluator Configuration: Which evaluation methods to use
  • Run History: Track all runs across different peers

Evaluation Question

Each question in a suite contains:

  • Question Text: The input to send to the peer
  • Expected Answer: The correct or ideal response
  • Context (optional): Additional context for the peer
  • Metadata (optional): Tags, categories, or custom fields

Evaluators

Cognipeer AI supports three types of evaluators:

1. Exact Match Evaluator

Compares the peer's response directly with the expected answer.

Best for:

  • Factual questions with specific answers
  • Structured data extraction
  • Command/query responses

Configuration Options: Admins can toggle settings in the dashboard to control matching:

  • Case Sensitivity: Choose whether matches should ignore upper/lowercase differences (default: off).
  • Whitespace Handling: Choose whether extra spaces are ignored (default: on).
  • Number Normalization: Choose whether numerical differences like formatting are normalized.

Scoring:

  • 1.0: Exact match
  • 0.0: No match

2. LLM Judge Evaluator

Uses an AI model (GPT-4o) to evaluate the quality and correctness of responses.

Best for:

  • Open-ended questions
  • Creative responses
  • Contextual understanding
  • Complex reasoning

Configuration Options:

  • AI Model Selection: Select a model optimized for evaluating quality.
  • Criteria List: Select parameters to evaluate, such as accuracy, completeness, relevance, or clarity.
  • Strictness Level: Choose how strictly the AI judge grades the response (e.g., moderate or strict).

Scoring:

  • 0.0 - 1.0: Graded based on multiple criteria
  • Provides detailed reasoning for the score

3. Semantic Similarity Evaluator

Measures semantic similarity between expected and actual answers using embeddings.

Best for:

  • Paraphrased answers
  • Conceptually similar responses
  • Language-agnostic evaluation

Configuration Options:

  • Similarity Threshold: Set a threshold (e.g., 0.8) to define what similarity score represents a passing match.
  • AI Embedding Model: Choose a model used to convert text to mathematical vectors for semantic comparison.

Scoring:

  • 0.0 - 1.0: Similarity between expected and actual answers
  • Pass/Fail: Based on configured threshold

Creating an Evaluation Suite

Evaluation suites are created at the workspace level and can be used to test any peer.

Step 1: Navigate to Evaluations

  1. Go to your workspace dashboard
  2. Click on the "Evaluations" menu item (workspace-level, not peer-specific)

Step 2: Configure Basic Information

Provide the following details:

  • Suite Name: A descriptive name (e.g., "Customer Support Quality Test")
  • Description: What this suite tests
  • Peer: Which peer to evaluate (pre-selected)

Step 3: Add Questions

You can add questions in three ways:

Manual Entry

Add questions one at a time:

Question: What are your business hours?
Expected Answer: We are open Monday-Friday, 9 AM to 6 PM EST.

CSV Import

You can upload a CSV file containing your test questions. The CSV file must include headers for the question and the expected answer (optionally, you can include columns for context and tags). UTF-8 encoding is supported.

JSON Import

You can upload a JSON file containing a list of test questions. Each item in the list must specify the question text and the expected answer, with optional fields for context and tags. For the exact schema format and example JSON structure, please refer to the Developer Hub.

Step 4: Configure Evaluators

Select which evaluators to use:

  1. Exact Match: Enable for factual questions
  2. LLM Judge: Enable for quality assessment
  3. Semantic Similarity: Enable for paraphrased responses

Configure each evaluator's settings based on your needs.

Step 5: Save and Run

  1. Review your configuration
  2. Click "Create Suite"
  3. The suite is now created at workspace level, ready to test any peer

Running Evaluations

When you run an evaluation, you select which peer to test with the suite.

Starting an Evaluation Run

  1. Navigate to your evaluation suite
  2. Click "Run Evaluation"
  3. Select the peer you want to test (required)
  4. Optionally select a specific peer version
  5. Click "Start Run"
  6. Monitor progress in real-time

Important: The same suite can be run against multiple peers or different versions of the same peer to compare performance.

Real-time Progress

During execution, you'll see:

  • Overall Progress: Percentage complete
  • Questions Processed: X of Y completed
  • Current Status: Running, Completed, or Failed
  • Live Results: Scores update as questions are evaluated

Evaluation Results

After completion, view detailed results:

Summary Metrics

  • Overall Score: Average across all evaluators
  • Pass Rate: Percentage of passing questions
  • Total Questions: Number of test cases
  • Duration: Total execution time

Individual Question Results

For each question, see:

  • Question Text: The input provided
  • Peer Response: What the peer actually answered
  • Expected Answer: The correct response
  • Scores: Results from each evaluator
  • Pass/Fail Status: Based on configured thresholds

Evaluator Breakdown

View results by evaluator:

  • Exact Match Results: Binary pass/fail
  • LLM Judge Scores: Detailed reasoning and grades
  • Semantic Similarity: Similarity scores and threshold comparison

AI-Powered Analysis

After running an evaluation, you can request AI-powered improvement suggestions.

Requesting Analysis

  1. Open an evaluation run result
  2. Click "Analyze with AI"
  3. Wait for the analysis to complete (typically 10-30 seconds)

Understanding Suggestions

The AI analysis provides:

  • Problem Identification: What issues were detected
  • Root Cause Analysis: Why problems occurred
  • Specific Recommendations: Actionable improvements
  • Priority Level: Which changes to make first

Applying Improvements

Suggestions may include:

  1. Prompt Modifications: Enhanced instructions
  2. Tool Additions: Recommended tools to enable
  3. Settings Changes: Temperature, model selection, etc.
  4. Data Source Updates: Knowledge base improvements

You can:

  • Preview Changes: See what will be modified
  • Apply Individually: Select specific suggestions
  • Apply All: Implement all recommendations at once
  • Edit Before Applying: Customize suggestions

See AI Analysis Guide for more details.

Best Practices

Designing Test Suites

  1. Start Small: Begin with 10-20 representative questions
  2. Cover Edge Cases: Include unusual or challenging scenarios
  3. Regular Updates: Add new questions as your peer evolves
  4. Version Control: Clone suites before major changes

Question Design

  1. Clear Expected Answers: Be specific about what's correct
  2. Realistic Scenarios: Use actual user questions
  3. Diverse Topics: Cover all areas of your peer's knowledge
  4. Context Matters: Provide context when necessary

Evaluator Selection

EvaluatorUse WhenAvoid When
Exact MatchFactual data, structured outputCreative responses, paraphrasing
LLM JudgeQuality matters more than exact wordingNeed deterministic results
Semantic SimilarityMeaning is more important than exact wordsRequire precise terminology

Continuous Improvement

  1. Baseline First: Run initial evaluation before changes
  2. Incremental Changes: Make one change at a time
  3. Compare Results: Track improvements over time
  4. Automate: Schedule regular evaluation runs

Common Use Cases

Customer Support Quality

Test your support peer's ability to handle common questions:

Suite: Customer Support Quality
Questions: 50+
Evaluators: LLM Judge (primary), Semantic Similarity (secondary)
Focus: Helpfulness, accuracy, tone

Data Extraction Accuracy

Verify structured data extraction:

Suite: Invoice Data Extraction
Questions: 30+
Evaluators: Exact Match (primary)
Focus: Correct field extraction, data formatting

Conversation Quality

Assess natural conversation flow:

Suite: Conversation Quality
Questions: 40+
Evaluators: LLM Judge (primary)
Focus: Context awareness, coherence, engagement

Compliance & Safety

Ensure adherence to guidelines:

Suite: Compliance Check
Questions: 25+
Evaluators: LLM Judge with strict criteria
Focus: Policy compliance, appropriate responses

Troubleshooting

Low Scores

Problem: Evaluation scores are lower than expected

Solutions:

  1. Review failed questions individually
  2. Check if expected answers are realistic
  3. Request AI analysis for suggestions
  4. Adjust evaluator thresholds if too strict

Inconsistent Results

Problem: Same questions get different scores across runs

Solutions:

  1. For LLM Judge: This is expected due to model variance
  2. Increase sample size for more reliable averages
  3. Use Exact Match for deterministic results
  4. Set temperature to 0 for more consistent LLM evaluations

Slow Execution

Problem: Evaluations take too long to complete

Solutions:

  1. Reduce number of questions per run
  2. Use faster models for LLM Judge
  3. Disable Semantic Similarity if not needed
  4. Run evaluations during off-peak hours

Import Failures

Problem: Can't import questions from CSV/JSON

Solutions:

  1. Verify file encoding (UTF-8 recommended)
  2. Check CSV headers match required format
  3. Ensure JSON structure is correct
  4. Remove special characters causing parsing issues

Automation

Most teams create, import, run, and review evaluations from Studio. If your team needs to automate evaluation runs or connect evaluation results to another system, use the Developer Hub for API and SDK details.

Summary

The Evaluation System is a powerful tool for maintaining and improving your AI peer's quality. By regularly running evaluations, analyzing results, and applying AI-powered suggestions, you can ensure your peers consistently deliver excellent results.

Next Steps:

  1. Create your first evaluation suite
  2. Import or create test questions
  3. Run an evaluation
  4. Request AI analysis for improvement suggestions
  5. Monitor improvements over time

Studio · Pulse — Cognipeer product documentation