Evaluation & Testing 
The Evaluation System provides a comprehensive framework for testing and evaluating your AI peers' performance. Evaluation suites are workspace-level resources that can be used to test any peer in your workspace. This feature helps you ensure consistent quality, measure improvements, and maintain high standards for your AI applications.
Important: Evaluation Architecture 
- Evaluation Suites are independent resources at the workspace level
- Evaluation Runs associate a suite with a specific peer for testing
- A single suite can be used to test multiple different peers
- Suites are reusable and can be versioned independently
Overview 
The Evaluation System allows you to:
- Create Test Suites: Organize related test cases into suites
- Import Questions: Bulk import test questions from CSV or JSON files
- Multiple Evaluators: Use different evaluation methods (Exact Match, LLM Judge, Semantic Similarity)
- Real-time Monitoring: Track evaluation progress in real-time
- Detailed Results: Get comprehensive metrics and insights
- AI-Powered Analysis: Receive improvement suggestions based on results
Key Concepts 
Evaluation Suite 
An evaluation suite is a workspace-level resource - a collection of test questions that can be used to evaluate any peer in your workspace. Suites are independent and reusable:
- Workspace-Level: Not tied to a specific peer
- Reusable: Use the same suite to test multiple peers or different versions
- Name and Description: Identify the purpose of the suite
- Questions: Test cases with expected answers
- Evaluator Configuration: Which evaluation methods to use
- Run History: Track all runs across different peers
Evaluation Question 
Each question in a suite contains:
- Question Text: The input to send to the peer
- Expected Answer: The correct or ideal response
- Context (optional): Additional context for the peer
- Metadata (optional): Tags, categories, or custom fields
Evaluators 
Cognipeer AI supports three types of evaluators:
1. Exact Match Evaluator 
Compares the peer's response directly with the expected answer.
Best for:
- Factual questions with specific answers
- Structured data extraction
- Command/query responses
Configuration:
{
  "caseSensitive": false,
  "ignoreWhitespace": true,
  "normalizeNumbers": true
}Scoring:
- 1.0: Exact match
- 0.0: No match
2. LLM Judge Evaluator 
Uses an AI model (GPT-4o) to evaluate the quality and correctness of responses.
Best for:
- Open-ended questions
- Creative responses
- Contextual understanding
- Complex reasoning
Configuration:
{
  "modelId": "chatgpt-4o-mini",
  "criteria": [
    "accuracy",
    "completeness",
    "relevance",
    "clarity"
  ],
  "strictness": "moderate"
}Scoring:
- 0.0 - 1.0: Graded based on multiple criteria
- Provides detailed reasoning for the score
3. Semantic Similarity Evaluator 
Measures semantic similarity between expected and actual answers using embeddings.
Best for:
- Paraphrased answers
- Conceptually similar responses
- Language-agnostic evaluation
Configuration:
{
  "threshold": 0.8,
  "modelId": "text-embedding-3-small"
}Scoring:
- 0.0 - 1.0: Cosine similarity between embeddings
- Pass/Fail: Based on configured threshold
Creating an Evaluation Suite 
Evaluation suites are created at the workspace level and can be used to test any peer.
Step 1: Navigate to Evaluations 
- Go to your workspace dashboard
- Click on the "Evaluations" menu item (workspace-level, not peer-specific)
Step 2: Configure Basic Information 
Provide the following details:
- Suite Name: A descriptive name (e.g., "Customer Support Quality Test")
- Description: What this suite tests
- Peer: Which peer to evaluate (pre-selected)
Step 3: Add Questions 
You can add questions in three ways:
Manual Entry 
Add questions one at a time:
Question: What are your business hours?
Expected Answer: We are open Monday-Friday, 9 AM to 6 PM EST.CSV Import 
Upload a CSV file with the following structure:
question,expectedAnswer,context,tags
"What are your business hours?","We are open Monday-Friday, 9 AM to 6 PM EST.","","support,hours"
"How do I reset my password?","Click 'Forgot Password' on the login page...","","support,account"CSV Format Requirements:
- Must include headers: question,expectedAnswer
- Optional columns: context,tags,metadata
- UTF-8 encoding supported
- Automatic encoding detection
JSON Import 
Upload a JSON file:
{
  "questions": [
    {
      "question": "What are your business hours?",
      "expectedAnswer": "We are open Monday-Friday, 9 AM to 6 PM EST.",
      "context": "",
      "tags": ["support", "hours"]
    },
    {
      "question": "How do I reset my password?",
      "expectedAnswer": "Click 'Forgot Password' on the login page and follow the instructions.",
      "context": "",
      "tags": ["support", "account"]
    }
  ]
}Step 4: Configure Evaluators 
Select which evaluators to use:
- Exact Match: Enable for factual questions
- LLM Judge: Enable for quality assessment
- Semantic Similarity: Enable for paraphrased responses
Configure each evaluator's settings based on your needs.
Step 5: Save and Run 
- Review your configuration
- Click "Create Suite"
- The suite is now created at workspace level, ready to test any peer
Running Evaluations 
When you run an evaluation, you select which peer to test with the suite.
Starting an Evaluation Run 
- Navigate to your evaluation suite
- Click "Run Evaluation"
- Select the peer you want to test (required)
- Optionally select a specific peer version
- Click "Start Run"
- Monitor progress in real-time
Important: The same suite can be run against multiple peers or different versions of the same peer to compare performance.
Real-time Progress 
During execution, you'll see:
- Overall Progress: Percentage complete
- Questions Processed: X of Y completed
- Current Status: Running, Completed, or Failed
- Live Results: Scores update as questions are evaluated
Evaluation Results 
After completion, view detailed results:
Summary Metrics 
- Overall Score: Average across all evaluators
- Pass Rate: Percentage of passing questions
- Total Questions: Number of test cases
- Duration: Total execution time
Individual Question Results 
For each question, see:
- Question Text: The input provided
- Peer Response: What the peer actually answered
- Expected Answer: The correct response
- Scores: Results from each evaluator
- Pass/Fail Status: Based on configured thresholds
Evaluator Breakdown 
View results by evaluator:
- Exact Match Results: Binary pass/fail
- LLM Judge Scores: Detailed reasoning and grades
- Semantic Similarity: Similarity scores and threshold comparison
AI-Powered Analysis 
After running an evaluation, you can request AI-powered improvement suggestions.
Requesting Analysis 
- Open an evaluation run result
- Click "Analyze with AI"
- Wait for the analysis to complete (typically 10-30 seconds)
Understanding Suggestions 
The AI analysis provides:
- Problem Identification: What issues were detected
- Root Cause Analysis: Why problems occurred
- Specific Recommendations: Actionable improvements
- Priority Level: Which changes to make first
Applying Improvements 
Suggestions may include:
- Prompt Modifications: Enhanced instructions
- Tool Additions: Recommended tools to enable
- Settings Changes: Temperature, model selection, etc.
- Data Source Updates: Knowledge base improvements
You can:
- Preview Changes: See what will be modified
- Apply Individually: Select specific suggestions
- Apply All: Implement all recommendations at once
- Edit Before Applying: Customize suggestions
See AI Analysis Guide for more details.
Best Practices 
Designing Test Suites 
- Start Small: Begin with 10-20 representative questions
- Cover Edge Cases: Include unusual or challenging scenarios
- Regular Updates: Add new questions as your peer evolves
- Version Control: Clone suites before major changes
Question Design 
- Clear Expected Answers: Be specific about what's correct
- Realistic Scenarios: Use actual user questions
- Diverse Topics: Cover all areas of your peer's knowledge
- Context Matters: Provide context when necessary
Evaluator Selection 
| Evaluator | Use When | Avoid When | 
|---|---|---|
| Exact Match | Factual data, structured output | Creative responses, paraphrasing | 
| LLM Judge | Quality matters more than exact wording | Need deterministic results | 
| Semantic Similarity | Meaning is more important than exact words | Require precise terminology | 
Continuous Improvement 
- Baseline First: Run initial evaluation before changes
- Incremental Changes: Make one change at a time
- Compare Results: Track improvements over time
- Automate: Schedule regular evaluation runs
Common Use Cases 
Customer Support Quality 
Test your support peer's ability to handle common questions:
Suite: Customer Support Quality
Questions: 50+
Evaluators: LLM Judge (primary), Semantic Similarity (secondary)
Focus: Helpfulness, accuracy, toneData Extraction Accuracy 
Verify structured data extraction:
Suite: Invoice Data Extraction
Questions: 30+
Evaluators: Exact Match (primary)
Focus: Correct field extraction, data formattingConversation Quality 
Assess natural conversation flow:
Suite: Conversation Quality
Questions: 40+
Evaluators: LLM Judge (primary)
Focus: Context awareness, coherence, engagementCompliance & Safety 
Ensure adherence to guidelines:
Suite: Compliance Check
Questions: 25+
Evaluators: LLM Judge with strict criteria
Focus: Policy compliance, appropriate responsesTroubleshooting 
Low Scores 
Problem: Evaluation scores are lower than expected
Solutions:
- Review failed questions individually
- Check if expected answers are realistic
- Request AI analysis for suggestions
- Adjust evaluator thresholds if too strict
Inconsistent Results 
Problem: Same questions get different scores across runs
Solutions:
- For LLM Judge: This is expected due to model variance
- Increase sample size for more reliable averages
- Use Exact Match for deterministic results
- Set temperature to 0 for more consistent LLM evaluations
Slow Execution 
Problem: Evaluations take too long to complete
Solutions:
- Reduce number of questions per run
- Use faster models for LLM Judge
- Disable Semantic Similarity if not needed
- Run evaluations during off-peak hours
Import Failures 
Problem: Can't import questions from CSV/JSON
Solutions:
- Verify file encoding (UTF-8 recommended)
- Check CSV headers match required format
- Ensure JSON structure is correct
- Remove special characters causing parsing issues
API Integration 
Automate evaluations using the API:
Create Evaluation Suite 
POST /api/v1/evaluation
{
  "name": "Customer Support Quality",
  "description": "Tests support peer responses",
  "peerId": "peer_123",
  "evaluators": {
    "llmJudge": {
      "enabled": true,
      "config": {
        "modelId": "chatgpt-4o-mini",
        "criteria": ["accuracy", "helpfulness"]
      }
    }
  }
}Import Questions 
POST /api/v1/evaluation/:suiteId/questions/import
{
  "format": "csv",
  "data": "base64_encoded_csv_content"
}Run Evaluation 
POST /api/v1/evaluation/:suiteId/run
{
  "description": "Weekly quality check"
}Get Results 
GET /api/v1/evaluation/:suiteId/runs/:runIdSee the API Reference for complete documentation.
Related Documentation 
- AI Analysis Guide - AI-powered improvement suggestions
- Peer Settings - Configure your peer for optimal performance
- Topics - Organize knowledge for better responses
- API Reference - Evaluation API documentation
Summary 
The Evaluation System is a powerful tool for maintaining and improving your AI peer's quality. By regularly running evaluations, analyzing results, and applying AI-powered suggestions, you can ensure your peers consistently deliver excellent results.
Next Steps:
- Create your first evaluation suite
- Import or create test questions
- Run an evaluation
- Request AI analysis for improvement suggestions
- Monitor improvements over time

