Skip to content

Optimizing Peers with AI-Powered Analysis

You've built your AI peer, run evaluations, and identified areas for improvement. But what comes next? Manually analyzing results and figuring out the right changes can be time-consuming and requires deep expertise.

Enter AI-Powered Analysis - a feature that uses GPT-4o to automatically analyze your peer's evaluation results and generate specific, actionable improvement suggestions. It's like having an AI expert consultant reviewing your peer's performance and recommending exactly what to fix.

The Challenge of Manual Optimization

Improving an AI peer traditionally requires:

  1. Analyzing Failed Questions: Understanding why specific questions failed
  2. Identifying Patterns: Finding common themes across failures
  3. Root Cause Analysis: Determining underlying configuration issues
  4. Generating Solutions: Deciding what changes to make
  5. Implementation: Actually making the changes
  6. Validation: Testing to ensure improvements worked

This process can take hours or even days, especially for complex peers with large evaluation suites.

How AI Analysis Works

AI-Powered Analysis automates this entire workflow:

Evaluation Results → AI Analysis → Actionable Suggestions → One-Click Application

Here's what happens behind the scenes:

1. Data Collection

The system gathers:

  • Failed and low-scoring questions
  • Peer's current configuration (prompt, tools, settings)
  • Evaluation metrics and patterns
  • Peer's purpose and context

2. Pattern Recognition

GPT-4o identifies:

  • Common failure types
  • Missing capabilities
  • Configuration issues
  • Knowledge gaps

3. Recommendation Generation

The AI creates:

  • Specific, concrete suggestions
  • Expected impact estimates
  • Priority rankings
  • Implementation details

4. Application

You can:

  • Preview each suggestion
  • Apply individually or in bulk
  • Edit suggestions before applying
  • Revert changes if needed

Real-World Example

Let's see AI Analysis in action with a real customer support peer.

The Scenario

Support Bot 3000 helps customers with product questions, but evaluation shows poor performance:

Overall Score: 68%
Pass Rate: 64%
Failed Questions: 18/50

Sample Failures:
❌ "What's the warranty period?" - Generic answer, no specifics
❌ "How do I activate my product?" - Missing step-by-step instructions
❌ "Is product X compatible with Y?" - Incorrect information

Running AI Analysis

After evaluation completes, click "Analyze with AI". Within 30 seconds, you receive:

Analysis Summary

markdown
## Overall Assessment

Your peer is struggling primarily with product-specific questions due to:

1. **Missing Product Knowledge**: No access to product documentation
2. **Vague System Prompt**: Lacks specific instructions for support scenarios
3. **High Temperature (0.8)**: Causing creative but inaccurate responses
4. **Missing Tools**: Cannot access real-time product data

Expected Improvement: +22-28% overall score

Suggested Improvements

Suggestion 1: Add Product Documentation Datasource (Priority: HIGH)

Current Issue:

The peer cannot answer product-specific questions about warranties, 
specifications, compatibility, or features because it has no access 
to product documentation.

Recommendation:

Enable the "Product Knowledge Base" datasource that contains:
- Product specifications
- Warranty information
- Compatibility charts
- User manuals
- Troubleshooting guides

This will allow the peer to retrieve accurate, up-to-date information 
when answering product questions.

Expected Impact: +18% on product-related questions

Implementation:

  • One-click to enable datasource
  • No prompt changes needed
  • Automatic integration with peer

Suggestion 2: Enhance System Prompt (Priority: HIGH)

Current Issue:

The peer's system prompt lacks specific guidance for handling support 
scenarios, leading to vague and unhelpful responses.

Recommendation:

Add these instructions to the system prompt:

"When answering customer support questions:

1. **Product Questions**: Always check the product knowledge base first
2. **Step-by-Step**: Provide clear, numbered instructions for how-to questions
3. **Warranty/Policy**: Reference exact policy terms, don't paraphrase
4. **Unknown Information**: Admit when you don't know and offer to escalate
5. **Empathy First**: Acknowledge the customer's issue before solving it

Example Response Format:
'I understand you're having trouble with [issue]. Here's how to resolve it:
1. [First step]
2. [Second step]
...
Let me know if you need any clarification!'"

Expected Impact: +12% on support interaction quality

Preview Changes:

diff
  You are a helpful customer support assistant.
+ When answering customer support questions:
+ 1. Product Questions: Always check product knowledge base first
+ 2. Step-by-Step: Provide clear, numbered instructions
+ ...

Suggestion 3: Lower Temperature Setting (Priority: MEDIUM)

Current Issue:

Temperature of 0.8 encourages creative responses, but support 
scenarios require accuracy and consistency over creativity.

Recommendation:

Reduce temperature from 0.8 to 0.3

This will:
- Reduce hallucinations and made-up information
- Increase consistency across similar questions
- Make responses more factual and deterministic
- Improve accuracy on policy/procedure questions

Expected Impact: +6% on factual accuracy

One-click apply

Suggestion 4: Enable Search Tool (Priority: MEDIUM)

Current Issue:

For questions about current events, pricing changes, or new 
product releases, the peer cannot access real-time information.

Recommendation:

Enable web search tool for queries about:
- Current promotions and pricing
- Recent product announcements
- Competitor comparisons
- Industry news affecting products

Configure with these restrictions:
- Only use for explicitly time-sensitive queries
- Always cite sources
- Verify information before presenting to user

Expected Impact: +4% on time-sensitive questions

Applying Suggestions

You have several options:

Option 1: Apply All

Click "Apply All Suggestions" to implement all recommendations at once:

✓ Added Product Knowledge Base datasource
✓ Updated system prompt with support guidelines
✓ Changed temperature from 0.8 to 0.3
✓ Enabled web search tool with restrictions

Changes saved. Ready to re-evaluate.

Option 2: Selective Application

Review and apply suggestions individually:

javascript
// Preview changes
const preview = suggestions[0].changes;

// Apply if satisfied
await peer.applySuggestion(suggestions[0].id);

Option 3: Edit Before Applying

Customize suggestions to fit your needs:

Original Suggestion:
"Lower temperature to 0.3"

Your Edit:
"Lower temperature to 0.4"
(Reasoning: We still want some creativity for greeting messages)

Results After Implementation

Re-run the evaluation:

Overall Score: 91% (+23%)
Pass Rate: 94% (+30%)
Failed Questions: 3/50 (-15)

Improvements:
✓ Product questions now accurate and detailed
✓ Step-by-step instructions clear and complete
✓ Consistent professional tone
✓ No more hallucinations

ROI: 30 minutes of work for 23% performance improvement!

Advanced AI Analysis Features

Custom Analysis Focus

Optimize for specific goals:

javascript
POST /api/v1/evaluation/:runId/suggest-improvements
{
  "focus": "accuracy",  // or "speed", "cost", "tone"
  "constraints": {
    "maxTemperature": 0.5,
    "preferredTools": ["datasource"],
    "budgetLimit": 1000
  }
}

Focus Options:

  • Accuracy: Prioritize correctness over speed
  • Speed: Optimize response time
  • Cost: Minimize credit/token usage
  • Tone: Improve communication style

Iterative Optimization

Run multiple analysis rounds:

Round 1: Basic improvements → 68% to 85%
Round 2: Fine-tuning → 85% to 91%
Round 3: Edge case handling → 91% to 94%
Round 4: Optimization → 94% to 96%

Each round focuses on progressively smaller issues.

A/B Testing with AI Suggestions

Compare suggestion effectiveness:

javascript
// Create two peer variants
const peerA = await peer.clone();
const peerB = await peer.clone();

// Apply different suggestion combinations
await peerA.applySuggestions([1, 2]);    // Prompt + datasource
await peerB.applySuggestions([1, 3, 4]); // Prompt + temperature + tools

// Evaluate both
const scoreA = await evaluate(peerA);
const scoreB = await evaluate(peerB);

// Deploy winner
await deploy(scoreB.score > scoreA.score ? peerB : peerA);

Historical Tracking

View all past analyses:

Analysis History:

Oct 15: +18% (added datasources)
Oct 12: +8% (improved prompt)
Oct 8: +5% (adjusted temperature)
Oct 1: +12% (enabled tools)

Total Improvement: +43% over 20 days

Best Practices

1. Start with Comprehensive Evaluation

AI Analysis quality depends on evaluation quality:

❌ Bad: 10 questions, narrow scope
✅ Good: 50+ questions, diverse scenarios

2. Run Analysis After Each Major Change

Track incremental improvements:

Change → Evaluate → Analyze → Apply → Repeat

3. Don't Apply Everything Blindly

Review suggestions for your specific context:

  • Some suggestions may not fit your use case
  • Business requirements might override AI recommendations
  • Test critical changes in staging first

4. Combine with Human Expertise

AI analysis + human judgment = best results:

AI Suggestion: "Add web search for pricing questions"
Your Decision: "Good idea, but restrict to official sources only"

5. Monitor After Application

Ensure suggestions actually improved performance:

javascript
const beforeScore = 0.68;
await applySuggestions();
const afterScore = await evaluate();

if (afterScore < beforeScore) {
  console.log('Suggestions made things worse!');
  await revert();
}

6. Document Your Changes

Keep track of what works:

markdown
## Optimization Log

### 2025-10-15: Added Product Datasource
- Suggestion: AI Analysis #12
- Expected: +18%
- Actual: +21%
- Notes: Exceeded expectations, huge impact

### 2025-10-12: Updated System Prompt
- Suggestion: AI Analysis #11
- Expected: +12%
- Actual: +9%
- Notes: Good but slightly lower than predicted

Common Patterns in AI Suggestions

Through thousands of analyses, we've identified common patterns:

Pattern 1: Missing Knowledge

Symptom: Low scores on factual questions

AI Suggests:

  • Add relevant datasources
  • Enable web search
  • Expand knowledge base

Success Rate: 85%

Pattern 2: Poor Instruction Following

Symptom: Correct information, wrong format

AI Suggests:

  • Add explicit formatting instructions
  • Include examples in prompt
  • Adjust temperature

Success Rate: 78%

Pattern 3: Inconsistent Responses

Symptom: Same question, different answers

AI Suggests:

  • Lower temperature
  • Add deterministic constraints
  • Use more specific prompts

Success Rate: 92%

Pattern 4: Scope Creep

Symptom: Answering outside intended domain

AI Suggests:

  • Add scope restrictions to prompt
  • Enable guardrails
  • Define clear boundaries

Success Rate: 88%

Limitations and Considerations

What AI Analysis Can't Do

Fix fundamental design flaws: Wrong model, wrong approach
Create missing data: Can suggest datasources, can't create content
Override business logic: Can't change your policies
Guarantee specific scores: Estimates are educated guesses

When to Ignore Suggestions

  • Conflicts with business requirements: Your rules take priority
  • Suggests proprietary tools you don't have: Skip or find alternatives
  • Recommends massive prompt changes: Iterate gradually instead
  • Pushes changes you've already tried: Trust your experience

Privacy and Security

AI Analysis:

  • ✅ Never shares your data externally
  • ✅ Uses only evaluation results and config
  • ✅ Doesn't store sensitive customer data
  • ✅ Complies with data retention policies

Measuring ROI

Track the value of AI Analysis:

Time Savings

Manual Analysis: 2-3 hours per iteration
AI Analysis: 30 seconds + 15 minutes review
Time Saved: ~90%

Performance Improvement

Average Improvement: +15-25% per analysis round
Time to 90%+ Score:
- Manual: 2-3 weeks
- With AI: 3-5 days

Cost Efficiency

AI Analysis Cost: ~10 credits
Value of 20% Performance Improvement: Priceless
ROI: Immediate

Integration Examples

Automated Optimization Pipeline

javascript
// Nightly optimization job
cron.schedule('0 2 * * *', async () => {
  const peers = await peer.listAll();
  
  for (const p of peers) {
    // Run evaluation
    const run = await evaluation.execute(p.defaultSuite);
    
    // If score below threshold, get suggestions
    if (run.averageScore < 0.85) {
      const analysis = await evaluation.analyze(run.id);
      
      // Auto-apply low-risk suggestions
      const safeChanges = analysis.suggestions
        .filter(s => s.risk === 'low' && s.priority === 'high');
      
      await peer.applySuggestions(p.id, safeChanges);
      
      // Notify team
      await slack.notify({
        message: `${p.name} auto-optimized: ${safeChanges.length} changes applied`
      });
    }
  }
});

Continuous Improvement Dashboard

javascript
// Track optimization trends
const dashboard = {
  totalAnalyses: 47,
  totalSuggestionsApplied: 156,
  averageImprovement: 0.18,
  topSuggestion: 'Add datasources (45% of cases)',
  lastOptimization: '2 hours ago',
  currentScore: 0.94
};

What's Next?

AI-Powered Analysis is just one tool in your optimization toolkit. Combine it with:

  • Regular Evaluation: Keep testing continuously
  • User Feedback: Listen to real users
  • A/B Testing: Validate improvements scientifically
  • Manual Review: Apply human judgment

Coming soon:

  • Multi-variant optimization: Test multiple suggestions simultaneously
  • Automated A/B testing: Deploy and compare automatically
  • Predictive analysis: Suggest improvements before problems occur
  • Custom optimization goals: Define your own success metrics

Try It Yourself

Ready to optimize your peers with AI?

  1. Run an evaluation on your peer
  2. Click "Analyze with AI" on the results page
  3. Review the suggestions
  4. Apply what makes sense for your use case
  5. Re-evaluate to measure improvement

Most users see significant improvements within their first analysis session!

Resources

Conclusion

AI-Powered Analysis transforms peer optimization from an art into a science. Instead of guessing what might improve performance, you get:

✅ Data-driven recommendations
✅ Concrete implementation steps
✅ Expected impact estimates
✅ One-click application

The result? Better peers, faster optimization, and more time to focus on what matters: building great AI experiences.


Questions? Join our community forum or reach out to our team.

Read more: Testing Your AI Peers with the Evaluation System

Built with VitePress