Optimizing Peers with AI-Powered Analysis
You've built your AI peer, run evaluations, and identified areas for improvement. But what comes next? Manually analyzing results and figuring out the right changes can be time-consuming and requires deep expertise.
Enter AI-Powered Analysis - a feature that uses GPT-4o to automatically analyze your peer's evaluation results and generate specific, actionable improvement suggestions. It's like having an AI expert consultant reviewing your peer's performance and recommending exactly what to fix.
The Challenge of Manual Optimization
Improving an AI peer traditionally requires:
- Analyzing Failed Questions: Understanding why specific questions failed
- Identifying Patterns: Finding common themes across failures
- Root Cause Analysis: Determining underlying configuration issues
- Generating Solutions: Deciding what changes to make
- Implementation: Actually making the changes
- Validation: Testing to ensure improvements worked
This process can take hours or even days, especially for complex peers with large evaluation suites.
How AI Analysis Works
AI-Powered Analysis automates this entire workflow:
Evaluation Results → AI Analysis → Actionable Suggestions → One-Click ApplicationHere's what happens behind the scenes:
1. Data Collection
The system gathers:
- Failed and low-scoring questions
- Peer's current configuration (prompt, tools, settings)
- Evaluation metrics and patterns
- Peer's purpose and context
2. Pattern Recognition
GPT-4o identifies:
- Common failure types
- Missing capabilities
- Configuration issues
- Knowledge gaps
3. Recommendation Generation
The AI creates:
- Specific, concrete suggestions
- Expected impact estimates
- Priority rankings
- Implementation details
4. Application
You can:
- Preview each suggestion
- Apply individually or in bulk
- Edit suggestions before applying
- Revert changes if needed
Real-World Example
Let's see AI Analysis in action with a real customer support peer.
The Scenario
Support Bot 3000 helps customers with product questions, but evaluation shows poor performance:
Overall Score: 68%
Pass Rate: 64%
Failed Questions: 18/50
Sample Failures:
❌ "What's the warranty period?" - Generic answer, no specifics
❌ "How do I activate my product?" - Missing step-by-step instructions
❌ "Is product X compatible with Y?" - Incorrect informationRunning AI Analysis
After evaluation completes, click "Analyze with AI". Within 30 seconds, you receive:
Analysis Summary
## Overall Assessment
Your peer is struggling primarily with product-specific questions due to:
1. **Missing Product Knowledge**: No access to product documentation
2. **Vague System Prompt**: Lacks specific instructions for support scenarios
3. **High Temperature (0.8)**: Causing creative but inaccurate responses
4. **Missing Tools**: Cannot access real-time product data
Expected Improvement: +22-28% overall scoreSuggested Improvements
Suggestion 1: Add Product Documentation Datasource (Priority: HIGH)
Current Issue:
The peer cannot answer product-specific questions about warranties,
specifications, compatibility, or features because it has no access
to product documentation.Recommendation:
Enable the "Product Knowledge Base" datasource that contains:
- Product specifications
- Warranty information
- Compatibility charts
- User manuals
- Troubleshooting guides
This will allow the peer to retrieve accurate, up-to-date information
when answering product questions.Expected Impact: +18% on product-related questions
Implementation:
- One-click to enable datasource
- No prompt changes needed
- Automatic integration with peer
Suggestion 2: Enhance System Prompt (Priority: HIGH)
Current Issue:
The peer's system prompt lacks specific guidance for handling support
scenarios, leading to vague and unhelpful responses.Recommendation:
Add these instructions to the system prompt:
"When answering customer support questions:
1. **Product Questions**: Always check the product knowledge base first
2. **Step-by-Step**: Provide clear, numbered instructions for how-to questions
3. **Warranty/Policy**: Reference exact policy terms, don't paraphrase
4. **Unknown Information**: Admit when you don't know and offer to escalate
5. **Empathy First**: Acknowledge the customer's issue before solving it
Example Response Format:
'I understand you're having trouble with [issue]. Here's how to resolve it:
1. [First step]
2. [Second step]
...
Let me know if you need any clarification!'"Expected Impact: +12% on support interaction quality
Preview Changes:
You are a helpful customer support assistant.
+ When answering customer support questions:
+ 1. Product Questions: Always check product knowledge base first
+ 2. Step-by-Step: Provide clear, numbered instructions
+ ...Suggestion 3: Lower Temperature Setting (Priority: MEDIUM)
Current Issue:
Temperature of 0.8 encourages creative responses, but support
scenarios require accuracy and consistency over creativity.Recommendation:
Reduce temperature from 0.8 to 0.3
This will:
- Reduce hallucinations and made-up information
- Increase consistency across similar questions
- Make responses more factual and deterministic
- Improve accuracy on policy/procedure questionsExpected Impact: +6% on factual accuracy
One-click apply
Suggestion 4: Enable Search Tool (Priority: MEDIUM)
Current Issue:
For questions about current events, pricing changes, or new
product releases, the peer cannot access real-time information.Recommendation:
Enable web search tool for queries about:
- Current promotions and pricing
- Recent product announcements
- Competitor comparisons
- Industry news affecting products
Configure with these restrictions:
- Only use for explicitly time-sensitive queries
- Always cite sources
- Verify information before presenting to userExpected Impact: +4% on time-sensitive questions
Applying Suggestions
You have several options:
Option 1: Apply All
Click "Apply All Suggestions" to implement all recommendations at once:
✓ Added Product Knowledge Base datasource
✓ Updated system prompt with support guidelines
✓ Changed temperature from 0.8 to 0.3
✓ Enabled web search tool with restrictions
Changes saved. Ready to re-evaluate.Option 2: Selective Application
Review and apply suggestions individually:
// Preview changes
const preview = suggestions[0].changes;
// Apply if satisfied
await peer.applySuggestion(suggestions[0].id);Option 3: Edit Before Applying
Customize suggestions to fit your needs:
Original Suggestion:
"Lower temperature to 0.3"
Your Edit:
"Lower temperature to 0.4"
(Reasoning: We still want some creativity for greeting messages)Results After Implementation
Re-run the evaluation:
Overall Score: 91% (+23%)
Pass Rate: 94% (+30%)
Failed Questions: 3/50 (-15)
Improvements:
✓ Product questions now accurate and detailed
✓ Step-by-step instructions clear and complete
✓ Consistent professional tone
✓ No more hallucinationsROI: 30 minutes of work for 23% performance improvement!
Advanced AI Analysis Features
Custom Analysis Focus
Optimize for specific goals:
POST /api/v1/evaluation/:runId/suggest-improvements
{
"focus": "accuracy", // or "speed", "cost", "tone"
"constraints": {
"maxTemperature": 0.5,
"preferredTools": ["datasource"],
"budgetLimit": 1000
}
}Focus Options:
- Accuracy: Prioritize correctness over speed
- Speed: Optimize response time
- Cost: Minimize credit/token usage
- Tone: Improve communication style
Iterative Optimization
Run multiple analysis rounds:
Round 1: Basic improvements → 68% to 85%
Round 2: Fine-tuning → 85% to 91%
Round 3: Edge case handling → 91% to 94%
Round 4: Optimization → 94% to 96%Each round focuses on progressively smaller issues.
A/B Testing with AI Suggestions
Compare suggestion effectiveness:
// Create two peer variants
const peerA = await peer.clone();
const peerB = await peer.clone();
// Apply different suggestion combinations
await peerA.applySuggestions([1, 2]); // Prompt + datasource
await peerB.applySuggestions([1, 3, 4]); // Prompt + temperature + tools
// Evaluate both
const scoreA = await evaluate(peerA);
const scoreB = await evaluate(peerB);
// Deploy winner
await deploy(scoreB.score > scoreA.score ? peerB : peerA);Historical Tracking
View all past analyses:
Analysis History:
Oct 15: +18% (added datasources)
Oct 12: +8% (improved prompt)
Oct 8: +5% (adjusted temperature)
Oct 1: +12% (enabled tools)
Total Improvement: +43% over 20 daysBest Practices
1. Start with Comprehensive Evaluation
AI Analysis quality depends on evaluation quality:
❌ Bad: 10 questions, narrow scope
✅ Good: 50+ questions, diverse scenarios2. Run Analysis After Each Major Change
Track incremental improvements:
Change → Evaluate → Analyze → Apply → Repeat3. Don't Apply Everything Blindly
Review suggestions for your specific context:
- Some suggestions may not fit your use case
- Business requirements might override AI recommendations
- Test critical changes in staging first
4. Combine with Human Expertise
AI analysis + human judgment = best results:
AI Suggestion: "Add web search for pricing questions"
Your Decision: "Good idea, but restrict to official sources only"5. Monitor After Application
Ensure suggestions actually improved performance:
const beforeScore = 0.68;
await applySuggestions();
const afterScore = await evaluate();
if (afterScore < beforeScore) {
console.log('Suggestions made things worse!');
await revert();
}6. Document Your Changes
Keep track of what works:
## Optimization Log
### 2025-10-15: Added Product Datasource
- Suggestion: AI Analysis #12
- Expected: +18%
- Actual: +21%
- Notes: Exceeded expectations, huge impact
### 2025-10-12: Updated System Prompt
- Suggestion: AI Analysis #11
- Expected: +12%
- Actual: +9%
- Notes: Good but slightly lower than predictedCommon Patterns in AI Suggestions
Through thousands of analyses, we've identified common patterns:
Pattern 1: Missing Knowledge
Symptom: Low scores on factual questions
AI Suggests:
- Add relevant datasources
- Enable web search
- Expand knowledge base
Success Rate: 85%
Pattern 2: Poor Instruction Following
Symptom: Correct information, wrong format
AI Suggests:
- Add explicit formatting instructions
- Include examples in prompt
- Adjust temperature
Success Rate: 78%
Pattern 3: Inconsistent Responses
Symptom: Same question, different answers
AI Suggests:
- Lower temperature
- Add deterministic constraints
- Use more specific prompts
Success Rate: 92%
Pattern 4: Scope Creep
Symptom: Answering outside intended domain
AI Suggests:
- Add scope restrictions to prompt
- Enable guardrails
- Define clear boundaries
Success Rate: 88%
Limitations and Considerations
What AI Analysis Can't Do
❌ Fix fundamental design flaws: Wrong model, wrong approach
❌ Create missing data: Can suggest datasources, can't create content
❌ Override business logic: Can't change your policies
❌ Guarantee specific scores: Estimates are educated guesses
When to Ignore Suggestions
- Conflicts with business requirements: Your rules take priority
- Suggests proprietary tools you don't have: Skip or find alternatives
- Recommends massive prompt changes: Iterate gradually instead
- Pushes changes you've already tried: Trust your experience
Privacy and Security
AI Analysis:
- ✅ Never shares your data externally
- ✅ Uses only evaluation results and config
- ✅ Doesn't store sensitive customer data
- ✅ Complies with data retention policies
Measuring ROI
Track the value of AI Analysis:
Time Savings
Manual Analysis: 2-3 hours per iteration
AI Analysis: 30 seconds + 15 minutes review
Time Saved: ~90%Performance Improvement
Average Improvement: +15-25% per analysis round
Time to 90%+ Score:
- Manual: 2-3 weeks
- With AI: 3-5 daysCost Efficiency
AI Analysis Cost: ~10 credits
Value of 20% Performance Improvement: Priceless
ROI: ImmediateIntegration Examples
Automated Optimization Pipeline
// Nightly optimization job
cron.schedule('0 2 * * *', async () => {
const peers = await peer.listAll();
for (const p of peers) {
// Run evaluation
const run = await evaluation.execute(p.defaultSuite);
// If score below threshold, get suggestions
if (run.averageScore < 0.85) {
const analysis = await evaluation.analyze(run.id);
// Auto-apply low-risk suggestions
const safeChanges = analysis.suggestions
.filter(s => s.risk === 'low' && s.priority === 'high');
await peer.applySuggestions(p.id, safeChanges);
// Notify team
await slack.notify({
message: `${p.name} auto-optimized: ${safeChanges.length} changes applied`
});
}
}
});Continuous Improvement Dashboard
// Track optimization trends
const dashboard = {
totalAnalyses: 47,
totalSuggestionsApplied: 156,
averageImprovement: 0.18,
topSuggestion: 'Add datasources (45% of cases)',
lastOptimization: '2 hours ago',
currentScore: 0.94
};What's Next?
AI-Powered Analysis is just one tool in your optimization toolkit. Combine it with:
- Regular Evaluation: Keep testing continuously
- User Feedback: Listen to real users
- A/B Testing: Validate improvements scientifically
- Manual Review: Apply human judgment
Coming soon:
- Multi-variant optimization: Test multiple suggestions simultaneously
- Automated A/B testing: Deploy and compare automatically
- Predictive analysis: Suggest improvements before problems occur
- Custom optimization goals: Define your own success metrics
Try It Yourself
Ready to optimize your peers with AI?
- Run an evaluation on your peer
- Click "Analyze with AI" on the results page
- Review the suggestions
- Apply what makes sense for your use case
- Re-evaluate to measure improvement
Most users see significant improvements within their first analysis session!
Resources
Conclusion
AI-Powered Analysis transforms peer optimization from an art into a science. Instead of guessing what might improve performance, you get:
✅ Data-driven recommendations
✅ Concrete implementation steps
✅ Expected impact estimates
✅ One-click application
The result? Better peers, faster optimization, and more time to focus on what matters: building great AI experiences.
Questions? Join our community forum or reach out to our team.

