Skip to content

RAG Best Practices: Building Production-Ready AI Knowledge Systems

Retrieval Augmented Generation (RAG) has become the cornerstone of building AI systems that provide accurate, verifiable, and up-to-date information. However, implementing RAG effectively requires more than just connecting a vector database to an LLM. This comprehensive guide covers advanced RAG techniques and best practices learned from real-world production deployments.

Why RAG Matters

Large Language Models (LLMs) are powerful but have fundamental limitations:

  • Knowledge Cutoff: Training data becomes outdated
  • Hallucinations: Models can confidently generate false information
  • No Source Attribution: Users can't verify where information comes from
  • Generic Responses: Lack domain-specific or proprietary knowledge

RAG solves these problems by grounding AI responses in your actual data while maintaining the conversational capabilities of modern LLMs.

The Evolution of RAG in Cognipeer AI

Our platform has evolved through multiple iterations of RAG implementation, each addressing real production challenges:

October 2025 Updates

Recent enhancements have dramatically improved RAG reliability and accuracy:

  1. Metadata Enrichment - Rich context from data sources
  2. Final Answer Validation - LLM-powered fact checking
  3. Strict Knowledge Base Mode - Force answers from known data only
  4. Advanced Query Controls - Fine-tuned retrieval parameters
  5. Hybrid Search - Combine semantic and keyword matching

Let's dive deep into each area with practical examples.


1. Metadata Enrichment: Context is King

The Problem

Traditional RAG systems only pass the text content to the LLM. But documents have rich metadata that provides crucial context:

❌ Without Metadata:
"The Q4 revenue was $2.5M"

✅ With Metadata:
"The Q4 revenue was $2.5M"
Source: Financial Report 2024
Author: Jane Smith (CFO)
Last Updated: 2025-10-15
Department: Finance
Classification: Internal

Implementation

Enable metadata enrichment in your Peer configuration:

javascript
// Peer Settings
{
  "ragIncludeMetadata": true,
  "ragIncludeConversationSources": true
}

What Gets Included

Cognipeer AI automatically enriches RAG context with:

Dataset Items

  • Item Identifier: Title, name, or displayField
  • Source Dataset: Which dataset the information came from
  • Collection Metadata: Custom fields from your schema
  • Relationships: Connected items and references

Documents

  • File Metadata: Filename, size, type, upload date
  • Author Information: Who uploaded/owns the document
  • Version Tracking: Last modified date and revision history
  • Classifications: Tags, categories, access levels

External Sources

  • URL and Domain: Source website information
  • Crawl Date: When the data was retrieved
  • Page Structure: Headings, sections, hierarchy
  • Link Context: How pages relate to each other

Scenario: AI assistant helping lawyers review contracts

javascript
// Without Metadata
Context: "Payment terms are net 30 days."
Question: "What are the payment terms?"
Answer: "The payment terms are net 30 days."
Problem: Which contract? When was this agreed? Who signed it?

// With Metadata
Context: 
"Payment terms are net 30 days.
📄 Source: Vendor Agreement - Acme Corp
📅 Signed: 2025-08-15
👤 Signatory: John Doe (CFO)
🔄 Status: Active
⚠️ Renewal: 2026-08-15"

Question: "What are the payment terms for Acme Corp?"
Answer: "According to the Vendor Agreement signed on August 15, 2025 by 
CFO John Doe, the payment terms with Acme Corp are net 30 days. This 
agreement is currently active and comes up for renewal on August 15, 2026."

Technical Deep Dive

The metadata enrichment happens in peer/helpers/ragMetadata.js:

javascript
// Core metadata enrichment flow
const enrichedEntries = await Promise.all(
  retrievedDocs.map(async (doc) => {
    const metadata = {
      source: doc.metadata?.source,
      type: doc.metadata?.type,
      itemId: doc.metadata?.itemId,
    };

    // Resolve dataset item names
    if (doc.metadata?.datasetId && doc.metadata?.itemId) {
      const dataset = await getDataset(doc.metadata.datasetId);
      const itemName = await resolveItemName(dataset, doc.metadata.itemId);
      metadata.itemName = itemName;
      metadata.datasetName = dataset.name;
    }

    // Enrich document metadata
    if (doc.metadata?.type === 'document') {
      const fileDoc = await getDocument(doc.metadata.documentId);
      metadata.filename = fileDoc.originalName;
      metadata.uploadDate = fileDoc.createdAt;
      metadata.author = fileDoc.uploadedBy;
    }

    return {
      content: doc.pageContent,
      metadata: metadata,
      score: doc.score,
    };
  })
);

Best Practices

DO:

  • Enable metadata for production Peers
  • Include source attribution in responses
  • Use metadata for filtering and access control
  • Display sources in the UI for verification

DON'T:

  • Include sensitive metadata in public-facing Peers
  • Overwhelm the context window with excessive metadata
  • Expose internal system identifiers to end users
  • Forget to update metadata when source data changes

2. Final Answer Validation: Preventing Hallucinations

The Problem

Even with RAG, LLMs can still hallucinate or misinterpret the retrieved context:

Retrieved Context: "Product X is available in red and blue."
User Question: "What colors does Product X come in?"
Bad Answer: "Product X is available in red, blue, green, and yellow."

The model "helpfully" added colors that weren't in the source data.

The Solution

Enable Final Answer Validation to have a second LLM review the answer against the context:

javascript
// Peer Configuration
{
  "ragValidateFinalAnswer": true,
  "ragValidateFinalAnswerInstructions": "Ensure all color options mentioned exist in the product catalog. Do not suggest colors that aren't explicitly listed."
}

How It Works

  1. Primary LLM generates answer using RAG context
  2. Validator LLM receives:
    • Original question
    • Generated answer
    • RAG context
    • Evidence sources
    • Custom validation instructions
  3. Validator judges if answer is supported by context
  4. If invalid, validator provides:
    • Detailed feedback on what's wrong
    • Corrected answer using only verified information
  5. System returns either original (if valid) or revised answer

Architecture

┌─────────────────────────────────────────────────────────┐
│ User Question                                           │
└─────────────────┬───────────────────────────────────────┘


┌─────────────────────────────────────────────────────────┐
│ RAG Pipeline: Retrieve Relevant Context                │
└─────────────────┬───────────────────────────────────────┘


┌─────────────────────────────────────────────────────────┐
│ Primary LLM: Generate Answer                            │
└─────────────────┬───────────────────────────────────────┘


┌─────────────────────────────────────────────────────────┐
│ Validator LLM: Check Answer vs Context                 │
│ ├─ Is answer supported by evidence?                    │
│ ├─ Does it contradict any source?                      │
│ ├─ Are all facts verifiable?                           │
│ └─ Does it follow custom instructions?                 │
└─────────────────┬───────────────────────────────────────┘

            ┌─────┴─────┐
            │           │
         Valid?      Invalid?
            │           │
            ▼           ▼
    Return Original  Return Revised
       Answer          Answer

Real-World Example: Healthcare Information

Scenario: Medical information chatbot for patient questions

javascript
// Configuration
{
  "ragValidateFinalAnswer": true,
  "ragValidateFinalAnswerInstructions": `
    Medical safety rules:
    1. Only provide information explicitly stated in verified medical sources
    2. Never extrapolate or assume symptoms, treatments, or side effects
    3. If information is incomplete, clearly state what is known vs unknown
    4. Always recommend consulting healthcare professionals for medical decisions
  `
}

// Interaction Example
User: "What are the side effects of medication X?"

Retrieved Context:
"Medication X common side effects: nausea, headache.
Source: FDA Drug Label 2024"

Primary LLM Answer:
"Medication X can cause nausea, headache, and dizziness. Some 
patients also experience fatigue and dry mouth."

Validator Analysis:
{
  "isValid": false,
  "feedback": "The answer includes side effects (dizziness, fatigue, 
  dry mouth) that are NOT mentioned in the FDA label. Only nausea and 
  headache are documented.",
  "revisedAnswer": "According to the FDA drug label from 2024, the 
  common side effects of Medication X are nausea and headache. For a 
  complete list of side effects and personalized medical advice, please 
  consult your healthcare provider."
}

Final Response: [Revised Answer]

Validation Prompt Engineering

The validator uses a carefully crafted prompt:

javascript
// From peer/helpers/agents/smart/postprocessors/final-answer-validator.js
const prompt = ChatPromptTemplate.fromMessages([
  [
    "system",
    `You are a precise answer auditor.
Use ONLY the supplied knowledge context, evidence list, and instructions 
to judge the candidate answer.

If the answer contains information that is unsupported or contradicts 
the context, mark it invalid and explain the mismatch.

When invalid, craft a revised answer that corrects the issues using 
only supported information.`,
  ],
  [
    "human",
    `Primary question: {question}
Candidate answer: {answer}
Knowledge context: {context}
Detected evidence entries: {evidence}
Language requirement: {languageHint}
Additional instructions: {extraInstructions}`,
  ],
]);

Performance Considerations

Latency Impact:

  • Adds ~500-2000ms depending on model speed
  • Consider async validation for non-critical flows
  • Use faster models (GPT-4o-mini) for validation

Cost Impact:

  • Doubles LLM calls per response
  • Validator typically uses shorter context
  • Consider enabling only for high-stakes applications

When to Enable:

  • Healthcare and medical information
  • Legal and financial advice
  • Product specifications and pricing
  • Compliance and regulatory responses
  • Any domain where accuracy > speed

When to Skip:

  • Casual conversations
  • General knowledge questions
  • Speed-critical applications
  • Confirmed reliable sources

3. Strict Knowledge Base Mode: No Hallucinations Allowed

The Problem

Sometimes you need the AI to ONLY answer from your knowledge base. If the information isn't in your data, the AI should say "I don't know" rather than guessing.

Use Cases

  • Customer Support: Only provide documented solutions
  • Product Information: Don't invent features or capabilities
  • Internal Policies: Stick to official company guidelines
  • Regulated Industries: Compliance requires source-backed answers

Configuration

javascript
{
  "ragStrictKnowledgebaseAnswers": true,
  "ragValidateFinalAnswer": true, // Recommended companion setting
  "additionalPrompt": "If the answer is not found in the provided context, respond: 'I don't have that information in my knowledge base. Please contact support for assistance.'"
}

How It Works

When enabled, the system prompt enforces strict adherence:

javascript
// Injected into system prompt
if (peer.ragStrictKnowledgebaseAnswers) {
  instructions.push(
    "CRITICAL: You must ONLY provide information that is explicitly " +
    "present in the retrieved context. If the context does not contain " +
    "the answer, you MUST state that you don't have that information. " +
    "DO NOT use general knowledge, DO NOT make assumptions, and " +
    "DO NOT extrapolate beyond what is explicitly stated."
  );
}

Example: Technical Support Bot

Scenario: Software product support chatbot

javascript
// Knowledge Base Contains:
- Installation guide for Windows 10/11
- Troubleshooting network connectivity
- License activation steps

// With Strict Mode DISABLED
User: "How do I install on macOS?"
Answer: "To install on macOS, download the .dmg file, open it, 
and drag the app to your Applications folder..."
Problem: ❌ This info isn't in the knowledge base! Pure hallucination.

// With Strict Mode ENABLED  
User: "How do I install on macOS?"
Answer: "I don't have installation instructions for macOS in my 
knowledge base. Our documentation currently covers Windows 10 and 
Windows 11. Please contact our support team at support@company.com 
for macOS installation help."
Result: ✅ Honest, helpful, and doesn't mislead users.

Advanced Pattern: Confidence Scoring

Combine strict mode with confidence indicators:

javascript
{
  "ragStrictKnowledgebaseAnswers": true,
  "additionalPrompt": `
When answering, indicate your confidence:
- 🟢 HIGH: Answer is directly stated in multiple sources
- 🟡 MEDIUM: Answer is implied or from single source
- 🔴 LOW: Information is partially related but incomplete

Format: [CONFIDENCE] Answer with source citations.
  `
}

// Example Response
User: "What's the warranty period?"
AI: "🟢 HIGH: The warranty period is 2 years from purchase date. 
Source: Product Warranty Policy, Section 3.1 (Updated Oct 2025)"

Balancing Strictness vs Usability

Too Strict:

User: "How do I reset my password?"
AI: "I don't have that information."
Problem: Knowledge base has "password recovery" but not "password reset"

Solution: Semantic Search + Strict Mode

javascript
{
  "ragStrictKnowledgebaseAnswers": true,
  "ragScoreThreshold": 0.7, // Lower threshold for broader matching
  "ragMaxResults": 10,       // Retrieve more candidates
  "additionalPrompt": "If the exact terminology doesn't match but 
  related information exists, use that and acknowledge the terminology 
  difference. For example, if asked about 'reset' but you have 'recovery' 
  information, answer with: 'Regarding password reset (also called 
  password recovery in our docs)...'"
}

4. Advanced Query Controls: Fine-Tuning Retrieval

Key Parameters

Cognipeer AI provides granular control over the RAG retrieval process:

javascript
{
  // How many chunks to retrieve
  "ragMaxResults": 10,
  
  // Minimum similarity score (0-1)
  "ragScoreThreshold": 0.75,
  
  // Search strategy
  "ragAllItemMode": "hybrid", // or "semantic" or "keyword"
  
  // Include metadata enrichment
  "ragIncludeMetadata": true,
  
  // Include past conversation context
  "ragIncludeConversationSources": true
}

Understanding ragMaxResults

What It Does: Controls how many document chunks to retrieve before ranking.

Impact:

  • Too Low (1-3): Miss relevant information
  • Optimal (5-10): Balance relevance and cost
  • Too High (20+): Noise, token waste, slower responses

Recommendations by Use Case:

javascript
// FAQ Chatbot - Simple, focused answers
{
  "ragMaxResults": 3,
  "ragScoreThreshold": 0.8
}

// Research Assistant - Comprehensive analysis
{
  "ragMaxResults": 20,
  "ragScoreThreshold": 0.65
}

// Technical Documentation - Precise code examples
{
  "ragMaxResults": 5,
  "ragScoreThreshold": 0.75
}

// General Q&A - Balanced
{
  "ragMaxResults": 10,
  "ragScoreThreshold": 0.7
}

Score Threshold Tuning

How Similarity Scoring Works:

Vector embeddings represent text as points in high-dimensional space. Similarity scores measure distance:

Score Range: 0.0 (unrelated) to 1.0 (identical)

Typical Distributions:
0.9-1.0: Exact matches, duplicates
0.8-0.9: Highly relevant, same topic
0.7-0.8: Related, conceptually similar
0.6-0.7: Loosely related
<0.6:    Different topics

Tuning Guide:

javascript
// High Precision (Strict Relevance)
{
  "ragScoreThreshold": 0.85,
  "ragMaxResults": 3
}
Use when: Accuracy > recall, technical/medical/legal domains

// Balanced (Default)
{
  "ragScoreThreshold": 0.7,
  "ragMaxResults": 10
}
Use when: General Q&A, customer support

// High Recall (Broad Coverage)
{
  "ragScoreThreshold": 0.6,
  "ragMaxResults": 15
}
Use when: Research, exploratory queries, brainstorming

Real-World Tuning Example

Scenario: E-commerce product search chatbot

Initial Configuration (Too Strict):

javascript
{
  "ragScoreThreshold": 0.9,
  "ragMaxResults": 3
}

User: "Do you have wireless headphones?"
Retrieved: 0 results (threshold too high)
Answer: "I don't have information about wireless headphones."
Problem: ❌ Products exist but slight wording differences filtered them out

After Tuning (Optimal):

javascript
{
  "ragScoreThreshold": 0.72,
  "ragMaxResults": 8,
  "ragAllItemMode": "hybrid" // Key addition!
}

User: "Do you have wireless headphones?"
Retrieved: 5 products
- "Bluetooth Over-Ear Headphones" (score: 0.84)
- "Wireless Noise-Cancelling Headset" (score: 0.78)
- "True Wireless Earbuds" (score: 0.76)
- "Sport Bluetooth Earphones" (score: 0.74)
- "Gaming Wireless Headset" (score: 0.73)

Answer: "Yes! We have several wireless headphone options:
1. Bluetooth Over-Ear Headphones - $149
2. Wireless Noise-Cancelling Headset - $199
[... full list ...]"
Result: ✅ Found all relevant products

5. Hybrid Search: Best of Both Worlds

The Limitation of Semantic Search Alone

Pure vector/semantic search has blindspots:

Query: "What's our PTO policy?"
Vector Search Result: Documents about "vacation days", "time off", "leave"
Problem: Misses exact matches if someone used "PTO" in docs

Query: "Product SKU ABC-123"
Vector Search Result: Random documents mentioning products
Problem: Semantic similarity doesn't help with exact IDs/codes

Combines two retrieval strategies:

  1. Semantic Search: Vector similarity for conceptual matching
  2. Keyword Search: Full-text search for exact terms

Results are merged and re-ranked for optimal relevance.

Configuration

javascript
{
  "ragAllItemMode": "hybrid",  // Enable hybrid search
  "ragMaxResults": 10,         // Total results across both methods
  "ragScoreThreshold": 0.7     // Applied after merging
}

How It Works

javascript
// Simplified hybrid search logic
async function hybridSearch(query, options) {
  // Parallel retrieval
  const [semanticResults, keywordResults] = await Promise.all([
    vectorSearch(query, { limit: options.ragMaxResults }),
    fullTextSearch(query, { limit: options.ragMaxResults }),
  ]);

  // Merge and deduplicate
  const merged = mergeResults(semanticResults, keywordResults);

  // Re-rank using Reciprocal Rank Fusion (RRF)
  const reranked = reciprocalRankFusion(merged);

  // Filter by threshold
  return reranked.filter(r => r.score >= options.ragScoreThreshold);
}

function reciprocalRankFusion(results, k = 60) {
  // RRF formula: score = Σ(1 / (k + rank))
  const scores = new Map();
  
  for (const result of results) {
    const existingScore = scores.get(result.id) || 0;
    const rrfScore = 1 / (k + result.rank);
    scores.set(result.id, existingScore + rrfScore);
  }
  
  return Array.from(scores.entries())
    .sort((a, b) => b[1] - a[1])
    .map(([id, score]) => ({ id, score }));
}

Real-World Example: Technical Documentation

Scenario: Developer searching internal API docs

Query: "How do I authenticate with JWT token?"

Semantic Search Results:

  1. "Authentication Overview" (score: 0.82)
  2. "OAuth2 Implementation" (score: 0.78)
  3. "User Session Management" (score: 0.71)

Keyword Search Results:

  1. "JWT Token Validation Guide" (score: 0.95) ← Exact term match!
  2. "Authentication Overview" (score: 0.85)
  3. "API Security Best Practices" (score: 0.72)

Hybrid (Merged & Reranked):

  1. "JWT Token Validation Guide" (0.93) ✅ Best match!
  2. "Authentication Overview" (0.87)
  3. "OAuth2 Implementation" (0.76)
  4. "API Security Best Practices" (0.74)
  5. "User Session Management" (0.70)

Result: User gets the exact JWT guide first, with related auth docs as context.

When to Use Each Mode

Pure Semantic ("ragAllItemMode": "semantic"):

  • ✅ Natural language queries
  • ✅ Conceptual searches
  • ✅ Multilingual content
  • ✅ Synonym-rich domains

Pure Keyword ("ragAllItemMode": "keyword"):

  • ✅ Code search
  • ✅ Product SKUs/IDs
  • ✅ Exact phrase matching
  • ✅ Structured data

Hybrid ("ragAllItemMode": "hybrid"):

  • ✅ Technical documentation (our recommendation)
  • ✅ Mixed content types
  • ✅ Unknown query patterns
  • ✅ General-purpose chatbots

6. Context Window Management

The Challenge

LLMs have token limits. With RAG, you're consuming tokens for:

  • System prompt
  • RAG context
  • Conversation history
  • User message
  • Generated response

Token Budget Breakdown

Typical GPT-4 conversation with RAG:

Total Available: 128,000 tokens

Allocation:
- System Prompt: 500 tokens
- RAG Context: 8,000 tokens (10 chunks × 800 tokens avg)
- Conversation History: 2,000 tokens (last 10 messages)
- User Message: 50 tokens
- Reserved for Response: 2,000 tokens
- Buffer: 1,000 tokens
──────────────────────────────
Used: 13,550 tokens
Remaining: 114,450 tokens ✅

Optimization Strategies

1. Dynamic Context Sizing

javascript
function calculateOptimalChunks(modelContextWindow, conversationLength) {
  const systemPromptTokens = 500;
  const responseReserve = 2000;
  const conversationTokens = conversationLength * 100; // rough estimate
  const buffer = 1000;
  
  const availableForRAG = modelContextWindow 
    - systemPromptTokens 
    - responseReserve 
    - conversationTokens 
    - buffer;
  
  const avgChunkSize = 800;
  const optimalChunks = Math.floor(availableForRAG / avgChunkSize);
  
  return Math.min(optimalChunks, 15); // Cap at 15 for quality
}

// Usage
const ragMaxResults = calculateOptimalChunks(128000, conversationHistory.length);

2. Chunk Size Optimization

javascript
// Document chunking strategy
{
  "chunkSize": 800,        // Characters per chunk
  "chunkOverlap": 200,     // Overlap between chunks
  "strategy": "semantic"   // Respect sentence boundaries
}

// Recommendations by content type:

// Code Documentation
{
  "chunkSize": 1000,
  "chunkOverlap": 100,
  "strategy": "code-aware" // Preserve function/class boundaries
}

// Legal Documents
{
  "chunkSize": 600,
  "chunkOverlap": 150,
  "strategy": "paragraph" // Keep paragraphs intact
}

// Conversational FAQs
{
  "chunkSize": 400,
  "chunkOverlap": 50,
  "strategy": "qa-pair" // Each Q&A as one chunk
}

3. Metadata-Driven Pruning

javascript
// Intelligently remove less important chunks
async function pruneContextByPriority(chunks, maxTokens) {
  // Score each chunk by multiple factors
  const scored = chunks.map(chunk => ({
    ...chunk,
    priority: calculatePriority(chunk)
  }));
  
  function calculatePriority(chunk) {
    let score = chunk.similarityScore * 100;
    
    // Boost recent documents
    const ageInDays = (Date.now() - chunk.metadata.updatedAt) / (1000 * 60 * 60 * 24);
    if (ageInDays < 30) score += 20;
    
    // Boost official sources
    if (chunk.metadata.isVerified) score += 15;
    
    // Boost exact keyword matches
    if (chunk.content.includes(query)) score += 10;
    
    return score;
  }
  
  // Sort by priority and keep within token budget
  scored.sort((a, b) => b.priority - a.priority);
  
  let totalTokens = 0;
  const selected = [];
  
  for (const chunk of scored) {
    const chunkTokens = estimateTokens(chunk.content);
    if (totalTokens + chunkTokens <= maxTokens) {
      selected.push(chunk);
      totalTokens += chunkTokens;
    }
  }
  
  return selected;
}

7. Evaluation and Monitoring

Key Metrics to Track

Retrieval Quality

javascript
{
  "avgSimilarityScore": 0.82,      // How well chunks match queries
  "retrievalLatency": 145,          // ms to retrieve from vector DB
  "chunksRetrieved": 8.5,           // avg per query
  "chunksUsed": 6.2,                // avg after filtering
  "cacheHitRate": 0.68              // % of cached embeddings
}

Answer Quality

javascript
{
  "validationPassRate": 0.94,       // % passing final validation
  "hallucination Rate": 0.03,       // detected hallucinations
  "sourceAttribution": 0.97,        // % with proper citations
  "avgResponseLength": 287,         // tokens
  "userSatisfaction": 4.3           // out of 5
}

Performance

javascript
{
  "totalLatency": 2340,             // ms end-to-end
  "breakdown": {
    "retrieval": 145,               // vector search
    "llmInference": 1890,           // generation
    "validation": 280,              // answer checking
    "overhead": 25                  // system processing
  },
  "tokensUsed": {
    "prompt": 3420,
    "completion": 287,
    "total": 3707
  }
}

A/B Testing RAG Configurations

Example test comparing strict vs lenient modes:

javascript
// Configuration A: Strict
const configA = {
  "ragStrictKnowledgebaseAnswers": true,
  "ragValidateFinalAnswer": true,
  "ragScoreThreshold": 0.8,
  "ragMaxResults": 5
};

// Configuration B: Lenient  
const configB = {
  "ragStrictKnowledgebaseAnswers": false,
  "ragValidateFinalAnswer": false,
  "ragScoreThreshold": 0.65,
  "ragMaxResults": 10
};

// Results after 1000 queries each:
const results = {
  configA: {
    answerRate: 0.73,              // 73% could answer
    accuracy: 0.96,                // 96% accurate when answering
    avgLatency: 2840,              // slower (validation)
    userSatisfaction: 4.5
  },
  configB: {
    answerRate: 0.94,              // 94% could answer
    accuracy: 0.87,                // 87% accurate
    avgLatency: 1950,              // faster
    userSatisfaction: 4.1
  }
};

// Decision: Use Config A for compliance-critical, Config B for general use

Cognipeer AI Evaluation System

Built-in tools for RAG testing:

javascript
// Create evaluation dataset
const evalDataset = await createEvaluation({
  name: "RAG Accuracy Test Q4 2025",
  peer: peer._id,
  testCases: [
    {
      input: "What's the return policy?",
      expectedSources: ["return-policy.pdf"],
      expectedAnswer: "30-day money-back guarantee",
      evaluationCriteria: ["accuracy", "source_attribution"]
    },
    // ... more test cases
  ]
});

// Run evaluation
const results = await runEvaluation(evalDataset._id);

// Analyze results
console.log(`
Evaluation Results:
  Total Tests: ${results.total}
  Passed: ${results.passed}
  Failed: ${results.failed}
  
  Accuracy: ${results.accuracy}%
  Avg Score: ${results.avgScore}
  
  Issues Found:
  - Missing sources: ${results.issues.missingSources}
  - Hallucinations: ${results.issues.hallucinations}
  - Wrong answers: ${results.issues.wrongAnswers}
`);

8. Production Architecture Patterns

Pattern 1: Multi-Tier RAG

Different quality tiers for different use cases:

javascript
// Tier 1: High-Stakes (Legal, Medical, Financial)
const tier1Config = {
  "ragValidateFinalAnswer": true,
  "ragStrictKnowledgebaseAnswers": true,
  "ragScoreThreshold": 0.85,
  "ragMaxResults": 5,
  "ragIncludeMetadata": true,
  "modelId": "gpt-4o" // Most capable model
};

// Tier 2: Standard (Customer Support, Internal Q&A)
const tier2Config = {
  "ragValidateFinalAnswer": false,
  "ragStrictKnowledgebaseAnswers": false,
  "ragScoreThreshold": 0.75,
  "ragMaxResults": 8,
  "ragIncludeMetadata": true,
  "modelId": "gpt-4o-mini"
};

// Tier 3: Casual (General Chat, Exploratory)
const tier3Config = {
  "ragValidateFinalAnswer": false,
  "ragStrictKnowledgebaseAnswers": false,
  "ragScoreThreshold": 0.65,
  "ragMaxResults": 10,
  "ragIncludeMetadata": false,
  "modelId": "gpt-4o-mini"
};

Pattern 2: Fallback Chain

Progressive degradation for reliability:

javascript
async function answerWithFallback(question, context) {
  // Try 1: Strict RAG
  try {
    const strictAnswer = await peer.ask(question, {
      ragStrictKnowledgebaseAnswers: true,
      ragScoreThreshold: 0.8
    });
    
    if (strictAnswer && !strictAnswer.includes("don't have")) {
      return { answer: strictAnswer, confidence: "high", source: "knowledge-base" };
    }
  } catch (err) {
    logger.warn("Strict RAG failed", err);
  }
  
  // Try 2: Relaxed RAG
  try {
    const relaxedAnswer = await peer.ask(question, {
      ragStrictKnowledgebaseAnswers: false,
      ragScoreThreshold: 0.65
    });
    
    return { answer: relaxedAnswer, confidence: "medium", source: "knowledge-base-fuzzy" };
  } catch (err) {
    logger.warn("Relaxed RAG failed", err);
  }
  
  // Try 3: General knowledge (with disclaimer)
  const generalAnswer = await peer.ask(question, {
    enableRagPipeline: false
  });
  
  return {
    answer: `⚠️ This answer is from general knowledge, not our knowledge base:\n\n${generalAnswer}`,
    confidence: "low",
    source: "general-knowledge"
  };
}

Pattern 3: Cached Embeddings

Optimize performance for frequently accessed data:

javascript
// Embedding caching strategy
const embeddingCache = new Map();

async function getOrCreateEmbedding(text, modelName) {
  const cacheKey = `${modelName}:${hashText(text)}`;
  
  if (embeddingCache.has(cacheKey)) {
    return embeddingCache.get(cacheKey);
  }
  
  const embedding = await generateEmbedding(text, modelName);
  
  // Cache with TTL
  embeddingCache.set(cacheKey, embedding);
  setTimeout(() => embeddingCache.delete(cacheKey), 3600000); // 1 hour
  
  return embedding;
}

// Persistent cache for common queries
await redis.setex(
  `embedding:${queryHash}`,
  86400, // 24 hours
  JSON.stringify(embedding)
);

Pattern 4: Progressive Context Loading

Load context incrementally for long conversations:

javascript
async function progressiveRAG(conversation) {
  const recentMessages = conversation.slice(-5);
  const olderMessages = conversation.slice(0, -5);
  
  // Initial response with recent context
  const quickResponse = await peer.ask(recentMessages, {
    ragMaxResults: 5,
    timeoutMs: 3000
  });
  
  // Stream initial response to user
  streamResponse(quickResponse);
  
  // Background: Load full context and refine
  if (olderMessages.length > 0) {
    const fullResponse = await peer.ask(conversation, {
      ragMaxResults: 15,
      ragIncludeConversationSources: true
    });
    
    // If significantly different, offer updated answer
    if (responseQuality(fullResponse) > responseQuality(quickResponse) + 0.2) {
      await sendFollowUp("I found additional relevant information. Would you like a more comprehensive answer?");
    }
  }
}

9. Common Pitfalls and Solutions

Pitfall 1: Over-Chunking

Problem: Documents split into too many tiny chunks

javascript
// Bad
{
  "chunkSize": 200,
  "chunkOverlap": 50
}
Result: "Our return policy..." (incomplete sentence)

Solution: Use semantic chunking with minimum size

javascript
// Good
{
  "chunkSize": 800,
  "chunkOverlap": 200,
  "minChunkSize": 400,
  "strategy": "semantic"
}

Pitfall 2: Stale Embeddings

Problem: Content updated but embeddings not regenerated

Solution: Automatic re-indexing on changes

javascript
// When document updated
async function onDocumentUpdate(documentId) {
  await vectorDB.deleteEmbeddings({ documentId });
  await regenerateEmbeddings(documentId);
  await clearResponseCache(documentId);
}

// Scheduled full re-indexing
cron.schedule('0 2 * * *', async () => {
  const staleDocuments = await findDocumentsUpdatedSince(lastIndexTime);
  for (const doc of staleDocuments) {
    await reindexDocument(doc._id);
  }
});

Pitfall 3: Ignoring Query Intent

Problem: Treating all queries the same

javascript
// User: "What's the weather?"
// System: Searches knowledge base about company policies
// Result: ❌ "I don't have weather information"

Solution: Intent classification before RAG

javascript
async function smartRouting(query) {
  const intent = await classifyIntent(query);
  
  switch (intent.type) {
    case 'knowledge-base':
      return await ragSearch(query);
    
    case 'conversational':
      return await generalChat(query);
    
    case 'action':
      return await executeAction(query, intent.action);
    
    default:
      return await fallbackHandler(query);
  }
}

Pitfall 4: Poor Error Messages

Problem: Generic "I don't know" responses

Solution: Helpful, actionable errors

javascript
// Bad
"I don't have that information."

// Good
"I couldn't find information about macOS installation in our current 
documentation. However, I can help with:
• Windows 10/11 installation
Linux (Ubuntu/Debian) setup
• Docker deployment

For macOS support, please contact support@company.com or check our 
community forum at forum.company.com"

10. Implementation Checklist

Use this checklist when implementing RAG in production:

Data Preparation

  • [ ] Documents cleaned and preprocessed
  • [ ] Optimal chunk size determined for content type
  • [ ] Metadata fields defined and populated
  • [ ] Embedding model selected based on language/domain
  • [ ] Initial indexing completed
  • [ ] Quality check on sample embeddings

Configuration

  • [ ] ragMaxResults tuned based on use case
  • [ ] ragScoreThreshold optimized through testing
  • [ ] ragAllItemMode set (semantic/keyword/hybrid)
  • [ ] Metadata inclusion enabled
  • [ ] Strict knowledge base mode configured
  • [ ] Final answer validation enabled (if needed)

Monitoring

  • [ ] Retrieval quality metrics tracked
  • [ ] Answer accuracy monitored
  • [ ] Latency and performance logged
  • [ ] User feedback collection implemented
  • [ ] A/B testing framework ready
  • [ ] Alerts for degraded performance

Testing

  • [ ] Evaluation dataset created
  • [ ] Common queries tested
  • [ ] Edge cases identified and tested
  • [ ] Hallucination detection validated
  • [ ] Source attribution verified
  • [ ] Load testing completed

Production Readiness

  • [ ] Caching strategy implemented
  • [ ] Fallback handling in place
  • [ ] Error messages user-friendly
  • [ ] Documentation updated
  • [ ] Team trained on configuration
  • [ ] Rollback plan prepared

Conclusion

Building production-ready RAG systems requires careful attention to:

  1. Metadata Enrichment: Provide rich context beyond just text
  2. Answer Validation: Prevent hallucinations with LLM-powered fact checking
  3. Strict Mode: Enforce knowledge-base-only answers when accuracy matters
  4. Query Controls: Fine-tune retrieval for your specific use case
  5. Hybrid Search: Combine semantic and keyword approaches
  6. Context Management: Optimize token usage efficiently
  7. Evaluation: Continuously measure and improve quality
  8. Architecture: Design for scalability and reliability

The recent enhancements in Cognipeer AI make it easier than ever to build accurate, trustworthy AI systems that users can rely on. Start with sensible defaults, measure real-world performance, and iterate based on data.

Next Steps

  1. Enable Metadata: Set ragIncludeMetadata: true on your Peers
  2. Test Validation: Try ragValidateFinalAnswer on high-stakes Peers
  3. Run Evaluations: Create test datasets in the Evaluation system
  4. Monitor Metrics: Track retrieval quality and answer accuracy
  5. Share Learnings: Join our community to discuss RAG strategies

Resources


Questions or feedback? Join our Discord community or reach out at support@cognipeer.com.

Built with VitePress