RAG Best Practices: Building Production-Ready AI Knowledge Systems

Retrieval Augmented Generation (RAG) has become the cornerstone of building AI systems that provide accurate, verifiable, and up-to-date information. However, implementing RAG effectively requires more than just connecting a vector database to an LLM. This comprehensive guide covers advanced RAG techniques and best practices learned from real-world production deployments.

Why RAG Matters

Large Language Models (LLMs) are powerful but have fundamental limitations:

Knowledge Cutoff: Training data becomes outdated
Hallucinations: Models can confidently generate false information
No Source Attribution: Users can't verify where information comes from
Generic Responses: Lack domain-specific or proprietary knowledge

RAG solves these problems by grounding AI responses in your actual data while maintaining the conversational capabilities of modern LLMs.

The Evolution of RAG in Cognipeer AI

Our platform has evolved through multiple iterations of RAG implementation, each addressing real production challenges:

October 2025 Updates

Recent enhancements have dramatically improved RAG reliability and accuracy:

Metadata Enrichment - Rich context from data sources
Final Answer Validation - LLM-powered fact checking
Strict Knowledge Base Mode - Force answers from known data only
Advanced Query Controls - Fine-tuned retrieval parameters
Hybrid Search - Combine semantic and keyword matching

Let's dive deep into each area with practical examples.

1. Metadata Enrichment: Context is King

The Problem

Traditional RAG systems only pass the text content to the LLM. But documents have rich metadata that provides crucial context:

❌ Without Metadata:
"The Q4 revenue was $2.5M"

✅ With Metadata:
"The Q4 revenue was $2.5M"
Source: Financial Report 2024
Author: Jane Smith (CFO)
Last Updated: 2025-10-15
Department: Finance
Classification: Internal

Implementation

Enable metadata enrichment in your Peer configuration:

javascript

// Peer Settings
{
  "ragIncludeMetadata": true,
  "ragIncludeConversationSources": true
}

What Gets Included

Cognipeer AI automatically enriches RAG context with:

Dataset Items

Item Identifier: Title, name, or displayField
Source Dataset: Which dataset the information came from
Collection Metadata: Custom fields from your schema
Relationships: Connected items and references

Documents

File Metadata: Filename, size, type, upload date
Author Information: Who uploaded/owns the document
Version Tracking: Last modified date and revision history
Classifications: Tags, categories, access levels

External Sources

URL and Domain: Source website information
Crawl Date: When the data was retrieved
Page Structure: Headings, sections, hierarchy
Link Context: How pages relate to each other

Real-World Example: Legal Contract Analysis

Scenario: AI assistant helping lawyers review contracts

javascript

// Without Metadata
Context: "Payment terms are net 30 days."
Question: "What are the payment terms?"
Answer: "The payment terms are net 30 days."
Problem: Which contract? When was this agreed? Who signed it?

// With Metadata
Context: 
"Payment terms are net 30 days.
📄 Source: Vendor Agreement - Acme Corp
📅 Signed: 2025-08-15
👤 Signatory: John Doe (CFO)
🔄 Status: Active
⚠️ Renewal: 2026-08-15"

Question: "What are the payment terms for Acme Corp?"
Answer: "According to the Vendor Agreement signed on August 15, 2025 by 
CFO John Doe, the payment terms with Acme Corp are net 30 days. This 
agreement is currently active and comes up for renewal on August 15, 2026."

Technical Deep Dive

The metadata enrichment happens in peer/helpers/ragMetadata.js:

javascript

// Core metadata enrichment flow
const enrichedEntries = await Promise.all(
  retrievedDocs.map(async (doc) => {
    const metadata = {
      source: doc.metadata?.source,
      type: doc.metadata?.type,
      itemId: doc.metadata?.itemId,
    };

    // Resolve dataset item names
    if (doc.metadata?.datasetId && doc.metadata?.itemId) {
      const dataset = await getDataset(doc.metadata.datasetId);
      const itemName = await resolveItemName(dataset, doc.metadata.itemId);
      metadata.itemName = itemName;
      metadata.datasetName = dataset.name;
    }

    // Enrich document metadata
    if (doc.metadata?.type === 'document') {
      const fileDoc = await getDocument(doc.metadata.documentId);
      metadata.filename = fileDoc.originalName;
      metadata.uploadDate = fileDoc.createdAt;
      metadata.author = fileDoc.uploadedBy;
    }

    return {
      content: doc.pageContent,
      metadata: metadata,
      score: doc.score,
    };
  })
);

Best Practices

✅ DO:

Enable metadata for production Peers
Include source attribution in responses
Use metadata for filtering and access control
Display sources in the UI for verification

❌ DON'T:

Include sensitive metadata in public-facing Peers
Overwhelm the context window with excessive metadata
Expose internal system identifiers to end users
Forget to update metadata when source data changes

2. Final Answer Validation: Preventing Hallucinations

The Problem

Even with RAG, LLMs can still hallucinate or misinterpret the retrieved context:

Retrieved Context: "Product X is available in red and blue."
User Question: "What colors does Product X come in?"
Bad Answer: "Product X is available in red, blue, green, and yellow."

The model "helpfully" added colors that weren't in the source data.

The Solution

Enable Final Answer Validation to have a second LLM review the answer against the context:

javascript

// Peer Configuration
{
  "ragValidateFinalAnswer": true,
  "ragValidateFinalAnswerInstructions": "Ensure all color options mentioned exist in the product catalog. Do not suggest colors that aren't explicitly listed."
}

How It Works

Primary LLM generates answer using RAG context
Validator LLM receives:
- Original question
- Generated answer
- RAG context
- Evidence sources
- Custom validation instructions
Validator judges if answer is supported by context
If invalid, validator provides:
- Detailed feedback on what's wrong
- Corrected answer using only verified information
System returns either original (if valid) or revised answer

Architecture

┌─────────────────────────────────────────────────────────┐
│ User Question                                           │
└─────────────────┬───────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────┐
│ RAG Pipeline: Retrieve Relevant Context                │
└─────────────────┬───────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────┐
│ Primary LLM: Generate Answer                            │
└─────────────────┬───────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────┐
│ Validator LLM: Check Answer vs Context                 │
│ ├─ Is answer supported by evidence?                    │
│ ├─ Does it contradict any source?                      │
│ ├─ Are all facts verifiable?                           │
│ └─ Does it follow custom instructions?                 │
└─────────────────┬───────────────────────────────────────┘
                  │
            ┌─────┴─────┐
            │           │
         Valid?      Invalid?
            │           │
            ▼           ▼
    Return Original  Return Revised
       Answer          Answer

Real-World Example: Healthcare Information

Scenario: Medical information chatbot for patient questions

javascript

// Configuration
{
  "ragValidateFinalAnswer": true,
  "ragValidateFinalAnswerInstructions": `
    Medical safety rules:
    1. Only provide information explicitly stated in verified medical sources
    2. Never extrapolate or assume symptoms, treatments, or side effects
    3. If information is incomplete, clearly state what is known vs unknown
    4. Always recommend consulting healthcare professionals for medical decisions
  `
}

// Interaction Example
User: "What are the side effects of medication X?"

Retrieved Context:
"Medication X common side effects: nausea, headache.
Source: FDA Drug Label 2024"

Primary LLM Answer:
"Medication X can cause nausea, headache, and dizziness. Some 
patients also experience fatigue and dry mouth."

Validator Analysis:
{
  "isValid": false,
  "feedback": "The answer includes side effects (dizziness, fatigue, 
  dry mouth) that are NOT mentioned in the FDA label. Only nausea and 
  headache are documented.",
  "revisedAnswer": "According to the FDA drug label from 2024, the 
  common side effects of Medication X are nausea and headache. For a 
  complete list of side effects and personalized medical advice, please 
  consult your healthcare provider."
}

Final Response: [Revised Answer]

Validation Prompt Engineering

The validator uses a carefully crafted prompt:

javascript

// From peer/helpers/agents/smart/postprocessors/final-answer-validator.js
const prompt = ChatPromptTemplate.fromMessages([
  [
    "system",
    `You are a precise answer auditor.
Use ONLY the supplied knowledge context, evidence list, and instructions 
to judge the candidate answer.

If the answer contains information that is unsupported or contradicts 
the context, mark it invalid and explain the mismatch.

When invalid, craft a revised answer that corrects the issues using 
only supported information.`,
  ],
  [
    "human",
    `Primary question: {question}
Candidate answer: {answer}
Knowledge context: {context}
Detected evidence entries: {evidence}
Language requirement: {languageHint}
Additional instructions: {extraInstructions}`,
  ],
]);

Performance Considerations

Latency Impact:

Adds ~500-2000ms depending on model speed
Consider async validation for non-critical flows
Use faster models (GPT-4o-mini) for validation

Cost Impact:

Doubles LLM calls per response
Validator typically uses shorter context
Consider enabling only for high-stakes applications

When to Enable:

Healthcare and medical information
Legal and financial advice
Product specifications and pricing
Compliance and regulatory responses
Any domain where accuracy > speed

When to Skip:

Casual conversations
General knowledge questions
Speed-critical applications
Confirmed reliable sources

3. Strict Knowledge Base Mode: No Hallucinations Allowed

The Problem

Sometimes you need the AI to ONLY answer from your knowledge base. If the information isn't in your data, the AI should say "I don't know" rather than guessing.

Use Cases

Customer Support: Only provide documented solutions
Product Information: Don't invent features or capabilities
Internal Policies: Stick to official company guidelines
Regulated Industries: Compliance requires source-backed answers

Configuration

javascript

{
  "ragStrictKnowledgebaseAnswers": true,
  "ragValidateFinalAnswer": true, // Recommended companion setting
  "additionalPrompt": "If the answer is not found in the provided context, respond: 'I don't have that information in my knowledge base. Please contact support for assistance.'"
}

How It Works

When enabled, the system prompt enforces strict adherence:

javascript

// Injected into system prompt
if (peer.ragStrictKnowledgebaseAnswers) {
  instructions.push(
    "CRITICAL: You must ONLY provide information that is explicitly " +
    "present in the retrieved context. If the context does not contain " +
    "the answer, you MUST state that you don't have that information. " +
    "DO NOT use general knowledge, DO NOT make assumptions, and " +
    "DO NOT extrapolate beyond what is explicitly stated."
  );
}

Example: Technical Support Bot

Scenario: Software product support chatbot

javascript

// Knowledge Base Contains:
- Installation guide for Windows 10/11
- Troubleshooting network connectivity
- License activation steps

// With Strict Mode DISABLED
User: "How do I install on macOS?"
Answer: "To install on macOS, download the .dmg file, open it, 
and drag the app to your Applications folder..."
Problem: ❌ This info isn't in the knowledge base! Pure hallucination.

// With Strict Mode ENABLED  
User: "How do I install on macOS?"
Answer: "I don't have installation instructions for macOS in my 
knowledge base. Our documentation currently covers Windows 10 and 
Windows 11. Please contact our support team at support@company.com 
for macOS installation help."
Result: ✅ Honest, helpful, and doesn't mislead users.

Advanced Pattern: Confidence Scoring

Combine strict mode with confidence indicators:

javascript

{
  "ragStrictKnowledgebaseAnswers": true,
  "additionalPrompt": `
When answering, indicate your confidence:
- 🟢 HIGH: Answer is directly stated in multiple sources
- 🟡 MEDIUM: Answer is implied or from single source
- 🔴 LOW: Information is partially related but incomplete

Format: [CONFIDENCE] Answer with source citations.
  `
}

// Example Response
User: "What's the warranty period?"
AI: "🟢 HIGH: The warranty period is 2 years from purchase date. 
Source: Product Warranty Policy, Section 3.1 (Updated Oct 2025)"

Balancing Strictness vs Usability

Too Strict:

User: "How do I reset my password?"
AI: "I don't have that information."
Problem: Knowledge base has "password recovery" but not "password reset"

Solution: Semantic Search + Strict Mode

javascript

{
  "ragStrictKnowledgebaseAnswers": true,
  "ragScoreThreshold": 0.7, // Lower threshold for broader matching
  "ragMaxResults": 10,       // Retrieve more candidates
  "additionalPrompt": "If the exact terminology doesn't match but 
  related information exists, use that and acknowledge the terminology 
  difference. For example, if asked about 'reset' but you have 'recovery' 
  information, answer with: 'Regarding password reset (also called 
  password recovery in our docs)...'"
}

4. Advanced Query Controls: Fine-Tuning Retrieval

Key Parameters

Cognipeer AI provides granular control over the RAG retrieval process:

javascript

{
  // How many chunks to retrieve
  "ragMaxResults": 10,
  
  // Minimum similarity score (0-1)
  "ragScoreThreshold": 0.75,
  
  // Search strategy
  "ragAllItemMode": "hybrid", // or "semantic" or "keyword"
  
  // Include metadata enrichment
  "ragIncludeMetadata": true,
  
  // Include past conversation context
  "ragIncludeConversationSources": true
}

Understanding ragMaxResults

What It Does: Controls how many document chunks to retrieve before ranking.

Impact:

Too Low (1-3): Miss relevant information
Optimal (5-10): Balance relevance and cost
Too High (20+): Noise, token waste, slower responses

Recommendations by Use Case:

javascript

// FAQ Chatbot - Simple, focused answers
{
  "ragMaxResults": 3,
  "ragScoreThreshold": 0.8
}

// Research Assistant - Comprehensive analysis
{
  "ragMaxResults": 20,
  "ragScoreThreshold": 0.65
}

// Technical Documentation - Precise code examples
{
  "ragMaxResults": 5,
  "ragScoreThreshold": 0.75
}

// General Q&A - Balanced
{
  "ragMaxResults": 10,
  "ragScoreThreshold": 0.7
}

Score Threshold Tuning

How Similarity Scoring Works:

Vector embeddings represent text as points in high-dimensional space. Similarity scores measure distance:

Score Range: 0.0 (unrelated) to 1.0 (identical)

Typical Distributions:
0.9-1.0: Exact matches, duplicates
0.8-0.9: Highly relevant, same topic
0.7-0.8: Related, conceptually similar
0.6-0.7: Loosely related
<0.6:    Different topics

Tuning Guide:

javascript

// High Precision (Strict Relevance)
{
  "ragScoreThreshold": 0.85,
  "ragMaxResults": 3
}
Use when: Accuracy > recall, technical/medical/legal domains

// Balanced (Default)
{
  "ragScoreThreshold": 0.7,
  "ragMaxResults": 10
}
Use when: General Q&A, customer support

// High Recall (Broad Coverage)
{
  "ragScoreThreshold": 0.6,
  "ragMaxResults": 15
}
Use when: Research, exploratory queries, brainstorming

Real-World Tuning Example

Scenario: E-commerce product search chatbot

Initial Configuration (Too Strict):

javascript

{
  "ragScoreThreshold": 0.9,
  "ragMaxResults": 3
}

User: "Do you have wireless headphones?"
Retrieved: 0 results (threshold too high)
Answer: "I don't have information about wireless headphones."
Problem: ❌ Products exist but slight wording differences filtered them out

After Tuning (Optimal):

javascript

{
  "ragScoreThreshold": 0.72,
  "ragMaxResults": 8,
  "ragAllItemMode": "hybrid" // Key addition!
}

User: "Do you have wireless headphones?"
Retrieved: 5 products
- "Bluetooth Over-Ear Headphones" (score: 0.84)
- "Wireless Noise-Cancelling Headset" (score: 0.78)
- "True Wireless Earbuds" (score: 0.76)
- "Sport Bluetooth Earphones" (score: 0.74)
- "Gaming Wireless Headset" (score: 0.73)

Answer: "Yes! We have several wireless headphone options:
1. Bluetooth Over-Ear Headphones - $149
2. Wireless Noise-Cancelling Headset - $199
[... full list ...]"
Result: ✅ Found all relevant products

5. Hybrid Search: Best of Both Worlds

The Limitation of Semantic Search Alone

Pure vector/semantic search has blindspots:

Query: "What's our PTO policy?"
Vector Search Result: Documents about "vacation days", "time off", "leave"
Problem: Misses exact matches if someone used "PTO" in docs

Query: "Product SKU ABC-123"
Vector Search Result: Random documents mentioning products
Problem: Semantic similarity doesn't help with exact IDs/codes

Enter Hybrid Search

Combines two retrieval strategies:

Semantic Search: Vector similarity for conceptual matching
Keyword Search: Full-text search for exact terms

Results are merged and re-ranked for optimal relevance.

Configuration

javascript

{
  "ragAllItemMode": "hybrid",  // Enable hybrid search
  "ragMaxResults": 10,         // Total results across both methods
  "ragScoreThreshold": 0.7     // Applied after merging
}

How It Works

javascript

// Simplified hybrid search logic
async function hybridSearch(query, options) {
  // Parallel retrieval
  const [semanticResults, keywordResults] = await Promise.all([
    vectorSearch(query, { limit: options.ragMaxResults }),
    fullTextSearch(query, { limit: options.ragMaxResults }),
  ]);

  // Merge and deduplicate
  const merged = mergeResults(semanticResults, keywordResults);

  // Re-rank using Reciprocal Rank Fusion (RRF)
  const reranked = reciprocalRankFusion(merged);

  // Filter by threshold
  return reranked.filter(r => r.score >= options.ragScoreThreshold);
}

function reciprocalRankFusion(results, k = 60) {
  // RRF formula: score = Σ(1 / (k + rank))
  const scores = new Map();
  
  for (const result of results) {
    const existingScore = scores.get(result.id) || 0;
    const rrfScore = 1 / (k + result.rank);
    scores.set(result.id, existingScore + rrfScore);
  }
  
  return Array.from(scores.entries())
    .sort((a, b) => b[1] - a[1])
    .map(([id, score]) => ({ id, score }));
}

Real-World Example: Technical Documentation

Scenario: Developer searching internal API docs

Query: "How do I authenticate with JWT token?"

Semantic Search Results:

"Authentication Overview" (score: 0.82)
"OAuth2 Implementation" (score: 0.78)
"User Session Management" (score: 0.71)

Keyword Search Results:

"JWT Token Validation Guide" (score: 0.95) ← Exact term match!
"Authentication Overview" (score: 0.85)
"API Security Best Practices" (score: 0.72)

Hybrid (Merged & Reranked):

"JWT Token Validation Guide" (0.93) ✅ Best match!
"Authentication Overview" (0.87)
"OAuth2 Implementation" (0.76)
"API Security Best Practices" (0.74)
"User Session Management" (0.70)

Result: User gets the exact JWT guide first, with related auth docs as context.

When to Use Each Mode

Pure Semantic ("ragAllItemMode": "semantic"):

✅ Natural language queries
✅ Conceptual searches
✅ Multilingual content
✅ Synonym-rich domains

Pure Keyword ("ragAllItemMode": "keyword"):

✅ Code search
✅ Product SKUs/IDs
✅ Exact phrase matching
✅ Structured data

Hybrid ("ragAllItemMode": "hybrid"):

✅ Technical documentation (our recommendation)
✅ Mixed content types
✅ Unknown query patterns
✅ General-purpose chatbots

6. Context Window Management

The Challenge

LLMs have token limits. With RAG, you're consuming tokens for:

System prompt
RAG context
Conversation history
User message
Generated response

Token Budget Breakdown

Typical GPT-4 conversation with RAG:

Total Available: 128,000 tokens

Allocation:
- System Prompt: 500 tokens
- RAG Context: 8,000 tokens (10 chunks × 800 tokens avg)
- Conversation History: 2,000 tokens (last 10 messages)
- User Message: 50 tokens
- Reserved for Response: 2,000 tokens
- Buffer: 1,000 tokens
──────────────────────────────
Used: 13,550 tokens
Remaining: 114,450 tokens ✅

Optimization Strategies

1. Dynamic Context Sizing

javascript

function calculateOptimalChunks(modelContextWindow, conversationLength) {
  const systemPromptTokens = 500;
  const responseReserve = 2000;
  const conversationTokens = conversationLength * 100; // rough estimate
  const buffer = 1000;
  
  const availableForRAG = modelContextWindow 
    - systemPromptTokens 
    - responseReserve 
    - conversationTokens 
    - buffer;
  
  const avgChunkSize = 800;
  const optimalChunks = Math.floor(availableForRAG / avgChunkSize);
  
  return Math.min(optimalChunks, 15); // Cap at 15 for quality
}

// Usage
const ragMaxResults = calculateOptimalChunks(128000, conversationHistory.length);

2. Chunk Size Optimization

javascript

// Document chunking strategy
{
  "chunkSize": 800,        // Characters per chunk
  "chunkOverlap": 200,     // Overlap between chunks
  "strategy": "semantic"   // Respect sentence boundaries
}

// Recommendations by content type:

// Code Documentation
{
  "chunkSize": 1000,
  "chunkOverlap": 100,
  "strategy": "code-aware" // Preserve function/class boundaries
}

// Legal Documents
{
  "chunkSize": 600,
  "chunkOverlap": 150,
  "strategy": "paragraph" // Keep paragraphs intact
}

// Conversational FAQs
{
  "chunkSize": 400,
  "chunkOverlap": 50,
  "strategy": "qa-pair" // Each Q&A as one chunk
}

3. Metadata-Driven Pruning

javascript

// Intelligently remove less important chunks
async function pruneContextByPriority(chunks, maxTokens) {
  // Score each chunk by multiple factors
  const scored = chunks.map(chunk => ({
    ...chunk,
    priority: calculatePriority(chunk)
  }));
  
  function calculatePriority(chunk) {
    let score = chunk.similarityScore * 100;
    
    // Boost recent documents
    const ageInDays = (Date.now() - chunk.metadata.updatedAt) / (1000 * 60 * 60 * 24);
    if (ageInDays < 30) score += 20;
    
    // Boost official sources
    if (chunk.metadata.isVerified) score += 15;
    
    // Boost exact keyword matches
    if (chunk.content.includes(query)) score += 10;
    
    return score;
  }
  
  // Sort by priority and keep within token budget
  scored.sort((a, b) => b.priority - a.priority);
  
  let totalTokens = 0;
  const selected = [];
  
  for (const chunk of scored) {
    const chunkTokens = estimateTokens(chunk.content);
    if (totalTokens + chunkTokens <= maxTokens) {
      selected.push(chunk);
      totalTokens += chunkTokens;
    }
  }
  
  return selected;
}

7. Evaluation and Monitoring

Key Metrics to Track

Retrieval Quality

javascript

{
  "avgSimilarityScore": 0.82,      // How well chunks match queries
  "retrievalLatency": 145,          // ms to retrieve from vector DB
  "chunksRetrieved": 8.5,           // avg per query
  "chunksUsed": 6.2,                // avg after filtering
  "cacheHitRate": 0.68              // % of cached embeddings
}

Answer Quality

javascript

{
  "validationPassRate": 0.94,       // % passing final validation
  "hallucination Rate": 0.03,       // detected hallucinations
  "sourceAttribution": 0.97,        // % with proper citations
  "avgResponseLength": 287,         // tokens
  "userSatisfaction": 4.3           // out of 5
}

Performance

javascript

{
  "totalLatency": 2340,             // ms end-to-end
  "breakdown": {
    "retrieval": 145,               // vector search
    "llmInference": 1890,           // generation
    "validation": 280,              // answer checking
    "overhead": 25                  // system processing
  },
  "tokensUsed": {
    "prompt": 3420,
    "completion": 287,
    "total": 3707
  }
}

A/B Testing RAG Configurations

Example test comparing strict vs lenient modes:

javascript

// Configuration A: Strict
const configA = {
  "ragStrictKnowledgebaseAnswers": true,
  "ragValidateFinalAnswer": true,
  "ragScoreThreshold": 0.8,
  "ragMaxResults": 5
};

// Configuration B: Lenient  
const configB = {
  "ragStrictKnowledgebaseAnswers": false,
  "ragValidateFinalAnswer": false,
  "ragScoreThreshold": 0.65,
  "ragMaxResults": 10
};

// Results after 1000 queries each:
const results = {
  configA: {
    answerRate: 0.73,              // 73% could answer
    accuracy: 0.96,                // 96% accurate when answering
    avgLatency: 2840,              // slower (validation)
    userSatisfaction: 4.5
  },
  configB: {
    answerRate: 0.94,              // 94% could answer
    accuracy: 0.87,                // 87% accurate
    avgLatency: 1950,              // faster
    userSatisfaction: 4.1
  }
};

// Decision: Use Config A for compliance-critical, Config B for general use

Cognipeer AI Evaluation System

Built-in tools for RAG testing:

javascript

// Create evaluation dataset
const evalDataset = await createEvaluation({
  name: "RAG Accuracy Test Q4 2025",
  peer: peer._id,
  testCases: [
    {
      input: "What's the return policy?",
      expectedSources: ["return-policy.pdf"],
      expectedAnswer: "30-day money-back guarantee",
      evaluationCriteria: ["accuracy", "source_attribution"]
    },
    // ... more test cases
  ]
});

// Run evaluation
const results = await runEvaluation(evalDataset._id);

// Analyze results
console.log(`
Evaluation Results:
  Total Tests: ${results.total}
  Passed: ${results.passed}
  Failed: ${results.failed}
  
  Accuracy: ${results.accuracy}%
  Avg Score: ${results.avgScore}
  
  Issues Found:
  - Missing sources: ${results.issues.missingSources}
  - Hallucinations: ${results.issues.hallucinations}
  - Wrong answers: ${results.issues.wrongAnswers}
`);

8. Production Architecture Patterns

Pattern 1: Multi-Tier RAG

Different quality tiers for different use cases:

javascript

// Tier 1: High-Stakes (Legal, Medical, Financial)
const tier1Config = {
  "ragValidateFinalAnswer": true,
  "ragStrictKnowledgebaseAnswers": true,
  "ragScoreThreshold": 0.85,
  "ragMaxResults": 5,
  "ragIncludeMetadata": true,
  "modelId": "gpt-4o" // Most capable model
};

// Tier 2: Standard (Customer Support, Internal Q&A)
const tier2Config = {
  "ragValidateFinalAnswer": false,
  "ragStrictKnowledgebaseAnswers": false,
  "ragScoreThreshold": 0.75,
  "ragMaxResults": 8,
  "ragIncludeMetadata": true,
  "modelId": "gpt-4o-mini"
};

// Tier 3: Casual (General Chat, Exploratory)
const tier3Config = {
  "ragValidateFinalAnswer": false,
  "ragStrictKnowledgebaseAnswers": false,
  "ragScoreThreshold": 0.65,
  "ragMaxResults": 10,
  "ragIncludeMetadata": false,
  "modelId": "gpt-4o-mini"
};

Pattern 2: Fallback Chain

Progressive degradation for reliability:

javascript

async function answerWithFallback(question, context) {
  // Try 1: Strict RAG
  try {
    const strictAnswer = await peer.ask(question, {
      ragStrictKnowledgebaseAnswers: true,
      ragScoreThreshold: 0.8
    });
    
    if (strictAnswer && !strictAnswer.includes("don't have")) {
      return { answer: strictAnswer, confidence: "high", source: "knowledge-base" };
    }
  } catch (err) {
    logger.warn("Strict RAG failed", err);
  }
  
  // Try 2: Relaxed RAG
  try {
    const relaxedAnswer = await peer.ask(question, {
      ragStrictKnowledgebaseAnswers: false,
      ragScoreThreshold: 0.65
    });
    
    return { answer: relaxedAnswer, confidence: "medium", source: "knowledge-base-fuzzy" };
  } catch (err) {
    logger.warn("Relaxed RAG failed", err);
  }
  
  // Try 3: General knowledge (with disclaimer)
  const generalAnswer = await peer.ask(question, {
    enableRagPipeline: false
  });
  
  return {
    answer: `⚠️ This answer is from general knowledge, not our knowledge base:\n\n${generalAnswer}`,
    confidence: "low",
    source: "general-knowledge"
  };
}

Pattern 3: Cached Embeddings

Optimize performance for frequently accessed data:

javascript

// Embedding caching strategy
const embeddingCache = new Map();

async function getOrCreateEmbedding(text, modelName) {
  const cacheKey = `${modelName}:${hashText(text)}`;
  
  if (embeddingCache.has(cacheKey)) {
    return embeddingCache.get(cacheKey);
  }
  
  const embedding = await generateEmbedding(text, modelName);
  
  // Cache with TTL
  embeddingCache.set(cacheKey, embedding);
  setTimeout(() => embeddingCache.delete(cacheKey), 3600000); // 1 hour
  
  return embedding;
}

// Persistent cache for common queries
await redis.setex(
  `embedding:${queryHash}`,
  86400, // 24 hours
  JSON.stringify(embedding)
);

Pattern 4: Progressive Context Loading

Load context incrementally for long conversations:

javascript

async function progressiveRAG(conversation) {
  const recentMessages = conversation.slice(-5);
  const olderMessages = conversation.slice(0, -5);
  
  // Initial response with recent context
  const quickResponse = await peer.ask(recentMessages, {
    ragMaxResults: 5,
    timeoutMs: 3000
  });
  
  // Stream initial response to user
  streamResponse(quickResponse);
  
  // Background: Load full context and refine
  if (olderMessages.length > 0) {
    const fullResponse = await peer.ask(conversation, {
      ragMaxResults: 15,
      ragIncludeConversationSources: true
    });
    
    // If significantly different, offer updated answer
    if (responseQuality(fullResponse) > responseQuality(quickResponse) + 0.2) {
      await sendFollowUp("I found additional relevant information. Would you like a more comprehensive answer?");
    }
  }
}

9. Common Pitfalls and Solutions

Pitfall 1: Over-Chunking

Problem: Documents split into too many tiny chunks

javascript

// Bad
{
  "chunkSize": 200,
  "chunkOverlap": 50
}
Result: "Our return policy..." (incomplete sentence)

Solution: Use semantic chunking with minimum size

javascript

// Good
{
  "chunkSize": 800,
  "chunkOverlap": 200,
  "minChunkSize": 400,
  "strategy": "semantic"
}

Pitfall 2: Stale Embeddings

Problem: Content updated but embeddings not regenerated

Solution: Automatic re-indexing on changes

javascript

// When document updated
async function onDocumentUpdate(documentId) {
  await vectorDB.deleteEmbeddings({ documentId });
  await regenerateEmbeddings(documentId);
  await clearResponseCache(documentId);
}

// Scheduled full re-indexing
cron.schedule('0 2 * * *', async () => {
  const staleDocuments = await findDocumentsUpdatedSince(lastIndexTime);
  for (const doc of staleDocuments) {
    await reindexDocument(doc._id);
  }
});

Pitfall 3: Ignoring Query Intent

Problem: Treating all queries the same

javascript

// User: "What's the weather?"
// System: Searches knowledge base about company policies
// Result: ❌ "I don't have weather information"

Solution: Intent classification before RAG

javascript

async function smartRouting(query) {
  const intent = await classifyIntent(query);
  
  switch (intent.type) {
    case 'knowledge-base':
      return await ragSearch(query);
    
    case 'conversational':
      return await generalChat(query);
    
    case 'action':
      return await executeAction(query, intent.action);
    
    default:
      return await fallbackHandler(query);
  }
}

Pitfall 4: Poor Error Messages

Problem: Generic "I don't know" responses

Solution: Helpful, actionable errors

javascript

// Bad
"I don't have that information."

// Good
"I couldn't find information about macOS installation in our current 
documentation. However, I can help with:
• Windows 10/11 installation
• Linux (Ubuntu/Debian) setup
• Docker deployment

For macOS support, please contact support@company.com or check our 
community forum at forum.company.com"

10. Implementation Checklist

Use this checklist when implementing RAG in production:

Data Preparation

[ ] Documents cleaned and preprocessed
[ ] Optimal chunk size determined for content type
[ ] Metadata fields defined and populated
[ ] Embedding model selected based on language/domain
[ ] Initial indexing completed
[ ] Quality check on sample embeddings

Configuration

[ ] ragMaxResults tuned based on use case
[ ] ragScoreThreshold optimized through testing
[ ] ragAllItemMode set (semantic/keyword/hybrid)
[ ] Metadata inclusion enabled
[ ] Strict knowledge base mode configured
[ ] Final answer validation enabled (if needed)

Monitoring

[ ] Retrieval quality metrics tracked
[ ] Answer accuracy monitored
[ ] Latency and performance logged
[ ] User feedback collection implemented
[ ] A/B testing framework ready
[ ] Alerts for degraded performance

Testing

[ ] Evaluation dataset created
[ ] Common queries tested
[ ] Edge cases identified and tested
[ ] Hallucination detection validated
[ ] Source attribution verified
[ ] Load testing completed

Production Readiness

[ ] Caching strategy implemented
[ ] Fallback handling in place
[ ] Error messages user-friendly
[ ] Documentation updated
[ ] Team trained on configuration
[ ] Rollback plan prepared

Conclusion

Building production-ready RAG systems requires careful attention to:

Metadata Enrichment: Provide rich context beyond just text
Answer Validation: Prevent hallucinations with LLM-powered fact checking
Strict Mode: Enforce knowledge-base-only answers when accuracy matters
Query Controls: Fine-tune retrieval for your specific use case
Hybrid Search: Combine semantic and keyword approaches
Context Management: Optimize token usage efficiently
Evaluation: Continuously measure and improve quality
Architecture: Design for scalability and reliability

The recent enhancements in Cognipeer AI make it easier than ever to build accurate, trustworthy AI systems that users can rely on. Start with sensible defaults, measure real-world performance, and iterate based on data.

Next Steps

Enable Metadata: Set ragIncludeMetadata: true on your Peers
Test Validation: Try ragValidateFinalAnswer on high-stakes Peers
Run Evaluations: Create test datasets in the Evaluation system
Monitor Metrics: Track retrieval quality and answer accuracy
Share Learnings: Join our community to discuss RAG strategies

Resources

Questions or feedback? Join our Discord community or reach out at support@cognipeer.com.

RAG Best Practices: Building Production-Ready AI Knowledge Systems ​

Why RAG Matters ​

The Evolution of RAG in Cognipeer AI ​

October 2025 Updates ​

1. Metadata Enrichment: Context is King ​

The Problem ​

Implementation ​

What Gets Included ​

Dataset Items ​

Documents ​

External Sources ​

Real-World Example: Legal Contract Analysis ​

Technical Deep Dive ​

Best Practices ​

2. Final Answer Validation: Preventing Hallucinations ​

The Problem ​

The Solution ​

How It Works ​

Architecture ​

Real-World Example: Healthcare Information ​

Validation Prompt Engineering ​

Performance Considerations ​

3. Strict Knowledge Base Mode: No Hallucinations Allowed ​

The Problem ​

Use Cases ​

Configuration ​

How It Works ​

Example: Technical Support Bot ​

Advanced Pattern: Confidence Scoring ​

Balancing Strictness vs Usability ​

4. Advanced Query Controls: Fine-Tuning Retrieval ​

Key Parameters ​

Understanding ragMaxResults ​

Score Threshold Tuning ​

Real-World Tuning Example ​

5. Hybrid Search: Best of Both Worlds ​

The Limitation of Semantic Search Alone ​

Enter Hybrid Search ​

Configuration ​

How It Works ​

Real-World Example: Technical Documentation ​

When to Use Each Mode ​

6. Context Window Management ​

The Challenge ​

Token Budget Breakdown ​

Optimization Strategies ​

1. Dynamic Context Sizing ​

2. Chunk Size Optimization ​

3. Metadata-Driven Pruning ​

7. Evaluation and Monitoring ​

Key Metrics to Track ​

Retrieval Quality ​

Answer Quality ​

Performance ​

A/B Testing RAG Configurations ​

Cognipeer AI Evaluation System ​

8. Production Architecture Patterns ​

Pattern 1: Multi-Tier RAG ​

Pattern 2: Fallback Chain ​

Pattern 3: Cached Embeddings ​

Pattern 4: Progressive Context Loading ​

9. Common Pitfalls and Solutions ​

Pitfall 1: Over-Chunking ​

Pitfall 2: Stale Embeddings ​

Pitfall 3: Ignoring Query Intent ​

Pitfall 4: Poor Error Messages ​

10. Implementation Checklist ​

Data Preparation ​

Configuration ​

Monitoring ​

Testing ​

Production Readiness ​

Conclusion ​

Next Steps ​

Resources ​

RAG Best Practices: Building Production-Ready AI Knowledge Systems

Why RAG Matters

The Evolution of RAG in Cognipeer AI

October 2025 Updates

1. Metadata Enrichment: Context is King

The Problem

Implementation

What Gets Included

Dataset Items

Documents

External Sources

Real-World Example: Legal Contract Analysis

Technical Deep Dive

Best Practices

2. Final Answer Validation: Preventing Hallucinations

The Problem

The Solution

How It Works

Architecture

Real-World Example: Healthcare Information

Validation Prompt Engineering

Performance Considerations

3. Strict Knowledge Base Mode: No Hallucinations Allowed

The Problem

Use Cases

Configuration

How It Works

Example: Technical Support Bot

Advanced Pattern: Confidence Scoring

Balancing Strictness vs Usability

4. Advanced Query Controls: Fine-Tuning Retrieval

Key Parameters

Understanding ragMaxResults

Score Threshold Tuning

Real-World Tuning Example

5. Hybrid Search: Best of Both Worlds

The Limitation of Semantic Search Alone

Enter Hybrid Search

Configuration

How It Works

Real-World Example: Technical Documentation

When to Use Each Mode

6. Context Window Management

The Challenge

Token Budget Breakdown

Optimization Strategies

1. Dynamic Context Sizing

2. Chunk Size Optimization

3. Metadata-Driven Pruning

7. Evaluation and Monitoring

Key Metrics to Track

Retrieval Quality

Answer Quality

Performance

A/B Testing RAG Configurations

Cognipeer AI Evaluation System

8. Production Architecture Patterns

Pattern 1: Multi-Tier RAG

Pattern 2: Fallback Chain

Pattern 3: Cached Embeddings

Pattern 4: Progressive Context Loading

9. Common Pitfalls and Solutions

Pitfall 1: Over-Chunking

Pitfall 2: Stale Embeddings

Pitfall 3: Ignoring Query Intent

Pitfall 4: Poor Error Messages

10. Implementation Checklist

Data Preparation

Configuration

Monitoring

Testing

Production Readiness

Conclusion

Next Steps

Resources