Skip to content

AWS S3

The AWS S3 Datasource connects an Amazon S3 bucket to Cognipeer as a knowledge source. Cognipeer lists all objects in the bucket, downloads their content, and indexes them so your Peer can search and retrieve information from files stored in S3.

This datasource is ideal for organizations that store documents, reports, configuration files, or any text-based content in S3 buckets.


Use Cases

  • Index company documents stored in S3 (policies, SOPs, contracts).
  • Build a knowledge base from data exports or generated reports uploaded to S3.
  • Search across large file archives without moving them to a separate system.
  • Combine with CI/CD pipelines to automatically index newly generated artifacts.

Prerequisites

  • An AWS account with an S3 bucket.
  • An IAM user or role configured with read permissions (ListBucket and GetObject) for the target bucket.
  • Access Key ID and Secret Access Key credentials.

For details on least-privilege IAM policies, refer to the Developer Hub or consult your cloud administrator.


Setting Up an AWS S3 Datasource

  1. Navigate to Datasources and click Add Datasource.
  2. Select AWS S3 as the datasource type.
  3. Fill in:
    • Access Key ID: Your IAM user's Access Key ID.
    • Secret Access Key: Your IAM user's Secret Access Key.
    • Bucket: The name of the S3 bucket (e.g., my-company-docs).
  4. Click Save.
  5. Click Sync to begin indexing the bucket contents.

What Gets Indexed

Cognipeer uses ListObjectsV2 to retrieve all objects (up to 5,000) from the bucket. Each object is downloaded and processed by the document loader, which supports:

  • PDF (.pdf)
  • Word (.docx)
  • Excel (.xlsx, .csv)
  • Plain text (.txt, .md, .json, .yaml)
  • Source code and configuration files

Objects that are folders (keys ending with /) are automatically skipped.


Keeping the Datasource Up to Date

The datasource does not automatically detect new or changed files. To re-index:

  • Click Sync in the datasource settings.
  • For automated sync workflows, use the Developer Hub for implementation guidance.

Best Practices

  • Use a dedicated bucket or prefix: Avoid mixing indexable documents with other assets (images, videos) to reduce noise.
  • Follow least-privilege: Grant only the s3:ListBucket and s3:GetObject permissions needed, scoped to the specific bucket.
  • Rotate credentials: Use IAM access keys with limited lifetimes; update the datasource when keys are rotated.
  • Organize with prefixes: Use folder-like prefixes (e.g., docs/, reports/) to keep content structured.

Limitations

  • Maximum 5,000 objects per sync (pagination is handled automatically).
  • Only text-based and document files are indexed. Binary assets (images, executables) are skipped.
  • The datasource connects to standard AWS S3. Other S3-compatible services (MinIO, DigitalOcean Spaces) may work if their API is fully S3-compatible, but are not officially tested.

Studio · Pulse — Cognipeer product documentation