Dataset Integration

Bring your own knowledge into Noema with Retrieval-Augmented Generation (RAG). This guide explains how to prepare sources, ingest them, and keep everything fresh so your assistant always answers with your latest context.

Concept overview

How RAG enhances your chats

Retrieval-Augmented Generation lets Noema search your indexed documents whenever you ask a question. Relevant snippets are appended to the model prompt so answers stay grounded in your own material instead of generic training data.

Supported inputs

File types that index cleanly

Documents

PDF, Markdown, and TXT notes
Word documents and lecture decks exported as PDF
Research papers or briefs

Structured data

CSV tables for metrics, logs, or glossaries
JSON exports from knowledge bases
Curated text bundles from other tools

Adding sources

Ingest datasets in two ways

Local files

Open Explore and press the Import icon.
Select Import from Files and pick documents from your device or iCloud.
Keep Noema active while the importer chunks and embeds each page.
Enable the dataset in chat settings to make it available to every conversation.

Open Textbook Library

Head to the Explore tab and browse curated academic titles.
Add the books that match your domain; they index in the background.
Review the summary once processing finishes to confirm availability.

What happens inside

The three stages of retrieval

Document processing

Files are chunked into passages and converted into embeddings so Noema can search semantically instead of relying on keywords.

Query matching

Each prompt generates its own embedding. The system compares it against your dataset vectors to surface the closest matches.

Context injection

Relevant excerpts are appended to the model input so responses cite the exact sections that informed them.

Best practices

Keep datasets focused on a single subject for sharp retrieval results.
Name files clearly so you can trace answers back to the right source.
Break massive PDFs into chapters to speed up processing.
Refresh datasets as your material changes to avoid stale context.
Periodically run sample questions to confirm the citations look correct.

Navigation