Dataset Integration
Enhance your AI with custom knowledge through dataset integration and Retrieval-Augmented Generation (RAG).
What is RAG?
Retrieval-Augmented Generation (RAG) allows the AI model to access and use information from your custom documents and datasets. Instead of relying only on its training data, the model can search through your provided materials to give more accurate, personalized, and up-to-date responses.
Supported File Types
Documents
- • PDF files
- • Text files (.txt)
- • Markdown files (.md)
- • Word documents (.docx)
Structured Data
- • CSV files
- • JSON datasets
- • Custom text collections
Adding Datasets
Method 1: Local Files
- 1. Tap the "+" button in the dataset section
- 2. Select "Import Local Files"
- 3. Choose files from your device or iCloud
- 4. Wait for processing and indexing
- 5. Enable the dataset in your chat settings
Method 2: Open Textbook Library
- 1. Navigate to the Explore page in the app
- 2. Browse the Open Textbook Library section
- 3. Select textbooks relevant to your needs
- 4. Download and integrate automatically
- 5. Access the knowledge in your chats
How RAG Works
1. Document Processing
When you import documents, Noema breaks them into chunks and creates embeddings (vector representations) of the content. This allows for semantic search through your documents.
💡 Tip: Keep your device plugged in during this process as document processing and embedding generation can be battery-intensive.
2. Query Processing
When you ask a question, the system searches through your dataset embeddings to find the most relevant chunks of information related to your query.
3. Context Injection
The relevant information is injected into the AI model's context, allowing it to reference your specific documents when generating responses.
Best Practices
Optimization Tips
- • Keep datasets focused on specific topics for better retrieval
- • Use clear, descriptive filenames and dataset names
- • Break large documents into logical sections
- • Regularly update datasets with new information
- • Test queries to ensure good retrieval performance
- • Monitor storage usage as datasets can be large