Noema
    Noema

    Dataset Integration

    Enhance your AI with custom knowledge through dataset integration and Retrieval-Augmented Generation (RAG).

    What is RAG?

    Retrieval-Augmented Generation (RAG) allows the AI model to access and use information from your custom documents and datasets. Instead of relying only on its training data, the model can search through your provided materials to give more accurate, personalized, and up-to-date responses.

    Supported File Types

    Documents

    • • PDF files
    • • Text files (.txt)
    • • Markdown files (.md)
    • • Word documents (.docx)

    Structured Data

    • • CSV files
    • • JSON datasets
    • • Custom text collections

    Adding Datasets

    Method 1: Local Files

    1. 1. Tap the "+" button in the dataset section
    2. 2. Select "Import Local Files"
    3. 3. Choose files from your device or iCloud
    4. 4. Wait for processing and indexing
    5. 5. Enable the dataset in your chat settings

    Method 2: Open Textbook Library

    1. 1. Navigate to the Explore page in the app
    2. 2. Browse the Open Textbook Library section
    3. 3. Select textbooks relevant to your needs
    4. 4. Download and integrate automatically
    5. 5. Access the knowledge in your chats

    How RAG Works

    1. Document Processing

    When you import documents, Noema breaks them into chunks and creates embeddings (vector representations) of the content. This allows for semantic search through your documents.

    💡 Tip: Keep your device plugged in during this process as document processing and embedding generation can be battery-intensive.

    2. Query Processing

    When you ask a question, the system searches through your dataset embeddings to find the most relevant chunks of information related to your query.

    3. Context Injection

    The relevant information is injected into the AI model's context, allowing it to reference your specific documents when generating responses.

    Best Practices

    Optimization Tips

    • • Keep datasets focused on specific topics for better retrieval
    • • Use clear, descriptive filenames and dataset names
    • • Break large documents into logical sections
    • • Regularly update datasets with new information
    • • Test queries to ensure good retrieval performance
    • • Monitor storage usage as datasets can be large