Noema
    Noema

    Running LLMs Locally

    Learn how Noema brings the power of large language models directly to your device.

    What are Local LLMs?

    Local Large Language Models (LLMs) are AI models that run entirely on your device rather than in the cloud. This means all processing happens on your iPhone or iPad, ensuring complete privacy and offline functionality.

    Cloud AI

    • • Your text sent to remote servers
    • • Requires constant internet
    • • Data stored and analyzed
    • • Usage limits and costs
    • • Fast processing (powerful servers)

    Local AI (Noema)

    • • All processing on your device
    • • Works completely offline
    • • Zero data transmission
    • • No usage limits
    • • Moderate speed (mobile hardware)

    How It Works

    1. Model Storage

    AI models are downloaded once and stored locally on your device. These files contain all the "knowledge" and capabilities the AI needs to understand and respond to your queries.

    2. On-Device Processing

    When you type a message, your device's processor (CPU/GPU) runs the AI model to generate responses. Modern Apple Silicon chips are surprisingly capable at this task.

    3. Zero Network Usage

    Once a model is downloaded, no internet connection is required for basic AI chat functionality. Your conversations never leave your device.

    Model Sizes & Performance

    Model SizeQualitySpeedMemory UsageBest For
    1B-3BGoodFast1-3GBQuick tasks, older devices (SLMs)
    7B-8BVery GoodModerate6-10GBGeneral use on newer devices

    Hardware Warning

    Devices with chips older than A13 Bionic have limited or no GPU offload. GGUF models run significantly slower and MLX is unsupported on these devices. We recommend using small language models (SLMs) such as 1B–3B for best results.

    Technical Foundation

    Noema is built on proven open-source technologies:

    • llama.cpp: Optimized C++ inference engine for running LLMs efficiently on consumer hardware
    • GGUF Format: Modern quantized model format that balances quality with file size
    • Apple Metal: Leverages Apple's GPU acceleration for faster inference
    • Quantization: Reduces model size while maintaining quality through advanced compression

    ⚡ Performance Tips

    • • Close background apps to free up memory for larger models
    • • Enable Low Power Mode improves stability
    • • Newer devices with more RAM can handle larger, better models
    • • GPU acceleration is automatically used when available