Noema - Local AI Assistant

Learn how Noema brings the power of large language models directly to your device.

What are Local LLMs?

Local Large Language Models (LLMs) are AI models that run entirely on your device rather than in the cloud. This means all processing happens on your iPhone or iPad, ensuring complete privacy and offline functionality.

Cloud AI

• Your text sent to remote servers
• Requires constant internet
• Data stored and analyzed
• Usage limits and costs
• Fast processing (powerful servers)

Local AI (Noema)

• All processing on your device
• Works completely offline
• Zero data transmission
• No usage limits
• Moderate speed (mobile hardware)

How It Works

1. Model Storage

AI models are downloaded once and stored locally on your device. These files contain all the "knowledge" and capabilities the AI needs to understand and respond to your queries.

2. On-Device Processing

When you type a message, your device's processor (CPU/GPU) runs the AI model to generate responses. Modern Apple Silicon chips are surprisingly capable at this task.

3. Zero Network Usage

Once a model is downloaded, no internet connection is required for basic AI chat functionality. Your conversations never leave your device.

Model Sizes & Performance

Model Size	Quality	Speed	Memory Usage	Best For
1B-3B	Good	Fast	1-3GB	Quick tasks, older devices (SLMs)
7B-8B	Very Good	Moderate	6-10GB	General use on newer devices

Hardware Warning

Devices with chips older than A13 Bionic have limited or no GPU offload. GGUF models run significantly slower and MLX is unsupported on these devices. We recommend using small language models (SLMs) such as 1B–3B for best results.

Technical Foundation

Noema is built on proven open-source technologies:

•
llama.cpp: Optimized C++ inference engine for running LLMs efficiently on consumer hardware
•
GGUF Format: Modern quantized model format that balances quality with file size
•
Apple Metal: Leverages Apple's GPU acceleration for faster inference
•
Quantization: Reduces model size while maintaining quality through advanced compression

⚡ Performance Tips

• Close background apps to free up memory for larger models
• Enable Low Power Mode improves stability
• Newer devices with more RAM can handle larger, better models
• GPU acceleration is automatically used when available

Documentation

Running LLMs Locally