Running LLMs Locally
Learn how Noema brings the power of large language models directly to your device.
What are Local LLMs?
Local Large Language Models (LLMs) are AI models that run entirely on your device rather than in the cloud. This means all processing happens on your iPhone or iPad, ensuring complete privacy and offline functionality.
Cloud AI
- • Your text sent to remote servers
- • Requires constant internet
- • Data stored and analyzed
- • Usage limits and costs
- • Fast processing (powerful servers)
Local AI (Noema)
- • All processing on your device
- • Works completely offline
- • Zero data transmission
- • No usage limits
- • Moderate speed (mobile hardware)
How It Works
1. Model Storage
AI models are downloaded once and stored locally on your device. These files contain all the "knowledge" and capabilities the AI needs to understand and respond to your queries.
2. On-Device Processing
When you type a message, your device's processor (CPU/GPU) runs the AI model to generate responses. Modern Apple Silicon chips are surprisingly capable at this task.
3. Zero Network Usage
Once a model is downloaded, no internet connection is required for basic AI chat functionality. Your conversations never leave your device.
Model Sizes & Performance
Model Size | Quality | Speed | Memory Usage | Best For |
---|---|---|---|---|
1B-3B | Good | Fast | 1-3GB | Quick tasks, older devices (SLMs) |
7B-8B | Very Good | Moderate | 6-10GB | General use on newer devices |
Hardware Warning
Devices with chips older than A13 Bionic have limited or no GPU offload. GGUF models run significantly slower and MLX is unsupported on these devices. We recommend using small language models (SLMs) such as 1B–3B for best results.
Technical Foundation
Noema is built on proven open-source technologies:
- •llama.cpp: Optimized C++ inference engine for running LLMs efficiently on consumer hardware
- •GGUF Format: Modern quantized model format that balances quality with file size
- •Apple Metal: Leverages Apple's GPU acceleration for faster inference
- •Quantization: Reduces model size while maintaining quality through advanced compression
⚡ Performance Tips
- • Close background apps to free up memory for larger models
- • Enable Low Power Mode improves stability
- • Newer devices with more RAM can handle larger, better models
- • GPU acceleration is automatically used when available