Running LLMs locally
Learn how Noema brings the power of large language models directly to your device without sacrificing privacy.
What are local LLMs?
Local Large Language Models (LLMs) run entirely on your device rather than in the cloud. Every response is generated on your iPhone or iPad so conversations stay private and keep working even without connectivity.
Cloud AI
- • Sends your prompts to remote servers
- • Requires constant internet access
- • Stores conversation history externally
- • Adds usage limits or paywalls
- • Leverages powerful, centralized hardware
Local AI (Noema)
- • Processes everything on your device
- • Works completely offline
- • Never transmits conversation data
- • Includes every feature for free
- • Runs on optimized mobile hardware
How it works
Model storage
Download models once and store them locally so Noema always has the knowledge it needs.
On-device processing
Apple Silicon CPUs and GPUs run each prompt through the model in real time, optimized for efficiency.
Zero network usage
After setup, chats stay offline. Noema keeps your prompts on-device and out of third-party logs.
Model sizes & performance
| Model size | Quality | Speed | Memory usage | Best for |
|---|---|---|---|---|
| 1B-3B | Good | Fast | 1-3GB | Quick replies, older devices (SLMs) |
| 7B-8B | Very good | Moderate | 6-10GB | Everyday use on modern devices |
Hardware warning
Devices with chips older than A13 Bionic have limited or no GPU offload. GGUF models run significantly slower and MLX is unsupported on these devices. Choose lighter 1B–3B models for the best experience.
Technical foundation
Noema builds on battle-tested, open-source tooling tailored for mobile hardware:
- •llama.cpp: Optimized C++ inference engine that keeps models responsive on consumer devices.
- •GGUF format: Modern quantization format balancing quality with manageable file sizes.
- •Apple Metal: Harnesses Apple GPU acceleration for lower latency during chat.
- •Quantization: Compresses model weights intelligently so you can run bigger models locally.
⚡ Performance tips
- • Close background apps to free memory for larger models.
- • Enable Low Power Mode to stabilize long sessions.
- • Newer devices with more RAM handle larger quantizations.
- • GPU acceleration is automatic when hardware allows.
