I recently spend time benchmarking multiple local models on the premise of how good they are at finding data in large context. Imagine you have confidential documents and you’d like to get information out of them. Local AI is a great solution to this. Instead of sharing the documents with a cloud AI you simple feed it to your local AI and give instructions.
But which one to choose from? There’re multiple factors playing into this:
- Size of the model. Does it fit your ram?
- Moe or dense Model? Moe models are much faster but less smart
- Do you need additional world knowledge?
- How well does it adhere to your prompt?
- How much does it hallucinate?
- How good is it long context understanding?
The task
I invented a task to measure models on. I gave it a very large Swift file (50k tokens) and asked it a question that requires it to understand what the code does, find specific snippets, and adhere to a guardrail (don’t propose changes, just give me snippets from the code):
WHICH CODE SNIPPETS FROM THIS CODE WOULD I NEED TO UNDERSTAND HOW TO ADD A PREFIX IMAGE BEFORE TITLE. DON’T CHANGE ANY CODE, JUST GIVE ME THE SNIPPETS I NEED.
I ran this against multiple models locally.
Hardware
M4 Macbook Pro Max with 128 GB Memory running Sequoia 15.7.3 and LM Studio 0.4.5
Included models
I tried to include models from smaller to larger sizes, including very large models which only run on 128GB of ram because they’re heavily quantized or dumbed down via REAP.
Performance
| Model | Prompt Tokens | Response Tokens | Parse Time (s) | Gen Time (s) | Tok/sec |
|---|---|---|---|---|---|
| mistralai/ministral-3-3b | 49409 | 1073 | 226.111 | 56.287 | 19.45 |
| nvidia/nemotron-3-nano | 48890 | 2109 | 63.985 | 46.311 | 45.54 |
| zai-org/glm-4.7-flash | 45879 | 1420 | 526.756 | 78.322 | 18.13 |
| granite-4.0-h-tiny (4bit) | 45597 | 337 | 26.363 | 3.373 | 99.91 |
| granite-4.0-h-tiny (8bit) | 51336 | 486 | 22.058 | 4.328 | 112.30 |
| Nemotron-3-Nano-REAP-21B-A3B | 48970 | 1382 | 57.468 | 26.795 | 51.58 |
| qwen/qwen3-coder-next (mlx) | 46677 | 2212 | 67.009 | 54.034 | 40.94 |
| qwen/qwen3-coder-next (gguf) | 46677 | 1857 | 181.260 | 100.780 | 18.42 |
| MiniMax-M2.1-REAP-50 | 52981 | 6114 | 532.726 | 217.332 | 8.01 |
| mistralai/devstral-small-2-2512 | 48877 | 568 | 644.479 | 76.419 | 7.43 |
| GLM-4.5-Air-REAP-82B (run 1) | 45879 | 1439 | 564.088 | 200.062 | 7.19 |
| GLM-4.5-Air-REAP-82B (run 2) | 45879 | 1439 | 564.088 | 200.062 | 7.19 |
| qwen/qwen3-4b-thinking-2507 | 46679 | 8207 | 109.285 | 369.098 | 17.16 |
| qwen/qwen3-4b-2507 | 46677 | 2320 | 121.624 | 59.372 | 12.82 |
| openai/gpt-oss-20b | 46073 | 4178 | 75.111 | 68.333 | 29.11 |
| mlx-community/gpt-oss-120b | 46073 | 2920 | 95.180 | 92.807 | 31.46 |
| qwen/qwen3-30b-a3b-2507 | 46677 | 1663 | 152.328 | 51.812 | 8.15 |
| Qwen3.5-35B-A3B | 50195 | 3778 | 77.583 | 162.419 | 15.74 |
| Qwen3.5-122B-A10B | 50195 | 2808 | 263.242 | 180.967 | 15.52 |
| Qwen3.5-27B | 50195 | 7060 | 395.732 | 211.067 | 11.64 |
Evaluation
| Model | Finding | Adhering | Hallucination | Verbatim | Overall |
|---|---|---|---|---|---|
| Opus 4.5 | 5 | 6.8 | 9 | 6.6 | 6 |
| mistralai/ministral-3-3b | 1.8 | 1 | 1.2 | 1.2 | 1.4 |
| nvidia/nemotron-3-nano | 0.4 | 0.8 | 0.4 | 0.4 | 0.4 |
| zai-org/glm-4.7-flash | 2 | 6.2 | 5.6 | 4.2 | 2.6 |
| granite-4.0-h-tiny (8bit) | 0 | 0 | 2 | 0 | 0 |
| Nemotron-3-Nano-REAP-21B-A3B | 0 | 0 | 0 | 0 | 0 |
| qwen/qwen3-coder-next (mlx) | 3.6 | 5 | 7.2 | 4.6 | 4.3 |
| qwen/qwen3-coder-next (gguf) | 2.8 | 4 | 3.4 | 1.8 | 2.65 |
| MiniMax-M2.1-REAP-50 | 1.4 | 2.8 | 2 | 2 | 1.6 |
| mistralai/devstral-small-2-2512 | 0.8 | 2.8 | 3.6 | 2.2 | 1.5 |
| GLM-4.5-Air-REAP-82B (run 1) | 2.4 | 1.2 | 1.8 | 1.8 | 1.7 |
| GLM-4.5-Air-REAP-82B (run 2) | 3.2 | 4 | 6 | 4 | 3.6 |
| qwen/qwen3-4b-thinking-2507 | 1.8 | 2.6 | 2.2 | 2.2 | 2.1 |
| qwen/qwen3-4b-2507 | 2 | 2.2 | 2.4 | 3 | 2.2 |
| openai/gpt-oss-20b | 3.4 | 7 | 6.4 | 3.6 | 3.8 |
| mlx-community/gpt-oss-120b | 6.4 | 4.8 | 5.4 | 3.2 | 4.8 |
| Qwen3.5-35B-A3B | 3.4 | 5 | 6.6 | 3 | 3.2 |
| Qwen3.5-122B-A10B | 3.2 | 6.6 | 9.2 | 6.2 | 4 |
| Qwen3.5-27B | 3.6 | 6.4 | 6.8 | 5.8 | 4 |