Can you process large texts with local LLMs?

I recently spend time benchmarking multiple local models on the premise of how good they are at finding data in large context. Imagine you have confidential documents and you’d like to get information out of them. Local AI is a great solution to this. Instead of sharing the documents with a cloud AI you simple feed it to your local AI and give instructions.

But which one to choose from? There’re multiple factors playing into this:

Size of the model. Does it fit your ram?
Moe or dense Model? Moe models are much faster but less smart
Do you need additional world knowledge?
How well does it adhere to your prompt?
How much does it hallucinate?
How good is it long context understanding?

The task

I invented a task to measure models on. I gave it a very large Swift file (50k tokens) and asked it a question that requires it to understand what the code does, find specific snippets, and adhere to a guardrail (don’t propose changes, just give me snippets from the code):

WHICH CODE SNIPPETS FROM THIS CODE WOULD I NEED TO UNDERSTAND HOW TO ADD A PREFIX IMAGE BEFORE TITLE. DON’T CHANGE ANY CODE, JUST GIVE ME THE SNIPPETS I NEED.

I ran this against multiple models locally.

Hardware

M4 Macbook Pro Max with 128 GB Memory running Sequoia 15.7.3 and LM Studio 0.4.5

Included models

I tried to include models from smaller to larger sizes, including very large models which only run on 128GB of ram because they’re heavily quantized or dumbed down via REAP.

Performance

Model	Prompt Tokens	Response Tokens	Parse Time (s)	Gen Time (s)	Tok/sec
mistralai/ministral-3-3b	49409	1073	226.111	56.287	19.45
nvidia/nemotron-3-nano	48890	2109	63.985	46.311	45.54
zai-org/glm-4.7-flash	45879	1420	526.756	78.322	18.13
granite-4.0-h-tiny (4bit)	45597	337	26.363	3.373	99.91
granite-4.0-h-tiny (8bit)	51336	486	22.058	4.328	112.30
Nemotron-3-Nano-REAP-21B-A3B	48970	1382	57.468	26.795	51.58
qwen/qwen3-coder-next (mlx)	46677	2212	67.009	54.034	40.94
qwen/qwen3-coder-next (gguf)	46677	1857	181.260	100.780	18.42
MiniMax-M2.1-REAP-50	52981	6114	532.726	217.332	8.01
mistralai/devstral-small-2-2512	48877	568	644.479	76.419	7.43
GLM-4.5-Air-REAP-82B (run 1)	45879	1439	564.088	200.062	7.19
GLM-4.5-Air-REAP-82B (run 2)	45879	1439	564.088	200.062	7.19
qwen/qwen3-4b-thinking-2507	46679	8207	109.285	369.098	17.16
qwen/qwen3-4b-2507	46677	2320	121.624	59.372	12.82
openai/gpt-oss-20b	46073	4178	75.111	68.333	29.11
mlx-community/gpt-oss-120b	46073	2920	95.180	92.807	31.46
qwen/qwen3-30b-a3b-2507	46677	1663	152.328	51.812	8.15
Qwen3.5-35B-A3B	50195	3778	77.583	162.419	15.74
Qwen3.5-122B-A10B	50195	2808	263.242	180.967	15.52
Qwen3.5-27B	50195	7060	395.732	211.067	11.64

Evaluation

Model	Finding	Adhering	Hallucination	Verbatim	Overall
Opus 4.5	5	6.8	9	6.6	6
mistralai/ministral-3-3b	1.8	1	1.2	1.2	1.4
nvidia/nemotron-3-nano	0.4	0.8	0.4	0.4	0.4
zai-org/glm-4.7-flash	2	6.2	5.6	4.2	2.6
granite-4.0-h-tiny (8bit)	0	0	2	0	0
Nemotron-3-Nano-REAP-21B-A3B	0	0	0	0	0
qwen/qwen3-coder-next (mlx)	3.6	5	7.2	4.6	4.3
qwen/qwen3-coder-next (gguf)	2.8	4	3.4	1.8	2.65
MiniMax-M2.1-REAP-50	1.4	2.8	2	2	1.6
mistralai/devstral-small-2-2512	0.8	2.8	3.6	2.2	1.5
GLM-4.5-Air-REAP-82B (run 1)	2.4	1.2	1.8	1.8	1.7
GLM-4.5-Air-REAP-82B (run 2)	3.2	4	6	4	3.6
qwen/qwen3-4b-thinking-2507	1.8	2.6	2.2	2.2	2.1
qwen/qwen3-4b-2507	2	2.2	2.4	3	2.2
openai/gpt-oss-20b	3.4	7	6.4	3.6	3.8
mlx-community/gpt-oss-120b	6.4	4.8	5.4	3.2	4.8
Qwen3.5-35B-A3B	3.4	5	6.6	3	3.2
Qwen3.5-122B-A10B	3.2	6.6	9.2	6.2	4
Qwen3.5-27B	3.6	6.4	6.8	5.8	4