Ben Terhechte

Benchmarking Local Embeddings

✎ Published: Sep 13, 2025

Which embedding models are particularly useful for generation and retrieval of information privately on the local computer

For one of my side projects I started investigating different embedding models to determine which one to use. There are already a variety of embedding benchmarks such as the MTEB leaderboard. However, I had slightly different requirements which is why I conducted my own benchmark the results of which I will share here.

Requirements

I wanted to test models supported by Rust crates such as Embed Anything or Fastembed
The embedding models should be fast as they will be run privacy friendly on local hardware as part of a desktop app
The generated embeddings should not be huge in terms of size as I will store many embeddings in a database
Their primary usecase is RAG search, so it is more important to have good results for “contains” and “not contains” tests instead of fully capturing the semantics of the content.
My content is primarily in english, so I don’t need multilingual embeddings
As the embeddings are created locally, I’d like to chunk them as large as possible while still keeping the meaning. E.g. when one document is chunked (with a small chunk size) into 50 embeddings, it takes a long time for the document to be processed. Whereas generating 10 embeddings is much faster.

As an example, A model like Qwen Embedding 8b generates 4k Embedding dimensions which makes it slow and large but also captures the detailed meaning of the content as the generated embedding is made for LLM usage. Much simpler models such as bge small cannot really capture meaning but are pretty good at figuring out whether text does appear (in different variation) or does not appear in a haystack.

The benchmark

I took two sample documents from my stash and for each document I created 3 queries of content that appears in the document and 2 queries of content that does not appear in the document. Then I calculated the cosine similarity of each query. The first 3 should have a high similarity, the last 2 should have a low similarity. I take these 5 values into a formulate that calculates a final score. I chunk these documents into different chunk sizes:

6000 words per chunk w/ 300 words overlap
3000 words per chunk w/ 300 words overlap
1000 words per chunk w/ 100 words overlap
500 words per chunk w/ 50 words overlap

Results

Chunk Config: 1000/100

Model	Score	Contains Score (Q1-Q3)	Does Not Contain Score (Q4-Q5)	Avg Time (ms)
AllMiniLML12V2Q	0.4510	0.4908	0.0814	202.0
AllMiniLML12V2	0.4464	0.4873	0.0841	406.9
Qwen3-Embedding-0.6B	0.4376	0.5818	0.2404	5364.3
AllMiniLML6V2Q	0.4306	0.4812	0.1048	99.3
AllMiniLML6V2	0.4285	0.4861	0.1063	39.1
NomicEmbedTextV15	0.3564	0.6624	0.4595	11309.8
BGELargeENV15	0.3313	0.7086	0.5288	4526.8
BGESmallENV15	0.3164	0.7278	0.5624	407.1
JINAV2BASEEN	0.2526	0.7889	0.6776	11524.7

Chunk Config: 3000/300

Model	Score	Contains Score (Q1-Q3)	Does Not Contain Score (Q4-Q5)	Avg Time (ms)
AllMiniLML12V2	0.3720	0.3897	0.0386	171.7
Qwen3-Embedding-0.6B	0.3719	0.4902	0.2248	2248.4
AllMiniLML6V2	0.3702	0.4017	0.0571	20.9
AllMiniLML12V2Q	0.3685	0.3818	0.0293	91.9
AllMiniLML6V2Q	0.3520	0.3716	0.0487	45.3
BGELargeENV15	0.3466	0.6628	0.4650	2001.8
BGESmallENV15	0.3231	0.6901	0.5226	171.4
NomicEmbedTextV15	0.3218	0.5887	0.4517	18337.8
JINAV2BASEEN	0.2485	0.7682	0.6739	19155.1

Chunk Config: 500/50

Model	Score	Contains Score (Q1-Q3)	Does Not Contain Score (Q4-Q5)	Avg Time (ms)
AllMiniLML12V2	0.4591	0.5014	0.0846	760.0
AllMiniLML12V2Q	0.4571	0.5024	0.0907	399.8
Qwen3-Embedding-0.6B	0.4411	0.5875	0.2426	10469.9
AllMiniLML6V2Q	0.4389	0.4915	0.1066	188.6
AllMiniLML6V2	0.4218	0.4861	0.1241	69.3
NomicEmbedTextV15	0.3584	0.6678	0.4610	8336.4
BGELargeENV15	0.3316	0.7102	0.5292	8675.8
BGESmallENV15	0.3132	0.7278	0.5672	791.3
JINAV2BASEEN	0.2459	0.8016	0.6909	8228.1

Chunk Config: 6000/300

Model	Score	Contains Score (Q1-Q3)	Does Not Contain Score (Q4-Q5)	Avg Time (ms)
Qwen3-Embedding-0.6B	0.3499	0.4463	0.2074	1624.1
BGELargeENV15	0.3376	0.6295	0.4564	1346.2
BGESmallENV15	0.3272	0.6530	0.4955	113.8
NomicEmbedTextV15	0.3204	0.5432	0.4104	16924.8
AllMiniLML12V2	0.3091	0.3135	0.0143	112.8
AllMiniLML6V2Q	0.3069	0.3237	0.0498	33.5
AllMiniLML12V2Q	0.3025	0.3065	0.0130	65.4
AllMiniLML6V2	0.2755	0.2854	0.0310	16.0
JINAV2BASEEN	0.2530	0.7590	0.6649	22370.4

Some notes:

In the largest chunk, the Qwen 3 embedding wins handily as Qwen 3 has a much larger context size than the other models (32768 tokens)
The AllMini models are very lightweight and surprisingly good at finding information and not finding not-existing information
The Qwen model is not so good at not finding not-existing information

AllMiniLML6V2 was apparently specifically trained on sentence similarity tasks using contrastive learning. It’s optimized to push similar things together and dissimilar things apart.

Bigger isn’t always better for specific tasks. For retrieval and similarity tasks specialized smaller models often outperform general-purpose larger ones.

The results make sense: Qwen3 and Jina are more powerful for general understanding but less precise for binary relevance decisions.