Benchmarking Local Embeddings
For one of my side projects I started investigating different embedding models to determine which one to use. There are already a variety of embedding benchmarks such as the MTEB leaderboard. However, I had slightly different requirements which is why I conducted my own benchmark the results of which I will share here.
Requirements
- I wanted to test models supported by Rust crates such as Embed Anything or Fastembed
- The embedding models should be fast as they will be run privacy friendly on local hardware as part of a desktop app
- The generated embeddings should not be huge in terms of size as I will store many embeddings in a database
- Their primary usecase is RAG search, so it is more important to have good results for “contains” and “not contains” tests instead of fully capturing the semantics of the content.
- My content is primarily in english, so I don’t need multilingual embeddings
- As the embeddings are created locally, I’d like to chunk them as large as possible while still keeping the meaning. E.g. when one document is chunked (with a small chunk size) into 50 embeddings, it takes a long time for the document to be processed. Whereas generating 10 embeddings is much faster.
As an example, A model like Qwen Embedding 8b generates 4k Embedding dimensions which makes it slow and large but also captures the detailed meaning of the content as the generated embedding is made for LLM usage. Much simpler models such as bge small cannot really capture meaning but are pretty good at figuring out whether text does appear (in different variation) or does not appear in a haystack.
The benchmark
I took two sample documents from my stash and for each document I created 3 queries of content that appears in the document and 2 queries of content that does not appear in the document. Then I calculated the cosine similarity of each query. The first 3 should have a high similarity, the last 2 should have a low similarity. I take these 5 values into a formulate that calculates a final score. I chunk these documents into different chunk sizes:
- 6000 words per chunk w/ 300 words overlap
- 3000 words per chunk w/ 300 words overlap
- 1000 words per chunk w/ 100 words overlap
- 500 words per chunk w/ 50 words overlap
Results
Chunk Config: 1000/100
Model | Score | Contains Score (Q1-Q3) | Does Not Contain Score (Q4-Q5) | Avg Time (ms) |
---|---|---|---|---|
AllMiniLML12V2Q | 0.4510 | 0.4908 | 0.0814 | 202.0 |
AllMiniLML12V2 | 0.4464 | 0.4873 | 0.0841 | 406.9 |
Qwen3-Embedding-0.6B | 0.4376 | 0.5818 | 0.2404 | 5364.3 |
AllMiniLML6V2Q | 0.4306 | 0.4812 | 0.1048 | 99.3 |
AllMiniLML6V2 | 0.4285 | 0.4861 | 0.1063 | 39.1 |
NomicEmbedTextV15 | 0.3564 | 0.6624 | 0.4595 | 11309.8 |
BGELargeENV15 | 0.3313 | 0.7086 | 0.5288 | 4526.8 |
BGESmallENV15 | 0.3164 | 0.7278 | 0.5624 | 407.1 |
JINAV2BASEEN | 0.2526 | 0.7889 | 0.6776 | 11524.7 |
Chunk Config: 3000/300
Model | Score | Contains Score (Q1-Q3) | Does Not Contain Score (Q4-Q5) | Avg Time (ms) |
---|---|---|---|---|
AllMiniLML12V2 | 0.3720 | 0.3897 | 0.0386 | 171.7 |
Qwen3-Embedding-0.6B | 0.3719 | 0.4902 | 0.2248 | 2248.4 |
AllMiniLML6V2 | 0.3702 | 0.4017 | 0.0571 | 20.9 |
AllMiniLML12V2Q | 0.3685 | 0.3818 | 0.0293 | 91.9 |
AllMiniLML6V2Q | 0.3520 | 0.3716 | 0.0487 | 45.3 |
BGELargeENV15 | 0.3466 | 0.6628 | 0.4650 | 2001.8 |
BGESmallENV15 | 0.3231 | 0.6901 | 0.5226 | 171.4 |
NomicEmbedTextV15 | 0.3218 | 0.5887 | 0.4517 | 18337.8 |
JINAV2BASEEN | 0.2485 | 0.7682 | 0.6739 | 19155.1 |
Chunk Config: 500/50
Model | Score | Contains Score (Q1-Q3) | Does Not Contain Score (Q4-Q5) | Avg Time (ms) |
---|---|---|---|---|
AllMiniLML12V2 | 0.4591 | 0.5014 | 0.0846 | 760.0 |
AllMiniLML12V2Q | 0.4571 | 0.5024 | 0.0907 | 399.8 |
Qwen3-Embedding-0.6B | 0.4411 | 0.5875 | 0.2426 | 10469.9 |
AllMiniLML6V2Q | 0.4389 | 0.4915 | 0.1066 | 188.6 |
AllMiniLML6V2 | 0.4218 | 0.4861 | 0.1241 | 69.3 |
NomicEmbedTextV15 | 0.3584 | 0.6678 | 0.4610 | 8336.4 |
BGELargeENV15 | 0.3316 | 0.7102 | 0.5292 | 8675.8 |
BGESmallENV15 | 0.3132 | 0.7278 | 0.5672 | 791.3 |
JINAV2BASEEN | 0.2459 | 0.8016 | 0.6909 | 8228.1 |
Chunk Config: 6000/300
Model | Score | Contains Score (Q1-Q3) | Does Not Contain Score (Q4-Q5) | Avg Time (ms) |
---|---|---|---|---|
Qwen3-Embedding-0.6B | 0.3499 | 0.4463 | 0.2074 | 1624.1 |
BGELargeENV15 | 0.3376 | 0.6295 | 0.4564 | 1346.2 |
BGESmallENV15 | 0.3272 | 0.6530 | 0.4955 | 113.8 |
NomicEmbedTextV15 | 0.3204 | 0.5432 | 0.4104 | 16924.8 |
AllMiniLML12V2 | 0.3091 | 0.3135 | 0.0143 | 112.8 |
AllMiniLML6V2Q | 0.3069 | 0.3237 | 0.0498 | 33.5 |
AllMiniLML12V2Q | 0.3025 | 0.3065 | 0.0130 | 65.4 |
AllMiniLML6V2 | 0.2755 | 0.2854 | 0.0310 | 16.0 |
JINAV2BASEEN | 0.2530 | 0.7590 | 0.6649 | 22370.4 |
Some notes:
- In the largest chunk, the Qwen 3 embedding wins handily as Qwen 3 has a much larger context size than the other models (32768 tokens)
- The AllMini models are very lightweight and surprisingly good at finding information and not finding not-existing information
- The Qwen model is not so good at not finding not-existing information
AllMiniLML6V2 was apparently specifically trained on sentence similarity tasks using contrastive learning. It’s optimized to push similar things together and dissimilar things apart.
Bigger isn’t always better for specific tasks. For retrieval and similarity tasks specialized smaller models often outperform general-purpose larger ones.
The results make sense: Qwen3 and Jina are more powerful for general understanding but less precise for binary relevance decisions.