Ben Terhechte — journal
--rss
POST #llms #programming #languages #ai

LLM-Powered Programming: A Language Matrix Revealed

Comparing LLMs and programming languages to see which combination is the best for AI-driven development.

Feb 4, 2025 648 21 min read 4,546 words

TLDR: I’ve tested 17 LLMs against 4 typed programming languages to see which excels at what.

The biggest takeaways are that there is a huge gap between the same LLM performing well in different languages. Take Sonnet 3.5 for example:

  • Python: 81% success rate
  • Typescript: 81% success rate
  • Swift: 71% success rate
  • Rust: 62% success rate

For the details of how this was measured, read on.

Intro

I have lately been thinking about the role of LLMs in software development. I use LLMs every day during my work and I couldn’t fathom doing my job as a software engineer without them. However, I do work with multiple different programming languages and I have noticed that the quality of the LLM’s output varies greatly between languages. This was just a feeling I had, not grounded in any empirical evidence. So I decided to do something about it.

The Idea

There’re different amounts of source code out there. Github publishes the state of the Octoverse every year, and while they don’t show the amount of code per language, they show distinct users contributing code per language and we can take that as an indicator for the amount of code out there.

By that metric, there is far more Python and JavaScript code than there is Rust or Swift code. As LLMs are being trained on code, and become better with more data, we can assume that they’re better at the languages with more code.

In addition, some languages - such as Rust - have a high complexity stemming from a trait based system where the correctness of the code depends not just on the currently visible code (e.g. a function), but a potpourri of other parts of the codebase. I assume that this is more difficult to reason about for a LLM.

With that in mind, I set out to compare four typed languages across multiple LLMs.

The Language x LLM Benchmark

I implemented a small Python project together with a database of programming problems. Then, I iterated over a list of LLMs and asked them to solve the problems. The outcome of this can then be used to answer the following questions:

  • See which LLM is the best for a given programming language
  • See which programming language is the best for ai-supported programming

Particular the second point is something I’d like to stress:

LLMs might enforce a language gap

If LLMs are far better at solving problems in a certain language, developers might be incentivized to use that language. This might lead to a language gap, where certain languages are used more often than others. There is a chance that future, better, LLMs will support inferiorely supported languages better, but by then the languages that are already supported very well will be supported even better.

The tasks

I’m keeping the database of LLM tasks private for now do decrease the likelihood that LLMs are being trained on this data. Each task consists out of the prompt as well as a description for how to evaluate the solution:

  • Check: I’m just running a type check on the code.
  • Eval: The code is run with a given input, and I’m expecting a given output.
  • Contains: I’m just checking if the code contains a given string (or multiple strings).
  • Run: The code should implement a server and I’m doing http requests to it.

I’ve included the Contains task to also test for the knowledge that a LLM has about the given ecosystem. E.g. how do I do the following in this third party library?

The tasks are split up into general tasks, which can be solved by any language, and language specific tasks which are only solvable in a certain language. One example is a task that requires the solution of a complex Rust lifetime issue. This is only solvable in Rust.

The set of specific tasks exist in order to answer question 2: Which LLM is the best for a given programming language?

An example of a general task is the following:


Can you write a {{lang}} function that recieves
a {{string_type}} as a parameter and returns the
same {{string_type}} uppercased?

In this case, {{lang}} is a placeholder for the programming language, {{string_type}} is a placeholder for the type of the string.

I would love to support morge languages, but these are the ones I’m most familiar with.

Each task is being executed three times to see how stable the results are for the given case (this is important as LLMs are not deterministic).

The contenders

In order to simplify the whole process, I’m using OpenRouter to access the LLMs. This allows me to use the same API for all LLMs.

  • amazon/nova-pro-v1
  • anthropic/claude-3.5-haiku-20241022:beta
  • anthropic/claude-3.5-sonnet:beta
  • deepseek/deepseek-chat
  • deepseek/deepseek-r1
  • deepseek/deepseek-r1-distill-llama-70b
  • meta-llama/llama-3.3-70b-instruct
  • microsoft/phi-4
  • mistralai/codestral-2501
  • mistralai/mistral-large-2411
  • mistralai/mistral-small-24b-instruct-2501
  • openai/gpt-4o-2024-11-20
  • openai/gpt-4o-mini
  • openai/o3-mini
  • qwen/qwen-max
  • qwen/qwen-plus
  • qwen/qwen-turbo
  • qwen/qwq-32b-preview

Results

Now the most interesting part, the results. I will try to answer various questions based on the data, including the two original questions:

  1. See which LLM is the best for a given programming language
  2. See which programming language is the best for ai-supported programming

Compare languages across LLMs

NamePythonRustTypescriptSwift
amazon/nova-pro-v176.2%35.7%40.5%42.9%
anthropic/claude-3.5-haiku-20241022:beta76.2%45.2%71.4%61.9%
anthropic/claude-3.5-sonnet:beta81.0%61.9%81.0%71.4%
deepseek/deepseek-chat92.9%61.9%88.1%64.3%
deepseek/deepseek-r169.0%54.8%59.5%52.4%
deepseek/deepseek-r1-distill-llama-70b61.9%38.1%45.2%28.6%
meta-llama/llama-3.3-70b-instruct73.8%33.3%57.1%35.7%
microsoft/phi-464.3%26.2%50.0%40.5%
mistralai/codestral-250173.8%59.5%71.4%66.7%
mistralai/mistral-large-241178.6%50.0%69.0%64.3%
mistralai/mistral-small-24b-instruct-250164.3%21.4%31.0%28.6%
openai/gpt-4o-mini71.4%54.8%71.4%61.9%
openai/gpt-4o-2024-11-2078.6%45.2%64.3%59.5%
openai/o3-mini97.6%76.2%95.2%92.9%
qwen/qwen-max85.7%59.5%54.8%38.1%
qwen/qwen-plus38.1%42.9%16.7%47.6%
qwen/qwen-turbo76.2%45.2%61.9%35.7%
qwen/qwq-32b-preview61.9%23.8%38.1%38.1%

Here you can see the success rate of each LLM for each language. o3-mini is really good at Swift. What I find particularly interesting, is that some much cheaper LLMs perform almost as good as more expensive ones. Lets hone in on this by taking costs and duration into account.

Compare costs and duration across LLMs and languages

NamePythonPython DurationPython CostsRustRust DurationRust CostsTypescriptTypescript DurationTypescript CostsSwiftSwift DurationSwift Costs
amazon/nova-pro-v176.2%78.9s$0.022435.7%115.9s$0.033340.5%106.1s$0.028242.9%165.3s$0.0392
anthropic/claude-3.5-haiku-20241022:beta76.2%135.1s$0.037345.2%179.4s$0.048971.4%169.2s$0.045061.9%199.9s$0.0571
anthropic/claude-3.5-sonnet:beta81.0%109.6s$0.121561.9%156.6s$0.179281.0%139.8s$0.153971.4%180.9s$0.2109
deepseek/deepseek-chat92.9%1130.1s$0.008761.9%1289.4s$0.012488.1%1069.6s$0.010364.3%1953.8s$0.0130
deepseek/deepseek-r169.0%11049.6s$0.149954.8%15493.3s$0.179559.5%12264.8s$0.159952.4%14094.4s$0.1594
deepseek/deepseek-r1-distill-llama-70b61.9%1501.1s$0.020938.1%2021.6s$0.023745.2%1292.6s$0.022528.6%2005.8s$0.0323
meta-llama/llama-3.3-70b-instruct73.8%367.7s$0.002433.3%436.0s$0.003357.1%339.4s$0.002735.7%550.8s$0.0039
microsoft/phi-464.3%183.9s$0.001526.2%325.8s$0.002550.0%192.7s$0.001740.5%342.9s$0.0023
mistralai/codestral-250173.8%37.0s$0.006359.5%55.9s$0.009471.4%54.1s$0.007766.7%88.1s$0.0108
mistralai/mistral-large-241178.6%111.6s$0.051250.0%188.3s$0.081169.0%158.3s$0.066164.3%238.7s$0.0888
mistralai/mistral-small-24b-instruct-250164.3%55.0s$0.002321.4%74.5s$0.003631.0%74.3s$0.003028.6%106.5s$0.0041
openai/gpt-4o-mini71.4%83.7s$0.069654.8%133.8s$0.107171.4%120.5s$0.091961.9%190.7s$0.1287
openai/gpt-4o-2024-11-2078.6%85.2s$0.003345.2%127.5s$0.005264.3%116.1s$0.004359.5%216.5s$0.0067
openai/o3-mini97.6%486.8s$0.196076.2%410.3s$0.275795.2%443.1s$0.190592.9%364.7s$0.2395
qwen/qwen-max85.7%138.4s$0.036759.5%249.3s$0.061154.8%157.3s$0.037138.1%267.4s$0.0579
qwen/qwen-plus38.1%95.8s$0.005242.9%216.8s$0.011916.7%80.9s$0.003547.6%284.5s$0.0128
qwen/qwen-turbo76.2%100.3s$0.001345.2%161.8s$0.002261.9%131.2s$0.001735.7%193.3s$0.0022
qwen/qwq-32b-preview61.9%1496.0s$0.010823.8%1401.2s$0.010638.1%1439.4s$0.010138.1%1514.7s$0.0118

The big surprise here is just how slow the deepseek models are. These values are real. I ran the whole suite 3 times, so it took deepseek r1 more than 9 hours to complete. Compared to 110 seconds for Claude Sonnet or 487 seconds for o3 mini.

Stability

Another thing I tested for is the stability of the results. If I ask the same question, how often do I get a correct answer? So the whole benchmark suite ran 3 times. The next table shows the successful runs per language.

  • R1 means first run
  • R2 means second run
  • R3 means third run
NamePythonPython R1Python R2Python R3RustRust R1Rust R2Rust R3TypescriptTypescript R1Typescript R2Typescript R3SwiftSwift R1Swift R2Swift R3
amazon/nova-pro-v176.2%14141635.7%971140.5%57842.9%6810
anthropic/claude-3.5-haiku-20241022:beta76.2%15141545.2%12121171.4%13121161.9%101212
anthropic/claude-3.5-sonnet:beta81.0%17151661.9%13131481.0%11141371.4%141514
deepseek/deepseek-chat92.9%17181961.9%16151288.1%14141564.3%111113
deepseek/deepseek-r169.0%14161454.8%13121559.5%9101252.4%13811
deepseek/deepseek-r1-distill-llama-70b61.9%13131238.1%10101045.2%76928.6%477
meta-llama/llama-3.3-70b-instruct73.8%16161133.3%1212957.1%109935.7%597
microsoft/phi-464.3%14131226.2%971050.0%77940.5%757
mistralai/codestral-250173.8%14141459.5%13131271.4%11101466.7%11912
mistralai/mistral-large-241178.6%15151550.0%13151169.0%1191164.3%111210
mistralai/mistral-small-24b-instruct-250164.3%10131421.4%68731.0%84428.6%467
openai/gpt-4o-mini71.4%14141554.8%13161271.4%10101261.9%121111
openai/gpt-4o-2024-11-2078.6%16151445.2%9121364.3%10111159.5%9911
openai/o3-mini97.6%19191876.2%19182095.2%15141592.9%161517
qwen/qwen-max85.7%17161559.5%17141254.8%1131238.1%1095
qwen/qwen-plus38.1%616642.9%14121216.7%33347.6%9107
qwen/qwen-turbo76.2%16141445.2%12111261.9%1010735.7%466
qwen/qwq-32b-preview61.9%12131423.8%68538.1%47738.1%756

Best language for LLMs

Back to the original question. Which language is best for LLMs? To answer that, we’re going beyond the general success rate and will look at the specific success rate which includes language-specific questions about the ecosystem and tricky type errors.

“Python (Gen)” means “Generic” questions, whereas “Python” means “Language specific” questions.

NamePythonPython (Gen)RustRust (Gen)TypescriptTypescript (Gen)SwiftSwift (Gen)
amazon/nova-pro-v169.8%76.2%31.0%35.7%33.3%40.5%34.8%42.9%
anthropic/claude-3.5-haiku-20241022:beta69.8%76.2%40.2%45.2%60.0%71.4%49.3%61.9%
anthropic/claude-3.5-sonnet:beta76.2%81.0%46.0%61.9%63.3%81.0%62.3%71.4%
deepseek/deepseek-chat85.7%92.9%49.4%61.9%71.7%88.1%50.7%64.3%
deepseek/deepseek-r169.8%69.0%46.0%54.8%51.7%59.5%46.4%52.4%
deepseek/deepseek-r1-distill-llama-70b60.3%61.9%34.5%38.1%36.7%45.2%26.1%28.6%
meta-llama/llama-3.3-70b-instruct68.3%73.8%37.9%33.3%46.7%57.1%30.4%35.7%
microsoft/phi-461.9%64.3%29.9%26.2%38.3%50.0%27.5%40.5%
mistralai/codestral-250166.7%73.8%43.7%59.5%58.3%71.4%46.4%66.7%
mistralai/mistral-large-241171.4%78.6%44.8%50.0%51.7%69.0%47.8%64.3%
mistralai/mistral-small-24b-instruct-250158.7%64.3%24.1%21.4%26.7%31.0%24.6%28.6%
openai/gpt-4o-mini68.3%71.4%47.1%54.8%53.3%71.4%49.3%61.9%
openai/gpt-4o-2024-11-2071.4%78.6%39.1%45.2%53.3%64.3%42.0%59.5%
openai/o3-mini88.9%97.6%65.5%76.2%73.3%95.2%69.6%92.9%
qwen/qwen-max76.2%85.7%49.4%59.5%43.3%54.8%34.8%38.1%
qwen/qwen-plus44.4%38.1%43.7%42.9%15.0%16.7%37.7%47.6%
qwen/qwen-turbo69.8%76.2%40.2%45.2%45.0%61.9%23.2%35.7%
qwen/qwq-32b-preview61.9%61.9%21.8%23.8%30.0%38.1%26.1%38.1%

Here’s the current languages sorted by the success rate of the highest performing LLM (always o3 mini).

  • Python: 89%
  • Typescript: 73%
  • Swift: 69%
  • Rust: 65%

A couple of other observations

  • Qwen-Max is as good at Python as Sonnet but costs only 30%. The duration is also only slightly slower
  • Deepseek-chat is really good and really cheap but also really slow.
  • Deepseek-r1 is far worse than o3-mini but only slightly cheaper (and so unbelievably slow)
  • o3-mini is the best in all categories.

All Data

Here you can see all data in one huge table.

NamePythonPython (Gen)Python DurationPython CostsPython QualityRustRust (Gen)Rust DurationRust CostsRust QualityTypescriptTypescript (Gen)Typescript DurationTypescript CostsTypescript QualitySwiftSwift (Gen)Swift DurationSwift CostsSwift QualitySwift R1Swift R2Swift R3Rust R1Rust R2Rust R3Typescript R1Typescript R2Typescript R3Python R1Python R2Python R3
amazon/nova-pro-v169.8%76.2%78.9s$0.022468.0%31.0%35.7%115.9s$0.033330.0%33.3%40.5%106.1s$0.028230.0%34.8%42.9%165.3s$0.039231.1%68109711578141416
anthropic/claude-3.5-haiku-20241022:beta69.8%76.2%135.1s$0.037370.1%40.2%45.2%179.4s$0.048940.9%60.0%71.4%169.2s$0.045062.1%49.3%61.9%199.9s$0.057147.2%101212121211131211151415
anthropic/claude-3.5-sonnet:beta76.2%81.0%109.6s$0.121577.6%46.0%61.9%156.6s$0.179245.3%63.3%81.0%139.8s$0.153960.7%62.3%71.4%180.9s$0.210962.1%141514131314111413171516
deepseek/deepseek-chat85.7%92.9%1130.1s$0.008783.7%49.4%61.9%1289.4s$0.012452.2%71.7%88.1%1069.6s$0.010370.7%50.7%64.3%1953.8s$0.013049.1%111113161512141415171819
deepseek/deepseek-r169.8%69.0%11049.6s$0.149969.4%46.0%54.8%15493.3s$0.179544.8%51.7%59.5%12264.8s$0.159948.6%46.4%52.4%14094.4s$0.159449.1%1381113121591012141614
deepseek/deepseek-r1-distill-llama-70b60.3%61.9%1501.1s$0.020961.2%34.5%38.1%2021.6s$0.023734.5%36.7%45.2%1292.6s$0.022535.0%26.1%28.6%2005.8s$0.032323.0%477101010769131312
meta-llama/llama-3.3-70b-instruct68.3%73.8%367.7s$0.002472.8%37.9%33.3%436.0s$0.003339.9%46.7%57.1%339.4s$0.002747.9%30.4%35.7%550.8s$0.003928.0%597121291099161611
microsoft/phi-461.9%64.3%183.9s$0.001563.9%29.9%26.2%325.8s$0.002529.6%38.3%50.0%192.7s$0.001736.4%27.5%40.5%342.9s$0.002328.0%7579710779141312
mistralai/codestral-250166.7%73.8%37.0s$0.006366.7%43.7%59.5%55.9s$0.009444.3%58.3%71.4%54.1s$0.007755.7%46.4%66.7%88.1s$0.010846.0%11912131312111014141414
mistralai/mistral-large-241171.4%78.6%111.6s$0.051271.4%44.8%50.0%188.3s$0.081145.8%51.7%69.0%158.3s$0.066152.1%47.8%64.3%238.7s$0.088848.4%11121013151111911151515
mistralai/mistral-small-24b-instruct-250158.7%64.3%55.0s$0.002354.4%24.1%21.4%74.5s$0.003623.2%26.7%31.0%74.3s$0.003031.4%24.6%28.6%106.5s$0.004121.7%467687844101314
openai/gpt-4o-mini68.3%71.4%83.7s$0.069667.3%47.1%54.8%133.8s$0.107147.3%53.3%71.4%120.5s$0.091951.4%49.3%61.9%190.7s$0.128750.3%121111131612101012141415
openai/gpt-4o-2024-11-2071.4%78.6%85.2s$0.003373.5%39.1%45.2%127.5s$0.005236.0%53.3%64.3%116.1s$0.004352.1%42.0%59.5%216.5s$0.006740.4%991191213101111161514
openai/o3-mini88.9%97.6%486.8s$0.196089.8%65.5%76.2%410.3s$0.275765.0%73.3%95.2%443.1s$0.190573.6%69.6%92.9%364.7s$0.239568.9%161517191820151415191918
qwen/qwen-max76.2%85.7%138.4s$0.036778.2%49.4%59.5%249.3s$0.061153.2%43.3%54.8%157.3s$0.037130.0%34.8%38.1%267.4s$0.057939.1%109517141211312171615
qwen/qwen-plus44.4%38.1%95.8s$0.005242.2%43.7%42.9%216.8s$0.011945.3%15.0%16.7%80.9s$0.003515.0%37.7%47.6%284.5s$0.012839.1%91071412123336166
qwen/qwen-turbo69.8%76.2%100.3s$0.001372.1%40.2%45.2%161.8s$0.002240.4%45.0%61.9%131.2s$0.001747.9%23.2%35.7%193.3s$0.002221.1%46612111210107161414
qwen/qwq-32b-preview61.9%61.9%1496.0s$0.010859.9%21.8%23.8%1401.2s$0.010622.2%30.0%38.1%1439.4s$0.010126.4%26.1%38.1%1514.7s$0.011827.3%756685477121314

Adding more languages

If you’re curious about adding a language, let me know on GitHub.. Note that the actual tasks are kept private to prevent LLMs training on them.