LLM-Powered Programming: A Language Matrix Revealed

Published:
Comparing LLMs and programming languages to see which combination is the best for AI-driven development.

TLDR: I’ve tested 17 LLMs against 4 typed programming languages to see which excels at what.

The biggest takeaways are that there is a huge gap between the same LLM performing well in different languages. Take Sonnet 3.5 for example:

  • Python: 81% success rate
  • Typescript: 81% success rate
  • Swift: 71% success rate
  • Rust: 62% success rate

For the details of how this was measured, read on.

Intro

I have lately been thinking about the role of LLMs in software development. I use LLMs every day during my work and I couldn’t fathom doing my job as a software engineer without them. However, I do work with multiple different programming languages and I have noticed that the quality of the LLM’s output varies greatly between languages. This was just a feeling I had, not grounded in any empirical evidence. So I decided to do something about it.

The Idea

There’re different amounts of source code out there. Github publishes the state of the Octoverse every year, and while they don’t show the amount of code per language, they show distinct users contributing code per language and we can take that as an indicator for the amount of code out there.

By that metric, there is far more Python and JavaScript code than there is Rust or Swift code. As LLMs are being trained on code, and become better with more data, we can assume that they’re better at the languages with more code.

In addition, some languages - such as Rust - have a high complexity stemming from a trait based system where the correctness of the code depends not just on the currently visible code (e.g. a function), but a potpourri of other parts of the codebase. I assume that this is more difficult to reason about for a LLM.

With that in mind, I set out to compare four typed languages across multiple LLMs.

The Language x LLM Benchmark

I implemented a small Python project together with a database of programming problems. Then, I iterated over a list of LLMs and asked them to solve the problems. The outcome of this can then be used to answer the following questions:

  • See which LLM is the best for a given programming language
  • See which programming language is the best for ai-supported programming

Particular the second point is something I’d like to stress:

LLMs might enforce a language gap

If LLMs are far better at solving problems in a certain language, developers might be incentivized to use that language. This might lead to a language gap, where certain languages are used more often than others. There is a chance that future, better, LLMs will support inferiorely supported languages better, but by then the languages that are already supported very well will be supported even better.

The tasks

I’m keeping the database of LLM tasks private for now do decrease the likelihood that LLMs are being trained on this data. Each task consists out of the prompt as well as a description for how to evaluate the solution:

  • Check: I’m just running a type check on the code.
  • Eval: The code is run with a given input, and I’m expecting a given output.
  • Contains: I’m just checking if the code contains a given string (or multiple strings).
  • Run: The code should implement a server and I’m doing http requests to it.

I’ve included the Contains task to also test for the knowledge that a LLM has about the given ecosystem. E.g. how do I do the following in this third party library?

The tasks are split up into general tasks, which can be solved by any language, and language specific tasks which are only solvable in a certain language. One example is a task that requires the solution of a complex Rust lifetime issue. This is only solvable in Rust.

The set of specific tasks exist in order to answer question 2: Which LLM is the best for a given programming language?

An example of a general task is the following:


Can you write a {{lang}} function that recieves 
a {{string_type}} as a parameter and returns the 
same {{string_type}} uppercased?

In this case, {{lang}} is a placeholder for the programming language, {{string_type}} is a placeholder for the type of the string.

I would love to support morge languages, but these are the ones I’m most familiar with.

Each task is being executed three times to see how stable the results are for the given case (this is important as LLMs are not deterministic).

The contenders

In order to simplify the whole process, I’m using OpenRouter to access the LLMs. This allows me to use the same API for all LLMs.

  • amazon/nova-pro-v1
  • anthropic/claude-3.5-haiku-20241022:beta
  • anthropic/claude-3.5-sonnet:beta
  • deepseek/deepseek-chat
  • deepseek/deepseek-r1
  • deepseek/deepseek-r1-distill-llama-70b
  • meta-llama/llama-3.3-70b-instruct
  • microsoft/phi-4
  • mistralai/codestral-2501
  • mistralai/mistral-large-2411
  • mistralai/mistral-small-24b-instruct-2501
  • openai/gpt-4o-2024-11-20
  • openai/gpt-4o-mini
  • openai/o3-mini
  • qwen/qwen-max
  • qwen/qwen-plus
  • qwen/qwen-turbo
  • qwen/qwq-32b-preview

Results

Now the most interesting part, the results. I will try to answer various questions based on the data, including the two original questions:

  1. See which LLM is the best for a given programming language
  2. See which programming language is the best for ai-supported programming

Compare languages across LLMs

Here you can see the success rate of each LLM for each language. o3-mini is really good at Swift. What I find particularly interesting, is that some much cheaper LLMs perform almost as good as more expensive ones. Lets hone in on this by taking costs and duration into account.

Compare costs and duration across LLMs and languages

The big surprise here is just how slow the deepseek models are. These values are real. I ran the whole suite 3 times, so it took deepseek r1 more than 9 hours to complete. Compared to 110 seconds for Claude Sonnet or 487 seconds for o3 mini.

Stability

Another thing I tested for is the stability of the results. If I ask the same question, how often do I get a correct answer? So the whole benchmark suite ran 3 times. The next table shows the successful runs per language.

  • R1 means first run
  • R2 means second run
  • R3 means third run

Best language for LLMs

Back to the original question. Which language is best for LLMs? To answer that, we’re going beyond the general success rate and will look at the specific success rate which includes language-specific questions about the ecosystem and tricky type errors.

“Python (Gen)” means “Generic” questions, whereas “Python” means “Language specific” questions.

Here’s the current languages sorted by the success rate of the highest performing LLM (always o3 mini).

  • Python: 89%
  • Typescript: 73%
  • Swift: 69%
  • Rust: 65%

A couple of other observations

  • Qwen-Max is as good at Python as Sonnet but costs only 30%. The duration is also only slightly slower
  • Deepseek-chat is really good and really cheap but also really slow.
  • Deepseek-r1 is far worse than o3-mini but only slightly cheaper (and so unbelievably slow)
  • o3-mini is the best in all categories.

All Data

Here you can see all data in one huge table.

Adding more languages

If you’re curious about adding a language, let me know on GitHub.. Note that the actual tasks are kept private to prevent LLMs training on them.