I've been working on building out "evals for your repo" based on the theory that commonly used benchmarks like SWE-bench are broken as they are not testing the right / valuable things, and are baked into the training data (see OpenAI's research on this here https://openai.com/index/why-we-no-longer-evaluate-swe-bench...)
Interestingly, I had a similar finding where, on the 3 open-source repos I ran evals on, the models (5.1-codex-mini, 5.3-codex, 5.4) all had relatively similar test scores, but when looking at other metrics, such as code quality, or equivalence to the original PR the task was based on, they had massive differences. posted results here if anyone is curious https://www.stet.sh/leaderboard
This matches what I see daily. The AI-generated code that worries me most isn't the code that fails tests, it's the code that passes everything but adds complexity no one asked for.
Last week my agent refactored a simple API route into three abstraction layers "for future extensibility." Tests still passed. Build still passed. But now I have three files to maintain instead of one, and the "extensibility" will never be used.
SWE-bench measures "can it solve the problem." It doesn't measure "did it solve only the problem." That gap is where most of my debugging time goes.
makes sense! we wrote something yesterday about the weaknesses of test-based evals like swe-bench [1]
they are definitely useful but they miss the things that are hard to encode in tests, like spec/intent alignment, scope creep, adherence to codebase patterns, team preferences (risk tolerance, etc)
and those factors are really important. which means that test-evals should be relied upon more as weak/directional priors than as definitive measures of real-world usefulness
For the most part, I think the tests AI have been given have been appropriately designed. At release, many AIs do poorly at them, the models rapidly catch up until the point where a new test is needed.
They should be measuring close to the limits of ability like that.
There will be some that try and steal headlines by targeting the specific nature of the test, but that is not a long term winning solution, the tests keep getting harder. If they make a model good at every test it has seen without regression, then with enough tests, that too ceases to be a problem.
Perhaps there should be an aggregate AI test score that evaluates all of the tests released in a given year. If a model passes the latest test really well but does worse at TestSet2024 than the models before, it would perhaps indicate the model being trained to pass the latest cool test.
There is a problem with people interpreting an AI that passes a test of X,Y or Z as indicating that the AI has the abilities of a human who passes X,Y, or Z. You should tell people who say that, Kasparov makes a nice coffee.
LLM-written code passed SWE Bench even back then. This may just say that SWE Bench is an inadequate test, and should not be used for serious evaluation.
Really interesting note. That echoes thoughts I’ve had about how much automated benchmark scores really reflect production‑ready code.
For me the big takeaway is that passing doesn't automatically mean it is maintainable, follows established patterns / conventions or have unexpected side effects that real reviewers care about.
I think a far greater problem is the human psychological and prejudice factor itself. When we heard AI assistance on a PR, we usually dive down the road to thinking about "oh my god is it another LLM slop" (for example: https://github.com/jneem/imbl/pull/149#pullrequestreview-370...). I do use AI but I review the code before I push it, yet most people don't. Once there is a trend, it is easy to form a prejudice and it is hard to go back, unless there is a substantial improvement both in quality and quantity.
Also, some people would have spoken outright rejecting any AI code, but most maintainers would employ the silent treatment tactics. And then when you demand them to review, they either close it or say that "I'm too busy" as an argument. I would call this one of the biggest dick move, because it hurts the most yet you can't find anything wrong with them until they reveal their motives.
Do these benchmarks make any sense? I tried a few local models that seem to be scoring well in SWE but the results were pure rubbish. (For instance MiniMax-M2.5 at 128GB from unslothed - completely unusable).
Which quant? I find folks running lower quants complaining, yet they should be running higher quant. Qwen3CoderNext is great, even at Q6. I mistakenly had it loaded for an agentic workflow and was surprised at how well it is.
What is "lower quant"? What is "higher quant"? I mean, I know what they are, but the very people you intend to reach don't know the difference between Q4_K_M and Q6_K and blog posts like [1] have nuggets like "For tests of the type ran here, there appear to be major diminishing returns past Q4".
SWE-bench scores well in the narrow task of making tests pass, which means models get good at exactly that. Real codebases have style constraints, architecture choices, and maintainability concerns that don't show up in any test suite. Not surprised at all that the PRs wouldn't get merged; you'd expect that from an eval that can't measure what reviewers actually care about.
They might have tried, but this would be pretty hard to achieve for real - especially for the older/worse models. For changes that do more than alter a couple of lines llm output can be very obvious. Stripping all comments from the changeset might go a long way to making it more blind, but then you're missing context that you kinda need to review the code properly.
I feel like I don't have the context for this conversation. If slop is obvious as slop, I feel like we should block it.
If you look at the comment it says what the code following the comment does. It doesn't matter whether it is a human or a machine that wrote it. It is useless. It is actually worse than useless because if someone needs to change the code, now they need to change two things. So in that sense, you just made twice the work for anyone who touches the code after you and for what benefit?
The point is that AI models do these kinds of things all the time. They're not really all that smart or intelligent, they just replicate patterns or boilerplate and then iterate until it sort of appears to work properly.
Interestingly, I had a similar finding where, on the 3 open-source repos I ran evals on, the models (5.1-codex-mini, 5.3-codex, 5.4) all had relatively similar test scores, but when looking at other metrics, such as code quality, or equivalence to the original PR the task was based on, they had massive differences. posted results here if anyone is curious https://www.stet.sh/leaderboard
Last week my agent refactored a simple API route into three abstraction layers "for future extensibility." Tests still passed. Build still passed. But now I have three files to maintain instead of one, and the "extensibility" will never be used.
SWE-bench measures "can it solve the problem." It doesn't measure "did it solve only the problem." That gap is where most of my debugging time goes.
they are definitely useful but they miss the things that are hard to encode in tests, like spec/intent alignment, scope creep, adherence to codebase patterns, team preferences (risk tolerance, etc)
and those factors are really important. which means that test-evals should be relied upon more as weak/directional priors than as definitive measures of real-world usefulness
[1] https://voratiq.com/blog/test-evals-are-not-enough/
Is this a post about AI archeology?
For the most part, I think the tests AI have been given have been appropriately designed. At release, many AIs do poorly at them, the models rapidly catch up until the point where a new test is needed.
They should be measuring close to the limits of ability like that.
There will be some that try and steal headlines by targeting the specific nature of the test, but that is not a long term winning solution, the tests keep getting harder. If they make a model good at every test it has seen without regression, then with enough tests, that too ceases to be a problem.
Perhaps there should be an aggregate AI test score that evaluates all of the tests released in a given year. If a model passes the latest test really well but does worse at TestSet2024 than the models before, it would perhaps indicate the model being trained to pass the latest cool test.
There is a problem with people interpreting an AI that passes a test of X,Y or Z as indicating that the AI has the abilities of a human who passes X,Y, or Z. You should tell people who say that, Kasparov makes a nice coffee.
For me the big takeaway is that passing doesn't automatically mean it is maintainable, follows established patterns / conventions or have unexpected side effects that real reviewers care about.
Also, some people would have spoken outright rejecting any AI code, but most maintainers would employ the silent treatment tactics. And then when you demand them to review, they either close it or say that "I'm too busy" as an argument. I would call this one of the biggest dick move, because it hurts the most yet you can't find anything wrong with them until they reveal their motives.
[1] https://big-stupid-jellyfish.github.io/GFMath/pages/llm-quan...
In any case, the blinding didn't stop Reviewer #2 from calling out obvious AI slop. (Figure 5)
If you look at the comment it says what the code following the comment does. It doesn't matter whether it is a human or a machine that wrote it. It is useless. It is actually worse than useless because if someone needs to change the code, now they need to change two things. So in that sense, you just made twice the work for anyone who touches the code after you and for what benefit?
That "appears" is doing a lot of heavy lifting.
The code working isn't what's being selected for.
The code looking convincing IS what is being selected for.
That distinction is massive.