GLM 5.2 beats Claude in our benchmarks

(semgrep.dev)

228 points | by jms703 4 hours ago

18 comments

SwellJoe 1 minute ago
[delayed]
pimeys 35 minutes ago
I have taken another look on these open models after the fiasco of Fable and GPT 5.6 this weekend and... GLM-5.2 truly is a good workhorse model for daily programming. I consider myself a heavy user of LLMs and a seasoned developer. A typical session for me with GPT is usually over a hundred dollars...
This weekend I programmed a matrix bot with encryption and a Rust agent with some tools. Because I need one and OpenClaw just felt... not what I wanted. Two days later and 20 dollars poorer I have what I need: a multimodal agent written in rust that has access to my homelab.
Nothing felt off with GLM. It did what I wanted, was fast, had a decent not very annoying personality and was much cheaper than Opus or GPT.
I used it unquantized through Fireworks, but there are multiple other providers too.
[-]
- shostack 27 minutes ago
  If you're using Matrix, consider Hermes as a harness if you haven't already. Native gateway support. I've been primarily using mine through Element and it has largely been great.
  [-]
  - pimeys 17 minutes ago
    Oh interesting. I basically chose Matrix because setting anything up with Whatsapp or signal was kind of painful and telegram doesn't make it easy to use encryption with bots.
    I kind of wanted to see if I can make a Matrix agent from scratch with Rust with GLM and it was surprisingly easy. Just make something for myself how I want it. Maybe I'll take a look on Hermes later...
- dist-epoch 27 minutes ago
  $20 on API pricing or on subscription?
  [-]
  - pimeys 23 minutes ago
    API, pay per token.
- HKCM852 28 minutes ago
  Which harness did u use?
  [-]
  - pimeys 23 minutes ago
    Opencode and Zed about 40/60.
bArray 1 hour ago
Apparently GLM 5.2 is 753B parameters [1], what kind of hardware are people using to run this locally?
[1] https://huggingface.co/zai-org/GLM-5.2
[-]
- kccqzy 4 minutes ago
  Run quantized versions. https://unsloth.ai/docs/models/glm-5.2
- crocowhile 1 hour ago
  follow antirez - https://x.com/antirez/status/2071173841175363905?s=20
  [-]
  - JamesSwift 55 minutes ago
    Thats quantized
- dakolli 39 minutes ago
  8 X RTX6000. It will run you around 80-100k to get started with a model at this size with decent tps..
  Don't worry though, open source evangelists will tell you that these will be running on your phone in the next 3 years.
  For $100k you could run this model 24/7 through open router with 10 concurrent sessions at 50tps for a decade and have money left over for a vacation. There's no point in investing this type of money in local models unless you have a business where you're already paying for many employee's individual token usage.
  [-]
  - InvertedRhodium 2 minutes ago
    Depends how much you value privacy and running uncensored models.
    Personally, I’m waiting for hardware to hit the secondary market before I buy something to run unquantized models like GLM. But I have no doubt that I will, at some point.
  - Aurornis 23 minutes ago
    > 8 X RTX6000. It will run you around 80-100k to get started
    8 x RTX6000 GPUs cost $100,000 alone. You then need to build a system that can support those GPUs with enough PCIe lanes through a PCIe switch.
    It's going to be $120K to $150K to build or buy a system to run this.
    [-]
    - CamperBob2 13 minutes ago
      You can run the NV4FP quant with 8x RTX6000 cards at 50-75 tps output, but not (practically speaking) the OEM FP8 version. You will learn more about PCIe than you ever wanted to know.
      The real gangstas are running 16x RTX6000s. Too rich for my blood, and the NV4FP quant doesn't seem to be that much worse.
  - wonnage 0 minutes ago
    Yeah, the neoclouds and hyperscalers are taking massive losses right now, self hosting is basically signing yourself up to do the same. There are philosophical reasons to do so but it’s a terrible economic decision
  - 8note 36 minutes ago
    you can however, have fun with it.
    oil workers buy 100k trucks they do not-much with. why not a 100k in computer?
    [-]
    - afavour 2 minutes ago
      Because car loans can’t be used to buy computers
    - Ken_At_EM 32 minutes ago
      I can't help but ask where this comment came from, you must have some exposure..
      [-]
      - CamperBob2 12 minutes ago
        It is so easy to spend $100K on a pickup truck these days, it's not even funny.
    - dakolli 34 minutes ago
      Sure, If you want to light money on fire for entertainment, more power to you. There's probably worse ways to light 100k on fire. If I have an extra 100k laying around it's going to my family though.
  - krackers 27 minutes ago
    Would you be better off pooling that money with some hackerspace group and then setting up shared inference infra, so that way you at least get better utilization?
  - rekttrader 35 minutes ago
    Or you have data that HIPAA, GDPR, PII, or have to care about the concern of others training on your data.
    [-]
    - dakolli 33 minutes ago
      That too.
  - dist-epoch 26 minutes ago
    > 50tps for a decade
    assuming demand doesn't keep on increasing. even google has trouble having enough capacity apparently.
WithinReason 1 hour ago
> [...] beating Claude Code (32%) at roughly $0.17 per vulnerability found
Claude Code is an agent harness, not an LLM.
Claude is a brand (or group of LLMs), not an LLM.
[-]
- raincole 58 minutes ago
  Yes, and the article author is fully aware of that. Thank you for pointing out this small mistake though.
  [-]
  - mkagenius 11 minutes ago
    It looks like the author is specifically avoiding model's name, because results are really weird.
```
  Opus 4.8/4.7 scored 28%

  Opus 4.6 score 37%
```
    So the author thought as let's not get into that just write Claude.
    [-]
    - andriy_koval 4 minutes ago
      many people think opus 4.6 was the best
- tills13 26 minutes ago
  It costs nothing to not be pedantic.
- Onavo 1 hour ago
  Claude code it's the only way to get access to the actual amortized cost of running a Claude-scale model. The consumer non-enterprise API is extremely expensive (with increasing marginal costs for the user and fat profit margins for Anthropic). If you want to approximate a State level attacker's cost where they can have the model on their own hardware, Claude Code is probably the best guess at the amortized cost.
himata4113 2 hours ago
These numbers are seem pretty low compared to what I was able to achieve specifically around windows kernel, win32k<->win32u to be exact. It honestly wouldn't surprise me anymore if china started surpassing models that US makes public, at least in specific categories such as cyber.
GLM 5.2 is already capable enough to assist in self-training which is similar to what we saw happen with frontier models and they appear to be getting there at a significantly lower cost than openai/anthropic.
solenoid0937 2 hours ago
GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months.
Not that it would make any sense.
[-]
- rgbrenner 1 hour ago
  If that happens it'll be an absolute disaster. Imagine a scenario where Anthropic and OpenAI prohibit most US companies from using their latest models because of safety.. And meanwhile attackers use equivalent open source models to attack US companies.
  Any prohibition on open source models will do nothing to fix the problem.. since attackers will never feel bound to the law. All advanced models must be available for defensive purposes.
  [-]
  - andy99 1 hour ago
    Right, but is there any evidence of intelligence behind any of these (government) decisions? It’s just regulatory capture + marketing (plus some people living out an imaginary fantasy that they’re in Neuromancer or something), absolutely no reason to think they won’t try and target open models as part of this.
    [-]
    - popalchemist 1 hour ago
      There's at least one reason: much harder to make a profit in policing non-american companies and open-source models without huge (or even any) MRR.
      If the real motive is profit, then open source models are likely simply not a viable means to that end.
  - solenoid0937 1 hour ago
    > since attackers will never feel bound to the law.
    But that's the whole point.
    Fall out of favor with the admin and you lose access to the good American models, aren't allowed to use Chinese ones, and fall prey to the attackers and behind your competitors.
- skissane 11 minutes ago
  I’m sceptical they could find the legal framework to do this even if they wanted to
  They have legal authority to (a) prevent export of US goods/services; (b) ban imports of physical goods; (c) ban transactions (including purchasing services or license agreements) with foreign firms
  But I’m not aware of any legal authority which lets them ban US firms from running a Chinese-developed open source AI model in the United States, if they are at arms length from the vendor, and aren’t using it for government contracts or regulated applications
- aussiegreenie 58 minutes ago
  The Americans may ban the use of the Chinese models in America. But like the Chinese car ban, everyone else will use them.
- gruez 2 hours ago
  >GLM export controls incoming?
  US imposing export restrictions on a model from China?
  [-]
  - mcintyre1994 2 hours ago
    It’d be restrictions on Americans and American companies, and probably also pressure on America’s allies.
    [-]
    - mkagenius 30 minutes ago
      Token smuggler sounds like a profession coming soon. For distillation and stuff.
  - manquer 2 hours ago
    While unlikely , it is not without precedent , there are restrictions on ASML a Dutch company to sell EUV machines
    [-]
    - throwup238 59 minutes ago
      That’s because the Department of Energy originally funded and contributed IP to the EUV Corp joint venture between several semiconductor companies (including ASML and Intel). Their ability to export control EUV was part of that original agreement that the entire technology is built on.
    - verdverm 2 hours ago
      ASML complies as an ally, why would China comply?
      The weights are already available and downloaded, is it going to be a crime to have them, run them, make them available? Constitutional rights still exist (I hope)
      [-]
      - solenoid0937 2 hours ago
        > is it going to be a crime to have them, run them, make them available?
        Now you're getting it! Commerce will call it a munition and those harboring it as harboring illegal/foreign munitions.
        No business will take the hit, so they will quickly deplatform the models.
        No end user has the GPU capacity to use GLM 5.2 or similar models at full precision so the government will call the problem "mostly solved." But they might choose to "make examples" out of a few people using p2p software to download the weights if they choose to.
        [-]
        verdverm 1 hour ago
        Or we use the models to work on fixing vulns and stop over-blowing the doom scenarios. Gotta save the kids and kill the terrorists though!
        I'm for making software better instead of banning it based on what the rich and powerful claim.
        I suspect the real fear is that open weight models undermine the financials and token prices they thought were going to pay off their ludicrous spending because they have all raced and raised hardware prices.
        [-]
        hadlock 54 minutes ago
        > making software better instead of banning it
        We're still in the middle of the cambrian explosion.
        If Anthropic was capable of developing Opus 4.49-4.5 2H 2025.... then any company with a research team capable of reading all the papers and press releases will be capable of producing Opus 4.8 by the end of 2027, either raw model competency, or in a harness like claude code (or better with both). I guess what I am trying to say is that Opus 4.5 does not represent the edge of agentic capability, merely somewhere in the thick meaty layer of "functional and achievable".
        We can draw the line at Sonnet 4.6 in the US but much like encryption export restrictions in the 1980s, the line drawn will be laughably low within a few years and simply unthinkable in a decade.
        solenoid0937 1 hour ago
        > making software better instead of banning it
        That would be the rational thing to do.
        > financials and token prices
        I do not think the government thinks this deeply. Market manipulation might be a rational, if unethical reason to ban open source models.
        But this admin banned Anthropic models to "own the libs." They will continue to ban what they want for whatever reason they want. I don't think those reasons will be particularly coherent.
        [-]
        verdverm 1 hour ago
        Yeah, the current admin is reactionary, they appear to put little thought in, or at least disregard input they dislike. I don't think Ant's ban was about "owning the libs" as much as it was asserting dominance over someone who spoke up counter to the admin's aims and claims. They do listen to money, which is where I see Big Ai paying for executive orders (because the admin forgot what it means to compromise as part of legislating for all americans).
        [-]
      - matheusmoreira 1 hour ago
        > it going to be a crime to have them, run them, make them available?
        Yeah. Illegal numbers.
  - fph 1 hour ago
    How would that even work for an open-weight model?
- djeastm 59 minutes ago
  I think state-of-the-art AI is going to be defense industry only from now on. We can have our toy drones but not the Predators and Reapers.
  [-]
  - Gigachad 48 minutes ago
    Turns out toy drones are more useful in war than multi million dollar planes anyway.
    [-]
    - techpression 37 minutes ago
      Reaper and Predator are both drones and there’s really no comparison to toy drones in terms of sheer destruction and capabilities in general, the comparison is actually quite apt imo.
      [-]
  - serf 53 minutes ago
    the things that empower modern toy drones were export restricted for years before hand.
- dakolli 36 minutes ago
  Cool then everyone will just change their config to route through a provider overseas for an added 50-100ms latency. Who cares.
g42gregory 49 minutes ago
If only the "cybersecurity" crowd were focused on patching the vulnerabilities.
Instead of shilling for the LLM providers.
[-]
- _factor 46 minutes ago
  The robot figured out how to bump the lock. The obvious solution is to ban the robot.
- __MatrixMan__ 26 minutes ago
  But if we patch all of the vulnerabilities, who will pay for our vulnerability scanner?
theteapot 1 hour ago
> Constant: the IDOR dataset (the same real, open-source applications we've used in prior research) ...
What we're they? Also, wouldn't one expect a more recently released coding agent (with a more recent knowledge cut off) to perform better because they have access to more knowledge about vulns in these OSS projects, and even possibly have knowledge of your own "prior research"?
[-]
- mkagenius 2 minutes ago
  One would. But then the results are even weirder as opus 4.6 scored more than opus 4.8 by a huge margin
laybak 21 minutes ago
how representative are Semgrep's benchmarks? everyone seems to have their own benchmark these days (guess it's good "content marketing") I'm honestly losing track
danslo 2 hours ago
It reads like an ad.
Secondly these are "just" IDORs, arguably the easiest class of vulnerabilities.
Thirdly it compares to GPT 5.5 and Opus 4.8.
No, we don't have Mythos at home.
[-]
- vlian2088 2 hours ago
  >Thirdly it compares to GPT 5.5
  mythos is <10% ahead of gpt 5.5 on all benchmarks, which it gains by being several times the size of opus. had it been economical to provide, it would've been released to the public on day one instead of the marketing circus those effective altruism clowns had exhibited. admitting that it costs >1000% to run inference on a <10% better model would've been very damning.
  [-]
  - oa335 13 minutes ago
    > it costs >1000% to run inference
    do you have a source for this claim? i thought LLM providers earn high margins from inference (charged by token). is this no longer the case?
- InsideOutSanta 2 hours ago
  In my experience, GLM 5.2 is extremely good at finding vulnerabilities, and more importantly, unlike Opus, I've never seen it refuse a command. It genuinely is a very strong model for finding and fixing vulnerabilities.
- NitpickLawyer 1 hour ago
  > Thirdly it compares to GPT 5.5 and Opus 4.8.
  > No, we don't have Mythos at home.
  That's still useful. To paraphrase the kids these days, GLM5.2 is in the room with us, today. Mythos is not. And for us in the EU, it's even more complicated, as Mythos might be with us in the room one day, and go poof the next day, on the whims of political entities that we have 0 control over.
  Knowing where open, accessible, local models are is important. We know they're behind. But there comes a time when "good enough" is useful. Even if they're "just IDORs" today, and even if they're behind SotA today.
  As someone else said above, GLM5.2 (and other models in the same tier like kimi, dsv4, etc) is / are slowly becoming "good enough" to assist in automated repo prepare work (download, install, test, edit, re-test, etc). And that translates in RL traces ready to be trained into the next generations. That might be more important than x% behind on benchmarks.
- sanid 1 hour ago
  Technically we don't have Mythos at all? You guys have access. This tells me we have Opus at home (open weights).
- jimbob45 1 hour ago
  Yeah they straight up say that their criteria is narrow and primarily important for their specific use case. Never let rationality cause your pitchfork to be cast away though!
veselin 2 hours ago
Here, it appears they compare a single prompt "find IDOR", against a multi-agent system. However, one can also start far more sophisticated skills that spin up subagents and mostly do the same in Claude Code, Codex, OpenCode, Pi, etc.
Which I guess makes what semgrep sells obsolete. Unless they have built a pareto-optimal point in terms of capabilities and token usage maybe?
[-]
- blazespin 2 hours ago
  I think the point is less "how can we throw shade on the OP" and more "a harness can enable a lot of models to do very serious cybersec, glm 5.2 is one of them"
  [-]
  - s3p 2 hours ago
    Are you replying to a response to the original comment? I looked but i didn't see anyone saying he's throwing shade.
    [-]
    - BikiniPrince 49 minutes ago
      You have to forgive the GLM bot. It's not very good.
csjh 7 minutes ago
I found it to spiral into complete nonsense a few times when I tested it out, but it's possible that was a bug in the provider
kordlessagain 4 hours ago
You can launch GLM-5.2 in Opencode using Nemesis8: https://github.com/DeepBlueDynamics/nemesis8#nemesis-8
After installing, do a `n8 build` to build the image, then `n8 --danger --provider opencode interactive` to launch it in a container.
Signup for GLM-5.2 here: https://z.ai
[-]
- sanid 1 hour ago
  One can also try https://neuralwatt.com using it in opencode.
  I think they give $5 trail credits to test with any of the open weight models.
admax88qqq 2 hours ago
> beats Claude in our Cyber Benchmarks
Beats which model in Claude? Whenever a "benchmark" doesn't put precise model numbers in their headlines I am immediately skeptical. Either they don't know the difference (bad) or they are benchmarking against weaker models (misleading, also bad).
It's like when studies say "AI is bad at X" and they used GPT-3.5 in current year.
[-]
- InsideOutSanta 1 hour ago
  They say "Claude Opus 4.8" in the first paragraph.
  [-]
  - crm9125 47 minutes ago
    We're supposed to read the article?
    How are we supposed to stay skeptical of everything if we read anything!?
- ls612 2 hours ago
  Opus 4.8 according to TFA. Whether or not the safety guardrails were responsible for the difference is an open question but for a dev who wants to secure their software who doesn’t work at one of the blessed Glasswing companies it doesn’t really matter why, it matters what the best tool you actually have is.
dist-epoch 28 minutes ago
Anthropic is saying other models were good at detecting vulnerabilities, where Mythos excelled was in creating functional exploits for them.
This article only talks about detecting vulnerabilities, so it's unclear if it's a true Mythos equivalent.
aussinholdn 2 minutes ago
[dead]
rode1974 1 hour ago
Hopefully i get a macbook pro soon enough to run some small or medium sized LLMs
[-]
- paperterminal 1 hour ago
  Same, but so much $$
BikiniPrince 42 minutes ago
This is a joke right? I wouldn't install this in a sandbox.