On the topic of local models, is there a good equivalent to something like Claude's chat interface? I've recently started transitioning to open models after getting fed up with Claude's usage limits (I'm not in a position to drop $200/month), and for coding tasks Kimi 2.6 has been about the same as Sonnet in my experience. The only thing I've found myself missing is a nice interface to ask it questions and have it help me with my math assignments.
I've been mostly using LM Studio for this recently. Ollama has an OK chat UI now too. 'brew install llama.cpp' gets you 'llama-server' which provides quite a good web UI.
I test drove it yesterday. It's pretty impressive at 8b. Runs on commodity hardware quickly.
Qwen3.6 35b a3b is still my local champion but I may use this for auto complete and small tasks. Granite has recent training data which is nice. If the other small models got fine tuned on recent data I don't know if I would use this at all, but that alone makes it pretty decent.
The 4b they released was not good for my needs but could probably handle tool calls or something
Have you tried the Gemma 4 series, out of curiosity? I haven’t run a local model in a while, but the benchmarks look good. I’d take a free local tool-use model if it was relatively consistent.
Qwen 3.6 burns it to the ground. it was not even a challenge. Gemma4 seriously fails at toolcalls and agentic works. It got all messed up after 2-3 turns of Vibecoding.
Counter-point: I built an agent that can only interface with Kakoune, a much less common and more challenging situation for an LLM to find itself in, and Gemma4-A4B 8bit quantized does remarkably better in actually figuring out how to get text in buffers than Qwen3.6-35B-A3B in a similar class as Gemma4 A4B.
Now, is this the usual use case? No, it's a benchmark I created specifically in order to put LLMs in situations where they can't just blast out their bash commands without having to interface with something else and adapt.
Gemma 4 31b was working ok for me; but it was consuming tons of memory on SWA checkpoints, I had to turn them way down, and as a 31b dense model is fairly slow on a Strix Halo. I did have a lot of tool calling issues on 26b-a4b, though.
I have tested Gemma4-26B against Qwen3.6-35B. Gemma beats Qwen on structured data extraction and instruction following. Gemma is far more precise than Qwen in these tasks, while Qwen gets a bit more creative, verbose, and imprecise. However Qwen has far more general smartness, high token throughput. Qwen could precisely pinpoint the issues in data quality and code, while Gemma had no clue. On the coding skills, Qwen appears to have edge over Gemma, but this could depend on the agent you use. For direct chat (llama_cpp UI), bot models show same skills for coding.
I tried the Gemma 4 I think 2 and 4b. The 2b was not useful for me at all. A little too weak for my use cases
The 4b was okay. It didn't get all of my small math questions right, it didn't know about some of the libraries I use, but it was able to do some basic auto complete type stuff. For microscopic models I like the llama 3.2 3b more right now for what I do, it's a little faster and seems a little stronger for what I do. But everyone is different and I don't think I'll use it anymore this past month has been crazy for local model releases.
For me, I use them for quick auto complete or small questions. I am not a vibe/agentic coder. I know I am a relic and a Luddite because of this.
Instead of hitting stack overflow and Google I will ask questions like "can you give me an example of how to do x in library y?" Or "this error is appearing what might be happening if I checked a b and c". Or "please write unit tests for this function". Or code auto complete.
I am not looking for the world's best answer from a 3b model. I am looking for a super fast answer that reminds me of things I already know or maybe just maybe gives me a fast idea to stub something while I focus on something more important, I am going to refactor anyways. Think a low quality rubber duck
I mostly use 7-9b models for this now but llama 3.2 3b is pretty decent for not hogging resources while say I have other compute heavy operations happening on a weak computer.
Probably half the questions people ask chatgpt could get roughly the same quality of answer with a small model in my opinion. You can't fully trust an LLM anyways so the difference between 60% and 70% accuracy isn't as much are marketing makes it sound like. That said the quality of a good 7-9b model is worth it compared to a 3b if your machine can run it. Furthermore the quality of qwen 36 is crazy and makes me wonder if I will ever need an AI provider again if the trend continues.
No comparison with competitor models other than the previous granite version strongly implies that it does not compete well with other comparable models. At least this is the most reasonable assumption until data comes out to the contrary
Qwen scores above sonnet in coding benchmarks. Runs locally. In personal use it's really good. Anecdotally others have used it to vibe code or agentic code successfully. Not toy problems. Not a toy model.
Qwen3.6 raises the bar for models of its size. There really isn't a comparison in my opinion.
College SAT scores do not tell you how the dev applying for your open back end systems engineering job is going to do once they're in your workplace harness.
Nor do class standings, nor hackerrank and the like.
What will tell you is asking them to fix a thing in your codebase. Once you ask an LLM to do that, a dozen times, I'd argue it's no longer "just your opinion man", it's a context-engineered performance x applicability assessment.
And it is very predictive.
But it's also why someone doing well at job A isn't necessarily going to be great at B, or bad at A doesn't mean will necessarily be bad at B.
I've often felt we should normalize a sort of mutual try-buy period where job-change seeker and company can spend a series of days without harming one's existing employment, to derisk the mutual learning. ESPECIALLY to derisk the career change for the applicant who only gets one timeline to manage, opposed to company that considers the applicant fungible.
But back to the LLM, yeah, the only valid opinion on whether it works for you is not benchmark, it's an informed opinion from 'using it in anger'.
That is how you empirically evaluate tools; not by reading stupid benchmarks. By actually using the tools, for hours and hours. Doing real work.
Did you try using it? For hours? Do you use qwen?
How about you tell us about your experience with your great 8B models that you use daily. What coding agent harness do you have then hooked up to? What context size can you get before they lose track of whats happening? Do you swap between models for different coding tasks?
Or, have you not, actually, even actually tried any of this stuff, yourself?
Qwen3-Coder-Next seems to be perfect sized for coding. I tried the new and just found the verbosity not really useful for coding. But probably for more analytical tasks or writing docs.
People complain a lot about LLM-written articles, but the human comments here on HN are far worse. Mostly a bunch of people extremely proud of themselves for not reading an LLM-written article, and then a bunch of people who take it at face value and make the model seem almost useful, and one comment that actually looked at other benchmarks. Good 'ol humanity, good at.. being emotional... and not doing analysis.....
The article makes some good points about model design (how different size models within a family can get similar results, how to filter out hallucination, math result reinforcement), so that's worth understanding. It's analyzing a paper, which only discussed 3 sizes of the same model family. But what the article doesn't say is, compared to other model families, Granite 4.1 8B sucks. The only benchmark it does well at compared to other models is non-hallucination and instruction following. Qwen 3.5 4B (among other models) easily outclass it on every other metric.
This article teaches a valuable lesson about reading articles in general. You can take useful information away from them (yes, despite being written by LLM). But you should also use critical thinking skills and be proactive to see if the article missed anything you might find relevant.
The pro LLM rant is weird, LLMs "hallucinate" in creating detailed elaborate lies, the frontier models still do this egregiously, an LLM written article by default has 0 value since every single line could be true or it could be a convincingly crafted lie, every line has to be fact checked
I'm using Gemini 3.1 pro to help me research my thesis, it still with search enabled and on pro mode, invents entire papers that don't exist, and lies about the contents of existing papers to relate them to the context or to appease me, if I submitted an LLM written article based on the results its given me 80% of the article would be lies
Commenting to complain that the article is LLM written is helpful too since some people aren't able to distinguish
If you are asking an LLM to cite it's sources you are wasting your time and degrading the quality of the response. LLMs have no inherent mechanism for "knowledge source tracking", because that isn't at all how they work. We're trying to get there with agentic stacks, but it's still too new.
For sparse knowledge tasks, where you know that the model can't possibly have much training because even humans themselves don't have much knowledge there, use it as a brainstorming partner, not as a source. Or put relevant papers in it's context to help you eval those papers in relation to your work. But it's just going to hurt itself in confusion trying to tie fuzzy ideas to sparse sources embedded in pages upon pages of mildly related google search results.
No, you're being weird (why are you calling people weird anyway, not helpful).
You're complaining about facts that have been true since words have been written on paper. If you read the article with the same criticality you read any other article you wont have the problem you complain about.
The reality is, you're only complaining because you hate ai. Cool, but dont dress it up and resort to name calling to browbeat the other guy
If I read something and cannot tell that it is AI generated, then there's no problem.
If it has AI tells then I wont bother to continue reading because it was either written by an AI or it was written by someone who can't tell the difference.
If they can't distinguish LLM text, then why should they care?
Anti-AI people like to bring up hallucination as if everything AI generates is false.
I can write pages of text, with my own content, and then use AI to improve my writing and clarity. Then I review and edit. It might have some LLM markers in there, which I remove sometimes because it's distracting. But the final, AI assisted writing is easier to read and better organized. But all the ideas are mine. Hallucinations are not remotely a problem in this case.
If it's used to create a false narrative (like a deep fake), sure, you should care. But if it's used as an alternative to a stock photo, or as an easy way to make an infographic then no, I don't think you should care.
The problem is the signal/noise ratio in these articles. If the AI has written the article, then this same info could have been generated by my own AI, but tailored to my needs. So what, exactly, is the new info that this article is generating that I can use to consult with my AI? That's what I want to get out of this interaction.
Maybe my point is something on the lines of "Just send me the prompt"[0]
prompt + all other bits of information the context has been seeded with before the output was created (documents, web searches, other sources) in which case it might be more efficient to just consume the final deliverable (yourself or via LLM).
You expect people to read every single excretion, which can be generated faster than I can read,just to find the rare gem that might exist?
The problem is that in the past it took multiple times more effort and hours to write something than it took to read. That served two purposes:
1. Lazy people just looking for an audience were effectively gatekept from drowning the world with their every vapid thought.
2. Because supply was many times slower than consumption it was viable to give most articles a chance: the author could not drown me in a deluge even if they wanted to.
Having the criteria now that the author should spend at least as much effort creating the piece as they expect the reader expend reading it is a damn useful bar: instead of reading 1000 AI articles just to find the one good one, I can simply read 10 human authored articles and be certain that 9 of them have something worthwhile.
>> The only benchmark it does well at compared to other models is non-hallucination and instruction following.
I think instruction following is going to be the most useful thing these models do. Add a voice interface and access to a bunch of simple, straight-forward devices or APIs and you have a mildly useful assistant. If that can be done in 8B parameters it will soon run on edge devices. That's solid usefulness.
>Mostly a bunch of people extremely proud of themselves for not reading an LLM-written article
I'm not sure it's proud as much as people voicing displeasure with the uncertainty about what went into the LLM prompt. This may have been a 1 sentence prompt, or it may have been some well researched background that simply reformatted it. Why waste minutes-hours on verifying it if it's possible someone could have spent 10 second on it? It's very easy to see their point.
People seem to indicate people they disagree with voicing their opinion about anything lately is some auto-fellatio, I wonder what causes them to think this way.
The thing is it's just a bunch of other original content that has been chewed up and regurgitated into something "new". Just show us the original content instead. This is by definition, slop. https://huggingface.co/blog/ibm-granite/granite-4-1
I don’t know how many difference little models this uses under the hood, but I was shocked at how good it was at the couple document extraction tasks I threw it at.
It's looking like running your own mini ecosystem is the way of the future to me. No data centers, just a decent GPU 16-24gb of VRAM, CPU, and 32gb of RAM.
I'm pretty sure there's someone somewhere who'll create a proper harness that's equivalent to one giant model. The difficulty is mostly local hardware has lot of memory constraints. Targeting 128GB would seem to be the current sweet spot. If we could get out of the corporate market movers of buying up all the memory, we could maybe have more.
Regardless, the people in the 80s capable of pruning programs to fit on small devices is likely happening now. I'd bet most of the Chinese firms are doing it because of the US's silly GPU games among other constraints.
Interesting to see a pivot away from MoE by both IBM and mistral while the larger classes of SOTA of models all seem to be sticking to it.
Quick vibe check of it- 8B @ Q6 - seems promising. Bit of a clinical tone, but can see that being useful for data processing and similar. You don't really want a LLM that spams you with emojis sometimes...
Makes sense, dense for small models, dense or MoE for larger ones, end up fitting various hardware setups pretty neatly, no need for MoE at smaller scale and dense too heavy at large scale.
Nah, I ain't reading that. If they can't be bothered to get a human to write it, it can't be that important. I'm glad for them though. Or sorry that happened.
Third line in to the article: "But there’s one result in the benchmarks I keep coming back to."
I hear this sort of thing all the time now on YouTube from media/news personalities:
“And that’s the part nobody seems to be talking about.”
"And here's what keeps me up at night."
“This is where the story gets complicated.”
“Here’s the piece that doesn’t quite fit.”
“And this is where the usual explanation starts to break down.”
“Here’s what I can’t stop thinking about.”
“The part that should worry us is not the obvious one.”
“And that’s where the real problem begins.”
“But the more interesting question is the one no one is asking.”
“And this is where things stop being simple.”
It doesn't really worry me but I think its interesting that LLM speak sounds so distinctive, and how willing these media personalities are to be so obvious in reading out on TV what the LLM spat out.
I've never studied what LLMs say in depth is it is interesting that my brain recognises the speech pattern so easily.
I think this kind of language predates widespread LLM use, and has been picked up from that kind of writing. It's a "and here's where it gets interesting" pattern that people like Malcolm Gladwell and Freakonomics have used, even if the same thing could be said in a way that makes it sound much less intriguing.
Isn't this the format of "hook-driven media" a constant stream of "second-act pivots" - where some new twist is added to a story to re-engage the reader and keep them reading.
BuzzFeed and Upworthy etc pioneered this for web 'news stories', then it got used in linkedin, twitter, and everywhere where views are more important than the content.
The language of drama and import without meaningful substance. Words statistically likely to be used in a segue, regardless of the preceding or subsequent point. Particularly effective when it seems like you’re getting let in on a secret. Really fatiguing to read
A writing teacher once excoriated me for saying that something was important. “Don’t tell me it’s important, show me, and let me decide, and if you do your job I’ll agree”
I don’t know how a completion can tell when it needs to do this. Mostly so far it doesn’t seem capable
I notice this very often in LinkedIn posts, and it's annoying, but I had not realized it was LLM-speak? Isn't it possible that people write like this naturally?
I don't really see reason to complain about tool use, so long as the result is cohesive, accurate and that ultimately means a human has at least read their own output before publishing. It's a bit like receiving a supposedly personal letter that starts "Dear [INSERT_FIRST_NAME_FIELD]," are you really going to read such a thing?
My opinion is that literature and art will continue pushing the envelope in the places they always pushed the envelope. LLMs will not change this, humans love making art, and they love doing it in new ways.
Corporate announcements were never the places that literature and art were pushing the envelope. They were slop before, and they're slop now.
The most salient thing about these models is that they're non-reasoning models. This makes then very token efficient and particularly well suited for local inference where decoding is usually slower than with datacenter GPUs.
The 8B class closing the gap with 32B is the real story of 2026 for anyone running models locally. I've been using smaller models for agent tool-use and the progress this year is real.
The gap that still matters most isn't intelligence — it's consistency on structured output. When you chain 5+ tool calls in sequence, even a small per-call reliability difference compounds fast. Would love to see Granite 4.1 benchmarked specifically on multi-step function calling rather than just general benchmarks.
If you really think about why MoE came into existence, its to save significant cost during training, I don't think there was any concrete evidence of performance gains for comparable MoE vs dense models. Over the years, I believe all the new techniques being employed in post training have made the models better.
I think you mean inference compute? I believe all expert weights are updated in each backward pass during MoE training. The first benefit was getting a sort of structured pruning of weights through the mechanism of expert selection so that the model didn’t need to go through ‘unnecessary’ parts of the model for a given token. This then let inference use memory more efficiently in memory constrained environments, where non-hot or less common experts could be put into slow RAM, or sometimes even streamed off storage.
But I don’t think it necessarily saved training cost; if it did, I’d be interested to learn how!
Each token is only routed through a few chosen (topk) experts during training. So not all expert weights are updated in the backward pass. Otoh, you may need more training to ensure all experts see enough tokens!
I doubt MoE is actually worth it, given how complicated high-performance expert routing and training is. But who knows, I don't.
MoE models will have far more world knowledge than dense models with the same amount of active parameters. MoE is a no-brainer if your inference setup is ultimately limited by compute or memory throughput - not total memory footprint - or alternately if it has fast, high-bandwidth access to lower-tier storage to fetch cold model weights from on demand.
qwen3.5 9b outperforms granite 4.1 30b by a huge amount (32 vs 15 on artificialanalysis benchmark)... i have no idea what made the writer of this article say so many demonstrably incorrect things
It's strange that they don't include reasoning training (RLVR). Their justification doesn't sound convincing:
> While reasoning models have grown in popularity in recent years, their abilities aren’t always the most efficient way to get a result. In enterprise settings, token costs and speed are often as important as performance. That is why turning to less expensive, non-reasoning models with similar benchmark performance for select tasks like instruction following and tool calling makes sense for enterprise users.
I guess they currently don't have the ability to do proper RLVR.
Apache 2.0 License. Did you not click the link to the project? They even list it in the article.
> Apache 2.0 across the board, so commercial use is clean.
Did you just stop when you saw open source and come post this here because you couldn't be bothered to... look at the project and see it's cleanly and clearly listed.
Edit: Like. I get it. It's fine to question open source. But this isn't hidden. It's repeated and made clear multiple times. They even link to the license: https://www.apache.org/licenses/LICENSE-2.0
It wasn't hidden, it wasn't in some weird, out-of-the-way place. In fact, I found it so easily that I genuinely questioned whether it was real because of your comment. Like, why would anyone post what you posted if it was this easy to find?
* https://docs.ollama.com/integrations/claude-code
Qwen3.6 35b a3b is still my local champion but I may use this for auto complete and small tasks. Granite has recent training data which is nice. If the other small models got fine tuned on recent data I don't know if I would use this at all, but that alone makes it pretty decent.
The 4b they released was not good for my needs but could probably handle tool calls or something
Can you share some parameters you enable tool calling and agentic usage?
Or, higher level, some philosophies on what approaches you are using for tuning to get better tool calling and/or agentic usage?
I'm having surprisingly good success with unsloth/Qwen3.6-27B-GGUF:Q4_K_M (love unsloth guys) on my RTX3090/24GB using opencode as the orchestrator.
It concocts some misleading paths, but the code often compiles, and I consider that a victory.
You have to watch it like you would watch a 14 year old boy who says he is doing his homework but you hear the sound effects of explosions.
Now, is this the usual use case? No, it's a benchmark I created specifically in order to put LLMs in situations where they can't just blast out their bash commands without having to interface with something else and adapt.
The Qwen models are quite solid though.
Can you share your switches and approach for using tools?
The 4b was okay. It didn't get all of my small math questions right, it didn't know about some of the libraries I use, but it was able to do some basic auto complete type stuff. For microscopic models I like the llama 3.2 3b more right now for what I do, it's a little faster and seems a little stronger for what I do. But everyone is different and I don't think I'll use it anymore this past month has been crazy for local model releases.
curious how people are leveraging these models
Instead of hitting stack overflow and Google I will ask questions like "can you give me an example of how to do x in library y?" Or "this error is appearing what might be happening if I checked a b and c". Or "please write unit tests for this function". Or code auto complete.
I am not looking for the world's best answer from a 3b model. I am looking for a super fast answer that reminds me of things I already know or maybe just maybe gives me a fast idea to stub something while I focus on something more important, I am going to refactor anyways. Think a low quality rubber duck
I mostly use 7-9b models for this now but llama 3.2 3b is pretty decent for not hogging resources while say I have other compute heavy operations happening on a weak computer.
Probably half the questions people ask chatgpt could get roughly the same quality of answer with a small model in my opinion. You can't fully trust an LLM anyways so the difference between 60% and 70% accuracy isn't as much are marketing makes it sound like. That said the quality of a good 7-9b model is worth it compared to a 3b if your machine can run it. Furthermore the quality of qwen 36 is crazy and makes me wonder if I will ever need an AI provider again if the trend continues.
Qwen3.6 raises the bar for models of its size. There really isn't a comparison in my opinion.
Qwen is really good.
Also, generally, it makes sense. 8B models are generally not very good^.
That this 8B model is decent is impressive, but that it could perform on par with a good model 4 times as large is a daydream.
^ - To be polite. The small models + tool use for coding agents are almost universally ass. Proof: my personal experience. Ive tried many of them.
edit: It was a play on The Big Lebowski, folks.
Nor do class standings, nor hackerrank and the like.
What will tell you is asking them to fix a thing in your codebase. Once you ask an LLM to do that, a dozen times, I'd argue it's no longer "just your opinion man", it's a context-engineered performance x applicability assessment.
And it is very predictive.
But it's also why someone doing well at job A isn't necessarily going to be great at B, or bad at A doesn't mean will necessarily be bad at B.
I've often felt we should normalize a sort of mutual try-buy period where job-change seeker and company can spend a series of days without harming one's existing employment, to derisk the mutual learning. ESPECIALLY to derisk the career change for the applicant who only gets one timeline to manage, opposed to company that considers the applicant fungible.
But back to the LLM, yeah, the only valid opinion on whether it works for you is not benchmark, it's an informed opinion from 'using it in anger'.
Yes.
That is how you empirically evaluate tools; not by reading stupid benchmarks. By actually using the tools, for hours and hours. Doing real work.
Did you try using it? For hours? Do you use qwen?
How about you tell us about your experience with your great 8B models that you use daily. What coding agent harness do you have then hooked up to? What context size can you get before they lose track of whats happening? Do you swap between models for different coding tasks?
Or, have you not, actually, even actually tried any of this stuff, yourself?
I ran it in LM Studio and got a pleasingly abstract pelican on a bicycle (genuinely not bad for a tiny 3B model - it can at least output valid SVG): https://gist.github.com/simonw/5f2df6093885a04c9573cf5756d34...
Original article on IBM research
Hugging face weights: https://huggingface.co/collections/ibm-granite/granite-41-la...
The article makes some good points about model design (how different size models within a family can get similar results, how to filter out hallucination, math result reinforcement), so that's worth understanding. It's analyzing a paper, which only discussed 3 sizes of the same model family. But what the article doesn't say is, compared to other model families, Granite 4.1 8B sucks. The only benchmark it does well at compared to other models is non-hallucination and instruction following. Qwen 3.5 4B (among other models) easily outclass it on every other metric.
This article teaches a valuable lesson about reading articles in general. You can take useful information away from them (yes, despite being written by LLM). But you should also use critical thinking skills and be proactive to see if the article missed anything you might find relevant.
I'm using Gemini 3.1 pro to help me research my thesis, it still with search enabled and on pro mode, invents entire papers that don't exist, and lies about the contents of existing papers to relate them to the context or to appease me, if I submitted an LLM written article based on the results its given me 80% of the article would be lies
Commenting to complain that the article is LLM written is helpful too since some people aren't able to distinguish
For sparse knowledge tasks, where you know that the model can't possibly have much training because even humans themselves don't have much knowledge there, use it as a brainstorming partner, not as a source. Or put relevant papers in it's context to help you eval those papers in relation to your work. But it's just going to hurt itself in confusion trying to tie fuzzy ideas to sparse sources embedded in pages upon pages of mildly related google search results.
You're complaining about facts that have been true since words have been written on paper. If you read the article with the same criticality you read any other article you wont have the problem you complain about.
The reality is, you're only complaining because you hate ai. Cool, but dont dress it up and resort to name calling to browbeat the other guy
If it has AI tells then I wont bother to continue reading because it was either written by an AI or it was written by someone who can't tell the difference.
Either way it's probably a poor piece of writing.
Anti-AI people like to bring up hallucination as if everything AI generates is false.
I can write pages of text, with my own content, and then use AI to improve my writing and clarity. Then I review and edit. It might have some LLM markers in there, which I remove sometimes because it's distracting. But the final, AI assisted writing is easier to read and better organized. But all the ideas are mine. Hallucinations are not remotely a problem in this case.
If it's used to create a false narrative (like a deep fake), sure, you should care. But if it's used as an alternative to a stock photo, or as an easy way to make an infographic then no, I don't think you should care.
Maybe my point is something on the lines of "Just send me the prompt"[0]
[0] https://blog.gpkb.org/posts/just-send-me-the-prompt/
But how can I tell if those are good points or not?
I don't want to invest time in reading something if the presence of those "good points" depends on a roll of the dice.
The problem is that in the past it took multiple times more effort and hours to write something than it took to read. That served two purposes:
1. Lazy people just looking for an audience were effectively gatekept from drowning the world with their every vapid thought.
2. Because supply was many times slower than consumption it was viable to give most articles a chance: the author could not drown me in a deluge even if they wanted to.
Having the criteria now that the author should spend at least as much effort creating the piece as they expect the reader expend reading it is a damn useful bar: instead of reading 1000 AI articles just to find the one good one, I can simply read 10 human authored articles and be certain that 9 of them have something worthwhile.
I think instruction following is going to be the most useful thing these models do. Add a voice interface and access to a bunch of simple, straight-forward devices or APIs and you have a mildly useful assistant. If that can be done in 8B parameters it will soon run on edge devices. That's solid usefulness.
It's mind-boggling how bad current voice assistants sometimes are when you prompt them some fairly easy questions.
Right. This just says that Granite 4.1 8B is better than a previous version, Granite 4.0-H-Small, which has 32B, 9B active.
So, they made a less bad model than before. But that doesn't tell you anything about how it compares with other models.
I'm not sure it's proud as much as people voicing displeasure with the uncertainty about what went into the LLM prompt. This may have been a 1 sentence prompt, or it may have been some well researched background that simply reformatted it. Why waste minutes-hours on verifying it if it's possible someone could have spent 10 second on it? It's very easy to see their point.
People seem to indicate people they disagree with voicing their opinion about anything lately is some auto-fellatio, I wonder what causes them to think this way.
I already assume some comments here are LLM written.
I assume some people here have never programmed a single useful thing even once in their lives.
I don’t know how many difference little models this uses under the hood, but I was shocked at how good it was at the couple document extraction tasks I threw it at.
Training purpose-specific miniature models lets you have a lot of tasks you can run with high confidence on consumer hardware.
Regardless, the people in the 80s capable of pruning programs to fit on small devices is likely happening now. I'd bet most of the Chinese firms are doing it because of the US's silly GPU games among other constraints.
Quick vibe check of it- 8B @ Q6 - seems promising. Bit of a clinical tone, but can see that being useful for data processing and similar. You don't really want a LLM that spams you with emojis sometimes...
But yea dislike that style where each heading and bullet point gets an emoji
It is not the researchers' fault that some slop got posted here instead.
Why people don't edit out obvious sloppification and expect to still have readers left
I hear this sort of thing all the time now on YouTube from media/news personalities:
“And that’s the part nobody seems to be talking about.”
"And here's what keeps me up at night."
“This is where the story gets complicated.”
“Here’s the piece that doesn’t quite fit.”
“And this is where the usual explanation starts to break down.”
“Here’s what I can’t stop thinking about.”
“The part that should worry us is not the obvious one.”
“And that’s where the real problem begins.”
“But the more interesting question is the one no one is asking.”
“And this is where things stop being simple.”
It doesn't really worry me but I think its interesting that LLM speak sounds so distinctive, and how willing these media personalities are to be so obvious in reading out on TV what the LLM spat out.
I've never studied what LLMs say in depth is it is interesting that my brain recognises the speech pattern so easily.
BuzzFeed and Upworthy etc pioneered this for web 'news stories', then it got used in linkedin, twitter, and everywhere where views are more important than the content.
A writing teacher once excoriated me for saying that something was important. “Don’t tell me it’s important, show me, and let me decide, and if you do your job I’ll agree”
I don’t know how a completion can tell when it needs to do this. Mostly so far it doesn’t seem capable
This is to say: Marketers and spammers repeat the same things over and over, and these models are build on coalescing repetition into the basis.
So yeah, of course people talked like this before, but it was always in some known context like linked in or a spam website.
No point creating busywork for yourself just shuffling words around when the information is there, no?
I guess it depends on what you want out of the article. Substance, or style?
Corporate announcements were never the places that literature and art were pushing the envelope. They were slop before, and they're slop now.
Link to HF collection: https://huggingface.co/collections/ibm-granite/granite-41-la...
The gap that still matters most isn't intelligence — it's consistency on structured output. When you chain 5+ tool calls in sequence, even a small per-call reliability difference compounds fast. Would love to see Granite 4.1 benchmarked specifically on multi-step function calling rather than just general benchmarks.
But I don’t think it necessarily saved training cost; if it did, I’d be interested to learn how!
I doubt MoE is actually worth it, given how complicated high-performance expert routing and training is. But who knows, I don't.
I have been using it with their Chunkless RAG concept and it is fitting very well! (for curious https://github.com/scub-france/Docling-Studio)
I convinced that SLM are a real parto of solution for true integrated AI in process...
https://huggingface.co/collections/ibm-granite/granite-embed...
311M and 97M versions.
edit: I just realised they do actually have a 30b release alongside this. Haven't tried it yet.
An interesting choice
> While reasoning models have grown in popularity in recent years, their abilities aren’t always the most efficient way to get a result. In enterprise settings, token costs and speed are often as important as performance. That is why turning to less expensive, non-reasoning models with similar benchmark performance for select tasks like instruction following and tool calling makes sense for enterprise users.
I guess they currently don't have the ability to do proper RLVR.
show me.
> Apache 2.0 across the board, so commercial use is clean.
Did you just stop when you saw open source and come post this here because you couldn't be bothered to... look at the project and see it's cleanly and clearly listed.
Edit: Like. I get it. It's fine to question open source. But this isn't hidden. It's repeated and made clear multiple times. They even link to the license: https://www.apache.org/licenses/LICENSE-2.0
It wasn't hidden, it wasn't in some weird, out-of-the-way place. In fact, I found it so easily that I genuinely questioned whether it was real because of your comment. Like, why would anyone post what you posted if it was this easy to find?
NOPE! It was right there.