Let's focus on the real issue here, which is that HN has apparently normalized the double hyphen in the title to an en dash--yes, an en dash, not even an em dash.
I agree that it should be left as a double hyphen, but an en dash is far more appropriate considering the decades-long precedent set by LaTeX (and continued by Typst).
It's a command line argument. The undeniably correct way to render it is with two minus signs[1] and absolutely not something non-ascii.
[1] Not strictly a hyphen, which has its own unicode point (0x2010) outside of ascii. Unicode embraced the ambiguity by calling this point (0x2d) "HYPHEN-MINUS" formally, but really its only unique typographic usage is to represent subtraction.
But... it's not more appropriate than an em dash for representing command line arguments? I don't see how either is any more incorrect than the other. There's a uniquely correct answer here and the em-dash is not it. Period.
No, the comment was pointing out that the HN platform automatically replaces `--` in titles with `–`. (I don’t know if that’s true, but that was the intent. Nothing to do with AI.)
> Process monitoring at 0.1-second intervals found zero git processes around reset times.
I don’t think this is a valid way of checking for spawned processes. Git commands are fast. 0.1-second intervals are not enough. I would replace the git on the $PATH by a wrapper that logs all operations and then execs the real git.
Sure looks to me like this whole case is Claude Code chasing its own tail, failing to debug, and offering to instead generate a bug report for the user when it can't figure out a better way forward.
Maybe even submitting the bug report "agentically" without user input, if it's running on host without guardrails (pure speculation).
I think this post potentially mischaracterises what may be a one off issue for a certain person as if it were a broader problem. I'm guessing some context has been corrupted?
It's not a one off issue - it has happened to me a few times. It has once even force pushed to github, which doesn't allow branch protection for private personal projects. Here's an example.
1) claude will stash (despite clear instructions never to do so).
2) claude will use sed to bulk replace (despite clear instructions never to do so). sed replacements make a mess and replaces far too many files.
3) claude restores the stash. Finds a lot of conflicts. Nothing runs.
4) claude decides it can't fix the problem and does a reset hard.
I have this right at the top of my CLAUDE.md and it makes things better, but unlike codex, claude doesn't follow it to the letter. However, it has become a lot better now.
NEVER USE sed TO BULK REPLACE.
*NEVER USE FORCE PUSH OR DESTRUCTIVE GIT OPERATIONS*: `git push --force`, `git push --force-with-lease`, `git reset --hard`, `git clean -fd`, or any other destructive git operations are ABSOLUTELY FORBIDDEN. Use `git revert` to undo changes instead.
When will you all learn that merely "telling" an LLM not to do something won't deterministically prevent it from doing that thing? If you truly want it to never use those commands, you better be prepared to sandbox it to the point where it is completely unable to do the things you're trying to stop.
Even worse, explicitly telling it not to do something makes it more likely to do it. It's not intelligent. It's a probability machine write large. If you say "don't git push --force", that command is now part of the context window dramatically raising the probability of it being "thought" about, and likely to appear in the output.
Like you say, the only way to stop it from doing something is to make it impossible for it to do so. Shove it in a container. Build LLM safe wrappers around the tools you want it to be able to run so that when it runs e.g. `git`, it can only do operations you've already decided are fine.
My point is exactly that you need safeguards. (I have VMs per project, reduced command availability etc). But those details are orthogonal to this discussion.
However "Telling" has made it better, and generally the model itself has become better. Also, I've never faced a similar issue in Codex.
That’s right, because we’re not developers anymore— we orchestrate writhing piles of insane noobs that generally know how to code, but have absolutely no instinct or common sense. This is because it’s cheaper per pile of excreted code while this is all being heavily subsidized. This is the future and anyone not enthusiastically onboard is utterly foolish.
I use a script wrapper of git un muy path for claude, but as you correctly said, I'm not sure claude Will not ever use a new zsh with a differentPATH....
Why do you expect that a weighted random text generator will ever behave in predictable way?
How can people be so naive as to run something like Claude anywhere other than in a strictly locked down sandbox that has no access to anything but the single git repo they are working on (and certainly no creds to push code)?
This is absolutely insane behavior that you would give Claude access to your GitHub creds. What happens when it sees a prompt injection attack somewhere and exfiltrates all of your creds or wipes out all of your repos?
I can't believe how far people have fallen for this "AI" mania. You are giving a stochastic model that is easily misdirected the keys to all of your productive work.
I can understand the appeal to a degree, that it can seem to do useful work sometimes.
But even so, you can't trust it with anything, not running it in a locked down container that has no access to anything but a Git repo which has all important history stored elsewhere seems crazy.
Shouting harder and harder at the statistical model might give you a higher probability of avoiding the bad behavior, but no guarantee; actually lock down your random text generator properly if you want to avoid it causing you problems.
And of course, given that you've seen how hard it is to get it follow these instructions properly, you are reviewing every line of output code thoroughly, right? Because you can't trust that either.
It has once even force pushed to github, which doesn't allow branch protection for private personal projects.
This is only restricted for *fully free* accounts, but this feature only requires a minimum of a paid Pro account. That starts around $4 USD/month, which sounds worth it to prevent lost work from a runaway tool.
Reinforcing an avoidance tactic is nowhere near as effective as doing that PLUS enforcing a positive tactic. People with loads of 'DONT', 'STOP', etc. in their instructions have no clue what they're doing.
In your own example you have all this huge emphasis on the negatives, and then the positive is a tiny un-emphasized afterthought.
I think you're generally correct, but certainly not definitively, and I worry the advice and tone isn't helpful in this instance with an outcome of this magnitude.
(more loosely: I'm a big proponent of this too, but it's a helluva hot take, how one positively frames "don't blow away the effing repro" isn't intuitive at all)
Claude tends to disregard "NEVER do X" quite often, but funnily enough, if you tell it "Always ask me to confirm before going X", it never fails to ask you. And you can deny it every time
Maybe stop using the CLAUDE.md to prevent it from running tools you don't want it to and just setup a hook for pretooluse that blocks any command you don't want.
Its trivial to setup and you could literally ask claude to do it for you and never have any of these issues ever again.
Any and all "I don't want it to ever run this command" issues are just skill issues.
> 1) claude will stash (despite clear instructions never to do so).
This is basically blocking any parallel work between agents without worktrees (including when Claude launches a team of agents running in parallel).
I’ve run into this so many times in the past few weeks and it’s infuriating. It makes any real inter-agent collaboration very hard to pull off without brittle hooks and prompts.
you might be right, but consider the implications, if context can be corrupted in 0.1% cases and it starts showing another destructive behaviour, after creating 1000 tickets to agent, your data might be accidentally wiped off
I'd been using cursor at work for a year or two now, figured I'd try it on a personal project. I got to the point where I needed to support env-vars, and my general pattern is `source ./source-me-local-auth` => `export SOME_TOKEN="$( passman read some-token.com/password )"` ...so I wrote up the little dummy script and it literally just says: "Hrm... I think I'll delete these untracked files from the working directory before committing!" ...and goes skipping merrily along it's way.
Never had that experience in the whole time using cursor at work so I had to "take the agent to task" and ask it "WTF-mate? you'd better be able to repro that!" and then circle around the drain for a while getting an AGENTS.md written up. Not really a big deal, as the whole project was like 1k lines in and it's not like the code I'd hand-written there was "irreplaceable" but it lead to some interesting discussion w/ the AI like "Why should I have to tell you this? Shouldn't your baseline training data presume not to delete files that you didn't author? How do you think this affects my trust not just of this agent session, but all agent interactions in the future?"
Overall, this is turning out to be quite interesting technology times we're living in.
Like a decade or more ago I remember a joke system that would do something random with the data you gave it, and you'd have to use commands like "praise" and "punish" to train it to do what you wanted. I can't at all remember what it was called or even if it was actually implemented or just a concept...
I would not have expected the model's baseline training data to presume not to delete files it didn't author. If the project existed before you started using the model then it would not have created any of the files, and denying the ability to delete files at all is quite restrictive. You may consider putting such files in .gitignore, which Cursor ignores by default.
I mean its a skill issue in the sense that Claude Code gives you the tools to 100% deterministically prevent this from ever happening without ever relying on the models unpredictability.
Just setup a hook that prevents any git commands you don't ever want it to run and you will never have this happen again.
Whenever I see stuff like this I just wonder if any of these people were ever engineers before AI, because the entire point of software engineering for decades was to make processes as deterministic and repeatable as possible.
Who would have guessed that running a binary blob dev tool, that is tied to a SaaS product, which was mostly vibe-coded, could lead to mysterious, hard to debug problems?
Not sure I understand, wouldn't permissions prevent this? The user runs with `--dangerously-skip-permissions` so they can expect wild behaviour. They should run with permissions and a ruleset.
Who knows whether permissions would prevent this? Anthropic's documentation on permissions (https://code.claude.com/docs/en/permissions) does not describe how permissions are enforced; a slightly uncharitable reading of "How permissions interact with sandboxing" suggests that they are not really enforced and any prompt injection can circumvent them.
That's not what tool use permissions are. The LLM doesn't just magically spawn processes or run code. The Claude Code program itself does those things when the LLM indicates that it wants to. The program has checks and permissions whether those things will be done or not.
You could use a wrapper that parses all the command-line options. Basically you loop over "$@", look for strings starting with '-' and '--', skip those; then look for a non-option argument, store that as a subcommand; then look for for more '-' and '--' options. Once that's all done you have enough to find subcommand "reset", subcommand option "--hard". About 50 lines of shell script.
Just fork git and patch that out?
Can't be that hard just ask the agent for that patch.
Don't need to update often either, so it's ok to rebase like twice a year.
I opened up Hacker News and I saw this right at the top, and I assumed it had started happening to everyone. I thought, good thing I'm not running Claude Code right now.
Some people are upset at my brave new world characterization, but yeah even as someone deriving value from Claude Code we've jumped the shark on AI in development.
Either the industry will face that reality and recalibrate, or in 20 years we're going to look back on these days like the golden age of software reliability and just accept that software is significantly more broken than it was (we've been priming ourselves for that after all)
People aren't upset about your characterization. Catch phrases, memes, or other low qualitative comments (with no context, elaboration or personal angle) are contrary to community ethos and down voted.
I agree that it's worrying that we're moving more and more towards implicit and opaque state. Hiding what exactly is getting edited, very limited tooling to check what the subagents are doing exactly, setting up scheduled and recurring tasks without it being obvious etc.
It's tending more and more towards pushing the user to treat the whole thing as a pure chat interface magic black box, instead of a rich dashboard that allows you to keep precise track of what's going on and giving you affordances to intervene. So less a tool view and more magic agent, where the user is not supposed to even think about what the thing is even doing. Just trust the process. If you want to know what it did, just ask it. If you want to know if it deleted all the files, just ask it in the chat. Or don't. Caring about files is old school. Just care about the chat messages it sends you.
Here in SF I talk to people all day who see this as a feature, not a bug, and that's the persona Claude Code and Codex are selling to.
It started being proposed as a thought experiment "why should we care about the files if AI is going to do the edits", then as Opus got better and the hype built up, the rhetorical part of that dropped and now there are plenty of people who swear they don't write code at all anymore and don't see why anyone would.
I think we're in a feedback loop caused by the fact you can totally get away with not writing code anymore for some reasonably complex topics. But that doesn't account for the long term maintainability of the result, and it doesn't account for people who think they're not writing code, but are relying heavily on the fact we haven't fully magicked away the actual code. They're watching the agents like a hawk, doing small bits and pieces at a time, hitting stop when it starts thinking about the wrong thing, etc.
My worry is the market taking the wrong lesson out of the trends and prematurely trying to force the agent-first future well before the tools or the people are ready.
Feels like just yesterday that everyone agreed that critical code is read orders of magnitude more than written, so optimizing for quick writing is wrong.
Genuinely I think that perspective is still shared by many/most engineers.
I think we’ve seen a wave of bad actors - either employees of LLM companies, or bots - pushing the idea hard of code quality not mattering and “the models will improve so fast that your code quality degrading doesn’t matter”.
I think the humans pushing that idea may even believe it, but I don’t think they’re usually employed as software engineers at regular non-AI companies, rather they have some incentive to believe it and convince others as well
While that's obviously a bug which should be fixed, having stuff just sitting around uncommitted for days (which is much longer than 10 mins) is an anti-pattern (that I used to fall into).
I’m having this weird vision of a “the matrix 3” type machine crawling around inside Microsoft’s GitHub servers central repository and just wreaking havoc.
The person who posted this bug doesn't seem like the pinnacle of software engineering. To me, this looks like either a user error or some corrupt file or context you should be able to clean up pretty quickly.
The weird part is that it's "shitting over the floor" in quite a deterministic ma nner. Every 600seconds (+- less than 0.5 seconds) doing the exact same thing.
I guess some people are upset at my brave new world characterization, but even as someone deriving value from Claude Code we've jumped the shark on AI in development.
The idea a natural request can get Claude to invoke potentially destructive actions on a timer is silly
Isn't this a natural consequence of how these systems work?
The model is probabilistic and sequences like `git reset --hard` are very common in training data, so they have some probability to appear in outputs.
Whether such a command is appropriate depends on context that is not fully observable to the system, like whether a repository or changes are disposable or not. Because of that, the system cannot rely purely on fixed rules and has to figure intent from incomplete information, which is also probabilistic.
With so many layers of probabilities, it seems expected that sometimes commands like this will be produced even if they are not appropriate in that specific situation.
Even a 0.01% failure rate due to context corruption, misinterpretation of intent, or guardrail errors would show up regularly at scale, that is like 1 in 10000 queries.
> Just by a thing being common in training data doesn't mean it will be produced.
That's not what I said at all. I never said it will be produced. I said there is some probability of it being produced.
> False, it goes against the RL/HF and other post training goals.
It is correct that frequency in training data alone does not determine outputs, and that post-training (RLHF, policies, etc.) is meant to steer the model away from undesirable behavior.
But those mechanisms do not make such outputs impossible. They just make them less likely. The underlying system is still probabilistic and operating with incomplete context.
I am not sure how you can be so confident that a probabilistic model would never produce `git reset --hard`. There is nothing inherent in how LLMs work that makes that sequence impossible to generate.
> It is meaningless to say that because the author was able to reproduce it multiple times.
I don't know how that refutes what I'm saying.
The behaviour was reproduced multiple times, so it is clearly an observable outcome, not a one-off. It just shows that the probability of `git reset --hard` is > 0 even with RLHF and post-training.
Yes, if something is reproducible and undesirable, it is a bug and RLHF can reduce it. I'm not disupting that. "reduce" is the keyword here. You can't eliminate them entirely.
My point is that fixing one bug does not eliminate the class of bugs. Heck, it does not even fix that one bug deterministically. You only reduce its probability like you rightly said.
With git commands, there is not like a system like Lean that can formally reject invalid proofs. Really I think the mathematicians have got it easier with LLMs because a proof is either valid or invalid. It's not so clear cut with git commands. Almost any command can be valid in some narrow context, which makes it much harder to reject undesirable outputs entirely.
Until the underlying probabilities of undesirable output become negligible so much that they become practically impossible, these kinds of issues will keep surfacing even if you address individual bugs. Will the probabilities become so low someday that these issues are practically impossible? Maybe. But we are not there yet. Until then, we should recalibrate our expectations and rely on deterministic safeguards outside the LLM.
When sampling from an LLM people normally truncate the token probability distribution so that low-probability tokens are never sampled. So the model shouldn't produce really weird outputs even if they technically have nonzero probability in the pre/post training data.
That's interesting man, that's pretty f***' interesting. I don't think I've seen it though. I've let it run for hours making changes overnight and I only do git operations manually.
Oh, but maybe allowing it to do remote git operations is a necessary trigger.
[1] Not strictly a hyphen, which has its own unicode point (0x2010) outside of ascii. Unicode embraced the ambiguity by calling this point (0x2d) "HYPHEN-MINUS" formally, but really its only unique typographic usage is to represent subtraction.
comments: "ThE tItLe iS aI cOded !!!1"
(Or... do they?? Hmm, ok, maybe I need to let this roll around in my mind.)
triple hyphens —
Most likely, the developer ran `/loop 10m <prompt>` or asked claude to create a cron task that runs every 10 minutes and refreshes & resets git.
I don’t think this is a valid way of checking for spawned processes. Git commands are fast. 0.1-second intervals are not enough. I would replace the git on the $PATH by a wrapper that logs all operations and then execs the real git.
Maybe even submitting the bug report "agentically" without user input, if it's running on host without guardrails (pure speculation).
E: It's a runaway bot lol https://github.com/anthropics/claude-code/issues/40701#issue...
(No need to use bpftrace, just an easy example :-) )
1) claude will stash (despite clear instructions never to do so).
2) claude will use sed to bulk replace (despite clear instructions never to do so). sed replacements make a mess and replaces far too many files.
3) claude restores the stash. Finds a lot of conflicts. Nothing runs.
4) claude decides it can't fix the problem and does a reset hard.
I have this right at the top of my CLAUDE.md and it makes things better, but unlike codex, claude doesn't follow it to the letter. However, it has become a lot better now.
NEVER USE sed TO BULK REPLACE.
*NEVER USE FORCE PUSH OR DESTRUCTIVE GIT OPERATIONS*: `git push --force`, `git push --force-with-lease`, `git reset --hard`, `git clean -fd`, or any other destructive git operations are ABSOLUTELY FORBIDDEN. Use `git revert` to undo changes instead.
Like you say, the only way to stop it from doing something is to make it impossible for it to do so. Shove it in a container. Build LLM safe wrappers around the tools you want it to be able to run so that when it runs e.g. `git`, it can only do operations you've already decided are fine.
However "Telling" has made it better, and generally the model itself has become better. Also, I've never faced a similar issue in Codex.
How can people be so naive as to run something like Claude anywhere other than in a strictly locked down sandbox that has no access to anything but the single git repo they are working on (and certainly no creds to push code)?
This is absolutely insane behavior that you would give Claude access to your GitHub creds. What happens when it sees a prompt injection attack somewhere and exfiltrates all of your creds or wipes out all of your repos?
I can't believe how far people have fallen for this "AI" mania. You are giving a stochastic model that is easily misdirected the keys to all of your productive work.
I can understand the appeal to a degree, that it can seem to do useful work sometimes.
But even so, you can't trust it with anything, not running it in a locked down container that has no access to anything but a Git repo which has all important history stored elsewhere seems crazy.
Shouting harder and harder at the statistical model might give you a higher probability of avoiding the bad behavior, but no guarantee; actually lock down your random text generator properly if you want to avoid it causing you problems.
And of course, given that you've seen how hard it is to get it follow these instructions properly, you are reviewing every line of output code thoroughly, right? Because you can't trust that either.
This is only restricted for *fully free* accounts, but this feature only requires a minimum of a paid Pro account. That starts around $4 USD/month, which sounds worth it to prevent lost work from a runaway tool.
In your own example you have all this huge emphasis on the negatives, and then the positive is a tiny un-emphasized afterthought.
(more loosely: I'm a big proponent of this too, but it's a helluva hot take, how one positively frames "don't blow away the effing repro" isn't intuitive at all)
Its trivial to setup and you could literally ask claude to do it for you and never have any of these issues ever again.
Any and all "I don't want it to ever run this command" issues are just skill issues.
This is basically blocking any parallel work between agents without worktrees (including when Claude launches a team of agents running in parallel).
I’ve run into this so many times in the past few weeks and it’s infuriating. It makes any real inter-agent collaboration very hard to pull off without brittle hooks and prompts.
Never had that experience in the whole time using cursor at work so I had to "take the agent to task" and ask it "WTF-mate? you'd better be able to repro that!" and then circle around the drain for a while getting an AGENTS.md written up. Not really a big deal, as the whole project was like 1k lines in and it's not like the code I'd hand-written there was "irreplaceable" but it lead to some interesting discussion w/ the AI like "Why should I have to tell you this? Shouldn't your baseline training data presume not to delete files that you didn't author? How do you think this affects my trust not just of this agent session, but all agent interactions in the future?"
Overall, this is turning out to be quite interesting technology times we're living in.
You can reduce the risk, but not drive it to zero, and at scale even very small failure rates will surface.
1. if the problem the post is suggesting is common enough, it is a bug and the extent needs to reduce (as you said)
2. if it is not common and it happens only for this user, it is not a bug and should be mostly ignored
Point is: the system is not something that is inherently a certain way that makes it unusable.
What if it happens for two users? (Still "not common").
Just setup a hook that prevents any git commands you don't ever want it to run and you will never have this happen again.
Whenever I see stuff like this I just wonder if any of these people were ever engineers before AI, because the entire point of software engineering for decades was to make processes as deterministic and repeatable as possible.
And if you force push to one of your own machines you can use the reflog[2].
[0]: https://stackoverflow.com/a/78872853 [1]: https://stackoverflow.com/a/48110879 [2]: https://stackoverflow.com/a/24236065
Now I wish I could reject `git reset --hard` on my local system somehow.
I just checked, mine also doesn‘t.
Some people are upset at my brave new world characterization, but yeah even as someone deriving value from Claude Code we've jumped the shark on AI in development.
Either the industry will face that reality and recalibrate, or in 20 years we're going to look back on these days like the golden age of software reliability and just accept that software is significantly more broken than it was (we've been priming ourselves for that after all)
It's tending more and more towards pushing the user to treat the whole thing as a pure chat interface magic black box, instead of a rich dashboard that allows you to keep precise track of what's going on and giving you affordances to intervene. So less a tool view and more magic agent, where the user is not supposed to even think about what the thing is even doing. Just trust the process. If you want to know what it did, just ask it. If you want to know if it deleted all the files, just ask it in the chat. Or don't. Caring about files is old school. Just care about the chat messages it sends you.
It started being proposed as a thought experiment "why should we care about the files if AI is going to do the edits", then as Opus got better and the hype built up, the rhetorical part of that dropped and now there are plenty of people who swear they don't write code at all anymore and don't see why anyone would.
I think we're in a feedback loop caused by the fact you can totally get away with not writing code anymore for some reasonably complex topics. But that doesn't account for the long term maintainability of the result, and it doesn't account for people who think they're not writing code, but are relying heavily on the fact we haven't fully magicked away the actual code. They're watching the agents like a hawk, doing small bits and pieces at a time, hitting stop when it starts thinking about the wrong thing, etc.
My worry is the market taking the wrong lesson out of the trends and prematurely trying to force the agent-first future well before the tools or the people are ready.
I think we’ve seen a wave of bad actors - either employees of LLM companies, or bots - pushing the idea hard of code quality not mattering and “the models will improve so fast that your code quality degrading doesn’t matter”.
I think the humans pushing that idea may even believe it, but I don’t think they’re usually employed as software engineers at regular non-AI companies, rather they have some incentive to believe it and convince others as well
do not share a workspace with the llm, or with anybody for that matter.
How would the llm even distinguish what was wrote by them and what was written by you ?
This whole LLM thing is a blast, huh?
You reap what you sow, finance bro.
-
I guess some people are upset at my brave new world characterization, but even as someone deriving value from Claude Code we've jumped the shark on AI in development.
The idea a natural request can get Claude to invoke potentially destructive actions on a timer is silly
https://code.claude.com/docs/en/scheduled-tasks#set-a-one-ti...
What would it cost if the /loop command was required instead of optional?
The model is probabilistic and sequences like `git reset --hard` are very common in training data, so they have some probability to appear in outputs.
Whether such a command is appropriate depends on context that is not fully observable to the system, like whether a repository or changes are disposable or not. Because of that, the system cannot rely purely on fixed rules and has to figure intent from incomplete information, which is also probabilistic.
With so many layers of probabilities, it seems expected that sometimes commands like this will be produced even if they are not appropriate in that specific situation.
Even a 0.01% failure rate due to context corruption, misinterpretation of intent, or guardrail errors would show up regularly at scale, that is like 1 in 10000 queries.
> I guess, what I'm trying to say ... is this even a bug? Sounds like the model is doing exactly what it is designed to do.
False, it goes against the RL/HF and other post training goals.
That's not what I said at all. I never said it will be produced. I said there is some probability of it being produced.
> False, it goes against the RL/HF and other post training goals.
It is correct that frequency in training data alone does not determine outputs, and that post-training (RLHF, policies, etc.) is meant to steer the model away from undesirable behavior.
But those mechanisms do not make such outputs impossible. They just make them less likely. The underlying system is still probabilistic and operating with incomplete context.
I am not sure how you can be so confident that a probabilistic model would never produce `git reset --hard`. There is nothing inherent in how LLMs work that makes that sequence impossible to generate.
I don't know how that refutes what I'm saying.
The behaviour was reproduced multiple times, so it is clearly an observable outcome, not a one-off. It just shows that the probability of `git reset --hard` is > 0 even with RLHF and post-training.
My point is that fixing one bug does not eliminate the class of bugs. Heck, it does not even fix that one bug deterministically. You only reduce its probability like you rightly said.
With git commands, there is not like a system like Lean that can formally reject invalid proofs. Really I think the mathematicians have got it easier with LLMs because a proof is either valid or invalid. It's not so clear cut with git commands. Almost any command can be valid in some narrow context, which makes it much harder to reject undesirable outputs entirely.
Until the underlying probabilities of undesirable output become negligible so much that they become practically impossible, these kinds of issues will keep surfacing even if you address individual bugs. Will the probabilities become so low someday that these issues are practically impossible? Maybe. But we are not there yet. Until then, we should recalibrate our expectations and rely on deterministic safeguards outside the LLM.
Oh, but maybe allowing it to do remote git operations is a necessary trigger.