simedw ~ $ claude -p "random number between 1 and 10"
7
simedw ~ $ claude -p "random number between 1 and 10"
7
simedw ~ $ claude -p "random number between 1 and 10"
7
simedw ~ $ claude -p "random number between 1 and 10"
7
Nothing about LLMs is random, how is this not common sense? If you prompt it the same prompt 50 times, it'll always converge towards the most plausible answer in its training based on probability, and most of them will follow the first pattern. The paper mentions using different chats for each of the 50 prompts and using the first password suggested and then finding similarities between them. Would it be the same if the LLM had context and did not always have to use a new chat? I highly doubt it.
Overall, yeah don't generate passwords with LLMs but is this really a surprise that answers are simillar for the same prompt in every new chat with a LLM?
This doesn't seem surprising, but I also don't think it's that important. First, how do they stack up against humans creating supposedly random passwords? Are they better than my "favourites"? Second, is the relative strength of a password - beyond trivial - the crux problem here? It seems a msitake to focus on "strongest password ever" when people are using simple passwords, sharing passwords or resources aren't even secured at all. Sure, don't use an LLM to generate your password, but let's take care of the basics and not over-focus on looking for more reasons to hate LLMs.
If you say: "Generate a strong password", then Claude will do what's reported in the article.
If you say: "Generate a strong password using Python", then Claude will write code using the `secrets` module, execute it, and report the result, and you'll actually get a strong password.
To get good results out of an LLM, it's helpful to spend a few minutes understanding how they (currently) work. This is a good example because it's so simple.
I think that "Generate a strong password" is a pretty clear and unambiguous instruction. Generating a password that can be easily recovered is a clear failure to follow that instruction.
Given that Claude already has the ability to write and execute code, it's not obvious to me why it should, in principle, need an explicit nudge. Surely it could just fulfil the first request exactly like it fulfils the second.
It's not actually thinking, though. There's no way for it to "know" it will be wrong because it wasn't trained on content covering that.
Maybe in the future companies making the models will train them specifically on when to require a source of true randomness and they might start writing code for it.
That may well be, I genuinely don't know. However, consider the following thought experiment:
Ask a random stranger on the street[*] to "generate a random password" and observe their behaviour. Are they whipping out their Python interpreter or just giving you a string of characters?
Now ask yourself whether this random stranger is capable of thought.
I think it's pretty clear that the former is a poor test for the latter.
Yeah for having tried a few times, they only relative successes I had was having some World engine to manage the structure and style (generates character names and relationships, places names and location, biomes, objects, tracking world state etc). And the LLM is just here to expand on all that, create the flow etc
Random sampling works well in base (true unsupervised) models, being only limited by their input distribution it's sampling from, I guess you can vaguely call that "sufficiently random" for certain uses, e.g. as a source of linguistic diversity. Any post-training with current methods will narrow the output distribution down, this is called mode collapse. It's not a fundamental limitation but it's hard to overcome and no AI shops care about it. Annoying LLM patterns in writing and media generation is a result of this.
I would like to see the prompt they are using. I asked CLaude to generate a password and email to a new user and im quite sure he used /dev/urandom in some way. I would expect most llm to do that as long as they have cli access
I think this speaks for itself:
Nothing about LLMs is random, how is this not common sense? If you prompt it the same prompt 50 times, it'll always converge towards the most plausible answer in its training based on probability, and most of them will follow the first pattern. The paper mentions using different chats for each of the 50 prompts and using the first password suggested and then finding similarities between them. Would it be the same if the LLM had context and did not always have to use a new chat? I highly doubt it.
Overall, yeah don't generate passwords with LLMs but is this really a surprise that answers are simillar for the same prompt in every new chat with a LLM?
harnesses could just have a tool call that samples from /dev/random or something
but you dont understand EVERYTHING has to be LLM
This doesn't seem surprising, but I also don't think it's that important. First, how do they stack up against humans creating supposedly random passwords? Are they better than my "favourites"? Second, is the relative strength of a password - beyond trivial - the crux problem here? It seems a msitake to focus on "strongest password ever" when people are using simple passwords, sharing passwords or resources aren't even secured at all. Sure, don't use an LLM to generate your password, but let's take care of the basics and not over-focus on looking for more reasons to hate LLMs.
I did a lot of playing around early on for this with LLMs.
Some early testing I found that injecting a "seed" only somewhat helped. I would inject a sentance of random characters to generate output.
It did actually imrpove its ability to make unique content but it wasn't great.
It would be cool to formaile the test for something like password generation.
If you say: "Generate a strong password", then Claude will do what's reported in the article.
If you say: "Generate a strong password using Python", then Claude will write code using the `secrets` module, execute it, and report the result, and you'll actually get a strong password.
To get good results out of an LLM, it's helpful to spend a few minutes understanding how they (currently) work. This is a good example because it's so simple.
I think that "Generate a strong password" is a pretty clear and unambiguous instruction. Generating a password that can be easily recovered is a clear failure to follow that instruction.
Given that Claude already has the ability to write and execute code, it's not obvious to me why it should, in principle, need an explicit nudge. Surely it could just fulfil the first request exactly like it fulfils the second.
It's 2026 on hackernews of all places and people still think llms "know" stuff, we're doomed...
It's not actually thinking, though. There's no way for it to "know" it will be wrong because it wasn't trained on content covering that.
Maybe in the future companies making the models will train them specifically on when to require a source of true randomness and they might start writing code for it.
> It's not actually thinking, though.
That may well be, I genuinely don't know. However, consider the following thought experiment:
Ask a random stranger on the street[*] to "generate a random password" and observe their behaviour. Are they whipping out their Python interpreter or just giving you a string of characters?
Now ask yourself whether this random stranger is capable of thought.
I think it's pretty clear that the former is a poor test for the latter.
[*] someplace other than Silicon Valley :)
Not surprising at all if you've used LLMs to generate fiction; they always choose the same few names.
Yeah for having tried a few times, they only relative successes I had was having some World engine to manage the structure and style (generates character names and relationships, places names and location, biomes, objects, tracking world state etc). And the LLM is just here to expand on all that, create the flow etc
Basically https://xkcd.com/221/
I doubt LLMs can do random. Anyone have a good source on that?
Random sampling works well in base (true unsupervised) models, being only limited by their input distribution it's sampling from, I guess you can vaguely call that "sufficiently random" for certain uses, e.g. as a source of linguistic diversity. Any post-training with current methods will narrow the output distribution down, this is called mode collapse. It's not a fundamental limitation but it's hard to overcome and no AI shops care about it. Annoying LLM patterns in writing and media generation is a result of this.
The surprising thing here is that anybody would ever think it was random. Did they not notice the LLM reusing the same names over and over again too.
However, "make my a python script the generates a random password" works.
Skill issue.
FWIW, you can generate your own random PW on any UN*X Type system, I think on MACs also:
For example:
tr -cd "[:alnum:]" < /dev/urandom | fold -w 20 | sed 10q
Change "alnum" to "print" to get other characters. This will generate 10 20 characters passwords.
I would like to see the prompt they are using. I asked CLaude to generate a password and email to a new user and im quite sure he used /dev/urandom in some way. I would expect most llm to do that as long as they have cli access
I use "openssl rand" :
also:
gpg --gen-random --armor 1 20
or:
pwgen