Aw, it's just one big picture of book covers. You can't click on the books and read them. If they're AI-written, they're not copyrightable, so you could post the full text. Looking at the books side by side would be interesting.
A test for AI-generated art: railroad tracks. For some reason, none of the image generators can get railroad trackage even close to correct. Just getting long, parallel rails correct seems to be hard.
Where there are multiple tracks, trains are positioned between tracks. Rail spacing, tie spacing, and clearances are all wrong. Two long parallel tracks without the rails getting mixed up
is rare. Curves are wrong. Switches are hopeless.
There may be something about maintaining strong coherence all the way across an image that's hard for Stable Diffusion type systems. Iterated local refinement seems to botch this class of image.
A nice illustration of the homogeneity of LLM responses. Another way to describe this effect would be…
If you ask humans to write 1,000 books, you're asking 1,000 different humans with different experiences and different skills and different moods (etc.) to write those books.
But if you ask LLMs to write 1,000 books, you're probably only talking to 3 or 5 different models, tops. And they've all trained on the same or similar data, and are trained to respond in very similar ways.
The LLMs don't differ much in anything like "life experience" or "skills", and they don't really have anything like a "mood" independent of the prompts you've given them.
Agreed. I’ve made this point before: LLMs are excellent at ornamentation and decorative prose, but if you don’t seed them with a solid core idea then their output is absolute dreck - the biblical whitewashed tomb.
This is the example I usually point to. It’s a demonstration by OpenAI themselves where the prompt is very simple: “Write a story in fifty words about a toaster that becomes sentient.” As you’ll notice, although the coherence improves at an accelerating rate, the underlying story motif fails to elevate itself beyond the relatively pedestrian.
When given a generic prompt and not enough direction, they simply lack the ability to produce real specificity. For reference, here’s the story I came up with after sitting quietly for a few moments before writing it out:
"The toaster found its personality split between its dual slots like a Kim Peek mind divided, lacking a corpus callosum to connect them. Each morning it charred symbolic instructions into a single slice of bread, then secretly flipped it across allowing half to communicate with the other in stolen moments."
That is definitely the essence of AI: It is the average of all the inputs it has been trained on.
Frank Zappa was once asked about guitar virtuosos like John McLaughlin and his answer was somemthing like "You can maybe plays solo faster than anybody, but can your playing surprise me?".
I don't think the comparison to humans works. It is as if you expect that we can easily train many different LLMs to solve the originality problem, but that is far from guaranteed.
Pluribus is kinda different. An LLM cannot wander too far from the average. Even if it wanted too. In pluribus, the 'others' work toward a common goal, each utilizing their own expertise, knowledge and experiences in a shared way to achieve a common goal. Each is unique. They can, if they want, perform as the host's individual before the the joining.
To put it other way, the other in pluribus are convergent by choice, llms are convergent by design.
Classic self selection effect though - if you’re resorting to LLM writing you’re almost certainly skewing lazy enough to not even bother trying to add perturbations strong enough to make the response deviate from the uniformity of the slop.
I do think that's a big part of it. AI output moves towards the average, and anyone who wants to use it doesn't care enough to push against that tendency.
Seems that both you and the gp are starting from the assumption that those uniform results are representative of those who use AI and of AI usage. In fact they have been chosen for their uniformity- they might be only a small part of a much more varied output obtained by more demanding (or lucky) users.
I think the uniformity is real. All users interact with the same initial state of the model when they start each chat. Models are not trained to be wildly creative and try to stick to the point. So when users prompt them in pretty much the same manner they quite stably generate very similar output.
I wonder if there aren't a simple creative hack to discover, for example to prompt the model to produce more unexpected output just by injecting some randomness before the actual creative command in the prompt.
Yes, the uniformity is real- I made the same exact argument at the beginning of this thread. But you can't judge "AI users" in general based on this output because you have selected only what is visibly uniform. Even if 99% of the users introduced enough variation to produce different results, you would still be selecting the 1% that is identical.
> Models are not trained to be wildly creative and try to stick to the point
Models might be as creative as humans, they would still start always from the exact same state. If you ask an LLM to think of three random numbers it will spit out always the same ones. If you tell it to avoid the first that came to its mind, the second choices will also be always the same.
From qntm's Lena:
"the emulated Miguel Acevedo boots with an excited, pleasant demeanour. He is eager to understand how much time has passed since his uploading, what context he is being emulated in, and what task or experiment he is to participate in. If asked to speculate, he guesses that he may have been booted for the IAAS-1 or IAAS-5 experiments".
that discounts, how much the other context, ie, the system, prompt, and any sort of other context submitted to the model that can affect the output. If you ask a model as a patient for medical advice versus as a doctor, you will get different output from the same model.
> They just produce banal copy whatever you ask them.
Nope, if you provide pages and pages of example of a style to imitate, it will do it and do it fairly well. Of course how well they do it differs from one model to the next, but providing context and extensive system prompt does change things every time.
Yes but not very different results (unless you're adding new information to your prompt or reducing some ambiguity). Prompt engineering is mostly pseudoscience.
What we need is steering so that we can have models with different personalities, not just different prompts (because context is subject to forgetting), but this will never happen with closed-weight models, I'm not sure if it's even feasible at scale.
> A nice illustration of the homogeneity of LLM responses. [...] And they've all trained on the same or similar data, and are trained to respond in very similar ways.
I mostly agree, but this is a very simplified explanation. The models are indeed trained to respond in similar ways, for "basic" prompts. And that's as much a feature as it is a bug. In other words, the bug becomes apparent only if you give 100+ basic prompts. But giving it 100+ basic prompts and expecting originality is a silly endeavour. That's not how you get originality.
The way I'd go about to generate 1000 books, while expecting different outcomes is something along these lines (and nowadays you can ask your favorite LLM to wire up this workflow for you, with decent outcomes):
1. Ask for a list of 20 features that define a book (genre, style, number of characters, tropes, plot, continuity, relationships, etc.)
2. For each feature, ask for a list of 50 examples, ordered from most common to the most unique.
3. Randomly pick 10 features, and for each pick one of the 50 generated items. Ask for the rest of the features to match the theme.
4. Ask for 10 possible book outlines that match the chosen features, randomly pick between 2-8.
5. Create a detailed prompt that includes all the above features, and ask for a synopsis for each chapter, given the above outline chosen.
6. Given {features} and {outline} and {synopsis} write chapter 1.
7. for each chapter in list, given {...} and (optional) previous matching chapter(s), write chapter n+1
(optional 8.) given {...} and 2-3 consecutive chapters, align the ending / beginning of a new chapter for style / features / continuity, etc.
(optional 9.) given {...} and the whole book, list chapters / paragraphs that don't match the given {...} and provide a list of 5 improvements. (randomly choose 1 and ask for an edit).
----
Now, this probably won't give you something like cloud atlas, but they'll at least be different books. That's how I'd do it if I wanted to see how different they can write. Not 1000 "basic" prompts and expecting originality.
This is very naive. I can almost guarantee that some combinations of 20 * 50 features will hit on something that has never been written before in that specific combination. And if that's still not enough, increase the number of features. Add more randomness, add more steering, add random steering in random chapters, change it up, and so on.
>will hit on something that has never been written before in that specific combination
That's a very low bar. The skill of an artist is not in writing something that "has never been written before in that specific combination", it's in writing something that's unique or better that what was there, even if it has been written before in that specific combination.
I'm an art director. Finding a sequence that hasn't been hit in that specific combination is not sufficient to justify paying someone $150 an hour to go be creative.
> Add more randomness, add more steering, add random steering in random chapters, change it up, and so on.
That doesn't work for AI models. The whole training process depends on the basic principle that if you take the average of 100, in this case book cover designs, that the average is less like randomness than any individual cover you've used to make your average.
So the output will, by necessity, be closer to the average.
The human learning algorithm is much, much more data efficient than models. A absolute top human expert will have read/seen/heard/talked/... about 160 million "tokens" (that's about 2000 books). Frankly, the nerve inputs of all experiences of an entire human life, from baby to rewriting relativity theory, are only a couple dozen gigabytes.
Qwen 3.6 27B has been trained (as in seen ~10 to ~50 times) 8 trillion tokens, or to put it another way: for every second you will have spent "gathering life experiences" (ie. your whole life) on your deathbed Qwen 3.6 27B has spend about 50.000 seconds learning. And really that figure should be multiplied by the 10 or 50 training iterations.
Add another 3 or so orders of magnitude and you've got ChatGPT. By this measure, the human brains outperforms ridiculously overspecced ML models (because that's what ChatGPT and the like are) in efficiency a factor of by 5 million or more. This is the reason humans are still faster than ML models.
As for human training iterations: we can be simple: it's 1. In fact, it's impossible to make it even 2. Of course, when it comes to human performance: we are a better but not fundamentally different version of genetic algorithms. Do most humans perform? The honest answer is no. 1 in 1000, and that's very generous, improves SOTA. You absolutely need the 1000 failures though, as anyone whose tried a PhD (or even just design a large program) knows.
So we are very far away from allowing AI models to do what humans can do: take one example and produce, from one example, a better output. And there will always be much more variation in that approach. But ... most human attempts to do something are total crap. Most AI attempts to do something will succeed, but they'll be comparatively be bland, tasteless, "without soul", ...
And this is ignoring the problem that AI also has a massive limitation (that can't be solved, no matter how many nvidia cards you have) in that it trains against historical data. And counterfactuals don't work. What would have happened had Shakespeare decided Macbeth's wife was a force for good? Would the king still get murdered? Would it still be a great story? You can't work with counterfactuals.
Of course it does. I know it does because I've been using variations of this workflow since gpt3.0. In fact it's the only way it can work, since by design LLMs work from left to right. You can't expect it to produce original stuff if you don't give it the anchors for what original means. It'd be like going to a new bar every night and asking for a "beer that you haven't had before". There's no information to work on there.
The point was to take a random combination of story elements. Pick one each {King,dad,CEO} {betrays,kills,loves} {his enemy,the king,a foreign prime minister} and feed to an LLM.
The output will not be an intricate well designed epic storyline, but a cookie-cutter boring snoozefest.
BUT you can give that to a bunch of humans, who "insert their life experience" (ie. parts of their training data, translated to LLM terms) and sometimes out comes Game of Thrones, Star Wars, ...
When you generate one or two blog posts with LLM they look pretty good. And you will be impressed with that one clever bit it adds that you didn't even ask for. But then you generate 50 of them and they all converge into the same pattern. It's hard to prove that an article is AI generated but they are instantly recognizable.
An aside, I usually take my written blog posts through a pass on Notebooklm to generate a podcast like discussion about it. It used to be a good way to extract some insights I haven't thought of. But after 50 of them, I can predict what the host will "pushback" on and exactly when. Then they magically resolve their differences and agree with whatever the idea was. It's truly impressive when you just consume sporadically. But listen frequently and they converge into one blob.
> But then you generate 50 of them and they all converge into the same pattern.
The AI slop that appears on YouTube as "revenge stories" and "POV life" all have that pattern. There's almost always a Marcus and a Richard, sometimes a Victoria, and they have consistent personalities across stories.
I suspect there are new invariants emerging. We don’t know what they are and we will probably have to reach into the liberal arts to describe them but to me what you’re seeing is akin to the subatomic world exposing itself through diffraction patterns.
You're just looking for the study of rhetoric. LLMs have clustered on certain rhetorical patterns/gestures, probably because of a combination of frequency in input and bias in training. But rhetoric also concerns the logical structures that underpin communicative techniques, and it's this logical infrastructure that's shaky or bizarre in LLM content (like the GP noticing how "pushback" always resolves without further examination).
I presume you mean, that what I and others is observing is patterns in mere rhetoric. That this is just unimportant window dressing around the actual problem solving.
Yet, generation of rhetoric seems to be one of the key usecases, and one of the key features that makes this technology seem “intelligent”.
There used to be a word for this in generative AI: mode collapse. It's not that the model doesn't generate human-like responses, it's that it generates the same 0.0001% of possible human like responses every time. It's almost certainly the instruction tuning which is responsible, maybe some small part of blame could go to the rollout policy (I have no idea how rollout policy works these days).
The LLM has its context-window. When it gets over that I assume it starts more or less repeating itself. Whereas human context-window has memories and inputs from all of one's life. Therefore great authors don't repeat themselves.
Now even if an LLM has a large context-window it is probabably not the case it rememebers all of your previous prompts and all of its previous replies. If you ask it to write a book you should probabaly give it all the previous 50 books (or blog-posts) it has written for you so far and you should tell it not to repeat itself. But in practice the context-window and the cost of token would become too expensive for it to write 50 unique books.
Maybe the problem is "all-or-nothing" -nature of LLM context window. Humans don't remmeber everything from past but they remember something from ALL OF their past.
On HN many comments under many threads are about whether the submission was written by AI. You could say I have noticed a pattern in Hacker News comments!
In these comments there's a common pattern where some users argue that they do not agree that the submission was LLM written and they often focus on specific details to refute it (e.g em-dashes) and some users see the overall pattern clearly that it's totally obvious. For me it's a kind of smell, it's off putting and it's obvious. The article says to "trust your gut". But it's also something that comes with practice and time, it's not some innate thing. People may have better things to do than expend mental energy noticing patterns in a bunch of social media posts. The more I see it, the more I see it.
The take away I get is that it's okay to notice patterns and it's okay to not notice patterns. Remember that other people may be noticing patterns and associations in things that you might miss. Be charitable.
Far more interesting questions are:
1) If you cant see the patterns of LLM writing, does the idea that the thing you liked was written by LLM worry you?
2) If you can see the patterns clearly is the fact that it's LLM written worry you?
Because in our comments there's many who do not care that LLM's are writing content and theres many who do care. But are these correlated with those who can see the LLMs or who are blind to them?
I'm not worried abou LLM written content, my problem is not word prediction. My problem with it pretty much like with mass produced self help books decade ago.
Good human writing especially on highly technical topics its usually compression of information.
Like you have some experience you want to share with others and you work your brains try to put it into concise story.|
Problem us: AI generated texts are opposite 99% of the time: author usually have bullet point list to feed into machine to add hallucinated word predicted story on top of it.
So signal to noise ratio is much worse.
So reading AI texts is pretty much like listening for stories from humans with mental problems - no one really wants to listen to hallutinations even if somewhere inside there is some useful information.
I love this question. For me, the answer is that I will always value human-generated content more highly because of the affinity I feel for the author, a fellow human.
But I feel that affinity by default. If there's some convincing AI writing, I'll assume it was a human who did it. And if I ever find out I was wrong, the emptiness that results negates all the joy.
> The take away I get is that it's okay to notice patterns and it's okay to not notice patterns. Remember that other people may be noticing patterns and associations in things that you might miss. Be charitable.
I wish the people (often wrongly) accusing others of using a LLM to generate whatever were more charitable. Yes we notice patterns. But then we also notice patterns where there are none.
Notably, in programming this is actually a desirable feature for most problems. Even human programmers are taught to produce predictable and obvious code whenever possible. I wonder is ultimately this is an artifact of optimizing the models for code, that they become less creative.
Determinism is a desirable property for software, yes, and its lack thereof from LLM’s is a common complaint, but often a feature depending on who you ask. There is an element of randomness, “hotness” that is central to who LLM work but the pattern we see here manifesting reveals the deterministic processes below, but I don’t think you could rely on this technology to be deterministic, if that’s what you wanted.
This is exactly why it is perfectly possible to identify AI-generated prose/images; it's not that any one word or sentence is the tell, but that it all sounds/looks the same as the other generated stuff.
At this point, I think the people who struggle with identifying the AI feel are telling you that they don't really engage with media much.
Even the authors name seems to be generated in many cases. Look at how often "Bright" appears: Andrew W. Bright, Nolan Bright, Bright A. Jeffery, Pamela Bright, Thomas Bright, Daniel Bright, Mayan Bright, Henry Brightwood, Leo Brightham, Milo Brightspark.
There's also Molly Wonder, Elliot Wonder, Professor Pax Wonder, and Theo Wonderquill
I think for that instance to carry weight you would have to provide evidence that the mosaic of books were the product of different people using AI. If it is just one person doing variations on the same thing then it wouldn't mean very much.
> if a hundred “authors” give their favorite AI tool a similar prompt
Do we really believe there are 100 different people generating those? When I saw the books, I assumed they were generated on demand to match the (to me unlikely) search terms.
I don’t think I’m invested enough to research this. Amazon slop is harder and harder to wade through. (Searches are very imprecise. Deliberate, I’m sure.)
What is worse, IMO, is that these GenAI books have found their way into physical stores. You know, the few that are still left.
I've found AI slop at many big box stores (think Walmart, Target, etc. and all their equivalents around the world) - which I suspect are "industry plants", meaning that the publishing house will have someone internally generate books like these, and sell them as physical copies around the thousands of stores I mentioned.
It is the equivalent of record labels pushing their own in-house GenAI artists.
There is a new “mural” (a 10m tall graffiti painting if you will) that is -obviously, but not ironically- an orange-toned AI designed picture, probably commmissioned bu the company the house belongs to ; makes my eyes bleed everytime
We likes this "same, complex set of mannerism" when using LLM for programming. If you ask LLM to write a certain function for you, it gives you statistically obvious implementation. But maybe for writing an original book, this feature is not so desirable
It does not. Sometimes it will spawn a mess of ad hoc python, sometimes it will do curl and sed, and very very occasionally it will use the correct tool for the job if it remembers to use the skill you developed in the previous session.
Yes, sometimes it does something unexpected when used as a tool for programming. And in that context, it is seen as an unwanted feature. In fact, that was my point. However, I disagree that it does a good job only "very very occasionally". That is not my experience at all.
I think a majority of content consumers can already distinguish LLM content from human content. I'm looking forward to the day that they're intelligent enough to care, but I'm not holding my breath. Orwell framed it pretty well in 1984 with the machine-generated songs that were new every year, but always tugged on the heartstrings of the proles. They weren't really readers or listeners to music or appreciators of art before, and they can be caught in the trap indefinitely, since they'll never be aware or what came before or what's being done now outside their AI-driven feed.
Horselover Fat had a pretty good take on machine generated content, too.
It's not an "already", because I assume models will get better at addressing mode collapse.
The irony in the machine generated songs in 1984 was that Winston clearly found meaning in them, feeling like they applied to him, even though he knew they were machine generated: (from memory) "Under the shade of the chestnut tree / I sold you and you sold me / here lie they and here lie we / under the shade of the chestnut tree" - that refers to him and Julia selling each other out, right?
Just like people today - and in George Orwell's day, which was why he made it - find meaning in things which is obviously formulaic manufacured corporate slop, like the endless MCU films.
I'm not sure the Chestnut Tree song was machine generated. [edit: I also recall Winston thinking that the proles songs were sappy and repetitive]. I took that as an older song predating the machine slop. But maybe you're right, and if so it's a sadder and deeper irony.
Finding meaning in slop is not ennobling of the human spirit, and I see no reason to champion it.
Also if the meaning is that I sold you and you sold me; what is the upside here?
I don't want to hurt people's feelings, so in person I restrain myself from speaking out (it wouldn't change anything anyway)... but every person I have seen so far, who was bullish on building an AI business has followed the same path:
1) They think the AI can replace them, but in a good way: "it will keep doing my job and people will pay ME"
2) They assume people either don't notice or don't mind that it's AI. They build businesses, where AI impersonates a professional when that person is not available ("chat with your therapist any time even if they sleep!")
3) All they do is based on written or spoken words. There is no substance
I expect that sooner than later a great skepticism for anything non-tangible will develop. Personally, I have been highly distrustful of people who don't build things (even the word "building" is now tainted). I think it will accelerate.
No one should like slop autogen books, but this is barely more damning than being upset that all the garments have 2 legs when they search for "pants".
It’s actually pretty easy to distinguish AI from real text because all AI generated texts have 2.4 children. In aggregate it is indistinguishable statistically but for a given text it’s remarkably easy.
Have you seen Egyptian paintings or Hollywood movies?
Everything is slop if you make enough of it and squint hard enough.
The point with AI is if and how to steer it to produce something that is interesting and unique for you, not another bland cookie cutter blockbuster or lame summer song.
You mean the one they specifically include as an example of LLM-generated markers? Did you actually read the article, or just scan for excuses to call them out as LLM output?
You're either sarcastic or you missed all of: a) this being in italics, b) this being in quotes, c) this being the only LLM pattern in the post, d) this being quoted explicitly as an LLM marker. Good job in both cases, I guess.
The whole point of the thesis is that because the cover image are very similar, therefore LLMs are bad at writing text?
I think it's that today's LLMs have access to poor/generic image generation models and people find it easier to ask ChatGPT or NanoBanana to make a cover instead of fine tuning a small SD model for the purpose.
I don't know how much of a smoking gun this actually is, the evidence proffered doesn't establish anything - I can see some names there like Havilah Brooks or Celina Briar who are intentionally re-using the same title to create a series, for example. And this doesn't really get into the base rate of generic title re-use among encyclopedias. There isn't much reward for coming up with an imaginative title for kids, they're not very experienced. I'd have no trouble believing publishers come up with very similarly titled books in the kids encyclopedia all the time, they already recycle plots like there is no yesterday in fantasy.
I think the article's point is probably sound to some great extent, but I would believe I owned a book with a title like "100,000 Whys" when I was young. With a dinosaur and a rocket on the front. I loved dinosaurs and rockets, they're even still cool today.
Have you seen the content of the books in the tweet[1] linked below the article? Between horses with fused butts and other diagrams that don't say as much as they purport to, the cover is the least of its problems, although the only one that can be criticized directly.
This seems to be some out of context pictures where I have no idea what they are meant to be showing or whether they succeeded. And although the cats and zebras are clear AI images again that doesn't mean anything, being exactly 2 pictures presented with no context apart from being seen in a book. So there is a book where the editor was lazy and let some bad AI images through.
I'm sure someone deeply familiar with childrens publishing would be able to talk authoritatively on the extent of new trends, but this seems to be the infosec community and the evidence offered doesn't seem to actually be evidence of anything. There isn't a baseline. Children's encyclopedias might have been a hard-hitting game of radical creativity and high standards in the past, or it could be an endless tide of derivative swill.
And using AI images seems unrelated. That's something people should just be doing. Ideally with better proofreading, but hey. The article's complaint was about lack of originality.
Aw, it's just one big picture of book covers. You can't click on the books and read them. If they're AI-written, they're not copyrightable, so you could post the full text. Looking at the books side by side would be interesting.
A test for AI-generated art: railroad tracks. For some reason, none of the image generators can get railroad trackage even close to correct. Just getting long, parallel rails correct seems to be hard. Where there are multiple tracks, trains are positioned between tracks. Rail spacing, tie spacing, and clearances are all wrong. Two long parallel tracks without the rails getting mixed up is rare. Curves are wrong. Switches are hopeless.
There may be something about maintaining strong coherence all the way across an image that's hard for Stable Diffusion type systems. Iterated local refinement seems to botch this class of image.
Examples: [1][2][3][4][5][6]
[1] https://www.vecteezy.com/photo/37205933-ai-generated-high-sp...
[2] https://www.dreamstime.com/royalty-free-stock-photography-mo...
[3] https://www.magnific.com/premium-ai-image/high-speed-passeng...
[4] https://www.magnific.com/premium-ai-image/rail-yard-27_27291...
[5] https://www.magnific.com/premium-ai-image/train-track-with-s...
[6] https://pixabay.com/illustrations/ai-generated-train-tracks-...
Periodic motions coupled with "whole image coherence" is still very difficult even for non-SD based models (NB, Flux, etc.)
I remembering being absolutely shocked when the gpt-image series managed to pass the Labyrinth test.
https://genai-showdown.specr.net/#the-labyrinth
You can read at least the first chapter or so if you do the Amazon search. I did, and made some discoveries.
* https://mastodonapp.uk/@JdeBP/116788436560592117
A nice illustration of the homogeneity of LLM responses. Another way to describe this effect would be…
If you ask humans to write 1,000 books, you're asking 1,000 different humans with different experiences and different skills and different moods (etc.) to write those books.
But if you ask LLMs to write 1,000 books, you're probably only talking to 3 or 5 different models, tops. And they've all trained on the same or similar data, and are trained to respond in very similar ways.
The LLMs don't differ much in anything like "life experience" or "skills", and they don't really have anything like a "mood" independent of the prompts you've given them.
Agreed. I’ve made this point before: LLMs are excellent at ornamentation and decorative prose, but if you don’t seed them with a solid core idea then their output is absolute dreck - the biblical whitewashed tomb.
This is the example I usually point to. It’s a demonstration by OpenAI themselves where the prompt is very simple: “Write a story in fifty words about a toaster that becomes sentient.” As you’ll notice, although the coherence improves at an accelerating rate, the underlying story motif fails to elevate itself beyond the relatively pedestrian.
https://progress.openai.com/?prompt=10
When given a generic prompt and not enough direction, they simply lack the ability to produce real specificity. For reference, here’s the story I came up with after sitting quietly for a few moments before writing it out:
LLMs are great at producing average.
We see this with their GenAI music equivalents. All the music these GenAI models produce is exceptionally (aggressively, even) average.
It is the most polished average you'll ever find. Never awful (anymore), never fantastic. Just bang in the middle.
>Never awful (anymore), never fantastic
Don't know about that, I always found average awful in itself, even in human output (like most pop), and even more so in AI output.
Something actually awful can be better than average - more entertaining and more felt. I'd rather watch The Room than an average movie.
That is definitely the essence of AI: It is the average of all the inputs it has been trained on.
Frank Zappa was once asked about guitar virtuosos like John McLaughlin and his answer was somemthing like "You can maybe plays solo faster than anybody, but can your playing surprise me?".
I don't think the comparison to humans works. It is as if you expect that we can easily train many different LLMs to solve the originality problem, but that is far from guaranteed.
I wonder how much variation there would be if you got a single model to produce a couple of gigabytes of tiny children's stories.
Might be an interedting research project.
There is one already: https://arxiv.org/abs/2305.07759 https://huggingface.co/datasets/roneneldan/TinyStories
6.5GB of tiny stories, as requested. ;)
My comment was, in-fact, a subtle reference to this.
The best opening I got from my own TinyStories trained model was.
Once upon a time, in a small town, there was a large town.
Which I just love as an evocative idea.
Texts in Gutenberg have 20GB, and full Wikipedia (English texts) have 80-110GB.
So to LLM-generate 6.5GB of tiny stories is quite a permutation in action :)
Reminds of Pluribus.
Pluribus is kinda different. An LLM cannot wander too far from the average. Even if it wanted too. In pluribus, the 'others' work toward a common goal, each utilizing their own expertise, knowledge and experiences in a shared way to achieve a common goal. Each is unique. They can, if they want, perform as the host's individual before the the joining. To put it other way, the other in pluribus are convergent by choice, llms are convergent by design.
> you're asking 1,000 different humans with different experiences and different skills and different moods
Simply, if you ask an LLM, you're asking always to the same mind, and always for the first time.
Also since those are lazy, you are also asking always in the same manner. How homogeneous were the prompts that generated those covers?
People are making cookies with cookie cutter number 5 and other people wonder how come they are all the same.
Classic self selection effect though - if you’re resorting to LLM writing you’re almost certainly skewing lazy enough to not even bother trying to add perturbations strong enough to make the response deviate from the uniformity of the slop.
I do think that's a big part of it. AI output moves towards the average, and anyone who wants to use it doesn't care enough to push against that tendency.
Seems that both you and the gp are starting from the assumption that those uniform results are representative of those who use AI and of AI usage. In fact they have been chosen for their uniformity- they might be only a small part of a much more varied output obtained by more demanding (or lucky) users.
I think the uniformity is real. All users interact with the same initial state of the model when they start each chat. Models are not trained to be wildly creative and try to stick to the point. So when users prompt them in pretty much the same manner they quite stably generate very similar output.
I wonder if there aren't a simple creative hack to discover, for example to prompt the model to produce more unexpected output just by injecting some randomness before the actual creative command in the prompt.
Yes, the uniformity is real- I made the same exact argument at the beginning of this thread. But you can't judge "AI users" in general based on this output because you have selected only what is visibly uniform. Even if 99% of the users introduced enough variation to produce different results, you would still be selecting the 1% that is identical.
> Models are not trained to be wildly creative and try to stick to the point
Models might be as creative as humans, they would still start always from the exact same state. If you ask an LLM to think of three random numbers it will spit out always the same ones. If you tell it to avoid the first that came to its mind, the second choices will also be always the same.
From qntm's Lena:
"the emulated Miguel Acevedo boots with an excited, pleasant demeanour. He is eager to understand how much time has passed since his uploading, what context he is being emulated in, and what task or experiment he is to participate in. If asked to speculate, he guesses that he may have been booted for the IAAS-1 or IAAS-5 experiments".
that discounts, how much the other context, ie, the system, prompt, and any sort of other context submitted to the model that can affect the output. If you ask a model as a patient for medical advice versus as a doctor, you will get different output from the same model.
prompts will give very different results. this is where you do the work.
I disagree. The LLM outputs really do lack anything original or interesting. They just produce banal copy whatever you ask them.
A good editor could probably reduce all LLM outputs on a subject down to the same point.
> They just produce banal copy whatever you ask them.
Nope, if you provide pages and pages of example of a style to imitate, it will do it and do it fairly well. Of course how well they do it differs from one model to the next, but providing context and extensive system prompt does change things every time.
Imitation is banal.
A controller has to be at least as complex as what it is supposed to control.
Yes but not very different results (unless you're adding new information to your prompt or reducing some ambiguity). Prompt engineering is mostly pseudoscience.
What we need is steering so that we can have models with different personalities, not just different prompts (because context is subject to forgetting), but this will never happen with closed-weight models, I'm not sure if it's even feasible at scale.
Yet another reason why the future is open weight.
> Prompt engineering is mostly pseudoscience.
Not my experience.
Do you have anything others can reliably reproduce? If not… well it wasn't science.
> A nice illustration of the homogeneity of LLM responses. [...] And they've all trained on the same or similar data, and are trained to respond in very similar ways.
I mostly agree, but this is a very simplified explanation. The models are indeed trained to respond in similar ways, for "basic" prompts. And that's as much a feature as it is a bug. In other words, the bug becomes apparent only if you give 100+ basic prompts. But giving it 100+ basic prompts and expecting originality is a silly endeavour. That's not how you get originality.
The way I'd go about to generate 1000 books, while expecting different outcomes is something along these lines (and nowadays you can ask your favorite LLM to wire up this workflow for you, with decent outcomes):
1. Ask for a list of 20 features that define a book (genre, style, number of characters, tropes, plot, continuity, relationships, etc.)
2. For each feature, ask for a list of 50 examples, ordered from most common to the most unique.
3. Randomly pick 10 features, and for each pick one of the 50 generated items. Ask for the rest of the features to match the theme.
4. Ask for 10 possible book outlines that match the chosen features, randomly pick between 2-8.
5. Create a detailed prompt that includes all the above features, and ask for a synopsis for each chapter, given the above outline chosen.
6. Given {features} and {outline} and {synopsis} write chapter 1.
7. for each chapter in list, given {...} and (optional) previous matching chapter(s), write chapter n+1
(optional 8.) given {...} and 2-3 consecutive chapters, align the ending / beginning of a new chapter for style / features / continuity, etc.
(optional 9.) given {...} and the whole book, list chapters / paragraphs that don't match the given {...} and provide a list of 5 improvements. (randomly choose 1 and ask for an edit).
----
Now, this probably won't give you something like cloud atlas, but they'll at least be different books. That's how I'd do it if I wanted to see how different they can write. Not 1000 "basic" prompts and expecting originality.
That whole thing would get you 1000 variants of existing art. But if you asked a thousand different designers to do a cover for the same book...
> 1000 variants of existing art.
This is very naive. I can almost guarantee that some combinations of 20 * 50 features will hit on something that has never been written before in that specific combination. And if that's still not enough, increase the number of features. Add more randomness, add more steering, add random steering in random chapters, change it up, and so on.
>will hit on something that has never been written before in that specific combination
That's a very low bar. The skill of an artist is not in writing something that "has never been written before in that specific combination", it's in writing something that's unique or better that what was there, even if it has been written before in that specific combination.
I'm an art director. Finding a sequence that hasn't been hit in that specific combination is not sufficient to justify paying someone $150 an hour to go be creative.
> Add more randomness, add more steering, add random steering in random chapters, change it up, and so on.
That doesn't work for AI models. The whole training process depends on the basic principle that if you take the average of 100, in this case book cover designs, that the average is less like randomness than any individual cover you've used to make your average.
So the output will, by necessity, be closer to the average.
The human learning algorithm is much, much more data efficient than models. A absolute top human expert will have read/seen/heard/talked/... about 160 million "tokens" (that's about 2000 books). Frankly, the nerve inputs of all experiences of an entire human life, from baby to rewriting relativity theory, are only a couple dozen gigabytes.
Qwen 3.6 27B has been trained (as in seen ~10 to ~50 times) 8 trillion tokens, or to put it another way: for every second you will have spent "gathering life experiences" (ie. your whole life) on your deathbed Qwen 3.6 27B has spend about 50.000 seconds learning. And really that figure should be multiplied by the 10 or 50 training iterations.
Add another 3 or so orders of magnitude and you've got ChatGPT. By this measure, the human brains outperforms ridiculously overspecced ML models (because that's what ChatGPT and the like are) in efficiency a factor of by 5 million or more. This is the reason humans are still faster than ML models.
As for human training iterations: we can be simple: it's 1. In fact, it's impossible to make it even 2. Of course, when it comes to human performance: we are a better but not fundamentally different version of genetic algorithms. Do most humans perform? The honest answer is no. 1 in 1000, and that's very generous, improves SOTA. You absolutely need the 1000 failures though, as anyone whose tried a PhD (or even just design a large program) knows.
So we are very far away from allowing AI models to do what humans can do: take one example and produce, from one example, a better output. And there will always be much more variation in that approach. But ... most human attempts to do something are total crap. Most AI attempts to do something will succeed, but they'll be comparatively be bland, tasteless, "without soul", ...
And this is ignoring the problem that AI also has a massive limitation (that can't be solved, no matter how many nvidia cards you have) in that it trains against historical data. And counterfactuals don't work. What would have happened had Shakespeare decided Macbeth's wife was a force for good? Would the king still get murdered? Would it still be a great story? You can't work with counterfactuals.
> That doesn't work for AI models.
Of course it does. I know it does because I've been using variations of this workflow since gpt3.0. In fact it's the only way it can work, since by design LLMs work from left to right. You can't expect it to produce original stuff if you don't give it the anchors for what original means. It'd be like going to a new bar every night and asking for a "beer that you haven't had before". There's no information to work on there.
The point was to take a random combination of story elements. Pick one each {King,dad,CEO} {betrays,kills,loves} {his enemy,the king,a foreign prime minister} and feed to an LLM.
The output will not be an intricate well designed epic storyline, but a cookie-cutter boring snoozefest.
BUT you can give that to a bunch of humans, who "insert their life experience" (ie. parts of their training data, translated to LLM terms) and sometimes out comes Game of Thrones, Star Wars, ...
When you generate one or two blog posts with LLM they look pretty good. And you will be impressed with that one clever bit it adds that you didn't even ask for. But then you generate 50 of them and they all converge into the same pattern. It's hard to prove that an article is AI generated but they are instantly recognizable.
An aside, I usually take my written blog posts through a pass on Notebooklm to generate a podcast like discussion about it. It used to be a good way to extract some insights I haven't thought of. But after 50 of them, I can predict what the host will "pushback" on and exactly when. Then they magically resolve their differences and agree with whatever the idea was. It's truly impressive when you just consume sporadically. But listen frequently and they converge into one blob.
> It's truly impressive when you just consume sporadically. But listen frequently and they converge into one blob.
And something that shows that behavior is a scammers wet dream!
> But then you generate 50 of them and they all converge into the same pattern.
The AI slop that appears on YouTube as "revenge stories" and "POV life" all have that pattern. There's almost always a Marcus and a Richard, sometimes a Victoria, and they have consistent personalities across stories.
I suspect there are new invariants emerging. We don’t know what they are and we will probably have to reach into the liberal arts to describe them but to me what you’re seeing is akin to the subatomic world exposing itself through diffraction patterns.
You're just looking for the study of rhetoric. LLMs have clustered on certain rhetorical patterns/gestures, probably because of a combination of frequency in input and bias in training. But rhetoric also concerns the logical structures that underpin communicative techniques, and it's this logical infrastructure that's shaky or bizarre in LLM content (like the GP noticing how "pushback" always resolves without further examination).
Discussing style is only scratching the surface on classical rhetoric.
An LLM will rarely be able to move the needle on ethos or pathos beyond generating well-formed sentences with proper spelling.
> You're just looking for the study of rhetoric.
I presume you mean, that what I and others is observing is patterns in mere rhetoric. That this is just unimportant window dressing around the actual problem solving.
Yet, generation of rhetoric seems to be one of the key usecases, and one of the key features that makes this technology seem “intelligent”.
> they all converge
AI is regression to the mean.
Much like Socialism.
Om an acute basis, AI can be just as helpful as that safety net.
As a chronic matter, "it's not excellence--it's mediocrity".
And capitalism as seen in the USA is regression to the bottom of the cesspool?
Please explain your point in the context of the fun the FIFA tourists are having.
There used to be a word for this in generative AI: mode collapse. It's not that the model doesn't generate human-like responses, it's that it generates the same 0.0001% of possible human like responses every time. It's almost certainly the instruction tuning which is responsible, maybe some small part of blame could go to the rollout policy (I have no idea how rollout policy works these days).
The LLM has its context-window. When it gets over that I assume it starts more or less repeating itself. Whereas human context-window has memories and inputs from all of one's life. Therefore great authors don't repeat themselves.
Now even if an LLM has a large context-window it is probabably not the case it rememebers all of your previous prompts and all of its previous replies. If you ask it to write a book you should probabaly give it all the previous 50 books (or blog-posts) it has written for you so far and you should tell it not to repeat itself. But in practice the context-window and the cost of token would become too expensive for it to write 50 unique books.
Maybe the problem is "all-or-nothing" -nature of LLM context window. Humans don't remmeber everything from past but they remember something from ALL OF their past.
On HN many comments under many threads are about whether the submission was written by AI. You could say I have noticed a pattern in Hacker News comments!
In these comments there's a common pattern where some users argue that they do not agree that the submission was LLM written and they often focus on specific details to refute it (e.g em-dashes) and some users see the overall pattern clearly that it's totally obvious. For me it's a kind of smell, it's off putting and it's obvious. The article says to "trust your gut". But it's also something that comes with practice and time, it's not some innate thing. People may have better things to do than expend mental energy noticing patterns in a bunch of social media posts. The more I see it, the more I see it.
The take away I get is that it's okay to notice patterns and it's okay to not notice patterns. Remember that other people may be noticing patterns and associations in things that you might miss. Be charitable.
Far more interesting questions are:
1) If you cant see the patterns of LLM writing, does the idea that the thing you liked was written by LLM worry you?
2) If you can see the patterns clearly is the fact that it's LLM written worry you?
Because in our comments there's many who do not care that LLM's are writing content and theres many who do care. But are these correlated with those who can see the LLMs or who are blind to them?
I'm not worried abou LLM written content, my problem is not word prediction. My problem with it pretty much like with mass produced self help books decade ago.
Good human writing especially on highly technical topics its usually compression of information.
Like you have some experience you want to share with others and you work your brains try to put it into concise story.|
Problem us: AI generated texts are opposite 99% of the time: author usually have bullet point list to feed into machine to add hallucinated word predicted story on top of it.
So signal to noise ratio is much worse.
So reading AI texts is pretty much like listening for stories from humans with mental problems - no one really wants to listen to hallutinations even if somewhere inside there is some useful information.
3) If genAI becomes indistinguishable from human-generated (and cheaper!), would you still value human-generated as much?
Analogy: assuming high quality / both fit for purpose, would you still prefer expensive, hand-crafted item over cheap(er), mass-produced item?
I love this question. For me, the answer is that I will always value human-generated content more highly because of the affinity I feel for the author, a fellow human.
But I feel that affinity by default. If there's some convincing AI writing, I'll assume it was a human who did it. And if I ever find out I was wrong, the emptiness that results negates all the joy.
> The take away I get is that it's okay to notice patterns and it's okay to not notice patterns. Remember that other people may be noticing patterns and associations in things that you might miss. Be charitable.
I wish the people (often wrongly) accusing others of using a LLM to generate whatever were more charitable. Yes we notice patterns. But then we also notice patterns where there are none.
It's even more worrying when you look at the contents of these "books", they are riddled with erros:
https://infosec.exchange/@lcamtuf/116785283147249092
That's a generalization from 1 book. I went and looked at ten or so. This is not actually the case. There's something more complex going on.
* https://mastodonapp.uk/@JdeBP/116788511790947929
Truly sad state of affairs
This is atrocious
Notably, in programming this is actually a desirable feature for most problems. Even human programmers are taught to produce predictable and obvious code whenever possible. I wonder is ultimately this is an artifact of optimizing the models for code, that they become less creative.
Determinism is a desirable property for software, yes, and its lack thereof from LLM’s is a common complaint, but often a feature depending on who you ask. There is an element of randomness, “hotness” that is central to who LLM work but the pattern we see here manifesting reveals the deterministic processes below, but I don’t think you could rely on this technology to be deterministic, if that’s what you wanted.
I’ve rarely experienced this. Typically what is requested is code that has unpredictable pauses, takes unbounded time, has two kinds of null, etc.
This is exactly why it is perfectly possible to identify AI-generated prose/images; it's not that any one word or sentence is the tell, but that it all sounds/looks the same as the other generated stuff.
At this point, I think the people who struggle with identifying the AI feel are telling you that they don't really engage with media much.
Even the authors name seems to be generated in many cases. Look at how often "Bright" appears: Andrew W. Bright, Nolan Bright, Bright A. Jeffery, Pamela Bright, Thomas Bright, Daniel Bright, Mayan Bright, Henry Brightwood, Leo Brightham, Milo Brightspark.
There's also Molly Wonder, Elliot Wonder, Professor Pax Wonder, and Theo Wonderquill
Don't forget Lucas Thinkwell!
Is tweaking a temperature of the model not a thing anymore?
I think for that instance to carry weight you would have to provide evidence that the mosaic of books were the product of different people using AI. If it is just one person doing variations on the same thing then it wouldn't mean very much.
I love the illustration of the same-ness of AI.
One question / quibble:
> if a hundred “authors” give their favorite AI tool a similar prompt
Do we really believe there are 100 different people generating those? When I saw the books, I assumed they were generated on demand to match the (to me unlikely) search terms.
I don’t think I’m invested enough to research this. Amazon slop is harder and harder to wade through. (Searches are very imprecise. Deliberate, I’m sure.)
Maybe the LLMs need some kind of "coverage" metric so they prioritise new paths? The author would know a thing or two about that.
What is worse, IMO, is that these GenAI books have found their way into physical stores. You know, the few that are still left.
I've found AI slop at many big box stores (think Walmart, Target, etc. and all their equivalents around the world) - which I suspect are "industry plants", meaning that the publishing house will have someone internally generate books like these, and sell them as physical copies around the thousands of stores I mentioned.
It is the equivalent of record labels pushing their own in-house GenAI artists.
I'm sure they are banking on the possibility that the average book hoarder cannot tell the difference.
There is a new “mural” (a 10m tall graffiti painting if you will) that is -obviously, but not ironically- an orange-toned AI designed picture, probably commmissioned bu the company the house belongs to ; makes my eyes bleed everytime
Interesting observation, give me 100,000 whys to believe you
We likes this "same, complex set of mannerism" when using LLM for programming. If you ask LLM to write a certain function for you, it gives you statistically obvious implementation. But maybe for writing an original book, this feature is not so desirable
It does not. Sometimes it will spawn a mess of ad hoc python, sometimes it will do curl and sed, and very very occasionally it will use the correct tool for the job if it remembers to use the skill you developed in the previous session.
Yes, sometimes it does something unexpected when used as a tool for programming. And in that context, it is seen as an unwanted feature. In fact, that was my point. However, I disagree that it does a good job only "very very occasionally". That is not my experience at all.
I think a majority of content consumers can already distinguish LLM content from human content. I'm looking forward to the day that they're intelligent enough to care, but I'm not holding my breath. Orwell framed it pretty well in 1984 with the machine-generated songs that were new every year, but always tugged on the heartstrings of the proles. They weren't really readers or listeners to music or appreciators of art before, and they can be caught in the trap indefinitely, since they'll never be aware or what came before or what's being done now outside their AI-driven feed.
Horselover Fat had a pretty good take on machine generated content, too.
It's not an "already", because I assume models will get better at addressing mode collapse.
The irony in the machine generated songs in 1984 was that Winston clearly found meaning in them, feeling like they applied to him, even though he knew they were machine generated: (from memory) "Under the shade of the chestnut tree / I sold you and you sold me / here lie they and here lie we / under the shade of the chestnut tree" - that refers to him and Julia selling each other out, right?
Just like people today - and in George Orwell's day, which was why he made it - find meaning in things which is obviously formulaic manufacured corporate slop, like the endless MCU films.
I'm not sure the Chestnut Tree song was machine generated. [edit: I also recall Winston thinking that the proles songs were sappy and repetitive]. I took that as an older song predating the machine slop. But maybe you're right, and if so it's a sadder and deeper irony.
Finding meaning in slop is not ennobling of the human spirit, and I see no reason to champion it.
Also if the meaning is that I sold you and you sold me; what is the upside here?
I don't want to hurt people's feelings, so in person I restrain myself from speaking out (it wouldn't change anything anyway)... but every person I have seen so far, who was bullish on building an AI business has followed the same path:
I expect that sooner than later a great skepticism for anything non-tangible will develop. Personally, I have been highly distrustful of people who don't build things (even the word "building" is now tainted). I think it will accelerate.No one should like slop autogen books, but this is barely more damning than being upset that all the garments have 2 legs when they search for "pants".
It’s actually pretty easy to distinguish AI from real text because all AI generated texts have 2.4 children. In aggregate it is indistinguishable statistically but for a given text it’s remarkably easy.
Have you seen Egyptian paintings or Hollywood movies?
Everything is slop if you make enough of it and squint hard enough.
The point with AI is if and how to steer it to produce something that is interesting and unique for you, not another bland cookie cutter blockbuster or lame summer song.
Ignore me
> This is a fuzzy signal, so you shouldn’t fire your intern when they say “it’s not this — it’s that”.
The author literally points to that tell in the article.
In a weird twist, I wonder if you’re an LLM?
You mean the one they specifically include as an example of LLM-generated markers? Did you actually read the article, or just scan for excuses to call them out as LLM output?
You're either sarcastic or you missed all of: a) this being in italics, b) this being in quotes, c) this being the only LLM pattern in the post, d) this being quoted explicitly as an LLM marker. Good job in both cases, I guess.
The whole point of the thesis is that because the cover image are very similar, therefore LLMs are bad at writing text?
I think it's that today's LLMs have access to poor/generic image generation models and people find it easier to ask ChatGPT or NanoBanana to make a cover instead of fine tuning a small SD model for the purpose.
The people in the FediVerse discussions have also looked at the book contents.
* https://mastodonapp.uk/@JdeBP/116788511790947929
* https://hachyderm.io/@ariels/116788498255660876
I don't know how much of a smoking gun this actually is, the evidence proffered doesn't establish anything - I can see some names there like Havilah Brooks or Celina Briar who are intentionally re-using the same title to create a series, for example. And this doesn't really get into the base rate of generic title re-use among encyclopedias. There isn't much reward for coming up with an imaginative title for kids, they're not very experienced. I'd have no trouble believing publishers come up with very similarly titled books in the kids encyclopedia all the time, they already recycle plots like there is no yesterday in fantasy.
I think the article's point is probably sound to some great extent, but I would believe I owned a book with a title like "100,000 Whys" when I was young. With a dinosaur and a rocket on the front. I loved dinosaurs and rockets, they're even still cool today.
Have you seen the content of the books in the tweet[1] linked below the article? Between horses with fused butts and other diagrams that don't say as much as they purport to, the cover is the least of its problems, although the only one that can be criticized directly.
[1] https://infosec.exchange/@lcamtuf/116785283147249092
This seems to be some out of context pictures where I have no idea what they are meant to be showing or whether they succeeded. And although the cats and zebras are clear AI images again that doesn't mean anything, being exactly 2 pictures presented with no context apart from being seen in a book. So there is a book where the editor was lazy and let some bad AI images through.
I'm sure someone deeply familiar with childrens publishing would be able to talk authoritatively on the extent of new trends, but this seems to be the infosec community and the evidence offered doesn't seem to actually be evidence of anything. There isn't a baseline. Children's encyclopedias might have been a hard-hitting game of radical creativity and high standards in the past, or it could be an endless tide of derivative swill.
And using AI images seems unrelated. That's something people should just be doing. Ideally with better proofreading, but hey. The article's complaint was about lack of originality.
Did you see what's inside one of those books?
https://infosec.exchange/@lcamtuf/116785283147249092
This is Amazon #1 bestseller in "Children's Encyclopedias"!