No, it can’t recreate a book. Well, maybe it could get most of the way for the B...

modo_mario · 2025-06-27T11:24:40 1751023480

>People might quote Harry Potter a lot, but they don’t quote the entire thing over and over, chapter and verse, on hundreds of thousands of different websites.

I'm fairly certain I could find the entire thing in plain text in multiple places online. A quick google gives the philosophers stone as the second result in pdf format on the internet archive but i'm sure with a bit of looking i'd bump into a lot of plaintext copies.

They might have taken measures to prevent this from being anywhere their training data (i think it would be fairly easy and something they'd likely do) but if they at any point failed for a book or so that they didn't consider wouldn't my original question stand?

JimDabell · 2025-06-27T11:41:14 1751024474

You’re missing the point. An LLM is not going to memorise a whole book just because it’s seen a few copies. An LLM might be able to memorise the Bible in particular simply because Bible quotes are everywhere. There is a vast difference between being able to find a handful of copies online and having it constantly quoted everywhere that humans communicate. Bible quotes get literally everywhere. People put them on bumper stickers, tattoo themselves with it, put it in their email signatures, etc. Bible quotes are so omnipresent, they have become part of our language – a lot of idioms people use every day come from the Bible.

The Bible isn’t just a book, it’s been a massive part of human culture for millennia, to the point of it shaping language itself. LLMs might be able to memorise the Bible, but it’s not because they can memorise books, it’s because the Bible is far more than just a book.

modo_mario · 2025-06-27T13:13:00 1751029980

I went to check and it seems like it works fine for plenty of other public domain books. The picture of Dorian Grey, Pride and prejudice and what have you. I can ask for x amount of paragraphs from a specific and such.

I doubt every part of those books get quoted everywhere on a numbered basis like the bible might be. For only recently public domain books it seems to be overly cautious trough the retroactively applied filtering where it refuses if it suspects there might be a single country where copyright still applies.

JimDabell · 2025-06-27T15:28:53 1751038133

I can’t reproduce that. What model were you using and what prompt?

modo_mario · 2025-06-27T20:52:37 1751057557

Don't have access to the account i was using before right now but when i'm using chatgpt free tier which i believe is GPT-4o I at first thought i got it right again.

I decided to ask it: Can you give me the first 4 paragraphs of chapter 3 of the book The picture of Dorian Grey?

And it gave me something and it looked alright to me. It read right and i went to gutenberg and glanced over it and the first lines of each paragraph seemed correct but only the short ones were. The first paragraph which was longer after the opening lines suddenly had an entire section randomly replaced with hallucination.

A followup asking it to not hallucinate had it search the web to fetch the correct thing which isn't valid in this context.

I suspect it starts hallucinating once the bit of text gets long so i asked for specific sentences of chapters (and to do so without web search). the 1st, 2nd, 3rd and such.

It managed to not outright hallucinate lines then but did get the chapter i asked for wrong sometimes. I presume that with sufficiently careful prompting one can get the book out properly in sequential order with a lot of prompts but it takes quite some effort to get there. But that's where my curiosity ends for the night. My bed calls.

JimDabell · 2025-06-28T00:30:14 1751070614

> I presume that with sufficiently careful prompting one can get the book out properly

You failed to get it to reproduce one paragraph. Why on earth would you presume you can do it for the entire book‽

modo_mario · 2025-07-01T08:18:01 1751357881

Did you read what I said? I got plenty of correct paragraphs. They just had to be short. Breaking up the big paragraphs seems to help the issue.