Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Am I misunderstanding what a parquet file is, or are all of the HN posts along with the embedding metadata a total of 55GB?


I imagine that's mostly embeddings actually. My database has all the posts and comments from Hacker News, and the table takes up 17.68 GB uncompressed and 5.67 GB compressed.


Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising!


Scraped reddit text archives (~23B items according to their corporate info page) are ~4 TB of compressed json, which includes metadata and not just the actual comment text.


I suspect the text alone would be a lot smaller. Embeddings add a lot - 4K or more regardless of the size of the text.


That’s crazy small. So is it fair to say that words are actually the best compression algorithm we have? You can explain complex ideas in just a few hundred words.

Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text.


I don’t think I would really consider it compression if it’s not very reversible. Whatever people “uncompress” from my words isn’t necessarily what I was imagining or thinking about when I encoded them. I guess it’s more like a symbolic shorthand for meaning which relies on the second party to build their own internal model out of their own (shared public interface, but internal implementation is relatively unique…) symbols.


It is compression, but it is lossy. Just like the digital counterparts like mp3 and jpeg, in some cases the final message can contain all the information you need.


But what’s getting reproduced in your head when you read what I’ve written isn’t what’s in my head at all. You have your own entire context, associations, and language.


how much?


Thanks, that's really helpful to guys like me to start up my "own database". BTW what database you choose for it?


It's on my personal ClickHouse server.

you'd be surprised. I have a lot of text data and Parquet files with brotli compression can achieve impressive file sizes.

Around 4 millions of web pages as markdown is like 1-2GB


based on the table they show, that would be my inclination

wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant


Compressed, pretty believable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: