Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The dumbest part of this is that all Wikimedia projects already export a dump for bulk downloading: https://dumps.wikimedia.org/

So it's not like you need to crawl the sites to get content for training your models...



I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.

It's not clear which files you need, and the site itself is (or at least, was when I tried) "shipped" as some gigantic SQL scripts to rebuild the database with enough lines that the SQL servers I tried gave up reading them, requiring another script to split it up into chunks.

Then when you finally do have the database, you don't have a local copy of Wikipedia. You're missing several more files, for example category information is in a separate dump. Also you need wiki software to use the dump and host the site. After a weekend of fucking around with SQL, this is the point where I gave up and just curled the 200 or so pages I was interested in.

I'm pretty sure they want you to "just" download the database dump and go to town, but it's such a pain in the ass that I can see why someone else would just crawl it.


> I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.

More recently they starting putting the data up on Kaggle in a format which is supposed to be easier to ingest.

https://enterprise.wikimedia.com/blog/kaggle-dataset/


More recently is very recently, not enough time yet for data collectors to evaluate changing processes.


Good timing to learn about this, given that it's Friday. Thanks! I'll check it out


I think there are engineers working for crawler companies who are paid well enough to figure out how to do this without kneecapping the most well-known noncommercial projects still surviving on the capitalized internet.


Have you tried any of the ZIM file exports?

https://dumps.wikimedia.org/kiwix/zim/wikipedia/


Yeah, it's a bit confusing at first to navigate. Luckily, they offer XML dumps that aren't too bad to work with:

1. Go to https://dumps.wikimedia.org/enwiki/latest/ (or a date of your choice in /enwiki)

2. Download https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page... and https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.... The first file is a bz2-multistream-compressed dump of a XML containing all of English Wikipedia's text, while the second file is an index to make it easier to find specific articles.

3. You can either:

  a. unpack the first file
  b. use the second file to locate specific articles within the first file; it maps page title -> file offset for the relevant bz2 stream
  c. use a streaming decoder to process the entire Wiki without ever decompressing it wholly
4. Once you have the XML, getting at the actual text isn't too difficult; you should use a streaming XML decoder to avoid as much allocation as possible when processing this much data.

The XML contains pages like this:

    <page>
      <title>AccessibleComputing</title>
      <ns>0</ns>
      <id>10</id>
      <redirect title="Computer accessibility" />
      <revision>
        <id>1219062925</id>
        <parentid>1219062840</parentid>
        <timestamp>2024-04-15T14:38:04Z</timestamp>
        <contributor>
          <username>Asparagusus</username>
          <id>43603280</id>
        </contributor>
        <comment>Restored revision 1002250816 by [[Special:Contributions/Elli|Elli]] ([[User talk:Elli|talk]]): Unexplained redirect breaking</comment>
        <origin>1219062925</origin>
        <model>wikitext</model>
        <format>text/x-wiki</format>
        <text bytes="111" sha1="kmysdltgexdwkv2xsml3j44jb56dxvn" xml:space="preserve">#REDIRECT [[Computer accessibility]]

    {{rcat shell|
    {{R from move}}
    {{R from CamelCase}}
    {{R unprintworthy}}
    }}</text>
        <sha1>kmysdltgexdwkv2xsml3j44jb56dxvn</sha1>
      </revision>
    </page>
so all you need to do is get at the `text`.


The bigger problem is this is wikitext markup. It would be helpful if they also provide HTML and/or plain text.

I know there are now a couple pretty-good wikitext parsers, but for years, it was a bigger problem. The only "official" one was the huge php app itself.


Oh, it's godawful; the format is a crime against all things structured. I use `parse-wiki-text-2` [0], which is a fork of `parse-wiki-text`, a Rust library by an author who has now disappeared into the wind. (Every day that I parse Wikipedia, I thank him for his contributions, wherever he may be.)

I wrote another Rust library [1] that wraps around `parse-wiki-text-2` that offers a simplified AST that takes care of matching tags for you. It's designed to be bound to WASM [2], which is how I'm pretty reliably parsing Wikitext for my web application. (The existing JS libraries aren't fantastic, if I'm being honest.)

[0]: https://github.com/soerenmeier/parse-wiki-text-2

[1]: https://github.com/philpax/wikitext_simplified

[2]: https://github.com/genresinspace/genresinspace.github.io/blo...


What they need to do is have 'major edits' push out an updated static render physical file like old school processes would. Then either host those somewhere as is, or also in a compressed format. (E.G. compressed weekly snapshot retained for a year?)

Also make a cname from bots.wikipedia.org to that site.


This probably is about on-demand search, not about gathering training data.

Crawling is more general + you get to consume it in its reconstituted form instead of deriving it yourself.

Hooking up a data dump for special-cased websites is much more complicated than letting LLM bots do a generalized on-demand web search.

Just think of how that logic would work. LLM wants to do a web search to answer your question. Some Wikimedia site is the top candidate. Instead of just going to the site, it uses this special code path that knows how to use https://{site}/{path} to figure out where {path} is in {site}'s data dump.


Yeah. Much easier to tragedy-of-the-commons the hell out of what is arguably one of the only consistently great achievements on the web...


> This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models.

Sounds like the problem is not the crawling itself but downloading multimedia files.

The article also explains that these requests are much more likely to request resources that aren't cached, so they generate more expensive traffic.


I need to work with the dump to extract geographic information. Most mirrors are not functioning, take weeks to catch up or block, or only mirror english wikipedia. Every other month I find a work-around. It's not easy to work with the full dumps, but I guess/hope easier than crawling wikipedia website itself.


Why use screwdriver when you have sledge hammer and everything is a nail?


nAIl™ - the network AI library. For sledgehammering all your screwdriver needs.


I thought that as well but maybe this is more for indexing search engines? In which case you want more realtime updates?


I don't see an obvious option to download all images from Wikipedia Commons. As the post clearly indicates, the text is not the issue here, its the images.


it seems like Wikimedia Foundation has always been protective of the image downloads since the 90s. So many drunken midnight scripters or new urban undergrad CEOs discovers that they can download cool images fairly quickly. AFAIK there has always been some kind of text corpus available in bulk because it is part of the mission of Wikipedia. But the image gallery is big on disk, big bandwidth compared to TEXT, and low hanging target for the uninformed, greedy and etc.


The Wikimedia nonfree image limitations have been a pain in my ass for years.

For those unfamiliar: The images that are marked NonFree must be smaller than 1Megapixel. 1155 X 866. In practice, 1024x768 is around the maximum size.


This is what torrents are built for.

A torrent of all images updated once a year would probably do quite well.


Provided you have enough seed nodes - not free.


I have excess bandwidth in various places - would be happy to seed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: