Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think this is an interesting case because scraping this is easy (just one page) where the wikidata query requires dealing with modifiers which is a bit more complex.


(It requires the birth dates, so it is more than one page)

The HTML structure may change over time: if the request is executed few times over a long period, the scrapper may/will require more maintenance than the SPARQL request.

For example, the same wikipedia page 3 years ago is slightly different: https://en.wikipedia.org/w/index.php?title=Turing_Award&oldi...


"The HTML structure may change over time..."

A very common argument in HN comments that discuss the merits of so-called web APIs.

Fair balance:

Web APIs can change (e.g., v1 -> v2), they can be discontinued, their terms of use can change, quotas can be enforced, etc.

A public web page does not suffer from those drawbacks. Changes that require me to rewrite scripts are generally infrequent. What happens more often is websites that provide good data/information sources simply go offline.

There is nothing wrong with web APIs per se, I welcome them (I use the same custom HTTP generator and TCP/TLS clients for both), but the way "APIs" are presented, as some sort of "special privilege", requiring "sign up", an email address and often more personal information, maybe even payment, is for the user, cf. developer, inferior to a public webpage, IMHO. As a user, not a developer, HTTP pipelining works for me better than many web APIs. I can get large quantities of data/information in one or a small number of TCP connections (I never have to use use proxies nor do I ever get banned); it requires no disclosure of personal details and is not subject to arbitrary limits.

What's interesting about this Wikidata/Wikipedia case is that the term chosen was "user" not "developer". It appears we cannot assume that the only persons who will use this "API" are ones who intend to insert the retrieved data/information into some other webpage or "app" that probably contains advertising and/or tracking. It is for everyone, not just "developers".


The semantics of RDF identifiers drift at least as often as HTML format changes.

For example, at one point I was doing a similar thing against DBPedia (a sort-of predecessor to WikiData).

I was doing leaders of countries. But it turns out "leader" used to mean constitutional leadership roles, and at some point someone had decided this included US Supreme Court Chief Justice (as the leader of the judicial branch).

So I had to go and rewrite all my queries to avoid that. But most major countries had similar semantic drift, and it turned out easier to parse Wikipedia itself.


DBPedia extracts data from wikipedia (infoboxes, tables) and other sources (wikidata). The circle is complete

http://mappings.dbpedia.org/index.php/Main_Page

https://github.com/dbpedia/extraction-framework/tree/master/...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: