Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a great post, which also happens to serve as a good illustration of the "curse of knowledge" and the typical blind-spots of enthusiasts. Consider the timeline of events:

• The blog post on scraping Wikipedia (https://billpg.com/data-mining-wikipedia/ , HN discussion 4 days ago: https://news.ycombinator.com/item?id=28234122 which mentions Wikidata as an alternative etc.

• The author of this post, a Wikidata person, finds this an "extremely surprising discussion", and posts a Twitter thread ( https://web.archive.org/web/20210820105621/https://twitter.c... ) ending with

> I don't want to argue or disagree, I am just completely surprised by that statement. Are the docs so bad? Is the API design of Wikidata so weird or undiscoverable? There are plenty of libraries for getting Wikidata data, are they all so hard to use? I am really curious.

This curiosity is a great attitude! (But…)

• After seeing the HN discussion and responses on Twitter/Facebook, he writes this post linked here. In this post, he does mention what he learned from potential users:

> And there were some very interesting stories about the pain of using Wikidata, and I very much expect us to learn from them and hopefully make things easier. The number of API queries one has to make in order to get data […], the learning curve about SPARQL and RDF (although, you can ignore both, unless you want to use them explicitly - you can just use JSON and the Wikidata API), the opaqueness of the identifiers (wdt:P25 wd:Q9682 instead of “mother” and “Queen Elizabeth II”) were just a few. The documentation seems hard to find, there seem to be a lack of libraries and APIs that are easy to use. And yet, comments like "if you've actually tried getting data from wikidata/wikipedia you very quickly learn the HTML is much easier to parse than the results wikidata gives you" surprised me a lot. […] I am not here to fight. I am here to listen and to learn, in order to help figuring out what needs to be made better.

Again, very commendable! Almost an opening to really understanding the perspective of casual potential users. But then: the entire rest of the post does not really address "the other side", and instead completely focuses on the kinds of things Wikidata enthusiasts care about: comparing Wikipedia and Wikidata quality in this example, etc.

I mean, sure, this query he presents is short:

    select * { wd:Q9682 (wdt:P25|wdt:P22)* ?p . ?p wdt:P25|wdt:P22 ?q } 
but when he says:

> I would claim that I invested far less work than Bill in creating my graph data. No data cleansing, no scraping, no crawling, no entity reconciliation, no manual checking.

he's ignoring the work he invested in learning that query language (and where to query it), for instance. And this post would have been a perfect opportunity to teach readers about how to go from the question "all ancestors of Queen Elizabeth" to that query (and in trying to teach it, he may have better discovered exactly what is hard about it), but he just squanders the opportunity (just as when he says "plenty of libraries" without inviting exploration by linking to the easiest one): this is a typical thing enthusiasts do, which is unfortunate IMO.

When scraping HTML from Wikipedia, one is using general-purpose well-known tools. You'll get slightly better at whatever general-purpose programming language and libraries you were using, learn something that may be useful the next time you need to scrape something else. And most importantly, you know that you'll finish, you can see a path to success. When exploring something "alternative" like Wikidata, you aren't sure if it will work, so the alternative path needs to work harder to convince potential users of success.

---

Personal story: I actually know about the existence of Wikidata. Yet the one time I tried to use it, I couldn't figure out how. This is what I was trying to do: plot a graph of the average age of Turing Award winners by year. (Reproduce the first figure from here: http://hagiograffiti.blogspot.com/2009/01/when-will-singular... just for fun) One would think this is a perfect use-case for Wikidata: presumably it has a way of going from Turing Award → list of winners → each winner's date of birth. But I was stymied at the very first step: despite knowing of the existence of Wikidata, and being able to go from the Wikipedia page that lists all recipients (current version: https://en.wikipedia.org/w/index.php?title=Turing_Award&oldi... ) to the Wikidata item for "Turing Award" (look for "Wikidata item" in the sidebar on the left) https://www.wikidata.org/wiki/Q185667 I could not quickly find a way of getting a list of recipients from there. Tantalizingly, the data maybe does exist e.g. if I go to one of the recipients like Leslie Valiant https://www.wikidata.org/wiki/Q93154 I see a "statement" award received → Turing Award with "property" point in time → 2010. Even after coming so close, and being interested in using Wikidata, it was not easy enough for me to get to the next step (which I still imagine is possible, maybe with tens of minutes of effort), until I just decided "screw this, I'll just scrape the Wikipedia page" (I scraped the wikisource rather than html). And if one is going to have to scrape anyway, then might as well do the rest too (dates of birth) with scraping.



Thank you. I am the author of the post, and appreciate your comments, and I agree with them.

I have to say that it indeed wasn't my intention to show how to get to the query - that is a form of tutorial that would be great to write too, agreed, and maybe I should have. What I wanted to write is just comparing the results of the two approaches.

Having said that, yes, again, I agree, a tutorial on describing how to get that data would be great too, and maybe I should write it, maybe someone else should. I agree that it is not trivial at all how to get to the query (and that is a particularly tricky query, certainly not what I would begin with).

Thank you again for your comment, it made me think and mull over the whole thing more. I will talk tomorrow with the lead of the Wikidata team, and I will bring these (and many other points that were mentioned in the last few days) with me. It will take a while, but I hope we can improve the situation.


There's a trick companies like Facebook use to try and protect users from copy pasting malicious scripts in devtools: when they detect it opening (probably keyboard event), they print a big scary warning using console.log/error [1]

Assuming the first things most scrapers do is open the site in devtools, this would be a great place to print some text with a page specific Wikidata query that will pull in the exact same information as the current page along with a link to a really good hacker style tutorial + appendix of how to guides. Even better would be an option to turn on some sort of dev mode with mouseover tool tips that show queries for every bit of info on the page. Anything that breaks the feedback loop between the code and the browser will decrease the probability that the scraper will use wikidata. Think of it as a weird inverse user retention problem

[1] https://imgur.com/a/0Xn1qIb


Thank you! And hope there was nothing in my comments that came off the wrong way. A few more comments, since you seem so receptive. :-)

• I do understand why you wouldn't want to have bothered to write a tutorial (it's too much work, there are enough tutorials already, etc). But still, it may have helped to link to one or two, just to catch the curious crowd.

• Specifically: Yesterday I later looked around, and I found this tutorial most inviting (big font, short pages, enough pictures and examples, and interactive querying right on the page): https://wdqs-tutorial.toolforge.org/ — but I couldn't find this tutorial linked from Wikidata or the Wikipedia page on Wikidata; I actually found it in the "See also" section of the Wikipedia page on SPARQL. (After reading this one, the tutorial at https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial also looks ok to me, but that's the "curse of knowledge" already: I know I wasn't enthused the first time I saw it…)

• In fact, after taking a few (tens of?) minutes to skim through these tutorials, the query here isn't a particularly tricky query, I thought! So it may not be that the query language is "hard" or "difficult"; the challenge is just to get people over that initial bump of unfamiliarity.

• The Wikidata query page (e.g. https://w.wiki/3vrd) already has a prominent big blue button on the left edge, but somehow the first time I loaded the page it still wasn't prominent enough for me to realize to click it. It may be nice if the button were somehow even more prominent, or if loading the page (for shared links) would automatically display the query results (possibly cached). (Or, the big white area where the results appear could say "click to see results here" or something.)

• It may be worth considering making labelled output the default and raw ids something to explicitly ask for, at least in the beginner's version of the query engine.

• In your blog post, even if not writing a tutorial, IMO it would have helped to just explain the query in a line of two, i.e. translate each of the statements into English. (This is less work than teaching someone to arrive at the query themselves.)

• Even if neither writing a tutorial nor explaining the query, IMO it would have helped to just mention something like "Yes, this query is in an unfamiliar language, but it takes only a few minutes to learn: see <here> and <here>" — basically, just acknowledge that there may be some barrier here (however small) for people who don't already know.

• Such things are exactly our blind spots when writing, so it's not easy. The only way I know is to show the writing to some people in the target audience and get feedback. Fortunately, you don't have to ask too many people: these researchers in usability testing say "You Only Need to Test with 5 Users": https://www.nngroup.com/articles/why-you-only-need-to-test-w...

Thanks for your post, ultimately as a result of reading it, and commenting about it and being shown a solution to my problem, in the end now I'm more likely, and better equipped, to try Wikidata in future.


Thank you for the follow up. I updated my post a little, mostly with a link to this discussion, as it contains and explanation of the query, and now also links to tutorial.

I agree with some of your suggestions on making the system easier to use. It's open source, and I hope someone will be motivated enough to give it a try - the development team can only do so many things, unfortunately.

Thanks again for the constructive comments!


About the Turing Award, after some trials and errors, I think this is the request: https://w.wiki/3wmY

Disclaimer: I follow https://www.youtube.com/channel/UCp2i8QpLDnWge8wZGKizVVw / https://www.twitch.tv/belett (mostly in French, sometimes in English).

Without these courses, I wouldn't have been able to write this request.


Thank you, that was educational! At the time I'd have been happy with just getting the data out, so to encourage others, here's a simpler version of the query: https://w.wiki/3x8t

Short version:

    SELECT ?awardYearLabel ?winnerLabel ?dateOfBirthLabel WHERE {
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
      ?statement ps:P166 wd:Q185667.
      ?winner p:P166 ?statement.
      ?statement pq:P585 ?awardYear.
      ?winner wdt:P569 ?dateOfBirth.
    }
    ORDER BY (?awardYearLabel)
Annotated version with comments:

    SELECT ?awardYearLabel ?winnerLabel ?dateOfBirthLabel WHERE {
      # Boilerplate: Provides, for every "?foo" variable, a corresponding "?fooLabel"
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
      
      # "Statements" of the form "<subject> <predicate> <object>."
      # also known as "<item> <property> <value>."
      # Variable names start with "?" and  we can think of them as placeholders.
      
      # For example, a straightforward query that lists winners
      # ("P166" means <award received> and "Q185667" means <Turing Award>):
      # ?winner wdt:P166 wd:Q185667.   # <?winner> <received award> <Turing Award>
      
      # "Qualifiers" on statements: See 
      #    https://wdqs-tutorial.toolforge.org/index.php/simple-queries/qualifiers/statements-with-qualifiers/
      #    or https://en.wikibooks.org/wiki/SPARQL/WIKIDATA_Qualifiers,_References_and_Ranks
      # A **statement** of the form "<somebody> <received award> <Turing Award>"
      ?statement ps:P166 wd:Q185667.
      # In that statement, the <somebody> we shall call "?winner".
      ?winner p:P166 ?statement.
      # That statement has <point in time> qualifier of "?awardYear".
      # ("P585" means <point in time>)
      ?statement pq:P585 ?awardYear.
    
      # The ?winner has a <date of birth> of ?dateOfBirth. 
      # ("P569" means <date of birth>)
      ?winner wdt:P569 ?dateOfBirth.
    }
    ORDER BY ?awardYearLabel


?awardYear and ?dateOfBirth are literals, so you don't need to take *Label of them (that's only useful for Qnnn nodes).

Below I use a blank node (since you don't need the URL of ?statement) to simplify the query, and calculate the age as a difference of the two years:

    SELECT ?awardYear ?age ?winnerLabel WHERE {
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
      ?winner p:P166 [ # award won
          ps:P166 wd:Q185667; # Turing award
          pq:P585 ?awardDate]; # point in time
        wdt:P569 ?birthDate.
      bind(year(?awardDate) as ?awardYear)
      bind(?awardYear-year(?birthDate) as ?age)
    }
    ORDER BY ?age


And today I learned Donald Knuth was the youngest Turing award winner at the age of 36. I'm going to have to go learn SPARQL.


I think this is an interesting case because scraping this is easy (just one page) where the wikidata query requires dealing with modifiers which is a bit more complex.


(It requires the birth dates, so it is more than one page)

The HTML structure may change over time: if the request is executed few times over a long period, the scrapper may/will require more maintenance than the SPARQL request.

For example, the same wikipedia page 3 years ago is slightly different: https://en.wikipedia.org/w/index.php?title=Turing_Award&oldi...


"The HTML structure may change over time..."

A very common argument in HN comments that discuss the merits of so-called web APIs.

Fair balance:

Web APIs can change (e.g., v1 -> v2), they can be discontinued, their terms of use can change, quotas can be enforced, etc.

A public web page does not suffer from those drawbacks. Changes that require me to rewrite scripts are generally infrequent. What happens more often is websites that provide good data/information sources simply go offline.

There is nothing wrong with web APIs per se, I welcome them (I use the same custom HTTP generator and TCP/TLS clients for both), but the way "APIs" are presented, as some sort of "special privilege", requiring "sign up", an email address and often more personal information, maybe even payment, is for the user, cf. developer, inferior to a public webpage, IMHO. As a user, not a developer, HTTP pipelining works for me better than many web APIs. I can get large quantities of data/information in one or a small number of TCP connections (I never have to use use proxies nor do I ever get banned); it requires no disclosure of personal details and is not subject to arbitrary limits.

What's interesting about this Wikidata/Wikipedia case is that the term chosen was "user" not "developer". It appears we cannot assume that the only persons who will use this "API" are ones who intend to insert the retrieved data/information into some other webpage or "app" that probably contains advertising and/or tracking. It is for everyone, not just "developers".


The semantics of RDF identifiers drift at least as often as HTML format changes.

For example, at one point I was doing a similar thing against DBPedia (a sort-of predecessor to WikiData).

I was doing leaders of countries. But it turns out "leader" used to mean constitutional leadership roles, and at some point someone had decided this included US Supreme Court Chief Justice (as the leader of the judicial branch).

So I had to go and rewrite all my queries to avoid that. But most major countries had similar semantic drift, and it turned out easier to parse Wikipedia itself.


DBPedia extracts data from wikipedia (infoboxes, tables) and other sources (wikidata). The circle is complete

http://mappings.dbpedia.org/index.php/Main_Page

https://github.com/dbpedia/extraction-framework/tree/master/...


I also had a horrible experience using the recommended SPARQL interface to query Wikidata. The queries were inscrutable, the documentation was poor and even after writing the correct queries, they timed out after scanning a tiny fraction of the data I needed, making the query engine useless to me.

However, I had great success querying Wikidata via the "plain old" MediaWiki Query API: https://www.mediawiki.org/wiki/API:Query. That API was a joy to work with.

Wikidata (as a backing store for Wikipedia and a knowledge graph engine) is a very powerful concept. It's a key platform technology for Wikipedia and hopefully they'll prioritize its usability going forward.


The WD SPARQL editor has auto-complete (eg type "wdt:award" and press control-space) and readout on hover.

To make the query more readable, use some comments (see my query above).

Yes, WD SPARQL has a firm timeout of 1 minute then may even cut out the response in half. I think it's falling victim of its own popularity (the API is imho much less popular).

There are optimization techniques that one can use, but they take some experience and patience. One good way is to use federated SPARQL insert to a local repo (assuming you want to selectively copy and reshape RDF data), eg our GraphDB repo has batching of federated queries that avoids the timeout.


> When scraping HTML from Wikipedia, one is using general-purpose well-known tools. You'll get slightly better at whatever general-purpose programming language and libraries you were using, learn something that may be useful the next time you need to scrape something else. And most importantly, you know that you'll finish, you can see a path to success. When exploring something "alternative" like Wikidata, you aren't sure if it will work, so the alternative path needs to work harder to convince potential users of success.

I'm not sure its that clear. Scrapping is pretty generic, but SPARQL is hardly a proprietary query language - other things use it. If what you're into is obtaining data, sparql might more generically apply than scrapping would. It really depends on what you are doing in the future. At the very least if you do scrapping a lot, you're probably going to reinvent the parsing wheel a lot. To each their own.

> he's ignoring the work he invested in learning that query language (and where to query it), for instance

And Bill is ignoring the work of learning how to program. None of us start from nothing, and its not like any of this is trivial to learn if you've never touched a computer before.

And to be clear i'm not objecting - there is nothing wrong with using the skills you currently have to solve the problem you currently have. Whatever gets you the solution. If you're querying wikidata (or similar things) everyday, learning sparql is probably a good investment. If you're interested in sparql, then by all means learn it. But if those dont apply, then scrapping makes sense if you already know how to do that.


> [Scraping] is pretty generic, but SPARQL is hardly a proprietary query language - other things use it. If what you're into is obtaining data, sparql might more generically apply than [scraping] would. It really depends on what you are doing in the future.

Yes my point exactly! My point was that even when trying to consider the perspective of people different from us, we can end up writing for (and from the perspective of) people who are "into" the same things as us. Casual users like in the original scraping post are not much really "into" obtaining data, which can be a blind spot for enthusiasts who are. The challenge and opportunity in such cases is really communication with the outside of the field, rather than competition within the field.


"Scrapping" is like nails scrapping on a chalkboard for me.


> he's ignoring the work he invested in learning that query language (and where to query it), for instance

>And Bill is ignoring the work of learning how to program.

I suppose if you didn't know how to program you wouldn't learn Sparql. So the investment in learning how to program has already been made.


There are plenty of Wikidata users who have learned some SPARQL without being programmers.


Why not? People sometimes learn SQL without learning to program, why not sparql?


Because SQL looks and is more simple: plain English words that are easily recognized, with basic queries (select from) that can be taught in less than an hour and then build on it. Now let's look at SPARQL: everything screams at technicality. Curly braces (I'm not sure a non-programmer even know how to type this). Then the variable name prefixed by ?. Then the need to understand what is an URI and how and why prefixes are declared, not to mention the sheer fact of using URI instead of a simple names such as one found for database columns. But even that isn't enough knowledge to start writing the simpliest query. One also need to be taught about RDF triples.

So no, every query languages are not born the same. SPARQL is overly technical and requires a lot of knowledge to do even the simpliest things.


I like your reason more than mine.


Non technical persons just learn SPARQL and the principles of Wikidata, and extract data. For them SQL, REST and JSON is much too technical.


Well one reason why someone might learn SQL without learning how to program is that you can get jobs for it.

Ah, but the response might go, lots of people learned SQL when there weren't a lot of jobs for people who knew SQL.

Yes, my response would be, but that was a long time ago and the incentives for people to learn technologies have changed, and I do not think a significant amount of people will learn SQL without learning to program henceforth; at least not amounts significant enough that anyone will say "Well look at that trend!".

here there can be several responses so I won't go through all the branches, but in the end I don't think there is going to be an interest in learning Sparql in people who are not programmers or at least programming adjacent professions, and from what I see there hasn't been that much interest from people who are programmers.


Absolutely spot-on. It makes me think of my own experience.

I've worked for a few niche search engines. Some sites have APIs available so that you don't have to scrape their data. But often times, since we were already used to scraping sites, we wouldn't even notice that an API was available. In a few number of cases, an API _was_ available, but it was more restrictive or complicated than it was for us to just scrape a page. That's not to say that we never used them, because we certainly did. Just that we often were never aware that they were an option since they were not very common in our cases.


Not to mention that APIs come with registration, credentials, rate limiting, throttling, etc.


Wikidata's API doesn't require registration, credentials, etc.


I'm one of the comments quoted in that chain of tweets, heh. Here's my specific example. This was years ago, so I don't remember much anymore and things may have changed. But I did now just give it a basic attempt and it still seems Wikipedia is easier than Wikidata. (I did put more effort into using Wikidata when I tried years ago, but all I really remember is it wasn't as fruitful as just fetching wikipedia).

My goal, a list of every airport on wikipedia with an IATA code and the city it is attached to. There is a perfect wikipedia page to start this off on, while as far as I can tell, wikidata does not have any of the data from the table on that page?

https://en.wikipedia.org/wiki/List_of_airports_by_IATA_airpo...

https://www.wikidata.org/w/api.php?action=wbgetentities&form...


I like that geospatial join you have there. Really it should be two query tabs and an interactive map.

I have often wanted a geofilter around my wikipedia search, esp when I am on vacation. Basically, give me every wikipedia page that ever talked about anything within 50km of here. And then one could filter down or have a personal recommendation system boost stuff you like.


I hope this helps with getting started: https://w.wiki/3x3n

And here's a visualization on a map, using geocoordinates: https://w.wiki/3x3g


Thanks, the queries are very powerful, but it still seems like this data is not as usable as the data in the HTML table. Any airports that don't have wikipedia links for the airport or city don't get picked up, and there are disagreeing duplicates in the wikidata that the HTML does not have.

For example (AKG) Anguganak Airport and city Anguganak don't have an article so they don't appear in the wikidata. ALZ doesn't appear in the data because Lazy Bay does not have an article page. There are some duplicate entries, with different cities or airport names like AAL, AAU, ABC. ABQ has 4 different entries. The data also is out-of-date in some instances. "Opa-locka Airport" was renamed to "Miami-Opa Locka Executive Airport" in 2014 for example. In the HTML table all these issues are solved.


Thanks for the answer!

I got the query wrong (reason: https://twitter.com/vrandezo/status/1430206988177219593 )

Here's the corrected query: https://w.wiki/3x8u

This includes a few more thousand results.

AKG does show up (but has indeed no connection to Anguganak), ALZ shows up (again, without a connection to a city). Article pages are not a requirement for the data to be in Wikidata.

I see your point. The duplicate entries can often be explained (e.g. ABQ is indeed the IANA code both for Albuqerque Sunport and the Kirtland AF Base, which are adjacent to each other), but that's already a lot of detail.

If a single table provides the form of clean data one is looking for, that's great and should be used (and slightly different than the original question that triggered this, where we had to go through many different pages and fuse data from thousands of pages together). Different tasks benefit from different inputs!


> no entity reconciliation

On the other hand there are still duplicates. I queried Wikidata once and every date result was duplicated because they existed in a slightly different format (7-7-2000 vs 07-07-2000; both were declared as xsd:date). Very "semantic" and powerful data model indeed. In fact the technology should be renamed stringly typed web, because this is what it really is.


That would be a bug and should not be the case. I just tried it and couldn't replicate it. There is no difference between 7-7-2000 and 07-07-2000 in xsd, and neither in the SPARQL query endpoint.

Here are the people in Wikidata born on 07-07-2000: https://w.wiki/3wrj

And here the people born on 7-7-2000: https://w.wiki/3wrk

The results are identical.

(This doesn't mean we have no duplicates at all in Wikidata - the post actually mentions five discovered duplicates within Queen Elizabeth II's ancestors. But these are entities, not within the datatypes)


IMHO WD SPARQL should reject invalid literals: https://phabricator.wikimedia.org/T253718




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: