Wikidata is a free linked database that serves as a central storage for the structured data in Wikipedia and other Wikimedia projects. Their query service is officially live.
This service allows you to execute SPARQL queries for answering questions like:
For more query examples see this page.
I adapted this notebook from Ramiro Gómez's notebook on querying data about U.S. presidents using Wikidata.
Wikidata is trying to build a structured database of every claim about every entity on Wikipedia -- and in every language. The data model gets complex, but the most basic distinction is that there are:
Everything that would get its own Wikipedia article is an entity, and the Wikidata project is about importing all the unstructured statements from those articles into a database. Because Wikidata was built to be language-independent, everything has a unique alphanumeric identifier. So California is Q99,
The best way to get a feel for Wikidata's data model is to browse an individual entity. So let's look at the entry for Mount Tamalpais, to see what is there. You can search for any entity or relation in the search bar at wikidata.org, or you can click the "Wikidata item" link on the lefthand sidebar of any Wikipedia article. The URL for Mount Tamalpais is linked to the unique identifier Q785665.
We see that the first statement is one of the most common and foundational statements in Wikidata: instance of. If you hover over the 'instance of' link, you can see that it links to Property P31, which is the structured identifier for this kind of relation between entities and data. Mount Tam is an instance of a mountain, and a mountain is also an entity in Wikidata.
For many statements in Wikidata, the data in the statement is another Wikidata entity, which has its own kinds of statements. One of Mount Tam's other statements is the property 'located in the administrative territorial entity' (or P131), with the data for that statement being the Wikidata entity 'California' (or Q99). Other Wikidata statements have raw data, like the 'coordinate location' (Property P625) statement.
To query these data, you can use a structured querying language called SPARQL, which is an extention of SQL. The pseudoquery for this would be something like:
We then have to translate these statements and entites into language-neutral identifiers, which becomes:
The way we do this in SPARQL is:
SELECT ?mountain ?coord
WHERE {
# define ?mountain as all entities that are instances of (P31) mountains (Q8502)
?mountain wdt:P31 wd:Q8502 .
# that are in the administrative territorial entity (P131) 'California' (Q99)
?mountain wdt:P131 wd:Q99 .
# for ?mountain, return all coordinate statements (P625) in the variable ?coord
?mountain wdt:P625 ?coord
}
(we also have to put in a bunch of declarations, which are similar to importing a library)
There is a great way to test out your queries in the browser at https://query.wikidata.org. Here is the above SPARQL query in the web query service. One of the great things about the web query service is that you can hover over every property or entity and see what it is. You can also directly download the data to a number of formats.
The first thing you'll notice is that the ?mountain variable is the unique identifier for each mountain, not the English name (or Spanish or Japanese or Arabic...). To get that, you have to add another block to the SPARQL query.
# Out of the following query, select the variables: ?mountain ?mountainLabel? ?coord
SELECT ?mountain ?mountainLabel ?coord
WHERE {
# define ?mountain as all entities that are instances of (P31) mountains (Q8502)
?mountain wdt:P31 wd:Q8502 .
# that are in the administrative territorial entity (P131) 'California' (Q99)
?mountain wdt:P131 wd:Q99 .
# Then for every ?mountain, return data for all coordinate
# statements (P625) in the variable ?coord
?mountain wdt:P625 ?coord .
# Then for every ?mountain, return data for all labels (rdfs:label)
# into the variable ?mountainLabel, but filter for only english language labels
?mountain rdfs:label ?mountainLabel filter (lang(?mountainLabel) = "en")
}
This query in the query explorer.
Looking at the Mount Tam Wikidata page, we can see there is a property called "elevation above sea level." How would we extend the SPARQL query above to also return this data?
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
# Out of the following query, select the variables:
# ?mountain ?mountainLabel? ?coord ?elevation
SELECT ?mountain ?mountainLabel ?coord ?elevation
WHERE {
# Define ?mountain as all entities that are instances of (P31) mountains (Q8502)
?mountain wdt:P31 wd:Q8502 .
# that are in the administrative territorial entity (P131) 'California' (Q99).
?mountain wdt:P131 wd:Q99 .
# Then for every ?mountain, return data for all coordinate statements (P625)
# in the variable ?coord
?mountain wdt:P625 ?coord .
# Then for every ?mountain, return data for all
# 'elevation above sea level' statements (P2044) in the variable ?elevation
?mountain wdt:P2044 ?elevation .
# Then for every ?mountain, return data for all labels (rdfs:label)
# into the variable ?mountainLabel, but filter for only english language labels
?mountain rdfs:label ?mountainLabel filter (lang(?mountainLabel) = "en")
}
We can also filter on returned variables by adding the filter() function:
# Out of the following query, select the variables:
# ?mountain ?mountainLabel? ?coord ?elevation
SELECT ?mountain ?mountainLabel ?coord ?elevation
WHERE {
# Define ?mountain as all entities that are instances of (P31) mountains (Q8502)
?mountain wdt:P31 wd:Q8502 .
# that are in the administrative territorial entity (P131) 'California' (Q99).
# And have an elevation > 0
?mountain wdt:P131 wd:Q99 .
# Then for every ?mountain, return data for all coordinate statements (P625)
# in the variable ?coord
?mountain wdt:P625 ?coord .
# Then for every ?mountain, return data for all
# 'elevation above sea level' statements (P2044) in the variable ?elevation
# that have an elevation > 0
?mountain wdt:P2044 ?elevation filter(?elevation > 0) .
# Then for every ?mountain, return data for all labels (rdfs:label)
# into the variable ?mountainLabel, but filter for only english language labels
?mountain rdfs:label ?mountainLabel filter (lang(?mountainLabel) = "en")
}
You can make any variable optional, which means the row will be returned even if there is no row for it.
You also sometimes need to use subclasses, because Wikidataians love to categorize and subcategorize!
SELECT ?city ?cityLabel ?population
WHERE {
# Define ?city as all entities that are instances of (P31)
# or subclasses of (/P279*) city (Q515)
?city wdt:P31/wdt:P279* wd:Q515 .
# that are in the country (P17) 'United States' (Q30).
?city wdt:P17 wd:Q30 .
OPTIONAL {
# Optionally, for every ?city, return data for population (P1082) in ?population
?city wdt:P1082 ?population .
}
# Then for every ?city , return data for all labels (rdfs:label)
# into the variable ?cityLabel, but filter for only english language labels
?city rdfs:label ?cityLabel filter (lang(?cityLabel) = "en")
}
</code>
In [ ]:
In [2]:
import requests
header = {
'User-Agent': 'Wikidata query script by you@you.com'
}
import pandas as pd
In [3]:
query = '''
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?mountain ?mountainLabel ?elevation
WHERE {
?mountain wdt:P31 wd:Q8502 .
?mountain wdt:P131 wd:Q99 .
?mountain wdt:P2044 ?elevation .
?mountain rdfs:label ?mountainLabel filter (lang(?mountainLabel) = "en")
}
'''
In [ ]:
Next we send an HTTP request to the SPARQL endpoint providing the query as a URL parameter, we also specify that we want the result encoded as JSON rather than the default XML. Thanks to the requests library this is practically self-explaining code.
Now we iterate through the result, creating a list of dictionaries, each of which contains values for the query variables defined above. Then we create a Pandas DataFrame from this list, print its length and the first few rows.
In [4]:
url = 'https://query.wikidata.org/bigdata/namespace/wdq/sparql'
data = requests.get(url, headers=header, params={'query': query, 'format': 'json'}).json()
In [5]:
data
Out[5]:
In [ ]:
languages = []
for item in data['results']['bindings']:
languages.append({
'name': item['mountainLabel']['value'],
'elevation': item['elevation']['value']
})
In [ ]:
df = pd.DataFrame(languages)
print(len(df))
df.head()
Let's also see the data types of the columns.
In [ ]:
df.dtypes
In [ ]: