Querying Wikidata

by R. Stuart Geiger, released CC-BY 4.0

Wikidata is a free linked database that serves as a central storage for the structured data in Wikipedia and other Wikimedia projects. Their query service is officially live.

This service allows you to execute SPARQL queries for answering questions like:

What are the heights of all the mountains in California?
What are the most populated cities whose mayors are women?
For each country, how many ministers are alive who are themselves children of a minister?

For more query examples see this page.

I adapted this notebook from Ramiro Gómez's notebook on querying data about U.S. presidents using Wikidata.

Wikidata's data model

Wikidata is trying to build a structured database of every claim about every entity on Wikipedia -- and in every language. The data model gets complex, but the most basic distinction is that there are:

entities (things in the world: California, Mount Tamalpais, George Washington, Harry Potter, carbon-14, python)
properties (types of claims: 'instance of', 'coordinate location', 'cause of death', 'population')
statements (an entity-property-data relation: 'python is an instance of a programming language', 'California has a population of 39,144,818').

Everything that would get its own Wikipedia article is an entity, and the Wikidata project is about importing all the unstructured statements from those articles into a database. Because Wikidata was built to be language-independent, everything has a unique alphanumeric identifier. So California is Q99,

The best way to get a feel for Wikidata's data model is to browse an individual entity. So let's look at the entry for Mount Tamalpais, to see what is there. You can search for any entity or relation in the search bar at wikidata.org, or you can click the "Wikidata item" link on the lefthand sidebar of any Wikipedia article. The URL for Mount Tamalpais is linked to the unique identifier Q785665.

Mount Tam's statements

We see that the first statement is one of the most common and foundational statements in Wikidata: instance of. If you hover over the 'instance of' link, you can see that it links to Property P31, which is the structured identifier for this kind of relation between entities and data. Mount Tam is an instance of a mountain, and a mountain is also an entity in Wikidata.

For many statements in Wikidata, the data in the statement is another Wikidata entity, which has its own kinds of statements. One of Mount Tam's other statements is the property 'located in the administrative territorial entity' (or P131), with the data for that statement being the Wikidata entity 'California' (or Q99). Other Wikidata statements have raw data, like the 'coordinate location' (Property P625) statement.

Querying Wikidata

To query these data, you can use a structured querying language called SPARQL, which is an extention of SQL. The pseudoquery for this would be something like:

Return all statements about coordinate locations
For all entities that are instances of mountains
That are located in the administrative territorial entity 'California'

We then have to translate these statements and entites into language-neutral identifiers, which becomes:

For all entities that are instances of (P31) mountains (Q8502)
That are located in the administrative territorial entity (P131) 'California' (Q99)
Return all statements about coordinate locations (P625)

The way we do this in SPARQL is:

SELECT ?mountain ?coord 
WHERE {

    # define ?mountain as all entities that are instances of (P31) mountains (Q8502)
    ?mountain wdt:P31 wd:Q8502 .     

    # that are in the administrative territorial entity (P131) 'California' (Q99)
    ?mountain wdt:P131 wd:Q99 .      

    # for ?mountain, return all coordinate statements (P625) in the variable ?coord 
    ?mountain wdt:P625 ?coord        
}

(we also have to put in a bunch of declarations, which are similar to importing a library)

Using Wikidata's web query service

There is a great way to test out your queries in the browser at https://query.wikidata.org. Here is the above SPARQL query in the web query service. One of the great things about the web query service is that you can hover over every property or entity and see what it is. You can also directly download the data to a number of formats.

Labels

The first thing you'll notice is that the ?mountain variable is the unique identifier for each mountain, not the English name (or Spanish or Japanese or Arabic...). To get that, you have to add another block to the SPARQL query.

# Out of the following query, select the variables: ?mountain ?mountainLabel? ?coord

SELECT ?mountain ?mountainLabel ?coord 
WHERE {

    # define ?mountain as all entities that are instances of (P31) mountains (Q8502)
    ?mountain wdt:P31 wd:Q8502 .     

    # that are in the administrative territorial entity (P131) 'California' (Q99)
    ?mountain wdt:P131 wd:Q99 .      

    # Then for every ?mountain, return data for all coordinate 
    # statements (P625) in the variable ?coord
    ?mountain wdt:P625 ?coord .      

    # Then for every ?mountain, return data for all labels (rdfs:label)
    # into the variable ?mountainLabel, but filter for only english language labels
    ?mountain rdfs:label ?mountainLabel filter (lang(?mountainLabel) = "en")
}

This query in the query explorer.

Extend it!

Looking at the Mount Tam Wikidata page, we can see there is a property called "elevation above sea level." How would we extend the SPARQL query above to also return this data?



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

# Out of the following query, select the variables: 
# ?mountain ?mountainLabel? ?coord ?elevation
SELECT ?mountain ?mountainLabel ?coord ?elevation

WHERE {

    # Define ?mountain as all entities that are instances of (P31) mountains (Q8502) 
    ?mountain wdt:P31 wd:Q8502 .     

    # that are in the administrative territorial entity (P131) 'California' (Q99).
    ?mountain wdt:P131 wd:Q99 .      

    # Then for every ?mountain, return data for all coordinate statements (P625)
    # in the variable ?coord
    ?mountain wdt:P625 ?coord .       

    # Then for every ?mountain, return data for all 
    # 'elevation above sea level' statements (P2044) in the variable ?elevation
    ?mountain wdt:P2044 ?elevation .

    # Then for every ?mountain, return data for all labels (rdfs:label)
    # into the variable ?mountainLabel, but filter for only english language labels
    ?mountain rdfs:label ?mountainLabel filter (lang(?mountainLabel) = "en")
}

This query in the query explorer..

We can also filter on returned variables by adding the filter() function:

# Out of the following query, select the variables: 
# ?mountain ?mountainLabel? ?coord ?elevation
SELECT ?mountain ?mountainLabel ?coord ?elevation

WHERE {

    # Define ?mountain as all entities that are instances of (P31) mountains (Q8502) 
    ?mountain wdt:P31 wd:Q8502 .     

    # that are in the administrative territorial entity (P131) 'California' (Q99).
    # And have an elevation > 0
    ?mountain wdt:P131 wd:Q99 .      

    # Then for every ?mountain, return data for all coordinate statements (P625)
    # in the variable ?coord
    ?mountain wdt:P625 ?coord .       

    # Then for every ?mountain, return data for all 
    # 'elevation above sea level' statements (P2044) in the variable ?elevation
    # that have an elevation > 0
    ?mountain wdt:P2044 ?elevation filter(?elevation > 0) .

    # Then for every ?mountain, return data for all labels (rdfs:label)
    # into the variable ?mountainLabel, but filter for only english language labels
    ?mountain rdfs:label ?mountainLabel filter (lang(?mountainLabel) = "en")
}

</code> This query in the query explorer.

You can make any variable optional, which means the row will be returned even if there is no row for it.

You also sometimes need to use subclasses, because Wikidataians love to categorize and subcategorize!

SELECT ?city ?cityLabel ?population

WHERE {

    # Define ?city as all entities that are instances of (P31)
    # or subclasses of (/P279*) city (Q515)
    ?city wdt:P31/wdt:P279* wd:Q515 .     

    # that are in the country (P17) 'United States' (Q30).
    ?city wdt:P17 wd:Q30 .      

    OPTIONAL {
    # Optionally, for every ?city, return data for population (P1082) in ?population
        ?city wdt:P1082 ?population .
    }
    # Then for every ?city , return data for all labels (rdfs:label)
    # into the variable ?cityLabel, but filter for only english language labels
    ?city rdfs:label ?cityLabel filter (lang(?cityLabel) = "en")
}

</code>

This query in the query explorer.

Make your own query

For simple queries, it is best to first filter by a fundamental property, like 'instance of', 'occupation',

Querying in Python



In [ ]:



In [2]:

    
import requests

header = {
    'User-Agent': 'Wikidata query script by you@you.com'
}

import pandas as pd



In [3]:

    
query = '''
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?mountain ?mountainLabel ?elevation 
WHERE {
    ?mountain wdt:P31 wd:Q8502 .
    ?mountain wdt:P131 wd:Q99 .
    ?mountain wdt:P2044 ?elevation .
    ?mountain rdfs:label ?mountainLabel filter (lang(?mountainLabel) = "en")
}
'''



In [ ]:

Get and process the data

Next we send an HTTP request to the SPARQL endpoint providing the query as a URL parameter, we also specify that we want the result encoded as JSON rather than the default XML. Thanks to the requests library this is practically self-explaining code.

Now we iterate through the result, creating a list of dictionaries, each of which contains values for the query variables defined above. Then we create a Pandas DataFrame from this list, print its length and the first few rows.



In [4]:

    
url = 'https://query.wikidata.org/bigdata/namespace/wdq/sparql'
data = requests.get(url, headers=header, params={'query': query, 'format': 'json'}).json()



In [5]:

    
data









    Out[5]:





{'head': {'vars': ['mountain', 'mountainLabel', 'elevation']},
 'results': {'bindings': [{'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '4322'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q748355'},
    'mountainLabel': {'type': 'literal',
     'value': 'Mount Shasta',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '1155'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q371103'},
    'mountainLabel': {'type': 'literal',
     'value': 'Loma Prieta',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '3069'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q128657'},
    'mountainLabel': {'type': 'literal',
     'value': 'Mount San Antonio',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '3750'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q163232'},
    'mountainLabel': {'type': 'literal',
     'value': 'Mount Morrison',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '4421'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q235539'},
    'mountainLabel': {'type': 'literal',
     'value': 'Mount Whitney',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '4007'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q573205'},
    'mountainLabel': {'type': 'literal',
     'value': 'Boundary Peak',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '939'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q4951563'},
    'mountainLabel': {'type': 'literal',
     'value': 'Box Springs Mountain',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '3326'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q5052261'},
    'mountainLabel': {'type': 'literal',
     'value': 'Cathedral Peak',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '486'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q5179749'},
    'mountainLabel': {'type': 'literal',
     'value': 'Cowles Mountain',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '1986'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q5197090'},
    'mountainLabel': {'type': 'literal',
     'value': 'Cuyamaca Peak',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '4013'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q1405920'},
    'mountainLabel': {'type': 'literal',
     'value': 'Red Slate Mountain',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '1319'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q1534757'},
    'mountainLabel': {'type': 'literal',
     'value': 'Mount Saint Helena',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '1327'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q1950508'},
    'mountainLabel': {'type': 'literal',
     'value': 'Mount Hamilton',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '3735'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q2412852'},
    'mountainLabel': {'type': 'literal',
     'value': 'Minarets',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '784'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q785665'},
    'mountainLabel': {'type': 'literal',
     'value': 'Mount Tamalpais',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '-86'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q799720'},
    'mountainLabel': {'type': 'literal',
     'value': 'Badwater Basin',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '4390'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q822964'},
    'mountainLabel': {'type': 'literal',
     'value': 'Mount Williamson',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '4272'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q831100'},
    'mountainLabel': {'type': 'literal',
     'value': 'Mount Muir',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '4010'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q841110'},
    'mountainLabel': {'type': 'literal',
     'value': 'Mount Ritter',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '3189'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q859354'},
    'mountainLabel': {'type': 'literal',
     'value': 'Lassen Peak',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '3366'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q916811'},
    'mountainLabel': {'type': 'literal',
     'value': 'Telescope Peak',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '2693'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q925252'},
    'mountainLabel': {'type': 'literal',
     'value': 'Half Dome',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '0'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q1124852'},
    'mountainLabel': {'type': 'literal',
     'value': 'El Capitan',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '0'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q8524849'},
    'mountainLabel': {'type': 'literal',
     'value': 'Santiago Peak',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '405'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q8525771'},
    'mountainLabel': {'type': 'literal',
     'value': 'Mount Rubidoux',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '0'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q13359059'},
    'mountainLabel': {'type': 'literal',
     'value': 'Mount Diablo',
     'xml:lang': 'en'}},
   {'elevation': {'datatype': 'http://www.w3.org/2001/XMLSchema#decimal',
     'type': 'literal',
     'value': '3302'},
    'mountain': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q7414344'},
    'mountainLabel': {'type': 'literal',
     'value': 'San Jacinto Peak',
     'xml:lang': 'en'}}]}}



In [ ]:

    
languages = []
for item in data['results']['bindings']:
    languages.append({
        'name': item['mountainLabel']['value'],
        'elevation': item['elevation']['value']
    })



In [ ]:

    
df = pd.DataFrame(languages)
print(len(df))
df.head()

Let's also see the data types of the columns.



In [ ]:

    
df.dtypes



In [ ]: