SPARQL from Python

SPARQLWrapper is a simple Python wrapper around a SPARQL service for remote query execution. Not only does it enable us to write more complex queries to extract information from RDF than those exposed through a library like rdflib, it can also convert query results into other formats like JSON and CSV!

First, what is SPARQL?

SPARQL ("SPARQL Protocol And RDF Query Language") is a W3C standard for querying RDF and can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions. SPARQL also supports extensible value testing and constraining queries by source RDF graph. The results of SPARQL queries can be results sets or RDF graphs.

SPARQL allows us to express queries as three-part statements:

"""
PREFIX ... // identifies & nicknames namespace URIs of desired variables 
SELECT ... // lists variables to be returned (start with a ?)
WHERE  ... // contains restrictions on variables expressed as triples
"""



SPARQLWrapper

The Python library SPARQLWrapper (which can be installed via pip) enables us to use the SPARQL query language to interact with remote or local SPARQL endpoints, such as DBPedia:


In [1]:
from SPARQLWrapper import SPARQLWrapper, JSON

# Specify the DBPedia endpoint
sparql = SPARQLWrapper("http://dbpedia.org/sparql")

# Query for the description of "Capsaicin", filtered by language 
sparql.setQuery("""
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?comment
    WHERE { <http://dbpedia.org/resource/Capsaicin> rdfs:comment ?comment 
    FILTER (LANG(?comment)='en')
    }
""")

# Convert results to JSON format
sparql.setReturnFormat(JSON)
result = sparql.query().convert()

# The return data contains "bindings" (a list of dictionaries)
for hit in result["results"]["bindings"]:
    # We want the "value" attribute of the "comment" field
    print(hit["comment"]["value"])


Capsaicin (/kæpˈseɪ.ᵻsɪn/ (INN); 8-methyl-N-vanillyl-6-nonenamide) is an active component of chili peppers, which are plants belonging to the genus Capsicum. It is an irritant for mammals, including humans, and produces a sensation of burning in any tissue with which it comes into contact. Capsaicin and several related compounds are called capsaicinoids and are produced as secondary metabolites by chili peppers, probably as deterrents against certain mammals and fungi. Pure capsaicin is a volatile, hydrophobic, colorless, odorless, crystalline to waxy compound.

Querying Wikidata

We can also use the Wikidata Query Service (WDQS) endpoint to query Wikidata.

Let's say we want to continue our research into spicy things by searching for information about hot sauces in Wikidata. The first step is to find the unique identifier that Wikidata uses to reference "hot sauce", which we can do by searching on Wikidata. It turns out to be "Q522171", which is an "entity", which corresponds to the "wd" prefix in Wikidata.

If we want to get back results for all of the kinds of hot sauces cataloged in Wikidata, we want to query for the results that have the direct property -- or "wdt" in Wikidata prefix speak -- "<subclasses of>", which is encoded as "P279" in Wikidata.

NOTE: For simple WDQS triples, items should be prefixed with wd:, and properties with wdt:. We don't need to explicitly alias any prefixes in this case because WDQS already knows many shortcut abbreviations commonly used externally (e.g. rdf, skos, owl, schema, etc.) as well as ones internal to Wikidata, such as:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wds: <http://www.wikidata.org/entity/statement/>
PREFIX wdv: <http://www.wikidata.org/value/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bd: <http://www.bigdata.com/rdf#>

More on prefixes here.


In [20]:
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

# Below we SELECT both the hot sauce items & their labels
# in the WHERE clause we specify that we want labels as well as items
sparql.setQuery("""
SELECT ?item ?itemLabel 

WHERE {
  ?item wdt:P279 wd:Q522171.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

Let's use pandas to review the results as a dataframe:


In [17]:
import pandas as pd

results_df = pd.io.json.json_normalize(results['results']['bindings'])
results_df[['item.value', 'itemLabel.value']]


Out[17]:
item.value itemLabel.value
0 http://www.wikidata.org/entity/Q249114 salsa
1 http://www.wikidata.org/entity/Q335016 Tabasco sauce
2 http://www.wikidata.org/entity/Q360459 Adobo
3 http://www.wikidata.org/entity/Q460439 Blair's 16 Million Reserve
4 http://www.wikidata.org/entity/Q966327 harissa
5 http://www.wikidata.org/entity/Q1026822 Chili oil
6 http://www.wikidata.org/entity/Q1392674 sriracha sauce
7 http://www.wikidata.org/entity/Q2227032 mojo
8 http://www.wikidata.org/entity/Q2279518 Shito
9 http://www.wikidata.org/entity/Q2402909 Valentina
10 http://www.wikidata.org/entity/Q3273096 Doubanjiang
11 http://www.wikidata.org/entity/Q3474141 sauce samouraï
12 http://www.wikidata.org/entity/Q3474250 Q3474250
13 http://www.wikidata.org/entity/Q4922876 Nam phrik
14 http://www.wikidata.org/entity/Q5104402 Cholula Hot Sauce
15 http://www.wikidata.org/entity/Q6961170 Nam chim
16 http://www.wikidata.org/entity/Q16628511 Q16628511
17 http://www.wikidata.org/entity/Q16642516 Q16642516

More on SPARQL & SPARQL Endpoints