SPARQLWrapper
is a simple Python wrapper around a SPARQL service for remote query execution. Not only does it enable us to write more complex queries to extract information from RDF than those exposed through a library like rdflib
, it can also convert query results into other formats like JSON and CSV!
SPARQL ("SPARQL Protocol And RDF Query Language") is a W3C standard for querying RDF and can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions. SPARQL also supports extensible value testing and constraining queries by source RDF graph. The results of SPARQL queries can be results sets or RDF graphs.
SPARQL allows us to express queries as three-part statements:
"""
PREFIX ... // identifies & nicknames namespace URIs of desired variables
SELECT ... // lists variables to be returned (start with a ?)
WHERE ... // contains restrictions on variables expressed as triples
"""
SPARQLWrapper
The Python library SPARQLWrapper
(which can be installed via pip
) enables us to use the SPARQL query language to interact with remote or local SPARQL endpoints, such as DBPedia:
In [1]:
from SPARQLWrapper import SPARQLWrapper, JSON
# Specify the DBPedia endpoint
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
# Query for the description of "Capsaicin", filtered by language
sparql.setQuery("""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?comment
WHERE { <http://dbpedia.org/resource/Capsaicin> rdfs:comment ?comment
FILTER (LANG(?comment)='en')
}
""")
# Convert results to JSON format
sparql.setReturnFormat(JSON)
result = sparql.query().convert()
# The return data contains "bindings" (a list of dictionaries)
for hit in result["results"]["bindings"]:
# We want the "value" attribute of the "comment" field
print(hit["comment"]["value"])
We can also use the Wikidata Query Service (WDQS) endpoint to query Wikidata.
Let's say we want to continue our research into spicy things by searching for information about hot sauces in Wikidata. The first step is to find the unique identifier that Wikidata uses to reference "hot sauce", which we can do by searching on Wikidata. It turns out to be "Q522171", which is an "entity
", which corresponds to the "wd
" prefix in Wikidata.
If we want to get back results for all of the kinds of hot sauces cataloged in Wikidata, we want to query for the results that have the direct property -- or "wdt
" in Wikidata prefix speak -- "<subclasses of>
", which is encoded as "P279" in Wikidata.
NOTE: For simple WDQS triples, items should be prefixed with wd:
, and properties with wdt:
. We don't need to explicitly alias any prefixes in this case because WDQS already knows many shortcut abbreviations commonly used externally (e.g. rdf, skos, owl, schema, etc.) as well as ones internal to Wikidata, such as:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wds: <http://www.wikidata.org/entity/statement/>
PREFIX wdv: <http://www.wikidata.org/value/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bd: <http://www.bigdata.com/rdf#>
More on prefixes here.
In [20]:
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
# Below we SELECT both the hot sauce items & their labels
# in the WHERE clause we specify that we want labels as well as items
sparql.setQuery("""
SELECT ?item ?itemLabel
WHERE {
?item wdt:P279 wd:Q522171.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
Let's use pandas
to review the results as a dataframe:
In [17]:
import pandas as pd
results_df = pd.io.json.json_normalize(results['results']['bindings'])
results_df[['item.value', 'itemLabel.value']]
Out[17]: