In our last episode, I did a number of queries against the DBpedia Ontology to map out the information available. In that notebook, I gave myself the restriction that I would only do queries against a copy of the DBpedia Ontology that is stored with the notebook.
Because the Ontology contains roughly 740 types and 2700 properties (more than 250 for Person alone) this turned out to be a serious limitation -- unless we know how much information is available for these properties, I can't know which ones are important, and thus make a visualization that makes sense.
Gastrodon is capable of querying the DBpedia Public SPARQL endpoint, but the DBpedia Endpoint has some limitations, particularly, it returns at most 10,000 results for a query. Complex queries can also time out. Certainly I could write a series of smaller queries to compute statistics, but then I face a balancing act between too many small queries (which will take a long time to run) and queries that get too large (and sometimes time out.)
Fortunately I have a product in the AWS Marketplace, the Ontology2 Edition of DBpedia 2016-04 which is a private SPARQL endpoint already loaded with data from DBpedia. By starting this product, and waiting about an hour for it to initialize, I can run as many SPARQL queries as I like of arbitrary complexity, and shut it down when I'm through.
In this notebook, I use this private SPARQL endpoint to count the prevalence of types, properties, and datatypes. I use SPARQL Construct to save this information into an RDF graph that I'll later be able to combine with the DBpedia Ontology RDF graph to better explore the schema.
I start with the usual preliminaries, importing Python modules and prefix definitions
In [30]:
%load_ext autotime
import sys
from os.path import expanduser
from gastrodon import RemoteEndpoint,QName,ttl,URIRef,inline
import pandas as pd
import json
pd.options.display.width=120
pd.options.display.max_colwidth=100
In [2]:
prefixes=inline("""
@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix summary: <http://rdf.ontology2.com/summary/> .
""").graph
It wouldn't be safe for me to check database connection information into Git, so I store it in a file
in my home directory named ~/.dbpedia/config.json
, which looks like
{
"url":"http://130.21.14.234:8890/sparql-auth",
"user":"dba",
"passwd":"vKUcW1eSVkruDOtT",
"base_uri":"http://dbpedia.org/resource/"
}
(Note that that is not my real IP address and passwd. If you want to reproduce this, put in the IP address and password for your own server and save it to ~/.dbpedia/config.json
In [3]:
connection_data=json.load(open(expanduser("~/.dbpedia/config.json")))
connection_data["prefixes"]=prefixes
In [4]:
endpoint=RemoteEndpoint(**connection_data)
The Ontology2 Edition of DBpedia 2016-04 is divided into a number of different named graphs, one for each dataset described here.
It's important to pay attention to this for two reasons.
One of them is that facts can appear in the output of a SPARQL query more than once than if the query covers multiple graphs and if facts are repeated in those graphs. This can throw off the accuracy of our counts.
The other is that some queries seem to take a long time to run if they are run over all graphs; particularly this affects queries that involve filtering over a prefix in the predicate field (ex.)
FILTER(STRSTARTS(STR(?p)),"http://dbpedia.org/ontology/")
Considering both of these factors, it is wise to know which graphs the facts we want are stored in, thus I start exploring:
In [5]:
endpoint.select("""
select ?g (COUNT(*) AS ?cnt) {
GRAPH ?g { ?a <http://dbpedia.org/ontology/Person/height> ?b } .
} GROUP BY ?g
""")
Out[5]:
Thus I find one motherload of properties right away: I save this in a variable so I can use it later.
In [6]:
pgraph=URIRef("http://downloads.dbpedia.org/2016-04/core-i18n/en/specific_mappingbased_properties_en.ttl.bz2")
Looking up types, I find a number of graphs and choose the transitive types:
In [8]:
endpoint.select("""
select ?g (COUNT(*) AS ?cnt) {
GRAPH ?g { ?a a dbo:Person } .
} GROUP BY ?g
""")
Out[8]:
In [9]:
tgraph=URIRef("http://downloads.dbpedia.org/2016-04/core-i18n/en/instance_types_transitive_en.ttl.bz2")
In [10]:
endpoint.select("""
SELECT ?type (COUNT(*) AS ?cnt) {
GRAPH ?_tgraph { ?s a ?type . }
FILTER(STRSTARTS(STR(?type),"http://dbpedia.org/ontology/"))
} GROUP BY ?type
""")
Out[10]:
I can store these facts in an RDF graph (instead of a Pandas DataFrame) by using a CONSTRUCT
query (instead of a SELECT
query). To capture the results of a GROUP BY
query, however, I have to use a subquery -- this is because SPARQL requires that I only use variables in the CONSTRUCT
clause, thus I have to evaluate expressions (such as COUNT(*)
) somewhere else.
The resulting query is straightforward, even if it looks a little awkward with all the braces: roughly I cut and pasted the above SELECT query into a CONSTRUCT
query that defines the facts that will be emitted.
In [11]:
t_counts=endpoint.construct("""
CONSTRUCT {
?type summary:count ?cnt .
} WHERE {
{
SELECT ?type (COUNT(*) AS ?cnt) {
GRAPH ?_tgraph { ?s a ?type . }
FILTER(STRSTARTS(STR(?type),"http://dbpedia.org/ontology/"))
} GROUP BY ?type
}
}
""")
I can count the facts in this resulting graph (same as the number of rows in the SELECT
query)
In [31]:
len(t_counts)
Out[31]:
And here is a sample fact:
In [40]:
next(t_counts.__iter__())
Out[40]:
Note that in the DBpedia Ontology there are a number of other facts about dbo:Book
, so if add the above fact to my copy of the DBpedia Ontology, SPARQL queries will be able to pick up the count together with all the other facts.
In [12]:
endpoint.select("""
SELECT ?p (COUNT(*) AS ?cnt) {
GRAPH ?_pgraph { ?s ?p ?o . }
} GROUP BY ?p
""")
Out[12]:
In [13]:
sp_count=endpoint.construct("""
CONSTRUCT {
?p summary:count ?cnt .
} WHERE { {
SELECT ?p (COUNT(*) AS ?cnt) {
GRAPH ?_pgraph { ?s ?p ?o . }
} GROUP BY ?p
} }
""")
In [14]:
endpoint.select("""
select ?g (COUNT(*) AS ?cnt) {
GRAPH ?g { ?a dbo:birthDate ?b } .
} GROUP BY ?g
""")
Out[14]:
A search for dbo:child
turns up object properties (which point to a URI reference)
In [15]:
endpoint.select("""
select ?g (COUNT(*) AS ?cnt) {
GRAPH ?g { ?a dbo:child ?b } .
} GROUP BY ?g
""")
Out[15]:
In [16]:
lgraph=URIRef("http://downloads.dbpedia.org/2016-04/core-i18n/en/mappingbased_literals_en.ttl.bz2")
ograph=URIRef("http://downloads.dbpedia.org/2016-04/core-i18n/en/mappingbased_objects_en.ttl.bz2")
In [17]:
endpoint.select("""
SELECT ?p (COUNT(*) AS ?cnt) {
{
GRAPH ?_pgraph {
?s ?p ?o .
}
} UNION {
GRAPH ?_ograph {
?s ?p ?o .
}
} UNION {
GRAPH ?_lgraph {
?s ?p ?o .
}
}
} GROUP BY ?p
""")
Out[17]:
In [18]:
p_counts=endpoint.construct("""
CONSTRUCT {
?p summary:count ?cnt .
} WHERE {
{
SELECT ?p (COUNT(*) AS ?cnt) {
{
GRAPH ?_pgraph {
?s ?p ?o .
}
} UNION {
GRAPH ?_ograph {
?s ?p ?o .
}
} UNION {
GRAPH ?_lgraph {
?s ?p ?o .
}
}
} GROUP BY ?p
}
}
""")
In [19]:
len(p_counts)
Out[19]:
In a RDF, a Class is a kind of type which represents a "Thing" in the world. Datatypes, on the other hand, are types that represent literal values. The most famous types in RDF come from the XML Schema Datatypes and represent things such as integers, dates, and strings.
RDF also allows us to define custom datatypes, which are specified with URIs, like most things in RDF.
A GROUP BY
query reveals the prevalence of various datatypes, which I then dump to a graph.
There still are some big questions to research such as "does the same property turn up with different units?" For instance, it is very possible that a length could be represented in kilometers, centimeters, feet, or furlongs. You won't get the right answer, however, if you try to add multiple lengths in different units that are all represented as floats. Thus it may be necessary at some point to build a bridge to a package like numericalunits or alternately build something that canonicalizes them.
In [20]:
endpoint.select("""
SELECT ?datatype (COUNT(*) AS ?cnt) {
{
GRAPH ?_pgraph {
?s ?p ?o .
}
} UNION {
GRAPH ?_lgraph {
?s ?p ?o .
}
}
BIND(DATATYPE(?o) AS ?datatype)
} GROUP BY ?datatype
""")
Out[20]:
In [21]:
dt_counts=endpoint.construct("""
CONSTRUCT {
?datatype summary:count ?cnt .
} WHERE {
SELECT ?datatype (COUNT(*) AS ?cnt) {
{
GRAPH ?_pgraph {
?s ?p ?o .
}
} UNION {
GRAPH ?_lgraph {
?s ?p ?o .
}
}
BIND(DATATYPE(?o) AS ?datatype)
} GROUP BY ?datatype
}
""")
In [25]:
all_counts = t_counts + p_counts + dt_counts
I add a few prefix declarations for (human) readability, then write the data to disk in Turtle format. I was tempted to write it to a relative path which would put this file in its final destination. (Underneath the local
notebook directory, where it could be found by notebooks) but decided against it, since I don't want to take the chance of me (or you) trashing the project by mistake. Instead I'll have to copy the file into place later.
In [28]:
all_counts.bind("datatype","http://dbpedia.org/datatype/")
all_counts.bind("dbo","http://dbpedia.org/ontology/")
all_counts.bind("summary","http://rdf.ontology2.com/summary/")
all_counts.serialize("/data/schema_counts.ttl",format='ttl',encoding='utf-8')
In [22]:
dimensions=endpoint.select("""
select ?p ?height ?weight {
GRAPH ?_pgraph {
?p <http://dbpedia.org/ontology/Person/weight> ?weight .
?p <http://dbpedia.org/ontology/Person/height> ?height .
}
}
""")
In [23]:
dimensions
Out[23]:
The data looks a bit messy. Most noticeably, I see quite a few facts which, instead of pointing to DBpedia concepts, point to synthetic URLs (such as <Ron_Clarke__2>
) which are supposed to represent 'topics' such the time that a particular employee worked for a particular employer. (See this notebook for some discussion of the phenomenon).
Filtering these out will not be hard, as these synthetic URLs all contain two consecutive underscores.
I also think it's suspicious that a few people have a height of 0.0
, which might be in the underlying data, or might be because Gastrodon is not properly handling a missing data value.
It would be certainly possible to serialize these results into an RDF graph, but instead I write them into a CSV for simplicity.
In [24]:
dimensions.to_csv("/data/people_weight.csv.gz",compression="gzip",encoding="utf-8")
To continue the analysis I began here, I needed a count of how often various classes, properties, and datatypes were used in DBpedia. API limits could make getting this data from the public SPARQL endpoint challenging, so I decided to run queries against my own private SPARQL endpoint powered by the Ontology2 Edition of DBpedia.
After setting up connection information, connecting to this private endpoint turned out to be as simple as connecting to a public endpoint and I was efficiently able to get the data I needed into an RDF graph, ready to merge with the DBpedia Ontology graph to make a more meaningful analysis of the data in DBpedia towards the goal of producing interesting and attractive visualizations.
In [ ]: