Fetching Summary Data

Introducing the Ontology2 Edition of Dbpedia

In our last episode, I did a number of queries against the DBpedia Ontology to map out the information available. In that notebook, I gave myself the restriction that I would only do queries against a copy of the DBpedia Ontology that is stored with the notebook.

Because the Ontology contains roughly 740 types and 2700 properties (more than 250 for Person alone) this turned out to be a serious limitation -- unless we know how much information is available for these properties, I can't know which ones are important, and thus make a visualization that makes sense.

Gastrodon is capable of querying the DBpedia Public SPARQL endpoint, but the DBpedia Endpoint has some limitations, particularly, it returns at most 10,000 results for a query. Complex queries can also time out. Certainly I could write a series of smaller queries to compute statistics, but then I face a balancing act between too many small queries (which will take a long time to run) and queries that get too large (and sometimes time out.)

Fortunately I have a product in the AWS Marketplace, the Ontology2 Edition of DBpedia 2016-04 which is a private SPARQL endpoint already loaded with data from DBpedia. By starting this product, and waiting about an hour for it to initialize, I can run as many SPARQL queries as I like of arbitrary complexity, and shut it down when I'm through.

In this notebook, I use this private SPARQL endpoint to count the prevalence of types, properties, and datatypes. I use SPARQL Construct to save this information into an RDF graph that I'll later be able to combine with the DBpedia Ontology RDF graph to better explore the schema.

I start with the usual preliminaries, importing Python modules and prefix definitions


In [30]:
%load_ext autotime
import sys
from os.path import expanduser
from gastrodon import RemoteEndpoint,QName,ttl,URIRef,inline
import pandas as pd
import json
pd.options.display.width=120
pd.options.display.max_colwidth=100


The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 4.5 ms

In [2]:
prefixes=inline("""
    @prefix dbo: <http://dbpedia.org/ontology/> .
    @prefix summary: <http://rdf.ontology2.com/summary/> .
""").graph


time: 8 ms

It wouldn't be safe for me to check database connection information into Git, so I store it in a file in my home directory named ~/.dbpedia/config.json, which looks like

{
    "url":"http://130.21.14.234:8890/sparql-auth",
    "user":"dba",
    "passwd":"vKUcW1eSVkruDOtT",
    "base_uri":"http://dbpedia.org/resource/"
}

(Note that that is not my real IP address and passwd. If you want to reproduce this, put in the IP address and password for your own server and save it to ~/.dbpedia/config.json


In [3]:
connection_data=json.load(open(expanduser("~/.dbpedia/config.json")))
connection_data["prefixes"]=prefixes


time: 4 ms

In [4]:
endpoint=RemoteEndpoint(**connection_data)


time: 12.5 ms

Counting Properties and Classes

Finding the right graphs

The Ontology2 Edition of DBpedia 2016-04 is divided into a number of different named graphs, one for each dataset described here.

It's important to pay attention to this for two reasons.

One of them is that facts can appear in the output of a SPARQL query more than once than if the query covers multiple graphs and if facts are repeated in those graphs. This can throw off the accuracy of our counts.

The other is that some queries seem to take a long time to run if they are run over all graphs; particularly this affects queries that involve filtering over a prefix in the predicate field (ex.)

FILTER(STRSTARTS(STR(?p)),"http://dbpedia.org/ontology/")

Considering both of these factors, it is wise to know which graphs the facts we want are stored in, thus I start exploring:


In [5]:
endpoint.select("""
    select ?g (COUNT(*) AS ?cnt) {
       GRAPH ?g { ?a <http://dbpedia.org/ontology/Person/height> ?b } .
    } GROUP BY ?g
""")


Out[5]:
cnt
g
http://downloads.dbpedia.org/2016-04/core-i18n/en/citedFacts_en.ttl.bz2 5502
http://downloads.dbpedia.org/2016-04/core-i18n/en/specific_mappingbased_properties_en.ttl.bz2 148105
time: 2.07 s

Thus I find one motherload of properties right away: I save this in a variable so I can use it later.


In [6]:
pgraph=URIRef("http://downloads.dbpedia.org/2016-04/core-i18n/en/specific_mappingbased_properties_en.ttl.bz2")


time: 999 µs

Looking up types, I find a number of graphs and choose the transitive types:


In [8]:
endpoint.select("""
    select ?g (COUNT(*) AS ?cnt) {
       GRAPH ?g { ?a a dbo:Person } .
    } GROUP BY ?g
""")


Out[8]:
cnt
g
http://downloads.dbpedia.org/2016-04/core-i18n/en/instance_types_transitive_en.ttl.bz2 1014819
http://downloads.dbpedia.org/2016-04/core-i18n/en/instance_types_en.ttl.bz2 502997
http://downloads.dbpedia.org/2016-04/core-i18n/en/instance_types_sdtyped_dbo_en.ttl.bz2 212295
http://downloads.dbpedia.org/2016-04/core-i18n/en/instance_types_lhd_dbo_en.ttl.bz2 834547
time: 332 ms

In [9]:
tgraph=URIRef("http://downloads.dbpedia.org/2016-04/core-i18n/en/instance_types_transitive_en.ttl.bz2")


time: 1 ms

Counting Classes

It is now straightforward to pull up a list of types (classes), noting that these are not mutually exclusive. (You can be a dbo:Actor and a dbo:Politician)


In [10]:
endpoint.select("""
   SELECT ?type (COUNT(*) AS ?cnt) {
       GRAPH ?_tgraph { ?s a ?type . }
       FILTER(STRSTARTS(STR(?type),"http://dbpedia.org/ontology/"))
    } GROUP BY ?type
""")


Out[10]:
cnt
type
dbo:WinterSportPlayer 22373
dbo:Project 11
dbo:PopulatedPlace 505557
dbo:Actor 1969
dbo:Document 23799
dbo:Genre 1229
dbo:Group 33681
dbo:Politician 19569
dbo:Station 2300
dbo:Venue 789
dbo:Animal 219648
dbo:Comic 4097
dbo:GridironFootballPlayer 18908
dbo:MusicalArtist 480
dbo:RacingDriver 1765
dbo:Software 20419
dbo:Song 1155
dbo:EducationalInstitution 53057
dbo:MusicalWork 199355
dbo:NaturalEvent 1191
dbo:RaceTrack 242
dbo:Gene 90
dbo:Cartoon 6373
dbo:Cleric 12842
dbo:Database 358
http://dbpedia.org/ontology/%3Chttp://purl.org/dc/terms/Jurisdiction%3E 24409
dbo:Instrumentalist 151
dbo:LegalCase 2724
dbo:Name 4361
dbo:OrganisationMember 323111
... ...
dbo:Device 25748
dbo:Engine 18829
dbo:FictionalCharacter 7990
dbo:MotorcycleRider 701
dbo:Olympics 4032
dbo:Person 1014819
dbo:Plant 5062
dbo:Royalty 799
dbo:Settlement 230132
dbo:Species 295514
dbo:SportsEvent 23588
dbo:SportsTeam 30031
dbo:TimePeriod 922533
dbo:Wrestler 470
dbo:WrittenWork 62761
dbo:ArchitecturalStructure 188172
dbo:FootballLeagueSeason 3348
dbo:SportsTeamSeason 35360
dbo:GeneLocation 86
dbo:Athlete 298681
dbo:FloweringPlant 381
dbo:Stream 28289
dbo:ClericalAdministrativeRegion 3250
dbo:Coach 6985
dbo:Horse 3855
dbo:Location 816252
dbo:Region 24409
dbo:Satellite 2137
dbo:SportsManager 17654
dbo:Tower 1868

108 rows × 1 columns

time: 1min 39s

I can store these facts in an RDF graph (instead of a Pandas DataFrame) by using a CONSTRUCT query (instead of a SELECT query). To capture the results of a GROUP BY query, however, I have to use a subquery -- this is because SPARQL requires that I only use variables in the CONSTRUCT clause, thus I have to evaluate expressions (such as COUNT(*)) somewhere else.

The resulting query is straightforward, even if it looks a little awkward with all the braces: roughly I cut and pasted the above SELECT query into a CONSTRUCT query that defines the facts that will be emitted.


In [11]:
t_counts=endpoint.construct("""
   CONSTRUCT {
      ?type summary:count ?cnt .
   } WHERE { 
       {
            SELECT ?type (COUNT(*) AS ?cnt) {
                GRAPH ?_tgraph { ?s a ?type . }
                FILTER(STRSTARTS(STR(?type),"http://dbpedia.org/ontology/"))
            } GROUP BY ?type
       } 
   }
""")


time: 1min 40s

I can count the facts in this resulting graph (same as the number of rows in the SELECT query)


In [31]:
len(t_counts)


Out[31]:
108
time: 3 ms

And here is a sample fact:


In [40]:
next(t_counts.__iter__())


Out[40]:
(rdflib.term.URIRef('http://dbpedia.org/ontology/Book'),
 rdflib.term.URIRef('http://rdf.ontology2.com/summary/count'),
 rdflib.term.Literal('22', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))
time: 4 ms

Note that in the DBpedia Ontology there are a number of other facts about dbo:Book, so if add the above fact to my copy of the DBpedia Ontology, SPARQL queries will be able to pick up the count together with all the other facts.

Counting "Specific Properties"

If I count properties in the "specific mappingbased properties" graph, I find that these are all properties that have the Class name baked in


In [12]:
endpoint.select("""
   SELECT ?p (COUNT(*) AS ?cnt) {
       GRAPH ?_pgraph { ?s ?p ?o . }
    } GROUP BY ?p
""")


Out[12]:
cnt
p
http://dbpedia.org/ontology/Canal/originalMaximumBoatLength 5
http://dbpedia.org/ontology/Engine/length 16
http://dbpedia.org/ontology/Engine/powerOutput 193
http://dbpedia.org/ontology/Planet/averageSpeed 650
http://dbpedia.org/ontology/Planet/density 123
http://dbpedia.org/ontology/Infrastructure/length 23124
http://dbpedia.org/ontology/Lake/volume 1401
http://dbpedia.org/ontology/Person/weight 66144
http://dbpedia.org/ontology/PopulatedPlace/area 41718
http://dbpedia.org/ontology/PopulatedPlace/populationDensity 75629
http://dbpedia.org/ontology/Software/fileSize 816
http://dbpedia.org/ontology/SpaceShuttle/distance 5
http://dbpedia.org/ontology/SpaceShuttle/timeInSpace 7
http://dbpedia.org/ontology/Planet/apoapsis 3018
http://dbpedia.org/ontology/Engine/displacement 335
http://dbpedia.org/ontology/Canal/maximumBoatBeam 79
http://dbpedia.org/ontology/MeanOfTransportation/diameter 219
http://dbpedia.org/ontology/Planet/volume 62
http://dbpedia.org/ontology/PopulatedPlace/populationUrbanDensity 162
http://dbpedia.org/ontology/Stream/dischargeAverage 148
http://dbpedia.org/ontology/Stream/maximumDischarge 975
http://dbpedia.org/ontology/Engine/cylinderBore 400
http://dbpedia.org/ontology/Engine/height 12
http://dbpedia.org/ontology/Engine/pistonStroke 372
http://dbpedia.org/ontology/Engine/torqueOutput 67
http://dbpedia.org/ontology/GrandPrix/course 2635
http://dbpedia.org/ontology/Planet/maximumTemperature 45
http://dbpedia.org/ontology/School/campusSize 1039
http://dbpedia.org/ontology/Stream/discharge 2133
http://dbpedia.org/ontology/Weapon/width 1416
... ...
http://dbpedia.org/ontology/Weapon/weight 3387
http://dbpedia.org/ontology/Planet/orbitalPeriod 3197
http://dbpedia.org/ontology/Planet/surfaceArea 41
http://dbpedia.org/ontology/SpaceStation/volume 27
http://dbpedia.org/ontology/Stream/minimumDischarge 915
http://dbpedia.org/ontology/Automobile/fuelCapacity 11
http://dbpedia.org/ontology/Canal/originalMaximumBoatBeam 7
http://dbpedia.org/ontology/Engine/width 12
http://dbpedia.org/ontology/Engine/weight 48
http://dbpedia.org/ontology/GrandPrix/distance 2595
http://dbpedia.org/ontology/PopulatedPlace/areaMetro 766
http://dbpedia.org/ontology/Rocket/lowerEarthOrbitPayload 55
http://dbpedia.org/ontology/Weapon/diameter 666
http://dbpedia.org/ontology/Planet/meanTemperature 59
http://dbpedia.org/ontology/Astronaut/timeInSpace 419
http://dbpedia.org/ontology/Automobile/wheelbase 5331
http://dbpedia.org/ontology/Lake/shoreLength 1592
http://dbpedia.org/ontology/MeanOfTransportation/weight 3173
http://dbpedia.org/ontology/MeanOfTransportation/width 5721
http://dbpedia.org/ontology/Planet/meanRadius 45
http://dbpedia.org/ontology/PopulatedPlace/areaTotal 162857
http://dbpedia.org/ontology/Rocket/mass 188
http://dbpedia.org/ontology/Weapon/length 3415
http://dbpedia.org/ontology/Work/runtime 261729
http://dbpedia.org/ontology/Weapon/height 1412
http://dbpedia.org/ontology/Canal/maximumBoatLength 77
http://dbpedia.org/ontology/Planet/minimumTemperature 41
http://dbpedia.org/ontology/Planet/periapsis 3031
http://dbpedia.org/ontology/PopulatedPlace/areaUrban 655
http://dbpedia.org/ontology/PopulatedPlace/populationMetroDensity 246

69 rows × 1 columns

time: 418 ms

In [13]:
sp_count=endpoint.construct("""
    CONSTRUCT {
      ?p summary:count ?cnt .
    } WHERE { {
        SELECT ?p (COUNT(*) AS ?cnt) {
           GRAPH ?_pgraph { ?s ?p ?o . }
        } GROUP BY ?p
   } }
""")


time: 416 ms

Other Ontology properties

That begs the question of in which graphs other properties are stored. Searching for dbo:birthDate I find the location of ordinary Literal properties. (Which could be a date, a number or a string)


In [14]:
endpoint.select("""
    select ?g (COUNT(*) AS ?cnt) {
       GRAPH ?g { ?a dbo:birthDate ?b } .
    } GROUP BY ?g
""")


Out[14]:
cnt
g
http://downloads.dbpedia.org/2016-04/core-i18n/en/mappingbased_literals_en.ttl.bz2 819371
http://downloads.dbpedia.org/2016-04/core-i18n/en/persondata_en.ttl.bz2 730541
http://downloads.dbpedia.org/2016-04/core-i18n/en/citedFacts_en.ttl.bz2 6658
time: 275 ms

A search for dbo:child turns up object properties (which point to a URI reference)


In [15]:
endpoint.select("""
    select ?g (COUNT(*) AS ?cnt) {
       GRAPH ?g { ?a dbo:child ?b } .
    } GROUP BY ?g
""")


Out[15]:
cnt
g
http://downloads.dbpedia.org/2016-04/core-i18n/en/mappingbased_objects_en.ttl.bz2 14456
http://downloads.dbpedia.org/2016-04/core-i18n/en/mappingbased_objects_disjoint_range_en.ttl.bz2 112
http://downloads.dbpedia.org/2016-04/core-i18n/en/citedFacts_en.ttl.bz2 91
http://downloads.dbpedia.org/2016-04/core-i18n/en/mappingbased_objects_uncleaned_en.ttl.bz2 14568
time: 247 ms

In [16]:
lgraph=URIRef("http://downloads.dbpedia.org/2016-04/core-i18n/en/mappingbased_literals_en.ttl.bz2")
ograph=URIRef("http://downloads.dbpedia.org/2016-04/core-i18n/en/mappingbased_objects_en.ttl.bz2")


time: 1 ms

Counting All Properties

By taking a UNION I can count the "specific", object, and literal properties. The DataFrame looks OK, so I decide to save these counts into a graph.


In [17]:
endpoint.select("""
   SELECT ?p (COUNT(*) AS ?cnt) {
       { 
           GRAPH ?_pgraph { 
               ?s ?p ?o .      
           }
       } UNION { 
           GRAPH ?_ograph {
               ?s ?p ?o .
           }
       } UNION {
           GRAPH ?_lgraph {
               ?s ?p ?o .
           }
       }
    } GROUP BY ?p
""")


Out[17]:
cnt
p
dbo:film 20
dbo:headteacher 1
dbo:poleDriverCountry 84
dbo:numberOfClassrooms 2
dbo:conservationStatus 48437
dbo:dateOfBurial 12
dbo:established 2463
dbo:firstPublicationYear 6390
dbo:isniId 50
dbo:lastElectionDate 844
dbo:numberOfVineyards 60
dbo:plays 3720
dbo:testaverage 69
dbo:acquirementDate 5244
dbo:bedCount 1933
dbo:buildingStartYear 2251
dbo:ceeb 1152
dbo:centuryBreaks 356
dbo:chairmanTitle 5001
dbo:closingDate 1638
dbo:closingYear 3655
dbo:configuration 17234
dbo:diameter 1306
dbo:discharge 2141
dbo:electionDateLeader 542
dbo:elevationQuote 2
dbo:fees 159
dbo:firstAirDate 13743
dbo:firstGame 59
dbo:formerCallsign 19894
... ...
dbo:jstor 451
dbo:whaDraft 357
dbo:areaRural 20
dbo:productionEndDate 13
dbo:mgiid 9
dbo:classes 390
dbo:schoolNumber 580
dbo:startDate 9444
dbo:fansgroup 29
dbo:internationally 4303
dbo:lccn 3000
dbo:nutsCode 20
dbo:rocketStages 192
dbo:routeTypeAbbreviation 17595
dbo:tournamentRecord 831
dbo:characterInPlay 5373
dbo:digitalSubChannel 4440
dbo:frequencyOfPublication 6082
dbo:recommissioningDate 925
dbo:successfulLaunches 258
dbo:originalMaximumBoatBeam 7
dbo:argueDate 13
dbo:lastFlightStartDate 7
dbo:volumeQuote 1
dbo:distanceToEdinburgh 140
dbo:issDockings 3
dbo:maximumDepthQuote 1
dbo:reservations 58
dbo:statisticValue 8
dbo:throwingSide 655

1438 rows × 1 columns

time: 3.16 s

In [18]:
p_counts=endpoint.construct("""
    CONSTRUCT {
       ?p summary:count ?cnt .
    } WHERE {
        {
            SELECT ?p (COUNT(*) AS ?cnt) {
               { 
                   GRAPH ?_pgraph { 
                       ?s ?p ?o .      
                   }
               } UNION { 
                   GRAPH ?_ograph {
                       ?s ?p ?o .
                   }
               } UNION {
                   GRAPH ?_lgraph {
                       ?s ?p ?o .
                   }
               }
            } GROUP BY ?p
        }
    }
""")


time: 3.09 s

In [19]:
len(p_counts)


Out[19]:
1438
time: 2.99 ms

Counting datatypes

In a RDF, a Class is a kind of type which represents a "Thing" in the world. Datatypes, on the other hand, are types that represent literal values. The most famous types in RDF come from the XML Schema Datatypes and represent things such as integers, dates, and strings.

RDF also allows us to define custom datatypes, which are specified with URIs, like most things in RDF.

A GROUP BY query reveals the prevalence of various datatypes, which I then dump to a graph.

There still are some big questions to research such as "does the same property turn up with different units?" For instance, it is very possible that a length could be represented in kilometers, centimeters, feet, or furlongs. You won't get the right answer, however, if you try to add multiple lengths in different units that are all represented as floats. Thus it may be necessary at some point to build a bridge to a package like numericalunits or alternately build something that canonicalizes them.


In [20]:
endpoint.select("""
   SELECT ?datatype (COUNT(*) AS ?cnt) {
       { 
           GRAPH ?_pgraph { 
               ?s ?p ?o .      
           }
       } UNION {
           GRAPH ?_lgraph {
               ?s ?p ?o .
           }
       }
       BIND(DATATYPE(?o) AS ?datatype)
    } GROUP BY ?datatype
""")


Out[20]:
cnt
datatype
http://dbpedia.org/datatype/kilometre 36045
http://dbpedia.org/datatype/kelvin 559
http://dbpedia.org/datatype/millimetre 59730
http://dbpedia.org/datatype/centimetre 148105
http://dbpedia.org/datatype/metre 387
http://dbpedia.org/datatype/litre 11
http://dbpedia.org/datatype/newtonMetre 67
xsd:nonNegativeInteger 824348
xsd:string 3571269
http://dbpedia.org/datatype/usDollar 56793
http://dbpedia.org/datatype/norwegianKrone 549
http://dbpedia.org/datatype/russianRouble 92
http://dbpedia.org/datatype/swissFranc 251
http://dbpedia.org/datatype/indianRupee 118
xsd:integer 1248777
http://dbpedia.org/datatype/tanzanianShilling 35
http://dbpedia.org/datatype/southKoreanWon 63
http://dbpedia.org/datatype/nicaraguanCórdoba 18
http://dbpedia.org/datatype/iranianRial 6
http://dbpedia.org/datatype/rwandaFranc 24
http://dbpedia.org/datatype/mauritianRupee 6
http://dbpedia.org/datatype/ukrainianHryvnia 28
http://dbpedia.org/datatype/renminbi 37
http://dbpedia.org/datatype/moldovanLeu 6
http://dbpedia.org/datatype/australianDollar 39
http://dbpedia.org/datatype/trinidadAndTobagoDollar 1
http://dbpedia.org/datatype/peruvianNuevoSol 5
http://dbpedia.org/datatype/gambianDalasi 2
http://dbpedia.org/datatype/bulgarianLev 5
http://dbpedia.org/datatype/maldivianRufiyaa 3
... ...
http://dbpedia.org/datatype/indonesianRupiah 26
http://dbpedia.org/datatype/qatariRial 7
http://dbpedia.org/datatype/thaiBaht 40
http://dbpedia.org/datatype/jordanianDinar 8
http://dbpedia.org/datatype/icelandKrona 35
http://dbpedia.org/datatype/lithuanianLitas 10
http://dbpedia.org/datatype/turkishLira 8
http://dbpedia.org/datatype/malawianKwacha 6
http://dbpedia.org/datatype/ghanaianCedi 11
http://dbpedia.org/datatype/hungarianForint 18
http://dbpedia.org/datatype/romanianNewLeu 15
http://dbpedia.org/datatype/bangladeshiTaka 25
http://dbpedia.org/datatype/nepaleseRupee 12
http://dbpedia.org/datatype/myanmaKyat 1
http://dbpedia.org/datatype/sierraLeoneanLeone 1
http://dbpedia.org/datatype/brazilianReal 3
http://dbpedia.org/datatype/newZealandDollar 12
http://dbpedia.org/datatype/estonianKroon 2
http://dbpedia.org/datatype/latvianLats 11
http://dbpedia.org/datatype/bahrainiDinar 1
http://dbpedia.org/datatype/honduranLempira 1
http://dbpedia.org/datatype/chileanPeso 1
http://dbpedia.org/datatype/iraqiDinar 1
http://dbpedia.org/datatype/guineaFranc 1
http://dbpedia.org/datatype/newTaiwanDollar 2
http://dbpedia.org/datatype/papuaNewGuineanKina 2
http://dbpedia.org/datatype/israeliNewSheqel 2
xsd:anyURI 25
http://dbpedia.org/datatype/fuelType 457
http://dbpedia.org/datatype/valvetrain 495

129 rows × 1 columns

time: 1min 36s

In [21]:
dt_counts=endpoint.construct("""
    CONSTRUCT {
       ?datatype summary:count ?cnt .
    } WHERE {
       SELECT ?datatype (COUNT(*) AS ?cnt) {
           { 
               GRAPH ?_pgraph { 
                   ?s ?p ?o .      
               }
           } UNION {
               GRAPH ?_lgraph {
                   ?s ?p ?o .
               }
           }
           BIND(DATATYPE(?o) AS ?datatype)
        } GROUP BY ?datatype
    }
""")


time: 1min 35s

Writing to disk

RDFlib overloads the '+' operator so that we can easily merge the type, property and datatype counts into one (modestly sized) graph.


In [25]:
all_counts = t_counts + p_counts + dt_counts


time: 182 ms

I add a few prefix declarations for (human) readability, then write the data to disk in Turtle format. I was tempted to write it to a relative path which would put this file in its final destination. (Underneath the local notebook directory, where it could be found by notebooks) but decided against it, since I don't want to take the chance of me (or you) trashing the project by mistake. Instead I'll have to copy the file into place later.


In [28]:
all_counts.bind("datatype","http://dbpedia.org/datatype/")
all_counts.bind("dbo","http://dbpedia.org/ontology/")
all_counts.bind("summary","http://rdf.ontology2.com/summary/")
all_counts.serialize("/data/schema_counts.ttl",format='ttl',encoding='utf-8')


time: 507 ms

Bonus File: Human Dimensions

While I had my copy of DBpedia running, I thought I'd gather a data set that would be worth making visualizations of. Quite a lot of data exists in DBpedia concerning people's body dimensions, so I decided to run a query and save the data for future use.


In [22]:
dimensions=endpoint.select("""
select ?p ?height ?weight {
    GRAPH ?_pgraph {
        ?p <http://dbpedia.org/ontology/Person/weight> ?weight .
        ?p <http://dbpedia.org/ontology/Person/height> ?height .
    }
}
""")


time: 51 s

In [23]:
dimensions


Out[23]:
p height weight
0 <Alexander_Hug_(rugby_union)> 188.00 91.000000
1 <Anderson_Silva> 187.96 83.916000
2 <Andrew_Gee> 183.00 102.000000
3 <Bernard_Ackah> 185.42 90.720000
4 <Billy_Brandt> 177.80 74.844000
5 <Bob_Beamon> 191.00 70.000000
6 <Caleb_Moore> 177.80 72.576000
7 <Charles_Hamelin> 175.00 71.000000
8 <Charmaine_Sinclair> 172.72 57.152639
9 <Christina_Von_Eerie> 162.56 55.792800
10 <Daniela_Hantuchová> 181.00 62.000000
11 <Denice_Klarskov> 170.18 50.000000
12 <Eamon_Sullivan> 191.00 74.000000
13 <Folke_Jansson> 187.00 80.000000
14 <Forrest_Towns> 188.00 75.000000
15 <Franco_Columbu> 164.00 88.000000
16 <Frank_Zane> 175.26 83.916000
17 <Frederique_van_der_Wal> 177.80 61.236000
18 <Félix_Sánchez> 178.00 73.000000
19 <Georg_Lammers> 178.00 84.000000
20 <Gloria_Leonard> 172.72 70.761600
21 <Gory_Guerrero> 175.00 95.000000
22 <Guillaume_LeBlanc> 183.00 74.000000
23 <Habiba_Ghribi> 174.00 49.000000
24 <Haile_Gebrselassie> 165.00 56.000000
25 <Hans-Joachim_Reske> 184.00 80.000000
26 <Harald_Andersson> 191.00 99.000000
27 <Heinz-Joachim_Rothenburg> 185.00 118.000000
28 <Ivan_Ivančić> 193.04 127.915200
29 <Jan_Henne> 152.40 63.957600
... ... ... ...
41633 <Brett_McDermott__Brett_McDermott__1> 180.00 93.000000
41634 <Attila_Czene__Attila_Czene__1> 185.00 76.000000
41635 <Adam_Braidwood__Adam_Braidwood__1> 193.04 122.472000
41636 <Clarence_Childs__1> 183.00 102.000000
41637 <Eugena_Washington__Eugena_Washington__1> 152.40 58.514400
41638 <Niall_Breslin__Niall_Breslin__1> 198.00 100.000000
41639 <Matt_Ghaffari__Matt_Ghaffari__1> 182.88 127.008000
41640 <Paul_Kelly_(fighter)__Paul_Kelly__1> 175.26 70.308000
41641 <Tünde_Szabó__Tünde_Szabó__1> 175.00 60.000000
41642 <Muhammed_Lawal__Muhammed_Lawal__1> 0.00 92.988000
41643 <Rodney_Glunder__Rodney_Glunder__1> 185.42 113.400000
41644 <Garth_Wood__Garth_Wood__1> 179.00 80.000000
41645 <Achim_Albrecht__Achim_Albrecht__1> 180.34 125.647200
41646 <Anna_Bogomazova__Anna_Bogomazova__1> 185.42 70.308000
41647 <Elisabetta_Dessy__Elisabetta_Dessy__1> 180.00 58.000000
41648 <Paul_Schaus__Paul_Schaus__1> 152.40 70.308000
41649 <Arnold_Jackson_(British_Army_officer)__Arnold_Jackson_1912.jpg__1> 176.00 67.000000
41650 <Chris_Laidlaw__1> 175.00 78.000000
41651 <Chyna__Chyna__1> 177.80 81.648000
41652 <Ernie_Ladd__Ernie_Ladd__1> 205.74 145.152000
41653 <Herschel_Walker__Herschel_Walker__1> 185.42 99.792000
41654 <Ken_Shamrock__Ken_Shamrock__1> 185.42 110.224800
41655 <Ron_Clarke__2> 183.00 72.000000
41656 <Adam_Jones_(American_football)__Adam_Jones__1> 0.00 83.916000
41657 <Christi_Wolf__Christi_Wolf__1> 160.02 68.040000
41658 <Don_Frye__Don_Frye__1> 185.00 110.000000
41659 <Shaggy_2_Dope__Shaggy_2_Dope__1> 187.96 104.328000
41660 <Victoria_Zdrok__Victoria_Nika_Zdrok__1> 175.26 54.432000
41661 <Violent_J__Violent_J__1> 190.50 127.008000
41662 <Katsuyori_Shibata__Katsuyori_Shibata__1> 183.00 103.000000

41663 rows × 3 columns

time: 25 ms

The data looks a bit messy. Most noticeably, I see quite a few facts which, instead of pointing to DBpedia concepts, point to synthetic URLs (such as <Ron_Clarke__2>) which are supposed to represent 'topics' such the time that a particular employee worked for a particular employer. (See this notebook for some discussion of the phenomenon).

Filtering these out will not be hard, as these synthetic URLs all contain two consecutive underscores.

I also think it's suspicious that a few people have a height of 0.0, which might be in the underlying data, or might be because Gastrodon is not properly handling a missing data value.

It would be certainly possible to serialize these results into an RDF graph, but instead I write them into a CSV for simplicity.


In [24]:
dimensions.to_csv("/data/people_weight.csv.gz",compression="gzip",encoding="utf-8")


time: 504 ms

Conclusion

To continue the analysis I began here, I needed a count of how often various classes, properties, and datatypes were used in DBpedia. API limits could make getting this data from the public SPARQL endpoint challenging, so I decided to run queries against my own private SPARQL endpoint powered by the Ontology2 Edition of DBpedia.

After setting up connection information, connecting to this private endpoint turned out to be as simple as connecting to a public endpoint and I was efficiently able to get the data I needed into an RDF graph, ready to merge with the DBpedia Ontology graph to make a more meaningful analysis of the data in DBpedia towards the goal of producing interesting and attractive visualizations.


In [ ]: