In this webinar we will be using the Python programming languages to manipulate MARC21 and MARCXML into BIBFRAME, Dublin Core, and Schema.org graphs made-up of RDF triples in order to expose this linked data in the form of RDF xml, RDF N-triples, HTML5 RDFa, HTML5 microdata, and JSON Linked Data
In [1]:
from IPython.display import Image
lita = Image(filename="static/img/lita-logo.png")
lita
Out[1]:
MARC21 is a binary format developed by the Library of Congress. Dealing with the raw MARC21 can be challenging because MARC21 combines both fixed-length and variable length fields. Fortunately a number of open-source libraries exist for manipulating MARC21 records that hides some of the complexity behind the format. First we will use Ed Summer's PyMARC module for coding experiment.
For this session our initial data-set is made-up of random MARC21 records exported from Colorado College's ILS.
In [2]:
import pymarc
marc_records = []
with open("static/marc/marc-sample.mrc") as marc_file:
marc_reader = pymarc.MARCReader(marc_file)
for rec in marc_reader:
marc_records.append(rec)
In [ ]:
In [ ]:
In [3]:
print(marc_records[1905])
[Write Answer here]
In [4]:
for i, rec in enumerate(marc_records[0:100]):
print("{0} {1} {2}".format(i,
rec.title(),
rec.author()))
In [5]:
# Unicode values in these records
print(marc_records[13].author())
These MARC21 records are not encoded correctly for Unicode São Paulo, Brazil should be displayed as
In [6]:
print(u"São Paulo, Brazil")
In [7]:
for row in dir(marc_records[8842]):
if row.startswith("_"): # Filter out internal properties
continue
print(row)
In [ ]:
In [ ]:
In [9]:
def classify_marc21_schema(record):
"Classifies a MARC21 record as specific Work class based on BIBFRAME website"
leader = record.leader
field007 = record['007']
field336 = record['336']
work_class = None
if leader[6] == 'a':
if field007 is not None:
test_value = field007.data[0]
if test_value == 'a' or test_value == 'd':
work_class = 'Map' # http://schema.org/Map
elif test_value == 'h': # Microfilm
work_class = 'Photograph' # http://schema.org/Photograph
elif ['m', 'v'].count(test_value) > 0:
work_class = 'VideoObject' # http://schema.org/VideoObject
else:
# Book is the default for Language Material
work_class = 'Book'
elif leader[6] == 'e' or leader[6] == 'f':
# Map is the default
work_class = 'Map'
if field007 is not None:
if field007.data[0] == 'r':
work_class = 'Dataset'
elif leader[6] == 'g':
work_class = 'Photograph'
elif leader[6] == 'i':
work_class = 'AudioObject' # http://schema.org/AudioObject
elif leader[6] == 'j':
work_class = 'MusicRecording'
elif leader[6] == 'k':
work_class = 'Photograph'
elif leader[6] == 'm':
work_class = 'SoftwareApplication'
if work_class is None:
work_class = 'CreativeWork'
return work_class
Using the classify_marc21_schema
function, we can now create some summary statistics about our MARC21 dataset by looping through our list of MARC21 records.
In [10]:
class_counters = {}
for i, record in enumerate(marc_records):
result = classify_marc21_schema(record)
if not result in class_counters:
class_counters[result] = 1
else:
class_counters[result] = class_counters[result] + 1
print(class_counters)
In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
schema_org_fig = plt.figure()
axes = schema_org_fig.add_axes([0.1, 0.1, 0.8, 0.8])
axes.pie(class_counters.values(), labels=class_counters.keys(), autopct='%.2f')
axes.set_title('Number of Schema.org Classes in MARC21 Records');
In [ ]:
In [12]:
from rdflib import Namespace
from lxml import etree
MARC_NS = Namespace('http://www.loc.gov/MARC21/slim')
marc_xml = etree.parse('static/marc/marc-sample.xml')
xml_marc_records = marc_xml.findall('/{{{0}}}record'.format(MARC_NS))
In [13]:
print(etree.tostring(xml_marc_records[13], pretty_print=True))
[Answer here]
In [ ]:
len
In [ ]:
In [14]:
xpath_string = "/{{{0}}}record/{{{0}}}datafield[@tag='245']/".format(MARC_NS)
title_elements = marc_xml.findall(xpath_string)
for i, element in enumerate(title_elements[0:100]):
print("{0} subfield {1}: {2}".format(i, element.attrib.get('code'), element.text))
The title_elements
list includes all of the subfields from the 245 field.
In [15]:
rec_245s = xml_marc_records[13].findall("{{{0}}}datafield[@tag='245']/".format(MARC_NS))
for element in rec_245s:
print(etree.tostring(element))
We are now going to create a function that takes either a MARC21 or MARCXML record and returns the Library of Congress or Local call-number for the record.
In [16]:
def get_call_number(record):
"""Function retrives a LOC or Local call-number from a 090 MARC field from either
a MARC21 or MARC XML record
:param record: MARC record
"""
if type(record) == pymarc.Record:
all_090s = record.get_fields('090')
return ''.join([''.join(x.get_subfields('a')) for x in all_090s])
else:
all_090s = record.findall("{{{0}}}datafield[@tag='090']/{{{0}}}subfield[@code='a']".format(MARC_NS))
return ''.join([x.text for x in all_090s])
In [17]:
print(get_call_number(marc_records[13]))
print(get_call_number(xml_marc_records[78]))
get_call_number
is given a MARC record that does not have a MARC field of 090?[Write answer here]
In [ ]:
[Write answer here]
In this task will use the schema.org vocabulary to create JSON linked data from our MARC records both in MARC21 and MARC XML formats. Below is an example of schema.org JSON-LD (short for JSON Linkded Data) for one of the resources used in this presentation, an article titled, Linking Things on the Web: A Pragmatic Examination of Linked Data for Libraries, Archives and Museums by Ed Summers and Dorothea Salo.
We first import the json
Python module to work with JSON objects. The json
module allows us to easily load JSON objects as native Python data structures like dict
and list
In [18]:
import json
Next we create a Python dict
for MARC record 4190 in our dataset. We will use the MARC21 records but we could have just as easily used the MARC XML records. The first line is setting a @context
property for the JSON and making an assertation that the vocabulary is from schema.org. Record 4190 is a book of poems for young people, so we classify the new JSON-LD graph as a schema.org type of Book.
In [19]:
print(marc_records[4189])
record4190_json = {"@context": {"@vocab": "http://schema.org/"},
"@type": "Book"}
For the JSON-LD graph, will map the MARC title to the schema.org name
property.
In [20]:
record4190_json['name'] = marc_records[4189].title()
print(record4190_json)
In the JSON-LD graph for Between two Junes is a forest :a journal of everything, we create a Python dict
for the author, assigning a schmea.org type of Person.
In [22]:
record4190_json['author'] = {"@type": "Person", "name": marc_records[4189].author()}
print(record4190_json)
record4190
.
In [ ]:
In [ ]:
record9_json = json.loads(marc_records[9].as_json())
In [ ]:
print(json.dumps(record9_json, indent=2, sort_keys=True))
In [ ]:
print(json.dumps(record4190_json, indent=2, sort_keys=True))
In [23]:
DCTERMS = Namespace("http://purl.org/dc/terms/")
BIBFRAME = Namespace("http://bibframe.org/vocab/")
SCHEMA_ORG = Namespace("http://schema.org/")
from rdflib import Graph, BNode, Literal
bib_graph = Graph()
entities = Namespace('http://intro2libsys.info/lita-webinar-2014/entities/')
entities.one
entity = entities.one
In [ ]:
First we will add schema.org properties to our our first entity in the bib_graph
.
In [25]:
from rdflib import URIRef
bib_graph.add((entity,
SCHEMA_ORG.type,
URIRef("http://schema.org/Book")))
Second we will add schema.org copyrightYear to the first entity
In [26]:
bib_graph.add((entity,
SCHEMA_ORG.copyrightYear,
Literal(marc_records[8]['260']['c'][1:])))
In [27]:
bib_graph.add((entity,
DCTERMS.title,
Literal(marc_records[8].title())))
bib_graph.add((entity,
DCTERMS.creator,
Literal(marc_records[8].author())))
for subject,predicate,obj in bib_graph:
print("Subject: {0}\nPredicate: {1}\nObject: {2}".format(subject,predicate,obj))
print("===")
if (entity, None, None) in bib_graph:
print("Graph contains triples about the entity")
In [43]:
from rdflib.serializer import Serializer
from rdflib import plugin
bib_graph.namespace_manager.reset()
bib_graph.namespace_manager.bind("dc", "http://purl.org/dc/terms/")
bib_graph.namespace_manager.bind("schema", SCHEMA_ORG)
print(bib_graph.serialize(format='pretty-xml', indent=4))
In [29]:
# Print in RDF N-Triples syntax
print(bib_graph.serialize(format='n3'))
In [30]:
import urllib2
from rdflib import URIRef
def get_marc_lang_uri(marc_record):
"""Function returns a LOC or Local call-number from a 008 MARC field from either
a MARC21 or MARC XML record
:param marc_record: MARC record
"""
loc_lang_base = 'http://id.loc.gov/vocabulary/languages'
if type(marc_record) == pymarc.Record:
lang_code = marc_record['008'].data[-5:-1]
else:
lang_code = marc_record.find("{{{0}}}controlfield[@tag='008']".format(MARC_NS)).text[-5:-1]
loc_lang_uri = "{0}/{1}".format(loc_lang_base, lang_code).strip()
if urllib2.urlopen(loc_lang_uri).code != 200:
return
else:
return URIRef(loc_lang_uri)
In [33]:
get_marc_lang_uri(marc_records[78])
Out[33]:
In [ ]:
In [33]:
entity2
In [ ]:
In [ ]:
In [ ]:
In [34]:
template = """<div vocab="http://dublincore.org" type="{{ dc_type}}">
<h2 property="title">{{ dc_title }}</h2>
by <span property="creator">{{ dc_creator }}</span>
</div>"""
from jinja2 import Template
rdfa_template = Template(template)
print(rdfa_template.render(dc_type='text',
dc_creator=bib_graph.value(entity, DCTERMS.creator),
dc_title=bib_graph.value(entity, DCTERMS.title)))
Validating this MARC RDFa by copying the HTML and RdDFa at http://www.w3.org/2012/pyRdfa/Validator.html#distill_by_input
In [ ]:
In [35]:
english_json = json.load(urllib2.urlopen('http://id.loc.gov/vocabulary/languages/eng.skos.json'))
print(english_json[u'<http://id.loc.gov/vocabulary/languages/eng>'][ u'<http://www.w3.org/2004/02/skos/core#prefLabel>'])
In [36]:
def get_marc_language_label(loc_language_uri):
"""Function returns the preferred label based on the Library of Congress MARC Language code
:param loc_language_uri: URI of Library of Congress Linked Data service
"""
loc_language_uri = loc_language_uri.strip()
prefLabel_uri = u'<http://www.w3.org/2004/02/skos/core#prefLabel>'
loc_skos_uri = "{0}.skos.json".format(loc_language_uri)
loc_language_key = u"<{0}>".format(loc_language_uri)
lang_json = json.load(urllib2.urlopen(loc_skos_uri))
return lang_json.get(u"<{0}>".format(loc_language_uri)).get(prefLabel_uri)[0].get("value", None)
print(get_marc_language_label('http://id.loc.gov/vocabulary/languages/eng'))
In [ ]:
In [38]:
micro_template = """<div itemscope itemtype="{{ itemType }}">
<h2 itemprop="name">{{ itemName }}</h2>
by <span property="author">{{ author }}</span>
</div>"""
In [39]:
micro_data_template = Template(micro_template)
print(micro_data_template.render(itemType= bib_graph.value(entity, SCHEMA_ORG.type),
itemName= bib_graph.value(entity, DCTERMS.title),
author= bib_graph.value(entity, DCTERMS.creator)))
Test how Google will extract the data using its rich snippets testing tools.
In [ ]:
In [ ]:
So far up to this point we've been manipulating MARC records into first creating linked data triples and then creating RDFa, RDF XML, and JSON-LD representations of those graphs. We have also started using Library of Congress Linked Data to associate the Dublin Core language with our entities. All these critical steps allow us to publishing MARC records as linked data. Next, we're going to switch-up our tasks and look into how we can now use other linked-data resources to enhance our existing bibliographic graphs we have created earlier in this session.
In [40]:
print(marc_records[191].title())
In [41]:
moby_dict_dbpedia = 'http://dbpedia.org/data/Moby-Dick'
moby_dict_url = "{0}.json".format(moby_dict_dbpedia)
print(moby_dict_url)
moby_dict_json = json.load(urllib2.urlopen("{0}.json".format(moby_dict_dbpedia)))
In [42]:
print(moby_dict_json["http://dbpedia.org/resource/Moby-Dick"])
In [ ]: