Lita Linked Data Python Notebook

In this webinar we will be using the Python programming languages to manipulate MARC21 and MARCXML into BIBFRAME, Dublin Core, and Schema.org graphs made-up of RDF triples in order to expose this linked data in the form of RDF xml, RDF N-triples, HTML5 RDFa, HTML5 microdata, and JSON Linked Data


In [1]:
from IPython.display import Image
lita = Image(filename="static/img/lita-logo.png")
lita


Out[1]:

Task #1: Reading MARC21 Files

MARC21 is a binary format developed by the Library of Congress. Dealing with the raw MARC21 can be challenging because MARC21 combines both fixed-length and variable length fields. Fortunately a number of open-source libraries exist for manipulating MARC21 records that hides some of the complexity behind the format. First we will use Ed Summer's PyMARC module for coding experiment.

For this session our initial data-set is made-up of random MARC21 records exported from Colorado College's ILS.


In [2]:
import pymarc
marc_records = [] 
with open("static/marc/marc-sample.mrc") as marc_file:
    marc_reader = pymarc.MARCReader(marc_file)
    for rec in marc_reader:
        marc_records.append(rec)

Badge Assessment

  • How many MARC records are in the list marc_records?

In [ ]:

  • How would you substitute your own MARC21 file for marc-sample.mrc using the above Python code?

In [ ]:

  • Describe what happens when you execute the following code snippet:

In [3]:
print(marc_records[1905])


=LDR  01843nam  2200397 a 4504
=001  32460661
=003  OCoLC
=005  19950512094931.0
=008  950512s1995\\\\ksuab\\\\b\\\f000\0\eng\d
=035  \\$a.b12436422$btd$c-
=035  \\$atmp95128051
=040  \\$aGPO$cGPO$dDLC
=043  \\$an-us-ks$an-us-nb
=049  \\$aCOCU
=074  \\$a0624-B
=086  \\$aI 19.42/4:94-4187
=086  0\$aI 19.42/4:94-4187
=100  1\$aJordan, Paul Robert,$d1933-
=245  10$aSurface-water-quality assessment of the lower Kansas River Basin, Kansas and Nebraska :$bsuspended-sediment conditions, May 1987 through April 1990, and trends, 1963 through April 1990 /$cby P.R. Jordan.
=246  3\$aSurface water quality assessment of the lower Kansas River Basin, Kansas and Nebraska.
=246  30$aSuspended-sediment conditions, May 1987 through April 1990, and trends, 1963 through April 1990.
=260  \\$aLawrence, Kan. :$bU.S. Geological Survey, National Water-Quality Assessment Program ;$aDenver, CO :$bEarth Science Information Center, Open-File Reports Section [distributor],$c1995.
=300  \\$aiv, 36 p. :$bill., maps ;$c28 cm.
=440  \0$aWater-resources investigations report ;$v94-4187.
=500  \\$aShipping list no.: 95-0164-P.
=504  \\$aIncludes bibliographical references (p. 33-36)
=650  \0$aWater quality management$zKansas.
=650  \0$aWater quality management$zNebraska.
=650  \0$aSuspended sediments$xPhysiological effect$zKansas$zKansas River Watershed.
=651  \0$aKansas River Watershed (Kan.)$xEnvironmental conditions.
=710  2\$aNational Water-Quality Assessment Program (U.S.)
=907  \\$a.b12436422
=902  \\$a130106
=999  \\$b1$c950703$dm$ea$f-$g0
=994  \\$atd
=945  \\$aI 19.42/4:94-4187$g1$i33027003091141$j0$ltd   $h0$o-$p$0.00$r-$s-$t4$u0$v0$w0$x0$y.i12982362$z950517

[Write Answer here]

Task #2: Retrieving MARC21 Data

In this task, we'll start off with displaying the title and author for the first 100 MARC Records using two pymarc record methods.


In [4]:
for i, rec in enumerate(marc_records[0:100]):
    print("{0} {1} {2}".format(i,
                               rec.title(),
                               rec.author()))


0 Power, gender, values / None
1 Men in arms;a history of warfare and its interrelationships with Western society Preston, Richard Arthur.
2 Perspectives in ornithology :essays presented for the centennial of the American Ornithologists' Union / None
3 The youth of Andre Gide. Delay, Jean, 1907-
4 College :the undergraduate experience in America / Boyer, Ernest L.
5 On the incomprehensible nature of God / John Chrysostom, Saint, d. 407.
6 Commentary on the Metaphysics of Aristotle. Thomas, Aquinas, Saint, 1225?-1274.
7 Energy sources :conservation and renewables (APS, Washington, DC, 1985 / None
8 Walking on air :an informal history of inflight service of seven U.S. airlines / McLaughlin, Helen E.
9 Distribution of racial and ethnic groups in Colorado public schools, 1970-71 / None
10 Hockey: bantam to pro Kelley, Jack.
11 Guidebook of Vermejo Park, northeastern New Mexico :Twenty- seventh Field Conference, September 30, October 1 and 2, 1976 / New Mexico Geological Society.
12 Battered women, shattered lives / Hofeller, Kathleen H.
13 Fertility of the sea. Symposium on Fertility of the Sea (1969 : S�ao Paulo, Brazil)
14 Elements of meteorology Miller, Albert, 1923-
15 Ancient Australia;the story of its past geography and life Laseron, Charles.
16 Calibration of hominoid evolution;recent advances in isotopic and other dating methods applicable to the origin of man: [proceedings of the symposium held at Burg Wartenstein, Austria, 3rd-12th July, 1971]; None
17 Raphael / Jones, Roger, 1947-
18 A Passage to India :essays in interpretation / None
19 The Renewal of preaching:theory and practice, None
20 Liturgy in transition. None
21 Time and space relationships of the Taconic allochthon and autochthon. Zen, E-an, 1928-
22 The doctrine of judicial review, its legal and historical basis, and other essays, Corwin, Edward Samuel, 1878-1963.
23 The spotted hyena;a study of predation and social behavior. Kruuk, H. (Hans)
24 The Roots of urban unrest / None
25 Genetics, environment, and behavior;implications for educational policy. None
26 The errors of evolution.An examination of the nebular theory, geological evolution, the origin of life, and Darwinism. Patterson, Robert, 1829-1885.
27 Probing plant structure;a scanning electron microscope study of some anatomical features in plants and the relationship of these structures to physiological processes Troughton, John.
28 Immunochemistry and the biosynthesis of antibodies. Haurowitz, Felix, 1896-1987.
29 Biology of earthworms Edwards, C. A. (Clive Arthur), 1925-
30 "We the people" and others :duality and America's treatment of its racial minorities / Ringer, Benjamin B. (Benjamin Bernard), 1920-
31 Identifying and estimating the genetic impact of chemical mutagens / None
32 The Nuclear power controversy. None
33 The Confucian persuasion. Wright, Arthur F., 1913-1976.
34 Shawnee! :The ceremonialism of a native Indian tribe and its cultural background / Howard, James H. (James Henri), 1925-1982.
35 The earth in decay:a history of British geomorphology, 1578-1878 Herries Davies, G. L.
36 Authors on film. Geduld, Harry M.
37 Plant speciation. Grant, Verne.
38 Contemporary approaches to moral education :an annotated bibliography and guide to research / Leming, James S., 1941-
39 Alienation and absence in the novels of Marguerite Duras / Murphy, Carol J.
40 International management and economic development: with particular reference to India and other developing countries Richman, Barry M.
41 Will they ever finish Bruckner Boulevard? Huxtable, Ada Louise.
42 Late modern: the visual arts since 1945. Lucie-Smith, Edward.
43 Catholicism in English-speaking lands. Carthy, M. P. (Mary Peter)
44 The origins of the Bible, Soares, Theodore Gerald, b. 1869.
45 Memories of Lenin / Krupskaya, Nadezhda Konstantinovna, 1869-1939.
46 The gift :imagination and the erotic life of property / Hyde, Lewis, 1945-
47 The China reader, None
48 French primitive photography;[exhibition, Nov. 17th through Dec. 28th, 1969] Alfred Stieglitz Center.
49 Aftershock;the story of a psychotic episode. Wolfe, Ellen.
50 Paintbox on the frontier:the life and times of George Caleb Bingham. Constant, Alberta Wilson.
51 Elgar orchestral music. Kennedy, Michael, 1926-
52 Judy / Frank, Gerold, 1907-1998.
53 Passages about earth;an exploration of the new planetary culture. Thompson, William Irwin.
54 Water resource planning in Colorado / Crawford, Ivan C.
55 Syndicalism in France, Lorwin, Lewis Levitzki, 1883-1970.
56 The Rocky Mountain bench;the territorial supreme courts of Colorado, Montana, and Wyoming, 1861-1890, Guice, John D. W.
57 A quiet revolution, British sculpture since 1965 / None
58 The second partition of Poland;a study in diplomatic history, Lord, Robert Howard, 1885-1954.
59 The influence of the enlightenment on the French Revolution. Church, William Farr, 1912-
60 The city as a work of art :London, Paris, Vienna / Olsen, Donald J.
61 The letters of the Tsaritsa to the Tsar, 1914-1916. Alexandra, Empress, consort of Nicholas II, Emperor of Russia, 1872-1918.
62 Jean-Jacques Rousseau;a critical study of his life and writings. Green, F. C. (Frederick Charles), 1891-1964.
63 The Analects of Confucius; Confucius.
64 The Jews: their history. Finkelstein, Louis, 1895-1991.
65 Roman imperialism in the late republic, Badian, E.
66 The saga of the buffalo. Martin, Cy.
67 The Soviet Germans :past and present / Fleischhauer, Ingeborg.
68 Das Schicksal der Deutschen in Rumanien. None
69 Images of Victorian womanhood in English art / Casteras, Susan P.
70 Numerology / Bell, Eric Temple, 1883-1960.
71 The fateful years;memoirs of a French ambassador in Berlin, 1931-1938. Fran�cois-Poncet, Andr�e, 1887-1978.
72 Willa, the life of Willa Cather / Robinson, Phyllis C.
73 Improving American innovation / Mehlhaff, Carol J.
74 The city of God against the pagans. Augustine, Saint, Bishop of Hippo.
75 Springs of scientific creativity :essays on founders of modern science / None
76 From the other side of the river :a self-portrait of China today / Fan, Kuang Huan, 1932-
77 Iran, past and present / Wilber, Donald Newton.
78 Christopher Marlowe; Marlowe, Christopher, 1564-1593.
79 America confronts a revolutionary world, 1776-1976 / Williams, William Appleman.
80 From the Greeks to Darwin / Osborn, Henry Fairfield, 1857-1935.
81 George Herbert / Stewart, Stanley, 1931-
82 Women in LC's terms :a thesaurus of Library of Congress subject headings relating to women / Dickstein, Ruth.
83 Spiritual narratives / None
84 Harold Pinter / Almansi, Guido, 1931-
85 White-jacket;or, The world in a man-of-war. Melville, Herman, 1819-1891.
86 El viaje en el jardin / Fern�andez Santos, Jes�us.
87 The wagonmasters;high plains freighting from the earliest days of the Santa Fe trail to 1880. Walker, Henry P. (Henry Pickering)
88 Emma Goldman / Watson, Martha, 1941-
89 Valley of the spirits;the Upper Skagit Indians of western Washington Collins, June M.
90 Women and criminality :the woman as victim, offender, and practitioner / Flowers, R. Barri (Ronald Barri)
91 Traditional medicine in modern China;science, nationalism, and the tensions of cultural change Croizier, Ralph C.
92 State and community governments in the federal system / Press, Charles.
93 The territorial growth of the United States  / National Geographic Society (U.S.). Cartographic Division.
94 Tradition and revolt in Latin America, and other essays Humphreys, R. A. (Robert Arthur), 1907-1999.
95 Historical essay on the colony of Surinam, 1788. None
96 Spies of the Confederacy Bakeless, John, 1894-1978.
97 Great Britain and the Confederate Navy, 1861-1865, Merli, Frank J., 1929-2000.
98 3000 years of black poetry;an anthology, Lomax, Alan, 1915-2002.
99 Counterpoint; debates about debate Kruger, Arthur N.

In [5]:
# Unicode values in these records
print(marc_records[13].author())


Symposium on Fertility of the Sea (1969 : S�ao Paulo, Brazil)

These MARC21 records are not encoded correctly for Unicode São Paulo, Brazil should be displayed as


In [6]:
print(u"São Paulo, Brazil")


São Paulo, Brazil

Methods of the MARC Record class

The pymarc Record class has a number of convenience methods that we will be using from now on. You can always see what methods are available for any Python class by running the following:


In [7]:
for row in dir(marc_records[8842]):
    if row.startswith("_"): # Filter out internal properties
        continue
    print(row)


add_field
add_grouped_field
add_ordered_field
addedentries
as_dict
as_json
as_marc
as_marc21
author
decode_marc
fields
force_utf8
get_fields
isbn
leader
location
next
notes
physicaldescription
pos
publisher
pubyear
remove_field
subjects
title
uniformtitle

Badge Assessment:

  • What is the title for record 8 in marc_records? (remember python lists start with a 0 index)

In [ ]:

  • What is the isbn for record 8?

In [ ]:

Adding a new function

Now we will create a function for classifying MARC21 Records as Schema.org CreativeWork Classes.


In [9]:
def classify_marc21_schema(record):
    "Classifies a MARC21 record as specific Work class based on BIBFRAME website"
    leader = record.leader
    field007 = record['007']
    field336 = record['336']
    work_class = None
    if leader[6] == 'a':
        if field007 is not None:
            test_value = field007.data[0]
            if test_value == 'a' or test_value == 'd':
                work_class = 'Map' # http://schema.org/Map
            elif test_value == 'h': # Microfilm
                work_class = 'Photograph' # http://schema.org/Photograph
            elif ['m', 'v'].count(test_value) > 0:
                work_class = 'VideoObject' # http://schema.org/VideoObject
        else:
            # Book is the default for Language Material
            work_class =  'Book'
    elif leader[6] == 'e' or leader[6] == 'f':
        # Map is the default
        work_class = 'Map'
        if field007 is not None:
            if field007.data[0] == 'r':
                work_class = 'Dataset'
    elif leader[6] == 'g':
        work_class = 'Photograph'
    elif leader[6] == 'i':
        work_class = 'AudioObject' # http://schema.org/AudioObject
    elif leader[6] == 'j':
        work_class = 'MusicRecording'
    elif leader[6] == 'k':
        work_class = 'Photograph' 
    elif leader[6] == 'm':
        work_class = 'SoftwareApplication'
    if work_class is None:
        work_class = 'CreativeWork'
    return work_class

Using the classify_marc21_schema function, we can now create some summary statistics about our MARC21 dataset by looping through our list of MARC21 records.


In [10]:
class_counters = {}
for i, record in enumerate(marc_records):
    result = classify_marc21_schema(record)
    if not result in class_counters:
        class_counters[result] = 1
    else:
        class_counters[result] = class_counters[result] + 1
print(class_counters)


{'AudioObject': 1, 'Map': 158, 'Photograph': 387, 'VideoObject': 3, 'MusicRecording': 299, 'Book': 5319, 'SoftwareApplication': 126, 'CreativeWork': 3660}

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
schema_org_fig = plt.figure()
axes = schema_org_fig.add_axes([0.1, 0.1, 0.8, 0.8])
axes.pie(class_counters.values(), labels=class_counters.keys(), autopct='%.2f')
axes.set_title('Number of Schema.org Classes in MARC21 Records');


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-08ef3c199825> in <module>()
      3 schema_org_fig = plt.figure()
      4 axes = schema_org_fig.add_axes([0.1, 0.1, 0.8, 0.8])
----> 5 axes.pie(class_counters.values(), labels=class_counters.keys(), autopct='%.2f')
      6 axes.set_title('Number of Schema.org Classes in MARC21 Records');

NameError: name 'class_counters' is not defined

Badge Assessment:

  • How would you loop through the marc_records list and get the titles for the three VideoObjects?

In [ ]:

Task #3: Parsing MARC XML

In this task, we'll be using start with a XML file derived the same MARC21 records we used earlier.


In [12]:
from rdflib import Namespace
from lxml import etree
MARC_NS = Namespace('http://www.loc.gov/MARC21/slim')
marc_xml = etree.parse('static/marc/marc-sample.xml')
xml_marc_records = marc_xml.findall('/{{{0}}}record'.format(MARC_NS))

Badge Assessment:

  • Describe what happens when you execute the following code snippet:

In [13]:
print(etree.tostring(xml_marc_records[13], pretty_print=True))


<marc:record xmlns:marc="http://www.loc.gov/MARC21/slim" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><marc:leader>01103cam a2200301   4500</marc:leader>
<marc:controlfield tag="001">lccn74132383</marc:controlfield>
<marc:controlfield tag="008">711116s1971    nyua     b    100 0 eng  </marc:controlfield><marc:datafield tag="010" ind1=" " ind2=" "><marc:subfield code="a">74132383</marc:subfield></marc:datafield><marc:datafield tag="020" ind1=" " ind2=" "><marc:subfield code="a">0677149603</marc:subfield></marc:datafield><marc:datafield tag="035" ind1=" " ind2=" "><marc:subfield code="a">.b10021978</marc:subfield><marc:subfield code="b">tb</marc:subfield><marc:subfield code="c">-</marc:subfield></marc:datafield><marc:datafield tag="035" ind1=" " ind2=" "><marc:subfield code="a">(CoCC)102640</marc:subfield></marc:datafield><marc:datafield tag="041" ind1="0" ind2=" "><marc:subfield code="a">eng</marc:subfield><marc:subfield code="b">spapor</marc:subfield></marc:datafield><marc:datafield tag="090" ind1=" " ind2=" "><marc:subfield code="a">GC2.S83 1969</marc:subfield></marc:datafield><marc:datafield tag="111" ind1="2" ind2=" "><marc:subfield code="a">Symposium on Fertility of the Sea</marc:subfield><marc:subfield code="d">(1969 :</marc:subfield><marc:subfield code="c">Sa&#771;o Paulo, Brazil)</marc:subfield></marc:datafield><marc:datafield tag="245" ind1="1" ind2="0"><marc:subfield code="a">Fertility of the sea.</marc:subfield><marc:subfield code="c">Edited by John D. Costlow, Jr.</marc:subfield></marc:datafield><marc:datafield tag="260" ind1=" " ind2=" "><marc:subfield code="a">New York,</marc:subfield><marc:subfield code="b">Gordon and Breach</marc:subfield><marc:subfield code="c">[1971]</marc:subfield></marc:datafield><marc:datafield tag="300" ind1=" " ind2=" "><marc:subfield code="a">2 v.</marc:subfield><marc:subfield code="b">illus.</marc:subfield><marc:subfield code="c">24 cm.</marc:subfield></marc:datafield><marc:datafield tag="500" ind1=" " ind2=" "><marc:subfield code="a">In English; summaries in English and Spanish or Portuguese.</marc:subfield></marc:datafield><marc:datafield tag="504" ind1=" " ind2=" "><marc:subfield code="a">Includes bibliographies.</marc:subfield></marc:datafield><marc:datafield tag="650" ind1=" " ind2="0"><marc:subfield code="a">Oceanography</marc:subfield><marc:subfield code="v">Congresses.</marc:subfield></marc:datafield><marc:datafield tag="650" ind1=" " ind2="0"><marc:subfield code="a">Marine ecology</marc:subfield><marc:subfield code="v">Congresses.</marc:subfield></marc:datafield><marc:datafield tag="700" ind1="1" ind2=" "><marc:subfield code="a">Costlow, John D.,</marc:subfield><marc:subfield code="d">1927-</marc:subfield></marc:datafield><marc:datafield tag="907" ind1=" " ind2=" "><marc:subfield code="a">.b10021978</marc:subfield></marc:datafield><marc:datafield tag="902" ind1=" " ind2=" "><marc:subfield code="a">130106</marc:subfield></marc:datafield><marc:datafield tag="999" ind1=" " ind2=" "><marc:subfield code="b">1</marc:subfield><marc:subfield code="c">940803</marc:subfield><marc:subfield code="d">m</marc:subfield><marc:subfield code="e">a</marc:subfield><marc:subfield code="f">-</marc:subfield><marc:subfield code="g">0</marc:subfield></marc:datafield><marc:datafield tag="994" ind1=" " ind2=" "><marc:subfield code="a">tb</marc:subfield></marc:datafield><marc:datafield tag="945" ind1=" " ind2=" "><marc:subfield code="a">GC2.S83 1969</marc:subfield><marc:subfield code="c">v. 2</marc:subfield><marc:subfield code="g">1</marc:subfield><marc:subfield code="i">33027000341267</marc:subfield><marc:subfield code="j">0</marc:subfield><marc:subfield code="l">tb   </marc:subfield><marc:subfield code="h">0</marc:subfield><marc:subfield code="o">-</marc:subfield><marc:subfield code="p">$0.00</marc:subfield><marc:subfield code="r">-</marc:subfield><marc:subfield code="s">-</marc:subfield><marc:subfield code="t">1</marc:subfield><marc:subfield code="u">0</marc:subfield><marc:subfield code="v">0</marc:subfield><marc:subfield code="w">0</marc:subfield><marc:subfield code="x">0</marc:subfield><marc:subfield code="y">.i10026526</marc:subfield><marc:subfield code="z">940804</marc:subfield></marc:datafield><marc:datafield tag="945" ind1=" " ind2=" "><marc:subfield code="a">GC2.S83 1969</marc:subfield><marc:subfield code="c">v. 1</marc:subfield><marc:subfield code="g">1</marc:subfield><marc:subfield code="i">33027000341275</marc:subfield><marc:subfield code="j">0</marc:subfield><marc:subfield code="l">tb   </marc:subfield><marc:subfield code="h">0</marc:subfield><marc:subfield code="o">-</marc:subfield><marc:subfield code="p">$0.00</marc:subfield><marc:subfield code="r">-</marc:subfield><marc:subfield code="s">-</marc:subfield><marc:subfield code="t">1</marc:subfield><marc:subfield code="u">0</marc:subfield><marc:subfield code="v">0</marc:subfield><marc:subfield code="w">0</marc:subfield><marc:subfield code="x">0</marc:subfield><marc:subfield code="y">.i10026538</marc:subfield><marc:subfield code="z">940804</marc:subfield></marc:datafield></marc:record>


[Answer here]

  • How many <marc:record> elements are in the xml_marc_records list?

In [ ]:
len
  • Do a quick check to see if all of the MARC Records in our initial list of records were converted correctly to XML

In [ ]:

Task #4: Retrieving Data with XPath

With MARC XML, we can use XML tools and technologies like XPath to filter and select elements.


In [14]:
xpath_string = "/{{{0}}}record/{{{0}}}datafield[@tag='245']/".format(MARC_NS)
title_elements = marc_xml.findall(xpath_string)
for i, element in enumerate(title_elements[0:100]):
    print("{0} subfield {1}: {2}".format(i, element.attrib.get('code'), element.text))


0 subfield a: Power, gender, values /
1 subfield c: Judith Genova, editor.
2 subfield a: Men in arms;
3 subfield b: a history of warfare and its interrelationships with Western society
4 subfield c: [by] Richard A. Preston, Sydney F. Wise and Herman O. Werner.
5 subfield a: Perspectives in ornithology :
6 subfield b: essays presented for the centennial of the American Ornithologists' Union /
7 subfield c: edited by Alan H. Brush and George A. Clark, Jr. ; sponsored by the American Ornithologists' Union.
8 subfield a: The youth of Andre Gide.
9 subfield c: Abridged and translated by June Guicharnaud.
10 subfield a: College :
11 subfield b: the undergraduate experience in America /
12 subfield c: Ernest L. Boyer.
13 subfield a: On the incomprehensible nature of God /
14 subfield c: St. John Chrysostom ; translated by Paul W. Harkins.
15 subfield a: Commentary on the Metaphysics of Aristotle.
16 subfield c: Translated by John P. Rowan.
17 subfield a: Energy sources :
18 subfield b: conservation and renewables (APS, Washington, DC, 1985 /
19 subfield c: edited by David Hafemeister, Henry Kelly, Barbara Levi.
20 subfield a: Walking on air :
21 subfield b: an informal history of inflight service of seven U.S. airlines /
22 subfield c: by Helen E. McLaughlin.
23 subfield a: Distribution of racial and ethnic groups in Colorado public schools, 1970-71 /
24 subfield c: Prepared by Earl W. Phillips for Youth-Community Relations Unit, Office of Continuing Education, Colorado Department of Education.
25 subfield a: Hockey: bantam to pro
26 subfield c: [by] Jack Kelley [and] Milt Schmidt, with Al Hirshberg.
27 subfield a: Guidebook of Vermejo Park, northeastern New Mexico :
28 subfield b: Twenty- seventh Field Conference, September 30, October 1 and 2, 1976 /
29 subfield c: editors : Rodney C. Ewing, Barry S. Kues.
30 subfield a: Battered women, shattered lives /
31 subfield c: by Kathleen H. Hofeller.
32 subfield a: Fertility of the sea.
33 subfield c: Edited by John D. Costlow, Jr.
34 subfield a: Elements of meteorology
35 subfield c: [by] Albert Miller and Jack C. Thompson.
36 subfield a: Ancient Australia;
37 subfield b: the story of its past geography and life
38 subfield c: [by] Charles Laseron.
39 subfield a: Calibration of hominoid evolution;
40 subfield b: recent advances in isotopic and other dating methods applicable to the origin of man: [proceedings of the symposium held at Burg Wartenstein, Austria, 3rd-12th July, 1971];
41 subfield c: scientific editors: W. W. Bishop and J. A. Miller; assistant editor: Sonia Cole.
42 subfield a: Raphael /
43 subfield c: Roger Jones and Nicholas Penny.
44 subfield a: A Passage to India :
45 subfield b: essays in interpretation /
46 subfield c: edited by John Beer.
47 subfield a: The Renewal of preaching:
48 subfield b: theory and practice,
49 subfield c: edited by Karl Rahner.
50 subfield a: Liturgy in transition.
51 subfield c: Edited by Herman Schmidt.
52 subfield a: Time and space relationships of the Taconic allochthon and autochthon.
53 subfield a: The doctrine of judicial review, its legal and historical basis, and other essays,
54 subfield c: by Edward S. Corwin .
55 subfield a: The spotted hyena;
56 subfield b: a study of predation and social behavior.
57 subfield a: The Roots of urban unrest /
58 subfield c: edited by John Benyon and John Solomos.
59 subfield a: Genetics, environment, and behavior;
60 subfield b: implications for educational policy.
61 subfield c: Edited by Lee Ehrman, Gilbert S. Omenn [and] Ernst Caspari. Contributors: V. Elving Anderson [and others]
62 subfield a: The errors of evolution.
63 subfield b: An examination of the nebular theory, geological evolution, the origin of life, and Darwinism.
64 subfield c: By Robert Patterson ... edited, with an introduction, by H. L. Hastings.
65 subfield a: Probing plant structure;
66 subfield b: a scanning electron microscope study of some anatomical features in plants and the relationship of these structures to physiological processes
67 subfield c: [by] John Troughton and Lesley A. Donaldson.
68 subfield a: Immunochemistry and the biosynthesis of antibodies.
69 subfield a: Biology of earthworms
70 subfield c: [by] C. A. Edwards [and] J. R. Lofty.
71 subfield a: "We the people" and others :
72 subfield b: duality and America's treatment of its racial minorities /
73 subfield c: Benjamin B. Ringer.
74 subfield a: Identifying and estimating the genetic impact of chemical mutagens /
75 subfield c: Committee on Chemical Environmental Mutagens, Board on Toxicology and Environmental Health Hazards, Commission on Life Sciences, National Research Council.
76 subfield a: The Nuclear power controversy.
77 subfield a: The Confucian persuasion.
78 subfield a: Shawnee! :
79 subfield b: The ceremonialism of a native Indian tribe and its cultural background /
80 subfield c: James H. Howard.
81 subfield a: The earth in decay:
82 subfield b: a history of British geomorphology, 1578-1878
83 subfield c: [by] Gordon L. Davies.
84 subfield a: Authors on film.
85 subfield c: Edited by Harry M. Geduld.
86 subfield a: Plant speciation.
87 subfield a: Contemporary approaches to moral education :
88 subfield b: an annotated bibliography and guide to research /
89 subfield c: James S. Leming.
90 subfield a: Alienation and absence in the novels of Marguerite Duras /
91 subfield c: Carol J. Murphy.
92 subfield a: International management and economic development: with particular reference to India and other developing countries
93 subfield c: [by] Barry M. Richman [and] Melvyn Copen.
94 subfield a: Will they ever finish Bruckner Boulevard?
95 subfield c: Pref. by Daniel P. Moynihan.
96 subfield a: Late modern: the visual arts since 1945.
97 subfield a: Catholicism in English-speaking lands.
98 subfield a: The origins of the Bible,
99 subfield c: by Theodore Gerald Soares.

The title_elements list includes all of the subfields from the 245 field.


In [15]:
rec_245s = xml_marc_records[13].findall("{{{0}}}datafield[@tag='245']/".format(MARC_NS))
for element in rec_245s:
    print(etree.tostring(element))


<marc:subfield xmlns:marc="http://www.loc.gov/MARC21/slim" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" code="a">Fertility of the sea.</marc:subfield>
<marc:subfield xmlns:marc="http://www.loc.gov/MARC21/slim" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" code="c">Edited by John D. Costlow, Jr.</marc:subfield>

We are now going to create a function that takes either a MARC21 or MARCXML record and returns the Library of Congress or Local call-number for the record.


In [16]:
def get_call_number(record):
    """Function retrives a LOC or Local call-number from a 090 MARC field from either
    a MARC21 or MARC XML record
    
    :param record: MARC record
    """
    if type(record) == pymarc.Record:
        all_090s = record.get_fields('090')
        return ''.join([''.join(x.get_subfields('a')) for x in all_090s])
    else:
        all_090s = record.findall("{{{0}}}datafield[@tag='090']/{{{0}}}subfield[@code='a']".format(MARC_NS))
        return ''.join([x.text for x in all_090s])

In [17]:
print(get_call_number(marc_records[13]))
print(get_call_number(xml_marc_records[78]))


GC2.S83 1969
PR2661.E5 1903

Badge Assessment:

  • What happens when the function get_call_number is given a MARC record that does not have a MARC field of 090?

[Write answer here]

  • How could you improve the XPath "/{{{0}}}record/{{{0}}}datafield[@tag='245']/" to just select the a subfields of the 245 field?

In [ ]:

[Write answer here]

Task #5: Creating JSON LinkedData from MARC

In this task will use the schema.org vocabulary to create JSON linked data from our MARC records both in MARC21 and MARC XML formats. Below is an example of schema.org JSON-LD (short for JSON Linkded Data) for one of the resources used in this presentation, an article titled, Linking Things on the Web: A Pragmatic Examination of Linked Data for Libraries, Archives and Museums by Ed Summers and Dorothea Salo.

{ "@context": { "@vocab": "http://schema.org/", "bf": "http://bibframe.org/vocab" }, "@id": "http://intro2libys.info/Article/linking-things-on-the-web", "@type": "ScholarlyArticle", "author": [ { "@context": { "@vocab": "http://schema.org/", "bf": "http://bibframe.org/vocab/" }, "@id": "http://intro2libsys.info/Person/summers-ed", "@type": "Person", "bf:adminInfo": { "bf:creationDate": "2013-12-29T06:48:02.515163", "bf:descriptionConventions": "Using schema.org for descriptive metadata", "bf:descriptionLanguage": "English" }, "familyName": "Summers", "givenName": "Ed", "name": "Ed Summers", "url": "http://intro2libsys.info/Person/EdSummers" }, { "@context": { "@vocab": "http://schema.org/", "bf": "http://bibframe.org/vocab/" }, "@id": "http://intro2libsys.info/Person/salo-dorothea", "@type": "Person", "bf:adminInfo": { "bf:creationDate": "2013-12-29T06:50:22.868813", "bf:descriptionConventions": "Using schema.org for descriptive metadata", "bf:descriptionLanguage": "English" }, "familyName": "Salo", "givenName": "Dorothea", "name": "Dorothea Salo", "url": "http://intro2libsys.info/Person/DorotheaSalo" } ], "bf:adminInfo": { "bf:creationDate": "2013-12-12T23:22:13.566000", "bf:descriptionConventions": "Using schema.org for descriptive metadata", "bf:descriptionLanguage": "English" }, "bookFormat": "EBook", "description": "Short Ebook on Linked Data in the cultural heritage sector: libraries, archives and museums", "headline": "Linking Things on the Web: A Pragmatic Examination of Linked Data for Libraries, Archives and Museums", "name": "linking-things-on-the-web-a-pragmatic-examination-of-linked-data-for-libraries-archives-and-museums", "url": "http://arxiv.org/abs/1302.4591" }

We first import the json Python module to work with JSON objects. The json module allows us to easily load JSON objects as native Python data structures like dict and list


In [18]:
import json

Next we create a Python dict for MARC record 4190 in our dataset. We will use the MARC21 records but we could have just as easily used the MARC XML records. The first line is setting a @context property for the JSON and making an assertation that the vocabulary is from schema.org. Record 4190 is a book of poems for young people, so we classify the new JSON-LD graph as a schema.org type of Book.


In [19]:
print(marc_records[4189])
record4190_json =  {"@context": {"@vocab": "http://schema.org/"},
                    "@type": "Book"}


=LDR  00958cam  2200301Ia 4500
=001  52415485
=003  OCoLC
=005  20030717113143.0
=008  030611s2003\\\\cau\\\\\\\\\\\000\0\eng\d
=020  \\$a189322483X :
=035  \\$a.b1585050x$btbp$cc
=040  \\$aCKE$cCKE$dCOC
=049  \\$aCOCA
=090  \\$aPS3604.I54$bB4 2003
=090  \\$aPS3604.I54$bB4 2003
=100  1\$aDilenschneider, Geoffrey.
=245  10$aBetween two Junes is a forest :$ba journal of everything /$cGeoffrey Dilenschneider.
=260  \\$aBeverly Hills, CA :$bNew Millennium Press,$cc2003.
=300  \\$a285 p. ;$c24 cm.
=500  \\$aPoems.
=650  \0$aYoung adult poetry, American.
=650  \0$aTeenagers$vPoetry.
=650  \0$aAmerican poetry$y21st century.
=907  \\$a.b1585050x
=902  \\$a130106
=999  \\$b1$c030717$dm$ea$fc$g0
=994  \\$atbp
=945  \\$aPS3604.I54$bB4 2003$g1$i33027004621177$j0$ltbp  $h0$oc$p$0.00$q $r-$s-$t1$u1$v0$w0$x0$y.i15788337$z030717

For the JSON-LD graph, will map the MARC title to the schema.org name property.


In [20]:
record4190_json['name'] = marc_records[4189].title()
print(record4190_json)


{'@context': {'@vocab': 'http://schema.org/'}, '@type': 'Book', 'name': 'Between two Junes is a forest :a journal of everything /'}

In the JSON-LD graph for Between two Junes is a forest :a journal of everything, we create a Python dict for the author, assigning a schmea.org type of Person.


In [22]:
record4190_json['author'] = {"@type": "Person", "name": marc_records[4189].author()}
print(record4190_json)


{'@context': {'@vocab': 'http://schema.org/'}, 'author': {'@type': 'Person', 'name': 'Dilenschneider, Geoffrey.'}, '@type': 'Book', 'name': 'Between two Junes is a forest :a journal of everything /'}

Badge Assessment


In [ ]:

  • Compare the outputs of the following code snippets:

In [ ]:
record9_json = json.loads(marc_records[9].as_json())

In [ ]:
print(json.dumps(record9_json, indent=2, sort_keys=True))

In [ ]:
print(json.dumps(record4190_json, indent=2, sort_keys=True))

Task #6: Create RDF Tuple


In [23]:
DCTERMS = Namespace("http://purl.org/dc/terms/")
BIBFRAME = Namespace("http://bibframe.org/vocab/")
SCHEMA_ORG = Namespace("http://schema.org/")
from rdflib import Graph, BNode, Literal
bib_graph = Graph()
entities = Namespace('http://intro2libsys.info/lita-webinar-2014/entities/')
entities.one
entity = entities.one

Badge Assessment

  • Create a second entity for the 3490 MARCXML document

In [ ]:

Task #7 -- Extract and Add DC Title and Creator Tuples

First we will add schema.org properties to our our first entity in the bib_graph.


In [25]:
from rdflib import URIRef
bib_graph.add((entity,
               SCHEMA_ORG.type,
               URIRef("http://schema.org/Book")))

Second we will add schema.org copyrightYear to the first entity


In [26]:
bib_graph.add((entity,
               SCHEMA_ORG.copyrightYear,
               Literal(marc_records[8]['260']['c'][1:])))

In [27]:
bib_graph.add((entity, 
               DCTERMS.title, 
               Literal(marc_records[8].title())))
bib_graph.add((entity,
               DCTERMS.creator,
               Literal(marc_records[8].author())))
for subject,predicate,obj in bib_graph:
    print("Subject: {0}\nPredicate: {1}\nObject: {2}".format(subject,predicate,obj))
    print("===")

if (entity, None, None) in bib_graph:
    print("Graph contains triples about the entity")


Subject: http://intro2libsys.info/lita-webinar-2014/entities/one
Predicate: http://schema.org/type
Object: http://schema.org/Book
===
Subject: http://intro2libsys.info/lita-webinar-2014/entities/one
Predicate: http://schema.org/copyrightYear
Object: 1986.
===
Subject: http://intro2libsys.info/lita-webinar-2014/entities/one
Predicate: http://purl.org/dc/terms/title
Object: Walking on air :an informal history of inflight service of seven U.S. airlines /
===
Subject: http://intro2libsys.info/lita-webinar-2014/entities/one
Predicate: http://purl.org/dc/terms/creator
Object: McLaughlin, Helen E.
===
Graph contains triples about the entity

In [43]:
from rdflib.serializer import Serializer
from rdflib import plugin
bib_graph.namespace_manager.reset()
bib_graph.namespace_manager.bind("dc", "http://purl.org/dc/terms/")
bib_graph.namespace_manager.bind("schema", SCHEMA_ORG) 
print(bib_graph.serialize(format='pretty-xml', indent=4))

In [29]:
# Print in RDF N-Triples syntax
print(bib_graph.serialize(format='n3'))


@prefix dc: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://intro2libsys.info/lita-webinar-2014/entities/one> dc:creator "McLaughlin, Helen E." ;
    dc:title "Walking on air :an informal history of inflight service of seven U.S. airlines /" ;
    schema:copyrightYear "1986." ;
    schema:type schema:Book .


Creating and asserting a linked URI for a 008 MARC language code


In [30]:
import urllib2
from rdflib import URIRef

def get_marc_lang_uri(marc_record):
    """Function returns a LOC or Local call-number from a 008 MARC field from either
    a MARC21 or MARC XML record
    
    :param marc_record: MARC record
    """
    loc_lang_base = 'http://id.loc.gov/vocabulary/languages'
    if type(marc_record) == pymarc.Record:
        lang_code = marc_record['008'].data[-5:-1]
    else:
        lang_code = marc_record.find("{{{0}}}controlfield[@tag='008']".format(MARC_NS)).text[-5:-1]
    loc_lang_uri = "{0}/{1}".format(loc_lang_base, lang_code).strip()
    if urllib2.urlopen(loc_lang_uri).code != 200:
        return
    else:
        return URIRef(loc_lang_uri)

Badge Assessment


In [33]:
get_marc_lang_uri(marc_records[78])


Out[33]:
rdflib.term.URIRef(u'http://id.loc.gov/vocabulary/languages/eng')
  • How would you add a Dublin Core language tuple to the entity?

In [ ]:

  • Using the second entity you created, add the title and the creator from the 3,490 MARCXML Document to the bib_graph

In [33]:

  • Add a Dublin Core Language triple to entity2

In [ ]:


In [ ]:


In [ ]:

Task #8: Create HTML Template using MARC RDFa


In [34]:
template = """<div vocab="http://dublincore.org" type="{{ dc_type}}">
 <h2 property="title">{{ dc_title }}</h2>
by  <span property="creator">{{ dc_creator }}</span>
</div>"""
from jinja2 import Template
rdfa_template = Template(template)
print(rdfa_template.render(dc_type='text',
                           dc_creator=bib_graph.value(entity, DCTERMS.creator),
                           dc_title=bib_graph.value(entity, DCTERMS.title)))


<div vocab="http://dublincore.org" type="text">
 <h2 property="title">Walking on air :an informal history of inflight service of seven U.S. airlines /</h2>
by  <span property="creator">McLaughlin, Helen E.</span>
</div>

Validating this MARC RDFa by copying the HTML and RdDFa at http://www.w3.org/2012/pyRdfa/Validator.html#distill_by_input

Badge Assessment

  • Add the Dublin Core language triple to the RDFa template

In [ ]:

Python function for retrieving a language label based on the language URI


In [35]:
english_json = json.load(urllib2.urlopen('http://id.loc.gov/vocabulary/languages/eng.skos.json'))
print(english_json[u'<http://id.loc.gov/vocabulary/languages/eng>'][ u'<http://www.w3.org/2004/02/skos/core#prefLabel>'])


[{u'lang': u'en', u'type': u'literal', u'value': u'English'}]

In [36]:
def get_marc_language_label(loc_language_uri):
    """Function returns the preferred label based on the Library of Congress MARC Language code
    
    :param loc_language_uri: URI of Library of Congress Linked Data service
    """
    loc_language_uri = loc_language_uri.strip()
    prefLabel_uri = u'<http://www.w3.org/2004/02/skos/core#prefLabel>'
    
    loc_skos_uri = "{0}.skos.json".format(loc_language_uri)
    loc_language_key = u"<{0}>".format(loc_language_uri)
    lang_json = json.load(urllib2.urlopen(loc_skos_uri))
    return lang_json.get(u"<{0}>".format(loc_language_uri)).get(prefLabel_uri)[0].get("value", None)

print(get_marc_language_label('http://id.loc.gov/vocabulary/languages/eng'))


English
  • Render and print the RDFa of the entity you created from the MARC XML using the rdfa.html template.

In [ ]:

Task #9: Create HTML5 marked up with schema.org Microdata using Jinja


In [38]:
micro_template = """<div itemscope itemtype="{{ itemType }}">
 <h2 itemprop="name">{{ itemName }}</h2>
    by <span property="author">{{ author }}</span> 
</div>"""

In [39]:
micro_data_template = Template(micro_template)
print(micro_data_template.render(itemType= bib_graph.value(entity, SCHEMA_ORG.type),
                                 itemName= bib_graph.value(entity, DCTERMS.title),
                                 author= bib_graph.value(entity, DCTERMS.creator)))


<div itemscope itemtype="http://schema.org/Book">
 <h2 itemprop="name">Walking on air :an informal history of inflight service of seven U.S. airlines /</h2>
    by <span property="author">McLaughlin, Helen E.</span> 
</div>

Test how Google will extract the data using its rich snippets testing tools.

Badge Assessment

  • Add and bind a new <span> element with an itemprop using the schema.org copyrightYear

In [ ]:


In [ ]:

Task #10: Putting the "linked" into Linked Data

So far up to this point we've been manipulating MARC records into first creating linked data triples and then creating RDFa, RDF XML, and JSON-LD representations of those graphs. We have also started using Library of Congress Linked Data to associate the Dublin Core language with our entities. All these critical steps allow us to publishing MARC records as linked data. Next, we're going to switch-up our tasks and look into how we can now use other linked-data resources to enhance our existing bibliographic graphs we have created earlier in this session.


In [40]:
print(marc_records[191].title())


Moby-Dick, or, The whale /

In [41]:
moby_dict_dbpedia = 'http://dbpedia.org/data/Moby-Dick'
moby_dict_url = "{0}.json".format(moby_dict_dbpedia)
print(moby_dict_url)
moby_dict_json = json.load(urllib2.urlopen("{0}.json".format(moby_dict_dbpedia)))


http://dbpedia.org/data/Moby-Dick.json

In [42]:
print(moby_dict_json["http://dbpedia.org/resource/Moby-Dick"])

Badge Assessment

  • Compare the subjects from the marc_records record 192 and the subjects from dbpedia.

In [ ]: