Find missing data in UCLDC couchdb

This is a notebook that is a starting point for QA against the couchdb database currently used for storing UCLDC harvested records. Can add cells for checking other data values. There must be a "_value" QA view in couchdb for the functions below to work.


In [12]:
#import some basic useful libraries and functions
from __future__ import print_function
import sys
import os

Connect to the couchdb database. The database name is 'ucldc' for our system.


In [13]:
import couchdb
s=couchdb.Server('https://127.0.0.1/couchdb')
db=s['ucldc']

Create functions for finding missing field data.


In [14]:
def get_missing_for_field(field_path, collection_id=None):
    '''Calls the couchdb view corresponding to <field_path>_value, using start/end key values to get only
    documents which are MISSING the field entirely.
    If no collection_id is given, it will work on all documents in the DB and could be very slow.
    '''
    viewname = 'qa_reports/{0}_value'.format(field_path)
    start_key = ["__MISSING__"]
    end_key = ["__MISSING__"]
    if collection_id:
        start_key.append(str(collection_id))
        end_key.append(str(collection_id))
    end_key.append({}) # empty dict
    view=db.view(viewname, startkey=start_key, endkey=end_key,group_level=3)
    missing_list = [r for r in view]
    return missing_list

def get_missing_for_field_in_collections(field_path, collection_ids):
    '''Get missings for a number of collections'''
    missing_list = []
    for cid in collection_ids:
        missing_list.extend(get_missing_for_field(field_path, collection_id=cid))
    return missing_list

Unfortunately, blank or "null" values need to be handled differently


In [15]:
def get_null_for_field(field_path, collection_id=None):
    '''Calls the couchdb view corresponding to <field_path>_value, using start/end key values to get only
    documents which have null or blank string values for the field.
    If no collection_id is given, it will work on all documents in the DB and could be very slow.
    '''
    viewname = 'qa_reports/{0}_value'.format(field_path)
    start_key = []
    end_key = [""]
    if collection_id:
        start_key.append(str(collection_id))
        end_key.append(str(collection_id))
    end_key.append({}) # empty dict
    print("SKEY:{} EKEY:{}".format(start_key, end_key))
    view=db.view(viewname, startkey=start_key, endkey=end_key, group_level=3)
    null_list = [r for r in view]
    return null_list

def get_null_for_field_in_collections(field_path, collection_ids):
    '''Get null or blank for a number of collections'''
    missing_list = []
    for cid in collection_ids:
        null_list.extend(get_missing_for_field(field_path, collection_id=cid))
    return null_list

This will print out all the records found for the criteria. This could be a very long list.


In [16]:
#now flag problems, there are certain items that should never be missing
def report_missing_for_field(field, collection_id=None):
    '''convenience function for reporting'''
    missing = get_missing_for_field(field, collection_id=collection_id)
    for row in missing:
        print('Missing {0}: {1}'.format(field, row['key'][2]), file=sys.stderr) #outputting to stderr make red bkgnd

In [17]:
report_missing_for_field('dataProvider', collection_id=1675)

In [21]:
missing = get_missing_for_field('sourceResource.identifier')
print('Number missing identifier:{}'.format(len(missing)))
collection_ids = [row['key'][1] for row in missing]
collection_ids = set(collection_ids) # this will give unique collection ids for missing id docs
print('Collection ids:{}'.format(collection_ids))
print('First 10:{}'.format([row['key'][2] for row in missing[:10]]), file=sys.stderr)


Number missing identifier:63836
Collection ids:set([u'19', u'26094'])
First 10:[u'19--0041a951-b9bc-4a16-8842-b55250d7b42b', u'19--00474908-7317-48c6-90c7-7948861b920f', u'19--00910bd5-8b79-4a7f-abe6-1064cc04bc4f', u'19--00e714ef-1e7a-4b8e-a946-36fcfb542911', u'19--0122dc58-cc5d-4522-a16a-673fcabe019b', u'19--0186b438-e755-43ab-94e2-1eabf1506382', u'19--01a4a686-de46-457c-a9db-8ce3ba9b86a8', u'19--03734322-47e9-46a7-b008-08dc814d75bd', u'19--03b2dd11-997c-411d-ac54-5f7eb717cacd', u'19--03fe02c1-26dc-46b3-8bb3-7360ed413427']

In [22]:
report_missing_for_field('sourceResource.title')


Missing sourceResource.title: 4731--http://ark.cdlib.org/ark:/13030/c84m92sh
Missing sourceResource.title: 4731--http://ark.cdlib.org/ark:/13030/c88c9tgd
Missing sourceResource.title: 4731--http://ark.cdlib.org/ark:/13030/c8d21vtd

In [23]:
report_missing_for_field('isShownBy', collection_id=1675)


Missing isShownBy: 1675--http://ark.cdlib.org/ark:/13030/kt1t1nd9zd

In [24]:
report_missing_for_field('isShownBy')


Missing isShownBy: 10130--http://ark.cdlib.org/ark:/13030/kt3j49s0dk
Missing isShownBy: 1675--http://ark.cdlib.org/ark:/13030/kt1t1nd9zd
Missing isShownBy: 1750--http://ark.cdlib.org/ark:/13030/kt496nf26r
Missing isShownBy: 23105--http://ark.cdlib.org/ark:/13030/c84j0h2c
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8057hgc
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8154jk9
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8251kp7
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c83x886s
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c84x599q
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c86q1zsw
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c87p90x3
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c88k7bmq
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8bg2qhm
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8cf9rm0
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8db83cw
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8g73g7b
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8h41sz0
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8j38v2s
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8kw5hkg
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8mw2jnz
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8nv9krb
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8qn6896
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8rn39cp
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8sn0bgm
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8wd423d
Missing isShownBy: 23107--http://ark.cdlib.org/ark:/13030/c8xd136b
Missing isShownBy: 24760--http://ark.cdlib.org/ark:/13030/hb9489p040
Missing isShownBy: 25471--http://ark.cdlib.org/ark:/13030/hb2779n5n6
Missing isShownBy: 25471--http://ark.cdlib.org/ark:/13030/hb4p3003p3
Missing isShownBy: 25471--http://ark.cdlib.org/ark:/13030/hb8x0nb3jw
Missing isShownBy: 25471--http://ark.cdlib.org/ark:/13030/hb958006wx
Missing isShownBy: 25471--http://ark.cdlib.org/ark:/13030/hb9v19p0sd
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533m4t
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533m5c
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533m6x
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533m7g
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533m81
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533m9k
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533n04
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533n1p
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533n27
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533n3s
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533n4b
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533n5w
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533n6f
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533n70
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533n8j
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533n93
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533p0n
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533p16
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533p2r
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533p39
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533p4v
Missing isShownBy: 25597--http://ark.cdlib.org/ark:/28722/bk001533p5d
Missing isShownBy: 4863--http://ark.cdlib.org/ark:/13030/kt6j49r9rk
Missing isShownBy: 5022--http://ark.cdlib.org/ark:/13030/kt2w1002n4
Missing isShownBy: 5022--http://ark.cdlib.org/ark:/13030/kt4w1003xc
Missing isShownBy: 6083--http://ark.cdlib.org/ark:/13030/hb638nb1hj
Missing isShownBy: 8622--http://ark.cdlib.org/ark:/13030/kt4m3nf33n
Missing isShownBy: 8622--http://ark.cdlib.org/ark:/13030/tf7v19p5mm
Missing isShownBy: 9836--http://ark.cdlib.org/ark:/13030/ft829006r8

In [25]:
missing = get_missing_for_field_in_collections('isShownBy', (1675, 1750))
for row in missing:
    print('Missing "isShownBy" for {}'.format(row['key'][2]), file=sys.stderr)


Missing "isShownBy" for 1675--http://ark.cdlib.org/ark:/13030/kt1t1nd9zd
Missing "isShownBy" for 1750--http://ark.cdlib.org/ark:/13030/kt496nf26r

In [26]:
### missing = get_missing_for_field('sourceResource.collection.description')
### print('MISSING collection descrip:{}'.format(len(missing)))
null = get_null_for_field('sourceResource.collection.description')
print('null collection descrip:{}'.format(len(null)))
for i in null[:10]:
    print(i)


SKEY:[] EKEY:['', {}]
null collection descrip:77669
<Row key=[None, u'1412', u'1412--http://ark.cdlib.org/ark:/20775/bb0001238p'], value=1>
<Row key=[None, u'1412', u'1412--http://ark.cdlib.org/ark:/20775/bb00012396'], value=1>
<Row key=[None, u'1412', u'1412--http://ark.cdlib.org/ark:/20775/bb00012401'], value=1>
<Row key=[None, u'1412', u'1412--http://ark.cdlib.org/ark:/20775/bb0001241j'], value=1>
<Row key=[None, u'1412', u'1412--http://ark.cdlib.org/ark:/20775/bb00012422'], value=1>
<Row key=[None, u'1412', u'1412--http://ark.cdlib.org/ark:/20775/bb0001243k'], value=1>
<Row key=[None, u'1412', u'1412--http://ark.cdlib.org/ark:/20775/bb00012443'], value=1>
<Row key=[None, u'1412', u'1412--http://ark.cdlib.org/ark:/20775/bb0001245m'], value=1>
<Row key=[None, u'1412', u'1412--http://ark.cdlib.org/ark:/20775/bb00012464'], value=1>
<Row key=[None, u'1412', u'1412--http://ark.cdlib.org/ark:/20775/bb0001247n'], value=1>

In [ ]: