Here are some working examples of how to query the current SHARE database for individual results, metrics, and statistics.

These particular queries are just examples, and the data is open for anyone to use, so feel free to make your own and experiment!

Soon, we'll need a URL to access the SHARE Search API

If you want to learn more about Python, the language that we are using to access the API and play with the data, these are a few great guides:

And a quick introduction to Jupyter Notebooks:



In [1]:

    
SHARE_API = 'https://staging-share.osf.io/api/search/abstractcreativework/_search'

The SHARE search API is built on a tool called elasticsearch. It lets you search a subset of SHARE's normalized metadata in a simple format.

Here are the fields available in SHARE's elasticsearch endpoint:

- 'title'
- 'language'
- 'subject'
- 'description'
- 'date'
- 'date_created'
- 'date_modified
- 'date_updated'
- 'date_published'
- 'tags'
- 'links'
- 'awards'
- 'venues'
- 'sources'
- 'contributors'

You can see a formatted version of the base results from the API by visiting the SHARE Search API URL.

Service Names for Reference

Each provider harvested from has a specific . Let's make an API call to generate a table to get all of those "internal" names, along with the official name of the repository that they represent.

The SHARE API has different endpoints. One of those endpoints returns a list of all of the providers that SHARE is harvesting from, along with their internal names, official names, links to their homepages, and a simple version of an icon representing their service, in a parsable format called json.

Let's make a call to that API endpoint using the requests libarary, get the json data, and print out all of the shortnames and longnames.



In [2]:

    
# Requests library allows you to send organic, grass-fed HTTP/1.1 requests, no need to manually add query strings 
    # to your URLs, or to form-encode your POST data. Docs: http://docs.python-requests.org/en/master/
import requests

# Json library parses JSON from strings or files. The library parses JSON into a Python dictionary or list. 
    # It can also convert Python dictionaries or lists into JSON strings. 
    # https://docs.python.org/2.7/library/json.html
import json

# This takes the URL and puts it into a variable (so we only need to ever reference this variable, 
    # and so we don't have to repeat adding this URL when we want to work with the data)
SHARE_PROVIDERS = 'https://staging-share.osf.io/api/providers/'

# this requests the data from the SHARE_PROVIDERS and uses the json library to parse it into this list variable
data = requests.get(SHARE_PROVIDERS).json()

# literally prints out the sentence in the quotes on the screen
print('Here are the first 10 Providers:')

# this is a for loop (https://wiki.python.org/moin/ForLoop) to repeat the same tasks for each of the items in the 
    # list of our SHARE providers (what we put into the variable "data").
# for every item (called 'source' below) in the list, we print out the title, website, and provider name, 
    # formatted so each is on a new line (\n)
for source in data['results']:
    print(
        '{}\n{}\n{}\n'.format(
            source['long_title'],
            source['home_page'],
            source['provider_name']
        )
    )









    



Here are the first 10 Providers:
Research Online @ University of Wollongong
http://ro.uow.edu.au
au.uow

Ghent University Academic Bibliography
https://biblio.ugent.be/
be.ghent

Pontifical Catholic University of Rio de Janeiro
http://www.maxwell.vrac.puc-rio.br
br.pcurio

Lake Winnipeg Basin Information Network
http://130.179.67.140
ca.lwbin

PAPYRUS - Dépôt institutionnel de l'Université de Montréal
http://papyrus.bib.umontreal.ca
ca.umontreal

Western University
http://ir.lib.uwo.ca
ca.uwo

BioMed Central
http://www.springer.com/us/
com.biomedcentral

Social Science Research Network
http://papers.ssrn.com/
com.dailyssrn

figshare
https://figshare.com/
com.figshare

Nature Publishing Group
http://www.nature.com/
com.nature

You can make queries against any of the fields defined in the SHARE Schema. If we were able to harvest the information from the original source, it should appear in SHARE. However, not all fields are required for every document.

Required fields include:

title
contributors
uris
providerUpdatedDateTime

We add some information after each document is harvested inside the field shareProperties, including:

source (where the document was originally harvested)
docID (a unique identifier for that object from that source)

These two fields can be combined to make a unique document identifier.

Simple Queries

Let's get the first 3 results from the most basic query - the first page of the most recently updated research release events in SHARE.

We'll use the URL parsing library furl to keep track of all of our arguments to the URL, because we'll be modifying them as we go along. We'll print the URL as we go to take a look at it, so we know what we're requesting.

We'll print out the result's title and sources where it appears.



In [3]:

    
# furl is a Python library that allows you to easily manipulate URLs. https://github.com/gruns/furl
import furl

# In cell 1, we put the URL for the SHARE API into the variable SHARE_API. We can use it even down here!
    # We are parsing it using furl and putting it into a new variable called search_url
search_url = furl.furl(SHARE_API)

# We are limiting the arguments that we can attach to the URL to 3 -- so we grab the first 3 entries from the API
search_url.args['size'] = 3

# We are requsting the information from the search_url and parsing the JSON that we requested and got back 
    # (requests.get!) like we did in cell 2. We put it into this list variable 
recent_results = requests.get(search_url.url).json()

# This is called a 2 dimensional array -- a list that looks kind of like a matrix. A list can store other lists. 
    # This is a way to create two-dimensional (2D) lists in Python -- 2Dlist=[hi,my,name,is,erin][but,my,name,is,vicky]. We can print out 2Dlist[1][4] and we would see "my Vicky" because we start counting at 0 instead of 1
recent_results = recent_results['hits']['hits']

# we need to do this to initialize/define the list -- look at this for more information: 
    # http://stackoverflow.com/questions/6667201/how-to-define-two-dimensional-array-in-python
recent_results

#We are printing out on the screen the variable search_url
print('The request URL is {}'.format(search_url.url))

#This is just so we have a nice visual queue between the url that we searched and the actual data we are grabbing
print('----------')

#Another for loop! For all of the items in the results we just grabbed, we are printing out the variables 
    # called source and title
for result in recent_results:
    print(
        '{} -- from {}'.format(
            result['_source']['title'],
            result['_source']['sources']
        )
    )









    



The request URL is https://staging-share.osf.io/api/search/abstractcreativework/_search?size=3
----------
LEDAPS corrected Landsat Enhanced Thematic Mapper image data for Shortgrass Steppe collected on 2011-06-17 -- from ['providers.org.datacite']
Test entry from ezid service for identifier: doi:10.6085//TEST/20152611351448160.0758426051207266 -- from ['providers.org.datacite']
LEDAPS corrected Landsat Enhanced Thematic Mapper image data for Shortgrass Steppe collected on 1990-04-20 -- from ['providers.org.datacite']

Now let's limit that query to only documents mentioning "giraffes" somewhere in the title, description, or in any of the metadata. We'd do that by adding a query search term.



In [4]:

    
# we are reusing that variable search_url from the cell above! We are querying (hence the args['q']) the API to try 
    # and get items that have the word 'giraffes' in them
search_url.args['q'] = 'giraffes'

# We are requsting the information from the search_url and parsing the JSON that we requested and got back 
    # (requests.get!) like we did in cell 2. We put it into this list variable 
recent_results = requests.get(search_url.url).json()

# This is called a 2 dimensional array -- a list that looks kind of like a matrix. A list can store other lists. 
    # This is a way to create two-dimensional (2D) lists in Python -- 
    # Exmaple: 2Dlist=[hi,my,name,is,erin][but,my,name,is,vicky]. We can print out 2Dlist[1][4] and we would see 
    # "my Vicky" because we start counting lists at 0 instead of 1 (computer science quirk)
recent_results = recent_results['hits']['hits']

# We are printing out on the screen the variable search_url
print('The request URL is {}'.format(search_url.url))

# This is just so we have a nice visual queue between the url that we searched and the actual data we are grabbing
print('---------')

# Another for loop! For all of the items in the results we just grabbed, we are printing out the entries that have 
    # the keyword 'giraffes'
for result in recent_results:
    print(
        '{} -- from {}'.format(
            result['_source']['title'],
            result['_source']['sources']
        )
    )









    



The request URL is https://staging-share.osf.io/api/search/abstractcreativework/_search?size=3&q=giraffes
---------
Genome reveals why giraffes have long necks -- from ['providers.org.crossref']
Odd creature was ancient ancestor of today’s giraffes -- from ['providers.org.crossref']
Genome reveals why giraffes have long necks -- from ['providers.com.nature']

Let's search for documents from the source CrossRef



In [5]:

    
# we are reusing that variable search_url from the cell above! We are querying (see that arg again) for everything 
    # from the source provide CrossRef
search_url.args['q'] = 'sources:providers.org.crossref'

# We are requsting the information from the search_url and parsing the JSON that we requested and got back 
    # (requests.get!) like we did in cell 2. We put it into this list variable 
recent_results = requests.get(search_url.url).json()

# This is called a 2 dimensional array -- a list that looks kind of like a matrix. A list can store other lists. 
    # This is a way to create two-dimensional (2D) lists in Python -- 
    # Example: 2Dlist=[hi,my,name,is,erin][but,my,name,is,vicky]. We can print out 2Dlist[1][4] 
    # and we would see "my Vicky" because we start counting at 0 instead of 1 (computer science quirk)
recent_results = recent_results['hits']['hits']

# We are printing out on the screen the variable search_url
print('The request URL is {}'.format(search_url.url))

# This is just so we have a nice visual queue between the url that we searched and the actual data we are grabbing
print('---------')

# Another for loop! For all of the items in the results we just grabbed, we are printing out the entries that are 
    # from CrossRef
for result in recent_results:
    print(
        '{} -- from {}'.format(
            result['_source']['title'],
            result['_source']['sources']
        )
    )









    



The request URL is https://staging-share.osf.io/api/search/abstractcreativework/_search?size=3&q=sources:providers.org.crossref
---------
Communicating Accessibility Resources Benefits Everyone -- from ['providers.org.crossref']
The Devil Is in the Details -- from ['providers.org.crossref']
Progression of coronary artery calcification by cardiac computed tomography -- from ['providers.org.crossref']

Let's combine the two and find documents from CrossRef that mention giraffes



In [6]:

    
# we are reusing that variable search_url from the cell above! We are querying (see that arg again) for entries 
    # that use the keyword "giraffes" from the source provider CrossRef
search_url.args['q'] = 'sources:providers.org.crossref AND giraffes'

# We are requsting the information from the search_url and parsing the JSON that we requested and got back 
    # (requests.get!) like we did in cell 2. We put it into this list variable 
recent_results = requests.get(search_url.url).json()

# This is called a 2 dimensional array -- a list that looks kind of like a matrix. A list can store other lists. 
    # This is a way to create two-dimensional (2D) lists in Python -- 2Dlist=[hi,my,name,is,erin][but,my,name,is,vicky]. We can print out 2Dlist[1][4] and we would see "my Vicky" because we start counting at 0 instead of 1
recent_results = recent_results['hits']['hits']

# We are printing out on the screen the variable search_url
print('The request URL is {}'.format(search_url.url))

# This is just so we have a nice visual queue between the url that we searched and the actual data we are grabbing
print('---------')

# Another for loop! For all of the items in the results we just grabbed, we are printing out the entries that are 
    # from CrossRef that are about giraffes
for result in recent_results:
    print(
        '{} -- from {}'.format(
            result['_source']['title'],
            result['_source']['sources']
        )
    )









    



The request URL is https://staging-share.osf.io/api/search/abstractcreativework/_search?size=3&q=sources:providers.org.crossref+AND+giraffes
---------
Genome reveals why giraffes have long necks -- from ['providers.org.crossref']
Odd creature was ancient ancestor of today’s giraffes -- from ['providers.org.crossref']
Of Caucasians, Asians, and Giraffes: The Influence of Categorization and Target Valence on Social Projection -- from ['providers.org.crossref']

Complex Queries

The SHARE Search API runs on elasticsearch - meaning that it can accept complicated queries that give you a wide variety of information.

Here are some examples of how to make more complex queries using the raw elasticsearch results. You can read a lot more about elasticsearch queries here.



In [7]:

    
# reset the args so that we remove our old query arguments.
search_url.args = None  

# Show the URL that we'll be requesting to make sure the args were cleared
search_url.url









    Out[7]:





'https://staging-share.osf.io/api/search/abstractcreativework/_search'

Query Setup

We can define a few functions that we can reuse to make querying simpler. Elasticsearch queries are passed through as json blobs specifying how to return the information you want.



In [8]:

    
#just like the json library from cell 2
import json

#this is called a function -- we DEFINE it (def) and name it something useful for us. It makes it easy for us to 
    #reuse this later on by just calling the function by typng it's name, and adding the appropriate values in the 
    #parentheses (called arguments). To learn more about functions: 
        #http://www.tutorialspoint.com/python/python_functions.htm
        #http://docs.python-guide.org/en/latest/writing/style/?highlight=function
        
# This is a helper function that will use the requests library, pass along the correct headers, and make the query
    # we want
def query_share(url, query):
    headers = {'Content-Type': 'application/json'}
    data = json.dumps(query)
    return requests.post(url, headers=headers, data=data).json()

Some Queries

The SHARE schema has many spots for information, and many of the original sources do not provide this information. We can do a query to find out if a certain field exists or not within certain records. The SHARE API is set up to show an empty list if the field is empty.

Let's query for the counts of documents that have a content in their tags field.



In [9]:

    
# we are making another list of all the items with tags
tags_query = {
    "query": {
        "exists": {
            "field": "tags"
        }
    }
}


# we are making another list of all the items without tags
missing_tags_query = {
    "query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "tags"
                }
            }
        }      
    }
}



In [10]:

    
# we are making a list of the results from searching items with tags
with_tags = query_share(search_url.url, tags_query)

# we are making a list of the results from searching items without tags
missing_tags = query_share(search_url.url, missing_tags_query)

#Gets the total number of hits from each search
total_results = requests.get(search_url.url).json()['hits']['total']

# getting the percentages of hits with and without tags, respectively, using the built-in float function, which
    # returns a float number (written with a decimal point dividing the integer and fractional parts.)
    # read more about float function: https://docs.python.org/3/library/functions.html#float
with_tags_percent = (float(with_tags['hits']['total'])/total_results)*100
missing_tags_percent = (float(missing_tags['hits']['total'])/total_results)*100


# this simply prints out the list of results that have tags by iterating over the list we already made "with_tags"
    # it then prints out the percentage of total 
print(
    '{} results out of {}, or {}%, have tags.'.format(
        with_tags['hits']['total'],
        total_results,
        format(with_tags_percent, '.2f')
    )
)

# this simply prints out the list of results without tags by iterating over the list we already made "missing_tags"
    # it then prints out the percentage of tota
print(
    '{} results out of {}, or {}%, do NOT have tags.'.format(
        missing_tags['hits']['total'],
        total_results,
        format(missing_tags_percent, '.2f')
    )
)

# Visual cue, printing the percentage of results with tags + percent with no tags (we make sure it equals 100)
print('------------')
print('As a little sanity check....')
print('{} + {} = {}%'.format(with_tags_percent, missing_tags_percent, format(with_tags_percent + missing_tags_percent, '.2f')))









    



2443294 results out of 4914457, or 49.72%, have tags.
2471163 results out of 4914457, or 50.28%, do NOT have tags.
------------
As a little sanity check....
49.71645901062925 + 50.28354098937074 = 100.00%

While you can always pass raw elasticsearch queries to the SHARE API, there is also a pip-installable python library that you can use that makes elasticsearch aggregations a little simpler. This library is called sharepa - short for SHARE Parsing and Analysis

Basic Actions

A basic search will provide access to all documents in SHARE in 10 document slices.

Count

You can use sharepa and the basic search to get the total number of documents in SHARE



In [11]:

    
# Sharepa is a python client for  browsing and analyzing SHARE data specifically using elasticsearch querying.
    # We can use this to aggregate, graph, and analyze the data. 
    # Helpful Links:
        # https://github.com/CenterForOpenScience/sharepa
        # https://pypi.python.org/pypi/sharepa
    # here, we import the specific function from Sharepa called basic_search
from sharepa import basic_search

# this performs a basic search over the SHARE dataset and returns a count of the items in it
basic_search.count()









    Out[11]:





4914457

Iterating Through Results

Executing the basic search will send the actual basic query to the SHARE API and then let you iterate through results, 10 at a time.



In [12]:

    
# this captures 10 items in the SHARE dataset -- that's what the basic search function is for
results = basic_search.execute()

# this prints out the 10 items we added to the list variable, results (above)
for hit in results:
    print(hit.title)









    



LEDAPS corrected Landsat Enhanced Thematic Mapper image data for Shortgrass Steppe collected on 2011-06-17
Test entry from ezid service for identifier: doi:10.6085//TEST/20152611351448160.0758426051207266
LEDAPS corrected Landsat Enhanced Thematic Mapper image data for Shortgrass Steppe collected on 1990-04-20
Chemical composition of essential oils of three Pistacia cultivars in Khorasan Razavi, Iran
Test entry from ezid service for identifier: doi:10.6085//TEST/20152611351448190.5563383046761334
Test entry from ezid service for identifier: doi:10.6085//TEST/20152611351448200.9788127569002799
Test entry from ezid service for identifier: doi:10.6085//TEST/20152611351448150.3484667860365027
Compiled Tree-ring Dates from the Southwestern United States (Unrestricted)
None
Area-based Amino Acid Composition for three types of interactions in the BNCP-CS dataset

If we don't want 10 results, or we want to offset the results, we can use slices



In [13]:

    
# this performs a basic search for items number 20-25 in the SHARE dataset (useful if you know where your things are
    # in SHARE dataset) 
results = basic_search[20:25].execute()

# prints out the provider for the items capture in the list, results (above)
for hit in results:
    print(hit.sources)









    



['providers.org.datacite']
['providers.org.datacite']
['providers.org.datacite']
['providers.org.datacite']
['providers.org.datacite']

Advanced Search with sharepa

You can make your own search object, which allows you to pass in custom queries for certain terms or SHARE fields. Queries are formed using lucene query syntax, just like we used in the above examples.

This type of query accepts an exists field. Other options include a query_string, a match query, a multi-match query, a bool query, and any other query structure available in the elasticsearch API.

We can see what that query that we're about to send to elasticsearch by using the pretty print helper function. You'll see that it looks very similar to the queries we defined by hand earlier.



In [14]:

    
# Sharepa is a python client for  browsing and analyzing SHARE data specifically using elasticsearch querying.
    # We can use this to aggregate, graph, and analyze the data. 
    # Helpful Links:
        # https://github.com/CenterForOpenScience/sharepa
        # https://pypi.python.org/pypi/sharepa
    # here, we import the specific function from Sharepa called ShareSearch and pretty_print
from sharepa import ShareSearch
from sharepa.helpers import pretty_print

#we just created a local name for ShareSearch function for us to use
my_search = ShareSearch()

# Lucene supports fielded data. When performing a search you can either specify a field, or use the default field. 
my_search = my_search.query(
    'exists', # Type of query, will accept a lucene query string
    field='tags', # This lucene query string will find all documents that don't have tags
)

# this prints out (prettily!) our search, transformed into a dictionary data type
    # read more about dictionaries here: http://learnpythonthehardway.org/book/ex39.html
pretty_print(my_search.to_dict())









    



{
    "query": {
        "exists": {
            "field": "tags"
        }
    }
}

When you execute that query, you can then iterate through the results the same way that you could with the simple search query.



In [15]:

    
# we are taking the my_search variable from the cell above and executing the search, placing the results into a 
    # new list called new_results
new_results = my_search.execute()

# this for loop prints out the tags for each item in the results we gathered 
for hit in new_results:
    print(hit.tags)









    



['CDL.LTERNET', 'CDL', 'dataPackage', 'Dataset']
['CDL.PISCO', 'CDL']
['CDL.LTERNET', 'CDL', 'dataPackage', 'Dataset']
['CDL.DIGSCI', 'CDL', 'Paper', 'Dataset']
['CDL.PISCO', 'CDL']
['CDL.PISCO', 'CDL']
['CDL.PISCO', 'CDL']
['CDL.DIGANT', 'CDL', 'Dataset']
['TIB.R-GATE', 'TIB']
['CDL.DIGSCI', 'CDL', 'Image']

Debugging and Problem Solving

Not everything always goes as planned when querying an unfamillar API. Here are some debugging and problem solving strategies when you're querying the SHARE API.

Schema issues

The SHARE schema has a lot of parts, and much of the information is nested within sections. Making a query isn't always as straight forward as you might think, if you're not looking in the right part of the schema.

Let's say you were trying to query for all SHARE documents that specify the language as not being in English.

We'll guess as to what that query might be, and try to make it using sharepa.



In [16]:

    
# this creates a new search for us to use!
language_search = ShareSearch()

# this sets the search query we are using for our new search! all the items that aren't in english
language_search = language_search.query(
    'query_string', # Type of query, will accept a lucene query string
    query='NOT languages=english', # This lucene query string will find all documents that don't have tags
)



In [17]:

    
# this allows us to search through 10 results and find the languages of each
results = language_search.execute()

# for each item in results, print out the language it is in
for hit in results:
    print(hit.languages)









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/erin/miniconda3/envs/share_tutorials/lib/python3.5/site-packages/elasticsearch_dsl/utils.py in __getattr__(self, attr_name)
    119         try:
--> 120             return _wrap(self._d_[attr_name])
    121         except KeyError:

KeyError: 'languages'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-17-f7e8b99a9f64> in <module>()
      2 
      3 for hit in results:
----> 4     print(hit.languages)

/Users/erin/miniconda3/envs/share_tutorials/lib/python3.5/site-packages/elasticsearch_dsl/utils.py in __getattr__(self, attr_name)
    121         except KeyError:
    122             raise AttributeError(
--> 123                 '%r object has no attribute %r' % (self.__class__.__name__, attr_name))
    124 
    125     def __delattr__(self, attr_name):

AttributeError: 'Result' object has no attribute 'languages'

So the result does not have an attribute called languages! Let's try to figure out what went wrong here.

Step one could be that we are trying to find something that does NOT match a given parameter. Since languages is not required, this is returning results that do not include the languages result at all!

So let's fix this up a bit to make sure that we're querying for items that specify language in the first place.



In [ ]:

    
# let's try that again! creating a new search from the ShareSearch() function
language_search = ShareSearch()

# this sets up our new query: if the field called 'language' exists, grab those results
language_search = language_search.filter(
    'exists',
    field="language"
)

# count the number of entries that have a language field
language_search.count()



In [ ]:

    
# grab the results of the search for a language field!
results = language_search.execute()

# Let's see how many documents have language results.
print('There are {} documents with languages specified'.format(language_search.count()))

print('Here are the languages for the first 10 results:')

# for each item in results, print out the language it is in
for hit in results:
    print(hit.language)

So now we're better equipped to add on to this filter, and then narrow down to results that are not in English.

When we printed out the first few results, we might have noticed a second problem with our query -- going back to the SHARE Schema, we might notice that there is a restriction on how languages are captured - as a three letter lowercase representation. Instead of "english" let's look for the three letter abbreviation - "eng"

We can modify our new and improved language query by adding on another query to our started language_search. We'll use the elasticsearch query object Q, and invert it with a ~ symbol, and search for the term "eng."



In [ ]:

    
# Elasticsearch DSL is a high-level library whose aim is to help with writing and running queries against Elasticsearch.
    # Read more about elasticsearch here:
        # http://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html
        # https://pypi.python.org/pypi/elasticsearch-dsl
        # https://github.com/elastic/elasticsearch-dsl-py
    # this imports the function Q from the library 
from elasticsearch_dsl import Q

# sets up a new search -- for results that have english ('eng') in their language field
language_search = language_search.query(~Q("term", language="eng"))

# execute our search and throw it into a new list
results = language_search.execute()

# Let's see how many documents have language results that aren't eng
print('There are {} documents that do not have "eng" listed.'.format(language_search.count()))

print('Here are the languages for the first 10 results:')

# Check out the first few results, make sure "eng" isn't in there
for hit in results:
    print(hit.language)
    print(hit.title)



In [ ]:

Calling the SHARE API

The SHARE Search Schema