SHARE Data in the Wide World

This notebook will focus on how to export SHARE data into different formats, and how to query SHARE for specific information from your institution, say from a list of names or from a list of emails or ORCIDs that act as reseearcher identifiers.

Exporting a DataFrame to csv and Excel

When doing an aggregation on SHARE data, it might be beneficial to export the data to a format that is easier to widely distribute, such as a csv file or and Excel file.

First, we'll do a SHARE aggregation query for documents from each source that have a description, turn it into a pandas DataFrame, and export the data into both csv and Excel formats.


In [ ]:
# Pandas is a python library that is used for data manipulation and analysis -- good for numbers + time series.
    # Pandas gives us some extra data structures (arrays are data structures, for example) which is nice
    # We are calling Pandas pd by using the "as" -- locally, we know Pandas as pd
    # Helpful Links:
        # https://en.wikipedia.org/wiki/Pandas_(software)
        # http://pandas.pydata.org/ 
import pandas as pd

# Sharepa is a python client for  browsing and analyzing SHARE data specifically using elasticsearch querying.
    # We can use this to aggregate, graph, and analyze the data. 
    # Helpful Links:
        # https://github.com/CenterForOpenScience/sharepa
        # https://pypi.python.org/pypi/sharepa
from sharepa import ShareSearch

#When we say from X import Y, we are saying "of all the things in this python library, import only this
from sharepa.helpers import pretty_print

description_search = ShareSearch()

# exists -- a type of query, will accept a lucene query string
    # Lucene supports fielded data. When performing a search you can either specify a field, or use the default field. 
    # The field names and default field is implementation specific.
# field = description -- This lucene query string will find all documents that don't have a description
description_search = description_search.query(
    'exists', 
    field='description',
)

# here we are aggregating all the entries by source
description_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'significant_terms',  # There are many kinds of aggregations
    field='sources',  # We store the source of a document in its type, so this will aggregate by source
    min_doc_count=0,
    percentage={}, # Will make the score value the percentage of all results (doc_count/bg_count)
    size=0
)

description_results = description_search.execute()

In [ ]:
# Creates a dataframe using Pandas (what we call pd) that aggregates the results
description_dataframe = pd.DataFrame(description_results.aggregations.sources.to_dict()['buckets'])

# We will add our own "percent" column to make things clearer
description_dataframe['percent'] = (description_dataframe['score'] * 100)

# Let's set the source name as the index, and then drop the old column
description_dataframe = description_dataframe.set_index(description_dataframe['key'])
description_dataframe = description_dataframe.drop('key', 1)

# Finally, we'll show the results!
description_dataframe

Let's export this pandas dataframe to a csv file, and to an excel file.

The next cell will work when running locally!


In [ ]:
# Note: Uncomment the following lines if running locally:

description_dataframe.to_csv('SHARE_Counts_with_Descriptions.csv')
description_dataframe.to_excel('SHARE_Counts_with_Descriptions.xlsx')

Working with outside data

Let's say we had a list of names of researchers that were from a particular University. We're interested in seeing if their full names appear in any sources across the SHARE data set.


In [ ]:
# this is a simple list
names = ["Susan Jones", "Ravi Patel"]

In [ ]:
#this is a 
name_search = ShareSearch()

# We are searching the entire SHARE dataset for each item in the list we called name, i.e. Susan Jones and Ravi Patel
for name in names:
    name_search = name_search.query(
        {
            "bool": {
                "should": [
                    {
                        "match": {
                            "contributors.full_name": {
                                "query": name, 
                                "operator": "and",
                                "type" : "phrase"
                            }
                        }
                    }
                ]
            }
        }
    )

# We are putting all the results into a new list called name_results
# name_search is our original list, and .execute() is a built-in function (one that the library provides, and we
    # don't have to write) that puts the results of the loop above into a new list
name_results = name_search.execute()

# Prints out the number of documents that have those 
print('There are {} documents with contributors who have any of those names.'.format(name_search.count()))

# Just visual queues for us to make it more readable
print('Here are the first 10:')
print('---------')

# Loops over the list called "name_results" and prints out 10
for result in name_results:
    print(
        '{} -- with contributors {}'.format(
            result.title,
            [contributor.full_name for contributor in result.contributors]
        )
    )

If we were interested to see an analysis of what sources these names came from, we can add an aggregation.


In [ ]:
name_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'terms',  # There are many kinds of aggregations, terms is a pretty useful one though
    field='sources',  # We store the source of a document in its type, so this will aggregate by source
    size=0,  # These are just to make sure we get numbers for all the sources, to make it easier to combine graphs
    min_doc_count=1
)

# We are putting all the results into a new list called name_results
# name_search is our original list, and .execute() is a built-in function (one that the library provides, and we
    # don't have to write) that puts the results of the loop above into a new list
name_results = name_search.execute()

# We are aggregating these into a DataFrame from Pandas (which we called pd)
pd.DataFrame(name_results.aggregations.sources.to_dict()['buckets'])

In [ ]: