This notebook will focus on how to export SHARE data into different formats, and how to query SHARE for specific information from your institution, say from a list of names or from a list of emails or ORCIDs that act as reseearcher identifiers.

Exporting a DataFrame to csv and Excel

When doing an aggregation on SHARE data, it might be beneficial to export the data to a format that is easier to widely distribute, such as a csv file or and Excel file.

First, we'll do a SHARE aggregation query for documents from each source that have a description, turn it into a pandas DataFrame, and export the data into both csv and Excel formats.



In [ ]:

    
# Pandas is a python library that is used for data manipulation and analysis -- good for numbers + time series.
    # Pandas gives us some extra data structures (arrays are data structures, for example) which is nice
    # We are calling Pandas pd by using the "as" -- locally, we know Pandas as pd
    # Helpful Links:
        # https://en.wikipedia.org/wiki/Pandas_(software)
        # http://pandas.pydata.org/ 
import pandas as pd

# Sharepa is a python client for  browsing and analyzing SHARE data specifically using elasticsearch querying.
    # We can use this to aggregate, graph, and analyze the data. 
    # Helpful Links:
        # https://github.com/CenterForOpenScience/sharepa
        # https://pypi.python.org/pypi/sharepa
from sharepa import ShareSearch

#When we say from X import Y, we are saying "of all the things in this python library, import only this
from sharepa.helpers import pretty_print

description_search = ShareSearch()

# exists -- a type of query, will accept a lucene query string
    # Lucene supports fielded data. When performing a search you can either specify a field, or use the default field. 
    # The field names and default field is implementation specific.
# field = description -- This lucene query string will find all documents that don't have a description
description_search = description_search.query(
    'exists', 
    field='description',
)

# here we are aggregating all the entries by source
description_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'significant_terms',  # There are many kinds of aggregations
    field='sources',  # We store the source of a document in its type, so this will aggregate by source
    min_doc_count=0,
    percentage={}, # Will make the score value the percentage of all results (doc_count/bg_count)
    size=0
)

description_results = description_search.execute()



In [ ]:

    
# Creates a dataframe using Pandas (what we call pd) that aggregates the results
description_dataframe = pd.DataFrame(description_results.aggregations.sources.to_dict()['buckets'])

# We will add our own "percent" column to make things clearer
description_dataframe['percent'] = (description_dataframe['score'] * 100)

# Let's set the source name as the index, and then drop the old column
description_dataframe = description_dataframe.set_index(description_dataframe['key'])
description_dataframe = description_dataframe.drop('key', 1)

# Finally, we'll show the results!
description_dataframe

Let's export this pandas dataframe to a csv file, and to an excel file.

The next cell will work when running locally!



In [ ]:

    
# Note: Uncomment the following lines if running locally:

description_dataframe.to_csv('SHARE_Counts_with_Descriptions.csv')
description_dataframe.to_excel('SHARE_Counts_with_Descriptions.xlsx')

Working with outside data

Let's say we had a list of names of researchers that were from a particular University. We're interested in seeing if their full names appear in any sources across the SHARE data set.



In [ ]:

    
# this is a simple list
names = ["Susan Jones", "Ravi Patel"]



In [ ]:

    
#this is a 
name_search = ShareSearch()

# We are searching the entire SHARE dataset for each item in the list we called name, i.e. Susan Jones and Ravi Patel
for name in names:
    name_search = name_search.query(
        {
            "bool": {
                "should": [
                    {
                        "match": {
                            "contributors.full_name": {
                                "query": name, 
                                "operator": "and",
                                "type" : "phrase"
                            }
                        }
                    }
                ]
            }
        }
    )

# We are putting all the results into a new list called name_results
# name_search is our original list, and .execute() is a built-in function (one that the library provides, and we
    # don't have to write) that puts the results of the loop above into a new list
name_results = name_search.execute()

# Prints out the number of documents that have those 
print('There are {} documents with contributors who have any of those names.'.format(name_search.count()))

# Just visual queues for us to make it more readable
print('Here are the first 10:')
print('---------')

# Loops over the list called "name_results" and prints out 10
for result in name_results:
    print(
        '{} -- with contributors {}'.format(
            result.title,
            [contributor.full_name for contributor in result.contributors]
        )
    )

If we were interested to see an analysis of what sources these names came from, we can add an aggregation.



In [ ]:

    
name_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'terms',  # There are many kinds of aggregations, terms is a pretty useful one though
    field='sources',  # We store the source of a document in its type, so this will aggregate by source
    size=0,  # These are just to make sure we get numbers for all the sources, to make it easier to combine graphs
    min_doc_count=1
)

# We are putting all the results into a new list called name_results
# name_search is our original list, and .execute() is a built-in function (one that the library provides, and we
    # don't have to write) that puts the results of the loop above into a new list
name_results = name_search.execute()

# We are aggregating these into a DataFrame from Pandas (which we called pd)
pd.DataFrame(name_results.aggregations.sources.to_dict()['buckets'])



In [ ]:

SHARE Data in the Wide World

Exporting a DataFrame to csv and Excel

Working with outside data