This notebook will go through how we match up students to real scientists based on their science interests. This code is heavily based on collaboratr, a project developed at Astro Hack Week.

Check it out here: github.com/benelson/collaboratr

Here, we will use real Letters to a Prescientist form data.


In [2]:
!pip install nxpd


Requirement already satisfied: nxpd in /home/christina/anaconda2/lib/python2.7/site-packages
Requirement already satisfied: networkx>=1.6 in /home/christina/anaconda2/lib/python2.7/site-packages (from nxpd)
Requirement already satisfied: pyparsing>=2.0.1 in /home/christina/anaconda2/lib/python2.7/site-packages (from nxpd)
Requirement already satisfied: decorator>=3.4.0 in /home/christina/anaconda2/lib/python2.7/site-packages (from networkx>=1.6->nxpd)

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import numpy as np
from operator import truediv
from collections import Counter
import itertools
import random
import collaboratr

#from nxpd import draw
#import nxpd



#reload(collaboratr)

Step 1 Create a Google Form with these questions:

1. What is your name? [text entry]
2. What is your gender? [multiple choice]
3. What are your general science interests? [checkboxes]

I can ask for other information from the students (e.g., grade, school name) and scientists (email).

After receiving the responses, load up the CSV of responses from the Google Form by running the cell below (you'll have to change the path to your own CSV).


In [2]:
def format_name(data):
    first_name = ['-'.join(list(map(str.capitalize,d))) for d in data['Name'].str.replace(" ", "-").str.split('-')]
    last_name = ['-'.join(list(map(str.capitalize,d))) for d in data['Last'].str.replace(" ", "-").str.split('-')]
    full_name = pd.Series([m+" "+n for m,n in zip(first_name,last_name)])
    
    return full_name

In [10]:
# Retrieve data from Google Sheet and parse using pandas dataframe
student_data = pd.read_csv("students.csv")
student_data = student_data.replace(np.nan,' ', regex=True)

# Store student information in variables.
#
# Collaboratr divided people into "learners" and "teachers" based on what they wanted to "learn" and "teach."
# Here, students are always "learners" by default and the scientists are always "teachers."
# To maintain the structure of the pandas dataframe,
# I've created blank values for what students want to "teach" and what scientists want to "learn."

### write a function that would format names (including hyphens)
student_data['Full Name'] = format_name(student_data) 
student_names = student_data['Full Name']
nStudents = len(student_names)

student_learn = student_data['If I could be any type of scientist when I grow up, I would want to study:']
student_teach = pd.Series(["" for i in range (nStudents)], index=[i for i in range(nStudents)])
student_email = pd.Series(["" for i in range (nStudents)], index=[i for i in range(nStudents)])

# Store scientist information in variables.
scientist_data = pd.read_csv("scientists_1.csv")
scientist_data = scientist_data.replace(np.nan,' ', regex=True)

#drop any duplicate email entries in the data frame
drop = np.where(scientist_data.duplicated('Email')==True)[0]
temp = scientist_data.drop(scientist_data.index[drop])
scientist_data = temp

scientist_data['Full Name'] = format_name(scientist_data) 
scientist_names = scientist_data['Full Name']
nScientists = len(scientist_names)

scientist_learn = pd.Series(["" for i in range (nScientists)], index=[i for i in range(nScientists)])
scientist_teach = scientist_data['We will match you with a pen pal who has expressed an interest in at least one of the following subjects. Which topic is most relevant to your work?']
scientist_email = scientist_data['Email']

In [181]:
#drop any duplicate email entries in the data frame
drop = np.where(scientist_data.duplicated('Full Name')==True)[0]
temp = scientist_data.drop(scientist_data.index[drop])
scientist_data = temp


[]

Step 2: Merge the student and scientist dataframes


In [8]:
names = student_names.append(scientist_names, ignore_index=True)
learn = student_learn.append(scientist_learn, ignore_index=True)
teach = student_teach.append(scientist_teach, ignore_index=True)
emails = student_email.append(scientist_email, ignore_index=True)

In [9]:
G = nx.DiGraph()

Step 3: Assign scientists to students

I thought about several ways to do this. Each student has a "pool" of scientists to be assigned to based on their interests. This was a non-trivial problem. I try to have no more than 2 students assigned to each scientist, working with a limited dataset of roughly 20 scientists and 30 students. Most scientists come from astronomy/physics or psychology/neuroscience. Here are my attempts to do just that:

  1. For each student, randomly draw from their "pool" of scientists with matching interests. This typically caused the more "underrepresented" scientists to get oversubscribed quickly, e.g., having one biologist and having many students interested in biology. This didn't help for students who had limited interests. If I couldn't match everyone up, I'd try again with different random draws. Couldn't find a solution for the conditions listed above. Maybe this would work better if we had a nScientists > nStudents.

  2. Start with the "least popular" topic, that is the topic where the student-to-scientist ratio is smallest. Loop through the students with those interests and try to match them to a scientist. Then, we work are way up the list until we get to the most popular topic. This approach worked much better.


In [43]:
# Insert users in graphs
for n,e,l,t in zip(names, emails, learn, teach):
    collaboratr.insert_node(G,n, email=e, learn=l.split(';'), teach=t.split(';'))

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [74]:
def sort_things(stu_data, sci_data):
    num_interests = {}
    
    for i,r in stu_data.iterrows():
        name = r['Name'].capitalize() + " " + r['Last'].capitalize()
        num_interests = { name: 1 }

    print(num_interests)
    stu_names_sorted = sorted(num_interests, key=num_interests.get)
    print(stu_names_sorted)
    
    interests_stu = Counter(list(itertools.chain.from_iterable(\
                [ i.split(';') for i in stu_data['If I could be any type of scientist when I grow up, I would want to study:'] ])))
    interests_sci = Counter(list(itertools.chain.from_iterable(\
                [ i.split(';') for i in sci_data['We will match you with a pen pal who has expressed an interest in at least one of the following subjects. Which topic is most relevant to your work?'] ])))

    interests_rel = { key: interests_stu[key]/interests_sci[key] for key in interests_sci.keys() }
    interests_rel_sorted = sorted(interests_rel, key=interests_rel.get)
    
    return interests_rel_sorted, stu_names_sorted

def assigner(assign, stu_data, sci_data, max_students=2):
    assign_one = {}
    subscriptions = { n: 0 for n in sci_data['What is your name?'] }

    interests_rel_sorted, stu_names_sorted = sort_things(stu_data, sci_data)
    
    for key in interests_rel_sorted:
        for name in stu_names_sorted:
            if name not in assign_one:
                if key in assign[name].keys():
                    try:
                        scientist = np.random.choice(assign[name][key])
                    except ValueError:
                        scientist = np.random.choice(scientist_data['What is your name?'])
                    assign_one[name] = scientist

                    subscriptions[scientist] += 1

                    if subscriptions[scientist]>=max_students:
                        for kk,vv in assign.items():
                            if vv:
                                for k,v in vv.items():
                                    if scientist in v:
                                        v.remove(scientist)
                                        
    for name in stu_names_sorted:
        if name not in assign_one:
            scientist = np.random.choice([ k for k,v in subscriptions.items() if v < max_students ])
            assign_one[name] = scientist
            
    return assign_one

In [47]:
assign_one = None
max_students = 2

while assign_one is None:
    try:
        participants = G.nodes(data=True)
        assign = collaboratr.assign_users(G,participants)
        assign_one = assigner(assign, student_data, scientist_data, max_students=max_students)
        if max(Counter([v for k,v in assign_one.items()]).values())>max_students:
            assign_one = None

    except ValueError:
#        print("error")
        pass
            

print(assign_one)
print(Counter([v for k,v in assign_one.items()]))


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
//anaconda/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2392             try:
-> 2393                 return self._engine.get_loc(key)
   2394             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)()

KeyError: 'What is your name?'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-47-9dc85487fce1> in <module>()
      6         participants = G.nodes(data=True)
      7         assign = collaboratr.assign_users(G,participants)
----> 8         assign_one = assigner(assign, student_data, scientist_data, max_students=max_students)
      9         if max(Counter([v for k,v in assign_one.items()]).values())>max_students:
     10             assign_one = None

<ipython-input-46-c1880aec49ec> in assigner(assign, stu_data, sci_data, max_students)
     13 def assigner(assign, stu_data, sci_data, max_students=2):
     14     assign_one = {}
---> 15     subscriptions = { n: 0 for n in sci_data['What is your name?'] }
     16 
     17     interests_rel_sorted, stu_names_sorted = sort_things(stu_data, sci_data)

//anaconda/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2060             return self._getitem_multilevel(key)
   2061         else:
-> 2062             return self._getitem_column(key)
   2063 
   2064     def _getitem_column(self, key):

//anaconda/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2067         # get column
   2068         if self.columns.is_unique:
-> 2069             return self._get_item_cache(key)
   2070 
   2071         # duplicate columns & possible reduce dimensionality

//anaconda/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1532         res = cache.get(item)
   1533         if res is None:
-> 1534             values = self._data.get(item)
   1535             res = self._box_item_values(item, values)
   1536             cache[item] = res

//anaconda/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

//anaconda/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2393                 return self._engine.get_loc(key)
   2394             except KeyError:
-> 2395                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2396 
   2397         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)()

KeyError: 'What is your name?'

In [8]:
items = []

for k,v in assign_one.items():
    items.append(str(v.ljust(22) + "-> " + k.ljust(22) + "who is interested in " \
                     + student_data.loc[student_data['What is your name?'] == k]\
                     ['What general science fields are you interested in?'].tolist()[0] ))
    
for i in sorted(items):
    print(i)


Adam Miller           -> David Jakubczak       who is interested in Astronomy
Adam Miller           -> Jose Flores           who is interested in Astronomy;Biology;Chemistry
Alex Gurvich          -> Adam                  who is interested in Astronomy;Biology
Alex Gurvich          -> Daniel Pesch          who is interested in Biology;Chemistry;Engineering (designing, city planning);Physics;Tecnology
Alicia McGeachy       -> James Brenka          who is interested in Chemistry;Volcanic activity/ interactive chemical reactions
Alicia McGeachy       -> Kate Padilla          who is interested in Chemistry;Zoology
Alissa Baker-Oglesbee -> Mary Grace Guidi      who is interested in Biology;Chemistry;Engineering (designing, city planning);Engineering (factories, industry);Physics;Psychology/neuroscience
Ben Nelson            -> Dallas Thurman        who is interested in Astronomy;Chemistry
Ben Nelson            -> Kristina              who is interested in Astronomy
Eve Chase             -> Daniel Perez          who is interested in Astronomy
Eve Chase             -> Sarah                 who is interested in Astronomy;Engineering (designing, city planning);Physics
Hollen Reischer       -> Leila Barszcz         who is interested in Biology;Chemistry;Physics;Psychology/neuroscience
Jackie Ng             -> Christian             who is interested in Engineering (designing, city planning);Engineering (factories, industry)
Jackie Ng             -> Natalia Kowalewska    who is interested in Astronomy;Chemistry;Engineering (designing, city planning)
Katie Breivik         -> Eloise Park           who is interested in Astronomy;Biology;Chemistry;Physics
Katie Breivik         -> Josh Schmidt          who is interested in Astronomy;Biology;Physics
Kyle Kremer           -> Burak Agar            who is interested in Astronomy;Engineering (designing, city planning)
Kyle Kremer           -> Noah Padilla          who is interested in Astronomy;Biology;Geology
Laura Shanahan        -> Dawn Pendon           who is interested in Astronomy;Engineering (designing, city planning);Geology
Michael Katz          -> Konrad Lukasiewicz    who is interested in Astronomy;Physics
Michael Katz          -> Leo Thompson          who is interested in Astronomy;Chemistry;Physics
Mike Hyland           -> Maribella Espino      who is interested in Engineering (designing, city planning)
Mike Hyland           -> Marissa Sanchez       who is interested in Engineering (designing, city planning);Geology
Mike Zevin            -> Alexander Yabes       who is interested in Astronomy;Engineering (designing, city planning)
Mike Zevin            -> Matthew Gomez         who is interested in Astronomy;Chemistry;Engineering (designing, city planning);Geology;Physics
Rachel Watson         -> Julia Wodzien         who is interested in Biology;Geology;ecology
Schnaude Dorizan      -> Aaquib Mohsin         who is interested in Paleontology 
Schnaude Dorizan      -> Miles                 who is interested in Biology;Marine science
Shi Ye                -> Aleksandar Dale       who is interested in Astronomy;Biology
Shi Ye                -> Daniela Salazar       who is interested in Astronomy;Chemistry

In [75]:
a, b = sort_things(student_data, scientist_data)
print(a, b)


{'Angelena Depalma': 1}
['Angelena Depalma']
['genes', 'the environment', 'cells', 'oceans', 'the brain', 'medicine', 'rocks', 'chemicals', 'computers', 'animals'] ['Angelena Depalma']

In [ ]: