manipulate_regonline_output

This notebook reads the RegOnline output into a pandas DataFrame and reworks it to have each row contain the attendee, the Doppler Primer Session, the Monday Breakout session, and the Tuesday breakout session in each row.


In [220]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib

In [2]:
#%matplotlib inline

Read the RegOnline output into a pandas DataFrame


In [221]:
df = pd.read_excel('/Users/matt/projects/EPRV/data/missings2.xls', encoding='utf-8')

In [222]:
df.columns


Out[222]:
Index([u'AgendaItem', u'RegId', u'GroupId', u'FirstName', u'LastName', u'Company'], dtype='object')

In [223]:
df.loc[16:17]


Out[223]:
AgendaItem RegId GroupId FirstName LastName Company

Extract the Sunday Sessions

RegOnline outputs multiple entries for each person, and each entry differs by the AgendaItem. AgendaItems exist for all sessions happening on all days. In this section, we extract the sessions happening on Sunday, which are all prefixed by "Doppler Primer: ".


In [224]:
#df['AgendaItem'].str.contains('Doppler Primer:')

In [225]:
sundf = df[df['AgendaItem'].str.contains('Doppler Primer:')].copy()
len(sundf)


Out[225]:
3

Let's create two new columns in our DataFrame: the Primer, and the PrimerID. The Primer column will contain the name of the Doppler Primer session (minus the Doppler Primer: prefix), and the PrimerID will be a session identifier that will later be used in plotting.


In [226]:
sundf['PrimerID'] = 0

In [227]:
sundf['Primer'] = [re.search(r'(.*):\s(.*)$', item).group(2) for item in sundf['AgendaItem']]

In [228]:
sundf[['AgendaItem', 'Primer']].head(3)


Out[228]:
AgendaItem Primer
0 Doppler Primer: Not Attending Not Attending
1 Doppler Primer: Not Attending Not Attending
2 Doppler Primer: Not Attending Not Attending

In [229]:
sundf['Primer'].unique()


Out[229]:
array([u'Not Attending'], dtype=object)

Now loop through the five unique sessions, updating the PrimerID column for each participant:


In [230]:
dopID = 0
for agItem in sundf['Primer'].unique():
    sundf.loc[sundf['Primer'] == agItem, 'PrimerID'] = dopID
    dopID += 1

Create an abbreviated code for each session. This will be added to the nametag to spark conversation among participants.


In [231]:
sun_ses = ['NA', 'SA', 'IC', 'DC', 'SM']

A quick preview of the first few rows to see the result:


In [232]:
sundf[['AgendaItem', 'Primer', 'PrimerID']].head(4)


Out[232]:
AgendaItem Primer PrimerID
0 Doppler Primer: Not Attending Not Attending 0
1 Doppler Primer: Not Attending Not Attending 0
2 Doppler Primer: Not Attending Not Attending 0

Extract the Monday Sessions

Now to do the same for the Monday sessions.


In [233]:
mondf = df[df['AgendaItem'].str.contains('Monday Break-out:')].copy()
len(mondf)


Out[233]:
3

In [234]:
mondf['MonID'] = 0

mondf['Monday'] = [re.search(r'(.*):\s(.*)$', item).group(2) for item in mondf['AgendaItem']]

mondf['Monday'].unique()

monID = 0
for agItem in mondf['Monday'].unique():
    mondf.loc[mondf['Monday'] == agItem, 'MonID'] = monID
    monID += 1

In [235]:
mondf['Monday'].unique()


Out[235]:
array([u'Not attending'], dtype=object)

In [236]:
mon_ses = ['NA', 'FS', 'TC', 'BC', 'FC']

In [237]:
mondf[['AgendaItem', 'Monday', 'MonID']].head(4)


Out[237]:
AgendaItem Monday MonID
3 Monday Break-out: Not attending Not attending 0
4 Monday Break-out: Not attending Not attending 0
5 Monday Break-out: Not attending Not attending 0

Extract Tuesday Sessions


In [238]:
tuedf = df[df['AgendaItem'].str.contains('Tuesday Break-out:')].copy()
len(tuedf)


Out[238]:
3

In [239]:
tuedf['TueID'] = 0

tuedf['Tuesday'] = [re.search(r'(.*):\s(.*)$', item).group(2) for item in tuedf['AgendaItem']]

tuedf['Tuesday'].unique()

tuesID = 0
for agItem in tuedf['Tuesday'].unique():
    tuedf.loc[tuedf['Tuesday'] == agItem, 'TueID'] = tuesID
    tuesID += 1

In [240]:
tuedf['Tuesday'].unique()


Out[240]:
array([u'Not attending'], dtype=object)

In [241]:
tue_ses = ['NA', 'ST', 'DC', 'LB', 'PS']

In [242]:
tuedf[['AgendaItem', 'Tuesday', 'TueID']].head(4)


Out[242]:
AgendaItem Tuesday TueID
6 Tuesday Break-out: Not attending Not attending 0
7 Tuesday Break-out: Not attending Not attending 0
8 Tuesday Break-out: Not attending Not attending 0

Combine the DataFrames

We only need to join on one field. However, pandas does something weird, where it creates multiple GroupId_x columns when joining multiple times. The simple solution is just to join on multiple columns since we know they're all consistent.


In [243]:
fulldf = df[['RegId', 'GroupId', 'FirstName', 'LastName', 'Company']]

In [244]:
print(len(fulldf))
fulldf = fulldf.drop_duplicates()
print(len(fulldf))
print(len(sundf))
print(len(mondf))
print(len(tuedf))


9
3
3
3
3

In [245]:
fulldf.columns


Out[245]:
Index([u'RegId', u'GroupId', u'FirstName', u'LastName', u'Company'], dtype='object')

In [246]:
sundf.columns


Out[246]:
Index([u'AgendaItem', u'RegId', u'GroupId', u'FirstName', u'LastName', u'Company', u'PrimerID', u'Primer'], dtype='object')

In [247]:
newdf = pd.merge(fulldf, sundf, on=['RegId', 'GroupId', 'FirstName', 'LastName', 'Company'], how='left')
print(len(newdf))

newdf = pd.merge(newdf, mondf, on=['RegId', 'GroupId', 'FirstName', 'LastName', 'Company'], how='left')
print(len(newdf))

newdf = pd.merge(newdf, tuedf, on=['RegId', 'GroupId', 'FirstName', 'LastName', 'Company'], how='left')
print(len(newdf))


3
3
3

In [248]:
newdf.head(5)


Out[248]:
RegId GroupId FirstName LastName Company AgendaItem_x PrimerID Primer AgendaItem_y MonID Monday AgendaItem TueID Tuesday
0 79762587 79762587 Cheryl Provost FiberTech Optica Doppler Primer: Not Attending 0 Not Attending Monday Break-out: Not attending 0 Not attending Tuesday Break-out: Not attending 0 Not attending
1 79762588 79762588 Abhijit Chakroborty Physical Research Lab Doppler Primer: Not Attending 0 Not Attending Monday Break-out: Not attending 0 Not attending Tuesday Break-out: Not attending 0 Not attending
2 79762589 79762589 Jose Dias Do Nascimento CfA Doppler Primer: Not Attending 0 Not Attending Monday Break-out: Not attending 0 Not attending Tuesday Break-out: Not attending 0 Not attending

In [249]:
newdf.columns


Out[249]:
Index([u'RegId', u'GroupId', u'FirstName', u'LastName', u'Company', u'AgendaItem_x', u'PrimerID', u'Primer', u'AgendaItem_y', u'MonID', u'Monday', u'AgendaItem', u'TueID', u'Tuesday'], dtype='object')

Now create a new DataFrame that is a subset of the newdf with only the columns of interest. Also, make sure the DataFrame is sorted by lastname, the index is reset, and it's a copy of newdf instead of a pointer to newdf.


In [250]:
finaldf = newdf[['FirstName', 'LastName', 'Company', 'Primer', 'PrimerID', 'Monday', 'MonID', 'Tuesday', 'TueID']].sort('LastName').reset_index().copy()

In [251]:
finaldf.head(5)


Out[251]:
index FirstName LastName Company Primer PrimerID Monday MonID Tuesday TueID
0 1 Abhijit Chakroborty Physical Research Lab Not Attending 0 Not attending 0 Not attending 0
1 2 Jose Dias Do Nascimento CfA Not Attending 0 Not attending 0 Not attending 0
2 0 Cheryl Provost FiberTech Optica Not Attending 0 Not attending 0 Not attending 0

In [252]:
len(finaldf)


Out[252]:
3

In [253]:
finaldf.columns


Out[253]:
Index([u'index', u'FirstName', u'LastName', u'Company', u'Primer', u'PrimerID', u'Monday', u'MonID', u'Tuesday', u'TueID'], dtype='object')

Now replace all empty cells for "Company" to a very general location:


In [254]:
finaldf.Company = ['Earth' if pd.isnull(company_el) else company_el for company_el in finaldf.Company]

Replace NaNs for PrimerID with the "Not Attending" ID:


In [255]:
finaldf.PrimerID = [4 if pd.isnull(primerid_el) else primerid_el for primerid_el in finaldf.PrimerID]

Check for NaNs in the Monday ID:


In [256]:
len(finaldf[pd.isnull(finaldf['MonID'])])


Out[256]:
0

Replace NaNs for the MonID with the "Not Attending" ID:


In [257]:
finaldf.MonID = [4 if pd.isnull(monid_el) else monid_el for monid_el in finaldf.MonID]

In [258]:
len(finaldf[pd.isnull(finaldf['MonID'])])


Out[258]:
0

Replace NaNs for the TueID with the "Not Attending" ID:


In [259]:
len(finaldf[pd.isnull(finaldf['TueID'])])


Out[259]:
0

In [260]:
finaldf.TueID = [4 if pd.isnull(tueid_el) else tueid_el for tueid_el in finaldf.TueID]

In [261]:
len(finaldf[pd.isnull(finaldf['TueID'])])


Out[261]:
0

Test out the wrap-around text for institute for participants that have long institution names. This regular expression will look for institutions (or Companies, as RegOnline refers to them), and find items that have a '/', and if no '/', either a '-', ',', or 'at' in the text. If so, add a newline character to make the text wrap around to the next line.

We'll first test the output on a participant's institution that contains both a '/' and a '-':


In [262]:
p = re.compile ('(/|^(?!.*/).*-|^(?!.*/).*,|^(?!.*/).*\sat\s)')
p.subn(r'\1\n', finaldf.loc[2].Company)[0]


Out[262]:
u'FiberTech Optica'

And test a cell that is long, contains at, but at is part of a longer word:


In [263]:
#p.subn(r'\1\n', finaldf.loc[53].Company)[0]

And a quick test on a few more institutions:


In [264]:
[p.sub(r'\1\n', company_el) if len(company_el) > 30 else company_el for company_el in finaldf.head(5).Company.values]


Out[264]:
[u'Physical Research Lab', u'CfA', u'FiberTech Optica']

Now update the full Company column of the DataFrame:


In [265]:
finaldf.Company = [p.sub(r'\1\n', company_el) if len(company_el) > 30 else company_el for company_el in finaldf.Company.values]

Plot Labels

Now that we have our DataFrame cleaned up the way we want it we can print the data to the Avery 5392 format. This format contains 6 4"x3" nametags per sheet.


In [266]:
png = mpimg.imread('/Users/matt/projects/EPRV/images/NameTag2.png')

In [267]:
png.shape


Out[267]:
(900, 1200, 4)

In [268]:
import matplotlib.font_manager as mfm
fontpaths = fontpaths=['/System/Library/Fonts/',
                       '/Library/Fonts',
                       '/Library/Fonts/Microsoft',
                       '/usr/X11/lib/X11/fonts',
                       '/opt/X11/share/fonts',
                       '/Users/matt/Library/Fonts']

blaa = mfm.findSystemFonts(fontpaths=fontpaths)

In [269]:
colors = ['#FFE2A9', '#4BA4D8', '#768085', '#BF5338', '#335B8F']
colors2 = ['#335B8F', '#BF5338', '#768085',  '#4BA4D8', '#FFE2A9']
colors3 = ['#4BA4D8', '#FFE2A9', '#BF5338', '#768085', '#335B8F']

circ_ypos = 775
name_dict = {'family': 'YaleNew-Roman',
             'color': '#D6E8E1',
             'weight': 'bold',
             'size': 28
             }

company_dict = {'family': 'YaleNew-Roman',
                'color': '#D6E8E1',
                'weight': 'bold',
                'size': 16
                }

circle_dict = {'family': 'YaleNew-Roman',
               'color': '#1D2523',
               'weight': 'normal',
               'size': 20
               }


def change_name_size(name, name_dict):
    if len(name) < 16:
        name_dict['size'] = 28
    elif ((len(name) >= 16) and (len(name) < 19)):
        name_dict['size'] = 24
    elif ((len(name) >= 19) and (len(name) < 24)):
        name_dict['size'] = 20
    elif ((len(name) >= 24) and (len(name) < 30)):
        name_dict['size'] = 17
    else:
        name_dict['size'] = 16
    return name_dict
        

def change_company_size(company, company_dict):
    newlines = len(re.findall(r'\n', finaldf.loc[0].Company))
    if newlines == 0:
        if len(company) < 15:
            company_dict['size'] = 18
        elif ((len(company) >= 15) and (len(company) < 30)):
            company_dict['size'] = 14
        elif ((len(company) >= 30) and (len(company) < 40)):
            company_dict['size'] = 12
        elif ((len(company) >= 40) and (len(company) < 50)):
            company_dict['size'] = 10
        else:
            company_dict['size'] = 8
    else:
        if len(company) < 15:
            company_dict['size'] = 18
        elif ((len(company) >= 15) and (len(company) < 40)):
            company_dict['size'] = 14
        elif ((len(company) >= 40) and (len(company) < 50)):
            company_dict['size'] = 12
        else:
            company_dict['size'] = 10
    return company_dict
    

# The HP Color LaserJet CP4020 offsets things by 1/16th of an inch left-to-right.
# This fudge factor should fix that:
hrz_fdg = 1. / 16./ 8.5
leftarr = np.array([0.0294, 0.5, 0.0294, 0.5, 0.0294, 0.5]) + hrz_fdg
bottomarr = [0.091, 0.091, 0.364,  0.364, 0.637, 0.637]
width = 0.4706
height = 0.273

# loop through the total number of pages:
for page in range(int(np.ceil((len(finaldf))/6.))):
    print('Now on page: {}'.format(page))
    fig = plt.figure(figsize=(8.5, 11))
    for indx in range(6):
        # add an if statement to handle the last page if there are less than
        # six participants remaining:
        if ((page*6 + indx) < len(finaldf)):
            rect = [leftarr[indx], bottomarr[indx], width, height]
            ax = fig.add_axes(rect)
            ax.imshow(png)
            ax.get_xaxis().set_visible(False)
            ax.get_yaxis().set_visible(False)

            print(u'Now making name tag for: {} {}'.format(finaldf.loc[page*6 + indx].FirstName, finaldf.loc[page*6 + indx].LastName))
            
            #add name text:
            name = finaldf.loc[page*6 + indx].FirstName + ' ' + finaldf.loc[page*6 + indx].LastName 
            this_name_dict = change_name_size(name, name_dict)
            ax.text(600, 500, name, fontdict=this_name_dict, horizontalalignment='center')

            #add company text:
            company = finaldf.loc[page*6 + indx].Company
            this_co_dict = change_company_size(company, company_dict)
            ax.text(600, 625, company, fontdict=this_co_dict, horizontalalignment='center')

            #add circles for sessions:
            circ1 = plt.Circle((750, circ_ypos), 70, color=colors[int(finaldf.loc[page*6 + indx].PrimerID)])
            fig.gca().add_artist(circ1)
            ax.text(750, circ_ypos + 27.5, sun_ses[int(finaldf.loc[page*6 + indx].PrimerID)], fontdict=circle_dict, horizontalalignment='center')

            circ2 = plt.Circle((925, circ_ypos), 70, color=colors2[int(finaldf.loc[page*6 + indx].MonID)])
            fig.gca().add_artist(circ2)
            ax.text(925, circ_ypos + 27.5, mon_ses[int(finaldf.loc[page*6 + indx].MonID)], fontdict=circle_dict, horizontalalignment='center')

            circ3 = plt.Circle((1100, circ_ypos), 70, color=colors3[int(finaldf.loc[page*6 + indx].TueID)])
            fig.gca().add_artist(circ3)
            ax.text(1100, circ_ypos + 27.5, tue_ses[int(finaldf.loc[page*6 + indx].TueID)], fontdict=circle_dict, horizontalalignment='center')

    plt.savefig('../nametags/more_missing_nameTags_bold_p'+str(page)+'.png', dpi=300)


Now on page: 0
Now making name tag for: Abhijit Chakroborty
Now making name tag for: Jose Dias Do Nascimento
Now making name tag for: Cheryl Provost

In [270]:
finaldf.columns


Out[270]:
Index([u'index', u'FirstName', u'LastName', u'Company', u'Primer', u'PrimerID', u'Monday', u'MonID', u'Tuesday', u'TueID'], dtype='object')

In [271]:
finaldf.FirstName.values


Out[271]:
array([u'Abhijit', u'Jose', u'Cheryl'], dtype=object)

In [272]:
finaldf.LastName.values


Out[272]:
array([u'Chakroborty', u'Dias Do Nascimento', u'Provost'], dtype=object)

In [273]:
hrz_fdg = 1. / 16./ 8.5
leftarr = np.array([0.0294, 0.5, 0.0294, 0.5, 0.0294, 0.5])

In [274]:
leftarr + hrz_fdg


Out[274]:
array([ 0.03675294,  0.50735294,  0.03675294,  0.50735294,  0.03675294,
        0.50735294])

In [ ]: