In this section we will:
At the end of that phase we want a DataFrame indicating the total amount of grants each institution has received based on the given data.
In [1]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import show
%matplotlib inline
sns.set_context('notebook')
pd.options.mode.chained_assignment = None # default='warn'
Add a function to control the writes to csv file.
In [2]:
# set to true if you want to write the data to csv
do_persist = True
In [3]:
def write_to_csv(df, file_name):
if do_persist:
df.to_csv(file_name)
print('...done')
return True
else:
return False
Load the data and check the types of the columns
In [4]:
data = pd.read_csv('data/GrantExport.csv', delimiter=';')
data.dtypes
Out[4]:
Look at the DF
In [5]:
data.head()
Out[5]:
We only need the columns 'University' and 'Approved Amount'. All other columns are not relevant for this homework.
In [6]:
# take only the relevant cols and give them nicer names
grants = data[['University', 'Approved Amount']]
grants.rename(columns={'University': 'university', 'Approved Amount' : 'amount'}, inplace=True)
grants.dtypes
Out[6]:
If one of the two values is NaN, we can't use the entry, a grant without institution is as meanningless as a institution without grant. -> We drop it. Note that there are entries with 'Nicht zuteilbar - NA' which are essentially also NaN values.
We drop almost 25% of all entries, which seems a lot, but what else can we do with uncomplete data?
In [7]:
nbr_entries = len(grants)
grants = grants.replace(to_replace='Nicht zuteilbar - NA', value=np.nan)
grants = grants.dropna()
print('Dropped '+str((100/nbr_entries) * (nbr_entries - len(grants)))+'% of all entries.')
Then make the 'amount' column numeric in order to make the utilisation easier.
In [8]:
grants['amount'] = pd.to_numeric(grants.amount, errors='coerce')
And finally group by the institutions and sum the grants. We show the sorted dataframe It is interesting that neither ETHZ nor EPFL has received the most grants.
In [9]:
universities = grants.groupby(by='university', axis=0, as_index=False).sum()
universities.sort_values('amount', ascending=False)
Out[9]:
In this section we will map each institution to the canton it belongs to with following steps:
With that method we can map 61 out of the 76 institutions. The rest we mapped manually.
One institution is present in 3 cantons, we just split the grants of it equally among the different cantons. Other institutions are all over the place, those we did not match as it would be too dificult to find all cantons they have a presence in.
Note that there are several 'institutions' that can't be mapped such as 'Weitere Spitäler' (engl: 'other hospitals'). But they only account for ~4% of all grants, which is low enough for us to ignore.
Lets start:
Split the names
In [10]:
delim = ' - '
universities['university_name'] = [fn.split(delim)[0].strip() for fn in universities['university'].values]
universities['abbrev'] = [fn.split(delim)[1].strip() if len(fn.split(delim)) > 1 else np.nan for fn in universities['university'].values]
universities.set_index('university', inplace=True)
universities.head(1)
Out[10]:
The functions for using the APIs are situated in the file 'map_universities.py', so we run it.
In [11]:
# run the python file that defines the functions to access the api's
%run map_universities.py
Then match each institution:
In [ ]:
# create a new column with the canton in it.
def canton_for_university_query(uni):
# create the query
q = str(uni.university_name) + ' ' + str(uni.abbrev) + ' Switzerland'
# remove some special characters in the query
to_remove = ['(', ')', ',', '.', '-', '+', '&']
[q.replace(ch, ' ') for ch in to_remove]
# execute the query
return canton_for_university(q)
universities['canton'] = universities.apply(canton_for_university_query, axis=1)
How many did we match?
In [13]:
len(universities[~pd.isnull(universities['canton'])])
Out[13]:
Show the ones we did not match:
In [14]:
universities[pd.isnull(universities['canton'])]
Out[14]:
We found following informations for the unmapped institutions
Map them and show the remaining unmapped institutions:
In [15]:
# do the manual mapping
manual_map = {
'Schweizer Kompetenzzentrum Sozialwissensch. - FORS' : 'VD',
'Pädagogische Hochschule Nordwestschweiz - PHFHNW' : 'AG',
'Physikal.-Meteorolog. Observatorium Davos - PMOD' : 'GR',
'Staatsunabh. Theologische Hochschule Basel - STHB' : 'BS',
'AO Research Institute - AORI' : 'GR',
'Zürcher Fachhochschule (ohne PH) - ZFH' : 'ZH'
}
for uni_index, ctn in manual_map.items():
if pd.isnull(universities.at[uni_index, 'canton']):
universities.set_value(uni_index, 'canton', ctn)
universities[pd.isnull(universities['canton'])]
Out[15]:
In [16]:
canton_grants = universities.groupby(by='canton', axis=0).sum()
canton_grants.sort_values('amount', ascending=False)
Out[16]:
The institution EMPA has presence in 3 cantons: BE, ZH, SG. So we split the grants for EMPA and add it to the 3 cantons (1/3 for each)
In [17]:
grants_empa = universities.at['Eidg. Material und Prüfungsanstalt - EMPA', 'amount']
grants_empa_third = grants_empa / 3
empa_cantons = ['BE', 'ZH', 'SG']
for c in empa_cantons:
canton_grants = canton_grants.set_value(c, 'amount', canton_grants.at[c, 'amount'] + grants_empa_third)
how many institutions did we match?
In [18]:
# the 1+len(...) accounts for the mapping of EMPA
print(str(round((100/ len(universities) ) * (1+len(universities[~pd.isnull(universities['canton'])]))) )+ '%')
which is how many % of all grants?
In [19]:
total_grants = universities.amount.sum()
matched_grants = canton_grants.amount.sum()
matched_percent = (100/total_grants) * matched_grants
print(str(round(matched_percent, 2) )+ '%')
In [20]:
write_to_csv(canton_grants, 'all_canton_grants.csv')
Out[20]:
In [21]:
# display all of them
pd.set_option('display.max_rows', None)
universities[['university_name', 'canton']]
Out[21]:
In [ ]: