In [1]:
# imports (may be more than needed)
import pandas as pd
import numpy as np
import glob # to find all files in folder
from datetime import datetime
from datetime import date, time
from dateutil.parser import parse
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import show
import itertools
%matplotlib inline
sns.set_context('notebook')
pd.options.mode.chained_assignment = None # default='warn'
Obtain all the data for the Bachelor students, starting from 2007. Keep only the students for which you have an entry for both Bachelor semestre 1 and Bachelor semestre 6. Compute how many months it took each student to go from the first to the sixth semester. Partition the data between male and female students, and compute the average -- is the difference in average statistically significant?
2) Perform a similar operation to what described above, this time for Master students. Notice that this data is more tricky, as there are many missing records in the IS-Academia database. Therefore, try to guess how much time a master student spent at EPFL by at least checking the distance in months between Master semestre 1 and Master semestre 2. If the Mineur field is not empty, the student should also appear registered in Master semestre 3. Last but not the least, don't forget to check if the student has an entry also in the Projet Master tables. Once you can handle well this data, compute the "average stay at EPFL" for master students. Now extract all the students with a Spécialisation and compute the "average stay" per each category of that attribute -- compared to the general average, can you find any specialization for which the difference in average is statistically significant?
In [2]:
all_data = pd.read_csv('all_data.csv', usecols=['Civilité', 'Nom_Prénom', 'title', 'periode_acad', 'periode_pedago','Orientation_Master', 'Spécialisation', 'Filière_opt.', 'Mineur', 'Statut', 'Type_Echange', 'Ecole_Echange', 'No_Sciper'])
In [3]:
all_data.sort_values(by='No_Sciper', axis=0).head(10)
Out[3]:
In [4]:
len(all_data)
Out[4]:
In [5]:
all_data['periode_pedago'].unique()
Out[5]:
In [6]:
all_data['title'].unique()
Out[6]:
checkout what the Semester printemps and automne are
In [7]:
all_data[all_data['periode_pedago'].isin(['Semestre printemps', 'Semestre automne'])]['title'].unique()
Out[7]:
This are the students exchange (students from other universities in exchange at EPFL) and students doing the Passerelle HES. Since the exchange students don't do a full master at EPFL we ignore them. Also we won't count students in Passerelle HES as beeing in the Master just yet because they have to succeed the passerelle to optain a master. So if they do a master they are inscribed in Master anyways.
In [8]:
master_periode_pedago = ['Master semestre 1', 'Master semestre 2', 'Projet Master printemps', 'Master semestre 3', 'Projet Master automne']
master_data = all_data[all_data['periode_pedago'].isin(master_periode_pedago)]
remove the columns with only NaN (in case there is one)
In [9]:
master_data.dropna(axis=1, how='all', inplace=True)
Rename the columns (remove the é and give shorter names)
In [10]:
master_data.rename(columns={'Civilité': 'Sex', 'Nom_Prénom': 'Name', 'Spécialisation': 'Specialisation'}, inplace=True)
In [11]:
master_data['periode_pedago'].unique()
Out[11]:
add some columns that make the use of the data easier later
In [12]:
master_data['periode_acad'].unique()
Out[12]:
In [13]:
# store the year of the entry
def start_year(student):
return int (student['periode_acad'].split('-')[0])
master_data['start_year'] = master_data.apply(start_year, axis=1)
master_data['end_year'] = master_data.apply(lambda st: start_year(st)+1, axis=1)
In [14]:
# make the indicator columns
new_cols_map = {
'Master1': ['Master semestre 1'],
'Master2': ['Master semestre 2'],
'Master3': ['Master semestre 3'],
'Project_Master' : ['Projet Master printemps', 'Projet Master automne']
}
for (new_col_name, match_list) in new_cols_map.items():
master_data[new_col_name] = master_data.apply(lambda student: student['periode_pedago'] in match_list, axis=1)
In [15]:
# show the new master_data (sample randomly to keep it managable but still informative)
master_data.sample(n = 10, axis=0, replace=False)
Out[15]:
We will remove all students that did certainly not finish the master. That is, we remove students that:
We also remove students that already have been studing in 2007 (the start of our data) because we can not tell how long they already have studied before.
In [16]:
# find all students that have not done the 'master 1' semester -> have not studied enough to finish the master.
grouped = master_data.groupby(by='No_Sciper')
no_master_1 = pd.DataFrame(columns=['Civilité', 'Nom_Prénom', 'Orientation_Bachelor', 'Orientation_Master',
'Spécialisation', 'Filière_opt.', 'Mineur', 'Statut', 'Type_Echange',
'Ecole_Echange', 'No_Sciper', 'title', 'periode_acad',
'periode_pedago'])
for scip, group in grouped:
if (group.periode_pedago != 'Master semestre 1').all():
no_master_1 = pd.concat([no_master_1, group])
len(no_master_1.No_Sciper.unique())
Out[16]:
In [17]:
# all the students that already studied in the '2007-2008' year:
df_2006 = no_master_1[no_master_1.periode_acad == '2007-2008']
len(df_2006.No_Sciper.unique())
Out[17]:
remove the found students. The others will be removed later.
In [18]:
sciper_to_remove = no_master_1['No_Sciper'].unique()
master_data = master_data[~master_data['No_Sciper'].isin(sciper_to_remove)]
Find all the students that are registered in the current semester and filter them.
In [19]:
study_now = pd.DataFrame(columns=['Civilité', 'Nom_Prénom', 'Orientation_Bachelor', 'Orientation_Master',
'Spécialisation', 'Filière_opt.', 'Mineur', 'Statut', 'Type_Echange',
'Ecole_Echange', 'No_Sciper', 'title', 'periode_acad',
'periode_pedago'])
for scip, group in grouped:
if (group.periode_acad == '2016-2017').any():
study_now = pd.concat([study_now, group])
len(study_now.No_Sciper.unique())
Out[19]:
... and remove them to
In [20]:
sciper_to_remove = study_now['No_Sciper'].unique()
master_data = master_data[~master_data['No_Sciper'].isin(sciper_to_remove)]
In [21]:
def group_master_data(grouped_entries):
# check that there are no two students with the same sciper number and different names or sex.
must_be_unique_list = ['Sex', 'Name'] #No_Scyper also but we group by it -> unique by construction
for unique_col in must_be_unique_list:
if(len(grouped_entries[unique_col].unique()) > 1): raise ValueError('Two students of different '+unique_col+' with same No_Sciper')
#aggregate the cols
first_entry = grouped_entries.head(1)
df_map = {
'No_Sciper' : first_entry['No_Sciper'].values[0],
'Name' : first_entry['Name'].values[0],
'Sex' : first_entry['Sex'].values[0],
'Specialisation' : grouped_entries['Specialisation'].dropna().unique(), # all the spcs the student was inscribed to
'Mineur' : grouped_entries['Mineur'].dropna().unique(),# all the minors the student was inscribed to
'first_year': grouped_entries['start_year'].min(), # smallest start year entry
'last_year' : grouped_entries['end_year'].max(), # highest year a studend appears
#some brainfuck lines ;)
'first_semestre' : grouped_entries.sort_values(by=['start_year', 'periode_pedago'], axis=0, ascending=True)['periode_pedago'].values[0], # eg master1
'last_semestre' : grouped_entries.sort_values(by=['end_year', 'periode_pedago'], axis=0, ascending=False)['periode_pedago'].values[0], # the name of the last semester (eg. master 3)
'semesters_done' : grouped_entries.sort_values(by=['end_year'])['periode_pedago'].values,
'nombre_semestres' : len(grouped_entries), # how many different semesters the student did at epfl
'project_master' : grouped_entries['Project_Master'].sum() > 0 # if student did the master project
}
# if the student did any minor or spec
df_map['mineur_or_spe'] = len(df_map['Specialisation']) + len(df_map['Mineur']) > 0
# True if the student has not (yet) finished the master
df_map['to_remove'] = df_map['nombre_semestres'] <= 1 or (not df_map['mineur_or_spe'] and df_map['nombre_semestres'] <= 2)
# if there are two spe, take the latest one
if len(df_map['Specialisation']) > 1:
df_map['Specialisation'] = grouped_entries[grouped_entries['end_year'] == df_map['last_year']]['Specialisation'].values[0]
elif len(df_map['Specialisation']) == 1:
df_map['Specialisation'] = df_map['Specialisation'][0] # take the spe out of the array
# set correct NaNs
if len(df_map['Mineur']) == 0:
df_map['Mineur'] = np.nan
if len(df_map['Specialisation']) == 0:
df_map['Specialisation'] = np.nan
# make Dataframe
for (k, v) in df_map.items():
df_map[k] = [v]
return pd.DataFrame.from_dict(df_map)
grouped_master = master_data.groupby(by='No_Sciper', as_index=False, sort=True).apply(group_master_data)
grouped_master.head()
Out[21]:
remove the students that have not finished their master (described above)
In [22]:
students_to_remove = grouped_master[grouped_master['to_remove']]
len(students_to_remove)
Out[22]:
In [23]:
grouped_master = grouped_master[~grouped_master['to_remove']]
Set the Sciper number as index
In [24]:
grouped_master.set_index('No_Sciper', inplace=True)
len(grouped_master)
Out[24]:
First without any further cleaning and thinking
In [25]:
master_epfl = grouped_master.copy()
In [26]:
plt.hist(master_epfl['nombre_semestres'], bins=8, range=[1, 9])
Out[26]:
In [27]:
master_epfl.describe()
Out[27]:
The mean of the number of semesters is around 3.6 semesters. However not all students are registered in the the Master Project (missing data?), but in the plot above we assume everyone finished their master. So we add to all students that are not inscribed in a Master Project one semester:
In [28]:
def add_semester_if_no_PM(row):
t = row.nombre_semestres
if(not row.project_master):
return t + 1
else:
return t
master_added_sem = master_epfl.copy()
master_added_sem['nombre_semestres'] = master_epfl.apply(add_semester_if_no_PM, axis=1)
In [29]:
master_added_sem.describe()
Out[29]:
It made a big difference. But that seems reasonable since most students do a Mineur/Spe which makes already (at least) 4 semesters (MP included)
In [30]:
plt.hist(master_added_sem['nombre_semestres'], bins=8, range=[1, 9])
Out[30]:
In [31]:
master_with_spe = master_added_sem[~pd.isnull(master_added_sem['Specialisation'])]
master_with_spe.describe()
Out[31]:
almost 50% of all students do a minor or a specialisation:
In [32]:
# how many percent of students do a specialisation
100/len(master_added_sem) * len(master_with_spe)
Out[32]:
In [33]:
# how many percent of students do a minor
100/len(master_added_sem) * len(master_added_sem[~pd.isnull(master_added_sem['Mineur'])])
Out[33]:
What are the different specialisations and how popular are they?
In [34]:
# The different specialisations
master_with_spe['Specialisation'].unique()
Out[34]:
A nice plot showing that there is a huge difference in the number of people taking the different Specialisations
In [35]:
sns.countplot(y="Specialisation", data=master_with_spe);
In the time taken to colmplete their studies they don't differ a lot. (The red dotted line is the averarge of all students, the blue line the average of the students with specialisations, the black bar is the confidence interval)
In [36]:
ax = sns.barplot(x='nombre_semestres', y='Specialisation', data=master_with_spe);
plt.axvline(master_with_spe['nombre_semestres'].mean(), color='b', linestyle='dashed', linewidth=2)
plt.axvline(master_added_sem['nombre_semestres'].mean(), color='r', linestyle='dashed', linewidth=2)
Out[36]:
Compared to all master students, it seems that only the first two specs actually inply a longer stay at EPFL (their confidence interval just barely don't touch the general mean)
In [37]:
def calc_diff_from_average(col, average):
return (lambda row: row[col] - average)
master_with_spe['diff_from_average'] = master_with_spe.apply(calc_diff_from_average('nombre_semestres', master_added_sem['nombre_semestres'].mean()), axis=1)
In [38]:
sns.barplot(x='diff_from_average', y='Specialisation', data=master_with_spe);
In [39]:
import scipy.stats as stats
We want to see if the difference of the average number of semesters for a particular specialisation and for all the master students in informatique are statistically significant with a threshold of 95%
We use a Welch's T-Test (which does not assume equal population variance): it measures whether the average value differs significantly across samples.
In [40]:
specialisation = master_with_spe.Specialisation.unique()
for spec in specialisation:
print(spec)
data_spec = master_with_spe[master_with_spe.Specialisation == spec]
print (stats.ttest_ind(a = data_spec.nombre_semestres, b= master_added_sem.nombre_semestres, equal_var=False))
print('\n')
Only Signals, Images and Interfaces and Internet computing have a pvalue < 0.05 (pvalue = 0.044 and 0.028 resp.) thus we can reject the null hypothesis and tell that the difference is significant.
For all other specialisation we cannot reject the null hypothesis of identical average scores (pvalue > 0.05), which means: for those specialisation we cannot say that the difference is in average statistically significant
Except for the number of students, male and females are quite similar
In [41]:
Female = master_added_sem[master_added_sem.Sex == 'Madame']
Female.describe()
Out[41]:
In [42]:
Male = master_added_sem[master_added_sem.Sex == 'Monsieur']
Male.describe()
Out[42]:
We want to see if the difference of the average years for female and male are statistically significant with a threshold of 95%
We use a Welch's T-Test (which does not assume equal population variance): it measures whether the average value differs significantly across samples.
In [43]:
stats.ttest_ind(a = Female.nombre_semestres, b= Male.nombre_semestres, equal_var=False)
Out[43]:
Since the pvalue is > 0.05, we cannot reject the null hypothesis of identical average scores which means: we cannot say that the difference is in average statistically significant, in fact it is quite similar (p value close to 1)
Also over time nothing much changed. The females might have a small upwards trend, but the confidence intervall is quite big.
In [44]:
sns.regplot(x=Male.last_year, y=Male.nombre_semestres, marker="o",ci=95, label='Male', scatter_kws={'s':100})
ax =sns.regplot(x=Female.last_year, y=Female.nombre_semestres, marker="d", ci= 95, label='Female', color='r')
ax.legend(loc="best")
Out[44]:
In [ ]: