Wednesday October 25, 2017 at 11:59PM
In this homework we will extract interesting information from www.toptop_universities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide top_universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need! You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this brief tutorial to understand quickly how to use it.
Plot your data using bar charts and describe briefly what you observed.
Obtain the 200 top-ranking top_universities in www.timeshighereducation.com (ranking 2018). Repeat the analysis of the previous point and discuss briefly what you observed.
Merge the two DataFrames created in questions 1 and 2 using university names. Match top_universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.
Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?
Can you find the best university taking in consideration both rankings? Explain your approach.
Hints:
In [1]:
import requests, re, html
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from tqdm import tqdm_notebook
import warnings
warnings.filterwarnings('ignore')
NUM_OBS = 200
In [2]:
root_url_1 = 'https://www.topuniversities.com'
# we use the link to the API from where the website fetches its data instead of BeautifulSoup
# much much cleaner
list_url_1 = root_url_1 + '/sites/default/files/qs-rankings-data/357051_indicators.txt'
r = requests.get(list_url_1)
top_universities = pd.DataFrame()
top_universities = top_universities.from_dict(r.json()['data'])[['uni', 'overall_rank', 'location', 'region']]
# get the university name and details URL with a regex
top_universities['name'] = top_universities['uni'].apply(lambda name: html.unescape(re.findall('<a[^>]+href=\"(.*?)\"[^>]*>(.*)?</a>', name)[0][1]))
top_universities['url'] = top_universities['uni'].apply(lambda name: html.unescape(re.findall('<a[^>]+href=\"(.*?)\"[^>]*>(.*)?</a>', name)[0][0]))
top_universities.drop('uni', axis=1, inplace=True)
top_universities['overall_rank'] = top_universities['overall_rank'].astype(int)
In [3]:
# selects the top N rows based on the colum_name of the dataframe df
def select_top_N(df, column_name, N):
df = df.sort_values(by=column_name)
df = df[df[column_name] <= N]
return df
In [57]:
# get only the first top-200 universities by overall rank
top_universities = select_top_N(top_universities, 'overall_rank', NUM_OBS)
top_universities.head()
Out[57]:
In [58]:
students_total = []
students_inter = []
faculty_total = []
faculty_inter = []
def get_num(soup, selector):
scraped = soup.select(selector)
# Some top_universities don't have stats, return NaN for these case
if scraped:
return int(scraped[0].contents[0].replace(',', ''))
else:
return np.NaN
for details_url in tqdm_notebook(top_universities['url']):
soup = BeautifulSoup(requests.get(root_url_1 + details_url).text, 'html.parser')
students_total.append(get_num(soup, 'div.total.student div.number'))
students_inter.append(get_num(soup, 'div.total.inter div.number'))
faculty_total.append(get_num(soup, 'div.total.faculty div.number'))
faculty_inter.append(get_num(soup, 'div.inter.faculty div.number'))
top_universities['students_total'] = students_total
top_universities['students_international'] = students_inter
top_universities['students_national'] = top_universities['students_total'] - top_universities['students_international']
top_universities['faculty_total'] = faculty_total
top_universities['faculty_international'] = faculty_inter
top_universities['faculty_national'] = top_universities['faculty_total'] - top_universities['faculty_international']
top_universities.head()
Out[58]:
In [6]:
#defining colors for each type of plot
colors_1 = ['#FF9F9A', '#D0BBFF']
colors_2 = ['#92C6FF', '#97F0AA']
plt.style.use('ggplot')
In [63]:
top = 10
top_universities_ratio = select_top_N(top_universities, 'overall_rank', top)
In [64]:
top_universities_ratio_sf = top_universities_ratio[['name', 'students_total', 'faculty_total']]
top_universities_ratio_sf = top_universities_ratio_sf.set_index(['name'])
top_universities_ratio_sf.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
top_universities_ratio_sf.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Top 10 ratio\'s between students and faculty members among universities')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
fig.autofmt_xdate()
plt.show()
Comments: We see that it is rather difficult to compare the ratios of the different universities. This is due to the different sizes of the population. In order to draw more precise information about it, we need to normalize the data with repect to each university.
In [65]:
# normalize the data to be able to make a good comparison
top_universities_ratio_normed = top_universities_ratio_sf.div(top_universities_ratio_sf.sum(1), axis=0).sort_values(by='faculty_total', ascending=False)
top_universities_ratio_normed.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
top_universities_ratio_normed.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Top 10 ratio\'s between students and faculty members among universities')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
# we can restrict the range on the y axis to avoid displaying unnecessary content
axes.set_ylim([0.7,1])
fig.autofmt_xdate()
plt.show()
Comments: You noticed that the y-axis ranges from 0.7 to 1. We limited the visualization to this interval because the complete interval does not add meaningful insight about the data. Analyzing the results, we see that the Caltech university is the university in the top 10 offering more faculty members to its students. ETHZ is in the last position.
In [66]:
top_universities_ratio_s = top_universities_ratio[['name', 'students_international', 'students_national']]
top_universities_ratio_s = top_universities_ratio_s.set_index(['name'])
top_universities_ratio_s_normed = top_universities_ratio_s.div(top_universities_ratio_s.sum(1), axis=0).sort_values(by='students_international', ascending=False)
top_universities_ratio_s_normed.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(10, 5))
top_universities_ratio_s_normed.plot.bar(stacked=True, color=colors_2, ax=axes)
axes.set_title('Top 10 ratio\'s of international and national students among universities')
axes.legend(labels=['international students', 'national students'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0, 0.6])
fig.autofmt_xdate()
plt.show()
Comments: The most international university, by its students, among the top 10 universities is the Imperial College London. Notice that ETHZ is in the third position.
In [67]:
ratio_country_sf = top_universities.groupby(['location'])['students_total', 'faculty_total'].sum()
ratio_country_sf_normed = ratio_country_sf.div(ratio_country_sf.sum(1), axis=0).sort_values(by='faculty_total', ascending=False)
ratio_country_sf_normed.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(15, 5))
ratio_country_sf_normed.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Ratio of students and faculty members by country')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0.8,1])
fig.autofmt_xdate()
plt.show()
In [68]:
ratio_country_s = top_universities.groupby(['location'])['students_international', 'students_national'].sum()
ratio_country_s_normed = ratio_country_s.div(ratio_country_s.sum(1), axis=0).sort_values(by='students_international', ascending=False)
ratio_country_s_normed.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(15, 5))
ratio_country_s_normed.plot.bar(stacked=True, color=colors_2, ax=axes)
axes.set_title('Ratio of international and national students by country')
axes.legend(labels=['international students', 'national students'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0, 0.4])
fig.autofmt_xdate()
plt.show()
Comments: Aggregating the data by country, we see that Russia is the country offering more faculty members for its student, followed by Danemark and Saudi Arabia. The most international university in terms of students is Australia, followed by United Kingdom and Hong Kong. Switzerland is in the fifth position and India is the country with the lowest ratio of international students.
In [69]:
ratio_region_s = top_universities.groupby(['region'])['students_total', 'faculty_total'].sum()
ratio_region_s_normed = ratio_region_s.div(ratio_region_s.sum(1), axis=0).sort_values(by='faculty_total', ascending=False)
ratio_region_s_normed.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
ratio_region_s_normed.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Ratio of students and faculty members by region')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0.8,1])
axes.yaxis.grid(True)
fig.autofmt_xdate()
plt.show()
In [70]:
ratio_region_s = top_universities.groupby(['region'])['students_international', 'students_national'].sum()
ratio_region_s_normed = ratio_region_s.div(ratio_region_s.sum(1), axis=0).sort_values(by='students_international', ascending=False)
ratio_region_s_normed.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
ratio_region_s_normed.plot.bar(stacked=True, color=colors_2, ax=axes)
axes.set_title('Ratio of international and national students by region')
axes.legend(labels=['international students', 'national students'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0,0.4])
axes.yaxis.grid(True)
fig.autofmt_xdate()
plt.show()
Comments: Asia is the region offering more faculty members to its students. It is followed by North America and Europe. The most international university in terms of students is Oceania. Europe is second.
In [71]:
# we repeat the same procedure as for www.topuniversities.com
root_url_2 = 'https://www.timeshighereducation.com'
list_url_2 = root_url_2 + '/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'
r = requests.get(list_url_2)
times_higher_education = pd.DataFrame()
times_higher_education = times_higher_education.from_dict(r.json()['data'])[['rank', 'location', 'location', 'name', 'url', 'stats_number_students', 'stats_pc_intl_students', 'stats_student_staff_ratio']]
# rename columns as is the first dataframe
times_higher_education.columns = ['overall_rank', 'location', 'region', 'name', 'url', 'students_total', 'ratio_inter', 'student_staff_ratio']
# as the ranks have different represetation we had to delete the '=' in front of universities that have the same rank,
# rewrite the rank when it is represented as an interval (ex: 201-250) and finally delete the '+' in the end for the last ones
times_higher_education['overall_rank'] = times_higher_education['overall_rank'].apply(lambda rank: re.sub('[=]', '', rank))
times_higher_education['overall_rank'] = times_higher_education['overall_rank'].apply(lambda rank: rank.split('–')[0])
times_higher_education['overall_rank'] = times_higher_education['overall_rank'].apply(lambda rank: re.sub('[+]', '', rank)).astype(int)
# remaps a ranking in order to make selection by ranking easier
# ex: 1,2,3,3,5,6,7 -> 1,2,3,3,4,5,6
def remap_ranking(rank):
last=0
for i in range(len(rank)):
if last == rank[i]-1:
#no problem
last = rank[i]
elif last != rank[i]:
last = last+1
rank[i] = last
rank[(i+1):] = rank[(i+1):]-1
return rank
times_higher_education['overall_rank'] = remap_ranking(times_higher_education['overall_rank'].copy())
# in the following lines we make the necessary transformation in order to get the right type or numbers for each column
times_higher_education['students_total'] = times_higher_education['students_total'].apply(lambda x: re.sub('[^0-9]','', x)).astype(int)
times_higher_education['ratio_inter'] = times_higher_education['ratio_inter'].apply(lambda x: re.sub('[^0-9]','', x)).astype(float)
times_higher_education['student_staff_ratio'] = times_higher_education['student_staff_ratio'].astype(float)
times_higher_education['students_international'] = (times_higher_education['students_total'] * (times_higher_education['ratio_inter']/100)).astype(int)
times_higher_education['students_national'] = times_higher_education['students_total'] - times_higher_education['students_international']
times_higher_education['faculty_total'] = (times_higher_education['students_total'] / times_higher_education['student_staff_ratio']).astype(int)
times_higher_education['faculty_international'] = np.NaN
times_higher_education['faculty_national'] = np.NaN
times_higher_education['region'] = np.NaN
# resolve ties
times_higher_education['overall_rank'] = np.arange(1, times_higher_education.shape[0]+1)
# resolve N/A region
loc_to_reg = top_universities[['location', 'region']]
loc_to_reg = set(loc_to_reg.apply(lambda x: '{}_{}'.format(x['location'], x['region']), axis=1).values)
loc_to_reg = {x.split('_')[0]: x.split('_')[1] for x in loc_to_reg}
from collections import defaultdict
loc_to_reg = defaultdict(lambda: 'N/A', loc_to_reg)
def resolve_uni(x):
x['region'] = loc_to_reg[x['location']]
return x
times_higher_education = times_higher_education.apply(resolve_uni, axis=1)
del times_higher_education['ratio_inter']
del times_higher_education['student_staff_ratio']
In [72]:
# get only the first top-200 universities by overall rank
times_higher_education = select_top_N(times_higher_education, 'overall_rank', NUM_OBS)
times_higher_education.head()
Out[72]:
In [73]:
top = 10
times_higher_education_ratio = select_top_N(times_higher_education, 'overall_rank', top)
In [87]:
times_higher_education_ratio_sf = times_higher_education_ratio[['name', 'students_total', 'faculty_total']]
times_higher_education_ratio_sf = times_higher_education_ratio_sf.set_index(['name'])
times_higher_education_ratio_normed = times_higher_education_ratio_sf.div(times_higher_education_ratio_sf.sum(1), axis=0).sort_values(by='faculty_total', ascending=False)
times_higher_education_ratio_normed.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
times_higher_education_ratio_normed.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Top 10 ratio\'s between students and faculty members among universities')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0.8,1])
fig.autofmt_xdate()
plt.show()
Comments: The university of Chicago is the faculty with the more faculty members by students. It is closely followed by the California Institute of Technology.
In [75]:
times_higher_education_ratio_s = times_higher_education_ratio[['name', 'students_international', 'students_national']]
times_higher_education_ratio_s = times_higher_education_ratio_s.set_index(['name'])
times_higher_education_ratio_s_normed = times_higher_education_ratio_s.div(times_higher_education_ratio_s.sum(1), axis=0).sort_values(by='students_international', ascending=False)
times_higher_education_ratio_s_normed.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(10, 5))
times_higher_education_ratio_s_normed.plot.bar(stacked=True, color=colors_2, ax=axes)
axes.set_title('Top 10 ratio\'s of international and national students among universities')
axes.legend(labels=['international students', 'national students'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0.2, 0.6])
fig.autofmt_xdate()
plt.show()
Comments: The Imperial College Longon university has a strong lead in the internationalization of its student. Oxford and ETHZ are following bunched together.
In [76]:
ratio_country_sf = times_higher_education.groupby(['location'])['students_total', 'faculty_total'].sum()
ratio_country_sf_normed = ratio_country_sf.div(ratio_country_sf.sum(1), axis=0).sort_values(by='faculty_total', ascending=False)
ratio_country_sf_normed.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(15, 5))
ratio_country_sf_normed.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Ratio of students and faculty members by country')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0.8,1])
fig.autofmt_xdate()
plt.show()
Comments: Denmark is in the first position. We find the Russian Federation in the second place. This is the same result obtained with the top universities website the other way around. This shows that either the universities of each country have different ranking in each website or each website has different information about each university.
In [77]:
ratio_country_s = times_higher_education.groupby(['location'])['students_international', 'students_national'].sum()
ratio_country_s_normed = ratio_country_s.div(ratio_country_s.sum(1), axis=0).sort_values(by='students_international', ascending=False)
ratio_country_s_normed.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(15, 5))
ratio_country_s_normed.plot.bar(stacked=True, color=colors_2, ax=axes)
axes.set_title('Ratio of international and national students by country')
axes.legend(labels=['international students', 'national students'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0, 0.6])
fig.autofmt_xdate()
plt.show()
Comments: Luxembourg has more international than national students which allows it to be in first position without difficulty. Switzerland is in the sixth position (versus fifth for top university website).
In [78]:
# Some countries have their field 'region' filled with 'N/A': this is due to the technique we used to write the
# correct region for each university. In the sample we are considering, let's see how many universities are concerned:
times_higher_education[times_higher_education['region'] == 'N/A']
Out[78]:
In [85]:
# As there is only two universities concerned, we can rapidly write it by hand. Of course we should have develop a
# more generalized manner to do it, if we had a much larger sample.
times_higher_education.set_value(178, 'region', 'Europe')
times_higher_education.set_value(193, 'region', 'Europe')
Out[85]:
In [80]:
ratio_region_s = times_higher_education.groupby(['region'])['students_total', 'faculty_total'].sum()
ratio_region_s_normed = ratio_region_s.div(ratio_region_s.sum(1), axis=0).sort_values(by='faculty_total', ascending=False)
ratio_region_s_normed.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
ratio_region_s_normed.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Ratio of students and faculty members by region')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0.8,1])
axes.yaxis.grid(True)
fig.autofmt_xdate()
plt.show()
In [81]:
ratio_region_s = times_higher_education.groupby(['region'])['students_international', 'students_national'].sum()
ratio_region_s_normed = ratio_region_s.div(ratio_region_s.sum(1), axis=0).sort_values(by='students_international', ascending=False)
ratio_region_s_normed.index.name = None
fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
ratio_region_s_normed.plot.bar(stacked=True, color=colors_2, ax=axes)
axes.set_title('Ratio of international and national students by region')
axes.legend(labels=['international students', 'national students'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0,0.4])
axes.yaxis.grid(True)
fig.autofmt_xdate()
plt.show()
Comments: In the first plot, we see that Africa is the region where there is more faculty members by students. The two following regions are very close to each other. In the second plot, Oceania is the more internationalized school in terms of its students and Europe is second. We had similar results by the other website concerning this last outcome.
In [82]:
# Detects same universities with different names in the two dataframes before merging
# using Jaccard similarity and same location rule (seems to keep matching entry)
def t(x):
# Compute Jaccard score (intersection over union)
def jaccard(a, b):
u = set(a.split(' '))
v = set(b.split(' '))
return len(u.intersection(v)) / len(u.union(v))
names = top_universities['name'].tolist()
locations = top_universities['location'].tolist()
scores = np.array([jaccard(x['name'], n) for n in names])
m = scores.max()
i = scores.argmax()
# Jaccard score for similarity and location match to filter out name with different locations
if m > 0.5 and x['location'] == locations[i]:
x['name'] = names[i]
return x
# Match universities name in both dataframes
times_higher_education = times_higher_education.apply(t, axis=1)
# Intersection on the name column of the two datasets
merged = pd.merge(top_universities, times_higher_education, on='name', how='inner')
merged.head()
Out[82]:
Here we first proceed by creating the correlation matrix (since it's a symetric matrix we only kept the lower triangle). We then plot it using a heatmap to see correlation between columns of the dataframe. We also made another heatmap with only correlation whose absolute value is greater than 0.5. Finally we averaged the features when they were available for the two websites (except rankings).
Some correlations bring interesting information:
In [83]:
merged_num = merged.select_dtypes(include=[np.number])
merged_num.dropna(how='all', axis=1)
merged_num.dropna(how='any', axis=0)
def avg_feature(x):
cols = set([x for x in x.index if 'overall' not in x])
cols_common = set([x[0:-2] for x in cols])
for cc in cols_common:
cc_x = '{}_x'.format(cc)
cc_y = '{}_y'.format(cc)
if cc_y in cols:
x['{}_avg'.format(cc)] = (x[cc_x] + x[cc_y]) / 2
else:
x['{}_avg'.format(cc)] = x[cc_x] / 2
for c in cols:
del x[c]
return x
merged_num_avg = merged_num.apply(avg_feature, axis=1)
merged_num.head()
corr = merged_num.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(corr, ax=ax, mask=mask, annot=True, square=True)
plt.show()
# Keep only correlation with and absolute value superior to 0.5
corr[(corr < 0.5) & (corr > -0.5)] = 0
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(corr, ax=ax, mask=mask, annot=True, square=True)
plt.show()
# Keep only correlation with and absolute value superior to 0.5 for averaged features
corr = merged_num_avg.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
corr[(corr < 0.5) & (corr > -0.5)] = 0
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(corr, ax=ax, mask=mask, annot=True, square=True)
plt.show()
First we have to transform ranking in some score. Here we assume a linear relation for the score given the ranking, so we gave a score of 1 for the best ranking and 0 for the worst ranking with linear mapping between these two. We did it for each of the ranking (the two websites). Also we don't really know if a website is most trustworthy than the other, so a good merging for the ranking would be to take the average of the two scores with equal weights for each score. Finally we also took into account the ratio of staff member per students:
$finalScore = mean(score1, score2, staff per studiants)$
After computing these values, we found that Caltech is the best university (according to our assumptions).
Per Website ranking:
Caltech: top_universities -> 4 | times_higher_education = -> 3 | staff per student ratio -> 0.15 | => final score: 0.71
In [84]:
r = merged[['name', 'overall_rank_x', 'overall_rank_y']]
r.head()
def lin(df):
best_rank = df.min()
worst_rank = df.max()
a = 1 / (best_rank - worst_rank)
b = 1 - a*best_rank
return df.apply(lambda x: a*x + b)
r['stud_staff_ratio'] = merged[['faculty_international_x', 'faculty_international_y']].mean(axis=1) / \
merged[['students_total_x', 'students_total_y']].mean(axis=1)
r['score_x'] = lin(r['overall_rank_x'])
r['score_y'] = lin(r['overall_rank_y'])
r['overall_score'] = r[['score_x', 'score_y', 'stud_staff_ratio']].mean(axis=1)
r = r.dropna()
r[r['overall_score'] == r['overall_score'].max()]
Out[84]: