Obtain the 200 top-ranking universities in www.topuniversities.com (ranking 2018). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the details page. Store the resulting dataset in a pandas DataFrame and answer the following questions:
Which are the best universities in term of: (a) ratio between faculty members and students? (b) ratio of international students?
Answer the previous question aggregating the data by (c) country and (d) region
In [1]:
#importing the libraries
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import re
import numpy as np
from collections import defaultdict
from IPython.core.display import display,HTML
In [2]:
#Defining a function that will get the attribute of the universiy using the url of the page of the university
def Get_data( HTML_tag ) :
#Using BeautifulSoup to extract the data from the given HTML
Page_Soup = BeautifulSoup(requests.get(HTML_tag).text, 'html.parser')
name = Page_Soup.find('h1').text
country = Page_Soup.find('span' , class_='country').text
#We get the number, delete the comma and then the rest of the text (there was spaces and text we had no use of)
# get the infos : number of faculty member
if Page_Soup.find('div', class_='total faculty') is not None:
faculty_total = re.findall("\d+", Page_Soup.find('div', class_='total faculty').text.replace("," , ""))
number_faculty_total = int(faculty_total[0])
else:
number_faculty_total = 'Unknown'
# get the infos : number of international faculty member
if Page_Soup.find('div', class_='inter faculty') is not None:
faculty_international = re.findall("\d+", Page_Soup.find('div', class_='inter faculty').text.replace("," , ""))
number_faculty_international = int(faculty_international[0])
else:
number_faculty_international = 'Unknown'
# get the infos : number of students
if Page_Soup.find('div', class_='total student') is not None:
students_total = re.findall("\d+", Page_Soup.find('div', class_='total student').text.replace("," , ""))
number_students_total = int(students_total[0])
else:
number_students_total = 'Unknown'
# get the infos : number of students coming from another country
if Page_Soup.find('div', class_ = 'total inter') is not None:
students_international = re.findall("\d+",Page_Soup.find('div', class_ = 'total inter').text.replace("," , ""))
number_students_international = int(students_international[0])
else:
number_students_international = 'Unknown'
# We return the results in the form of a dataframe with all the infos
Data = pd.DataFrame({'Name': [name], 'Country': [country], 'Total_Faculty': [number_faculty_total],
'Inter_Faculty': [number_faculty_international], 'Total_Students': [number_students_total],
'Inter_Students': [number_students_international],'Rank': np.nan, 'Region':np.nan})
return Data
In [3]:
# We run a test on the function Get_data
# and display the result
Test = Get_data('https://www.topuniversities.com/universities/university-michigan#wurs')
display(Test)
All the data from the universities are not on the main page. However all the links are on the page. We will create a function to exctract all the URLs of each universitiy.
In [4]:
# extract request url from jsonfile and uses them to extract other data via Get_Data()
# Used postman to request htmlfile. then searched for <tbody> it was empty so we searched for json/txt to find where
# the documents that JQuery/Ajax or whatever loads in after a while
r = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508507889334')
req = r.json()
data = req['data']
QSframe = pd.DataFrame.from_dict(data)
QSframe.url = "https://www.topuniversities.com" + QSframe.url
# We nw have a dataframe with all the infos for the wholle ranking
# we only keep the datas for the first 200 top school
# We also keep track of the full frame to use it in the exercice 2 to get the region for every country
QS = QSframe.head(200).copy()
# We create a new dataframe do apply our changes
QSdf = pd.DataFrame()
# Create a loop to update the Region and the rank for every School
# we use the compteur i to update the rank to avoid tie in the ranking
for i, url in enumerate(QS.url):
data = Get_data(url)
data['Rank'] = i+1
data['Region']=QS.loc[i,'region']
QSdf = QSdf.append(data)
# We Put the final dataframe in the variable QS
QS = QSdf.copy()
# We display the whole dataframe with all the relavants infos
print('DataFrame of the Ranking from the topuniversities.com :')
display(QS)
In [5]:
#converting the columns we need to have as integers to be able to apply some operations upon them
QS.Total_Faculty = pd.to_numeric(QS.Total_Faculty, errors='coerce')
QS.Total_Students = pd.to_numeric(QS.Total_Students, errors='coerce')
QS.Inter_Faculty = pd.to_numeric(QS.Inter_Faculty, errors='coerce')
QS.Inter_Students = pd.to_numeric(QS.Inter_Students, errors='coerce')
We define Fonctions for every Questions. We will use them again for the second dataframe acquired on another website
In [6]:
# Method resolving the Question (a). It will be use for both exercice 1 and 2
def Question_a(Ranking) :
Ranking = Ranking.reset_index()
# We divide the columns total_faculty by the number of students
# Calculating the ratio between faculty members and students
Ranking['Faculty_over_Students'] = Ranking.Total_Faculty.div(Ranking.Total_Students, axis= 0)
Ranking = Ranking.set_index('Faculty_over_Students')
Ranking = Ranking.sort_index(ascending=False)
#Showing the 10 best universities in terms of Faculty members over Students Ratio
print('10 best universities in terms of Faculty members over Students Ratio:')
display((Ranking[['Name']].head(10)))
#plot
# Plotting answer to Question a
print('Plot:')
fac_stud_plot = Ranking.head(10).reset_index().plot(x='Name', y='Faculty_over_Students', kind='bar')
fac_stud_plot.set_title('10 best universities in terms of Faculty members over Students Ratio:')
fac_stud_plot.set_ylabel('Faculty members / students')
fac_stud_plot.set_xlabel('Universities')
plt.show(fac_stud_plot)
return Ranking
In [7]:
# Apply the function on the first dataframe
QS = Question_a(QS)
In [8]:
def Question_b(Ranking) :
Ranking = Ranking.reset_index()
# we calculate the ratio of international students
# by dividing the number of international student by the number of student
Ranking['Inter_Students_over_Total'] = Ranking.Inter_Students.div(Ranking.Total_Students, axis= 0)
Ranking = Ranking.set_index('Inter_Students_over_Total')
Ranking = Ranking.sort_index(ascending=False)
display(Ranking[['Name']].head(10))
# Plotting answer to Question b
print('Plot:')
interstud_total_plot = Ranking.head(10).reset_index().plot(x='Name', y='Inter_Students_over_Total', kind='bar')
interstud_total_plot.set_title('10 best universities in terms of International students over Total students ratio:')
interstud_total_plot.set_ylabel(' International students / students')
interstud_total_plot.set_xlabel('Universities')
plt.show(interstud_total_plot)
return Ranking
In [9]:
QS = Question_b(QS)
In order to aggregate the results by country, we will use the groupBy function.
C. By Country
1. Ratio between faculty members and students
2. Ratio of international students
D. By Region
1. Ratio between faculty members and students
2. Ratio of international students
In [10]:
def Region_and_Country_Stat(Ranking) :
Ranking = Ranking.reset_index()
C1 = pd.DataFrame(Ranking.groupby('Country').Faculty_over_Students.mean())
C2 = pd.DataFrame(Ranking.groupby('Country').Inter_Students_over_Total.mean())
D1 = pd.DataFrame(Ranking.groupby('Region').Faculty_over_Students.mean())
D2 = pd.DataFrame(Ranking.groupby('Region').Inter_Students_over_Total.mean())
C1 = C1.sort_values('Faculty_over_Students', ascending=False).head(10)
print('Question C1')
display(C1)
# Plotting answer to Question C1
print('Plot:')
C1_plot = C1.reset_index().plot(x='Country', y='Faculty_over_Students', kind='bar')
C1_plot.set_title('10 best countries in terms of Faculty members over Students Ratio:')
C1_plot.set_ylabel('Faculty members / students')
C1_plot.set_xlabel('Countries')
plt.show(C1_plot)
C2 = C2.sort_values('Inter_Students_over_Total', ascending=False).head(10)
print('Question C2')
display(C2)
# Plotting answer to Question C2
print('Plot:')
C2_plot = C2.reset_index().plot(x='Country', y='Inter_Students_over_Total', kind='bar')
C2_plot.set_title('10 best countries in terms of International students over Total students ratio:')
C2_plot.set_ylabel('International students / students')
C2_plot.set_xlabel('Countries')
plt.show(C2_plot)
D1 = D1.sort_values('Faculty_over_Students', ascending=False)
print('Question D1')
display(D1)
# Plotting answer to Question D1
print('Plot:')
D1_plot = D1.reset_index().plot(x='Region', y='Faculty_over_Students', kind='bar')
D1_plot.set_title('10 best regions in terms of Faculty members over Students Ratio:')
D1_plot.set_ylabel('Faculty members / students')
D1_plot.set_xlabel('Region')
plt.show(D1_plot)
D2 = D2.sort_values('Inter_Students_over_Total', ascending=False)
print('Question D2')
display(D2)
# Plotting answer to Question D1
print('Plot:')
D2_plot = D2.reset_index().plot(x='Region', y='Inter_Students_over_Total', kind='bar')
D2_plot.set_title('10 best regions in terms of International students over Total students ratio:')
D2_plot.set_ylabel('International students / students')
D2_plot.set_xlabel('Region')
plt.show(D2_plot)
return
In [11]:
#display(QS)
Region_and_Country_Stat(QS)
DISCUSSION OF THE RESULTS:
We observe that the universities with a high number of faculty members usually are high in the ranking. We also observe that our dear beloving school, the EPFL, is alsmost the school with the highest ratio of international Students behind the London School of Economics. Oceania is the Region with the highest ratio of international student followed by the european countries. And the United Kingdom is the country with the most highest ratio of international student The North American universities are the one with the most of faculty members.
(2) Obtain the 200 top-ranking universities in www.timeshighereducation.com (ranking 2018). Repeat the analysis of the previous point and discuss briefly what you observed.
In [12]:
# Used postman to request htmlfile. then searched for <tbody> it was empty so we searched for json/txt to find where
# the documents that JQuery/Ajax or whatever loads in after a while
r = requests.get('https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json')
req = r.json()
data = req['data']
THframe = pd.DataFrame.from_dict(data)
THframe = THframe.head(200)
# copy THframe for exercice 5
TH5 = THframe.copy()
display(TH5)
In [13]:
#Create the three colomns total students/ International student and Total faculty member
#A bit of data cleaning. We must remove the "=" and "%" and make sure that the numbers are not strings.
# We also calculate the number of Inter_students with the poucentage
# and the number of member faculty
THframe['Total_Students'] = pd.to_numeric(THframe.stats_number_students.str.replace(',' , ''), errors='coerce')
THframe['Inter_Students'] = round(pd.to_numeric(THframe.stats_pc_intl_students.str.replace('%' , ''), errors='coerce').div(100) * THframe.Total_Students)
THframe['Total_Faculty'] = round(THframe.Total_Students.div(pd.to_numeric(THframe.stats_student_staff_ratio, errors='coerce')))
#change the rank to delete the Ties
THframe['Rank']= THframe.index.values+1
#Because we will apply the functions from question 1, the columns's name must be the same, thus why we need to change them.
#We also select only the meaningful datas
THframe['Country'] = THframe['location']
THframe['Name'] = THframe['name']
THframe = THframe[['Country', 'Name', 'Rank', 'Total_Students', 'Inter_Students', 'Total_Faculty']]
#We add the column Region to our data frame! We must also fill the column with the data using the DF of question 1
THframe['Region'] = 'N/A'
# Resolve N/A
# Because TH don't provide the information of region directly,
# We will merge it with QS on country so we can extract country and region data
# We need all the QS data so we dont miss any country's region
whole_QS = QSframe
whole_QS['Country']=whole_QS['country']
whole_QS['Region']=whole_QS['region']
# Manually found the three exceptions that did not match in country names.
whole_QS['Country'] = whole_QS['Country'].str.replace('Hong Kong S.A.R., China', 'Hong Kong')
whole_QS['Country'] = whole_QS['Country'].str.replace("Russia", "Russian Federation")
# QS did not have Luxembourg on its list. So easyfix, it is valid since Belarus is not in TH and we just want the region
whole_QS['Country'] = whole_QS['Country'].str.replace("Belarus", "Luxembourg")
# Then we create a dictonary with the data from QS
loc_to_reg = whole_QS[['Country', 'Region']]
loc_to_reg = set(loc_to_reg.apply(lambda x: '{}_{}'.format(x['Country'], x['Region']), axis=1).values)
loc_to_reg = {x.split('_')[0]: x.split('_')[1] for x in loc_to_reg}
loc_to_reg = defaultdict(lambda: 'N/A', loc_to_reg)
def resolve_uni(x):
x['Region'] = (loc_to_reg[x['Country']])
return x
THframe = THframe.apply(resolve_uni, axis=1)
# Now that the region columns is updated and clear we can
TH = THframe.copy()
print('DataFrame of the Ranking from the timeshighereducation.com :')
display(TH)
Answering question (a), the faculty members over students ratio:
In [14]:
# We use the same fonction than for the question 1
THframe = Question_a(THframe)
Answering question (b), the international students over total ratio:
In [15]:
# We use the same fonction than for the question 1
THframe = Question_b(THframe)
In [16]:
# We use the same fonction than for the question 1
Region_and_Country_Stat(THframe)
DISCUSSION OF THE RESULTS:
Vanderbilt University is the University with the highest ratio of faculty member which is in the USA. Whereas the Danmark is the country the the highest ratio of faculty member. North America is still the region with the highest ratio of faculty member. Once again the London school of Economics has the highest ratio of international student far above the other schools and Universities. This time the Luxembourg is the country with the highest ratio of international student which is pretty logical for a small country like this one. And once again Oceania is the Region with the highest ratio of international Students
Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.
The strategy of merging names is to use a technique called N-gram to parse up each university name into N long strings that are overlapping each other. When counting how many N-grams they have in common one can rate how much to Universities have in common. This handles most case very well. But have issues when Universities are from the same city like MIT and University of Massachusetts. Therefore we will use some stopwords that we filter out to merge correctly for the top universities. A good deal of the universities will be filtered out because some simply are just in one list, and some have so different names that we can't match them. We assume it is fine with analysing 140+ universities from here on out. This was how understood the task
In [17]:
#To merge the data on University name, some data cleaning is required. We will do n-grams. In order to diminish false
#positive matching
def Cleaning_for_merging(Ranking):
Ranking = Ranking.reset_index()
RankName = Ranking['Name']
stopWords = ['North',' ', '(',')']
for word in stopWords:
RankName = RankName.str.replace(word,'')
return RankName
In [18]:
THframe = THframe.reset_index()
THframe = THframe.set_index('Rank')
THframe = THframe.sort_index()
THname = Cleaning_for_merging(THframe)
QSname = Cleaning_for_merging(QSdf)
Unchanged_QS_name = QSdf['Name']
Unchanged_TH_name = THframe['Name']
In order to have a matching names for merging the two dataframes, we will use Unigrams and Bigrams.
In [19]:
# An Ngram implementation to rate the similarity of two words by parsing them up in N long pieces
def N_Gram(N,text):
NList = set() # start with an empty list
if N> 1:
space = " " * (N-1) # add N - 1 spaces
text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
for i in range( len(text) - (N - 1) ):
NList.add(text[i:i+N])
return NList
In [20]:
#Matching the names of the universities across dataframes
#initializing the dataframe
Matched = pd.DataFrame(columns=['rank_TH','Actual_TH','Name_1_(TH)','Name_2_(QS)', 'Actual_QS','rank_QS', 'Matching_%'])
i = 0
count = 0
#iterate through all the names from TH
while i <len(THname):
# get the list of partial words/sentences from Ngram
nList1 = N_Gram(5,THname.iloc[i])
j=0
list_of_possible_matchings = np.zeros((len(QSname)))
# iterate through all the names from TH
while j <len(QSname):
Counter = 0
# get the list of partial words/sentences from Ngram
nList2 = N_Gram(5,QSname.iloc[j])
matching = nList1.intersection(nList2)
if (matching is not None) :
Counter = len(matching)
# give an estimate of how good the matching is
MatchingPerc = Counter/(min(len(nList1),len(nList2)))
list_of_possible_matchings[j] = MatchingPerc
j+=1
# deside which matches that are the optimal
bestIndex = np.argmax(list_of_possible_matchings)
MatchingPerc = np.max(list_of_possible_matchings)
Matched.loc[i] = [i+1, Unchanged_TH_name.iloc[i], THname.iloc[i], QSname.iloc[bestIndex],
Unchanged_QS_name.iloc[bestIndex], bestIndex+1, MatchingPerc]
i+=1
display(Matched.sort_values('Matching_%', ascending=False).head(160))
# how much must the matchings match
percentMatchKeep = 0.8
Matched = Matched[(Matched['Matching_%'] > percentMatchKeep)]
print(Matched.shape)
In [ ]:
# QS anf TH Dataframe that we will merge
QS_merge = QSdf
TH_merge = THframe
# Make sure QS_merge is sorted by rank
QS_merge.reset_index()
QS_merge.sort_values('Rank')
# sort Matched by TH_rank
Matched.sort_values('rank_TH')
# change name of the QS_merge to be identical to the ones we matched with the TH_frame
for index, matched_TH_name in enumerate(Matched.loc[:,'Actual_TH']):
TH_rank = index + 1
QS_rank = Matched.loc[:,'rank_QS'].iloc[index]
QS_merge['Name'].iloc[ int(QS_rank) - 1] = matched_TH_name
# Set 'Name' as index before merging
QS_merge.set_index('Name')
TH_merge.set_index('Name')
# merge on names
# simultanously get rid of universities not in the list with just keeping the intersection: "inner" join
merged_frame = pd.merge(TH_merge, QS_merge, on='Name', how='inner', suffixes=('_TH', '_QS'))
# The index is +1 so it replaces the ranking in TH
merged_frame['index'] += 1
# set 'index' -> 'Rank_TH' for clarity and Rank -> Rank_QS
merged_frame = merged_frame.rename(columns={'index': 'Rank_TH', 'Rank': 'Rank_QS', 'Country_TH': 'Country'})
# PUt the rank on the left part of the data frame
# We delete the country and the region beacause they are twice in the dataframe
merged_frame = merged_frame.set_index('Rank_TH')
merged_frame['Region'] = merged_frame['Region_TH']
merged_frame = merged_frame[['Rank_QS','Name','Country','Region', 'Total_Students_TH', 'Inter_Students_TH',
'Total_Faculty_TH', 'Inter_Faculty', 'Inter_Students_QS', 'Total_Faculty_QS', 'Total_Students_QS']]
In [22]:
display(merged_frame)
Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?
In [23]:
# We drop the row in the merged dataframe where there are some unknows datas
merged_frame = merged_frame[merged_frame.Total_Students_QS != 'Unknown']
#We plot the number of Students for everey Universities between QS and TH website
# We also plot a red line regression
students = merged_frame.plot(kind='scatter', x = 'Total_Students_QS',y = 'Total_Students_TH')
x = np.linspace(0, 80000, 80000)
plt.plot(x,x,c='r')
plt.show(students)
Faculty = merged_frame.plot(kind='scatter', x = 'Total_Faculty_QS',y = 'Total_Faculty_TH')
x = np.linspace(0, 8000, 8000)
plt.plot(x,x,c='r')
plt.show(students)
We plotted 2 things.
First a plot with the Total number of students from one website on one axis and from the other website on the other axis. Secondly on the same plot we put a red line: the linear function which is the line where all the datas should be if the datas from both websites were the same. We Repeated this operation for the Total number of Students and for the number of faculty members.
For The first plot we observe that there is a strong correlation and the datas are approximatly the same.
For the second plot we clearly see that the number of faculty member from the QS website is way ABOVE the number from the other website.
In [24]:
merged_frame['International_Students_mean'] = np.mean([merged_frame.Inter_Students_QS,merged_frame.Inter_Students_TH],axis=0)
merged_frame['Total_Students_mean'] = np.mean([merged_frame.Total_Students_QS,merged_frame.Total_Students_TH],axis=0)
merged_frame['Total_Faculty_mean'] = np.mean([merged_frame.Total_Faculty_QS,merged_frame.Total_Faculty_TH],axis=0)
x = merged_frame['Inter_Students_QS'].div(merged_frame['Total_Students_QS'])
y = merged_frame.Inter_Faculty.div(merged_frame['Total_Faculty_QS'])
#Inter = merged_frame.plot(kind='scatter', x ,y )
plt.scatter(x,y)
plt.title('Interational Faculty / Total Faculty members versus International Student/ Total Student')
plt.xlabel('International Student/ Total Student (QS)')
plt.ylabel('Interational Faculty / Total Faculty members')
line = np.linspace(0, 0.8, 1000)
plt.plot(line,line,c='r')
plt.show()
observation : It is possible that there is a positive correlation bewtween international faculty members and universities with high ratio of International Students.
In [25]:
x = merged_frame.Total_Faculty_mean.div(merged_frame.Total_Students_mean)
y = merged_frame.International_Students_mean.div(merged_frame.Total_Students_mean)
#Inter = merged_frame.plot(kind='scatter', x ,y )
plt.scatter(x,y)
plt.title('Faculty/Student ratio versus International student/student ratio')
plt.xlabel('Faculty members / total students')
plt.ylabel('International student / student')
plt.show()
observation: These results are inconclusive. We do not observe a clear pattern between both ratios. It don't not seem to have a linear correlation between faculty members and international students. It might be that we have a negative correlation but cannot conclude this from this graph.
In order to establish which university is the best, we consider the scores in the different categories that were given in the rankings. The QS ranking has the following items (they will be named QS(number) afterwards): 1 - Overall 2 - Academic reputation 3 - Citation per faculty 4 - Employer reputation 5 - Faculty Student 6 - International Students 7 - International faculty
The Times higher ranking has (they will be named TH(number) afterwards): 1 - Overall 2 - Teaching 3 - Research 4 - Citation 5 - Industry income 6 - International outlook
In order to be able to make a scoring system that considers both rankings, we made the following assumptions:
- QS(1) & TH(1) are removed because we want to make another overall ranking, it would make no sense to take them into account.
- The average between TH(2) and TH(3) will correspond to QS(2) because to us, teaching and research put together are academic reputation.
- QS(3) = TH(4)
- QS(5) correspond to the Faculty/Student ratio that we computed in question 2.
- QS(4) and TH(5) will be removed because they don't have an equivalent in the other ranking.
- QS(6) and QS(7) are forming TH(6) because the international students and faculty are both parts of the international outlook.
In the end, our criterias are: 1 - Academic 2 - Citations 3 - Faculty over student ratio 4 - International outlook
When looking at the merged rankings, one can notice that in the top 10, apart from number 7 (which is UCL for the QS and Princeton University for Times Higher), both ranking share the same universities at the top even though they are not in the same order. It seemed highly unlikely that a university whihch is not in the top 10 of both ranking could in the end be the best universy when merging. This is why we removed the two number 7 and decided to work with only 9 universities.
In [26]:
#Remove the number 7 and work with the rest. Puting the Scores in a DataFrame.
THscores = pd.DataFrame()
#Extracting the Data from TH5. In order to handle a small amount of data we will use only the first 10 universities.
TH5 = TH5.head(10)
THscores['Name' ] = TH5.aliases
#Taking the average between TH(2) and TH(3):
THscores['Academic_Reputation'] = (pd.to_numeric(TH5.scores_teaching, errors='coerce') +
pd.to_numeric(TH5.scores_research, errors='coerce')).div(2)
THscores['Citations'] = pd.to_numeric(TH5.scores_citations, errors='coerce')
#THscores['Faculty_over_Student'] = THframe['Faculty_over_Students']
THscores['International_Outlook'] = pd.to_numeric(TH5.scores_international_outlook, errors='coerce')
#We reset the index of the THframe and then we use it to have the Faculty over student ratio.
THscores['Faculty_over_Students'] = TH['Total_Faculty'].head(10).div(TH['Total_Students'].head(10))
display(THscores)
In order to have the same standards, we create a function to normalize scoring values. We will set the maximum of each column at 100 and and then set the other values taking this into account.
In [27]:
def Normalizing(ScoreFrame,Column):
MaxValue = max(ScoreFrame[Column])
ScoreFrame[Column] = (ScoreFrame[Column] * 100) / MaxValue
return ScoreFrame
In [28]:
THscores = Normalizing(THscores, 'Academic_Reputation')
THscores = Normalizing(THscores, 'Faculty_over_Students')
THscores = Normalizing(THscores, 'International_Outlook')
display(THscores)
In [29]:
THscores = THscores.drop(6)
THscores = THscores.set_index('Name')
In [30]:
display(THscores)
We believe that the international dimension is less important for a university to be the best. It still matters, however we think that teaching, citations and faculty over student ratio is more important than the rest. This is why every other item apart from the international dimension is given a 2 factor in our grading system.
In [31]:
def Grade(Score):
Score['Grade'] = (2*Score['Academic_Reputation'] + 2*Score['Citations'] + Score['International_Outlook'] +
2*Score['Faculty_over_Students'])/7
return Score
In [32]:
THscores = Grade(THscores)
display(THscores.sort_values('Grade', ascending=False))
Let's work with QS now, we will use a similar function as the Get_data function in question 1, but extracting only the items we need for the score.
In [33]:
def Get_score_data(HTML):
#Using BeautifulSoup to extract the data from the given HTML
Page_Soup = BeautifulSoup(requests.get(HTML).text, 'html.parser')
name = Page_Soup.find('h1').text
#We get the number, delete the comma and then the rest of the text (there was spaces and text we had no use of)
Array = re.findall(r'[+-]?[0-9.]+', Page_Soup.find('ul', class_='score').text)
Score = pd.DataFrame({'Name': [name], 'Academic_Reputation': Array[1], 'Citation_per_Faculty': Array[2],
'Inter_Faculty': Array[5], 'InterStudents': Array[6]})
return Score
In [34]:
#Using the same process as in question 1, but with a different function
r = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508507889334')
req = r.json()
data = req['data']
QSframe = pd.DataFrame.from_dict(data)
QSframe.url = "https://www.topuniversities.com" + QSframe.url
QS = QSframe.head(10)
QSscores = pd.DataFrame()
for i,url in enumerate(QS.url):
QSscores = QSscores.append(Get_score_data(url))
In [35]:
#Reindexing and getting rid of the data we do not want to work with.
QSscores = QSscores.set_index('Name')
QSscores = QSscores.drop('UCL (University College London)')
In [36]:
display(QSscores)
In [ ]:
QSscores['International_Outlook'] = (pd.to_numeric(QSscores['InterStudents']) + pd.to_numeric(QSscores['Inter_Faculty']))/2
QSscores['Citations'] = pd.to_numeric(QSscores['Citation_per_Faculty'])
QSscores['Academic_Reputation'] = pd.to_numeric(QSscores['Academic_Reputation'])
#We reset the index of the QS and then we use it to have the Faculty over student ratio.
QS = QSdf
QS = QS.reset_index()
QS = QS.sort_values('Rank')
#Creating an empty column to contain the results:
QSscores['Faculty_over_Students'] = 'NaN'
i = 0
while i < len(QSscores['Academic_Reputation']) :
QSscores['Faculty_over_Students'][i] = QS['Total_Faculty'][i]/QS['Total_Students'][i]
i+=1
In [38]:
display(QSscores)
In [39]:
#Deleting the columsn that are now in two.
QSscores = QSscores.drop(['Citation_per_Faculty', 'InterStudents' ,'Inter_Faculty'] ,axis=1)
One can notice that the "Citations" and "Faculty over Students" columns aren't normalized.
In [40]:
QSscores = Normalizing(QSscores, 'Citations')
QSscores = Normalizing(QSscores, 'Faculty_over_Students')
#We now get the grade
QSscores =Grade(QSscores)
#In order to find the best university, we put the in the same order inside the 2 dataframes.
#To do so we make sure that their names are the same.
QSscores = QSscores.rename(index={'Massachusetts Institute of Technology (MIT) ':'Massachusetts Institute of Technology',
'California Institute of Technology (Caltech)' : 'California Institute of Technology caltech',
'ETH Zurich - Swiss Federal Institute of Technology' :'ETH Zurich – Swiss Federal Institute of Technology Zurich'})
Now that we have the both rankings with the same universities, we are able to make the average of the the two grades. To do so we sort the indexes and then obtain a final DataFrame.
In [41]:
THscores = THscores.sort_index()
QSscores = QSscores.sort_index()
display(QSscores)
display(THscores)
In [42]:
FinalGrading = pd.DataFrame()
FinalGrading['Overall_Score'] = (THscores['Grade'] + QSscores['Grade'])/2
FinalGrading = FinalGrading.sort_values('Overall_Score', ascending=False)
In [43]:
display(FinalGrading)
With respect to our ranking system merging the two previous ones, CalTech arrives first, which not that surprising because it is well ranked in both QS and TH. The weight we applied to each parameters should not be that far from the ones that were used by the ranking sites because our results are quite clear.