Homework 2

QUESTION 1

Obtain the 200 top-ranking universities in www.topuniversities.com (ranking 2018). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the details page. Store the resulting dataset in a pandas DataFrame and answer the following questions:

Which are the best universities in term of: (a) ratio between faculty members and students? (b) ratio of international students?

Answer the previous question aggregating the data by (c) country and (d) region


In [1]:
#importing the libraries
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import re
import numpy as np
from collections import defaultdict
from IPython.core.display import display,HTML

In [2]:
#Defining a function that will get the attribute of the universiy using the url of the page of the university

def Get_data( HTML_tag ) :
    #Using BeautifulSoup to extract the data from the given HTML
    Page_Soup = BeautifulSoup(requests.get(HTML_tag).text, 'html.parser')
    name = Page_Soup.find('h1').text
    country = Page_Soup.find('span' , class_='country').text

    #We get the number, delete the comma and then the rest of the text (there was spaces and text we had no use of)
    # get the infos : number of faculty member
    if Page_Soup.find('div', class_='total faculty') is not None:
        faculty_total = re.findall("\d+", Page_Soup.find('div', class_='total faculty').text.replace("," , ""))
        number_faculty_total = int(faculty_total[0])
    else:
        number_faculty_total = 'Unknown'
        
    # get the infos : number of international faculty member
    if Page_Soup.find('div', class_='inter faculty') is not None:
        faculty_international = re.findall("\d+", Page_Soup.find('div', class_='inter faculty').text.replace("," , ""))
        number_faculty_international = int(faculty_international[0])
    else:
        number_faculty_international = 'Unknown'
        
    # get the infos : number of students    
    if Page_Soup.find('div', class_='total student') is not None:
        students_total = re.findall("\d+", Page_Soup.find('div', class_='total student').text.replace("," , ""))
        number_students_total = int(students_total[0])
    else:
        number_students_total = 'Unknown'
    # get the infos : number of students coming from another country        
    if Page_Soup.find('div', class_ = 'total inter') is not None:
        students_international = re.findall("\d+",Page_Soup.find('div', class_ = 'total inter').text.replace("," , ""))
        number_students_international = int(students_international[0])
    else:
        number_students_international = 'Unknown'
    
    # We return the results in the form of a dataframe with all the infos
    Data = pd.DataFrame({'Name': [name], 'Country': [country], 'Total_Faculty': [number_faculty_total], 
                         'Inter_Faculty': [number_faculty_international], 'Total_Students': [number_students_total], 
                         'Inter_Students': [number_students_international],'Rank': np.nan, 'Region':np.nan})
    return Data

In [3]:
# We run a test on the function Get_data
# and display the result
Test = Get_data('https://www.topuniversities.com/universities/university-michigan#wurs')
display(Test)


Country Inter_Faculty Inter_Students Name Rank Region Total_Faculty Total_Students
0 United States 1920 7527 University of Michigan NaN NaN 6809 43147

All the data from the universities are not on the main page. However all the links are on the page. We will create a function to exctract all the URLs of each universitiy.

Extracting data from topuniversities.com


In [4]:
# extract request url from jsonfile and uses them to extract other data via Get_Data()

# Used postman to request htmlfile. then searched for <tbody> it was empty so we searched for json/txt to find where
# the documents that JQuery/Ajax or whatever loads in after a while
r = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508507889334')
req = r.json()
data = req['data']
QSframe = pd.DataFrame.from_dict(data)
QSframe.url = "https://www.topuniversities.com" + QSframe.url

# We nw have a dataframe with all the infos for the wholle ranking
# we only keep the datas for the first 200 top school
# We also keep track of the full frame to use it in the exercice 2 to get the region for every country
QS = QSframe.head(200).copy()

# We create a new dataframe do apply our changes
QSdf = pd.DataFrame()
# Create a loop to update the Region and the rank for every School
# we use the compteur i to update the rank to avoid tie in the ranking
for i, url in enumerate(QS.url):
    data = Get_data(url)
    data['Rank'] = i+1
    data['Region']=QS.loc[i,'region']
    QSdf = QSdf.append(data)

#  We Put the final dataframe in the variable QS
QS = QSdf.copy()    
# We display the whole dataframe with all the relavants infos
print('DataFrame of the Ranking from the topuniversities.com :')
display(QS)


DataFrame of the Ranking from the topuniversities.com :
Country Inter_Faculty Inter_Students Name Rank Region Total_Faculty Total_Students
0 United States 1679 3717 Massachusetts Institute of Technology (MIT) 1 North America 2982 11067
0 United States 2042 3611 Stanford University 2 North America 4285 15878
0 United States 1311 5266 Harvard University 3 North America 4350 22429
0 United States 350 647 California Institute of Technology (Caltech) 4 North America 953 2255
0 United Kingdom 2278 6699 University of Cambridge 5 Europe 5490 18770
0 United Kingdom 2964 7353 University of Oxford 6 Europe 6750 19720
0 United Kingdom 2554 14854 UCL (University College London) 7 Europe 6345 31080
0 United Kingdom 2071 8746 Imperial College London 8 Europe 3930 16090
0 United States 635 3379 University of Chicago 9 North America 2449 13557
0 Switzerland 1886 7563 ETH Zurich - Swiss Federal Institute of Techno... 10 Europe 2477 19815
0 Singapore 2993 7251 Nanyang Technological University, Singapore (NTU) 11 Asia 4338 25738
0 Switzerland 1300 5896 Ecole Polytechnique Fédérale de Lausanne (EPFL) 12 Europe 1695 10343
0 United States 246 1793 Princeton University 13 North America 1007 8069
0 United States 970 5411 Cornell University 14 North America 2718 21904
0 Singapore 3086 8917 National University of Singapore (NUS) 15 Asia 5106 32728
0 United States 1708 2469 Yale University 16 North America 4940 12402
0 United States 1061 4105 Johns Hopkins University 17 North America 4462 16146
0 United States 913 8105 Columbia University 18 North America 6189 25045
0 United States 1383 4250 University of Pennsylvania 19 North America 5499 20639
0 Australia 927 5551 The Australian National University 20 Oceania 1600 14442
0 United States 1920 7527 University of Michigan 21 North America 6809 43147
0 United States 226 2864 Duke University 22 North America 2938 15320
0 United Kingdom 1553 10551 The University of Edinburgh 23 Europe 4075 28040
0 United Kingdom 1513 9451 King's College London 24 Europe 3650 23780
0 China 932 4072 Tsinghua University 25 Asia 5506 36300
0 Hong Kong S.A.R., China 2085 8230 The University of Hong Kong 26 Asia 3012 20214
0 United States 1395 6086 University of California, Berkeley (UCB) 27 North America 3321 36703
0 Japan 254 2639 The University of Tokyo 28 Asia 4514 27279
0 United States 1141 3422 Northwestern University 29 North America 4231 18675
0 Hong Kong S.A.R., China 835 2921 The Hong Kong University of Science and Techno... 30 Asia 1150 10375
... ... ... ... ... ... ... ... ...
0 Germany 413 3897 Albert-Ludwigs-Universitaet Freiburg 171 Europe 1966 23214
0 India 4 80 Indian Institute of Technology Delhi (IITD) 172 Asia 466 7477
0 Saudi Arabia 665 989 King Fahd University of Petroleum & Minerals 173 Asia 1062 6040
0 United Kingdom 717 7502 University of Liverpool 174 Europe 2390 22065
0 United States 487 1899 University of Virginia 175 North America 2509 21560
0 Australia 546 7769 University of Technology Sydney 176 Oceania 1328 24207
0 France 137 1132 CentraleSupélec 177 Europe 529 3877
0 United States 504 4597 University of Florida 178 North America 5419 45485
0 India 19 93 Indian Institute of Technology Bombay (IITB) 179 Asia 876 9402
0 Netherlands 299 2235 University of Twente 180 Europe 910 9332
0 Germany 641 3736 University of Göttingen 181 Europe 3678 30402
0 Belgium 515 1853 Vrije Universiteit Brussel (VUB) 182 Europe 1792 9284
0 Brazil 109 966 Universidade Estadual de Campinas (Unicamp) 183 Latin America 1968 26572
0 United States 566 2679 University of Colorado Boulder 184 North America 3685 30219
0 Austria 172 5199 Vienna University of Technology 185 Europe 706 17479
0 United States 488 2805 University of Rochester 186 North America 2569 9636
0 Spain 181 2976 Universidad Autónoma de Madrid 187 Europe 2738 26511
0 Italy 153 4195 Alma Mater Studiorum - University of Bologna 188 Europe 2990 63399
0 United Kingdom 422 4077 University of Reading 189 Europe 1280 13330
0 India Unknown 47 Indian Institute of Science (IISc) Bangalore 190 Asia 423 4071
0 South Africa 379 3325 University of Cape Town 191 Africa 1733 19593
0 Italy 1 48 Scuola Normale Superiore di Pisa 192 Europe 86 532
0 Italy 1 135 Scuola Superiore Sant'Anna Pisa di Studi Unive... 193 Europe 119 789
0 Japan 261 2178 Keio University CEMS MIM 194 Asia 3905 33500
0 Sweden 646 2636 Stockholm University 195 Europe 2154 28281
0 Spain 230 3848 Universitat Autònoma de Barcelona 196 Europe 2187 31986
0 United States 206 4900 Texas A&M University 197 North America 3446 60294
0 Mexico 821 1412 Instituto Tecnológico y de Estudios Superiores... 198 Latin America 1822 13376
0 Netherlands 502 8234 Maastricht University 199 Europe 1277 16385
0 Chile 127 2134 Universidad de Chile 200 Latin America 2256 38848

200 rows × 8 columns


In [5]:
#converting the columns we need to have as integers to be able to apply some operations upon them
QS.Total_Faculty = pd.to_numeric(QS.Total_Faculty, errors='coerce')
QS.Total_Students = pd.to_numeric(QS.Total_Students, errors='coerce')
QS.Inter_Faculty  = pd.to_numeric(QS.Inter_Faculty, errors='coerce')
QS.Inter_Students = pd.to_numeric(QS.Inter_Students, errors='coerce')

We define Fonctions for every Questions. We will use them again for the second dataframe acquired on another website


In [6]:
# Method resolving the Question (a). It will be use for both exercice 1 and 2
def Question_a(Ranking) :
    Ranking = Ranking.reset_index()
    # We divide the columns total_faculty by the number of students
    # Calculating the ratio between faculty members and students
    Ranking['Faculty_over_Students'] = Ranking.Total_Faculty.div(Ranking.Total_Students, axis= 0)
    Ranking = Ranking.set_index('Faculty_over_Students')
    Ranking = Ranking.sort_index(ascending=False)
    #Showing the 10 best universities in terms of Faculty members over Students Ratio
    print('10 best universities in terms of Faculty members over Students Ratio:')
    display((Ranking[['Name']].head(10)))
    #plot
    # Plotting answer to Question a
    print('Plot:')
    fac_stud_plot = Ranking.head(10).reset_index().plot(x='Name', y='Faculty_over_Students', kind='bar')
    fac_stud_plot.set_title('10 best universities in terms of Faculty members over Students Ratio:')
    fac_stud_plot.set_ylabel('Faculty members / students')
    fac_stud_plot.set_xlabel('Universities')
    plt.show(fac_stud_plot)
    return Ranking

In [7]:
# Apply the function on the first dataframe
QS = Question_a(QS)


10 best universities in terms of Faculty members over Students Ratio:
Name
Faculty_over_Students
0.422616 California Institute of Technology (Caltech)
0.398323 Yale University
0.342292 University of Oxford
0.292488 University of Cambridge
0.276353 Johns Hopkins University
0.269870 Stanford University
0.269450 Massachusetts Institute of Technology (MIT)
0.266604 University of Rochester
0.266437 University of Pennsylvania
0.247115 Columbia University
Plot:

In [8]:
def Question_b(Ranking) :
    Ranking = Ranking.reset_index()
    # we calculate the  ratio of international students
    # by dividing the number of international student by the number of student
    Ranking['Inter_Students_over_Total'] = Ranking.Inter_Students.div(Ranking.Total_Students, axis= 0)
    Ranking = Ranking.set_index('Inter_Students_over_Total')
    Ranking = Ranking.sort_index(ascending=False)
    display(Ranking[['Name']].head(10))
    # Plotting answer to Question b
    print('Plot:')
    interstud_total_plot = Ranking.head(10).reset_index().plot(x='Name', y='Inter_Students_over_Total', kind='bar')
    interstud_total_plot.set_title('10 best universities in terms of International students over Total students ratio:')
    interstud_total_plot.set_ylabel(' International students / students')
    interstud_total_plot.set_xlabel('Universities')
    plt.show(interstud_total_plot)
    return Ranking

In [9]:
QS = Question_b(QS)


Name
Inter_Students_over_Total
0.691393 London School of Economics and Political Scien...
0.570047 Ecole Polytechnique Fédérale de Lausanne (EPFL)
0.543567 Imperial College London
0.502533 Maastricht University
0.478062 Carnegie Mellon University
0.477928 UCL (University College London)
0.457955 University of St Andrews
0.427434 The University of Melbourne
0.421816 Queen Mary University of London
0.407144 The University of Hong Kong
Plot:

In order to aggregate the results by country, we will use the groupBy function.

C. By Country

1. Ratio between faculty members and students
2. Ratio of international students

D. By Region

1. Ratio between faculty members and students
2. Ratio of international students

In [10]:
def Region_and_Country_Stat(Ranking) :
    Ranking = Ranking.reset_index()
    C1 = pd.DataFrame(Ranking.groupby('Country').Faculty_over_Students.mean())
    C2 = pd.DataFrame(Ranking.groupby('Country').Inter_Students_over_Total.mean())

    D1 = pd.DataFrame(Ranking.groupby('Region').Faculty_over_Students.mean())
    D2 = pd.DataFrame(Ranking.groupby('Region').Inter_Students_over_Total.mean())

    C1 = C1.sort_values('Faculty_over_Students', ascending=False).head(10)
    print('Question C1')
    display(C1)
    # Plotting answer to Question C1
    print('Plot:')
    C1_plot = C1.reset_index().plot(x='Country', y='Faculty_over_Students', kind='bar')
    C1_plot.set_title('10 best countries in terms of Faculty members over Students Ratio:')
    C1_plot.set_ylabel('Faculty members / students')
    C1_plot.set_xlabel('Countries')
    plt.show(C1_plot)
    
    C2 = C2.sort_values('Inter_Students_over_Total', ascending=False).head(10)
    print('Question C2')
    display(C2)
    # Plotting answer to Question C2
    print('Plot:')
    C2_plot = C2.reset_index().plot(x='Country', y='Inter_Students_over_Total', kind='bar')
    C2_plot.set_title('10 best countries in terms of International students over Total students ratio:')
    C2_plot.set_ylabel('International students / students')
    C2_plot.set_xlabel('Countries')
    plt.show(C2_plot)
    
    D1 = D1.sort_values('Faculty_over_Students', ascending=False)
    print('Question D1')
    display(D1)
    # Plotting answer to Question D1
    print('Plot:')
    D1_plot = D1.reset_index().plot(x='Region', y='Faculty_over_Students', kind='bar')
    D1_plot.set_title('10 best regions in terms of Faculty members over Students Ratio:')
    D1_plot.set_ylabel('Faculty members / students')
    D1_plot.set_xlabel('Region')
    plt.show(D1_plot)
    
    D2 = D2.sort_values('Inter_Students_over_Total', ascending=False)
    print('Question D2')
    display(D2)
    # Plotting answer to Question D1
    print('Plot:')
    D2_plot = D2.reset_index().plot(x='Region', y='Inter_Students_over_Total', kind='bar')
    D2_plot.set_title('10 best regions in terms of International students over Total students ratio:')
    D2_plot.set_ylabel('International students / students')
    D2_plot.set_xlabel('Region')
    plt.show(D2_plot)
    return

In [11]:
#display(QS)
Region_and_Country_Stat(QS)


Question C1
Faculty_over_Students
Country
Russia 0.221910
Denmark 0.186580
Saudi Arabia 0.175828
Singapore 0.162279
Japan 0.155840
Malaysia 0.153893
United States 0.151151
South Korea 0.149356
France 0.144006
Israel 0.136047
Plot:
Question C2
Inter_Students_over_Total
Country
United Kingdom 0.351308
Australia 0.346878
Switzerland 0.313816
Hong Kong S.A.R., China 0.312148
Austria 0.306095
Singapore 0.277091
Canada 0.252604
New Zealand 0.248971
Netherlands 0.245456
Ireland 0.241791
Plot:
Question D1
Faculty_over_Students
Region
North America 0.144831
Asia 0.134673
Europe 0.120003
Latin America 0.096779
Africa 0.088450
Oceania 0.075003
Plot:
Question D2
Inter_Students_over_Total
Region
Oceania 0.329077
Europe 0.245932
North America 0.202437
Africa 0.169703
Asia 0.132394
Latin America 0.071751
Plot:

DISCUSSION OF THE RESULTS:

We observe that the universities with a high number of faculty members usually are high in the ranking. We also observe that our dear beloving school, the EPFL, is alsmost the school with the highest ratio of international Students behind the London School of Economics. Oceania is the Region with the highest ratio of international student followed by the european countries. And the United Kingdom is the country with the most highest ratio of international student The North American universities are the one with the most of faculty members.  

QUESTION 2

(2) Obtain the 200 top-ranking universities in www.timeshighereducation.com (ranking 2018). Repeat the analysis of the previous point and discuss briefly what you observed.

In [12]:
# Used postman to request htmlfile. then searched for <tbody> it was empty so we searched for json/txt to find where
# the documents that JQuery/Ajax or whatever loads in after a while

r = requests.get('https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json')
req = r.json()
data = req['data']
THframe = pd.DataFrame.from_dict(data)
THframe = THframe.head(200)
# copy THframe for exercice 5
TH5 = THframe.copy()
display(TH5)


aliases location member_level name nid rank rank_order record_type scores_citations scores_citations_rank ... scores_research scores_research_rank scores_teaching scores_teaching_rank stats_female_male_ratio stats_number_students stats_pc_intl_students stats_student_staff_ratio subjects_offered url
0 University of Oxford United Kingdom 0 University of Oxford 468 1 10 master_account 99.1 15 ... 99.5 1 86.7 5 46 : 54 20,409 38% 11.2 Archaeology,Art, Performing Arts & Design,Biol... /world-university-rankings/university-oxford
1 University of Cambridge United Kingdom 0 University of Cambridge 470 2 20 master_account 97.5 29 ... 97.8 3 87.8 3 45 : 55 18,389 35% 10.9 Archaeology,Architecture,Art, Performing Arts ... /world-university-rankings/university-cambridge
2 California Institute of Technology caltech United States 0 California Institute of Technology 128779 =3 30 private 99.5 10 ... 97.5 4 90.3 1 31 : 69 2,209 27% 6.5 Architecture,Biological Sciences,Business & Ma... /world-university-rankings/california-institut...
3 Stanford University United States 11 Stanford University 467 =3 40 private 99.9 4 ... 96.7 5 89.1 2 42 : 58 15,845 22% 7.5 Archaeology,Architecture,Art, Performing Arts ... /world-university-rankings/stanford-university
4 Massachusetts Institute of Technology United States 0 Massachusetts Institute of Technology 471 5 50 private 100.0 1 ... 91.9 9 87.3 4 37 : 63 11,177 34% 8.7 Architecture,Art, Performing Arts & Design,Bio... /world-university-rankings/massachusetts-insti...
5 Harvard University United States 0 Harvard University 466 6 60 private 99.7 8 ... 98.4 2 84.2 9 None 20,326 26% 8.9 Agriculture & Forestry,Archaeology,Art, Perfor... /world-university-rankings/harvard-university
6 Princeton University United States 0 Princeton University 469 7 70 private 99.6 9 ... 93.9 6 85.7 7 45 : 55 7,955 24% 8.3 Architecture,Art, Performing Arts & Design,Bio... /world-university-rankings/princeton-university
7 Imperial College London United Kingdom 0 Imperial College London 472 8 80 master_account 96.7 40 ... 88.7 12 81.7 13 37 : 63 15,857 55% 11.4 Biological Sciences,Chemical Engineering,Chemi... /world-university-rankings/imperial-college-lo...
8 University of Chicago United States 0 University of Chicago 473 9 90 private 99.4 12 ... 90.1 10 85.3 8 44 : 56 13,525 25% 6.2 Archaeology,Art, Performing Arts & Design,Biol... /world-university-rankings/university-chicago
9 ETH Zurich – Swiss Federal Institute of Techno... Switzerland 0 ETH Zurich – Swiss Federal Institute of Techno... 479 =10 100 master_account 94.3 60 ... 92.0 8 76.4 21 31 : 69 19,233 38% 14.6 Agriculture & Forestry,Architecture,Biological... /world-university-rankings/eth-zurich-swiss-fe...
10 University of Pennsylvania United States 0 University of Pennsylvania 480 =10 110 private 98.5 22 ... 90.1 11 83.7 10 50 : 50 20,361 20% 6.5 Accounting & Finance,Archaeology,Architecture,... /world-university-rankings/university-pennsylv...
11 Yale University United States 0 Yale University 475 12 120 private 98.4 24 ... 87.0 17 86.7 6 49 : 51 12,155 21% 4.3 Archaeology,Architecture,Art, Performing Arts ... /world-university-rankings/yale-university
12 Johns Hopkins University United States 0 Johns Hopkins University 478 13 130 private 98.4 23 ... 88.1 16 76.1 23 52 : 48 15,498 24% 4.3 Agriculture & Forestry,Archaeology,Architectur... /world-university-rankings/johns-hopkins-unive...
13 Columbia University United States 0 Columbia University 476 14 140 private 98.8 19 ... 83.3 26 82.2 12 None 26,587 32% 6.1 Archaeology,Architecture,Art, Performing Arts ... /world-university-rankings/columbia-university
14 University of California, Los Angeles United States 0 University of California, Los Angeles 477 15 150 public 97.9 27 ... 88.1 15 80.7 14 53 : 47 39,279 17% 9.6 Architecture,Art, Performing Arts & Design,Bio... /world-university-rankings/university-californ...
15 University College London ucl United Kingdom 0 University College London 481 16 160 master_account 94.6 56 ... 88.2 13 74.4 25 56 : 44 30,304 49% 10.5 Archaeology,Architecture,Art, Performing Arts ... /world-university-rankings/university-college-...
16 Duke University United States 0 Duke University 487 17 170 private 98.3 25 ... 80.6 27 80.7 15 49 : 51 15,256 22% 4.5 Art, Performing Arts & Design,Biological Scien... /world-university-rankings/duke-university
17 University of California, Berkeley United States 0 University of California, Berkeley 474 18 180 public 99.8 5 ... 84.5 24 77.4 18 52 : 48 36,182 17% 13.1 Agriculture & Forestry,Archaeology,Architectur... /world-university-rankings/university-californ...
18 Cornell University United States 0 Cornell University 484 19 190 private 97.6 28 ... 86.6 19 76.2 22 49 : 51 21,850 24% 9.8 Agriculture & Forestry,Archaeology,Architectur... /world-university-rankings/cornell-university
19 Northwestern University United States 0 Northwestern University 490 20 200 private 96.9 36 ... 86.7 18 72.6 28 48 : 52 17,466 18% 12.8 Archaeology,Architecture,Art, Performing Arts ... /world-university-rankings/northwestern-univer...
20 University of Michigan United States 0 University of Michigan 482 21 210 public 95.7 47 ... 86.3 20 77.2 20 48 : 52 41,818 16% 8.6 Archaeology,Architecture,Art, Performing Arts ... /world-university-rankings/university-michigan
21 National University of Singapore nus Singapore 6 National University of Singapore 504 =22 220 master_account 81.3 193 ... 88.2 14 77.4 19 51 : 49 30,602 30% 17.0 Architecture,Biological Sciences,Business & Ma... /world-university-rankings/national-university...
22 University of Toronto Canada 0 University of Toronto 483 =22 230 master_account 92.6 75 ... 84.8 23 74.6 24 None 69,427 17% 18.7 Accounting & Finance,Agriculture & Forestry,Ar... /world-university-rankings/university-toronto
23 Carnegie Mellon University United States 0 Carnegie Mellon University 485 24 240 private 99.7 7 ... 83.7 25 65.8 36 39 : 61 12,676 45% 13.5 Accounting & Finance,Archaeology,Architecture,... /world-university-rankings/carnegie-mellon-uni...
24 London School of Economics and Political Science United Kingdom 0 London School of Economics and Political Science 511 =25 250 master_account 94.9 53 ... 72.0 43 71.8 30 52 : 48 10,065 71% 12.2 Accounting & Finance,Business & Management,Eco... /world-university-rankings/london-school-econo...
25 University of Washington United States 0 University of Washington 489 =25 260 public 99.0 17 ... 79.9 28 67.9 34 52 : 48 44,945 16% 11.4 Accounting & Finance,Agriculture & Forestry,Ar... /world-university-rankings/university-washington
26 University of Edinburgh United Kingdom 0 University of Edinburgh 500 =27 270 master_account 97.0 35 ... 74.2 36 66.8 35 59 : 41 26,759 40% 12.5 Accounting & Finance,Archaeology,Architecture,... /world-university-rankings/university-edinburgh
27 New York University United States 0 New York University 508 =27 280 private 96.5 43 ... 77.4 34 73.7 27 56 : 44 43,860 26% 9.6 Accounting & Finance,Archaeology,Art, Performi... /world-university-rankings/new-york-university
28 Peking University China 0 Peking University 514 =27 290 master_account 74.2 274 ... 85.1 22 83.0 11 47 : 53 42,136 16% 8.9 Accounting & Finance,Archaeology,Art, Performi... /world-university-rankings/peking-university
29 Tsinghua University Qinghua University Tsing h... China 6 Tsinghua University 535 30 300 master_account 71.4 306 ... 93.2 7 80.2 16 32 : 68 42,089 9% 13.7 Accounting & Finance,Architecture,Art, Perform... /world-university-rankings/tsinghua-university
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
170 University of Cape Town South Africa 0 University of Cape Town 567 171 1710 master_account 87.0 133 ... 36.2 225 30.5 372 53 : 47 20,775 18% 11.7 Accounting & Finance,Archaeology,Architecture,... /world-university-rankings/university-cape-town
171 Rutgers, the State University of New Jersey United States 0 Rutgers, the State University of New Jersey 546 172 1720 public 79.7 214 ... 45.3 143 43.0 148 52 : 48 43,439 10% 11.1 Accounting & Finance,Agriculture & Forestry,Ar... /world-university-rankings/rutgers-state-unive...
172 KTH Royal Institute of Technology Sweden 0 KTH Royal Institute of Technology 651 =173 1730 master_account 67.0 363 ... 46.3 138 42.3 158 33 : 67 12,951 17% 16.0 Architecture,Biological Sciences,Chemistry,Com... /world-university-rankings/kth-royal-institute...
173 University of Münster Germany 0 University of Münster 756 =173 1740 master_account 81.2 195 ... 42.1 161 40.3 181 53 : 47 42,287 7% 42.6 Chemistry,Communication & Media Studies,Comput... /world-university-rankings/university-munster
174 Université Libre de Bruxelles Belgium 0 Université Libre de Bruxelles 669 =175 1750 master_account 80.4 205 ... 42.2 160 32.4 332 55 : 45 25,343 35% 43.4 Archaeology,Architecture,Art, Performing Arts ... /world-university-rankings/universite-libre-de...
175 Newcastle University United Kingdom 6 Newcastle University 610 =175 1760 master_account 80.6 201 ... 40.8 176 33.4 304 49 : 51 20,952 30% 15.6 Accounting & Finance,Agriculture & Forestry,Ar... /world-university-rankings/newcastle-university
176 University of Liverpool United Kingdom 11 University of Liverpool 646 =177 1770 master_account 86.8 136 ... 34.7 245 32.7 318 54 : 46 19,815 35% 11.7 Accounting & Finance,Archaeology,Architecture,... /world-university-rankings/university-liverpool
177 Zhejiang University ZJU Zhejiang University Ch... China 11 Zhejiang University 815 =177 1780 master_account 45.1 564 ... 63.7 60 57.0 63 22 : 78 50,051 6% 14.1 Accounting & Finance,Agriculture & Forestry,Ar... /world-university-rankings/zhejiang-university
178 University of Luxembourg Luxembourg 6 University of Luxembourg 131698 =179 1790 master_account 91.9 78 ... 32.4 275 26.8 475 50 : 50 4,969 57% 14.6 Accounting & Finance,Biological Sciences,Busin... /world-university-rankings/university-luxembourg
179 University of Twente Netherlands 6 University of Twente 664 =179 1800 master_account 68.5 342 ... 47.0 134 35.5 257 37 : 63 6,697 27% 12.8 Art, Performing Arts & Design,Business & Manag... /world-university-rankings/university-twente
180 Paris-Sud University France 0 Paris-Sud University 867 181 1810 master_account 86.8 138 ... 33.1 265 40.4 180 55 : 45 27,221 16% 14.5 Business & Management,Chemistry,Economics & Ec... /world-university-rankings/paris-sud-university
181 École Normale Supérieure de Lyon France 0 École Normale Supérieure de Lyon 605 =182 1820 master_account 80.9 197 ... 37.8 210 40.1 185 47 : 53 2,221 13% 8.5 Chemistry,Computer Science,Geology, Environmen... /world-university-rankings/ecole-normale-super...
182 Hong Kong Polytechnic University hkpu hong kon... Hong Kong 11 Hong Kong Polytechnic University 723 =182 1830 master_account 67.7 356 ... 48.1 126 39.1 197 51 : 49 21,773 25% 26.9 Accounting & Finance,Art, Performing Arts & De... /world-university-rankings/hong-kong-polytechn...
183 Scuola Normale Superiore di Pisa Italy 0 Scuola Normale Superiore di Pisa 1019 184 1840 master_account 75.4 255 ... 33.3 263 53.7 74 30 : 70 515 7% 5.2 Biological Sciences,Chemistry,Computer Science... /world-university-rankings/scuola-normale-supe...
184 University of Aberdeen United Kingdom 0 University of Aberdeen 615 185 1850 master_account 87.1 131 ... 33.6 258 30.3 377 55 : 45 12,055 36% 13.9 Archaeology,Art, Performing Arts & Design,Biol... /world-university-rankings/university-aberdeen
185 University of Miami United States 0 University of Miami 636 186 1860 private 90.3 91 ... 24.4 418 42.5 157 51 : 49 15,860 16% 5.9 Accounting & Finance,Architecture,Art, Perform... /world-university-rankings/university-miami
186 University of Dundee United Kingdom 11 University of Dundee 640 187 1870 master_account 94.5 57 ... 32.0 284 26.3 491 62 : 38 10,893 24% 14.9 Accounting & Finance,Architecture,Art, Perform... /world-university-rankings/university-dundee
187 University of East Anglia United Kingdom 0 University of East Anglia 609 =188 1880 master_account 90.5 89 ... 31.6 290 29.9 395 60 : 40 13,521 30% 13.6 Accounting & Finance,Archaeology,Biological Sc... /world-university-rankings/university-east-anglia
188 Shanghai Jiao Tong University China 0 Shanghai Jiao Tong University 801 =188 1890 master_account 44.1 572 ... 62.5 67 53.5 76 41 : 59 37,288 6% 12.4 Accounting & Finance,Agriculture & Forestry,Ar... /world-university-rankings/shanghai-jiao-tong-...
189 Aalto University Finland 0 Aalto University 767 190 1900 master_account 79.5 215 ... 35.8 232 37.8 210 34 : 66 12,744 20% 20.2 Accounting & Finance,Architecture,Art, Perform... /world-university-rankings/aalto-university
190 University of Massachusetts United States 0 University of Massachusetts 528 191 1910 master_account 88.8 112 ... 32.2 280 36.3 237 49 : 51 59,230 14% 13.3 Accounting & Finance,Agriculture & Forestry,Ar... /world-university-rankings/university-massachu...
191 University of Auckland New Zealand 11 University of Auckland 637 192 1920 master_account 73.8 282 ... 40.7 177 32.5 327 54 : 46 30,348 29% 18.8 Accounting & Finance,Architecture,Art, Perform... /world-university-rankings/university-auckland
192 Northeastern University United States 0 Northeastern University 681 193 1930 private 91.7 79 ... 26.8 370 35.6 254 49 : 51 18,780 32% 14.0 Accounting & Finance,Architecture,Art, Perform... /world-university-rankings/northeastern-univer...
193 Lomonosov Moscow State University Russian Federation 0 Lomonosov Moscow State University 753 194 1940 master_account 15.8 932 ... 61.6 71 74.2 26 48 : 52 29,236 22% 7.3 Art, Performing Arts & Design,Biological Scien... /world-university-rankings/lomonosov-moscow-st...
194 Tilburg University Netherlands 0 Tilburg University 734 195 1950 master_account 61.3 422 ... 50.1 112 38.4 206 48 : 52 7,738 13% 22.5 Business & Management,Economics & Econometrics... /world-university-rankings/tilburg-university
195 Paris-Sorbonne University – Paris 4 France 0 Paris-Sorbonne University – Paris 4 924 196 1960 master_account 77.1 240 ... 34.3 251 44.6 132 68 : 32 22,042 15% 30.2 Archaeology,Art, Performing Arts & Design,Geog... /world-university-rankings/paris-sorbonne-univ...
196 Royal Holloway, University of London United Kingdom 0 Royal Holloway, University of London 571 197 1970 master_account 85.0 156 ... 27.9 345 33.9 294 59 : 41 8,931 40% 15.4 Art, Performing Arts & Design,Biological Scien... /world-university-rankings/royal-holloway-univ...
197 University of California, Riverside United States 0 University of California, Riverside 607 =198 1980 public 92.3 76 ... 30.9 294 30.0 391 51 : 49 20,740 14% 20.7 Archaeology,Architecture,Art, Performing Arts ... /world-university-rankings/university-californ...
198 University of Gothenburg Sweden 0 University of Gothenburg 676 =198 1990 master_account 87.5 129 ... 39.2 199 26.7 477 None 18,378 13% 10.0 Art, Performing Arts & Design,Biological Scien... /world-university-rankings/university-gothenburg
199 National Taiwan University Taiwan 0 National Taiwan University 618 =198 2000 master_account 53.4 495 ... 54.7 93 50.4 87 40 : 60 31,758 8% 11.5 Accounting & Finance,Agriculture & Forestry,Ar... /world-university-rankings/national-taiwan-uni...

200 rows × 26 columns


In [13]:
#Create the three colomns total students/ International student and Total faculty member
#A bit of data cleaning. We must remove the "=" and "%" and make sure that the numbers are not strings.
# We also calculate the number of Inter_students with the poucentage 
# and the number of member faculty
THframe['Total_Students'] = pd.to_numeric(THframe.stats_number_students.str.replace(',' , ''), errors='coerce') 
THframe['Inter_Students'] = round(pd.to_numeric(THframe.stats_pc_intl_students.str.replace('%' , ''), errors='coerce').div(100) * THframe.Total_Students)
THframe['Total_Faculty'] = round(THframe.Total_Students.div(pd.to_numeric(THframe.stats_student_staff_ratio, errors='coerce')))
#change the rank to delete the Ties
THframe['Rank']= THframe.index.values+1

#Because we will apply the functions from question 1, the columns's name must be the same, thus why we need to change them.
#We also select only the meaningful datas
THframe['Country'] = THframe['location']
THframe['Name'] = THframe['name']
THframe = THframe[['Country', 'Name', 'Rank', 'Total_Students', 'Inter_Students', 'Total_Faculty']]

#We add the column Region to our data frame! We must also fill the column with the data using the DF of question 1
THframe['Region'] = 'N/A'
# Resolve N/A
# Because TH don't provide the information of region directly,
# We will merge it with QS on country so we can extract country and region data
# We need all the QS data so we dont miss any country's region
whole_QS = QSframe
whole_QS['Country']=whole_QS['country']
whole_QS['Region']=whole_QS['region']

# Manually found the three exceptions that did not match in country names.
whole_QS['Country'] = whole_QS['Country'].str.replace('Hong Kong S.A.R., China', 'Hong Kong')
whole_QS['Country'] = whole_QS['Country'].str.replace("Russia", "Russian Federation")
# QS did not have Luxembourg on its list. So easyfix, it is valid since Belarus is not in TH and we just want the region
whole_QS['Country'] = whole_QS['Country'].str.replace("Belarus", "Luxembourg")

# Then we create a dictonary with the data from QS
loc_to_reg = whole_QS[['Country', 'Region']]
loc_to_reg = set(loc_to_reg.apply(lambda x: '{}_{}'.format(x['Country'], x['Region']), axis=1).values)
loc_to_reg = {x.split('_')[0]: x.split('_')[1] for x in loc_to_reg}
loc_to_reg = defaultdict(lambda: 'N/A', loc_to_reg)
def resolve_uni(x):
    x['Region'] = (loc_to_reg[x['Country']])
    return x
THframe = THframe.apply(resolve_uni, axis=1)

# Now that the region columns is updated and clear we can 

TH = THframe.copy()
print('DataFrame of the Ranking from the timeshighereducation.com :')
display(TH)


DataFrame of the Ranking from the timeshighereducation.com :
Country Name Rank Total_Students Inter_Students Total_Faculty Region
0 United Kingdom University of Oxford 1 20409 7755.0 1822.0 Europe
1 United Kingdom University of Cambridge 2 18389 6436.0 1687.0 Europe
2 United States California Institute of Technology 3 2209 596.0 340.0 North America
3 United States Stanford University 4 15845 3486.0 2113.0 North America
4 United States Massachusetts Institute of Technology 5 11177 3800.0 1285.0 North America
5 United States Harvard University 6 20326 5285.0 2284.0 North America
6 United States Princeton University 7 7955 1909.0 958.0 North America
7 United Kingdom Imperial College London 8 15857 8721.0 1391.0 Europe
8 United States University of Chicago 9 13525 3381.0 2181.0 North America
9 Switzerland ETH Zurich – Swiss Federal Institute of Techno... 10 19233 7309.0 1317.0 Europe
10 United States University of Pennsylvania 11 20361 4072.0 3132.0 North America
11 United States Yale University 12 12155 2553.0 2827.0 North America
12 United States Johns Hopkins University 13 15498 3720.0 3604.0 North America
13 United States Columbia University 14 26587 8508.0 4359.0 North America
14 United States University of California, Los Angeles 15 39279 6677.0 4092.0 North America
15 United Kingdom University College London 16 30304 14849.0 2886.0 Europe
16 United States Duke University 17 15256 3356.0 3390.0 North America
17 United States University of California, Berkeley 18 36182 6151.0 2762.0 North America
18 United States Cornell University 19 21850 5244.0 2230.0 North America
19 United States Northwestern University 20 17466 3144.0 1365.0 North America
20 United States University of Michigan 21 41818 6691.0 4863.0 North America
21 Singapore National University of Singapore 22 30602 9181.0 1800.0 Asia
22 Canada University of Toronto 23 69427 11803.0 3713.0 North America
23 United States Carnegie Mellon University 24 12676 5704.0 939.0 North America
24 United Kingdom London School of Economics and Political Science 25 10065 7146.0 825.0 Europe
25 United States University of Washington 26 44945 7191.0 3943.0 North America
26 United Kingdom University of Edinburgh 27 26759 10704.0 2141.0 Europe
27 United States New York University 28 43860 11404.0 4569.0 North America
28 China Peking University 29 42136 6742.0 4734.0 Asia
29 China Tsinghua University 30 42089 3788.0 3072.0 Asia
... ... ... ... ... ... ... ...
170 South Africa University of Cape Town 171 20775 3740.0 1776.0 Africa
171 United States Rutgers, the State University of New Jersey 172 43439 4344.0 3913.0 North America
172 Sweden KTH Royal Institute of Technology 173 12951 2202.0 809.0 Europe
173 Germany University of Münster 174 42287 2960.0 993.0 Europe
174 Belgium Université Libre de Bruxelles 175 25343 8870.0 584.0 Europe
175 United Kingdom Newcastle University 176 20952 6286.0 1343.0 Europe
176 United Kingdom University of Liverpool 177 19815 6935.0 1694.0 Europe
177 China Zhejiang University 178 50051 3003.0 3550.0 Asia
178 Luxembourg University of Luxembourg 179 4969 2832.0 340.0 Europe
179 Netherlands University of Twente 180 6697 1808.0 523.0 Europe
180 France Paris-Sud University 181 27221 4355.0 1877.0 Europe
181 France École Normale Supérieure de Lyon 182 2221 289.0 261.0 Europe
182 Hong Kong Hong Kong Polytechnic University 183 21773 5443.0 809.0 Asia
183 Italy Scuola Normale Superiore di Pisa 184 515 36.0 99.0 Europe
184 United Kingdom University of Aberdeen 185 12055 4340.0 867.0 Europe
185 United States University of Miami 186 15860 2538.0 2688.0 North America
186 United Kingdom University of Dundee 187 10893 2614.0 731.0 Europe
187 United Kingdom University of East Anglia 188 13521 4056.0 994.0 Europe
188 China Shanghai Jiao Tong University 189 37288 2237.0 3007.0 Asia
189 Finland Aalto University 190 12744 2549.0 631.0 Europe
190 United States University of Massachusetts 191 59230 8292.0 4453.0 North America
191 New Zealand University of Auckland 192 30348 8801.0 1614.0 Oceania
192 United States Northeastern University 193 18780 6010.0 1341.0 North America
193 Russian Federation Lomonosov Moscow State University 194 29236 6432.0 4005.0 Europe
194 Netherlands Tilburg University 195 7738 1006.0 344.0 Europe
195 France Paris-Sorbonne University – Paris 4 196 22042 3306.0 730.0 Europe
196 United Kingdom Royal Holloway, University of London 197 8931 3572.0 580.0 Europe
197 United States University of California, Riverside 198 20740 2904.0 1002.0 North America
198 Sweden University of Gothenburg 199 18378 2389.0 1838.0 Europe
199 Taiwan National Taiwan University 200 31758 2541.0 2762.0 Asia

200 rows × 7 columns

Answering question (a), the faculty members over students ratio:


In [14]:
# We use the same fonction than for the question 1
THframe = Question_a(THframe)


10 best universities in terms of Faculty members over Students Ratio:
Name
Faculty_over_Students
0.303056 Vanderbilt University
0.243889 University of Copenhagen
0.232579 Yale University
0.232565 University of Rochester
0.232546 Johns Hopkins University
0.227291 Emory University
0.222208 Duke University
0.196141 École Polytechnique
0.192233 Scuola Normale Superiore di Pisa
0.169492 University of Pittsburgh
Plot:

Answering question (b), the international students over total ratio:


In [15]:
# We use the same fonction than for the question 1
THframe = Question_b(THframe)


Name
Inter_Students_over_Total
0.709985 London School of Economics and Political Science
0.569934 University of Luxembourg
0.549978 Imperial College London
0.549960 École Polytechnique Fédérale de Lausanne
0.500030 Maastricht University
0.490001 University College London
0.479956 University of St Andrews
0.449984 Carnegie Mellon University
0.449984 Queen Mary University of London
0.420007 University of Hong Kong
Plot:

In [16]:
# We use the same fonction than for the question 1

Region_and_Country_Stat(THframe)


Question C1
Faculty_over_Students
Country
Denmark 0.160176
Italy 0.138870
Russian Federation 0.136989
Japan 0.132105
United States 0.106393
France 0.099340
China 0.087419
Taiwan 0.086970
South Africa 0.085487
Switzerland 0.083433
Plot:
Question C2
Inter_Students_over_Total
Country
Luxembourg 0.569934
United Kingdom 0.365478
Hong Kong 0.327989
Switzerland 0.314280
Australia 0.307502
Singapore 0.305013
New Zealand 0.290003
Ireland 0.269976
Austria 0.260014
Canada 0.225000
Plot:
Question D1
Faculty_over_Students
Region
North America 0.101658
Africa 0.085487
Asia 0.080347
Europe 0.065951
Oceania 0.039893
Plot:
Question D2
Inter_Students_over_Total
Region
Oceania 0.305558
Europe 0.244658
North America 0.185585
Africa 0.180024
Asia 0.165242
Plot:

DISCUSSION OF THE RESULTS:

Vanderbilt University is the University with the highest ratio of faculty member which is in the USA. Whereas the Danmark is the country the the highest ratio of faculty member. North America is still the region with the highest ratio of faculty member. Once again the London school of Economics has the highest ratio of international student far above the other schools and Universities. This time the Luxembourg is the country with the highest ratio of international student which is pretty logical for a small country like this one. And once again Oceania is the Region with the highest ratio of international Students

QUESTION 3

Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

The strategy of merging names is to use a technique called N-gram to parse up each university name into N long strings that are overlapping each other. When counting how many N-grams they have in common one can rate how much to Universities have in common. This handles most case very well. But have issues when Universities are from the same city like MIT and University of Massachusetts. Therefore we will use some stopwords that we filter out to merge correctly for the top universities. A good deal of the universities will be filtered out because some simply are just in one list, and some have so different names that we can't match them. We assume it is fine with analysing 140+ universities from here on out. This was how understood the task


In [17]:
#To merge the data on University name, some data cleaning is required. We will do n-grams. In order to diminish false 
#positive matching

def Cleaning_for_merging(Ranking):
    Ranking = Ranking.reset_index()
    RankName = Ranking['Name']
    stopWords = ['North',' ', '(',')']
    
    for word in stopWords:
        RankName = RankName.str.replace(word,'')
    return RankName

In [18]:
THframe = THframe.reset_index()
THframe = THframe.set_index('Rank')
THframe = THframe.sort_index()

THname = Cleaning_for_merging(THframe)
QSname = Cleaning_for_merging(QSdf)
Unchanged_QS_name = QSdf['Name']
Unchanged_TH_name = THframe['Name']

In order to have a matching names for merging the two dataframes, we will use Unigrams and Bigrams.


In [19]:
# An Ngram implementation to rate the similarity of two words by parsing them up in N long pieces
def N_Gram(N,text):
    NList = set()                      # start with an empty list
    if N> 1:
        space = " " * (N-1)         # add N - 1 spaces
        text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
    for i in range( len(text) - (N - 1) ):
        NList.add(text[i:i+N])
    return NList

In [20]:
#Matching the names of the universities across dataframes

#initializing the dataframe
Matched = pd.DataFrame(columns=['rank_TH','Actual_TH','Name_1_(TH)','Name_2_(QS)', 'Actual_QS','rank_QS', 'Matching_%'])

i = 0
count = 0
#iterate through all the names from TH
while i <len(THname):
    # get the list of partial words/sentences from Ngram
    nList1 = N_Gram(5,THname.iloc[i])
    j=0
    list_of_possible_matchings = np.zeros((len(QSname)))
    
    # iterate through all the names from TH
    while j <len(QSname):
        Counter = 0
        
        # get the list of partial words/sentences from Ngram
        nList2 = N_Gram(5,QSname.iloc[j])
        matching = nList1.intersection(nList2)
        if (matching is not None) :
            Counter = len(matching)
        
        # give an estimate of how good the matching is
        MatchingPerc = Counter/(min(len(nList1),len(nList2)))
        list_of_possible_matchings[j] = MatchingPerc
        j+=1
    
    # deside which matches that are the optimal
    bestIndex = np.argmax(list_of_possible_matchings)
    MatchingPerc = np.max(list_of_possible_matchings)
    Matched.loc[i] = [i+1, Unchanged_TH_name.iloc[i], THname.iloc[i], QSname.iloc[bestIndex],
                      Unchanged_QS_name.iloc[bestIndex], bestIndex+1, MatchingPerc]
           
    i+=1
    
display(Matched.sort_values('Matching_%', ascending=False).head(160))

# how much must the matchings match
percentMatchKeep = 0.8
Matched = Matched[(Matched['Matching_%'] > percentMatchKeep)]

print(Matched.shape)


rank_TH Actual_TH Name_1_(TH) Name_2_(QS) Actual_QS rank_QS Matching_%
0 1.0 University of Oxford UniversityofOxford UniversityofOxford University of Oxford 6.0 1.000000
106 107.0 Ghent University GhentUniversity GhentUniversity Ghent University 126.0 1.000000
97 98.0 Emory University EmoryUniversity EmoryUniversity Emory University 148.0 1.000000
98 99.0 University of California, Irvine UniversityofCalifornia,Irvine UniversityofCalifornia,Irvine University of California, Irvine 165.0 1.000000
1 2.0 University of Cambridge UniversityofCambridge UniversityofCambridge University of Cambridge 5.0 1.000000
101 102.0 University of Pittsburgh UniversityofPittsburgh UniversityofPittsburgh University of Pittsburgh 142.0 1.000000
102 103.0 Maastricht University MaastrichtUniversity MaastrichtUniversity Maastricht University 199.0 1.000000
104 105.0 University of Bern UniversityofBern UniversityofBern University of Bern 167.0 1.000000
108 109.0 Aarhus University AarhusUniversity AarhusUniversity Aarhus University 119.0 1.000000
119 120.0 City University of Hong Kong CityUniversityofHongKong CityUniversityofHongKong City University of Hong Kong 49.0 1.000000
109 110.0 University of Copenhagen UniversityofCopenhagen UniversityofCopenhagen University of Copenhagen 73.0 1.000000
110 111.0 Sungkyunkwan University (SKKU) SungkyunkwanUniversitySKKU SungkyunkwanUniversitySKKU Sungkyunkwan University (SKKU) 108.0 1.000000
112 113.0 University of Göttingen UniversityofGöttingen UniversityofGöttingen University of Göttingen 181.0 1.000000
113 114.0 University of Virginia UniversityofVirginia UniversityofVirginia University of Virginia 175.0 1.000000
115 116.0 Fudan University FudanUniversity FudanUniversity Fudan University 40.0 1.000000
117 118.0 Trinity College Dublin TrinityCollegeDublin TrinityCollegeDublin,TheUniversityofDublin Trinity College Dublin, The University of Dublin 88.0 1.000000
96 97.0 Durham University DurhamUniversity DurhamUniversity Durham University 78.0 1.000000
94 95.0 University of Basel UniversityofBasel UniversityofBasel University of Basel 149.0 1.000000
92 93.0 Lund University LundUniversity LundUniversity Lund University 79.0 1.000000
89 90.0 University of Helsinki UniversityofHelsinki UniversityofHelsinki University of Helsinki 103.0 1.000000
88 89.0 Dartmouth College DartmouthCollege DartmouthCollege Dartmouth College 169.0 1.000000
86 87.0 Uppsala University UppsalaUniversity UppsalaUniversity Uppsala University 112.0 1.000000
85 86.0 Rice University RiceUniversity RiceUniversity Rice University 89.0 1.000000
83 84.0 Michigan State University MichiganStateUniversity MichiganStateUniversity Michigan State University 150.0 1.000000
82 83.0 University of Groningen UniversityofGroningen UniversityofGroningen University of Groningen 113.0 1.000000
80 81.0 Monash University MonashUniversity MonashUniversity Monash University 60.0 1.000000
79 80.0 University of Glasgow UniversityofGlasgow UniversityofGlasgow University of Glasgow 65.0 1.000000
78 79.0 RWTH Aachen University RWTHAachenUniversity RWTHAachenUniversity RWTH Aachen University 141.0 1.000000
77 78.0 McMaster University McMasterUniversity McMasterUniversity McMaster University 140.0 1.000000
76 77.0 Pennsylvania State University PennsylvaniaStateUniversity PennsylvaniaStateUniversity Pennsylvania State University 94.0 1.000000
... ... ... ... ... ... ... ...
26 27.0 University of Edinburgh UniversityofEdinburgh TheUniversityofEdinburgh The University of Edinburgh 23.0 0.840000
191 192.0 University of Auckland UniversityofAuckland TheUniversityofAuckland The University of Auckland 82.0 0.833333
63 64.0 Wageningen University & Research WageningenUniversity&Research WageningenUniversity Wageningen University 124.0 0.833333
39 40.0 University of Hong Kong UniversityofHongKong TheUniversityofHongKong The University of Hong Kong 26.0 0.833333
133 134.0 University of Adelaide UniversityofAdelaide TheUniversityofAdelaide The University of Adelaide 111.0 0.833333
70 71.0 Ohio State University OhioStateUniversity TheOhioStateUniversity The Ohio State University 86.0 0.826087
90 91.0 University of Warwick UniversityofWarwick TheUniversityofWarwick The University of Warwick 57.0 0.826087
60 61.0 University of Sydney UniversityofSydney TheUniversityofSydney The University of Sydney 50.0 0.818182
129 130.0 University of Exeter UniversityofExeter TheUniversityofExeter The University of Exeter 159.0 0.818182
45 46.0 University of Tokyo UniversityofTokyo TheUniversityofTokyo The University of Tokyo 28.0 0.809524
27 28.0 New York University NewYorkUniversity NewYorkUniversityNYU New York University (NYU) 52.0 0.809524
136 137.0 Pohang University of Science and Technology PohangUniversityofScienceandTechnology TheHongKongUniversityofScienceandTechnology The Hong Kong University of Science and Techno... 30.0 0.809524
9 10.0 ETH Zurich – Swiss Federal Institute of Techno... ETHZurich–SwissFederalInstituteofTechnologyZurich ETHZurich-SwissFederalInstituteofTechnology ETH Zurich - Swiss Federal Institute of Techno... 10.0 0.808511
35 36.0 King’s College London King’sCollegeLondon King'sCollegeLondon King's College London 24.0 0.782609
57 58.0 Chinese University of Hong Kong ChineseUniversityofHongKong TheUniversityofHongKong The University of Hong Kong 26.0 0.777778
161 162.0 University of California, Santa Cruz UniversityofCalifornia,SantaCruz UniversityofCalifornia,SantaBarbaraUCSB University of California, Santa Barbara (UCSB) 134.0 0.777778
37 38.0 École Polytechnique Fédérale de Lausanne ÉcolePolytechniqueFédéraledeLausanne EcolePolytechniqueFédéraledeLausanneEPFL Ecole Polytechnique Fédérale de Lausanne (EPFL) 12.0 0.775000
114 115.0 École Polytechnique ÉcolePolytechnique EcolePolytechnique Ecole Polytechnique 59.0 0.772727
197 198.0 University of California, Riverside UniversityofCalifornia,Riverside UniversityofCalifornia,Irvine University of California, Irvine 165.0 0.727273
128 129.0 Université Catholique de Louvain UniversitéCatholiquedeLouvain UniversitécatholiquedeLouvainUCL Université catholique de Louvain (UCL) 153.0 0.727273
84 85.0 University of New South Wales UniversityofNewSouthWales TheUniversityofNewSouthWalesUNSWSydney The University of New South Wales (UNSW Sydney) 45.0 0.724138
192 193.0 Northeastern University easternUniversity westernUniversity Northwestern University 29.0 0.714286
93 94.0 University of Tübingen UniversityofTübingen UniversityofGroningen University of Groningen 113.0 0.708333
156 157.0 Ulm University UlmUniversity StockholmUniversity Stockholm University 195.0 0.705882
99 100.0 University of Bonn UniversityofBonn UniversityofBergen University of Bergen 164.0 0.700000
173 174.0 University of Münster UniversityofMünster UniversityofRochester University of Rochester 186.0 0.695652
144 145.0 University of Cologne UniversityofCologne UniversityofColoradoBoulder University of Colorado Boulder 184.0 0.695652
122 123.0 Georgetown University GeorgetownUniversity BrownUniversity Brown University 53.0 0.684211
158 159.0 University of Leicester UniversityofLeicester UniversityofRochester University of Rochester 186.0 0.680000
154 155.0 Scuola Superiore Sant’Anna ScuolaSuperioreSant’Anna ScuolaSuperioreSant'AnnaPisadiStudiUniversitar... Scuola Superiore Sant'Anna Pisa di Studi Unive... 193.0 0.678571

160 rows × 7 columns

(143, 7)

In [ ]:
# QS anf TH Dataframe that we will merge
QS_merge = QSdf
TH_merge = THframe


# Make sure QS_merge is sorted by rank
QS_merge.reset_index()
QS_merge.sort_values('Rank')
 
# sort Matched by TH_rank
Matched.sort_values('rank_TH')
 
# change name of the QS_merge to be identical to the ones we matched with the TH_frame
for index, matched_TH_name in enumerate(Matched.loc[:,'Actual_TH']):
    TH_rank = index + 1
    QS_rank = Matched.loc[:,'rank_QS'].iloc[index]
    QS_merge['Name'].iloc[ int(QS_rank) - 1] = matched_TH_name
 
# Set 'Name' as index before merging
QS_merge.set_index('Name')
TH_merge.set_index('Name')
 
# merge on names
# simultanously get rid of universities not in the list with just keeping the intersection: "inner" join
merged_frame = pd.merge(TH_merge, QS_merge, on='Name', how='inner', suffixes=('_TH', '_QS'))
 
# The index is +1 so it replaces the ranking in TH
merged_frame['index'] += 1
# set 'index' -> 'Rank_TH' for clarity and Rank -> Rank_QS
merged_frame = merged_frame.rename(columns={'index': 'Rank_TH', 'Rank': 'Rank_QS', 'Country_TH': 'Country'})
 
# PUt the rank on the left part of the data frame 
# We delete the country and the region beacause they are twice in the dataframe
merged_frame = merged_frame.set_index('Rank_TH') 
merged_frame['Region'] = merged_frame['Region_TH']
merged_frame = merged_frame[['Rank_QS','Name','Country','Region', 'Total_Students_TH', 'Inter_Students_TH',
       'Total_Faculty_TH', 'Inter_Faculty', 'Inter_Students_QS', 'Total_Faculty_QS', 'Total_Students_QS']]

Merged Data Frame with rank and datas from both websites

Here is the two Dataframes based on university names


In [22]:
display(merged_frame)


Rank_QS Name Country Region Total_Students_TH Inter_Students_TH Total_Faculty_TH Inter_Faculty Inter_Students_QS Total_Faculty_QS Total_Students_QS
Rank_TH
1 6 University of Oxford United Kingdom Europe 20409 7755.0 1822.0 2964 7353 6750 19720
2 5 University of Cambridge United Kingdom Europe 18389 6436.0 1687.0 2278 6699 5490 18770
3 4 California Institute of Technology United States North America 2209 596.0 340.0 350 647 953 2255
4 2 Stanford University United States North America 15845 3486.0 2113.0 2042 3611 4285 15878
5 1 Massachusetts Institute of Technology United States North America 11177 3800.0 1285.0 1679 3717 2982 11067
6 3 Harvard University United States North America 20326 5285.0 2284.0 1311 5266 4350 22429
7 13 Princeton University United States North America 7955 1909.0 958.0 246 1793 1007 8069
8 8 Imperial College London United Kingdom Europe 15857 8721.0 1391.0 2071 8746 3930 16090
9 9 University of Chicago United States North America 13525 3381.0 2181.0 635 3379 2449 13557
10 10 ETH Zurich – Swiss Federal Institute of Techno... Switzerland Europe 19233 7309.0 1317.0 1886 7563 2477 19815
11 19 University of Pennsylvania United States North America 20361 4072.0 3132.0 1383 4250 5499 20639
12 16 Yale University United States North America 12155 2553.0 2827.0 1708 2469 4940 12402
13 17 Johns Hopkins University United States North America 15498 3720.0 3604.0 1061 4105 4462 16146
14 18 Columbia University United States North America 26587 8508.0 4359.0 913 8105 6189 25045
15 33 University of California, Los Angeles United States North America 39279 6677.0 4092.0 662 6887 2859 40114
16 7 University College London United Kingdom Europe 30304 14849.0 2886.0 2554 14854 6345 31080
17 22 Duke University United States North America 15256 3356.0 3390.0 226 2864 2938 15320
18 27 University of California, Berkeley United States North America 36182 6151.0 2762.0 1395 6086 3321 36703
19 14 Cornell University United States North America 21850 5244.0 2230.0 970 5411 2718 21904
20 29 Northwestern University United States North America 17466 3144.0 1365.0 1141 3422 4231 18675
21 21 University of Michigan United States North America 41818 6691.0 4863.0 1920 7527 6809 43147
22 15 National University of Singapore Singapore Asia 30602 9181.0 1800.0 3086 8917 5106 32728
23 31 University of Toronto Canada North America 69427 11803.0 3713.0 3905 21910 9581 72207
24 48 Carnegie Mellon University United States North America 12676 5704.0 939.0 425 6385 1342 13356
25 35 London School of Economics and Political Science United Kingdom Europe 10065 7146.0 825.0 687 6748 1088 9760
26 61 University of Washington United States North America 44945 7191.0 3943.0 550 8373 2484 56973
27 23 University of Edinburgh United Kingdom Europe 26759 10704.0 2141.0 1553 10551 4075 28040
28 52 New York University United States North America 43860 11404.0 4569.0 Unknown Unknown Unknown Unknown
29 38 Peking University China Asia 42136 6742.0 4734.0 1038 7090 5185 42136
30 25 Tsinghua University China Asia 42089 3788.0 3072.0 932 4072 5506 36300
... ... ... ... ... ... ... ... ... ... ... ...
139 101 University of Leeds United Kingdom Europe 27995 7279.0 1904.0 870 7534 3025 29645
142 104 Eindhoven University of Technology Netherlands Europe 7637 1298.0 411.0 322 1275 801 7636
143 178 University of Florida United States North America 46346 4171.0 2648.0 504 4597 5419 45485
144 92 University of St Andrews United Kingdom Europe 9005 4322.0 693.0 485 4030 1140 8800
146 143 University of Oslo Norway Europe 27730 5269.0 1491.0 711 3667 2509 24304
148 85 University of Nottingham United Kingdom Europe 29176 8753.0 2084.0 1135 7993 3360 29165
150 135 Lancaster University United Kingdom Europe 11637 4422.0 888.0 435 4464 1245 12300
152 146 University of Lausanne Switzerland Europe 13065 3266.0 961.0 1031 3337 2173 13435
153 117 Technical University of Denmark Denmark Europe 8548 2052.0 1425.0 966 2098 2117 8878
154 186 University of Rochester United States North America 9636 2794.0 2241.0 488 2805 2569 9636
160 197 Texas A&M University United States North America 56245 5062.0 2383.0 206 4900 3446 60294
163 137 Cardiff University United Kingdom Europe 24186 6288.0 1792.0 705 6090 2755 24565
165 154 University of Vienna Austria Europe 35375 9198.0 1701.0 1400 14468 3411 45967
168 84 University of Alabama at Birmingham United States North America 15703 942.0 1033.0 1028 7795 2962 28660
169 115 Nanjing University China Asia 33456 3346.0 2356.0 617 2960 2449 32495
171 191 University of Cape Town South Africa Africa 20775 3740.0 1776.0 379 3325 1733 19593
173 99 KTH Royal Institute of Technology Sweden Europe 12951 2202.0 809.0 494 1937 1500 12658
176 161 Newcastle University United Kingdom Europe 20952 6286.0 1343.0 835 5717 2428 21673
177 174 University of Liverpool United Kingdom Europe 19815 6935.0 1694.0 717 7502 2390 22065
178 87 Zhejiang University China Asia 50051 3003.0 3550.0 1913 4625 4507 40869
180 180 University of Twente Netherlands Europe 6697 1808.0 523.0 299 2235 910 9332
182 157 École Normale Supérieure de Lyon France Europe 2221 289.0 261.0 91 221 382 2020
183 96 Hong Kong Polytechnic University Hong Kong Asia 21773 5443.0 809.0 1275 5251 2447 20972
184 192 Scuola Normale Superiore di Pisa Italy Europe 515 36.0 99.0 1 48 86 532
185 158 University of Aberdeen United Kingdom Europe 12055 4340.0 867.0 508 4246 1310 12005
189 62 Shanghai Jiao Tong University China Asia 37288 2237.0 3007.0 887 2858 3544 38931
190 139 Aalto University Finland Europe 12744 2549.0 631.0 370 1831 1257 12147
192 82 University of Auckland New Zealand Oceania 30348 8801.0 1614.0 645 8568 2047 29641
194 95 Lomonosov Moscow State University Russian Federation Europe 29236 6432.0 4005.0 373 5098 6709 30233
200 77 National Taiwan University Taiwan Asia 31758 2541.0 2762.0 220 3152 2807 32195

141 rows × 11 columns

QUESTION 4

Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?


In [23]:
# We drop the row in the merged dataframe where there are some unknows datas
merged_frame = merged_frame[merged_frame.Total_Students_QS != 'Unknown']

#We plot the number of Students for everey Universities between QS and TH website
# We also plot a red line regression 

students = merged_frame.plot(kind='scatter', x = 'Total_Students_QS',y = 'Total_Students_TH')
x = np.linspace(0, 80000, 80000)
plt.plot(x,x,c='r')
plt.show(students)

Faculty = merged_frame.plot(kind='scatter', x = 'Total_Faculty_QS',y = 'Total_Faculty_TH')
x = np.linspace(0, 8000, 8000)
plt.plot(x,x,c='r')
plt.show(students)


We plotted 2 things.

First a plot with the Total number of students from one website on one axis and from the other website on the other axis. Secondly on the same plot we put a red line: the linear function which is the line where all the datas should be if the datas from both websites were the same. We Repeated this operation for the Total number of Students and for the number of faculty members.

For The first plot we observe that there is a strong correlation and the datas are approximatly the same.

For the second plot we clearly see that the number of faculty member from the QS website is way ABOVE the number from the other website.


In [24]:
merged_frame['International_Students_mean'] = np.mean([merged_frame.Inter_Students_QS,merged_frame.Inter_Students_TH],axis=0)
merged_frame['Total_Students_mean'] = np.mean([merged_frame.Total_Students_QS,merged_frame.Total_Students_TH],axis=0)
merged_frame['Total_Faculty_mean'] = np.mean([merged_frame.Total_Faculty_QS,merged_frame.Total_Faculty_TH],axis=0)

x = merged_frame['Inter_Students_QS'].div(merged_frame['Total_Students_QS'])
y = merged_frame.Inter_Faculty.div(merged_frame['Total_Faculty_QS'])  
                                                
#Inter = merged_frame.plot(kind='scatter', x  ,y )
plt.scatter(x,y)
plt.title('Interational Faculty / Total Faculty members versus International Student/ Total Student')
plt.xlabel('International Student/ Total Student (QS)')
plt.ylabel('Interational Faculty / Total Faculty members')
line = np.linspace(0, 0.8, 1000)
plt.plot(line,line,c='r')
plt.show()


observation : It is possible that there is a positive correlation bewtween international faculty members and universities with high ratio of International Students.


In [25]:
x = merged_frame.Total_Faculty_mean.div(merged_frame.Total_Students_mean)
y = merged_frame.International_Students_mean.div(merged_frame.Total_Students_mean)  
                                                
#Inter = merged_frame.plot(kind='scatter', x  ,y )
plt.scatter(x,y)
plt.title('Faculty/Student ratio versus International student/student ratio')
plt.xlabel('Faculty members / total students')
plt.ylabel('International student / student')
plt.show()


observation: These results are inconclusive. We do not observe a clear pattern between both ratios. It don't not seem to have a linear correlation between faculty members and international students. It might be that we have a negative correlation but cannot conclude this from this graph.

QUESTION 5

In order to establish which university is the best, we consider the scores in the different categories that were given in the rankings. The QS ranking has the following items (they will be named QS(number) afterwards): 1 - Overall 2 - Academic reputation 3 - Citation per faculty 4 - Employer reputation 5 - Faculty Student 6 - International Students 7 - International faculty

The Times higher ranking has (they will be named TH(number) afterwards): 1 - Overall 2 - Teaching 3 - Research 4 - Citation 5 - Industry income 6 - International outlook

In order to be able to make a scoring system that considers both rankings, we made the following assumptions:

 - QS(1) & TH(1) are removed because we want to make another overall ranking, it would make no sense to take them into account. 
 - The average between TH(2) and TH(3) will correspond to QS(2) because to us, teaching and research put together are academic reputation.
 - QS(3) = TH(4)
 - QS(5) correspond to the Faculty/Student ratio that we computed in question 2.
 - QS(4) and TH(5) will be removed because they don't have an equivalent in the other ranking.
 - QS(6) and QS(7) are forming TH(6) because the international students and faculty are both parts of the international outlook.

In the end, our criterias are: 1 - Academic 2 - Citations 3 - Faculty over student ratio 4 - International outlook

When looking at the merged rankings, one can notice that in the top 10, apart from number 7 (which is UCL for the QS and Princeton University for Times Higher), both ranking share the same universities at the top even though they are not in the same order. It seemed highly unlikely that a university whihch is not in the top 10 of both ranking could in the end be the best universy when merging. This is why we removed the two number 7 and decided to work with only 9 universities.


In [26]:
#Remove the number 7 and work with the rest. Puting the Scores in a DataFrame.
THscores = pd.DataFrame()
#Extracting the Data from TH5. In order to handle a small amount of data we will use only the first 10 universities.
TH5 = TH5.head(10)
THscores['Name' ] = TH5.aliases

#Taking the average between TH(2) and TH(3):
THscores['Academic_Reputation'] = (pd.to_numeric(TH5.scores_teaching, errors='coerce') +
                        pd.to_numeric(TH5.scores_research, errors='coerce')).div(2)
THscores['Citations'] = pd.to_numeric(TH5.scores_citations, errors='coerce')
#THscores['Faculty_over_Student'] = THframe['Faculty_over_Students']
THscores['International_Outlook'] = pd.to_numeric(TH5.scores_international_outlook, errors='coerce')

#We reset the index of the THframe and then we use it to have the Faculty over student ratio.

THscores['Faculty_over_Students'] = TH['Total_Faculty'].head(10).div(TH['Total_Students'].head(10))


display(THscores)


Name Academic_Reputation Citations International_Outlook Faculty_over_Students
0 University of Oxford 93.1 99.1 95.0 0.089274
1 University of Cambridge 92.8 97.5 93.0 0.091740
2 California Institute of Technology caltech 93.9 99.5 59.7 0.153916
3 Stanford University 92.9 99.9 77.6 0.133354
4 Massachusetts Institute of Technology 89.6 100.0 87.6 0.114968
5 Harvard University 91.3 99.7 79.7 0.112368
6 Princeton University 89.8 99.6 78.7 0.120427
7 Imperial College London 85.2 96.7 96.6 0.087722
8 University of Chicago 87.7 99.4 69.6 0.161257
9 ETH Zurich – Swiss Federal Institute of Techno... 84.2 94.3 98.1 0.068476

In order to have the same standards, we create a function to normalize scoring values. We will set the maximum of each column at 100 and and then set the other values taking this into account.


In [27]:
def Normalizing(ScoreFrame,Column):
    MaxValue = max(ScoreFrame[Column])
    ScoreFrame[Column] = (ScoreFrame[Column] * 100) / MaxValue
    return ScoreFrame

In [28]:
THscores = Normalizing(THscores, 'Academic_Reputation')
THscores = Normalizing(THscores, 'Faculty_over_Students')
THscores = Normalizing(THscores, 'International_Outlook')

display(THscores)


Name Academic_Reputation Citations International_Outlook Faculty_over_Students
0 University of Oxford 99.148030 99.1 96.839959 55.361552
1 University of Cambridge 98.828541 97.5 94.801223 56.890346
2 California Institute of Technology caltech 100.000000 99.5 60.856269 95.447555
3 Stanford University 98.935037 99.9 79.102956 82.696830
4 Massachusetts Institute of Technology 95.420660 100.0 89.296636 71.295068
5 Harvard University 97.231097 99.7 81.243629 69.682831
6 Princeton University 95.633653 99.6 80.224261 74.680451
7 Imperial College London 90.734824 96.7 98.470948 54.398599
8 University of Chicago 93.397231 99.4 70.948012 100.000000
9 ETH Zurich – Swiss Federal Institute of Techno... 89.669862 94.3 100.000000 42.463946

In [29]:
THscores = THscores.drop(6)
THscores = THscores.set_index('Name')

In [30]:
display(THscores)


Academic_Reputation Citations International_Outlook Faculty_over_Students
Name
University of Oxford 99.148030 99.1 96.839959 55.361552
University of Cambridge 98.828541 97.5 94.801223 56.890346
California Institute of Technology caltech 100.000000 99.5 60.856269 95.447555
Stanford University 98.935037 99.9 79.102956 82.696830
Massachusetts Institute of Technology 95.420660 100.0 89.296636 71.295068
Harvard University 97.231097 99.7 81.243629 69.682831
Imperial College London 90.734824 96.7 98.470948 54.398599
University of Chicago 93.397231 99.4 70.948012 100.000000
ETH Zurich – Swiss Federal Institute of Technology Zurich 89.669862 94.3 100.000000 42.463946

We believe that the international dimension is less important for a university to be the best. It still matters, however we think that teaching, citations and faculty over student ratio is more important than the rest. This is why every other item apart from the international dimension is given a 2 factor in our grading system.


In [31]:
def Grade(Score):
    Score['Grade'] = (2*Score['Academic_Reputation'] + 2*Score['Citations'] + Score['International_Outlook'] + 
                      2*Score['Faculty_over_Students'])/7
    return Score

In [32]:
THscores = Grade(THscores)
display(THscores.sort_values('Grade', ascending=False))


Academic_Reputation Citations International_Outlook Faculty_over_Students Grade
Name
University of Chicago 93.397231 99.4 70.948012 100.000000 93.791782
California Institute of Technology caltech 100.000000 99.5 60.856269 95.447555 92.964483
Stanford University 98.935037 99.9 79.102956 82.696830 91.738099
Massachusetts Institute of Technology 95.420660 100.0 89.296636 71.295068 88.961156
Harvard University 97.231097 99.7 81.243629 69.682831 87.781641
University of Oxford 99.148030 99.1 96.839959 55.361552 86.294160
University of Cambridge 98.828541 97.5 94.801223 56.890346 85.891285
Imperial College London 90.734824 96.7 98.470948 54.398599 83.162542
ETH Zurich – Swiss Federal Institute of Technology Zurich 89.669862 94.3 100.000000 42.463946 78.981088

Let's work with QS now, we will use a similar function as the Get_data function in question 1, but extracting only the items we need for the score.


In [33]:
def Get_score_data(HTML):
#Using BeautifulSoup to extract the data from the given HTML
    Page_Soup = BeautifulSoup(requests.get(HTML).text, 'html.parser')
    name = Page_Soup.find('h1').text
   
    #We get the number, delete the comma and then the rest of the text (there was spaces and text we had no use of)
    
    Array =  re.findall(r'[+-]?[0-9.]+', Page_Soup.find('ul', class_='score').text)
    
    Score = pd.DataFrame({'Name': [name], 'Academic_Reputation': Array[1], 'Citation_per_Faculty': Array[2], 
                         'Inter_Faculty': Array[5], 'InterStudents': Array[6]})
    return Score

In [34]:
#Using the same process as in question 1, but with a different function
r = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508507889334')
req = r.json()
data = req['data']
QSframe = pd.DataFrame.from_dict(data)
QSframe.url = "https://www.topuniversities.com" + QSframe.url

QS = QSframe.head(10)
QSscores = pd.DataFrame()
for i,url in enumerate(QS.url):
    QSscores = QSscores.append(Get_score_data(url))

In [35]:
#Reindexing and getting rid of the data we do not want to work with.
QSscores = QSscores.set_index('Name')
QSscores = QSscores.drop('UCL (University College London)')

In [36]:
display(QSscores)


Academic_Reputation Citation_per_Faculty InterStudents Inter_Faculty
Name
Massachusetts Institute of Technology (MIT) 100 99.9 96.1 100
Stanford University 100 99.4 72.7 99.6
Harvard University 100 99.9 75.2 96.5
California Institute of Technology (Caltech) 99.5 100 89.2 93.4
University of Cambridge 100 78.3 97.7 97.4
University of Oxford 100 76.3 98.5 98.6
Imperial College London 99.4 68.7 100 100
University of Chicago 99.9 85.9 79.8 71.9
ETH Zurich - Swiss Federal Institute of Technology 99.6 98.7 98.8 100

In [ ]:
QSscores['International_Outlook'] = (pd.to_numeric(QSscores['InterStudents']) + pd.to_numeric(QSscores['Inter_Faculty']))/2
QSscores['Citations'] = pd.to_numeric(QSscores['Citation_per_Faculty'])
QSscores['Academic_Reputation'] = pd.to_numeric(QSscores['Academic_Reputation'])


#We reset the index of the QS and then we use it to have the Faculty over student ratio.
QS = QSdf
QS = QS.reset_index()
QS = QS.sort_values('Rank')

#Creating an empty column to contain the results:
QSscores['Faculty_over_Students'] = 'NaN'

i = 0
while i < len(QSscores['Academic_Reputation']) :
    
    QSscores['Faculty_over_Students'][i] = QS['Total_Faculty'][i]/QS['Total_Students'][i]
    i+=1

In [38]:
display(QSscores)


Academic_Reputation Citation_per_Faculty InterStudents Inter_Faculty International_Outlook Citations Faculty_over_Students
Name
Massachusetts Institute of Technology (MIT) 100.0 99.9 96.1 100 98.05 99.9 0.26945
Stanford University 100.0 99.4 72.7 99.6 86.15 99.4 0.26987
Harvard University 100.0 99.9 75.2 96.5 85.85 99.9 0.193945
California Institute of Technology (Caltech) 99.5 100 89.2 93.4 91.30 100.0 0.422616
University of Cambridge 100.0 78.3 97.7 97.4 97.55 78.3 0.292488
University of Oxford 100.0 76.3 98.5 98.6 98.55 76.3 0.342292
Imperial College London 99.4 68.7 100 100 100.00 68.7 0.204151
University of Chicago 99.9 85.9 79.8 71.9 75.85 85.9 0.244251
ETH Zurich - Swiss Federal Institute of Technology 99.6 98.7 98.8 100 99.40 98.7 0.180645

In [39]:
#Deleting the columsn that are now in two.
QSscores = QSscores.drop(['Citation_per_Faculty', 'InterStudents' ,'Inter_Faculty'] ,axis=1)
One can notice that the "Citations" and "Faculty over Students" columns aren't normalized.

In [40]:
QSscores = Normalizing(QSscores, 'Citations')
QSscores = Normalizing(QSscores, 'Faculty_over_Students')
#We now get the grade
QSscores =Grade(QSscores)

#In order to find the best university, we put the in the same order inside the 2 dataframes.
#To do so we make sure that their names are the same.

QSscores = QSscores.rename(index={'Massachusetts Institute of Technology (MIT) ':'Massachusetts Institute of Technology',
                                 'California Institute of Technology (Caltech)' : 'California Institute of Technology caltech',
                                 'ETH Zurich - Swiss Federal Institute of Technology' :'ETH Zurich – Swiss Federal Institute of Technology Zurich'})

Now that we have the both rankings with the same universities, we are able to make the average of the the two grades. To do so we sort the indexes and then obtain a final DataFrame.


In [41]:
THscores = THscores.sort_index()
QSscores = QSscores.sort_index()
display(QSscores)
display(THscores)


Academic_Reputation International_Outlook Citations Faculty_over_Students Grade
Name
California Institute of Technology caltech 99.5 91.30 100.0 100 98.6143
ETH Zurich – Swiss Federal Institute of Technology Zurich 99.6 99.40 98.7 42.7444 83.0698
Harvard University 100.0 85.85 99.9 45.8916 82.4905
Imperial College London 99.4 100.00 68.7 48.3064 76.1161
Massachusetts Institute of Technology 100.0 98.05 99.9 63.7575 89.3379
Stanford University 100.0 86.15 99.4 63.857 87.5234
University of Cambridge 100.0 97.55 78.3 69.2089 84.6525
University of Chicago 99.9 75.85 85.9 57.795 80.4343
University of Oxford 100.0 98.55 76.3 80.9936 87.591
Academic_Reputation Citations International_Outlook Faculty_over_Students Grade
Name
California Institute of Technology caltech 100.000000 99.5 60.856269 95.447555 92.964483
ETH Zurich – Swiss Federal Institute of Technology Zurich 89.669862 94.3 100.000000 42.463946 78.981088
Harvard University 97.231097 99.7 81.243629 69.682831 87.781641
Imperial College London 90.734824 96.7 98.470948 54.398599 83.162542
Massachusetts Institute of Technology 95.420660 100.0 89.296636 71.295068 88.961156
Stanford University 98.935037 99.9 79.102956 82.696830 91.738099
University of Cambridge 98.828541 97.5 94.801223 56.890346 85.891285
University of Chicago 93.397231 99.4 70.948012 100.000000 93.791782
University of Oxford 99.148030 99.1 96.839959 55.361552 86.294160

In [42]:
FinalGrading = pd.DataFrame()
FinalGrading['Overall_Score'] = (THscores['Grade'] + QSscores['Grade'])/2

FinalGrading = FinalGrading.sort_values('Overall_Score', ascending=False)

In [43]:
display(FinalGrading)


Overall_Score
Name
California Institute of Technology caltech 95.7894
Stanford University 89.6308
Massachusetts Institute of Technology 89.1495
University of Chicago 87.113
University of Oxford 86.9426
University of Cambridge 85.2719
Harvard University 85.136
ETH Zurich – Swiss Federal Institute of Technology Zurich 81.0255
Imperial College London 79.6393

With respect to our ranking system merging the two previous ones, CalTech arrives first, which not that surprising because it is well ranked in both QS and TH. The weight we applied to each parameters should not be that far from the ones that were used by the ranking sites because our results are quite clear.