This notebook is dedicated to getting info about universities. Universities are major structural component of academic institution. Here we will try to find lists of universities and get data about them. We are interested in disciplines that these universities are dealing this. We can guess that from programs, departments, proffessors that university has. We can also guess some things from the projects in which they are involvoved, as well as connections to other universities.

  • departments
  • permanent and temporary staf
  • study programs
  • research projecets involved in.

We will also apply some basing scrapping techniques to look into the keywords of disciplines mentioned in websites.

Univ.cc

For our first point we have found univ.cc website that contains a lot of universities. We will also get into university rating later.


In [15]:
import requests
from bs4 import BeautifulSoup
import pandas
import time
import pickle
#print(soup.prettify()) #this shows us the structure of website.



Let's start by defining a function that would download all the links that we need. In next cell

all_universities = get_all_universities()

is commented, because we do not want to repeat the process we already did.


In [55]:
def get_all_universities():
    all_links = []
    try:
        for x in range(1, 7229, 50):
            payload = {'dom':'world','key':'','start':'{}'.format(x)} #Every twenty numbers
            r = requests.get('http://univ.cc/search.php?', params=payload)
            soup = BeautifulSoup(r.content)
            all_lines = soup.findAll('li')
            for x in all_lines:
                my_links = x.find('a')
                all_links.append(my_links)
            time.sleep(0.2)
        return all_links
    except:
        return all_links
#all_universities = get_all_universities()

Also, let's define some functions that will allow as to save and load data. For now we use pickle to save the data. In the future, we would like to have all addresses at one place.


In [43]:
def write_txt():
    with open('all_univs.txt', 'w') as thefile:
        for university_link in all_universities:
            thefile.write("{}\n".format(university_link))

def read_txt():
    ff = open('all_univs.txt')
    recovered = ff.read()
    return recovered.split('\n')
all_universities = [x for x in read_txt() if x]

Was our load successful?


In [44]:
len(set(all_universities))


Out[44]:
7216

We need to structure our data a bit more.


In [63]:
university_list = []
for s in all_universities:
    s1, s2 = s.split('">')
    university_list.append((s1.lstrip('(<a href=)').lstrip('"'), s2.rstrip('/a>').rstrip('<')))

import pandas as pd
df = pd.DataFrame(university_list, columns=['url', 'name'])
df.head()


Out[63]:
url name
0 http://www.uab.ro/ 1 December University of Alba Iulia
1 http://www.smmu.edu.cn/ 2nd Military Medical University
2 http://www.tmmu.edu.cn/ 3rd Military Medical University
3 http://www.fmmu.edu.cn/ 4th Military Medical University
4 http://www.ah.dk/ Aalborg Business College

So now we have a bit more than 7000 thousand links to universities. We can use it in many ways. We will, first of all, be able to look for mentions of disciplines. The best way to do that is to look for disciplines names. We can use web_of_science classifications.


In [71]:
from disciplines.theory import web_of_science_categories
discipline_names = web_of_science_categories.categories
discipline_names[5:10]


Out[71]:
(u'Agronomy',
 u'Allergy',
 u'Anatomy & Morphology',
 u'Andrology',
 u'Anesthesiology')

Now we will need some change in words, so we can catch not only "Agronomy" but also "Agronomical" and other forms. NLTK will help us with that. We would also like to get such items as "anesthesiologist", plural form included.


In [88]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
print(wnl.lemmatize('sociological', 'n'))
#Still not working properly


sociological

But, we will need to have pages to work with. The first way is to download all the pages. The second way is to scrap them as we enter. We chose first option, because it will allow us to work more with the data.