1. Introduction

Data science used to be called data mining, and for good reasons. It is a dirty, soul-wrenching job in which many die. Machine learning is the only enjoyable part, but whatever happens before you can deploy a learning algorithm is drudgery. The renaming of the subject was a PR move, and it is about as pretentious as calling coal mining "carbon extraction science".

Today we will focus on the drudgery part, which is inevitable if you want to obtain any non-trivial result with machine learning. My hope is that the topics covered translate to other workloads you might encounter in your daily work. The topics will be:

  • I/O operations, which includes pulling files off the net and elementary loading/saving. We will not work on scale, as the IT policy explicitly bans this.

  • Lots of text processing. This alone is a compelling reason to use Python 3: you will avoid most of the problems associated with the rabbit hole known as UTF-8. You'd also better make friends with regular expressions.

  • Database-like operations, dataframes.

  • Visual inspection of data to get a feeling of it.

The modules we import reflect the above: numpy or TensorFlow does not have a place here. We will scrape stuff from arXiv to analyze correlation patterns between authors, metadata, and Impact Factor. There are high-level libraries to do scraping (e.g. Scrapy) and also to work with arXiv (e.g.arxiv.py), but we will build things bottom-up, so we import low-level I/O, networking, text processing and parsing libraries.


In [ ]:
from __future__ import print_function
import matplotlib.pyplot as plt
import os
import pandas as pd
import re
import seaborn as sns
try:
    from urllib2 import Request, urlopen
except ImportError:
    from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
%matplotlib inline

2. Scraping

The amount of insight you can gain by scraping free information from the net is astonishing, especially given how easy it is to scrape. Whatever you want to learn about, there is free information out there (and probably a Python package to help).

We will work on the simplest scenario: pulling off files that do not need authentication or setting cookies. We will ignore robot rules and pretend that we are a Firefox browser. For instance, this will get you BIST's main site:


In [ ]:
url = "http://bist.eu/"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
content = urlopen(req).read()

If we inspect what it is, it looks like HTML:


In [ ]:
content[:50]

The way we read it, the content is a byte array:


In [ ]:
type(content)

Convert it to UTF-8 to see what it contains in a nicer way:


In [ ]:
print(content.decode("utf-8")[:135])

You can save it to a file so that you do not have to scrape it again. This is important when you scrape millions of files, and the scraping has to be restarted. Have mercy on the webserver, or you risk getting blacklisted.


In [ ]:
with open("bist.html", 'wb') as file:
    file.write(content)
    file.close()

You can technically open this in a browser, but it is only the HTML part of the site: the style files, images, and a whole lot of other things that make a page working are missing. Lazy people can launch a browser from Python to see what the page looks like:


In [ ]:
from subprocess import run
run(["firefox", "-new-tab", "bist.html"])

Exercise 1. Copy an image location from the BIST website. Scrape it and display it. For the display, you can import Image from IPython.display, and then use Image(data=content). Technically you could use imshow from Matplotlib, but then you would have decompress the image first.


In [ ]:

We put everything together in a function that only attempts to download a file if it is not present locally. If you do not specify a filename, it tries to extract it from the URL by taking everything after the last "/" character as the filename.


In [ ]:
def get_file(url, filename=None):
    if filename is None:
        slash = url.rindex("/")
        filename = url[slash+1:]
    if os.path.isfile(filename):
        with open(filename, 'rb') as file:
            content = file.read()
    else:
        req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        content = urlopen(req).read()
        with open(filename, 'wb') as file:
            file.write(content)
            file.close()
    return content

3. Cleaning structured data

We will integrate several data sources, as this is the most common pattern you encounter in day to day data science. We will have two structured sources of data, and one semistructured:

  1. A table of journal titles and matching abbreviations. This will help us standardize journal titles in the data, as some will be given as full titles, others as abbreviations.

  2. A table of journals and matching Impact Factors.

  3. Search results in HTML format that contain the metadata of the papers we want to study

We start with the easier part, obtaining and cleaning the structured data. Let us get the journal abbrevations:


In [ ]:
abbreviations = get_file("https://github.com/JabRef/abbrv.jabref.org/raw/master/journals/journal_abbreviations_webofscience.txt")

Let's see what it contains:


In [ ]:
print(abbreviations.decode("utf-8")[:1000])

This is fairly straightforward: the file starts with a bunch of comments, then two columns containing the full title and the abbreviation, separated by the "=" character. We can load it with Pandas without thinking much:


In [ ]:
abbs = pd.read_csv("journal_abbreviations_webofscience.txt", sep='=',
                   comment='#', names=["Full Journal Title", "Abbreviation"])

In principle, it looks okay:


In [ ]:
abbs.head()

If we take a closer look at a particular element, we will notice that something is off:


In [ ]:
abbs.iloc[0, 1]

The string starts with an unnecessary white space. We also do not want to miss any possible match because of differences in the use of capitalization. So we strip starting and trailing white space and convert every entry to upper case:


In [ ]:
abbs = abbs.applymap(lambda x: x.upper().strip())

This is done: it was an easy and clean data set.

Let us go for the next one. Their Journal Citation Records reduced scientific contributions to a single number, allowing incompetent people to make unfounded judgements on the quality of your research output. So when you encounter a question like this: "I want to know which JCR journal is paid and easy to get publication", you do not even blink an eye. Fortunately, the same thread has a download link for the 2015 report in an XLS file. We go for it:


In [ ]:
get_file("https://www.researchgate.net/file.PostFileLoader.html?id=558730995e9d9735688b4631&assetKey=AS%3A273803718922244%401442291301717",
         "2014_SCI_IF.xlsx");

We can open it in LibreOffice or OpenOffice, using the lazy way again:


In [ ]:
run(["soffice", "2014_SCI_IF.xlsx"])

The two first lines are junk, and there is an index column. Let us load it in Pandas:


In [ ]:
ifs = pd.read_excel("2014_SCI_IF.xlsx", skiprows=2, index_col=0)

If this fails, you do not have the package for reading Excel files. Install it with import pip; pip.main(["install", "xlrd"]).

Now this table definitely looks fishy:


In [ ]:
ifs.head()

What are those empty columns? They do not seem to be in the file. The answer lies at the end of the table:


In [ ]:
ifs.tail()

Some stupid multi-column copyright notice at the bottom distorts the entire collection. This is why spreadsheets are the most hated data format for structured data, as the uncontrolled blend of content and formatting lets ignorant people give you a headache.

The journals are listed according to their ranking, in decreasing order. We know that we do not care about journals that do not have an Impact Factor. We take this as a cut-off for rows, and re-read only the rows and columns we are interested in:


In [ ]:
skip = len(ifs) - ifs["Journal Impact Factor"].last_valid_index() + 1
ifs = pd.read_excel("2014_SCI_IF.xlsx", skiprows=2, index_col=0,
                    skip_footer=skip, parse_cols="A,B,E")

This looks much better:


In [ ]:
ifs.head()

But not perfect:


In [ ]:
ifs.sample(n=10)

Now you see why we converted everything to upper case before.

Exercise 2. Convert everything in the column "Full Journal Title" to upper case. You will encounter a charming inconsistency in Pandas.


In [ ]:

4. Cleaning and filtering of semi-structured data

The horror starts when even more control is given to humans who create the data. We move on to studying the search results for the authors Lewenstein and Acín on arXiv. The metadata is entered by humans, which makes it inconsistent and chaotic.

The advanced search on arXiv is simple and handy. The search URL has a well-defined structure that we can exploit. We create a wrapper function around our scraping engine, which gets the first 400 results for a requested author and returns a parsed HTML file. The library BeautifulSoup does the parsing: it understand the hierarchical structure of the file, and allows you navigate the hierarchy with a handful of convenience functions.


In [ ]:
def get_search_result(author):
    filename = author + ".html"
    url = "https://arxiv.org/find/all/1/au:+" + author + "/0/1/0/all/0/1?per_page=400"
    page = get_file(url, filename)
    return BeautifulSoup(page, "html.parser")

It is a common pattern to exploit search URLs to get what you want. Many don't let you pull off too many results in a single page, and you have to step through the "Next" links over and over again to get all the results you want. This is also not difficult, but it has the extra complication that you have to extract the "Next" link and call it recursively.

Let us study the results:


In [ ]:
lewenstein = get_search_result("Lewenstein_M")

We can view the source in a browser:


In [ ]:
run(["firefox", "-new-tab", "view-source:file://" + os.getcwd() + "/Lewenstein_M.html"])

Skipping the boring header part, the source code of the search result looks like this:

<h3>Showing results 1 through 389 (of 389 total) for 
<a href="/find/all/1/au:+Lewenstein_M/0/1/0/all/0/1?skip=0&amp;query_id=ff0631708b5d0dd5">au:Lewenstein_M</a></h3>
<dl>
<dt>1.  <span class="list-identifier"><a href="/abs/1703.09814" title="Abstract">arXiv:1703.09814</a> [<a href="/pdf/1703.09814" title="Download PDF">pdf</a>, <a href="/ps/1703.09814" title="Download PostScript">ps</a>, <a href="/format/1703.09814" title="Other formats">other</a>]</span></dt>
<dd>
<div class="meta">
<div class="list-title mathjax">
<span class="descriptor">Title:</span> Efficient Determination of Ground States of Infinite Quantum Lattice  Models in Three Dimensions
</div>
<div class="list-authors">
<span class="descriptor">Authors:</span> 
<a href="/find/cond-mat/1/au:+Ran_S/0/1/0/all/0/1">Shi-Ju Ran</a>, 
<a href="/find/cond-mat/1/au:+Piga_A/0/1/0/all/0/1">Angelo Piga</a>, 
<a href="/find/cond-mat/1/au:+Peng_C/0/1/0/all/0/1">Cheng Peng</a>, 
<a href="/find/cond-mat/1/au:+Su_G/0/1/0/all/0/1">Gang Su</a>, 
<a href="/find/cond-mat/1/au:+Lewenstein_M/0/1/0/all/0/1">Maciej Lewenstein</a>
</div>
<div class="list-comments">
<span class="descriptor">Comments:</span> 11 pages, 9 figures
</div>
<div class="list-subjects">
<span class="descriptor">Subjects:</span> <span class="primary-subject">Strongly Correlated Electrons (cond-mat.str-el)</span>; Computational Physics (physics.comp-ph)

</div>
</div>
</dd>

This is the entire first result. It might look intimidating, but as long as you know that in HTML a mark-up starts with <whatever> and ends with </whatever>, you will find regular and hierarchical patterns. If you stare hard enough, you will see that the <dd> tag contains most of the information we want: it has the authors, the title, the journal reference if there is one (not in the example shown), and the primary subject. It does not actually matter what <dd> is: we are not writing a browser, we are scraping data.

As a quick sanity check, we can easily extract the titles, and verify that they match the number of search results. The lewenstein object is instance of the BeautifulSoup class, which has methods to find all instances of a given mark-up. We use this to find the titles:


In [ ]:
titles = []
for dd in lewenstein.find_all("dd"):
    titles.append(dd.find("div", class_="list-title mathjax"))
len(titles)

So far so good, although the titles do not really look like we expect them:


In [ ]:
titles[0]

We define a helper function to extract the title:


In [ ]:
def extract_title(title):
    start = title.index(" ")
    return title[start+1:-1]

titles = []
for dd in lewenstein.find_all("dd"):
    titles.append(extract_title(dd.find("div", class_ = "list-title mathjax").text))
titles[0]

The next problem we face is that not all of these papers belong to Maciej Lewenstein: some impostors have the same abbreviated name M. Lewenstein. They are easy to detect if they uses the non-abbreviated name. Let us run through the page again, noting which subject the the impostors publish in. For this, let us introduce another auxiliary function that extract the short name of the the subject. We also note the primary subject when the abbreviated form of the name appears. We use another function to drop "." and merge multiple white space to a single one. We will use a simple regular expression to find the candidate Lewensteins.


In [ ]:
def extract_subject(long_subject):
    start = long_subject.index("(")
    return long_subject[start+1:-1]


def drop_punctuation(string):
    result = string.replace(".", " ")
    return " ".join(result.split())

true_lewenstein = ["Maciej Lewenstein", "M Lewenstein"]
impostors = set()
primary_subjects = set()
for dd in lewenstein.find_all("dd"):
    div = dd.find("div", class_="list-authors")
    subject = extract_subject(dd.find("span", class_ = "primary-subject").text)
    names = [drop_punctuation(a.text) for a in div.find_all("a")]
    for name in names:
        if re.search("M.* Lewenstein", name):
            if name not in true_lewenstein:
                impostors.add(name + " " + subject)
            elif "Maciej" not in name:
                primary_subjects.add(subject)
print(impostors)
print(primary_subjects)

So it is only one person, and we can be reasonably confident that Maciej Lewenstein is unlikely to publish in these subjects. The other good news is that all the short forms of the name belong to physics papers, and not computer science. Armed with this knowledge, we can filter out the correct manuscripts.

We need to filter one more thing: we are only interested in papers for which the journal reference is given. Further digging in the HTML code lets us find the correct tag. While we are putting together the correct records, we also normalize his name.

Annoyingly, the <dd> tag we have been focusing on does not contain the arXiv ID or the year. We zip the main loop's iterator with the <dt> tag, because only this one contains the arXiv ID, from which the year can be extracted. This always goes in pair with the <dd> tag, completing the metadata of the manuscripts.

We define yet another set of auxiliary functions to extract everything we need. The routine for extracting the journal title already performs stripping and converting to upper case. We assume that the name of the journal is the stuff on the matching line before the first digit (which is probably the volume or some other bibliographic information).


In [ ]:
def extract_journal(journal):
    start = journal.index(" ")
    raw = journal[start+1:-1]
    m = re.search("\d", raw)
    return drop_punctuation(raw[:m.start()]).strip().upper()


def extract_title(title):
    start = title.index(" ")
    return title[start+1:-1]


def extract_id_and_year(arXiv):
    start = arXiv.index(":")
    if "/" in arXiv:
        year_index = arXiv.index("/")
    else:
        year_index = start
    year = arXiv[year_index+1:year_index+3]
    if year[0] == "9":
        year = int("19" + year)
    else:
        year = int("20" + year)
    return arXiv[start+1:], year

papers = []
for dd, dt in zip(lewenstein.find_all("dd"), lewenstein.find_all("dt")):
    id_, year = extract_id_and_year(dt.find("a", attrs={"title": "Abstract"}).text)
    div = dd.find("div", class_="list-authors")
    subject = extract_subject(dd.find("span", class_ = "primary-subject").text)
    journal = dd.find("div", class_ = "list-journal-ref")
    if journal:
        names = [drop_punctuation(a.text) for a in div.find_all("a")]
        for i, name in enumerate(names):
            if re.search("M.* Lewenstein", name):
                if name not in true_lewenstein:
                    break
                else:
                    names[i] = "Maciej Lewenstein"
        else:
            papers.append([id_, extract_title(dd.find("div", class_ = "list-title mathjax").text),
                           names, subject, year, extract_journal(journal.text)])

We would be almost done if journal names were all entered same way. Of course they were not. Let us try to standardize them: this is where we first combine two sources of data. We use the journal abbreviations table to default to the long title of the journal in our paper collection.


In [ ]:
for i, paper in enumerate(papers):
    journal = paper[-1]
    long_name = abbs[abbs["Abbreviation"] == journal]
    if len(long_name) > 0:
        papers[i][-1] = long_name["Full Journal Title"].values[0]

There will still be some rotten apples:


In [ ]:
def find_rotten_apples(paper_list):
    rotten_apples = []
    for paper in paper_list:
        match = ifs[ifs["Full Journal Title"] == paper[-1]]
        if len(match) == 0:
            rotten_apples.append(paper[-1])
    return sorted(rotten_apples)

rotten_apples = find_rotten_apples(papers)
rotten_apples

Now you start to feel the pain of being a data scientist. The sloppiness of manual data entering is unbounded. Your duty is to clean up this mess. A quick fix is to tinker with the drop_punctuation function to replace the retarded encoding of JRC. Then go through the creation of the papers array and the standardization again.


In [ ]:
def drop_punctuation(string):
    result = string.replace(".", " ")
    result = result.replace(",", " ")
    result = result.replace("(", " ")
    result = result.replace(": ", "-")
    return " ".join(result.split())

Exercise 3. Cut the number of rotten apples in half by defining a replacement dictionary and doing another round of standardization.


In [ ]:
len(rotten_apples)

It is the same drill with our other contender, except that the short version of his name is uniquely his. On the other hand, that single accent in the surname introduces N+1 spelling variants, which we should standardize.


In [ ]:
acin = get_search_result("Acin_A")
for dd, dt in zip(acin.find_all("dd"), acin.find_all("dt")):
    id_, year = extract_id_and_year(dt.find("a", attrs={"title": "Abstract"}).text)
    div = dd.find("div", class_="list-authors")
    subject = extract_subject(dd.find("span", class_ = "primary-subject").text)
    journal = dd.find("div", class_ = "list-journal-ref")
    if journal:
        names = [drop_punctuation(a.text) for a in div.find_all("a")]
        journal = extract_journal(journal.text)
        long_name = abbs[abbs["Abbreviation"] == journal]
        if len(long_name) > 0:
            journal = long_name["Full Journal Title"].values[0]
        papers.append([id_, extract_title(dd.find("div", class_ = "list-title mathjax").text),
                       names, subject, year, journal])
for paper in papers:
    names = paper[2]
    for i, name in enumerate(names):
        if re.search("A.* Ac.n", name):
            names[i] = "Antonio Acín"

In [ ]:
rotten_apples = find_rotten_apples(papers)
rotten_apples

Finally, we do another combination of sources. We merge the data set with the table of Impact Factors. The merge will be done based on the full journal titles.


In [ ]:
db = pd.merge(pd.DataFrame(papers, columns=["arXiv", "Title", "Authors", "Primary Subject", "Year", "Full Journal Title"]),
                           ifs, how="inner", on=["Full Journal Title"])

Since the two of them co-authored papers, there are duplicated. They are easy to drop, since they have the same arXiv ID:


In [ ]:
db = db.drop_duplicates(subset="arXiv")

Notice that we lost a few papers:


In [ ]:
print(len(papers), len(db))

This can be one of two reasons: the papers were published in journals that are not in JCR (very unlikely), or we failed matching the full journal name (very likely). It is getting tedious, so we live it here for now.

5. Visual analysis

Since we only focus on two authors, we can add an additional column to help identifying who it is. We also care about the co-authored papers.


In [ ]:
def identify_key_authors(authors):
    if "Maciej Lewenstein" in authors and "Antonio Acín" in authors:
        return "AAML"
    elif "Maciej Lewenstein" in authors:
        return "ML"
    else:
        return "AA"

db["Group"] = db["Authors"].apply(lambda x: identify_key_authors(x))

Let's start plotting distributions:


In [ ]:
groups = ["AA", "ML", "AAML"]
fig, ax = plt.subplots(ncols=1)
for group in groups:
    data = db[db["Group"] == group]["Journal Impact Factor"]
    sns.distplot(data, kde=False, label=group)
ax.legend()
ax.set_yscale("log")
plt.show()

The logarithmic scale makes the raw number of papers appear more balanced, which is fair given the difference in age between the two authors. A single Nature paper makes a great outlier:


In [ ]:
db[db["Journal Impact Factor"] == db["Journal Impact Factor"].max()]

Actually, Toni has another Nature, but that is not on arXiv yet. Not all of Maciej's paper are on arXiv either, especially not the old ones.

We can do the same plots with subjects:


In [ ]:
subjects = db["Primary Subject"].drop_duplicates()
fig, ax = plt.subplots(ncols=1)
for subject in subjects:
    data = db[db["Primary Subject"] == subject]["Journal Impact Factor"]
    sns.distplot(data, kde=False, label=subject)
ax.legend()
ax.set_yscale("log")
plt.show()

You are safe with quant-ph and quantum gases, but stay clear of atom physics. It is amusing to restrict the histogram to Professor Acín's subset:


In [ ]:
fig, ax = plt.subplots(ncols=1)
for subject in subjects:
    data = db[(db["Primary Subject"] == subject) & (db["Group"] != "ML")]
    if len(data) > 1:
        sns.distplot(data["Journal Impact Factor"], kde=False, label=subject)
ax.legend()
ax.set_yscale("log")
plt.show()

His topics are somewhat predictable.

Let's add one more column to indicate the number of authors:


In [ ]:
db["#Authors"] = db["Authors"].apply(lambda x: len(x))
sns.stripplot(x="#Authors", y="Journal Impact Factor", data=db)

How about the length of title? Number of words in title?


In [ ]:
db["Length of Title"] = db["Title"].apply(lambda x: len(x))
db["Number of Words in Title"] = db["Title"].apply(lambda x: len(x.split()))
fig, axes = plt.subplots(ncols=2, figsize=(12, 5))
sns.stripplot(x="Length of Title", y="Journal Impact Factor", data=db, ax=axes[0])
sns.stripplot(x="Number of Words in Title", y="Journal Impact Factor", data=db, ax=axes[1])
plt.show()

Do our authors maintain IF over time?


In [ ]:
fig, axes = plt.subplots(nrows=2, figsize=(10, 5))
data = db[db["Group"] != "ML"]
sns.stripplot(x="Year", y="Journal Impact Factor", data=data, ax=axes[0])
axes[0].set_title("AA")
data = db[db["Group"] != "AA"].sort_values(by="Year")
sns.stripplot(x="Year", y="Journal Impact Factor", data=data, ax=axes[1])
axes[1].set_title("ML")
plt.tight_layout()
plt.show()

If anything, the IF improved over time. Perhaps joining ICFO had something to do with it.

Completely absurd exercise 1. Find out if there is any correlation between the non-date part of the arXiv ID and the Impact Factor. For IDs that contain a "/", it is the last three digits. For newer IDs, it is everything after "."


In [ ]:

Homework. Extend analysis to the citation numbers of individual papers. Scrape data from Google Scholar (there is a package for that: scholar.py). To make it more accurate, you should extend the data by scraping the DOI from the paper metadata. This extension would cycle over links that you can generate from the arXiv IDs to get the matching page from arXiv.