Creation of text corpus from arxiv archives

In this notebook you will create a corpus of documents for further processing with LDA routines. For this, you will follow the following steps:

Download of data from arxiv using the API
Extraction of dates and abstracts of each paper
Removal of stopwords and stemming
Formatting the corpus according to the format of topic modeling routines

1. Data download (35 min)

Before implementing this section, you should carry out the following tasks on your web browser:

Go to the webpage http://www.arxiv.org, and explore the site. Make sure to use the search function, and analyze what information is provided for each paper in the database
Read now information on the arxiv API at http://arxiv.org/help/api/user-manual. Go through the following sections: 1, 2, 3.1 to 3.3, 4 (python only), and 5. Test while you read using your web browser.

Now, you are ready to download a feed with information of your choice. The following fragment of code will build an URL of your choice, and retrieve the desired data into a string. Fill in the required information, and make sure you download on a simple request 1000 documents of the category of your choice. Note that, when accessing the API, the documents will be sorted out according to relevance.



In [8]:

    
import urllib

# The following line should be used to define your query. You can download
# from a particular category, all documents that contain a particular word, etc ...
s_q = 'search_query=cat:cs.CV' 

m_r = 'max_results=1000'
url = 'http://export.arxiv.org/api/query?' + s_q + '&' + m_r

#Uncommnet the following line if you want to check the correctness of your url in a web browser
#print url

#Uncomment the following line when you are done with the design of the query, and want to download
#the query to your python variable 'data' 
#data = urllib.urlopen(url).read()
print str(len(data)) + ' characters retrieved'

2. Extraction of dates and abstracts (10 min)

The following piece of code shows how we can access the relevant information. First, we create a feed object so that we can access relevant information more easily. With the provided code, you can check that there are 1000 entries, each one corresponding to one paper, and the dates of publication, title, and abstract, can also be easily accessed for each entry



In [17]:

    
import feedparser

feed = feedparser.parse(data)

#in feed.entries we have one object per each retrieved paper
print str(len(feed.entries))

#We now show how to access the title, publication date, and abstract of the paper
print feed.entries[0].title
print feed.entries[0].summary
print feed.entries[0].published

To complete this section, students are requested to create two lists: 'dates' and 'abstracts', keeping in each of them the dates and abstracts of the papers retrieved in the query. Note that python has other structures (tuples) that can be used for storing this information in a more organized way. For simplicity, here we will just recur to lists.

Complete the required piece of code. The solution is provided in the next block, but try not to cheat and get yourself a working implementation.



In [18]:

    
dates = list()
abstracts = list()
#for ...



In [36]:

    
## Solution ####
## Do not cheat ####
#dates = list()
#abstracts = list()
#for entry in feed.entries:
#    dates.append(entry.published[0:4])
#    abstracts.append(entry.summary)

3. Removing stopwords and stemming (15 min)

Before proceeding further, it is necessary to preprocess the text in the abstracts to perform the following tasks:

Decomponse the text into words, removing punctuations signs and other irrelevant symbols
Stemming: Using a dictionary, we keep the stem of each word. This way, verb forms collapse to the same stem, plurals are removed, etc.
Stopwords: Removal of very common words, known as stopwords

To perform these tasks we will recur to the functions implemented in the 'Natural Language Toolkit' (nltk) module. In the following we show how each of these tasks can be performed for a single abstract using the available functions and dictionaries. Please, note that the developed code is not very 'pythonic'. These implementations are as simple as possible and try to use just the basic functionality that was introduced in the previous notebook, but with python smarter implementations are possible and preferred.

We start by detecting the words, and keeping only those that contain just alphanumeric characters:



In [54]:

    
text = abstracts[0]

#From NLTK we import a function that splits the text into words (tokens)
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)

#Next, we create a list keeping only alphanumeric tokens and removing capital letters
tokenize_text = list()
for token in tokens:
    if token.isalnum():
        tokenize_text.append(token.lower())
        
print tokenize_text

To show how smart Python can be, note that the following 'one-liner' implements just the same functionality as the two previous nested for and if conditions:



In [59]:

    
tokenize_text = [token.lower() for token in tokens if token.isalnum()]
print tokenize_text

We can also recur to nltk resources to perform stemming and to remove stopwords:



In [60]:

    
import nltk.stem
s= nltk.stem.SnowballStemmer('english')

from nltk.corpus import stopwords
eng_stopwords = stopwords.words('english')

clean_text = list()
for token in tokenize_text:
    stem_token = s.stem(token)
    if stem_token not in eng_stopwords:
        clean_text.append(stem_token)

print clean_text

Finally, we can convert the list of words back to a single string using the 'join' command as follows:



In [61]:

    
clean_text = ' '.join(clean_text)
print clean_text

This representation has lost all grammar information, but keeps the semantic meaning of the original text through the concatenation of tokens. In other words, we assume that the semantic meaning of the abstract can be somehow decomposed into the sum of the semantic meaning of each token, neglecting other information, such as the order of the words.

This representation is known as 'bag of words', and is the basis of most algorithms for topic modeling.

Exercise:

Apply all previous steps to process the abstracts of all papers in the list 'abstracts'. Save the processed abstracts in a new list named 'clean_abstracts'



In [62]:

    
clean_abstracts = list()



In [64]:

    
## Solution ####
## Do not cheat ####

from nltk.tokenize import word_tokenize
import nltk.stem
s= nltk.stem.SnowballStemmer('english')
from nltk.corpus import stopwords
eng_stopwords = stopwords.words('english')

clean_abstracts = list()
for abstract in abstracts:
    tokens = word_tokenize(abstract)
    tokens = [s.stem(token.lower()) for token in tokens if token.isalnum()]
    tokens = [token for token in tokens if token not in eng_stopwords]
    clean_abstracts.append(' '.join(tokens))



In [66]:

    
print clean_abstracts[0:3]

4. Format corpus for further processing with Blei's topic modeling toolbox (10 min)

We are almost ready to start using tools for the automatic detection of topics. Before doing so, however, it is important that we create the text files that are required by the available code to perform Latent Dirichlet Allocation (LDA). Among the different available implementations, we will recur to C code provided by the group of Dr. David Blei: https://www.cs.princeton.edu/~blei/topicmodeling.html

To start with, we will create the two following files:

A file 'corpusname_corpus.txt' that contains a line for each document in the corpus; each line contains the text of the document ('\n' is used to specify the end of each document, and is therefore not allowed in the description of the documents, but we have already addressed this issue).
If we want to train dynamical models, it is also necessary to create a second file 'corpusname-seq.dat'. The first line of the file contains the total number of documents, and each subsequent line indicates the number of documents in each slot. This implies that the sum of numbers in lines 2:end should be equal to the total number of documents. Note that when using dynamic models we assume that the documents appear in the file 'corpusname_corpus.txt' in chronological order.

Assumming that you have already created lists 'dates' and 'clean abstracts', the following piece of code will create the required files. Go through the code, and make sure you understand the whole fragment. Again, note that the code is not very efficient; a much more efficient code could be developed by using the full potential of the Python language.



In [87]:

    
corpusname = 'arxiv_v0'

# We first sort the lists according to dates by zipping the lists together, and then unzipping after sorting
zipped_list = zip(dates, clean_abstracts)
zipped_list.sort()

dates = [el[0] for el in zipped_list]
clean_abstracts = [el[1] for el in zipped_list]

# We create the file with the corpus
f = open(corpusname+'_corpus.txt', 'wb')
for abstract in clean_abstracts:
    f.write(abstract+'\n')
f.close()

# We create the file for the dynamic model
sorted_unique_dates = sorted(list(set(dates)))
f = open(corpusname+'-seq.dat','wb')
f.write(str(len(clean_abstracts))+'\n')
for date in sorted_unique_dates:
    f.write(str(dates.count(date))+'\n')
f.close()

Actually, the format of the input files for Dr. Blei's software is slightly more complicated:

The data is a file where each line is of the form:

[M] [term_1]:[count] [term_2]:[count] ...  [term_N]:[count]

where [M] is the number of unique terms in the document, and the
[count] associated with each term is how many times that term appeared
in the document.  Note that [term_1] is an integer which indexes the
term; it is not a string.

The following function precisely works on the file 'corpusname_corpus.txt', and generates the two following files:

'corpusname-mult.dat': Which is the file in the format specified above
'corpusname_vocab.txt': Which is a file containing the vocabulary of the corpus. Each line contain a different word, and the line number is the index associated to each word in the vocabulary



In [ ]:

    
def format_conversion(corpusname):
    
    from gensim import corpora
    
    #We start by creating the vocabulary file
    dictionary = corpora.Dictionary(
        [el for el in line.lower().split()]
            for line in open(corpusname+'_corpus.txt'))
    #Remove words that appear in less than no_below documents, or in more than
    #no_above, and keep at most keep_n most frequent terms
    dictionary.filter_extremes(no_below=4, no_above=0.5, keep_n=10000)
    #We generate the vocabulary file
    with open(corpusname + '_vocab.txt','wb') as f:
        [f.write(dictionary[idx]+'\n') for idx in range(len(dictionary))]

    #We create now an iterable corpus (memory friendly)
    class MyCorpus(object):
        def __iter__(self):
            for line in open(corpusname+'_corpus.txt'):
                yield dictionary.doc2bow(line.lower().split())
    corpus_gensim = MyCorpus()
    #And generate the file with the format required by Blei's LDA implementation
    with open(corpusname + '-mult.dat','wb') as f:
        for docbow in corpus_gensim:
            docstr = ' '.join([str(el[0])+':'+str(el[1]) for el in docbow])
            f.write(str(len(docbow))+' '+docstr+'\n')

To use the function, execute the following command line:



In [2]:

    
format_conversion(corpusname)

Verify that you can find in your working directory the files:

corpusname_vocab.txt
corpusname-mult.dat
corpusname-seq.dat

Check also that the format of these files is as expected. Once you are done, you can proceed to the next section.



In [ ]: