Notebook for preprocessing NYT op-ed data

Goal: Emily & Greg go through NLP preprocessing pipeline for two data sets in parallel


In [1]:
import pandas as pd
import numpy as np

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer

1. Read in the Data


In [3]:
names = range(1,14)
df_list = []

for name in names:
    csvfile = '/Users/emilyhalket/Desktop/NLP_NYT/datafiles/{0}_100.csv'.format(name)
    df = pd.read_csv(csvfile)
    df_list.append(df)

article_df = pd.concat(df_list)

In [3]:
article_df = article_df[pd.notnull(article_df['full_text'])]

In [4]:
article_df.shape


Out[4]:
(11570, 11)

This dataset has 11,648 op-eds from the NY Times. We have additional information for each article (title, author, number of comments, etc.) but for now we will just focus on the text data.

2. Tokenize

For my analysis, I plan to consider each article as a separate document. For my purposes, I do not need to retain punctuation information, so I plan to remove punctuation in my preprocessing


In [ ]:


In [ ]:


In [ ]:


In [5]:
def preprocess_article_content(text_df):

    print 'preprocessing article text...'

    # text_df is data frame from SQL query, column 'content' contains text content from each article
    article_list = []

    tokenizer = RegexpTokenizer(r'\w+')
    stop_words = set(stopwords.words('english'))  # can add more stop words to this set
    
    stemmer = SnowballStemmer('english')

    kept_rows = [] # keep track of rows that have unusable articles

    for row, article in enumerate(text_df['full_text']):

        cleaned_tokens = []

        tokens = tokenizer.tokenize(article.decode('utf-8').lower())

        for token in tokens:

            
            if token not in stop_words:

                if len(token) > 0 and len(token) < 20: # removes non words

                    if not token[0].isdigit() and not token[-1].isdigit(): # removes numbers
                        
                        stemmed_tokens = stemmer.stem(token)
                        cleaned_tokens.append(stemmed_tokens)

        print 'success for row %d' % row 
        article_list.append(' '.join(wd for wd in cleaned_tokens))
        kept_rows.append(row)

    print 'preprocessed content for %d articles' % len(article_list)

    return article_list, kept_rows

In [ ]:


In [8]:
article_df = article_df[pd.notnull(article_df['full_text'])]

In [9]:
article_df.shape


Out[9]:
(11570, 11)

In [10]:
article_list, kept_rows = preprocess_article_content(article_df)


preprocessing article text...
preprocessed content for 11570 articles

In [12]:
len(article_list)


Out[12]:
11570

In [14]:
article_list[2000]


Out[14]:
u'long deni referendum independ akin held scotland quebec catalan separatist proclaim sunday elect region parliament would de facto plebiscit secess spain result howev muddl issu separatist parti win major seat fail win major vote would requir referendum amount mandat creat new nation still vote tell conserv govern prime minist mariano rajoy better start heed catalan like scottish quebec separatist catalan separatist convinc prosper region would better far certain especi sinc membership european union alreadi face major challeng uniti hard guarante also certainti european central bank would continu fund catalan bank hit euro debt crisi given clear choic whether go alon scot quebec pull back scot year ago quebec catalan howev abl make choic spain constitut enshrin indissolubl uniti mr rajoy use block discuss self determin catalonia deni referendum separatist leader artur mas pledg start process toward independ parti won elect togeth yes coalit fail win major seat join forc far left separatist parti oppos mr mas leadership form major region parliament even togeth separatist parti poll percent vote result empow separatist unilater break spain deni catalan choic matter certain deepen nationalist feel best cours separatist use strength seek control affair madrid mr rajoy recogn even within constraint spanish constitut plenti room discuss accommod catalan yearn'

In [ ]: