Text Analysis

Notebook by: Juan Shishido

In this notebook, I'll start cleaning the text columns and, more importantly, thinking about how to classify and group the data within them. Consider using n-grams for word occurence.

Imports


In [1]:
import re
import random
#import lda
import csv
import numpy as np
import pandas as pd
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

Load


In [2]:
df = pd.read_csv('../../data/cleaned/UCB_dept_merge.csv')

In [3]:
df['product_line'] = df['supplier_name'] + ' ' + \
                     df['product_description'] + ' ' + \
                     df['manufacturer']

Account for NaN in column name.

Random Sample: 40,000


In [4]:
random.seed(8675309)
rows = random.sample(df.index, 40000)
df = df.ix[rows].reset_index()

In [5]:
df.head()


Out[5]:
index po_id po_num creation_date supplier_name item_type product_description manufacturer quantity unit_price department buyer__first_name buyer__last_name po_closed_date department_name spend product_line
0 245817 35963640 BB00319075 2014-02-11 00:00:00 FISHER SCIENTIFIC SQ Hosted Product Molecular BioProducts Electroporation Cuvettes... Molecular BioProducts 2 109.07 NaN Jon Kuchenreuther 2014-07-16 00:00:00 OOEBI Depolymerization 767.03 FISHER SCIENTIFIC Molecular BioProducts Electr...
1 311817 27067475 BB00134494 2013-02-04 00:00:00 OFFICE MAX SQ Hosted Product Tablemate - Balloons, Color Assorted TABLE MATE PRODUCTS INC 1 16.81 NaN William Wolf 2013-04-22 00:00:00 OOEBI Depolymerization 18.32 OFFICE MAX Tablemate - Balloons, Color Assorte...
2 405620 30901583 BB00218407 2013-07-16 00:00:00 GRAINGER INC PunchOut Product CS2460120 Split Lock Washer, Spring Steel, Zin... GRAINGER APPROVED VENDOR 1 2.46 NaN MIKE COURTER 2013-10-02 00:00:00 FJPPS BM Plumbing Shop 1890.47 GRAINGER INC CS2460120 Split Lock Washer, Spri...
3 525992 24759795 BB00085067 2012-10-11 00:00:00 Thorlabs Inc NonCatalog Product SM05 Threaded Kinematic Cage Mount O1/2" Optics. Thorlabs 1 82.00 NaN Michael Hohensee 2013-03-01 00:00:00 PHYSI SS Atomic & Molecular 4783.54 Thorlabs Inc SM05 Threaded Kinematic Cage Moun...
4 171686 35510783 BB00309841 2014-01-24 00:00:00 FISHER SCIENTIFIC NonCatalog Product 10 mL serological pipettes 200/case NaN 3 29.20 NaN Philip Nguyen 2014-10-30 00:00:00 IMMCB BH Research 745.20 NaN

Transform

This code:

  • removes NaNs
  • converts all to lowercase
  • removes URLs
  • replaces forward slashes not associated with fractions with a single space
  • replaces characters that are not alphanumeric, periods, percent signs, or forward slashes with a single space
  • removes multiple periods
  • removes individual letters (with spaces on either side)
  • removes multiple whitespace

In a later step, numbers not associated with a percent sign are removed.


In [6]:
cols = ['supplier_name', 'item_type', 'product_description', 'manufacturer', 
        'buyer__first_name', 'buyer__last_name', 'department_name', 'product_line']

In [7]:
for col in cols:
    df[col] = df[col].replace(np.nan, '' , regex=True)                                      \
                     .apply(lambda x: x.lower())                                            \
                     .apply(lambda x: re.sub('(http\S*|www\S*)', '', x))                    \
                     .apply(lambda x: re.sub('((?<=\D)/|/(?=\D))', ' ', x))                 \
                     .apply(lambda x: re.sub('[^A-Za-z0-9.%\/]+', ' ', x))                  \
                     .apply(lambda x: re.sub('\.+', '', x))                                 \
                     .apply(lambda x: re.sub('(?<=\s)\w(?=\s)|(?<=\s)\d(?=\s)', '', x))     \
                     .apply(lambda x: re.sub('\s+', ' ', x).strip())

In [8]:
df.head()


Out[8]:
index po_id po_num creation_date supplier_name item_type product_description manufacturer quantity unit_price department buyer__first_name buyer__last_name po_closed_date department_name spend product_line
0 245817 35963640 BB00319075 2014-02-11 00:00:00 fisher scientific sq hosted product molecular bioproducts electroporation cuvettes... molecular bioproducts 2 109.07 NaN jon kuchenreuther 2014-07-16 00:00:00 ooebi depolymerization 767.03 fisher scientific molecular bioproducts electr...
1 311817 27067475 BB00134494 2013-02-04 00:00:00 office max sq hosted product tablemate balloons color assorted table mate products inc 1 16.81 NaN william wolf 2013-04-22 00:00:00 ooebi depolymerization 18.32 office max tablemate balloons color assorted t...
2 405620 30901583 BB00218407 2013-07-16 00:00:00 grainger inc punchout product cs2460120 split lock washer spring steel zinc ... grainger approved vendor 1 2.46 NaN mike courter 2013-10-02 00:00:00 fjpps bm plumbing shop 1890.47 grainger inc cs2460120 split lock washer sprin...
3 525992 24759795 BB00085067 2012-10-11 00:00:00 thorlabs inc noncatalog product sm05 threaded kinematic cage mount o1/2 optics thorlabs 1 82.00 NaN michael hohensee 2013-03-01 00:00:00 physi ss atomic molecular 4783.54 thorlabs inc sm05 threaded kinematic cage moun...
4 171686 35510783 BB00309841 2014-01-24 00:00:00 fisher scientific noncatalog product 10 ml serological pipettes 200 case 3 29.20 NaN philip nguyen 2014-10-30 00:00:00 immcb bh research 745.20

Bag of Words

Tokenize


In [9]:
tokenized_pd = [word_tokenize(line) for line in df.product_line]

Stop Words

Removing English stopwords from NLTK.


In [10]:
stop_words = stopwords.words('english') + \
             [u'ea', u'per', u'item', u'description', u'quote', u'pk', u'pack',
              'give', 'something', 'inc', 'corporation', 'quantity', 'back',
              'products', 'co', 'officemax', 'unit', 'corp']

This code:

  • removes English stop words
  • removes numbers (excludes numbers with non-numeric characters attached)
  • removes single characters

In [11]:
tokenized_pd_clean = []

for entry in tokenized_pd:
    entry_list = []
    for word in entry:
        if ((not word in stop_words) and \
        (not unicode(word).isnumeric()) and \
        (not len(word) <= 1)):
            entry_list.append(word)
    tokenized_pd_clean.append(entry_list)

To DF


In [12]:
df['tokenized_pd_clean'] = tokenized_pd_clean

Words to Features


In [13]:
pd_list_clean = []

for item in tokenized_pd_clean:
    pd_list_clean.append(' '.join(item))

In [14]:
vectorizer = CountVectorizer(analyzer = "word", 
                             tokenizer = None, 
                             preprocessor = None, 
                             stop_words = None)

In [15]:
word_features = vectorizer.fit_transform(pd_list_clean).toarray()

In [16]:
word_features.shape


Out[16]:
(40000, 23542)

In [17]:
word_features[0:5,:]


Out[17]:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [18]:
vocab = vectorizer.get_feature_names()
print vocab[:15]


[u'0001g', u'0001mm', u'0002mm', u'0003mm', u'00067m', u'000bp', u'000mw', u'000u', u'000ug', u'000v', u'000w', u'000x', u'000xg', u'0010in', u'0011in']

In [19]:
vocab_map = vectorizer.vocabulary_

LDA


In [23]:
X = word_features

In [24]:
model = lda.LDA(n_topics=15, n_iter=1500, random_state=8675309)
model.fit(X)


WARNING:lda:all zero row in document-term matrix found
Out[24]:
<lda.lda.LDA instance at 0x11a7609e0>

In [25]:
topic_word = model.topic_word_
n_top_words = 21

with open('../../results/topic_definitions.csv', 'wb') as to_:
    writer = csv.writer(to_, delimiter=',', quotechar='\"')
    doc_topic = model.doc_topic_
    for i, topic_dist in enumerate(topic_word):
        topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
        writer.writerow([i, ' '.join(topic_words)])

In [26]:
with open('../../results/pd_topics.csv', 'wb') as to_:
    writer = csv.writer(to_, delimiter=',', quotechar='\"')
    doc_topic = model.doc_topic_
    for i in range(len(tokenized_pd_clean)):
        writer.writerow([tokenized_pd_clean[i], doc_topic[i].argmax()])

In [27]:
words = [w.strip().split(' ') for w in pd_list_clean]
word_list = [i for word in words for i in word]
word_counts = Counter(word_list)
top_100_words = word_counts.most_common(100)
for word in top_100_words:
    print word


('', 13900)
('office', 7333)
('max', 7150)
('inc', 6584)
('scientific', 5587)
('fisher', 4595)
('size', 4276)
('back', 4237)
('color', 4137)
('give', 4106)
('something', 4098)
('vwr', 4015)
('pack', 3809)
('sigma', 3507)
('aldrich', 3507)
('black', 2980)
('1/2', 2816)
('white', 2543)
('type', 2296)
('officemax', 2177)
('grainger', 2161)
('label', 2132)
('quantity', 2084)
('international', 2025)
('private', 2009)
('products', 1956)
('unit', 1847)
('paper', 1752)
('corporation', 1680)
('length', 1475)
('box', 1416)
('blue', 1376)
('cs', 1361)
('ea', 1356)
('hp', 1332)
('pk', 1191)
('corp', 1174)
('life', 1143)
('dell', 1122)
('3m', 1073)
('assorted', 1049)
('material', 1048)
('thermo', 1012)
('lb', 1002)
('cartridge', 975)
('toner', 967)
('capacity', 962)
('tape', 957)
('boise', 936)
('sterile', 935)
('1/4', 921)
('technologies', 920)
('steel', 913)
('laboratories', 904)
('use', 899)
('3/4', 898)
('cdw', 882)
('free', 872)
('packard', 855)
('hewlett', 854)
('ink', 829)
('point', 828)
('clear', 819)
('sanford', 813)
('high', 811)
('medium', 810)
('avery', 803)
('yellow', 800)
('tip', 800)
('universal', 783)
('plastic', 778)
('chemical', 773)
('style', 757)
('llc', 753)
('tube', 744)
('carton', 728)
('head', 727)
('glass', 724)
('recycled', 706)
('sheets', 701)
('dia', 691)
('yield', 688)
('bd', 682)
('cap', 677)
('standard', 662)
('oz', 660)
('gloves', 656)
('tips', 638)
('co', 632)
('corning', 632)
('praxair', 629)
('cell', 617)
('bottle', 611)
('letter', 609)
('width', 587)
('grade', 578)
('disposable', 570)
('per', 565)
('bio', 563)
('page', 560)

Topics

Merge the topic assignments when number of topics is 10.


In [20]:
topics = pd.read_csv('../../results/pd_topics_10.csv', header=None)

In [21]:
topics.columns = ['tpc', 'topic']

In [22]:
df['tpc'] = topics.tpc

In [23]:
df['topic'] = topics.topic

In [24]:
depts = pd.DataFrame({'count' : df.groupby('department_name')['department_name'].count()}).reset_index()
depts.sort('count', ascending=False, inplace=True)

In [25]:
top15 = depts['department_name'][:25].tolist()

In [26]:
df_top15 = df[df.department_name.isin(top15)]

In [27]:
df_top15 = df_top15[df_top15['product_line'] != '']

In [28]:
topics_by_dept = pd.DataFrame({'count' : df_top15.groupby(['department_name', 'topic'])['topic'].count()}).reset_index()

In [29]:
topic_def = pd.read_csv('../../results/topics_definitions_10.csv', header=None)

In [30]:
topic_def.columns = ['topic', 'words']
topic_def['words'] = topic_def['words'].apply(lambda x: ', '.join(x.split()[:10]))

In [31]:
df_top15_final = pd.merge(topics_by_dept, topic_def, on='topic')

In [32]:
df_top15_final.to_csv('../../results/topic_count_10.csv', index=False)

In [ ]: