Text Analysis

Notebook by: Juan Shishido

In this notebook, I'll start cleaning the text columns and, more importantly, thinking about how to classify and group the data within them. Consider using n-grams for word occurence.

Imports



In [1]:

    
import re
import random
#import lda
import csv
import numpy as np
import pandas as pd
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

Load



In [2]:

    
df = pd.read_csv('../../data/cleaned/UCB_dept_merge.csv')



In [3]:

    
df['product_line'] = df['supplier_name'] + ' ' + \
                     df['product_description'] + ' ' + \
                     df['manufacturer']

Account for NaN in column name.

Random Sample: 40,000



In [4]:

    
random.seed(8675309)
rows = random.sample(df.index, 40000)
df = df.ix[rows].reset_index()



In [5]:

    
df.head()









    Out[5]:






  
    
      
      index
      po_id
      po_num
      creation_date
      supplier_name
      item_type
      product_description
      manufacturer
      quantity
      unit_price
      department
      buyer__first_name
      buyer__last_name
      po_closed_date
      department_name
      spend
      product_line
    
  
  
    
      0
       245817
       35963640
       BB00319075
       2014-02-11 00:00:00
       FISHER SCIENTIFIC
        SQ Hosted Product
       Molecular BioProducts Electroporation Cuvettes...
          Molecular BioProducts
       2
       109.07
      NaN
           Jon
       Kuchenreuther
       2014-07-16 00:00:00
            OOEBI Depolymerization
        767.03
       FISHER SCIENTIFIC Molecular BioProducts Electr...
    
    
      1
       311817
       27067475
       BB00134494
       2013-02-04 00:00:00
              OFFICE MAX
        SQ Hosted Product
                    Tablemate - Balloons, Color Assorted
        TABLE MATE PRODUCTS INC
       1
        16.81
      NaN
       William
                Wolf
       2013-04-22 00:00:00
            OOEBI Depolymerization
         18.32
       OFFICE MAX Tablemate - Balloons, Color Assorte...
    
    
      2
       405620
       30901583
       BB00218407
       2013-07-16 00:00:00
            GRAINGER INC
         PunchOut Product
       CS2460120 Split Lock Washer, Spring Steel, Zin...
       GRAINGER APPROVED VENDOR
       1
         2.46
      NaN
          MIKE
             COURTER
       2013-10-02 00:00:00
            FJPPS BM Plumbing Shop
       1890.47
       GRAINGER INC CS2460120 Split Lock Washer, Spri...
    
    
      3
       525992
       24759795
       BB00085067
       2012-10-11 00:00:00
            Thorlabs Inc
       NonCatalog Product
        SM05 Threaded Kinematic Cage Mount O1/2" Optics.
                       Thorlabs
       1
        82.00
      NaN
       Michael
            Hohensee
       2013-03-01 00:00:00
       PHYSI SS Atomic & Molecular
       4783.54
       Thorlabs Inc SM05 Threaded Kinematic Cage Moun...
    
    
      4
       171686
       35510783
       BB00309841
       2014-01-24 00:00:00
       FISHER SCIENTIFIC
       NonCatalog Product
                     10 mL serological pipettes 200/case
                            NaN
       3
        29.20
      NaN
        Philip
              Nguyen
       2014-10-30 00:00:00
                 IMMCB BH Research
        745.20
                                                     NaN

Transform

This code:

removes NaNs
converts all to lowercase
removes URLs
replaces forward slashes not associated with fractions with a single space
replaces characters that are not alphanumeric, periods, percent signs, or forward slashes with a single space
removes multiple periods
removes individual letters (with spaces on either side)
removes multiple whitespace

In a later step, numbers not associated with a percent sign are removed.



In [6]:

    
cols = ['supplier_name', 'item_type', 'product_description', 'manufacturer', 
        'buyer__first_name', 'buyer__last_name', 'department_name', 'product_line']



In [7]:

    
for col in cols:
    df[col] = df[col].replace(np.nan, '' , regex=True)                                      \
                     .apply(lambda x: x.lower())                                            \
                     .apply(lambda x: re.sub('(http\S*|www\S*)', '', x))                    \
                     .apply(lambda x: re.sub('((?<=\D)/|/(?=\D))', ' ', x))                 \
                     .apply(lambda x: re.sub('[^A-Za-z0-9.%\/]+', ' ', x))                  \
                     .apply(lambda x: re.sub('\.+', '', x))                                 \
                     .apply(lambda x: re.sub('(?<=\s)\w(?=\s)|(?<=\s)\d(?=\s)', '', x))     \
                     .apply(lambda x: re.sub('\s+', ' ', x).strip())



In [8]:

    
df.head()









    Out[8]:






  
    
      
      index
      po_id
      po_num
      creation_date
      supplier_name
      item_type
      product_description
      manufacturer
      quantity
      unit_price
      department
      buyer__first_name
      buyer__last_name
      po_closed_date
      department_name
      spend
      product_line
    
  
  
    
      0
       245817
       35963640
       BB00319075
       2014-02-11 00:00:00
       fisher scientific
        sq hosted product
       molecular bioproducts electroporation cuvettes...
          molecular bioproducts
       2
       109.07
      NaN
           jon
       kuchenreuther
       2014-07-16 00:00:00
          ooebi depolymerization
        767.03
       fisher scientific molecular bioproducts electr...
    
    
      1
       311817
       27067475
       BB00134494
       2013-02-04 00:00:00
              office max
        sq hosted product
                       tablemate balloons color assorted
        table mate products inc
       1
        16.81
      NaN
       william
                wolf
       2013-04-22 00:00:00
          ooebi depolymerization
         18.32
       office max tablemate balloons color assorted t...
    
    
      2
       405620
       30901583
       BB00218407
       2013-07-16 00:00:00
            grainger inc
         punchout product
       cs2460120 split lock washer spring steel zinc ...
       grainger approved vendor
       1
         2.46
      NaN
          mike
             courter
       2013-10-02 00:00:00
          fjpps bm plumbing shop
       1890.47
       grainger inc cs2460120 split lock washer sprin...
    
    
      3
       525992
       24759795
       BB00085067
       2012-10-11 00:00:00
            thorlabs inc
       noncatalog product
          sm05 threaded kinematic cage mount o1/2 optics
                       thorlabs
       1
        82.00
      NaN
       michael
            hohensee
       2013-03-01 00:00:00
       physi ss atomic molecular
       4783.54
       thorlabs inc sm05 threaded kinematic cage moun...
    
    
      4
       171686
       35510783
       BB00309841
       2014-01-24 00:00:00
       fisher scientific
       noncatalog product
                     10 ml serological pipettes 200 case
                               
       3
        29.20
      NaN
        philip
              nguyen
       2014-10-30 00:00:00
               immcb bh research
        745.20

Bag of Words

Tokenize



In [9]:

    
tokenized_pd = [word_tokenize(line) for line in df.product_line]

Stop Words

Removing English stopwords from NLTK.



In [10]:

    
stop_words = stopwords.words('english') + \
             [u'ea', u'per', u'item', u'description', u'quote', u'pk', u'pack',
              'give', 'something', 'inc', 'corporation', 'quantity', 'back',
              'products', 'co', 'officemax', 'unit', 'corp']

This code:

removes English stop words
removes numbers (excludes numbers with non-numeric characters attached)
removes single characters



In [11]:

    
tokenized_pd_clean = []

for entry in tokenized_pd:
    entry_list = []
    for word in entry:
        if ((not word in stop_words) and \
        (not unicode(word).isnumeric()) and \
        (not len(word) <= 1)):
            entry_list.append(word)
    tokenized_pd_clean.append(entry_list)

To DF



In [12]:

    
df['tokenized_pd_clean'] = tokenized_pd_clean

Words to Features



In [13]:

    
pd_list_clean = []

for item in tokenized_pd_clean:
    pd_list_clean.append(' '.join(item))



In [14]:

    
vectorizer = CountVectorizer(analyzer = "word", 
                             tokenizer = None, 
                             preprocessor = None, 
                             stop_words = None)



In [15]:

    
word_features = vectorizer.fit_transform(pd_list_clean).toarray()



In [16]:

    
word_features.shape









    Out[16]:





(40000, 23542)



In [17]:

    
word_features[0:5,:]









    Out[17]:





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])



In [18]:

    
vocab = vectorizer.get_feature_names()
print vocab[:15]









    



[u'0001g', u'0001mm', u'0002mm', u'0003mm', u'00067m', u'000bp', u'000mw', u'000u', u'000ug', u'000v', u'000w', u'000x', u'000xg', u'0010in', u'0011in']



In [19]:

    
vocab_map = vectorizer.vocabulary_

LDA



In [23]:

    
X = word_features



In [24]:

    
model = lda.LDA(n_topics=15, n_iter=1500, random_state=8675309)
model.fit(X)









    



WARNING:lda:all zero row in document-term matrix found






    Out[24]:





<lda.lda.LDA instance at 0x11a7609e0>



In [25]:

    
topic_word = model.topic_word_
n_top_words = 21

with open('../../results/topic_definitions.csv', 'wb') as to_:
    writer = csv.writer(to_, delimiter=',', quotechar='\"')
    doc_topic = model.doc_topic_
    for i, topic_dist in enumerate(topic_word):
        topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
        writer.writerow([i, ' '.join(topic_words)])



In [26]:

    
with open('../../results/pd_topics.csv', 'wb') as to_:
    writer = csv.writer(to_, delimiter=',', quotechar='\"')
    doc_topic = model.doc_topic_
    for i in range(len(tokenized_pd_clean)):
        writer.writerow([tokenized_pd_clean[i], doc_topic[i].argmax()])



In [27]:

    
words = [w.strip().split(' ') for w in pd_list_clean]
word_list = [i for word in words for i in word]
word_counts = Counter(word_list)
top_100_words = word_counts.most_common(100)
for word in top_100_words:
    print word









    



('', 13900)
('office', 7333)
('max', 7150)
('inc', 6584)
('scientific', 5587)
('fisher', 4595)
('size', 4276)
('back', 4237)
('color', 4137)
('give', 4106)
('something', 4098)
('vwr', 4015)
('pack', 3809)
('sigma', 3507)
('aldrich', 3507)
('black', 2980)
('1/2', 2816)
('white', 2543)
('type', 2296)
('officemax', 2177)
('grainger', 2161)
('label', 2132)
('quantity', 2084)
('international', 2025)
('private', 2009)
('products', 1956)
('unit', 1847)
('paper', 1752)
('corporation', 1680)
('length', 1475)
('box', 1416)
('blue', 1376)
('cs', 1361)
('ea', 1356)
('hp', 1332)
('pk', 1191)
('corp', 1174)
('life', 1143)
('dell', 1122)
('3m', 1073)
('assorted', 1049)
('material', 1048)
('thermo', 1012)
('lb', 1002)
('cartridge', 975)
('toner', 967)
('capacity', 962)
('tape', 957)
('boise', 936)
('sterile', 935)
('1/4', 921)
('technologies', 920)
('steel', 913)
('laboratories', 904)
('use', 899)
('3/4', 898)
('cdw', 882)
('free', 872)
('packard', 855)
('hewlett', 854)
('ink', 829)
('point', 828)
('clear', 819)
('sanford', 813)
('high', 811)
('medium', 810)
('avery', 803)
('yellow', 800)
('tip', 800)
('universal', 783)
('plastic', 778)
('chemical', 773)
('style', 757)
('llc', 753)
('tube', 744)
('carton', 728)
('head', 727)
('glass', 724)
('recycled', 706)
('sheets', 701)
('dia', 691)
('yield', 688)
('bd', 682)
('cap', 677)
('standard', 662)
('oz', 660)
('gloves', 656)
('tips', 638)
('co', 632)
('corning', 632)
('praxair', 629)
('cell', 617)
('bottle', 611)
('letter', 609)
('width', 587)
('grade', 578)
('disposable', 570)
('per', 565)
('bio', 563)
('page', 560)

Topics

Merge the topic assignments when number of topics is 10.



In [20]:

    
topics = pd.read_csv('../../results/pd_topics_10.csv', header=None)



In [21]:

    
topics.columns = ['tpc', 'topic']



In [22]:

    
df['tpc'] = topics.tpc



In [23]:

    
df['topic'] = topics.topic



In [24]:

    
depts = pd.DataFrame({'count' : df.groupby('department_name')['department_name'].count()}).reset_index()
depts.sort('count', ascending=False, inplace=True)



In [25]:

    
top15 = depts['department_name'][:25].tolist()



In [26]:

    
df_top15 = df[df.department_name.isin(top15)]



In [27]:

    
df_top15 = df_top15[df_top15['product_line'] != '']



In [28]:

    
topics_by_dept = pd.DataFrame({'count' : df_top15.groupby(['department_name', 'topic'])['topic'].count()}).reset_index()



In [29]:

    
topic_def = pd.read_csv('../../results/topics_definitions_10.csv', header=None)



In [30]:

    
topic_def.columns = ['topic', 'words']
topic_def['words'] = topic_def['words'].apply(lambda x: ', '.join(x.split()[:10]))



In [31]:

    
df_top15_final = pd.merge(topics_by_dept, topic_def, on='topic')



In [32]:

    
df_top15_final.to_csv('../../results/topic_count_10.csv', index=False)



In [ ]:

	index	po_id	po_num	creation_date	supplier_name	item_type	product_description	manufacturer	quantity	unit_price	department	buyer__first_name	buyer__last_name	po_closed_date	department_name	spend	product_line
0	245817	35963640	BB00319075	2014-02-11 00:00:00	FISHER SCIENTIFIC	SQ Hosted Product	Molecular BioProducts Electroporation Cuvettes...	Molecular BioProducts	2	109.07	NaN	Jon	Kuchenreuther	2014-07-16 00:00:00	OOEBI Depolymerization	767.03	FISHER SCIENTIFIC Molecular BioProducts Electr...
1	311817	27067475	BB00134494	2013-02-04 00:00:00	OFFICE MAX	SQ Hosted Product	Tablemate - Balloons, Color Assorted	TABLE MATE PRODUCTS INC	1	16.81	NaN	William	Wolf	2013-04-22 00:00:00	OOEBI Depolymerization	18.32	OFFICE MAX Tablemate - Balloons, Color Assorte...
2	405620	30901583	BB00218407	2013-07-16 00:00:00	GRAINGER INC	PunchOut Product	CS2460120 Split Lock Washer, Spring Steel, Zin...	GRAINGER APPROVED VENDOR	1	2.46	NaN	MIKE	COURTER	2013-10-02 00:00:00	FJPPS BM Plumbing Shop	1890.47	GRAINGER INC CS2460120 Split Lock Washer, Spri...
3	525992	24759795	BB00085067	2012-10-11 00:00:00	Thorlabs Inc	NonCatalog Product	SM05 Threaded Kinematic Cage Mount O1/2" Optics.	Thorlabs	1	82.00	NaN	Michael	Hohensee	2013-03-01 00:00:00	PHYSI SS Atomic & Molecular	4783.54	Thorlabs Inc SM05 Threaded Kinematic Cage Moun...
4	171686	35510783	BB00309841	2014-01-24 00:00:00	FISHER SCIENTIFIC	NonCatalog Product	10 mL serological pipettes 200/case	NaN	3	29.20	NaN	Philip	Nguyen	2014-10-30 00:00:00	IMMCB BH Research	745.20	NaN