Computing Word Frequency for Brand Perception Interviews amongst key stakeholders NYCAASC

Importing Libraries



In [64]:

    
import re
import nltk
import math
import string
from collections import Counter
import pandas as pd



In [71]:

    
from __future__ import unicode_literals



In [68]:

    
# only need to run once
nltk.download()









    



showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml






    Out[68]:





True

Reading in Data



In [72]:

    
df = pd.read_csv("/Users/chuamelia/Downloads/Brand_IDI_Qual.tsv",sep="\t")

Looking at Structure of Data



In [73]:

    
df.head(1)









    Out[73]:






  
    
      
      id
      a_position_toNYCAASC
      b_position_toNYCAASC
      embody_Mission
      future_Direction
      first_Time
      logo
    
  
  
    
      0
      1
      Umm, I guess I do a lot of the upper level man...
      I see it as, well I see my role as kinda givin...
      I think that a good, I talked about this earli...
      I wanna see it reaching more people and being ...
      I guess, NYCAASC is kind of divided between th...
      The spelling is a little confusing because of ...

Prepare Functions for Analysis

Table of Word Types



In [168]:

    
#Write function to append all tokens to one list.
def stuff_tokenizer(column, list):
    discard = ['IN', ',','.','TO', 'DT', 'PRP', 'CC']
    end_num = len(column)
    temp = []
    for i in range(end_num): #append to one list
        temp.extend(nltk.word_tokenize(column[i].decode('utf-8')))
    temp2 = nltk.pos_tag(temp) #tag words
    for i in temp2: #discard prepositions, articles, etc.
        if i[1] not in discard: 
            list.append(i)



In [74]:

    
#add decode('utf-8') bc "\xe2\x80\x99" interprtation
#ascii' codec can't decode byte
#example: df['first_Time'][0]









    Out[74]:





'I guess, NYCAASC is kind of divided between the event and then also the group. So we\xe2\x80\x99re both kinda this thing that happens once a year and we\xe2\x80\x99re also this thing that is ongoing like this discussion between almost 50 people. So, I guess I want people to see NYCAASC as this thing that fosters dialogue, fosters active conversations, fosters debate, and encourages personal growth and community growth. Basically, pushing for people to be more comfortable with each other'

Create empty lists for stuffing.



In [163]:

    
q1a = []
q1b = []
q2 = []
q3 = []
q4 = []
q5 = []

Stuff Away!



In [170]:

    
stuff_tokenizer(df['a_position_toNYCAASC'],q1a)
stuff_tokenizer(df['b_position_toNYCAASC'],q1b)
stuff_tokenizer(df['embody_Mission'],q2)
stuff_tokenizer(df['future_Direction'],q3)
stuff_tokenizer(df['first_Time'],q4)
stuff_tokenizer(df['logo'],q5)

Checking if stuffing worked...



In [131]:

    
print q1a[:3]









    



[u'Umm', u',', u'I']



In [152]:

    
dataset = [q1a,q1b,q2,q3,q4,q5]



In [153]:

    
for i in dataset:
    common = Counter(i)
    print common.most_common(5)
# Need to remove prepositions
# How can we control for one person repeating the same word?
# select distinct words: my_list = list(set(my_list))
# compare word counts









    



[]
[]
[]
[]
[]
[]
[((u'is', 'VBZ'), 19), ((u'are', 'VBP'), 17), ((u'people', 'NNS'), 15), ((u'be', 'VB'), 14), ((u'know', 'VBP'), 13)]

References

Peter Norvig Everything I Need to Know About NLP I learned from Sesame Street



In [ ]: