Computing Word Frequency for Brand Perception Interviews amongst key stakeholders NYCAASC

Importing Libraries


In [64]:
import re
import nltk
import math
import string
from collections import Counter
import pandas as pd

In [71]:
from __future__ import unicode_literals

In [68]:
# only need to run once
nltk.download()


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Out[68]:
True

Reading in Data


In [72]:
df = pd.read_csv("/Users/chuamelia/Downloads/Brand_IDI_Qual.tsv",sep="\t")

Looking at Structure of Data


In [73]:
df.head(1)


Out[73]:
id a_position_toNYCAASC b_position_toNYCAASC embody_Mission future_Direction first_Time logo
0 1 Umm, I guess I do a lot of the upper level man... I see it as, well I see my role as kinda givin... I think that a good, I talked about this earli... I wanna see it reaching more people and being ... I guess, NYCAASC is kind of divided between th... The spelling is a little confusing because of ...

Prepare Functions for Analysis


In [168]:
#Write function to append all tokens to one list.
def stuff_tokenizer(column, list):
    discard = ['IN', ',','.','TO', 'DT', 'PRP', 'CC']
    end_num = len(column)
    temp = []
    for i in range(end_num): #append to one list
        temp.extend(nltk.word_tokenize(column[i].decode('utf-8')))
    temp2 = nltk.pos_tag(temp) #tag words
    for i in temp2: #discard prepositions, articles, etc.
        if i[1] not in discard: 
            list.append(i)

In [74]:
#add decode('utf-8') bc "\xe2\x80\x99" interprtation
#ascii' codec can't decode byte
#example: df['first_Time'][0]


Out[74]:
'I guess, NYCAASC is kind of divided between the event and then also the group. So we\xe2\x80\x99re both kinda this thing that happens once a year and we\xe2\x80\x99re also this thing that is ongoing like this discussion between almost 50 people. So, I guess I want people to see NYCAASC as this thing that fosters dialogue, fosters active conversations, fosters debate, and encourages personal growth and community growth. Basically, pushing for people to be more comfortable with each other'

Create empty lists for stuffing.


In [163]:
q1a = []
q1b = []
q2 = []
q3 = []
q4 = []
q5 = []

Stuff Away!


In [170]:
stuff_tokenizer(df['a_position_toNYCAASC'],q1a)
stuff_tokenizer(df['b_position_toNYCAASC'],q1b)
stuff_tokenizer(df['embody_Mission'],q2)
stuff_tokenizer(df['future_Direction'],q3)
stuff_tokenizer(df['first_Time'],q4)
stuff_tokenizer(df['logo'],q5)

Checking if stuffing worked...


In [131]:
print q1a[:3]


[u'Umm', u',', u'I']

In [152]:
dataset = [q1a,q1b,q2,q3,q4,q5]

In [153]:
for i in dataset:
    common = Counter(i)
    print common.most_common(5)
# Need to remove prepositions
# How can we control for one person repeating the same word?
# select distinct words: my_list = list(set(my_list))
# compare word counts


[]
[]
[]
[]
[]
[]
[((u'is', 'VBZ'), 19), ((u'are', 'VBP'), 17), ((u'people', 'NNS'), 15), ((u'be', 'VB'), 14), ((u'know', 'VBP'), 13)]

In [ ]: