This will demonstrate two methods of building an ngram with Python

  • The first using "pure Python"
  • The second using the machine learning library: Scikit learn

The use of Jupyter Notebooks and Pandas for data science is a common practice

  • These tools provide excellent methods of cleaning, or munging, data
    • Due to interactive responses using cell division of code and web interface
  • Allows full access to scientific libraries, visualization libraries, and the Python language
  • Allows one to save code results as Html files for easy sharing and use at conferences or meetings

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Standard imports

%matplotlib inline

In [3]:
social_results = pd.read_csv("social_messages.csv")
social_results.head()


Out[3]:
years gender social_site text
0 5.0 F FB The new router I bought broke far too soon
1 2.0 M Linked Give me a job!
2 2.5 M Twit The experience at your store was happy and rew...
3 7.0 F Twit salemen was very helpful and great experience
4 2.0 F FB Your salesmen need to develop better communica...

In [4]:
from collections import Counter
import re
# quick bag of words (going to be doing this for a bunch of fields; ngram below)

In [14]:
def check_text(txt):
    '''
    Checks to see if text includes NAN (not a number) values and sets them to none.
    If not nan-cleans out non-alpha characters, lowercases values, and removes stop words
    returns: either list of 'none' or list of cleaned values
    '''
    pat = re.compile('([^\s\w]|_)+') 
    # set a pattern for removing non-space and alphanumeric characters
    STOP = ['and', 'to', 'of', 'with', 'for', 'but', 'or', 'in', 'at', 'it', 'was', 'the', 'are', 'a', 'i']
    # stop words to ignore
    
    try:
        float(txt)
        return ['none']
    except ValueError:
        return [x for x in pat.sub('', txt.lower()).split() if x not in STOP]

counter_results = Counter()

for words in social_results['text']:
    counter_results += Counter(check_text(words))

total_counts = sum(counter_results.values())
print("Total frequencys: {}\nMost Common:".format(total_counts))
for word,freq in counter_results.most_common(10):
    print("The word '{}' occured {} times or {:.2f} percent".format(word,freq,(freq/total_counts)*100))


Total frequencys: 209
Most Common:
The word 'great' occured 12 times or 5.74 percent
The word 'guys' occured 10 times or 4.78 percent
The word 'you' occured 10 times or 4.78 percent
The word 'businesses' occured 9 times or 4.31 percent
The word 'small' occured 8 times or 3.83 percent
The word 'is' occured 7 times or 3.35 percent
The word 'interface' occured 7 times or 3.35 percent
The word 'use' occured 6 times or 2.87 percent
The word 'rule' occured 6 times or 2.87 percent
The word 'hard' occured 6 times or 2.87 percent

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
#going to make a quick N-Gram to determine some areas to look for themes

In [7]:
social_results.fillna('none', inplace=True) # in case I missed any
social_results['text'].head()


Out[7]:
0           The new router I bought broke far too soon
1                                       Give me a job!
2    The experience at your store was happy and rew...
3        salemen was very helpful and great experience
4    Your salesmen need to develop better communica...
Name: text, dtype: object

In [15]:
vect = CountVectorizer(ngram_range=(1, 2), stop_words='english')
X = vect.fit_transform(social_results['text'].tolist())
# Fit transform performs a Randomized PCA, principle component analysis, on the passed values
# PCA is used on correlated data, in order to create a model 
#   that is the result of combining uncorrelated components and sorting by the most important ones
# And then returns the transformed values


terms = vect.get_feature_names() # features are transformed into numeric data so need to get their names (terms)
freqs = X.sum(axis=0).A1
result = dict(zip(terms, freqs)) #creates an dictionary of results for processing

for k in sorted(result, key=result.get, reverse=True):
    print("{},{}".format(k,result[k]))


great,12
guys,10
businesses,9
small,8
small businesses,8
interface,7
rule small,6
hard use,6
use,6
hard,6
guys rule,6
rule,6
interface hard,5
mobile,5
events,5
use mobile,4
bought,3
experience,3
need,3
events great,3
activites,3
discounts,3
helpful,2
activites events,2
guys great,2
salesmen,2
helpful great,2
broke,2
great experience,2
need discounts,2
needs,2
enjoyed,2
enjoyed events,2
company,2
soon,2
great small,2
new router,1
develop,1
equipment bought,1
bought wack,1
forward great,1
skills explain,1
attending,1
salemen helpful,1
salesman helpful,1
happy,1
veterans great,1
hate,1
rewarding,1
explain difference,1
fair,1
hosting,1
difference ethernet,1
wack,1
salemen,1
develop better,1
wish,1
time event,1
interface needs,1
immediately,1
improvement hard,1
liked activites,1
helped improve,1
great mobile,1
bought broke,1
salesman,1
event,1
happy rewarding,1
use needs,1
bought works,1
career,1
stuff,1
business guys,1
attending career,1
social interface,1
forward,1
guys hosting,1
activities,1
discounts students,1
offered discounts,1
teachers military,1
online,1
freezes lot,1
needs improvement,1
interface great,1
great time,1
wish guys,1
site freezes,1
lot,1
job,1
time,1
career fair,1
great jobs,1
needs work,1
difference,1
really liked,1
work phones,1
better communication,1
online interface,1
broke far,1
types,1
need develop,1
communication skills,1
fair soon,1
great need,1
ethernet,1
students,1
better,1
experience store,1
events guys,1
router bought,1
social,1
wack broke,1
really,1
looking,1
company attending,1
liked,1
work,1
looking forward,1
events activities,1
military veterans,1
improve,1
teachers,1
new,1
mobile site,1
tabletop,1
tabletop game,1
explain,1
router,1
works,1
veterans,1
guys offered,1
salesmen great,1
ethernet types,1
military,1
improve business,1
business,1
store,1
equipment,1
far,1
jobs,1
rewarding equipment,1
site,1
discounts teachers,1
game events,1
hate company,1
students businesses,1
skills,1
freezes,1
offered,1
improvement,1
far soon,1
phones,1
broke immediately,1
store happy,1
stuff bought,1
communication,1
game,1
helped,1
salesmen need,1
great activites,1

Created ngram and saved to text file

Loading and calculating

Also going to look at Male vs Female numbers and plot years


In [9]:
gen = Counter(social_results.gender)
tot_gen = sum(gen.values())

sns.set(style="darkgrid")
sns.countplot(x="gender", data=social_results)

print('Total Females = {0} : {1:.2f} percent'.format(gen['M'],(gen['M']/tot_gen) * 100))
print('Total Males = {0} : {1:.2f} percent'.format(gen['F'],(gen['F']/tot_gen) * 100))


Total Females = 20 : 57.14 percent
Total Males = 15 : 42.86 percent

In [10]:
sns.countplot(x="years", data=social_results) # Gotta love seaborn


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7179020a90>

In [13]:
ngram_results = pd.read_csv('social_results.csv')

x = ngram_results.id.tolist()
y = ngram_results.freq.tolist()

fig, ax = plt.subplots()
ax.scatter(x,y, s=ngram_results.freq*200, c=np.cos(x)) #change last arguement to change size (50,200,400)

col = ['black', 'red', 'blue', 'green', 'orange']
for i, txt in enumerate(ngram_results.term): #label each dot...should change color to make readable
    ax.annotate(txt, (x[i],y[i]), color=col[i % 5])



In [ ]: