OkNLP

This notebook demonstrates the algorithm we used in our project. It shows an example of how we clustered using Nonnegative Matrix Factorization. We manually inspect the output of NMF to determine the best number of clusters for each group. Then, we create word clouds for specific groups and demographic splits.

Imports and Settings


In [1]:
import random
import warnings

import matplotlib as mpl
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import StratifiedKFold, permutation_test_score

from utils.clean_up import *
from utils.categorize_demographics import *
from utils.nonnegative_matrix_factorization import nmf_labels
from utils.distinctive_tokens import log_odds_ratio
from utils.wc_colors import *
from utils.splits import *
from utils.plotting import *

warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
mpl.rc('savefig', dpi=300)
params = {'figure.dpi' : 300,
          'axes.axisbelow' : True,
          'lines.antialiased' : True}

for (k, v) in params.items():
    plt.rcParams[k] = v

sns.set_style("dark")

In [3]:
# Keeping track of the names of the essays
essay_dict = {'essay0' : 'My self summary',
              'essay1' : 'What I\'m doing with my life',
              'essay2' : 'I\'m really good at',
              'essay3' : 'The first thing people notice about me',
              'essay4' : 'Favorite books, movies, tv, food',
              'essay5' : 'The six things I could never do without',
              'essay6' : 'I spend a lot of time thinking about',
              'essay7' : 'On a typical Friday night I am',
              'essay8' : 'The most private thing I am willing to admit',
              'essay9' : 'You should message me if'}

Data Cleaning

First we read in the data frame and re-categorize some of the demographic information. We'll have two separate dataframes, one for essay0 and one for essay4.


In [4]:
df = pd.read_csv('data/profiles.20120630.csv')

essay_list = ['essay0', 'essay4']
df_0, df_4 = clean_up(df, essay_list)

df_0 = recategorize(df_0)
df_4 = recategorize(df_4)

Clustering

For each essay, we convert the users' essays into a tfidf matrix and then use NMF to cluster the data points, using 25 clusters for each essay. The cell below takes a while to run, just forewarning.


In [5]:
K = 25
count_matrix, tfidf_matrix, vocab = col_to_data_matrix(df_0, 'essay0')
df_0['group'] = nmf_labels(tfidf_matrix, K)




In [6]:
K = 25
count_matrix, tfidf_matrix, vocab = col_to_data_matrix(df_4, 'essay4')
df_4['group'] = nmf_labels(tfidf_matrix, K)



Note: count_matrix, tfidf_matrix, and vocab correspond to the data found in df_4.

Figures

Examples

Difference in Proportion


In [7]:
counts = counts_by_class(count_matrix, df_4, 'group', one_vs_one=False, vals=15)
diffs = diff_prop(counts, vocab)
t, _ = wf(diffs, 100)
wcloud(t, cyan)


Lollipops


In [8]:
demog = 'ethnicity'
subset = subset_df(df_4, demog, ['white', 'black'])
grouped = group_pct(subset, demog)
lollipop(grouped, demog)



In [9]:
demog = 'drugs'
subset = subset_df(df_4, demog, ['yes','no','unknown'])
grouped = group_pct(subset, demog)
lollipop(grouped, demog)


Log Odds Ratio


In [10]:
counts = counts_by_class(count_matrix, df_4, 'drugs', one_vs_one=True, vals=['yes', 'no'])
log_odds = log_odds_ratio(counts, vocab, use_variance=True)
t, b = wf(log_odds, 100)
wcloud(t, blue)
wcloud(b, red)



Distinctive Tokens

Log Odds Ratio

From Monroe et al.


In [14]:
counts = counts_by_class(count_matrix, df_4, 'drugs', ['yes', 'no'])

In [15]:
log_odds = log_odds_ratio(np.array(counts), vocab, use_variance=True)

In [10]:
colors = ['#348ABD', '#A60628', '#7A68A6', '#467821', '#D55E00', '#CC79A7',
          '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#A500FF', '#FFA500']

tmp = log_odds.sort('log_odds_ratio', ascending=False)
tmp = tmp.set_index('features')
top = tmp.iloc[:15]
top['group'] = 0
bottom = tmp.iloc[-15:]
bottom['group'] = 1
tmp = top.append(bottom)

f, ax = plt.subplots()
tmp['log_odds_ratio'].plot(kind = 'bar', ax = ax, color=[colors[i] for i in tmp['group']])
ax.set_ylim([-17,17])
ax.set_xlabel('')
ax.set_ylabel('log odds ratio')
plt.legend([])

fs = 14
for label in (ax.get_xticklabels() + ax.get_yticklabels()):
    label.set_fontweight('bold')
    label.set_fontsize(fs)
    label.set_color('lightgray')

ax.xaxis.label.set_color('lightgray')
ax.xaxis.label.set_fontweight('bold')
ax.xaxis.label.set_fontsize(fs)

ax.yaxis.label.set_color('lightgray')
ax.yaxis.label.set_fontweight('bold')
ax.yaxis.label.set_fontsize(fs)



Logistic Regression

Feature importance based on model coefficients for the tfidf features.


In [8]:
X = np.array(tfidf_matrix.todense())
y = df_4.drugs.values

In [9]:
logistic = LogisticRegression()

In [10]:
b_hats_logistic = betas(logistic, X, y, test_size=0.25)

In [11]:
rf = RandomForestClassifier()

In [12]:
b_hats_rf = betas(rf, X, y, test_size=0.25)

In [12]:
b_hats_logistic


Out[12]:
array([[ 0.60576971,  1.37155678, -0.56513165, ..., -0.80721075,
         0.34325163, -0.41766705],
       [ 0.00822375, -0.79456131, -0.08514868, ...,  0.13465736,
        -0.38123083,  0.54348128],
       [-1.08996236, -1.60586957,  1.03745574, ...,  0.96022083,
         0.10072112,  0.00530885]])

In [13]:
b_hats_rf


Out[13]:
array([ 0.00047461,  0.00124444,  0.00070737, ...,  0.00109653,
        0.00039806,  0.00058587])

For the random forest classifier, the values are not the beta hats. They are based on the feature_importances_ attribute. If this isn't useful, we could edit utils/classification.py.

Also, we might want to turn these into binary classification problems. For the drugs feature, for example, we could exclude unknown values.

In either case, this is just a placeholder for now.

Permutation testing

In order to test if a classification score is significative you repeat the classification procedure after permuting the labels. The p-value is then given by the percentage of runs for which the score obtained is greater than the classification score obtained in the first place.

See: http://scikit-learn.org/stable/auto_examples/feature_selection/plot_permutation_test_for_classification.html


In [15]:
cv = StratifiedKFold(y, 2)

score, permutation_scores, pvalue = permutation_test_score(
    logistic, X, y, scoring="accuracy", cv=cv, n_permutations=100, n_jobs=1)

print("Classification score %s (pvalue : %s)" % (score, pvalue))


Classification score 0.621778552722 (pvalue : 0.00990099009901)

In [29]:
# View histogram of permutation scores
n_classes = np.unique(y).size

plt.hist(permutation_scores, label='Permutation scores')
plt.plot(2 * [score], ylim, '--g', linewidth=3,
         label='Classification Score'
         ' (pvalue %s)' % pvalue)
plt.legend()
plt.xlabel('Score')


Out[29]:
<matplotlib.text.Text at 0x1119a2210>

Examining Clusters

Here, for some demographics of interest we graph the percentage each demographic is present within each cluster.


In [7]:
# Takes in an essay, data frame, and demographic list. Graphs the
# percentage each demographic is present within each cluster. Assumes that
# for each demographic listed, there is a filename.

def plot_bars(df, essay, demog_list, filenames):
    colors = ['#348ABD', '#A60628', '#7A68A6', '#467821', '#D55E00', '#CC79A7',
              '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#A500FF', '#FFA500' ]  
    sns.set_style("dark")
    fs = 28
    for f, demog in enumerate(demog_list):
        this = pd.DataFrame({'count' :
                         df.groupby([demog, 'group'])['group'].count()}).reset_index()
        that = this.groupby(demog, as_index=False)['count'].sum()
        this = pd.merge(this, that, on=demog)
        this['pct'] = this.count_x / this.count_y

        fig, ax = plt.subplots(figsize=(12, 8))
    
        # lines
        lineval = this.groupby('group')['pct'].max()
        for i, g in enumerate(lineval):
            plt.plot([i, i], [0, g],
                     linewidth=10,
                     color='lightgray',
                     zorder=1)

        # markers
        for i, d in enumerate(this[demog].unique()):
            tdf = this[this[demog]==d]
            plt.scatter(range(len(tdf)), tdf.pct,
                    s=400,
                    color=colors[i],edgecolor = 'lightgray', lw = 4,
                    zorder=2, label=d.capitalize())
    
        plt.xlim(-0.5, len(tdf)-0.5)
        plt.ylim(0)
    
        plt.gca().get_yaxis().set_major_formatter(
                mpl.ticker.FuncFormatter(lambda y, p: format(y, '.0%')))

        plt.xlabel('Group')
        plt.ylabel('Normalized Percentage of Users')

        #plt.title(essay_dict[essay_list[0]], fontsize = 18, fontweight = 'bold', color = 'lightgray')
        lg = ax.legend(title=demog.title(), fontsize = fs, loc = 'upper right', bbox_to_anchor = (1.15, 1))
    
        for text in lg.get_texts():
            plt.setp(text, color = 'lightgray', weight = 'bold')
        lg.get_title().set_fontweight('bold')
        lg.get_title().set_color('lightgray')
        lg.get_title().set_fontsize(fs)

    
        for label in (ax.get_xticklabels() + ax.get_yticklabels()):
            label.set_fontweight('bold')
            label.set_fontsize(fs)
            label.set_color('lightgray')
        
        ax.xaxis.label.set_color('lightgray')
        ax.xaxis.label.set_fontweight('bold')
        ax.xaxis.label.set_fontsize(fs)
    
        ax.yaxis.label.set_color('lightgray')
        ax.yaxis.label.set_fontweight('bold')
        ax.yaxis.label.set_fontsize(fs)
    
        plt.savefig(filenames[f], transparent=True)

In [24]:
df_0_simple = df_0[df_0.gender_orientation.isin(['M straight','M gay', 'F straight','F gay'])]
plot_bars(df_0_simple, 'essay0', ['gender_orientation'], ['essay0_gender_orientation.png'])



In [25]:
df_0_simple = df_0[df_0.ethnicity.isin(['white','black','asian','hispanic', 'multi'])]
plot_bars(df_0_simple,'essay0', ['ethnicity'], ['essay0_ethnicity.png'])



In [26]:
plot_bars(df_4, 'essay4', ['sex'], ['essay4_sex.png'])



In [27]:
df_4_simple = df_4[df_4.ethnicity.isin(['white','black','asian','hispanic', 'multi'])]
plot_bars(df_4_simple,'essay4', ['ethnicity'], ['essay4_ethnicity.png'])


Word Clouds

The code for how we visualized our word clouds is below.

Essay 0 Word Clouds

Group Word Clouds

For a couple groups of interest, we visualized the words that had the largest difference in frequency between the group and the rest of the data set. We found that these words mapped perfectly to the words Non-negative Matrix Factorization generated for each group


In [43]:
count_0 = count_matrix[np.array(df_0.group==2), :]
count_1 = count_matrix[np.array(df_0.group!=2), :]

wcloud(count_0, count_1, vocab, n, yellow, 'group2.png')


(2990, 1021) (49966, 1021)

In [44]:
count_0 = count_matrix[np.array(df_0.group==6), :]
count_1 = count_matrix[np.array(df_0.group!=6), :]

wcloud(count_0, count_1, vocab, n, cyan, 'group6.png')


(2319, 1021) (50637, 1021)

In [45]:
count_0 = count_matrix[np.array(df_0.group==11), :]
count_1 = count_matrix[np.array(df_0.group!=11), :]

wcloud(count_0, count_1, vocab, n, yellow, 'group11.png')


(2321, 1021) (50635, 1021)

In [46]:
count_0 = count_matrix[np.array(df_0.group==16), :]
count_1 = count_matrix[np.array(df_0.group!=16), :]

wcloud(count_0, count_1, vocab, n, cyan, 'group16.png')


(1684, 1021) (51272, 1021)

Demographic Split Word Clouds

We visualized a few of the interesting demographic splits below. For example, since gay men dominate group 1 when we look at gender crossed with orientation and group 1 discusses location, we wanted to show what kinds of words gay men tended to use in that group.

Ethnicity Split

Asians in Group 4


In [33]:
count_0 = count_matrix[np.array((df_clean.group==4) &
                                (df_clean.ethnicity=='asian')), :]
count_1 = count_matrix[np.array((df_clean.group!=4) &
                                (df_clean.ethnicity.isin(['black',
                                                          'hispanic / latin',
                                                          'multi',
                                                          'white']))), :]
wcloud(count_0, count_1, vocab, n, blue, 'essay0_group4_asian.png')


(570, 1021) (36744, 1021)

Hispanics in Group 9


In [35]:
count_0 = count_matrix[np.array((df_clean.group==9) &
                                (df_clean.ethnicity=='hispanic / latin')), :]
count_1 = count_matrix[np.array((df_clean.group!=9) &
                                (df_clean.ethnicity.isin(['black',
                                                          'asian',
                                                          'multi',
                                                          'white']))), :]

wcloud(count_0, count_1, vocab, n, purple, 'essay0_group9_hispanic.png')


(204, 1021) (41764, 1021)
Gender Orientation Split

Gay Men in Group 1


In [36]:
count_0 = count_matrix[np.array((df_clean.group==1) & (df_clean.gender_orientation=='M gay')), :]
count_1 = count_matrix[np.array((df_clean.group!=1) & (df_clean.gender_orientation!='M gay')), :]

wcloud(count_0, count_1, vocab, n, purple, 'essay0_group1_mgay.png')


(353, 1021) (45946, 1021)

Women in Group 2


In [37]:
count_0 = count_matrix[np.array((df_clean.group==2) & (df_clean.sex=='F')), :]
count_1 = count_matrix[np.array((df_clean.group!=2) & (df_clean.sex!='F')), :]

wcloud(count_0, count_1, vocab, n, red_blue, 'essay0_group2_f.png')


(1692, 1021) (30337, 1021)

Gay Men in Group 6


In [39]:
count_0 = count_matrix[np.array((df_clean.group==6) & (df_clean.gender_orientation=='M gay')), :]
count_1 = count_matrix[np.array((df_clean.group!=6) & (df_clean.gender_orientation!='M gay')), :]

wcloud(count_0, count_1, vocab, n, purple, 'essay0_group6_mgay.png')


(239, 1021) (47256, 1021)

Gay Women in Group 7


In [40]:
count_0 = count_matrix[np.array((df_clean.group==7) & (df_clean.gender_orientation=='F gay')), :]
count_1 = count_matrix[np.array((df_clean.group!=7) & (df_clean.gender_orientation!='F gay')), :]

wcloud(count_0, count_1, vocab, n, blue, 'essay0_group7_fgay.png')


(138, 1021) (48015, 1021)

Essay 4 Word Clouds


In [47]:
n=100

Group Word Clouds


In [48]:
count_0 = count_matrix[np.array(df_4.group==1), :]
count_1 = count_matrix[np.array(df_4.group!=1), :]

wcloud(count_0, count_1, vocab, n, yellow, 'group1.png')


(2314, 1396) (45791, 1396)

In [49]:
count_0 = count_matrix[np.array(df_clean.group==8), :]
count_1 = count_matrix[np.array(df_clean.group!=8), :]

wcloud(count_0, count_1, vocab, n, cyan, 'group8.png')


(2846, 1396) (45259, 1396)

In [50]:
count_0 = count_matrix[np.array(df_clean.group==15), :]
count_1 = count_matrix[np.array(df_clean.group!=15), :]

wcloud(count_0, count_1, vocab, n, yellow, 'group15.png')


(1702, 1396) (46403, 1396)

In [51]:
count_0 = count_matrix[np.array(df_clean.group==22), :]
count_1 = count_matrix[np.array(df_clean.group!=22), :]

wcloud(count_0, count_1, vocab, n, cyan, 'group22.png')


(3785, 1396) (44320, 1396)

Demographic Split Word Clouds

African Americans in Group 7


In [52]:
count_0 = count_matrix[np.array((df_clean.group==7) &
                                (df_clean.ethnicity=='black')), :]
count_1 = count_matrix[np.array((df_clean.group!=7) &
                                (df_clean.ethnicity.isin(['asian',
                                                          'hispanic / latin',
                                                          'multi',
                                                          'white']))), :]

wcloud(count_0, count_1, vocab, n, red, 'essay4_group7_black.png')


(189, 1396) (38436, 1396)

White/Multi people in Group 22


In [53]:
count_0 = count_matrix[np.array((df_clean.group==22) &
                                (df_clean.ethnicity.isin(['white',
                                                          'multi']))), :]
count_1 = count_matrix[np.array((df_clean.group!=22) &
                                (df_clean.ethnicity.isin(['asian',
                                                          'black',
                                                          'hispanic / latin']))), :]

wcloud(count_0, count_1, vocab, n, green_orange, 'essay4_group22_whitemulti.png')


(2783, 1396) (7994, 1396)

Asians in Group 12


In [54]:
count_0 = count_matrix[np.array((df_clean.group==12) & (df_clean.ethnicity=='asian')), :]
count_1 = count_matrix[np.array((df_clean.group!=12) & (df_clean.ethnicity!='asian')), :]

wcloud(count_0, count_1, vocab, n, blue, 'essay4_group12_asian.png')


(272, 1396) (42317, 1396)

White people in Group 1


In [55]:
count_0 = count_matrix[np.array((df_clean.group==1) & (df_clean.ethnicity=='white')), :]
count_1 = count_matrix[np.array((df_clean.group!=1) & (df_clean.ethnicity!='white')), :]

wcloud(count_0, count_1, vocab, n, orange, 'essay4_group1_white.png')


(1505, 1396) (20054, 1396)

Women in Groups 2, 15, and 24


In [57]:
count_0 = count_matrix[np.array((df_clean.group.isin([2, 15, 24])) & (df_clean.sex=='F')), :]
count_1 = count_matrix[np.array((df_clean.group.isin([2, 15, 24])) & (df_clean.ethnicity!='F')), :]

wcloud(count_0, count_1, vocab, n, blue, 'essay4_movies_women.png')


(1812, 1396) (4669, 1396)

Men in Groups 2, 15, 24


In [59]:
count_0 = count_matrix[np.array((df_clean.group.isin([2, 15, 24])) & (df_clean.sex=='M')), :]
count_1 = count_matrix[np.array((df_clean.group.isin([2, 15, 24])) & (df_clean.ethnicity!='M')), :]

wcloud(count_0, count_1, vocab, n, red, 'essay4_movies_men.png')


(2857, 1396) (4669, 1396)

In [ ]: