Results

Here we reproduce Figures 1 & 3 and Table 3 from the AAAI 2015 paper entitled "Using Matched Samples to Estimate the Effects of Exercise on Mental Health from Twitter".

This notebook reads in the final mood classifications for users in three different groups:

Those who exercise (as estimated by use of wellness apps like Nike+)
A matched control set (using gender, location, and other variables to find suitable matches)
A random control set

We compare the differences in aggregate mood classifications between groups using a Wilcoxon signed-rank test.

Note: Since our annotated data is somewhat sensitive (e.g., linking Twitter accounts to mood), we have elected not to share a public link to the data. Please contact the authors to discuss possible data sharing agreements.



In [1]:

    
# Download and extract data (see note above).
import tarfile
import urllib

DATA_URL_PATH = 'http://tapi.cs.iit.edu/data/aaai-2015-matching/'
DATA_FILE = 'aaai-2015-matching-data.tgz'
DATA_URL = DATA_URL_PATH + DATA_FILE
print 'downloading %s' % (DATA_URL)
urllib.urlretrieve(DATA_URL, DATA_FILE)
print 'extracting %s' % (DATA_FILE)
tar = tarfile.open(DATA_FILE)
tar.extractall()
tar.close()









    



downloading http://tapi.cs.iit.edu/data/aaai-2015-matching/aaai-2015-matching-data.tgz
extracting aaai-2015-matching-data.tgz

The main file containts classifier output for each user, their matched pair, and a random pair for each of the three classes.



In [2]:

    
!head -2 classifications.csv









    



u,u_AH,u_DD,u_TA,u_avg,m,m_AH,m_DD,m_TA,m_avg,diff_AH,diff_DD,diff_TA,diff_avg,r,r_AH,r_DD,r_TA,r_avg
1075648513,0.006060606,0.03030303,0.012121212,0.016161616,21720967,0.011284256,0.031703385,0.032240731,0.025076124,-0.00522365,-0.001400355,-0.020119519,-0.008914508,1000715179,0.032894737,0.056140351,0.04254386,0.043859649

The columns of this file are as follows:

u: User id (excercise group)
u_AH: Proportion of tweets predicted as Anger/Hostility
u_DD: Proportion of tweets predicted as Depression/Dejection
u_TA: Proportion of tweets predicted as Tension/Anxiety
u_avg: Average proportion of AH/DD/TA
m: User id (matched group)
m_AH: Proportion of tweets predicted as Anger/Hostility
m_DD: Proportion of tweets predicted as Depression/Dejection
m_TA: Proportion of tweets predicted as Tension/Anxiety
m_avg: Average proportion of AH/DD/TA
diff_AH: (excercise - matched) proportion for AH
diff_DD: (excercise - matched) proportion for DD
diff_TA: (excercise - matched) proportion for TA
diff_avg: (excercise - matched) proportion for avg
r: User id (random control)
r_AH: Proportion of tweets predicted as Anger/Hostility
r_DD: Proportion of tweets predicted as Depression/Dejection
r_TA: Proportion of tweets predicted as Tension/Anxiety
r_avg: Average proportion of AH/DD/TA

A similar file was generated using a classifier trained on half the training data. The columns are the same (but there are no columns for a random control).



In [3]:

    
!head -2 half_classifications.csv









    



u,u_AH,u_DD,u_TA,u_avg,m,m_AH,m_DD,m_TA,m_avg,diff_AH,diff_DD,diff_TA,diff_avg
1075648513,0.0181818181818,0.0484848484848,0.030303030303,0.0323232323232,21720967,0.0327780763031,0.0607200429876,0.0612573885008,0.0515851692638,-0.0145962581213,-0.0122351945028,-0.0309543581978,-0.0192619369406

Finally, there are three files in stats/ containing profile information for each user in the exercise group (sport_users_stats), matched control group (nosport_users_stats) and random control group (random_users_stats).



In [4]:

    
!head -2 stats/sport_users_stats









    



id,gender,city,state,statuses_count,followers_count,friends_count
1000179733,f,Southlake,Texas,210,196,365

The columns for these files are:

id: user id
gender: estimated gender (based on first name and Census data)
city: estimated city of origin, from the location field
state: estimate state of origin, from the location field
statuses_count: number of tweets
followers_count: number of followers
friends_count: number of friends

With this data, we will perform hypothesis tests to measure the differences in estimate mood across groups.



In [5]:

    
# Read classifications.csv and half_classifications.csv
# Note that half_classifications includes rows without a random match, thus the counts differ.

from numpy import array as npa

def read_results(fname):
    header = None
    results = []
    for line in open(fname, 'rt'):
        parts = line.strip().split(',')
        if not header:
            header = parts
        else:
            results.append(npa([float(x) for x in parts]))
    return header, npa(results)

fields_all, results_all = read_results('classifications.csv')
fields_half, results_half = read_results('half_classifications.csv')

print 'read %d results from classifications.csv' % (len(results_all))
print 'read %d results from half_classifications.csv' % (len(results_half))









    



read 786 results from classifications.csv
read 840 results from half_classifications.csv



In [9]:

    
# Generate boxplots of the mood predictions from classifications.csv (Figure 3 in the paper).

import matplotlib.pyplot as plt
import numpy as np

def get_labels(fields, random_users=False):
    labels = []
    for i, (label, pretty_label) in enumerate([('AH', 'Hostility'), ('DD', 'Dejection'),
                                               ('TA', 'Anxiety')]):  #, ('avg', 'Average')]):
        sporty_idx = fields.index('u_' + label)
        nonsporty_idx = fields.index('m_' + label)
        labels_tuple = [label, nonsporty_idx, sporty_idx, pretty_label]
        if random_users:
            random_idx = fields.index('r_' + label)
            labels_tuple.append(random_idx)  # (label, nonsporty_idx, sporty_idx, random_idx, pretty_label)
        labels.append(labels_tuple)
    return labels

def plot_results(labels, results, random_users=False):
    f, axes = plt.subplots(1, 3, sharex=True, sharey=True, figsize=(5,3))
    xticklabels = ['match', 'exercise']
    if random_users:
        xticklabels.append('random')

    for i, label in enumerate(labels):
        boxplots = [results[:, label[1]], results[:, label[2]]]
        if random_users:
            boxplots.append(results[:, label[4]])
        axes[i].boxplot(boxplots, showfliers=False, widths=.7)
        axes[i].set_ylabel('P(' + label[3] + ')', size=10)
        axes[i].yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
               alpha=0.5)
        axes[i].set_xticklabels(xticklabels, rotation=90)
    f.tight_layout()
    f.show()
    plt.savefig('classifications.pdf', bbox_inches='tight')

# classifier trained on all tweets
labels_all = get_labels(fields_all, random_users=False)
plot_results(labels_all, results_all, random_users=False)



In [10]:

    
# Perform wilcoxon signed-rank test of significance (Table 3 from paper).

import numpy as np
from scipy.stats import wilcoxon

def pct_reduction(before, after):
    return 100. * (after - before) / before

def test_significance(labels, results, idx1=1, idx2=2, diff_legend='% Change (vs. match)'): # 1: nosport, 2: sport, 3: random
    print '%10s\t%15s\t%10s' % ('Category', diff_legend, 'p-value')
    for i, label in enumerate(labels):
        match = results[:, label[idx1]]
        exercise = results[:, label[idx2]]
        wil = wilcoxon(match, exercise)
        print '%10s\t%2.1f\t%10.2g' % (label[3],
                                       pct_reduction(np.mean(match), np.mean(exercise)),
                                       wil[1])

print 'all the labeled tweets'
labels_all = get_labels(fields_all, random_users=True)
test_significance(labels_all, results_all)
test_significance(labels_all, results_all, idx1=4, diff_legend='% Change (vs. random)')
print 'half of the labeled tweets'
labels_half = get_labels(fields_half)
test_significance(labels_half, results_half)









    



all the labeled tweets
  Category	% Change (vs. match)	   p-value
 Hostility	0.9	      0.53
 Dejection	-3.9	   0.00078
   Anxiety	-2.7	      0.02
  Category	% Change (vs. random)	   p-value
 Hostility	-21.1	   9.2e-17
 Dejection	-5.4	    0.0001
   Anxiety	-7.9	   3.6e-06
half of the labeled tweets
  Category	% Change (vs. match)	   p-value
 Hostility	1.3	      0.47
 Dejection	-2.0	     0.016
   Anxiety	0.1	      0.64



In [11]:

    
# Plot distribution of samples (Figure 1 from paper).
import csv
from collections import Counter
import random
random.seed(1234567)

def read_stats(fname):
    rows = []
    with open(fname, 'rb') as csvfile:
        reader = csv.DictReader(csvfile, delimiter=',')
        return [r for r in reader]

def get_col(stats, label, value):
    if value:
        counts = Counter([x[label] for x in stats])
        return 1. * counts[value] / sum(counts.values())
    else:
        return [float(x[label]) for x in stats]

# Plot boxplots for each stat.
def plot_stats(nosport, sport, rnd):
    f, axes = plt.subplots(2, 3, sharex=True, figsize=(9,5))
    labels = [('statuses_count', '# Statuses', None),
              ('followers_count', '# Followers', None),
              ('friends_count', '# Friends', None),
              ('gender', '% Female', 'f'),
              ('state', '% from California', 'California'),
              ]
    for i, (label, pretty_label, value) in enumerate(labels):
        if i < 3:
            j = 0
        else:
            j = 1
            i = i % 3
        data = [get_col(nosport, label, value),
                get_col(sport, label, value),
                get_col(rnd, label, value)]
        if not value:
            axes[j, i].boxplot(data, showfliers=False, widths=.7)
        else:
            print data
            axes[j, i].bar(1 + np.arange(3), data, align='center', width=.7)
        axes[j, i].set_ylabel(pretty_label, size=10)
        axes[j, i].yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
              alpha=0.5)
        xticklabels = ['match', 'exercise', 'random']
        axes[j, i].set_xticklabels(xticklabels, rotation=90)
    axes[1, 2].axis('off')
    f.tight_layout()
    plt.savefig('matches.pdf', bbox_inches='tight')


random_stats = read_stats('stats/random_users_stats')
nosport_stats = read_stats('stats/nosport_users_stats')
sport_stats = read_stats('stats/sport_users_stats')
plot_stats(nosport_stats, sport_stats, random_stats)









    



[0.25948275862068965, 0.25948275862068965, 0.5189873417721519]
[0.14913793103448275, 0.14913793103448275, 0.1308016877637131]