Community: SuperUser

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data Summary

Women


In [4]:
females.describe()


Out[4]:
answers_total comments_total contributions_total questions_total reputation
count 788.000000 788.000000 788.00000 788.000000 788.000000
mean 5.181472 9.354061 17.28934 2.753807 238.482234
std 32.435824 41.087909 71.01479 7.462421 727.567127
min 0.000000 0.000000 1.00000 0.000000 51.000000
25% 0.000000 0.000000 2.00000 0.000000 101.000000
50% 1.000000 2.000000 4.00000 1.000000 116.000000
75% 2.000000 5.000000 10.00000 3.000000 156.000000
max 761.000000 612.000000 1299.00000 102.000000 15164.000000

In [5]:
females.median()


Out[5]:
answers_total            1
comments_total           2
contributions_total      4
questions_total          1
reputation             116
dtype: float64

Men


In [6]:
males.describe()


Out[6]:
answers_total comments_total contributions_total questions_total reputation
count 18264.000000 18264.000000 18264.000000 18264.000000 18264.000000
mean 5.264838 10.483081 18.270039 2.522120 267.312308
std 42.707281 79.985537 118.000162 6.882908 1473.342909
min 0.000000 0.000000 1.000000 0.000000 50.000000
25% 0.000000 0.000000 2.000000 0.000000 101.000000
50% 1.000000 2.000000 4.000000 1.000000 118.000000
75% 2.000000 5.000000 10.000000 2.000000 170.000000
max 2717.000000 5176.000000 6571.000000 249.000000 99647.000000

In [7]:
males.median()


Out[7]:
answers_total            1
comments_total           2
contributions_total      4
questions_total          1
reputation             118
dtype: float64

Top contributors


In [8]:
pyplot.close('all')
histogram(females["reputation"], males["reputation"], 100, "Reputation")
histogram(females[females["reputation"]<= 1000]["reputation"], males[males["reputation"]<= 1000]["reputation"], 100, "Reputation")
histogram(females[females["reputation"]<= 400]["reputation"], males[males["reputation"]<= 400]["reputation"], 100, "Reputation")
pyplot.show()


Top Women


In [9]:
top_females = females[females["reputation"]> 400]
top_females.describe()


Out[9]:
answers_total comments_total contributions_total questions_total reputation
count 61.000000 61.000000 61.000000 61.000000 61.000000
mean 49.967213 76.065574 138.262295 12.229508 1487.803279
std 107.280313 128.510294 219.763123 20.407020 2274.934327
min 0.000000 0.000000 1.000000 0.000000 411.000000
25% 6.000000 10.000000 29.000000 0.000000 517.000000
50% 18.000000 22.000000 52.000000 4.000000 726.000000
75% 45.000000 77.000000 138.000000 16.000000 1285.000000
max 761.000000 612.000000 1299.000000 102.000000 15164.000000

In [10]:
top_females.median()


Out[10]:
answers_total           18
comments_total          22
contributions_total     52
questions_total          4
reputation             726
dtype: float64

Top Men


In [11]:
top_males = males[males["reputation"]> 400]
top_males.describe()


Out[11]:
answers_total comments_total contributions_total questions_total reputation
count 1525.000000 1525.000000 1525.000000 1525.000000 1525.000000
mean 46.017705 85.359344 142.239344 10.862295 1690.312787
std 141.280692 264.724019 385.911769 19.735440 4874.398334
min 0.000000 0.000000 1.000000 0.000000 401.000000
25% 7.000000 11.000000 26.000000 1.000000 511.000000
50% 16.000000 26.000000 53.000000 4.000000 713.000000
75% 38.000000 64.000000 118.000000 13.000000 1346.000000
max 2717.000000 5176.000000 6571.000000 249.000000 99647.000000

In [12]:
top_males.median()


Out[12]:
answers_total           16
comments_total          26
contributions_total     53
questions_total          4
reputation             713
dtype: float64

Common women contributors


In [13]:
common_females = females[females["reputation"] <= 400]
common_females.describe()


Out[13]:
answers_total comments_total contributions_total questions_total reputation
count 727.000000 727.000000 727.000000 727.000000 727.000000
mean 1.423659 3.756534 7.138927 1.958735 133.656121
std 2.578656 7.738664 11.943679 4.215732 63.030995
min 0.000000 0.000000 1.000000 0.000000 51.000000
25% 0.000000 0.000000 1.000000 0.000000 101.000000
50% 1.000000 1.000000 3.000000 1.000000 111.000000
75% 2.000000 4.000000 8.000000 2.000000 141.000000
max 24.000000 109.000000 150.000000 64.000000 400.000000

In [14]:
common_females.median()


Out[14]:
answers_total            1
comments_total           1
contributions_total      3
questions_total          1
reputation             111
dtype: float64

Common men contributors


In [15]:
common_males = males[males["reputation"] <= 400]
common_males.describe()


Out[15]:
answers_total comments_total contributions_total questions_total reputation
count 16739.000000 16739.000000 16739.000000 16739.000000 16739.000000
mean 1.552064 3.661509 6.975865 1.762292 137.670410
std 2.757249 6.522414 10.244380 3.051703 62.310956
min 0.000000 0.000000 1.000000 0.000000 50.000000
25% 0.000000 0.000000 1.000000 0.000000 101.000000
50% 1.000000 1.000000 3.000000 1.000000 115.000000
75% 2.000000 4.000000 8.000000 2.000000 150.000000
max 45.000000 135.000000 165.000000 47.000000 400.000000

In [16]:
common_males.median()


Out[16]:
answers_total            1
comments_total           1
contributions_total      3
questions_total          1
reputation             115
dtype: float64

First Question: The amount of contibution made by both genders are equal ?

Hypothesis 1: The amount of questions each user posted is the same between genders.

H0: questionsMadeBy(Males) = questionsMadeBy(Females);

H1: questionsMadeBy(Males) != questionsMadeBy(Females).

Data


In [17]:
females_questions = females['questions_total']
males_questions = males['questions_total']

The data's shape


In [18]:
show_data_shape(females_questions, males_questions, "expon", 100, "Questions")


Levene's test:  0.433575871571
Skewness for Females:  7.73840254123
Skewness for Males:  11.874721034

Hypothesis test


In [19]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_questions, males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_questions, males_questions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_questions, males_questions)[1]


Two-sample Kolmogorov-Smirnov test:  0.910725743006
Two-sample unpaired t-test:  0.35662752299
Two-sample Mann Whitney U test:  0.312338189818

Looking at the top contributors


In [20]:
top_females_questions = top_females['questions_total']
top_males_questions = top_males['questions_total']

The data's shape


In [21]:
show_data_shape(top_females_questions, top_males_questions, "expon", 50, "Questions")


Levene's test:  0.531855091358
Skewness for Females:  2.62798009912
Skewness for Males:  4.54530262722

Hypothesis test


In [22]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_questions, top_males_questions)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_questions, top_males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_questions, top_males_questions)[1]


Two-sample Kolmogorov-Smirnov test:  0.860292442484
Two-sample Mann Whitney U test:  0.963905907159

Looking at the common contributors


In [23]:
common_females_questions = common_females['questions_total']
common_males_questions = common_males['questions_total']

The data's shape


In [24]:
show_data_shape(common_females_questions, common_males_questions, "expon", 50, "Questions")


Levene's test:  0.17236195401
Skewness for Females:  8.63149019094
Skewness for Males:  4.43712441964

Hypotesis test


In [25]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_questions, common_males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_questions, common_males_questions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_questions, common_males_questions)[1]


Two-sample Kolmogorov-Smirnov test:  0.753405792677
Two-sample unpaired t-test:  0.0953455808063
Two-sample Mann Whitney U test:  0.177732574962

Hypothesis 2: The amount of answers each user posted is the same between genders.

H0: answersMadeBy(Males) = answersMadeBy(Females);

H1: answersMadeBy(Males) != answersMadeBy(Females).

Data


In [26]:
females_answers = females['answers_total']
males_answers = males['answers_total']

The data's shape


In [27]:
show_data_shape(females_answers, males_answers, "expon", 80, "Answers")


Levene's test:  0.985904508936
Skewness for Females:  17.6451454612
Skewness for Males:  38.2529535683

Hypothesis test


In [28]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_answers, males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_answers, males_answers)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_answers, males_answers)[1]


Two-sample Kolmogorov-Smirnov test:  0.019018474395
Two-sample unpaired t-test:  0.956835255629
Two-sample Mann Whitney U test:  0.013862029385

Looking at the top contributors


In [29]:
top_females_answers = top_females['answers_total']
top_males_answers = top_males['answers_total']

The data's shape


In [30]:
show_data_shape(top_females_answers, top_males_answers, "expon", 30, "Answers")


Levene's test:  0.786871999544
Skewness for Females:  5.10907891089
Skewness for Males:  11.7765167222

Hypotesis test


In [31]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_answers, top_males_answers)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_answers, top_males_answers, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_answers, top_males_answers)[1]


Two-sample Kolmogorov-Smirnov test:  0.846315564594
Two-sample Mann Whitney U test:  0.949291586374

Looking at the common contributors


In [32]:
common_females_answers = common_females['answers_total']
common_males_answers = common_males['answers_total']

The data's shape


In [33]:
show_data_shape(common_females_answers, common_males_answers, "expon", 30, "Answers")


Levene's test:  0.8520735041
Skewness for Females:  3.66586123076
Skewness for Males:  4.40464278479

Hypotesis test


In [34]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_answers, common_males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_answers, common_males_answers)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_answers, common_males_answers)[1]


Two-sample Kolmogorov-Smirnov test:  0.0259031006837
Two-sample unpaired t-test:  0.217790337014
Two-sample Mann Whitney U test:  0.0187332817762

Hypothesis 3: The amount of comments each user posted is the same between genders.

H0: commentsMadeBy(Males) = commentsMadeBy(Females);

H1: commentsMadeBy(Males) != commentsMadeBy(Females).

Data


In [35]:
females_comments = females['comments_total']
males_comments = males['comments_total']

The shape of the data


In [36]:
show_data_shape(females_comments, males_comments, "expon", 80, "Comments")


Levene's test:  0.717713953021
Skewness for Females:  10.2071753555
Skewness for Males:  31.8754512154

Hypothesis test


In [37]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_comments, males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_comments, males_comments)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_comments, males_comments)[1]


Two-sample Kolmogorov-Smirnov test:  0.20186716981
Two-sample unpaired t-test:  0.693591431965
Two-sample Mann Whitney U test:  0.159336434041

Looking at the contributors


In [38]:
top_females_comments = top_females['comments_total']
top_males_comments = top_males['comments_total']

The data's shape


In [39]:
show_data_shape(top_females_comments, top_males_comments, "expon", 50, "Comments")


Levene's test:  0.794347858966
Skewness for Females:  2.68491199573
Skewness for Males:  9.66810171654

Hypotesis test


In [40]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_comments, top_males_comments)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_comments, top_males_comments, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_comments, top_males_comments)[1]


Two-sample Kolmogorov-Smirnov test:  0.996797794232
Two-sample Mann Whitney U test:  0.95997571275

Looking at the common contributors


In [41]:
common_females_comments = common_females['comments_total']
common_males_comments = common_males['comments_total']

The data's shape


In [42]:
show_data_shape(common_females_comments, common_males_comments, "expon", 80, "Comments")


Levene's test:  0.458013411564
Skewness for Females:  6.58191543776
Skewness for Males:  5.09206392428

Hypotesis test


In [43]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_comments, common_males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_comments, common_males_comments)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_comments, common_males_comments)[1]


Two-sample Kolmogorov-Smirnov test:  0.179683975535
Two-sample unpaired t-test:  0.702954884432
Two-sample Mann Whitney U test:  0.198998385768

Hypothesis 4: The total amount of contributions by each user is the same between genders.

H0: contributionsBy(Males) = contributionsBy(Females);

H1: contributionsBy(Males) != contributionsBy(Females).

Data


In [44]:
females_contributions = females['contributions_total']
males_contributions = males['contributions_total']

The data's shape


In [45]:
show_data_shape(females_contributions, males_contributions, "expon", 80, "Contributions")


Levene's test:  0.824069398374
Skewness for Females:  11.264325847
Skewness for Males:  28.342789687

Hypothesis test


In [46]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_contributions, males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_contributions, males_contributions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_contributions, males_contributions)[1]


Two-sample Kolmogorov-Smirnov test:  0.99979981089
Two-sample unpaired t-test:  0.816932385132
Two-sample Mann Whitney U test:  0.623274917356

Looking at the top contributors


In [47]:
top_females_contributions = top_females['contributions_total']
top_males_contributions = top_males['contributions_total']

The data's shape


In [48]:
show_data_shape(top_females_contributions, top_males_contributions, "expon", 50, "Contributions")


Levene's test:  0.937212824873
Skewness for Females:  3.18240690367
Skewness for Males:  8.71568561288

Hypothesis test


In [49]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_contributions, top_males_contributions)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_contributions, top_males_contributions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_contributions, top_males_contributions)[1]


Two-sample Kolmogorov-Smirnov test:  0.944560191699
Two-sample Mann Whitney U test:  0.736329270594

Looking at the common contributors


In [50]:
common_females_contributions = common_females['contributions_total']
common_males_contributions = common_males['contributions_total']

The data's shape


In [51]:
show_data_shape(common_females_contributions, common_males_contributions, "expon", 80, "Contributions")


Levene's test:  0.635409476558
Skewness for Females:  5.82117296757
Skewness for Males:  4.13451965706

Hypothesis test


In [52]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_contributions, common_males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_contributions, common_males_contributions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_contributions, common_males_contributions)[1]


Two-sample Kolmogorov-Smirnov test:  0.999976407842
Two-sample unpaired t-test:  0.676649790493
Two-sample Mann Whitney U test:  0.800998946094

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [1]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, vincent
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot

%matplotlib inline

client = pymongo.MongoClient('localhost', 27017)

community = 'superuser'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, {'comments_total':{'$gt':0}}] }, 
                       {u'_id': False, u'questions_total': True, u'reputation': True, u'contributions_total':True,
                        u'comments_total': True, u'answers_total': True, u'gender':True})

df =  pandas.DataFrame(list(cursor))

males = df[df['gender']=='Male']
females = df[df['gender']=='Female']

Utility functions for ploting.


In [2]:
pyplot.rcdefaults()
mpl.style.use('ggplot')

def histogram(sample1, sample2, bins, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    axes[0].hist(list(sample1), bins)
    axes[0].set_title(aspect + " by Females - Histogram")
    axes[1].hist(list(sample2), bins)
    axes[1].set_title(aspect + " by Males - Histogram")

def pmf_plot(sample1, sample2, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    
    aux1 = sample1.value_counts()
    pmf1 = aux1[aux1 >1].sort_index() / len(sample1)
#     pmf1 = sample1.value_counts().sort_index() / len(sample1)
    pmf1.plot(kind="bar", ax=axes[0])
    
    aux2 = sample2.value_counts()
    pmf2 = aux2[aux2 >1].sort_index() / len(sample2)
#     pmf2 = sample2.value_counts().sort_index() / len(sample2)
    pmf2.plot(kind="bar", ax=axes[1])
    
    axes[0].set_title(aspect + " by Females - Density")
    axes[1].set_title(aspect + " by Males - Density")
    axes[0].get_xaxis().set_visible(False)
    axes[1].get_xaxis().set_visible(False)    
    

    
def boxplot(sample1, sample2, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    axes[0].boxplot(list(sample1))
    axes[0].set_title(aspect + " by Females - Boxplot")
    axes[1].boxplot(list(sample2))
    axes[1].set_title(aspect + " by Males - Boxplot")
    
def qq_plot(sample1, sample2, distribution, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

    pyplot.subplot(121)
    stats.probplot(list(sample1), dist=distribution, plot=pyplot)
    axes[0].set_title(aspect + " by Females - QQPlot "+ distribution)

    pyplot.subplot(122)
    stats.probplot(list(sample2), dist=distribution, plot=pyplot)
    axes[1].set_title(aspect + " by Male - QQPlot "+ distribution)

Utility functions for describing the data.


In [3]:
def describe(sample1, sample2):
    print sample1.describe()
    print "Median: ", sample1.median()
    print 
    print sample2.describe()
    print "Median: ", sample2.median()
    
def show_data_shape(sample1, sample2, dist, bins, aspect):
    pyplot.close('all')
    #histogram
    histogram(sample1, sample2, bins, aspect)

    #PMF
    pmf_plot(sample1, sample2, aspect)

    #QQPlot
    qq_plot(sample1, sample2, dist, aspect)

    #boxplot
    boxplot(sample1,sample2, aspect)
    pyplot.show()

    #Levene
    print "Levene's test: ", stats.levene(sample1, sample2)[1]
    
    #skewness
    print "Skewness for Females: ", stats.skew(sample1)
    print "Skewness for Males: ", stats.skew(sample2)

In [ ]: