Community: StackOverflow

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data Summary

Women


In [4]:
females.describe()


Out[4]:
answers_total comments_total contributions_total questions_total reputation
count 7897.000000 7897.000000 7897.00000 7897.000000 7897.000000
mean 22.341775 54.324807 90.21008 13.543624 694.373939
std 102.089690 208.542119 309.56576 31.424080 3526.726330
min 0.000000 0.000000 1.00000 0.000000 50.000000
25% 1.000000 3.000000 10.00000 1.000000 75.000000
50% 4.000000 14.000000 28.00000 5.000000 132.000000
75% 13.000000 40.000000 71.00000 14.000000 394.000000
max 3249.000000 6199.000000 8771.00000 728.000000 141184.000000

In [5]:
females.median()


Out[5]:
answers_total            4
comments_total          14
contributions_total     28
questions_total          5
reputation             132
dtype: float64

Men


In [6]:
males.describe()


Out[6]:
answers_total comments_total contributions_total questions_total reputation
count 125340.000000 125340.000000 125340.000000 125340.000000 125340.000000
mean 34.937674 69.636533 116.008545 11.434690 1169.622946
std 195.043927 421.617094 603.472731 27.684822 6567.785536
min 0.000000 0.000000 1.000000 0.000000 50.000000
25% 2.000000 3.000000 9.000000 1.000000 86.000000
50% 6.000000 13.000000 28.000000 4.000000 176.000000
75% 20.000000 42.000000 78.000000 12.000000 606.000000
max 28137.000000 48476.000000 76642.000000 1327.000000 640237.000000

In [7]:
males.median()


Out[7]:
answers_total            6
comments_total          13
contributions_total     28
questions_total          4
reputation             176
dtype: float64

Top contributors


In [8]:
pyplot.close('all')
histogram(females["reputation"], males["reputation"], 100, "Reputation")
histogram(females[females["reputation"]<= 1000]["reputation"], males[males["reputation"]<= 1000]["reputation"], 100, "Reputation")
# histogram(females[females["reputation"]<= 450]["reputation"], males[males["reputation"]<= 450]["reputation"], 100, "Reputation")
pyplot.show()


Top Women


In [9]:
top_females = females[females["reputation"]> 1000]
top_females.describe()


Out[9]:
answers_total comments_total contributions_total questions_total reputation
count 868.000000 868.000000 868.000000 868.000000 868.000000
mean 144.726959 295.945853 479.347926 38.675115 4651.165899
std 277.750081 563.258726 821.172839 77.020379 9763.712921
min 0.000000 0.000000 1.000000 0.000000 1002.000000
25% 32.000000 62.000000 135.750000 3.000000 1330.750000
50% 70.000000 134.000000 237.500000 12.500000 2027.500000
75% 133.250000 279.000000 464.500000 38.000000 3837.250000
max 3249.000000 6199.000000 8771.000000 728.000000 141184.000000

In [10]:
top_females.median()


Out[10]:
answers_total            70.0
comments_total          134.0
contributions_total     237.5
questions_total          12.5
reputation             2027.5
dtype: float64

Top Men


In [11]:
top_males = males[males["reputation"]> 1000]
top_males.describe()


Out[11]:
answers_total comments_total contributions_total questions_total reputation
count 21572.000000 21572.000000 21572.000000 21572.000000 21572.000000
mean 161.496755 306.075932 496.715372 29.144029 5655.511218
std 448.405582 979.629931 1388.689957 56.172743 15036.391721
min 0.000000 0.000000 1.000000 0.000000 1001.000000
25% 38.000000 53.000000 114.000000 3.000000 1425.000000
50% 73.000000 118.000000 224.000000 10.000000 2272.500000
75% 146.000000 263.000000 451.000000 31.000000 4612.250000
max 28137.000000 48476.000000 76642.000000 1327.000000 640237.000000

In [12]:
top_males.median()


Out[12]:
answers_total            73.0
comments_total          118.0
contributions_total     224.0
questions_total          10.0
reputation             2272.5
dtype: float64

Common women contributors


In [13]:
common_females = females[females["reputation"] <= 1000]
common_females.describe()


Out[13]:
answers_total comments_total contributions_total questions_total reputation
count 7029.000000 7029.000000 7029.000000 7029.000000 7029.000000
mean 7.228624 24.487409 42.156068 10.440176 205.756011
std 10.690553 40.272687 58.886717 17.028885 203.109675
min 0.000000 0.000000 1.000000 0.000000 50.000000
25% 1.000000 3.000000 8.000000 1.000000 71.000000
50% 3.000000 11.000000 23.000000 5.000000 116.000000
75% 9.000000 29.000000 52.000000 13.000000 271.000000
max 109.000000 696.000000 851.000000 205.000000 999.000000

In [14]:
common_females.median()


Out[14]:
answers_total            3
comments_total          11
contributions_total     23
questions_total          5
reputation             116
dtype: float64

Common men contributors


In [15]:
common_males = males[males["reputation"] <= 1000]
common_males.describe()


Out[15]:
answers_total comments_total contributions_total questions_total reputation
count 103768.000000 103768.000000 103768.000000 103768.000000 103768.000000
mean 8.627708 20.483897 36.864611 7.753151 237.065878
std 11.440576 34.370571 51.056267 13.823759 223.953725
min 0.000000 0.000000 1.000000 0.000000 50.000000
25% 1.000000 2.000000 7.000000 0.000000 77.000000
50% 5.000000 9.000000 20.000000 3.000000 133.000000
75% 11.000000 25.000000 47.000000 9.000000 340.000000
max 157.000000 1206.000000 1382.000000 519.000000 1000.000000

In [16]:
common_males.median()


Out[16]:
answers_total            5
comments_total           9
contributions_total     20
questions_total          3
reputation             133
dtype: float64

First Question: The amount of contibution made by both genders are equal ?

Hypothesis 1: The amount of questions each user posted is the same between genders.

H0: questionsMadeBy(Males) = questionsMadeBy(Females);

H1: questionsMadeBy(Males) != questionsMadeBy(Females).

Data


In [17]:
females_questions = females['questions_total']
males_questions = males['questions_total']

The data's shape


In [18]:
show_data_shape(females_questions, males_questions, "expon", 100, "Questions")


Levene's test:  1.64914205061e-06
Skewness for Females:  9.62029608923
Skewness for Males:  10.2143145184

Hypothesis test


In [19]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_questions, males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_questions, males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_questions, males_questions)[1]


Two-sample Kolmogorov-Smirnov test:  4.56135162741e-43
Two-sample unpaired t-test:  5.97727398308e-09
Two-sample Mann Whitney U test:  1.49099984806e-46

Looking at the top contributors


In [20]:
top_females_questions = top_females['questions_total']
top_males_questions = top_males['questions_total']

The data's shape


In [21]:
show_data_shape(top_females_questions, top_males_questions, "expon", 50, "Questions")


Levene's test:  5.04410356281e-06
Skewness for Females:  4.61297163874
Skewness for Males:  5.80204298277

Hypothesis test


In [22]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_questions, top_males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_questions, top_males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_questions, top_males_questions)[1]


Two-sample Kolmogorov-Smirnov test:  0.0243552171711
Two-sample unpaired t-test:  0.000326185825131
Two-sample Mann Whitney U test:  0.000437244380937

Looking at the common contributors


In [23]:
common_females_questions = common_females['questions_total']
common_males_questions = common_males['questions_total']

The data's shape


In [24]:
show_data_shape(common_females_questions, common_males_questions, "expon", 50, "Questions")


Levene's test:  3.2106371363e-37
Skewness for Females:  4.27473910057
Skewness for Males:  5.3660608091

Hypotesis test


In [25]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_questions, common_males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_questions, common_males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_questions, common_males_questions)[1]


Two-sample Kolmogorov-Smirnov test:  2.5088346718e-62
Two-sample unpaired t-test:  6.3758247538e-38
Two-sample Mann Whitney U test:  4.02101842182e-74

Hypothesis 2: The amount of answers each user posted is the same between genders.

H0: answersMadeBy(Males) = answersMadeBy(Females);

H1: answersMadeBy(Males) != answersMadeBy(Females).

Data


In [26]:
females_answers = females['answers_total']
males_answers = males['answers_total']

The data's shape


In [27]:
show_data_shape(females_answers, males_answers, "expon", 80, "Answers")


Levene's test:  1.4276443818e-07
Skewness for Females:  16.3315430624
Skewness for Males:  50.6136951313

Hypothesis test


In [28]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_answers, males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_answers, males_answers, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_answers, males_answers)[1]


Two-sample Kolmogorov-Smirnov test:  3.66827738146e-78
Two-sample unpaired t-test:  5.85506023238e-23
Two-sample Mann Whitney U test:  2.27954784175e-113

Looking at the top contributors


In [29]:
top_females_answers = top_females['answers_total']
top_males_answers = top_males['answers_total']

The data's shape


In [30]:
show_data_shape(top_females_answers, top_males_answers, "expon", 30, "Answers")


Levene's test:  0.388124663613
Skewness for Females:  5.98312376517
Skewness for Males:  23.3354710256

Hypotesis test


In [31]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_answers, top_males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_answers, top_males_answers)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_answers, top_males_answers)[1]


Two-sample Kolmogorov-Smirnov test:  0.03902926158
Two-sample unpaired t-test:  0.274224209117
Two-sample Mann Whitney U test:  0.0375535587907

Looking at the common contributors


In [32]:
common_females_answers = common_females['answers_total']
common_males_answers = common_males['answers_total']

The data's shape


In [33]:
show_data_shape(common_females_answers, common_males_answers, "expon", 30, "Answers")


Levene's test:  1.80455626133e-10
Skewness for Females:  3.01758328072
Skewness for Males:  2.89820550601

Hypotesis test


In [34]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_answers, common_males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_answers, common_males_answers, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_answers, common_males_answers)[1]


Two-sample Kolmogorov-Smirnov test:  1.39170241474e-48
Two-sample unpaired t-test:  6.05122469436e-26
Two-sample Mann Whitney U test:  2.78000666163e-64

Hypothesis 3: The amount of comments each user posted is the same between genders.

H0: commentsMadeBy(Males) = commentsMadeBy(Females);

H1: commentsMadeBy(Males) != commentsMadeBy(Females).

Data


In [35]:
females_comments = females['comments_total']
males_comments = males['comments_total']

The shape of the data


In [36]:
show_data_shape(females_comments, males_comments, "expon", 80, "Comments")


Levene's test:  0.000967877093454
Skewness for Females:  14.7461095991
Skewness for Males:  41.3093174939

Hypothesis test


In [37]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_comments, males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_comments, males_comments, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_comments, males_comments)[1]


Two-sample Kolmogorov-Smirnov test:  0.00472290408769
Two-sample unpaired t-test:  6.08808658197e-09
Two-sample Mann Whitney U test:  0.361923909908

Looking at the contributors


In [38]:
top_females_comments = top_females['comments_total']
top_males_comments = top_males['comments_total']

The data's shape


In [39]:
show_data_shape(top_females_comments, top_males_comments, "expon", 50, "Comments")


Levene's test:  0.56936408459
Skewness for Females:  5.44666245336
Skewness for Males:  18.3978061987

Hypotesis test


In [40]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_comments, top_males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_comments, top_males_comments)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_comments, top_males_comments)[1]


Two-sample Kolmogorov-Smirnov test:  0.0123055248553
Two-sample unpaired t-test:  0.762162402885
Two-sample Mann Whitney U test:  0.0108233814633

Looking at the common contributors


In [41]:
common_females_comments = common_females['comments_total']
common_males_comments = common_males['comments_total']

The data's shape


In [42]:
show_data_shape(common_females_comments, common_males_comments, "expon", 80, "Comments")


Levene's test:  3.49169361756e-16
Skewness for Females:  4.65617555894
Skewness for Males:  5.25141853108

Hypotesis test


In [43]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_comments, common_males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_comments, common_males_comments, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_comments, common_males_comments)[1]


Two-sample Kolmogorov-Smirnov test:  4.40498956759e-17
Two-sample unpaired t-test:  4.71675054517e-16
Two-sample Mann Whitney U test:  4.9514669955e-21

Hypothesis 4: The total amount of contributions by each user is the same between genders.

H0: contributionsBy(Males) = contributionsBy(Females);

H1: contributionsBy(Males) != contributionsBy(Females).

Data


In [45]:
females_contributions = females['contributions_total']
males_contributions = males['contributions_total']

The data's shape


In [46]:
show_data_shape(females_contributions, males_contributions, "expon", 80, "Contributions")


Levene's test:  0.000114867174326
Skewness for Females:  13.9445368647
Skewness for Males:  41.935908188

Hypothesis test


In [47]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_contributions, males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_contributions, males_contributions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_contributions, males_contributions)[1]


Two-sample Kolmogorov-Smirnov test:  4.08269419041e-05
Two-sample unpaired t-test:  0.000164719188465
Two-sample Mann Whitney U test:  0.279834924476

Looking at the top contributors


In [48]:
top_females_contributions = top_females['contributions_total']
top_males_contributions = top_males['contributions_total']

The data's shape


In [49]:
show_data_shape(top_females_contributions, top_males_contributions, "expon", 50, "Contributions")


Levene's test:  0.520734254079
Skewness for Females:  5.27518913557
Skewness for Males:  19.1550328263

Hypothesis test


In [50]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_contributions, top_males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_contributions, top_males_contributions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_contributions, top_males_contributions)[1]


Two-sample Kolmogorov-Smirnov test:  0.00402861711924
Two-sample unpaired t-test:  0.555261815852
Two-sample Mann Whitney U test:  0.041843615685

Looking at the common contributors


In [51]:
common_females_contributions = common_females['contributions_total']
common_males_contributions = common_males['contributions_total']

The data's shape


In [52]:
show_data_shape(common_females_contributions, common_males_contributions, "expon", 80, "Contributions")


Levene's test:  2.13099816828e-12
Skewness for Females:  3.9857155094
Skewness for Males:  4.25754446783

Hypothesis test


In [53]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_contributions, common_males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_contributions, common_males_contributions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_contributions, common_males_contributions)[1]


Two-sample Kolmogorov-Smirnov test:  2.43169439394e-13
Two-sample unpaired t-test:  8.73057151448e-17
Two-sample Mann Whitney U test:  2.71049911262e-15

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [1]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, vincent
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot

%matplotlib inline

client = pymongo.MongoClient('localhost', 27017)

community = 'stackoverflow'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, {'comments_total':{'$gt':0}}] }, 
                       {u'_id': False, u'questions_total': True, u'reputation': True, u'contributions_total':True,
                        u'comments_total': True, u'answers_total': True, u'gender':True})

df =  pandas.DataFrame(list(cursor))

males = df[df['gender']=='Male']
females = df[df['gender']=='Female']

Utility functions for ploting.


In [2]:
pyplot.rcdefaults()
mpl.style.use('ggplot')

def histogram(sample1, sample2, bins, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    axes[0].hist(list(sample1), bins)
    axes[0].set_title(aspect + " by Females - Histogram")
    axes[1].hist(list(sample2), bins)
    axes[1].set_title(aspect + " by Males - Histogram")

def pmf_plot(sample1, sample2, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    
    aux1 = sample1.value_counts()
    pmf1 = aux1[aux1 >1].sort_index() / len(sample1)
#     pmf1 = sample1.value_counts().sort_index() / len(sample1)
    pmf1.plot(kind="bar", ax=axes[0])
    
    aux2 = sample2.value_counts()
    pmf2 = aux2[aux2 >1].sort_index() / len(sample2)
#     pmf2 = sample2.value_counts().sort_index() / len(sample2)
    pmf2.plot(kind="bar", ax=axes[1])
    
    axes[0].set_title(aspect + " by Females - Density")
    axes[1].set_title(aspect + " by Males - Density")
    axes[0].get_xaxis().set_visible(False)
    axes[1].get_xaxis().set_visible(False)    
    

    
def boxplot(sample1, sample2, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    axes[0].boxplot(list(sample1))
    axes[0].set_title(aspect + " by Females - Boxplot")
    axes[1].boxplot(list(sample2))
    axes[1].set_title(aspect + " by Males - Boxplot")
    
def qq_plot(sample1, sample2, distribution, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

    pyplot.subplot(121)
    stats.probplot(list(sample1), dist=distribution, plot=pyplot)
    axes[0].set_title(aspect + " by Females - QQPlot "+ distribution)

    pyplot.subplot(122)
    stats.probplot(list(sample2), dist=distribution, plot=pyplot)
    axes[1].set_title(aspect + " by Male - QQPlot "+ distribution)

Utility functions for describing the data.


In [3]:
def describe(sample1, sample2):
    print sample1.describe()
    print "Median: ", sample1.median()
    print 
    print sample2.describe()
    print "Median: ", sample2.median()
    
def show_data_shape(sample1, sample2, dist, bins, aspect):
    pyplot.close('all')
    #histogram
    histogram(sample1, sample2, bins, aspect)

    #PMF
    pmf_plot(sample1, sample2, aspect)

    #QQPlot
    qq_plot(sample1, sample2, dist, aspect)

    #boxplot
    boxplot(sample1,sample2, aspect)
    pyplot.show()

    #Levene
    print "Levene's test: ", stats.levene(sample1, sample2)[1]
    
    #skewness
    print "Skewness for Females: ", stats.skew(sample1)
    print "Skewness for Males: ", stats.skew(sample2)

In [ ]: