Community: Programmers

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data Summary

Women


In [4]:
females.describe()


Out[4]:
answers_total comments_total contributions_total questions_total reputation
count 383.000000 383.000000 383.000000 383.000000 383.000000
mean 5.007833 13.929504 20.336815 1.402089 472.519582
std 19.049466 65.861013 82.806650 3.761919 1833.622104
min 0.000000 0.000000 1.000000 0.000000 51.000000
25% 0.000000 0.000000 1.000000 0.000000 101.000000
50% 1.000000 2.000000 3.000000 1.000000 127.000000
75% 2.000000 5.000000 8.000000 1.000000 213.500000
max 173.000000 911.000000 1080.000000 42.000000 24391.000000

In [5]:
females.median()


Out[5]:
answers_total            1
comments_total           2
contributions_total      3
questions_total          1
reputation             127
dtype: float64

Men


In [6]:
males.describe()


Out[6]:
answers_total comments_total contributions_total questions_total reputation
count 8734.000000 8734.000000 8734.000000 8734.000000 8734.000000
mean 5.112778 12.272956 18.316808 0.931074 421.200939
std 25.708283 70.775710 93.016729 2.419398 1759.699954
min 0.000000 0.000000 1.000000 0.000000 50.000000
25% 0.000000 1.000000 1.000000 0.000000 103.000000
50% 1.000000 2.000000 3.000000 0.000000 133.000000
75% 2.000000 5.000000 9.000000 1.000000 239.000000
max 796.000000 3125.000000 3905.000000 76.000000 62314.000000

In [7]:
males.median()


Out[7]:
answers_total            1
comments_total           2
contributions_total      3
questions_total          0
reputation             133
dtype: float64

Top contributors


In [8]:
pyplot.close('all')
histogram(females["reputation"], males["reputation"], 100, "Reputation")
histogram(females[females["reputation"]<= 1000]["reputation"], males[males["reputation"]<= 1000]["reputation"], 100, "Reputation")
histogram(females[females["reputation"]<= 450]["reputation"], males[males["reputation"]<= 450]["reputation"], 100, "Reputation")
pyplot.show()


Top Women


In [9]:
top_females = females[females["reputation"]> 450]
top_females.describe()


Out[9]:
answers_total comments_total contributions_total questions_total reputation
count 46.000000 46.000000 46.000000 46.000000 46.000000
mean 33.934783 96.347826 135.978261 5.717391 2842.021739
std 45.645690 169.532326 205.828892 9.427296 4687.361966
min 0.000000 1.000000 2.000000 0.000000 454.000000
25% 6.500000 15.500000 26.250000 0.000000 685.000000
50% 16.000000 33.500000 55.500000 1.500000 1223.000000
75% 32.250000 110.000000 179.000000 7.000000 2505.250000
max 173.000000 911.000000 1080.000000 42.000000 24391.000000

In [10]:
top_females.median()


Out[10]:
answers_total            16.0
comments_total           33.5
contributions_total      55.5
questions_total           1.5
reputation             1223.0
dtype: float64

Top Men


In [11]:
top_males = males[males["reputation"]> 450]
top_males.describe()


Out[11]:
answers_total comments_total contributions_total questions_total reputation
count 1113.000000 1113.000000 1113.000000 1113.000000 1113.000000
mean 31.483378 73.737646 108.030548 2.809524 2234.580413
std 66.073985 186.354871 241.459931 5.647596 4528.080814
min 0.000000 0.000000 1.000000 0.000000 451.000000
25% 5.000000 7.000000 16.000000 0.000000 606.000000
50% 12.000000 21.000000 36.000000 1.000000 910.000000
75% 28.000000 57.000000 90.000000 3.000000 1815.000000
max 796.000000 3125.000000 3905.000000 76.000000 62314.000000

In [12]:
top_males.median()


Out[12]:
answers_total           12
comments_total          21
contributions_total     36
questions_total          1
reputation             910
dtype: float64

Common women contributors


In [13]:
common_females = females[females["reputation"] <= 450]
common_females.describe()


Out[13]:
answers_total comments_total contributions_total questions_total reputation
count 337.000000 337.000000 337.000000 337.000000 337.000000
mean 1.059347 2.679525 4.551929 0.813056 149.086053
std 1.823142 5.034816 6.403901 1.135497 79.347610
min 0.000000 0.000000 1.000000 0.000000 51.000000
25% 0.000000 0.000000 1.000000 0.000000 101.000000
50% 0.000000 1.000000 2.000000 1.000000 121.000000
75% 1.000000 3.000000 5.000000 1.000000 159.000000
max 15.000000 61.000000 70.000000 6.000000 450.000000

In [14]:
common_females.median()


Out[14]:
answers_total            0
comments_total           1
contributions_total      2
questions_total          1
reputation             121
dtype: float64

Common men contributors


In [15]:
common_males = males[males["reputation"] <= 450]
common_males.describe()


Out[15]:
answers_total comments_total contributions_total questions_total reputation
count 7621.000000 7621.000000 7621.000000 7621.000000 7621.000000
mean 1.261514 3.296418 5.214670 0.656738 156.368062
std 1.984646 6.365563 7.767629 1.209652 79.177591
min 0.000000 0.000000 1.000000 0.000000 50.000000
25% 0.000000 0.000000 1.000000 0.000000 101.000000
50% 1.000000 1.000000 3.000000 0.000000 123.000000
75% 1.000000 4.000000 6.000000 1.000000 178.000000
max 23.000000 160.000000 169.000000 26.000000 449.000000

In [16]:
common_males.median()


Out[16]:
answers_total            1
comments_total           1
contributions_total      3
questions_total          0
reputation             123
dtype: float64

First Question: The amount of contibution made by both genders are equal ?

Hypothesis 1: The amount of questions each user posted is the same between genders.

H0: questionsMadeBy(Males) = questionsMadeBy(Females);

H1: questionsMadeBy(Males) != questionsMadeBy(Females).

Data


In [17]:
females_questions = females['questions_total']
males_questions = males['questions_total']

The data's shape


In [18]:
show_data_shape(females_questions, males_questions, "expon", 100, "Questions")


Levene's test:  0.00111848500992
Skewness for Females:  7.08622496842
Skewness for Males:  10.0671926058

Hypothesis test


In [19]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_questions, males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_questions, males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_questions, males_questions)[1]


Two-sample Kolmogorov-Smirnov test:  0.000299055340249
Two-sample unpaired t-test:  0.0156098841524
Two-sample Mann Whitney U test:  0.000164843365657

Looking at the top contributors


In [20]:
top_females_questions = top_females['questions_total']
top_males_questions = top_males['questions_total']

The data's shape


In [21]:
show_data_shape(top_females_questions, top_males_questions, "expon", 50, "Questions")


Levene's test:  0.00108475383997
Skewness for Females:  2.36537834892
Skewness for Males:  4.92983109131

Hypothesis test


In [22]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_questions, top_males_questions)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_questions, top_males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_questions, top_males_questions)[1]


Two-sample Kolmogorov-Smirnov test:  0.163788311325
Two-sample Mann Whitney U test:  0.0253877277412

Looking at the common contributors


In [23]:
common_females_questions = common_females['questions_total']
common_males_questions = common_males['questions_total']

The data's shape


In [24]:
show_data_shape(common_females_questions, common_males_questions, "expon", 50, "Questions")


Levene's test:  0.0268154535093
Skewness for Females:  2.20314138974
Skewness for Males:  4.45239253748

Hypotesis test


In [25]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_questions, common_males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_questions, common_males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_questions, common_males_questions)[1]


Two-sample Kolmogorov-Smirnov test:  0.000647210766037
Two-sample unpaired t-test:  0.0141123140939
Two-sample Mann Whitney U test:  0.000417704898114

Hypothesis 2: The amount of answers each user posted is the same between genders.

H0: answersMadeBy(Males) = answersMadeBy(Females);

H1: answersMadeBy(Males) != answersMadeBy(Females).

Data


In [26]:
females_answers = females['answers_total']
males_answers = males['answers_total']

The data's shape


In [27]:
show_data_shape(females_answers, males_answers, "expon", 80, "Answers")


Levene's test:  0.95928992108
Skewness for Females:  6.71068321924
Skewness for Males:  14.9153762768

Hypothesis test


In [28]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_answers, males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_answers, males_answers)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_answers, males_answers)[1]


Two-sample Kolmogorov-Smirnov test:  0.00794681373679
Two-sample unpaired t-test:  0.937080121104
Two-sample Mann Whitney U test:  0.0113548354218

Looking at the top contributors


In [29]:
top_females_answers = top_females['answers_total']
top_males_answers = top_males['answers_total']

The data's shape


In [30]:
show_data_shape(top_females_answers, top_males_answers, "expon", 30, "Answers")


Levene's test:  0.975955876799
Skewness for Females:  2.00203179184
Skewness for Males:  5.64249191119

Hypotesis test


In [31]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_answers, top_males_answers)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_answers, top_males_answers, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_answers, top_males_answers)[1]


Two-sample Kolmogorov-Smirnov test:  0.297215176502
Two-sample Mann Whitney U test:  0.147924002969

Looking at the common contributors


In [32]:
common_females_answers = common_females['answers_total']
common_males_answers = common_males['answers_total']

The data's shape


In [33]:
show_data_shape(common_females_answers, common_males_answers, "expon", 30, "Answers")


Levene's test:  0.722435887715
Skewness for Females:  3.49178050159
Skewness for Males:  3.67889638357

Hypotesis test


In [34]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_answers, common_males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_answers, common_males_answers)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_answers, common_males_answers)[1]


Two-sample Kolmogorov-Smirnov test:  0.0113491235257
Two-sample unpaired t-test:  0.0663885681543
Two-sample Mann Whitney U test:  0.00626502859677

Hypothesis 3: The amount of comments each user posted is the same between genders.

H0: commentsMadeBy(Males) = commentsMadeBy(Females);

H1: commentsMadeBy(Males) != commentsMadeBy(Females).

Data


In [35]:
females_comments = females['comments_total']
males_comments = males['comments_total']

The data's shape


In [36]:
show_data_shape(females_comments, males_comments, "expon", 80, "Comments")


Levene's test:  0.646775115852
Skewness for Females:  10.0103316886
Skewness for Males:  20.8596228734

Hypothesis test


In [37]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_comments, males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_comments, males_comments)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_comments, males_comments)[1]


Two-sample Kolmogorov-Smirnov test:  0.891645847285
Two-sample unpaired t-test:  0.653011944043
Two-sample Mann Whitney U test:  0.524181424445

Looking at the top contributors


In [38]:
top_females_comments = top_females['comments_total']
top_males_comments = top_males['comments_total']

The data's shape


In [39]:
show_data_shape(top_females_comments, top_males_comments, "expon", 50, "Comments")


Levene's test:  0.565765941571
Skewness for Females:  3.43962863414
Skewness for Males:  7.95182669752

Hypotesis test


In [40]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_comments, top_males_comments)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_comments, top_males_comments, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_comments, top_males_comments)[1]


Two-sample Kolmogorov-Smirnov test:  0.0945951365778
Two-sample Mann Whitney U test:  0.0235779669453

Looking at the common contributors


In [41]:
common_females_comments = common_females['comments_total']
common_males_comments = common_males['comments_total']

The data's shape


In [42]:
show_data_shape(common_females_comments, common_males_comments, "expon", 80, "Comments")


Levene's test:  0.0844358570899
Skewness for Females:  6.7426246402
Skewness for Males:  7.17698625932

Hypotesis test


In [43]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_comments, common_males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_comments, common_males_comments)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_comments, common_males_comments)[1]


Two-sample Kolmogorov-Smirnov test:  0.601677170024
Two-sample unpaired t-test:  0.0793147185376
Two-sample Mann Whitney U test:  0.331127920343

Hypothesis 4: The total amount of contributions by each user is the same between genders.

H0: contributionsBy(Males) = contributionsBy(Females);

H1: contributionsBy(Males) != contributionsBy(Females).

Data


In [44]:
females_contributions = females['contributions_total']
males_contributions = males['contributions_total']

The data's shape


In [45]:
show_data_shape(females_contributions, males_contributions, "expon", 80, "Contributions")


Levene's test:  0.663731992768
Skewness for Females:  8.73086943159
Skewness for Males:  18.071068573

Hypothesis test


In [46]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_contributions, males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_contributions, males_contributions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_contributions, males_contributions)[1]


Two-sample Kolmogorov-Smirnov test:  0.952478477814
Two-sample unpaired t-test:  0.676103355321
Two-sample Mann Whitney U test:  0.408508056203

Looking at the top contributors


In [47]:
top_females_contributions = top_females['contributions_total']
top_males_contributions = top_males['contributions_total']

The data's shape


In [48]:
show_data_shape(top_females_contributions, top_males_contributions, "expon", 50, "Contributions")


Levene's test:  0.631153604413
Skewness for Females:  2.98940120633
Skewness for Males:  6.950801335

Hypotesis test


In [49]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_contributions, top_males_contributions)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_contributions, top_males_contributions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_contributions, top_males_contributions)[1]


Two-sample Kolmogorov-Smirnov test:  0.0414309408302
Two-sample Mann Whitney U test:  0.0206558039535

Looking at the common contributors


In [50]:
common_females_contributions = common_females['contributions_total']
common_males_contributions = common_males['contributions_total']

The data's shape


In [51]:
show_data_shape(common_females_contributions, common_males_contributions, "expon", 80, "Contributions")


Levene's test:  0.134800730135
Skewness for Females:  5.32288004184
Skewness for Males:  5.68299718259

Hypotesis test


In [52]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_contributions, common_males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_contributions, common_males_contributions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_contributions, common_males_contributions)[1]


Two-sample Kolmogorov-Smirnov test:  0.806323387233
Two-sample unpaired t-test:  0.122813659011
Two-sample Mann Whitney U test:  0.294131678534

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [1]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, vincent
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot

%matplotlib inline

client = pymongo.MongoClient('localhost', 27017)

community = 'programmers'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, {'comments_total':{'$gt':0}}] }, 
                       {u'_id': False, u'questions_total': True, u'reputation': True, u'contributions_total':True,
                        u'comments_total': True, u'answers_total': True, u'gender':True})

df =  pandas.DataFrame(list(cursor))

males = df[df['gender']=='Male']
females = df[df['gender']=='Female']

Utility functions for ploting.


In [2]:
pyplot.rcdefaults()
mpl.style.use('ggplot')

def histogram(sample1, sample2, bins, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    axes[0].hist(list(sample1), bins)
    axes[0].set_title(aspect + " by Females - Histogram")
    axes[1].hist(list(sample2), bins)
    axes[1].set_title(aspect + " by Males - Histogram")

def pmf_plot(sample1, sample2, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    
    aux1 = sample1.value_counts()
    pmf1 = aux1[aux1 >1].sort_index() / len(sample1)
#     pmf1 = sample1.value_counts().sort_index() / len(sample1)
    pmf1.plot(kind="bar", ax=axes[0])
    
    aux2 = sample2.value_counts()
    pmf2 = aux2[aux2 >1].sort_index() / len(sample2)
#     pmf2 = sample2.value_counts().sort_index() / len(sample2)
    pmf2.plot(kind="bar", ax=axes[1])
    
    axes[0].set_title(aspect + " by Females - Density")
    axes[1].set_title(aspect + " by Males - Density")
    axes[0].get_xaxis().set_visible(False)
    axes[1].get_xaxis().set_visible(False)    
    

    
def boxplot(sample1, sample2, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    axes[0].boxplot(list(sample1))
    axes[0].set_title(aspect + " by Females - Boxplot")
    axes[1].boxplot(list(sample2))
    axes[1].set_title(aspect + " by Males - Boxplot")
    
def qq_plot(sample1, sample2, distribution, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

    pyplot.subplot(121)
    stats.probplot(list(sample1), dist=distribution, plot=pyplot)
    axes[0].set_title(aspect + " by Females - QQPlot "+ distribution)

    pyplot.subplot(122)
    stats.probplot(list(sample2), dist=distribution, plot=pyplot)
    axes[1].set_title(aspect + " by Male - QQPlot "+ distribution)

Utility functions for describing the data.


In [3]:
def describe(sample1, sample2):
    print sample1.describe()
    print "Median: ", sample1.median()
    print 
    print sample2.describe()
    print "Median: ", sample2.median()
    
def show_data_shape(sample1, sample2, dist, bins, aspect):
    pyplot.close('all')
    #histogram
    histogram(sample1, sample2, bins, aspect)

    #PMF
    pmf_plot(sample1, sample2, aspect)

    #QQPlot
    qq_plot(sample1, sample2, dist, aspect)

    #boxplot
    boxplot(sample1,sample2, aspect)
    pyplot.show()

    #Levene
    print "Levene's test: ", stats.levene(sample1, sample2)[1]
    
    #skewness
    print "Skewness for Females: ", stats.skew(sample1)
    print "Skewness for Males: ", stats.skew(sample2)

In [ ]: