Community: Mathematics

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data Summary

Women


In [4]:
females.describe()


Out[4]:
answers_total comments_total contributions_total questions_total reputation
count 633.000000 633.000000 633.00000 633.000000 633.000000
mean 7.385466 27.399684 44.64139 9.857820 387.303318
std 48.434478 84.821629 134.96054 17.905307 1958.974331
min 0.000000 0.000000 1.00000 0.000000 50.000000
25% 0.000000 2.000000 4.00000 1.000000 83.000000
50% 0.000000 7.000000 14.00000 4.000000 116.000000
75% 2.000000 20.000000 37.00000 11.000000 198.000000
max 1057.000000 1457.000000 2552.00000 178.000000 45388.000000

In [5]:
females.median()


Out[5]:
answers_total            0
comments_total           7
contributions_total     14
questions_total          4
reputation             116
dtype: float64

Men


In [6]:
males.describe()


Out[6]:
answers_total comments_total contributions_total questions_total reputation
count 7492.000000 7492.000000 7492.000000 7492.000000 7492.000000
mean 21.642285 59.028831 87.239989 6.569541 868.969434
std 191.770553 411.057410 591.301783 18.395609 6045.880132
min 0.000000 0.000000 1.000000 0.000000 50.000000
25% 0.000000 1.000000 3.000000 1.000000 101.000000
50% 1.000000 4.000000 8.000000 2.000000 121.000000
75% 3.000000 17.000000 29.000000 5.000000 242.250000
max 8667.000000 14520.000000 23187.000000 650.000000 221779.000000

In [7]:
males.median()


Out[7]:
answers_total            1
comments_total           4
contributions_total      8
questions_total          2
reputation             121
dtype: float64

Top contributors


In [8]:
pyplot.close('all')
histogram(females["reputation"], males["reputation"], 100, "Reputation")
histogram(females[females["reputation"]<= 1000]["reputation"], males[males["reputation"]<= 1000]["reputation"], 100, "Reputation")
histogram(females[females["reputation"]<= 300]["reputation"], males[males["reputation"]<= 300]["reputation"], 100, "Reputation")
pyplot.show()


Top Women


In [9]:
top_females = females[females["reputation"]> 300]
top_females.describe()


Out[9]:
answers_total comments_total contributions_total questions_total reputation
count 123.000000 123.000000 123.000000 123.000000 123.000000
mean 34.398374 98.260163 158.308943 25.650407 1524.268293
std 105.940266 172.845749 275.603098 33.073436 4272.204505
min 0.000000 0.000000 1.000000 0.000000 302.000000
25% 1.000000 20.000000 39.500000 4.000000 363.500000
50% 5.000000 43.000000 75.000000 14.000000 504.000000
75% 27.500000 105.000000 173.000000 34.000000 1341.500000
max 1057.000000 1457.000000 2552.000000 178.000000 45388.000000

In [10]:
top_females.median()


Out[10]:
answers_total            5
comments_total          43
contributions_total     75
questions_total         14
reputation             504
dtype: float64

Top Men


In [11]:
top_males = males[males["reputation"]> 300]
top_males.describe()


Out[11]:
answers_total comments_total contributions_total questions_total reputation
count 1683.000000 1683.000000 1683.000000 1683.000000 1683.000000
mean 92.901961 241.774807 353.688651 19.014854 3451.025550
std 396.505569 842.072927 1210.299048 35.062562 12416.953799
min 0.000000 0.000000 1.000000 0.000000 301.000000
25% 5.000000 23.000000 40.000000 2.000000 421.500000
50% 15.000000 56.000000 90.000000 8.000000 751.000000
75% 50.000000 152.500000 229.000000 22.000000 1965.000000
max 8667.000000 14520.000000 23187.000000 650.000000 221779.000000

In [12]:
top_males.median()


Out[12]:
answers_total           15
comments_total          56
contributions_total     90
questions_total          8
reputation             751
dtype: float64

Common women contributors


In [13]:
common_females = females[females["reputation"] <= 300]
common_females.describe()


Out[13]:
answers_total comments_total contributions_total questions_total reputation
count 510.000000 510.000000 510.000000 510.000000 510.000000
mean 0.870588 10.309804 17.227451 6.049020 113.094118
std 1.963047 16.325824 23.126441 7.815879 50.373253
min 0.000000 0.000000 1.000000 0.000000 50.000000
25% 0.000000 1.000000 4.000000 1.000000 74.000000
50% 0.000000 5.000000 9.000000 3.000000 105.000000
75% 1.000000 13.000000 22.000000 8.000000 136.750000
max 16.000000 135.000000 208.000000 73.000000 298.000000

In [14]:
common_females.median()


Out[14]:
answers_total            0
comments_total           5
contributions_total      9
questions_total          3
reputation             105
dtype: float64

Common men contributors


In [15]:
common_males = males[males["reputation"] <= 300]
common_males.describe()


Out[15]:
answers_total comments_total contributions_total questions_total reputation
count 5809.000000 5809.000000 5809.000000 5809.000000 5809.000000
mean 0.996729 6.083147 10.043725 2.963849 120.888793
std 2.194680 9.880875 14.382269 4.747850 46.529120
min 0.000000 0.000000 1.000000 0.000000 50.000000
25% 0.000000 1.000000 2.000000 1.000000 101.000000
50% 0.000000 3.000000 5.000000 1.000000 111.000000
75% 1.000000 7.000000 12.000000 3.000000 136.000000
max 41.000000 136.000000 213.000000 77.000000 300.000000

In [16]:
common_males.median()


Out[16]:
answers_total            0
comments_total           3
contributions_total      5
questions_total          1
reputation             111
dtype: float64

First Question: The amount of contibution made by both genders are equal ?

Hypothesis 1: The amount of questions each user posted is the same between genders.

H0: questionsMadeBy(Males) = questionsMadeBy(Females);

H1: questionsMadeBy(Males) != questionsMadeBy(Females).

Data


In [17]:
females_questions = females['questions_total']
males_questions = males['questions_total']

The data's shape


In [18]:
show_data_shape(females_questions, males_questions, "expon", 100, "Questions")


Levene's test:  0.000849751747887
Skewness for Females:  4.94841683931
Skewness for Males:  12.4636994937

Hypothesis test


In [19]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_questions, males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_questions, males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_questions, males_questions)[1]


Two-sample Kolmogorov-Smirnov test:  1.2736105893e-23
Two-sample unpaired t-test:  1.09614974125e-05
Two-sample Mann Whitney U test:  8.48999928353e-24

Looking at the top contributors


In [20]:
top_females_questions = top_females['questions_total']
top_males_questions = top_males['questions_total']

The data's shape


In [21]:
show_data_shape(top_females_questions, top_males_questions, "expon", 50, "Questions")


Levene's test:  0.169764824762
Skewness for Females:  2.4568377947
Skewness for Males:  6.89878393647

Hypothesis test


In [22]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_questions, top_males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_questions, top_males_questions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_questions, top_males_questions)[1]


Two-sample Kolmogorov-Smirnov test:  0.00104958701579
Two-sample unpaired t-test:  0.0421244907497
Two-sample Mann Whitney U test:  0.0012792837483

Looking at the common contributors


In [23]:
common_females_questions = common_females['questions_total']
common_males_questions = common_males['questions_total']

The data's shape


In [24]:
show_data_shape(common_females_questions, common_males_questions, "expon", 50, "Questions")


Levene's test:  5.73038405135e-29
Skewness for Females:  2.94685836601
Skewness for Males:  3.81531424074

Hypotesis test


In [25]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_questions, common_males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_questions, common_males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_questions, common_males_questions)[1]


Two-sample Kolmogorov-Smirnov test:  2.14700854482e-27
Two-sample unpaired t-test:  2.25050266887e-17
Two-sample Mann Whitney U test:  1.45618342034e-30

Hypothesis 2: The amount of answers each user posted is the same between genders.

H0: answersMadeBy(Males) = answersMadeBy(Females);

H1: answersMadeBy(Males) != answersMadeBy(Females).

Data


In [26]:
females_answers = females['answers_total']
males_answers = males['answers_total']

The data's shape


In [27]:
show_data_shape(females_answers, males_answers, "expon", 80, "Answers")


Levene's test:  0.062240712799
Skewness for Females:  17.0853331839
Skewness for Males:  28.1890641474

Hypothesis test


In [28]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_answers, males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_answers, males_answers)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_answers, males_answers)[1]


Two-sample Kolmogorov-Smirnov test:  0.00053547127166
Two-sample unpaired t-test:  0.0621730361686
Two-sample Mann Whitney U test:  4.59267552005e-05

Looking at the top contributors


In [29]:
top_females_answers = top_females['answers_total']
top_males_answers = top_males['answers_total']

The data's shape


In [30]:
show_data_shape(top_females_answers, top_males_answers, "expon", 30, "Answers")


Levene's test:  0.127551545406
Skewness for Females:  7.70203512819
Skewness for Males:  13.6620478045

Hypotesis test


In [31]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_answers, top_males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_answers, top_males_answers)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_answers, top_males_answers)[1]


Two-sample Kolmogorov-Smirnov test:  4.85280156303e-09
Two-sample unpaired t-test:  0.102907397664
Two-sample Mann Whitney U test:  2.20950124279e-07

Looking at the common contributors


In [32]:
common_females_answers = common_females['answers_total']
common_males_answers = common_males['answers_total']

The data's shape


In [33]:
show_data_shape(common_females_answers, common_males_answers, "expon", 30, "Answers")


Levene's test:  0.209651341298
Skewness for Females:  4.04158836906
Skewness for Males:  5.18790616299

Hypotesis test


In [34]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_answers, common_males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_answers, common_males_answers)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_answers, common_males_answers)[1]


Two-sample Kolmogorov-Smirnov test:  0.277524093606
Two-sample unpaired t-test:  0.209651341297
Two-sample Mann Whitney U test:  0.0910416585296

Hypothesis 3: The amount of comments each user posted is the same between genders.

H0: commentsMadeBy(Males) = commentsMadeBy(Females);

H1: commentsMadeBy(Males) != commentsMadeBy(Females).

Data


In [35]:
females_comments = females['comments_total']
males_comments = males['comments_total']

The shape of the data


In [36]:
show_data_shape(females_comments, males_comments, "expon", 80, "Comments")


Levene's test:  0.0470578684224
Skewness for Females:  9.97643408407
Skewness for Males:  20.5811340941

Hypothesis test


In [37]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_comments, males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_comments, males_comments, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_comments, males_comments)[1]


Two-sample Kolmogorov-Smirnov test:  4.4951937443e-06
Two-sample unpaired t-test:  5.92505509993e-08
Two-sample Mann Whitney U test:  4.92703460612e-05

Looking at the contributors


In [38]:
top_females_comments = top_females['comments_total']
top_males_comments = top_males['comments_total']

The data's shape


In [39]:
show_data_shape(top_females_comments, top_males_comments, "expon", 50, "Comments")


Levene's test:  0.0614927925734
Skewness for Females:  4.82438746795
Skewness for Males:  10.0039692793

Hypotesis test


In [40]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_comments, top_males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_comments, top_males_comments)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_comments, top_males_comments)[1]


Two-sample Kolmogorov-Smirnov test:  0.0200277094992
Two-sample unpaired t-test:  0.0593482268574
Two-sample Mann Whitney U test:  0.0426858355308

Looking at the common contributors


In [41]:
common_females_comments = common_females['comments_total']
common_males_comments = common_males['comments_total']

The data's shape


In [42]:
show_data_shape(common_females_comments, common_males_comments, "expon", 80, "Comments")


Levene's test:  2.03180869834e-15
Skewness for Females:  3.94070664171
Skewness for Males:  4.09901342558

Hypotesis test


In [43]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_comments, common_males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_comments, common_males_comments, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_comments, common_males_comments)[1]


Two-sample Kolmogorov-Smirnov test:  1.29156220726e-10
Two-sample unpaired t-test:  1.45182752983e-08
Two-sample Mann Whitney U test:  8.58661618015e-13

Hypothesis 4: The total amount of contributions by each user is the same between genders.

H0: contributionsBy(Males) = contributionsBy(Females);

H1: contributionsBy(Males) != contributionsBy(Females).

Data


In [44]:
females_contributions = females['contributions_total']
males_contributions = males['contributions_total']

The data's shape


In [45]:
show_data_shape(females_contributions, males_contributions, "expon", 80, "Contributions")


Levene's test:  0.0577124683045
Skewness for Females:  11.6738514995
Skewness for Males:  22.2614873092

Hypothesis test


In [46]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_contributions, males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_contributions, males_contributions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_contributions, males_contributions)[1]


Two-sample Kolmogorov-Smirnov test:  2.16888119418e-10
Two-sample unpaired t-test:  0.070570409033
Two-sample Mann Whitney U test:  1.81246085606e-08

Looking at the top contributors


In [47]:
top_females_contributions = top_females['contributions_total']
top_males_contributions = top_males['contributions_total']

The data's shape


In [48]:
show_data_shape(top_females_contributions, top_males_contributions, "expon", 50, "Contributions")


Levene's test:  0.0777221084768
Skewness for Females:  5.81584337868
Skewness for Males:  10.8938909315

Hypothesis test


In [49]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_contributions, top_males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_contributions, top_males_contributions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_contributions, top_males_contributions)[1]


Two-sample Kolmogorov-Smirnov test:  0.074323950921
Two-sample unpaired t-test:  5.39235197372e-07
Two-sample Mann Whitney U test:  0.0729170272654

Looking at the common contributors


In [50]:
common_females_contributions = common_females['contributions_total']
common_males_contributions = common_males['contributions_total']

The data's shape


In [51]:
show_data_shape(common_females_contributions, common_males_contributions, "expon", 80, "Contributions")


Levene's test:  1.31415724959e-17
Skewness for Females:  3.41786564393
Skewness for Males:  3.76624009367

Hypothesis test


In [52]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_contributions, common_males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_contributions, common_males_contributions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_contributions, common_males_contributions)[1]


Two-sample Kolmogorov-Smirnov test:  7.90802268576e-17
Two-sample unpaired t-test:  3.58474976324e-24
Two-sample Mann Whitney U test:  9.50798589351e-21

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [1]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, vincent
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot

%matplotlib inline

client = pymongo.MongoClient('localhost', 27017)

community = 'math'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, {'comments_total':{'$gt':0}}] }, 
                       {u'_id': False, u'questions_total': True, u'reputation': True, u'contributions_total':True,
                        u'comments_total': True, u'answers_total': True, u'gender':True})

df =  pandas.DataFrame(list(cursor))

males = df[df['gender']=='Male']
females = df[df['gender']=='Female']

Utility functions for ploting.


In [2]:
pyplot.rcdefaults()
mpl.style.use('ggplot')

def histogram(sample1, sample2, bins, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    axes[0].hist(list(sample1), bins)
    axes[0].set_title(aspect + " by Females - Histogram")
    axes[1].hist(list(sample2), bins)
    axes[1].set_title(aspect + " by Males - Histogram")

def pmf_plot(sample1, sample2, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    
    aux1 = sample1.value_counts()
    pmf1 = aux1[aux1 >1].sort_index() / len(sample1)
#     pmf1 = sample1.value_counts().sort_index() / len(sample1)
    pmf1.plot(kind="bar", ax=axes[0])
    
    aux2 = sample2.value_counts()
    pmf2 = aux2[aux2 >1].sort_index() / len(sample2)
#     pmf2 = sample2.value_counts().sort_index() / len(sample2)
    pmf2.plot(kind="bar", ax=axes[1])
    
    axes[0].set_title(aspect + " by Females - Density")
    axes[1].set_title(aspect + " by Males - Density")
    axes[0].get_xaxis().set_visible(False)
    axes[1].get_xaxis().set_visible(False)    
    

    
def boxplot(sample1, sample2, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    axes[0].boxplot(list(sample1))
    axes[0].set_title(aspect + " by Females - Boxplot")
    axes[1].boxplot(list(sample2))
    axes[1].set_title(aspect + " by Males - Boxplot")
    
def qq_plot(sample1, sample2, distribution, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

    pyplot.subplot(121)
    stats.probplot(list(sample1), dist=distribution, plot=pyplot)
    axes[0].set_title(aspect + " by Females - QQPlot "+ distribution)

    pyplot.subplot(122)
    stats.probplot(list(sample2), dist=distribution, plot=pyplot)
    axes[1].set_title(aspect + " by Male - QQPlot "+ distribution)

Utility functions for describing the data.


In [3]:
def describe(sample1, sample2):
    print sample1.describe()
    print "Median: ", sample1.median()
    print 
    print sample2.describe()
    print "Median: ", sample2.median()
    
def show_data_shape(sample1, sample2, dist, bins, aspect):
    pyplot.close('all')
    #histogram
    histogram(sample1, sample2, bins, aspect)

    #PMF
    pmf_plot(sample1, sample2, aspect)

    #QQPlot
    qq_plot(sample1, sample2, dist, aspect)

    #boxplot
    boxplot(sample1,sample2, aspect)
    pyplot.show()

    #Levene
    print "Levene's test: ", stats.levene(sample1, sample2)[1]
    
    #skewness
    print "Skewness for Females: ", stats.skew(sample1)
    print "Skewness for Males: ", stats.skew(sample2)

In [ ]: