Community: Programmers

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data Summary

Women



In [ ]:

    
females.describe()



In [ ]:

    
females.median()

Men



In [ ]:

    
males.describe()



In [ ]:

    
males.median()

Top contributors



In [ ]:

    
pyplot.close('all')
histogram(females["reputation"], males["reputation"], 100, "Reputation")
histogram(females[females["reputation"]<= 1000]["reputation"], males[males["reputation"]<= 1000]["reputation"], 100, "Reputation")
histogram(females[females["reputation"]<= 450]["reputation"], males[males["reputation"]<= 450]["reputation"], 100, "Reputation")
pyplot.show()

Top Women



In [ ]:

    
top_females = females[females["reputation"]> 450]
top_females.describe()



In [ ]:

    
top_females.median()

Top Men



In [ ]:

    
top_males = males[males["reputation"]> 450]
top_males.describe()



In [ ]:

    
top_males.median()

Common women contributors



In [ ]:

    
common_females = females[females["reputation"] <= 450]
common_females.describe()



In [ ]:

    
common_females.median()

Common men contributors



In [ ]:

    
common_males = males[males["reputation"] <= 450]
common_males.describe()



In [ ]:

    
common_males.median()

First Question: The amount of contibution made by both genders are equal ?

Hypothesis 1: The amount of questions each user posted is the same between genders.

H0: questionsMadeBy(Males) = questionsMadeBy(Females);

H1: questionsMadeBy(Males) != questionsMadeBy(Females).

Data



In [ ]:

    
females_questions = females['questions_total']
males_questions = males['questions_total']

The data's shape



In [ ]:

    
show_data_shape(females_questions, males_questions, "expon", 100, "Questions")

Hypothesis test



In [ ]:

    
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_questions, males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_questions, males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_questions, males_questions)[1]

Looking at the top contributors



In [ ]:

    
top_females_questions = top_females['questions_total']
top_males_questions = top_males['questions_total']

The data's shape



In [ ]:

    
show_data_shape(top_females_questions, top_males_questions, "expon", 50, "Questions")

Hypothesis test



In [ ]:

    
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_questions, top_males_questions)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_questions, top_males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_questions, top_males_questions)[1]

Looking at the common contributors



In [ ]:

    
common_females_questions = common_females['questions_total']
common_males_questions = common_males['questions_total']

The data's shape



In [ ]:

    
show_data_shape(common_females_questions, common_males_questions, "expon", 50, "Questions")

Hypotesis test



In [ ]:

    
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_questions, common_males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_questions, common_males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_questions, common_males_questions)[1]

Hypothesis 2: The amount of answers each user posted is the same between genders.

H0: answersMadeBy(Males) = answersMadeBy(Females);

H1: answersMadeBy(Males) != answersMadeBy(Females).

Data



In [ ]:

    
females_answers = females['answers_total']
males_answers = males['answers_total']

The data's shape



In [ ]:

    
show_data_shape(females_answers, males_answers, "expon", 80, "Answers")

Hypothesis test



In [ ]:

    
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_answers, males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_answers, males_answers)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_answers, males_answers)[1]

Looking at the top contributors



In [ ]:

    
top_females_answers = top_females['answers_total']
top_males_answers = top_males['answers_total']

The data's shape



In [ ]:

    
show_data_shape(top_females_answers, top_males_answers, "expon", 30, "Answers")

Hypotesis test



In [ ]:

    
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_answers, top_males_answers)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_answers, top_males_answers, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_answers, top_males_answers)[1]

Looking at the common contributors



In [ ]:

    
common_females_answers = common_females['answers_total']
common_males_answers = common_males['answers_total']

The data's shape



In [ ]:

    
show_data_shape(common_females_answers, common_males_answers, "expon", 30, "Answers")

Hypotesis test



In [ ]:

    
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_answers, common_males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_answers, common_males_answers)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_answers, common_males_answers)[1]

Hypothesis 3: The amount of comments each user posted is the same between genders.

H0: commentsMadeBy(Males) = commentsMadeBy(Females);

H1: commentsMadeBy(Males) != commentsMadeBy(Females).

Data



In [ ]:

    
females_comments = females['comments_total']
males_comments = males['comments_total']

The data's shape



In [ ]:

    
show_data_shape(females_comments, males_comments, "expon", 80, "Comments")

Hypothesis test



In [ ]:

    
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_comments, males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_comments, males_comments)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_comments, males_comments)[1]

Looking at the top contributors



In [ ]:

    
top_females_comments = top_females['comments_total']
top_males_comments = top_males['comments_total']

The data's shape



In [ ]:

    
show_data_shape(top_females_comments, top_males_comments, "expon", 50, "Comments")

Hypotesis test



In [ ]:

    
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_comments, top_males_comments)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_comments, top_males_comments, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_comments, top_males_comments)[1]

Looking at the common contributors



In [ ]:

    
common_females_comments = common_females['comments_total']
common_males_comments = common_males['comments_total']

The data's shape



In [ ]:

    
show_data_shape(common_females_comments, common_males_comments, "expon", 80, "Comments")

Hypotesis test



In [ ]:

    
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_comments, common_males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_comments, common_males_comments)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_comments, common_males_comments)[1]

Hypothesis 4: The total amount of contributions by each user is the same between genders.

H0: contributionsBy(Males) = contributionsBy(Females);

H1: contributionsBy(Males) != contributionsBy(Females).

Data



In [ ]:

    
females_contributions = females['contributions_total']
males_contributions = males['contributions_total']

The data's shape



In [ ]:

    
show_data_shape(females_contributions, males_contributions, "expon", 80, "Contributions")

Hypothesis test



In [ ]:

    
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_contributions, males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_contributions, males_contributions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_contributions, males_contributions)[1]

Looking at the top contributors



In [ ]:

    
top_females_contributions = top_females['contributions_total']
top_males_contributions = top_males['contributions_total']

The data's shape



In [ ]:

    
show_data_shape(top_females_contributions, top_males_contributions, "expon", 50, "Contributions")

Hypotesis test



In [ ]:

    
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_contributions, top_males_contributions)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_contributions, top_males_contributions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_contributions, top_males_contributions)[1]

Looking at the common contributors



In [ ]:

    
common_females_contributions = common_females['contributions_total']
common_males_contributions = common_males['contributions_total']

The data's shape



In [ ]:

    
show_data_shape(common_females_contributions, common_males_contributions, "expon", 80, "Contributions")

Hypotesis test



In [ ]:

    
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_contributions, common_males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_contributions, common_males_contributions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_contributions, common_males_contributions)[1]

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.



In [ ]:

    
from __future__ import division
import pymongo, time, pylab, numpy, pandas, vincent
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot

%matplotlib inline

client = pymongo.MongoClient('localhost', 27017)

community = 'programmers'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, {'comments_total':{'$gt':0}}] }, 
                       {u'_id': False, u'questions_total': True, u'reputation': True, u'contributions_total':True,
                        u'comments_total': True, u'answers_total': True, u'gender':True})

df =  pandas.DataFrame(list(cursor))

males = df[df['gender']=='Male']
females = df[df['gender']=='Female']

Utility functions for ploting.



In [ ]:

    
pyplot.rcdefaults()
mpl.style.use('ggplot')

def histogram(sample1, sample2, bins, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    axes[0].hist(list(sample1), bins)
    axes[0].set_title(aspect + " by Females - Histogram")
    axes[1].hist(list(sample2), bins)
    axes[1].set_title(aspect + " by Males - Histogram")

def pmf_plot(sample1, sample2, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    
    aux1 = sample1.value_counts()
    pmf1 = aux1[aux1 >1].sort_index() / len(sample1)
#     pmf1 = sample1.value_counts().sort_index() / len(sample1)
    pmf1.plot(kind="bar", ax=axes[0])
    
    aux2 = sample2.value_counts()
    pmf2 = aux2[aux2 >1].sort_index() / len(sample2)
#     pmf2 = sample2.value_counts().sort_index() / len(sample2)
    pmf2.plot(kind="bar", ax=axes[1])
    
    axes[0].set_title(aspect + " by Females - Density")
    axes[1].set_title(aspect + " by Males - Density")
    axes[0].get_xaxis().set_visible(False)
    axes[1].get_xaxis().set_visible(False)    
    

    
def boxplot(sample1, sample2, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    axes[0].boxplot(list(sample1))
    axes[0].set_title(aspect + " by Females - Boxplot")
    axes[1].boxplot(list(sample2))
    axes[1].set_title(aspect + " by Males - Boxplot")
    
def qq_plot(sample1, sample2, distribution, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

    pyplot.subplot(121)
    stats.probplot(list(sample1), dist=distribution, plot=pyplot)
    axes[0].set_title(aspect + " by Females - QQPlot "+ distribution)

    pyplot.subplot(122)
    stats.probplot(list(sample2), dist=distribution, plot=pyplot)
    axes[1].set_title(aspect + " by Male - QQPlot "+ distribution)

Utility functions for describing the data.



In [ ]:

    
def describe(sample1, sample2):
    print sample1.describe()
    print "Median: ", sample1.median()
    print 
    print sample2.describe()
    print "Median: ", sample2.median()
    
def show_data_shape(sample1, sample2, dist, bins, aspect):
    pyplot.close('all')
    #histogram
    histogram(sample1, sample2, bins, aspect)

    #PMF
    pmf_plot(sample1, sample2, aspect)

    #QQPlot
    qq_plot(sample1, sample2, dist, aspect)

    #boxplot
    boxplot(sample1,sample2, aspect)
    pyplot.show()

    #Levene
    print "Levene's test: ", stats.levene(sample1, sample2)[1]
    
    #skewness
    print "Skewness for Females: ", stats.skew(sample1)
    print "Skewness for Males: ", stats.skew(sample2)



In [ ]: