Community: Programmers

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data Summary

Women


In [ ]:
females.describe()

In [ ]:
females.median()

Men


In [ ]:
males.describe()

In [ ]:
males.median()

Top contributors


In [ ]:
pyplot.close('all')
histogram(females["reputation"], males["reputation"], 100, "Reputation")
histogram(females[females["reputation"]<= 1000]["reputation"], males[males["reputation"]<= 1000]["reputation"], 100, "Reputation")
histogram(females[females["reputation"]<= 450]["reputation"], males[males["reputation"]<= 450]["reputation"], 100, "Reputation")
pyplot.show()

Top Women


In [ ]:
top_females = females[females["reputation"]> 450]
top_females.describe()

In [ ]:
top_females.median()

Top Men


In [ ]:
top_males = males[males["reputation"]> 450]
top_males.describe()

In [ ]:
top_males.median()

Common women contributors


In [ ]:
common_females = females[females["reputation"] <= 450]
common_females.describe()

In [ ]:
common_females.median()

Common men contributors


In [ ]:
common_males = males[males["reputation"] <= 450]
common_males.describe()

In [ ]:
common_males.median()

First Question: The amount of contibution made by both genders are equal ?

Hypothesis 1: The amount of questions each user posted is the same between genders.

H0: questionsMadeBy(Males) = questionsMadeBy(Females);

H1: questionsMadeBy(Males) != questionsMadeBy(Females).

Data


In [ ]:
females_questions = females['questions_total']
males_questions = males['questions_total']

The data's shape


In [ ]:
show_data_shape(females_questions, males_questions, "expon", 100, "Questions")

Hypothesis test


In [ ]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_questions, males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_questions, males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_questions, males_questions)[1]

Looking at the top contributors


In [ ]:
top_females_questions = top_females['questions_total']
top_males_questions = top_males['questions_total']

The data's shape


In [ ]:
show_data_shape(top_females_questions, top_males_questions, "expon", 50, "Questions")

Hypothesis test


In [ ]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_questions, top_males_questions)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_questions, top_males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_questions, top_males_questions)[1]

Looking at the common contributors


In [ ]:
common_females_questions = common_females['questions_total']
common_males_questions = common_males['questions_total']

The data's shape


In [ ]:
show_data_shape(common_females_questions, common_males_questions, "expon", 50, "Questions")

Hypotesis test


In [ ]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_questions, common_males_questions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_questions, common_males_questions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_questions, common_males_questions)[1]

Hypothesis 2: The amount of answers each user posted is the same between genders.

H0: answersMadeBy(Males) = answersMadeBy(Females);

H1: answersMadeBy(Males) != answersMadeBy(Females).

Data


In [ ]:
females_answers = females['answers_total']
males_answers = males['answers_total']

The data's shape


In [ ]:
show_data_shape(females_answers, males_answers, "expon", 80, "Answers")

Hypothesis test


In [ ]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_answers, males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_answers, males_answers)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_answers, males_answers)[1]

Looking at the top contributors


In [ ]:
top_females_answers = top_females['answers_total']
top_males_answers = top_males['answers_total']

The data's shape


In [ ]:
show_data_shape(top_females_answers, top_males_answers, "expon", 30, "Answers")

Hypotesis test


In [ ]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_answers, top_males_answers)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_answers, top_males_answers, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_answers, top_males_answers)[1]

Looking at the common contributors


In [ ]:
common_females_answers = common_females['answers_total']
common_males_answers = common_males['answers_total']

The data's shape


In [ ]:
show_data_shape(common_females_answers, common_males_answers, "expon", 30, "Answers")

Hypotesis test


In [ ]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_answers, common_males_answers)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_answers, common_males_answers)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_answers, common_males_answers)[1]

Hypothesis 3: The amount of comments each user posted is the same between genders.

H0: commentsMadeBy(Males) = commentsMadeBy(Females);

H1: commentsMadeBy(Males) != commentsMadeBy(Females).

Data


In [ ]:
females_comments = females['comments_total']
males_comments = males['comments_total']

The data's shape


In [ ]:
show_data_shape(females_comments, males_comments, "expon", 80, "Comments")

Hypothesis test


In [ ]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_comments, males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_comments, males_comments)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_comments, males_comments)[1]

Looking at the top contributors


In [ ]:
top_females_comments = top_females['comments_total']
top_males_comments = top_males['comments_total']

The data's shape


In [ ]:
show_data_shape(top_females_comments, top_males_comments, "expon", 50, "Comments")

Hypotesis test


In [ ]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_comments, top_males_comments)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_comments, top_males_comments, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_comments, top_males_comments)[1]

Looking at the common contributors


In [ ]:
common_females_comments = common_females['comments_total']
common_males_comments = common_males['comments_total']

The data's shape


In [ ]:
show_data_shape(common_females_comments, common_males_comments, "expon", 80, "Comments")

Hypotesis test


In [ ]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_comments, common_males_comments)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_comments, common_males_comments)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_comments, common_males_comments)[1]

Hypothesis 4: The total amount of contributions by each user is the same between genders.

H0: contributionsBy(Males) = contributionsBy(Females);

H1: contributionsBy(Males) != contributionsBy(Females).

Data


In [ ]:
females_contributions = females['contributions_total']
males_contributions = males['contributions_total']

The data's shape


In [ ]:
show_data_shape(females_contributions, males_contributions, "expon", 80, "Contributions")

Hypothesis test


In [ ]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(females_contributions, males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(females_contributions, males_contributions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(females_contributions, males_contributions)[1]

Looking at the top contributors


In [ ]:
top_females_contributions = top_females['contributions_total']
top_males_contributions = top_males['contributions_total']

The data's shape


In [ ]:
show_data_shape(top_females_contributions, top_males_contributions, "expon", 50, "Contributions")

Hypotesis test


In [ ]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(top_females_contributions, top_males_contributions)[1]
# print "Two-sample unpaired t-test: ", stats.ttest_ind(top_females_contributions, top_males_contributions, equal_var=False)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(top_females_contributions, top_males_contributions)[1]

Looking at the common contributors


In [ ]:
common_females_contributions = common_females['contributions_total']
common_males_contributions = common_males['contributions_total']

The data's shape


In [ ]:
show_data_shape(common_females_contributions, common_males_contributions, "expon", 80, "Contributions")

Hypotesis test


In [ ]:
print "Two-sample Kolmogorov-Smirnov test: ", stats.ks_2samp(common_females_contributions, common_males_contributions)[1]
print "Two-sample unpaired t-test: ", stats.ttest_ind(common_females_contributions, common_males_contributions)[1]
print "Two-sample Mann Whitney U test: ",2* stats.mannwhitneyu(common_females_contributions, common_males_contributions)[1]

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [ ]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, vincent
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot

%matplotlib inline

client = pymongo.MongoClient('localhost', 27017)

community = 'programmers'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, {'comments_total':{'$gt':0}}] }, 
                       {u'_id': False, u'questions_total': True, u'reputation': True, u'contributions_total':True,
                        u'comments_total': True, u'answers_total': True, u'gender':True})

df =  pandas.DataFrame(list(cursor))

males = df[df['gender']=='Male']
females = df[df['gender']=='Female']

Utility functions for ploting.


In [ ]:
pyplot.rcdefaults()
mpl.style.use('ggplot')

def histogram(sample1, sample2, bins, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    axes[0].hist(list(sample1), bins)
    axes[0].set_title(aspect + " by Females - Histogram")
    axes[1].hist(list(sample2), bins)
    axes[1].set_title(aspect + " by Males - Histogram")

def pmf_plot(sample1, sample2, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    
    aux1 = sample1.value_counts()
    pmf1 = aux1[aux1 >1].sort_index() / len(sample1)
#     pmf1 = sample1.value_counts().sort_index() / len(sample1)
    pmf1.plot(kind="bar", ax=axes[0])
    
    aux2 = sample2.value_counts()
    pmf2 = aux2[aux2 >1].sort_index() / len(sample2)
#     pmf2 = sample2.value_counts().sort_index() / len(sample2)
    pmf2.plot(kind="bar", ax=axes[1])
    
    axes[0].set_title(aspect + " by Females - Density")
    axes[1].set_title(aspect + " by Males - Density")
    axes[0].get_xaxis().set_visible(False)
    axes[1].get_xaxis().set_visible(False)    
    

    
def boxplot(sample1, sample2, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
    axes[0].boxplot(list(sample1))
    axes[0].set_title(aspect + " by Females - Boxplot")
    axes[1].boxplot(list(sample2))
    axes[1].set_title(aspect + " by Males - Boxplot")
    
def qq_plot(sample1, sample2, distribution, aspect):
    fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

    pyplot.subplot(121)
    stats.probplot(list(sample1), dist=distribution, plot=pyplot)
    axes[0].set_title(aspect + " by Females - QQPlot "+ distribution)

    pyplot.subplot(122)
    stats.probplot(list(sample2), dist=distribution, plot=pyplot)
    axes[1].set_title(aspect + " by Male - QQPlot "+ distribution)

Utility functions for describing the data.


In [ ]:
def describe(sample1, sample2):
    print sample1.describe()
    print "Median: ", sample1.median()
    print 
    print sample2.describe()
    print "Median: ", sample2.median()
    
def show_data_shape(sample1, sample2, dist, bins, aspect):
    pyplot.close('all')
    #histogram
    histogram(sample1, sample2, bins, aspect)

    #PMF
    pmf_plot(sample1, sample2, aspect)

    #QQPlot
    qq_plot(sample1, sample2, dist, aspect)

    #boxplot
    boxplot(sample1,sample2, aspect)
    pyplot.show()

    #Levene
    print "Levene's test: ", stats.levene(sample1, sample2)[1]
    
    #skewness
    print "Skewness for Females: ", stats.skew(sample1)
    print "Skewness for Males: ", stats.skew(sample2)

In [ ]: