Community: Programmers

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary


In [5]:
merged.describe()


Out[5]:
contrib contrib_females contrib_males male_prop female_prop
count 11.000000 11.000000 11.000000 11.000000 11.000000
mean 15251.636364 708.090909 14543.545455 0.955451 0.044549
std 13459.977298 747.487853 12747.309892 0.012528 0.012528
min 323.000000 15.000000 303.000000 0.938080 0.030488
25% 550.500000 20.500000 532.500000 0.942576 0.033284
50% 18394.000000 596.000000 17798.000000 0.956660 0.043340
75% 24746.000000 1052.500000 23557.500000 0.966716 0.057424
max 37746.000000 2229.000000 35517.000000 0.969512 0.061920

Fourth Question: Does the proportion of contributions by each gender is decreasing ?

Hypothesis 1:

H0: contributionsBefore(Males) > contributionsRecent(Males) & contributionsBefore(Females) > contributionsRecent(Females);

H1: contributionsBefore(Males) <= contributionsRecent(Males) & contributionsBefore(Females) <= contributionsRecent(Females).

Absolute amount of contributions


In [6]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")


Out[6]:
<matplotlib.text.Text at 0x10df5a990>

Proportion of contributions by gender


In [7]:
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)


Out[7]:
<matplotlib.legend.Legend at 0x1102b7550>

Regression - Proportion of contributions by Women


In [10]:
merged['semester'] = [1,2,3,4,5,6,7,8,9,10,11]

import statsmodels.formula.api as smf
result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()


/Library/Python/2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  int(n))
Out[10]:
OLS Regression Results
Dep. Variable: female_prop R-squared: 0.044
Model: OLS Adj. R-squared: -0.062
Method: Least Squares F-statistic: 0.4188
Date: Sat, 11 Oct 2014 Prob (F-statistic): 0.534
Time: 13:18:48 Log-Likelihood: 33.344
No. Observations: 11 AIC: -62.69
Df Residuals: 9 BIC: -61.89
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0493 0.008 5.909 0.000 0.030 0.068
semester -0.0008 0.001 -0.647 0.534 -0.004 0.002
Omnibus: 3.811 Durbin-Watson: 0.785
Prob(Omnibus): 0.149 Jarque-Bera (JB): 1.106
Skew: -0.010 Prob(JB): 0.575
Kurtosis: 1.447 Cond. No. 14.8

In [11]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])


Regression - Proportion of contributions by Men


In [12]:
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()


Out[12]:
OLS Regression Results
Dep. Variable: male_prop R-squared: 0.044
Model: OLS Adj. R-squared: -0.062
Method: Least Squares F-statistic: 0.4188
Date: Sat, 11 Oct 2014 Prob (F-statistic): 0.534
Time: 13:18:51 Log-Likelihood: 33.344
No. Observations: 11 AIC: -62.69
Df Residuals: 9 BIC: -61.89
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.9507 0.008 113.889 0.000 0.932 0.970
semester 0.0008 0.001 0.647 0.534 -0.002 0.004
Omnibus: 3.811 Durbin-Watson: 0.785
Prob(Omnibus): 0.149 Jarque-Bera (JB): 1.106
Skew: 0.010 Prob(JB): 0.575
Kurtosis: 1.447 Cond. No. 14.8

In [13]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])



Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [1]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm

%matplotlib inline
mpl.style.use('ggplot')
# pyplot.rcdefaults()

client = pymongo.MongoClient('localhost', 27017)

community = 'programmers'
stats_db = client[community].statistics

pipeline = [
    {'$match':{'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"} }},
    {'$unwind': '$dates'},
    {'$project': {'gender':1, 'dates':1}}
    
]

cursor = stats_db.aggregate(pipeline, cursor={})

df =  pandas.DataFrame(list(cursor))

In [2]:
indexed_df = df.set_index(['dates'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']

In [3]:
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.dates).date()
    begin = min(df.dates).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']

In [4]:
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)

In [ ]: