Community: Programmers

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary



In [5]:

    
merged.describe()









    Out[5]:






  
    
      
      contrib
      contrib_females
      contrib_males
      male_prop
      female_prop
    
  
  
    
      count
          11.000000
         11.000000
          11.000000
       11.000000
       11.000000
    
    
      mean
       15251.636364
        708.090909
       14543.545455
        0.955451
        0.044549
    
    
      std
       13459.977298
        747.487853
       12747.309892
        0.012528
        0.012528
    
    
      min
         323.000000
         15.000000
         303.000000
        0.938080
        0.030488
    
    
      25%
         550.500000
         20.500000
         532.500000
        0.942576
        0.033284
    
    
      50%
       18394.000000
        596.000000
       17798.000000
        0.956660
        0.043340
    
    
      75%
       24746.000000
       1052.500000
       23557.500000
        0.966716
        0.057424
    
    
      max
       37746.000000
       2229.000000
       35517.000000
        0.969512
        0.061920

Fourth Question: Does the proportion of contributions by each gender is decreasing ?

Hypothesis 1:

H0: contributionsBefore(Males) > contributionsRecent(Males) & contributionsBefore(Females) > contributionsRecent(Females);

H1: contributionsBefore(Males) <= contributionsRecent(Males) & contributionsBefore(Females) <= contributionsRecent(Females).

Absolute amount of contributions



In [6]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")









    Out[6]:





<matplotlib.text.Text at 0x10df5a990>

Proportion of contributions by gender



In [7]:

    
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)









    Out[7]:





<matplotlib.legend.Legend at 0x1102b7550>

Regression - Proportion of contributions by Women



In [10]:

    
merged['semester'] = [1,2,3,4,5,6,7,8,9,10,11]

import statsmodels.formula.api as smf
result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()









    



/Library/Python/2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  int(n))






    Out[10]:





OLS Regression Results

  Dep. Variable:        female_prop      R-squared:             0.044


  Model:                    OLS          Adj. R-squared:       -0.062


  Method:              Least Squares     F-statistic:          0.4188


  Date:              Sat, 11 Oct 2014    Prob (F-statistic):    0.534 


  Time:                  13:18:48        Log-Likelihood:       33.344


  No. Observations:           11         AIC:                  -62.69


  Df Residuals:                9         BIC:                  -61.89


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.0493      0.008      5.909   0.000      0.030     0.068


  semester      -0.0008      0.001     -0.647   0.534     -0.004     0.002




  Omnibus:         3.811    Durbin-Watson:         0.785


  Prob(Omnibus):   0.149    Jarque-Bera (JB):      1.106


  Skew:           -0.010    Prob(JB):              0.575


  Kurtosis:        1.447    Cond. No.               14.8



In [11]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])

Regression - Proportion of contributions by Men



In [12]:

    
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()









    Out[12]:





OLS Regression Results

  Dep. Variable:         male_prop       R-squared:             0.044


  Model:                    OLS          Adj. R-squared:       -0.062


  Method:              Least Squares     F-statistic:          0.4188


  Date:              Sat, 11 Oct 2014    Prob (F-statistic):    0.534 


  Time:                  13:18:51        Log-Likelihood:       33.344


  No. Observations:           11         AIC:                  -62.69


  Df Residuals:                9         BIC:                  -61.89


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.9507      0.008    113.889   0.000      0.932     0.970


  semester       0.0008      0.001      0.647   0.534     -0.002     0.004




  Omnibus:         3.811    Durbin-Watson:         0.785


  Prob(Omnibus):   0.149    Jarque-Bera (JB):      1.106


  Skew:            0.010    Prob(JB):              0.575


  Kurtosis:        1.447    Cond. No.               14.8



In [13]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.



In [1]:

    
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm

%matplotlib inline
mpl.style.use('ggplot')
# pyplot.rcdefaults()

client = pymongo.MongoClient('localhost', 27017)

community = 'programmers'
stats_db = client[community].statistics

pipeline = [
    {'$match':{'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"} }},
    {'$unwind': '$dates'},
    {'$project': {'gender':1, 'dates':1}}
    
]

cursor = stats_db.aggregate(pipeline, cursor={})

df =  pandas.DataFrame(list(cursor))



In [2]:

    
indexed_df = df.set_index(['dates'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']



In [3]:

    
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.dates).date()
    begin = min(df.dates).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']



In [4]:

    
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)



In [ ]:

	contrib	contrib_females	contrib_males	male_prop	female_prop
count	11.000000	11.000000	11.000000	11.000000	11.000000
mean	15251.636364	708.090909	14543.545455	0.955451	0.044549
std	13459.977298	747.487853	12747.309892	0.012528	0.012528
min	323.000000	15.000000	303.000000	0.938080	0.030488
25%	550.500000	20.500000	532.500000	0.942576	0.033284
50%	18394.000000	596.000000	17798.000000	0.956660	0.043340
75%	24746.000000	1052.500000	23557.500000	0.966716	0.057424
max	37746.000000	2229.000000	35517.000000	0.969512	0.061920

Dep. Variable:	female_prop	R-squared:	0.044
Model:	OLS	Adj. R-squared:	-0.062
Method:	Least Squares	F-statistic:	0.4188
Date:	Sat, 11 Oct 2014	Prob (F-statistic):	0.534
Time:	13:18:48	Log-Likelihood:	33.344
No. Observations:	11	AIC:	-62.69
Df Residuals:	9	BIC:	-61.89
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	0.0493	0.008	5.909	0.000	0.030 0.068
semester	-0.0008	0.001	-0.647	0.534	-0.004 0.002

Omnibus:	3.811	Durbin-Watson:	0.785
Prob(Omnibus):	0.149	Jarque-Bera (JB):	1.106
Skew:	-0.010	Prob(JB):	0.575
Kurtosis:	1.447	Cond. No.	14.8

Dep. Variable:	male_prop	R-squared:	0.044
Model:	OLS	Adj. R-squared:	-0.062
Method:	Least Squares	F-statistic:	0.4188
Date:	Sat, 11 Oct 2014	Prob (F-statistic):	0.534
Time:	13:18:51	Log-Likelihood:	33.344
No. Observations:	11	AIC:	-62.69
Df Residuals:	9	BIC:	-61.89
Df Model:	1