Community: Programmers

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary



In [5]:

    
merged.describe()









    Out[5]:






  
    
      
      contrib
      contrib_females
      contrib_males
      male_prop
      female_prop
    
  
  
    
      count
          7.000000
         7.000000
          7.000000
       7.000000
       7.000000
    
    
      mean
       1302.428571
        54.714286
       1247.714286
       0.955497
       0.044503
    
    
      std
       1038.951372
        36.568266
       1002.807678
       0.005516
       0.005516
    
    
      min
        310.000000
        14.000000
        296.000000
       0.947706
       0.036736
    
    
      25%
        706.000000
        32.500000
        673.500000
       0.951932
       0.040596
    
    
      50%
        960.000000
        45.000000
        915.000000
       0.954839
       0.045161
    
    
      75%
       1557.000000
        68.500000
       1488.500000
       0.959404
       0.048068
    
    
      max
       3321.000000
       122.000000
       3199.000000
       0.963264
       0.052294

Fifth Question: Does the proportion registrations by each gender is decreasing ?

Absolute amount of contributions



In [6]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")









    Out[6]:





<matplotlib.text.Text at 0x113918590>

Proportion of contributions by gender



In [8]:

    
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)









    Out[8]:





<matplotlib.legend.Legend at 0x113d770d0>

Regression - Proportion of Women who joined per semester



In [9]:

    
merged['semester'] = [1,2,3,4,5,6,7]


result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()









    Out[9]:





OLS Regression Results

  Dep. Variable:        female_prop      R-squared:             0.138


  Model:                    OLS          Adj. R-squared:       -0.035


  Method:              Least Squares     F-statistic:          0.7982


  Date:              Mon, 27 Oct 2014    Prob (F-statistic):    0.413 


  Time:                  14:11:32        Log-Likelihood:       27.526


  No. Observations:            7         AIC:                  -51.05


  Df Residuals:                5         BIC:                  -51.16


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.0407      0.005      8.586   0.000      0.029     0.053


  semester       0.0009      0.001      0.893   0.413     -0.002     0.004




  Omnibus:           nan    Durbin-Watson:         1.685


  Prob(Omnibus):     nan    Jarque-Bera (JB):      0.725


  Skew:            0.631    Prob(JB):              0.696


  Kurtosis:        2.055    Cond. No.               10.4



In [10]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])

Regression - Proportion of Men who joined per semester



In [11]:

    
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()









    Out[11]:





OLS Regression Results

  Dep. Variable:         male_prop       R-squared:             0.138


  Model:                    OLS          Adj. R-squared:       -0.035


  Method:              Least Squares     F-statistic:          0.7982


  Date:              Mon, 27 Oct 2014    Prob (F-statistic):    0.413 


  Time:                  14:11:33        Log-Likelihood:       27.526


  No. Observations:            7         AIC:                  -51.05


  Df Residuals:                5         BIC:                  -51.16


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.9593      0.005    202.294   0.000      0.947     0.971


  semester      -0.0009      0.001     -0.893   0.413     -0.004     0.002




  Omnibus:           nan    Durbin-Watson:         1.685


  Prob(Omnibus):     nan    Jarque-Bera (JB):      0.725


  Skew:           -0.631    Prob(JB):              0.696


  Kurtosis:        2.055    Cond. No.               10.4



In [12]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.



In [1]:

    
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
mpl.style.use('ggplot')

client = pymongo.MongoClient('localhost', 27017)

community = 'programmers'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"}  }, 
                       {u'_id': False, u'contributions_total': True,
                        u'joined': True, u'gender':True, 'gender_cat': True})


df =  pandas.DataFrame(list(cursor))



In [2]:

    
indexed_df = df.set_index(['joined'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']
unknown = indexed_df[indexed_df['gender']=='Unknown']



In [3]:

    
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.joined).date()
    begin = min(df.joined).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']



In [4]:

    
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)



In [ ]:

	contrib	contrib_females	contrib_males	male_prop	female_prop
count	7.000000	7.000000	7.000000	7.000000	7.000000
mean	1302.428571	54.714286	1247.714286	0.955497	0.044503
std	1038.951372	36.568266	1002.807678	0.005516	0.005516
min	310.000000	14.000000	296.000000	0.947706	0.036736
25%	706.000000	32.500000	673.500000	0.951932	0.040596
50%	960.000000	45.000000	915.000000	0.954839	0.045161
75%	1557.000000	68.500000	1488.500000	0.959404	0.048068
max	3321.000000	122.000000	3199.000000	0.963264	0.052294

Dep. Variable:	female_prop	R-squared:	0.138
Model:	OLS	Adj. R-squared:	-0.035
Method:	Least Squares	F-statistic:	0.7982
Date:	Mon, 27 Oct 2014	Prob (F-statistic):	0.413
Time:	14:11:32	Log-Likelihood:	27.526
No. Observations:	7	AIC:	-51.05
Df Residuals:	5	BIC:	-51.16
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	0.0407	0.005	8.586	0.000	0.029 0.053
semester	0.0009	0.001	0.893	0.413	-0.002 0.004

Omnibus:	nan	Durbin-Watson:	1.685
Prob(Omnibus):	nan	Jarque-Bera (JB):	0.725
Skew:	0.631	Prob(JB):	0.696
Kurtosis:	2.055	Cond. No.	10.4

Dep. Variable:	male_prop	R-squared:	0.138
Model:	OLS	Adj. R-squared:	-0.035
Method:	Least Squares	F-statistic:	0.7982
Date:	Mon, 27 Oct 2014	Prob (F-statistic):	0.413
Time:	14:11:33	Log-Likelihood:	27.526
No. Observations:	7	AIC:	-51.05
Df Residuals:	5	BIC:	-51.16
Df Model:	1