Community: Programmers

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary


In [5]:
merged.describe()


Out[5]:
contrib contrib_females contrib_males male_prop female_prop
count 7.000000 7.000000 7.000000 7.000000 7.000000
mean 1302.428571 54.714286 1247.714286 0.955497 0.044503
std 1038.951372 36.568266 1002.807678 0.005516 0.005516
min 310.000000 14.000000 296.000000 0.947706 0.036736
25% 706.000000 32.500000 673.500000 0.951932 0.040596
50% 960.000000 45.000000 915.000000 0.954839 0.045161
75% 1557.000000 68.500000 1488.500000 0.959404 0.048068
max 3321.000000 122.000000 3199.000000 0.963264 0.052294

Fifth Question: Does the proportion registrations by each gender is decreasing ?

Absolute amount of contributions


In [6]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")


Out[6]:
<matplotlib.text.Text at 0x113918590>

Proportion of contributions by gender


In [8]:
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)


Out[8]:
<matplotlib.legend.Legend at 0x113d770d0>

Regression - Proportion of Women who joined per semester


In [9]:
merged['semester'] = [1,2,3,4,5,6,7]


result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()


Out[9]:
OLS Regression Results
Dep. Variable: female_prop R-squared: 0.138
Model: OLS Adj. R-squared: -0.035
Method: Least Squares F-statistic: 0.7982
Date: Mon, 27 Oct 2014 Prob (F-statistic): 0.413
Time: 14:11:32 Log-Likelihood: 27.526
No. Observations: 7 AIC: -51.05
Df Residuals: 5 BIC: -51.16
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0407 0.005 8.586 0.000 0.029 0.053
semester 0.0009 0.001 0.893 0.413 -0.002 0.004
Omnibus: nan Durbin-Watson: 1.685
Prob(Omnibus): nan Jarque-Bera (JB): 0.725
Skew: 0.631 Prob(JB): 0.696
Kurtosis: 2.055 Cond. No. 10.4

In [10]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])


Regression - Proportion of Men who joined per semester


In [11]:
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()


Out[11]:
OLS Regression Results
Dep. Variable: male_prop R-squared: 0.138
Model: OLS Adj. R-squared: -0.035
Method: Least Squares F-statistic: 0.7982
Date: Mon, 27 Oct 2014 Prob (F-statistic): 0.413
Time: 14:11:33 Log-Likelihood: 27.526
No. Observations: 7 AIC: -51.05
Df Residuals: 5 BIC: -51.16
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.9593 0.005 202.294 0.000 0.947 0.971
semester -0.0009 0.001 -0.893 0.413 -0.004 0.002
Omnibus: nan Durbin-Watson: 1.685
Prob(Omnibus): nan Jarque-Bera (JB): 0.725
Skew: -0.631 Prob(JB): 0.696
Kurtosis: 2.055 Cond. No. 10.4

In [12]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])



Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [1]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
mpl.style.use('ggplot')

client = pymongo.MongoClient('localhost', 27017)

community = 'programmers'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"}  }, 
                       {u'_id': False, u'contributions_total': True,
                        u'joined': True, u'gender':True, 'gender_cat': True})


df =  pandas.DataFrame(list(cursor))

In [2]:
indexed_df = df.set_index(['joined'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']
unknown = indexed_df[indexed_df['gender']=='Unknown']

In [3]:
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.joined).date()
    begin = min(df.joined).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']

In [4]:
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)

In [ ]: