Community: StackOverflow

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary


In [5]:
merged.describe()


Out[5]:
contrib contrib_females contrib_males male_prop female_prop
count 11.000000 11.000000 11.000000 11.000000 11.000000
mean 12112.454545 717.909091 11394.545455 0.940064 0.059936
std 4388.494078 335.495292 4093.349444 0.017631 0.017631
min 3070.000000 275.000000 2794.000000 0.910098 0.029506
25% 9224.500000 449.500000 8911.500000 0.929288 0.049245
50% 12378.000000 647.000000 11740.000000 0.938508 0.061492
75% 15270.500000 1041.000000 14317.000000 0.950755 0.070712
max 17482.000000 1173.000000 16407.000000 0.970494 0.089902

Fifth Question: Does the proportion registrations by each gender is decreasing ?

Absolute amount of contributions


In [6]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")


Out[6]:
<matplotlib.text.Text at 0x10d07e510>

Proportion of contributions by gender


In [7]:
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)


Out[7]:
<matplotlib.legend.Legend at 0x10d9d4810>

Regression - Proportion of Women who joined per semester


In [8]:
merged['semester'] = [1,2,3,4,5,6,7,8,9,10,11]


result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()


/Library/Python/2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  int(n))
Out[8]:
OLS Regression Results
Dep. Variable: female_prop R-squared: 0.973
Model: OLS Adj. R-squared: 0.970
Method: Least Squares F-statistic: 319.2
Date: Mon, 27 Oct 2014 Prob (F-statistic): 2.45e-08
Time: 14:17:06 Log-Likelihood: 49.116
No. Observations: 11 AIC: -94.23
Df Residuals: 9 BIC: -93.44
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0285 0.002 14.312 0.000 0.024 0.033
semester 0.0052 0.000 17.867 0.000 0.005 0.006
Omnibus: 2.233 Durbin-Watson: 1.401
Prob(Omnibus): 0.327 Jarque-Bera (JB): 1.089
Skew: -0.398 Prob(JB): 0.580
Kurtosis: 1.680 Cond. No. 14.8

In [9]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])


Regression - Proportion of Men who joined per semester


In [10]:
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()


Out[10]:
OLS Regression Results
Dep. Variable: male_prop R-squared: 0.973
Model: OLS Adj. R-squared: 0.970
Method: Least Squares F-statistic: 319.2
Date: Mon, 27 Oct 2014 Prob (F-statistic): 2.45e-08
Time: 14:17:07 Log-Likelihood: 49.116
No. Observations: 11 AIC: -94.23
Df Residuals: 9 BIC: -93.44
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.9715 0.002 488.198 0.000 0.967 0.976
semester -0.0052 0.000 -17.867 0.000 -0.006 -0.005
Omnibus: 2.233 Durbin-Watson: 1.401
Prob(Omnibus): 0.327 Jarque-Bera (JB): 1.089
Skew: 0.398 Prob(JB): 0.580
Kurtosis: 1.680 Cond. No. 14.8

In [11]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])



Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [1]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
mpl.style.use('ggplot')

client = pymongo.MongoClient('localhost', 27017)

community = 'stackoverflow'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"}  }, 
                       {u'_id': False, u'contributions_total': True,
                        u'joined': True, u'gender':True, 'gender_cat': True})


df =  pandas.DataFrame(list(cursor))

In [2]:
indexed_df = df.set_index(['joined'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']
unknown = indexed_df[indexed_df['gender']=='Unknown']

In [3]:
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.joined).date()
    begin = min(df.joined).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']

In [4]:
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)

In [ ]: