Community: StackOverflow

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary


In [6]:
merged.describe()


Out[6]:
contrib contrib_females contrib_males male_prop female_prop
count 11.000000 11.000000 11.000000 11.000000 11.000000
mean 1386627.272727 64762.636364 1321864.636364 0.957977 0.042023
std 720936.712398 41650.407168 679542.609959 0.010583 0.010583
min 266108.000000 5197.000000 260911.000000 0.946952 0.019530
25% 804623.500000 31121.000000 773502.500000 0.950056 0.038543
50% 1557108.000000 65366.000000 1491742.000000 0.957483 0.042517
75% 2021578.000000 101961.000000 1919617.000000 0.961457 0.049944
max 2348170.000000 121994.000000 2226176.000000 0.980470 0.053048

Fourth Question: Does the proportion of contributions by each gender is decreasing ?

Hypothesis 1:

H0: contributionsBefore(Males) > contributionsRecent(Males) & contributionsBefore(Females) > contributionsRecent(Females);

H1: contributionsBefore(Males) <= contributionsRecent(Males) & contributionsBefore(Females) <= contributionsRecent(Females).

Absolute amount of contributions


In [7]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")


Out[7]:
<matplotlib.text.Text at 0x116d7fd10>

Proportion of contributions by gender


In [8]:
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)


Out[8]:
<matplotlib.legend.Legend at 0x117ce5790>

Regression - Proportion of contributions by Women


In [9]:
merged['semester'] = [1,2,3,4,5,6,7,8,9,10,11]

import statsmodels.formula.api as smf
result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()


/Library/Python/2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  int(n))
Out[9]:
OLS Regression Results
Dep. Variable: female_prop R-squared: 0.881
Model: OLS Adj. R-squared: 0.868
Method: Least Squares F-statistic: 66.64
Date: Sat, 11 Oct 2014 Prob (F-statistic): 1.88e-05
Time: 13:55:44 Log-Likelihood: 46.658
No. Observations: 11 AIC: -89.32
Df Residuals: 9 BIC: -88.52
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0241 0.002 9.666 0.000 0.018 0.030
semester 0.0030 0.000 8.164 0.000 0.002 0.004
Omnibus: 1.666 Durbin-Watson: 0.871
Prob(Omnibus): 0.435 Jarque-Bera (JB): 0.937
Skew: -0.688 Prob(JB): 0.626
Kurtosis: 2.612 Cond. No. 14.8

In [10]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])


Regression - Proportion of contributions by Men


In [11]:
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()


Out[11]:
OLS Regression Results
Dep. Variable: male_prop R-squared: 0.881
Model: OLS Adj. R-squared: 0.868
Method: Least Squares F-statistic: 66.64
Date: Sat, 11 Oct 2014 Prob (F-statistic): 1.88e-05
Time: 13:55:56 Log-Likelihood: 46.658
No. Observations: 11 AIC: -89.32
Df Residuals: 9 BIC: -88.52
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.9759 0.002 392.216 0.000 0.970 0.982
semester -0.0030 0.000 -8.164 0.000 -0.004 -0.002
Omnibus: 1.666 Durbin-Watson: 0.871
Prob(Omnibus): 0.435 Jarque-Bera (JB): 0.937
Skew: 0.688 Prob(JB): 0.626
Kurtosis: 2.612 Cond. No. 14.8

In [12]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])



Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [2]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm

%matplotlib inline
# mpl.style.use('ggplot')
# pyplot.rcdefaults()

client = pymongo.MongoClient('localhost', 27017)

community = 'stackoverflow'
stats_db = client[community].statistics

pipeline = [
    {'$match':{'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"} }},
    {'$unwind': '$dates'},
    {'$project': {'gender':1, 'dates':1}}
    
]

cursor = stats_db.aggregate(pipeline, cursor={})

df =  pandas.DataFrame(list(cursor))

In [3]:
indexed_df = df.set_index(['dates'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']

In [4]:
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.dates).date()
    begin = min(df.dates).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']

In [5]:
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)

In [ ]: