Community: SuperUser

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary


In [5]:
merged.describe()


Out[5]:
contrib contrib_females contrib_males male_prop female_prop
count 11.000000 11.000000 11.000000 11.000000 11.000000
mean 31573.454545 1238.545455 30334.909091 0.964964 0.035036
std 14222.254978 632.761782 13617.867039 0.011355 0.011355
min 349.000000 3.000000 346.000000 0.951722 0.008596
25% 32570.500000 1212.000000 31300.000000 0.959400 0.034489
50% 35081.000000 1366.000000 33787.000000 0.963114 0.036886
75% 39092.000000 1511.500000 37580.500000 0.965511 0.040600
max 45576.000000 2124.000000 43452.000000 0.991404 0.048278

Fourth Question: Does the proportion of contributions by each gender is decreasing ?

Hypothesis 1:

H0: contributionsBefore(Males) > contributionsRecent(Males) & contributionsBefore(Females) > contributionsRecent(Females);

H1: contributionsBefore(Males) <= contributionsRecent(Males) & contributionsBefore(Females) <= contributionsRecent(Females).

Absolute amount of contributions


In [6]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")


Out[6]:
<matplotlib.text.Text at 0x116465390>

Proportion of contributions by gender


In [7]:
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)


Out[7]:
<matplotlib.legend.Legend at 0x1174145d0>

Regression - Proportion of contributions by Women


In [8]:
merged['semester'] = [1,2,3,4,5,6,7,8,9,10,11]

import statsmodels.formula.api as smf
result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()


/Library/Python/2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  int(n))
Out[8]:
OLS Regression Results
Dep. Variable: female_prop R-squared: 0.545
Model: OLS Adj. R-squared: 0.495
Method: Least Squares F-statistic: 10.79
Date: Sat, 11 Oct 2014 Prob (F-statistic): 0.00946
Time: 13:20:18 Log-Likelihood: 38.508
No. Observations: 11 AIC: -73.02
Df Residuals: 9 BIC: -72.22
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0199 0.005 3.806 0.004 0.008 0.032
semester 0.0025 0.001 3.285 0.009 0.001 0.004
Omnibus: 0.007 Durbin-Watson: 1.080
Prob(Omnibus): 0.996 Jarque-Bera (JB): 0.172
Skew: -0.036 Prob(JB): 0.918
Kurtosis: 2.392 Cond. No. 14.8

In [9]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])


Regression - Proportion of contributions by Men


In [10]:
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()


Out[10]:
OLS Regression Results
Dep. Variable: male_prop R-squared: 0.545
Model: OLS Adj. R-squared: 0.495
Method: Least Squares F-statistic: 10.79
Date: Sat, 11 Oct 2014 Prob (F-statistic): 0.00946
Time: 13:20:33 Log-Likelihood: 38.508
No. Observations: 11 AIC: -73.02
Df Residuals: 9 BIC: -72.22
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.9801 0.005 187.770 0.000 0.968 0.992
semester -0.0025 0.001 -3.285 0.009 -0.004 -0.001
Omnibus: 0.007 Durbin-Watson: 1.080
Prob(Omnibus): 0.996 Jarque-Bera (JB): 0.172
Skew: 0.036 Prob(JB): 0.918
Kurtosis: 2.392 Cond. No. 14.8

In [11]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])



Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [1]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm

%matplotlib inline
mpl.style.use('ggplot')
# pyplot.rcdefaults()

client = pymongo.MongoClient('localhost', 27017)

community = 'superuser'
stats_db = client[community].statistics

pipeline = [
    {'$match':{'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"} }},
    {'$unwind': '$dates'},
    {'$project': {'gender':1, 'dates':1}}
    
]

cursor = stats_db.aggregate(pipeline, cursor={})

df =  pandas.DataFrame(list(cursor))

In [2]:
indexed_df = df.set_index(['dates'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']

In [3]:
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.dates).date()
    begin = min(df.dates).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']

In [4]:
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)

In [ ]: