Community: Mathematics

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary


In [5]:
merged.describe()


Out[5]:
contrib contrib_females contrib_males male_prop female_prop
count 8.000000 8.000000 8.000000 8.000000 8.000000
mean 85232.500000 3532.250000 81700.250000 0.964158 0.035842
std 53279.776514 2815.219238 50549.113848 0.012124 0.012124
min 7259.000000 187.000000 7072.000000 0.944552 0.020251
25% 47974.000000 1143.250000 46830.750000 0.956522 0.025455
50% 86157.000000 3138.500000 83018.500000 0.963940 0.036060
75% 122456.750000 6002.250000 116104.000000 0.974545 0.043478
max 162888.000000 7488.000000 155400.000000 0.979749 0.055448

Fourth Question: Does the proportion of contributions by each gender is decreasing ?

Hypothesis 1:

H0: contributionsBefore(Males) > contributionsRecent(Males) & contributionsBefore(Females) > contributionsRecent(Females);

H1: contributionsBefore(Males) <= contributionsRecent(Males) & contributionsBefore(Females) <= contributionsRecent(Females).

Absolute amount of contributions


In [6]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")


Out[6]:
<matplotlib.text.Text at 0x112730890>

Proportion of contributions by gender


In [7]:
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)


Out[7]:
<matplotlib.legend.Legend at 0x11414cf10>

Regression - Proportion of contributions by Women


In [8]:
merged['semester'] = [1,2,3,4,5,6,7,8]

import statsmodels.formula.api as smf
result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()


/Library/Python/2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=8
  int(n))
Out[8]:
OLS Regression Results
Dep. Variable: female_prop R-squared: 0.914
Model: OLS Adj. R-squared: 0.900
Method: Least Squares F-statistic: 63.90
Date: Sat, 11 Oct 2014 Prob (F-statistic): 0.000204
Time: 13:16:13 Log-Likelihood: 34.304
No. Observations: 8 AIC: -64.61
Df Residuals: 6 BIC: -64.45
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0145 0.003 4.865 0.003 0.007 0.022
semester 0.0047 0.001 7.994 0.000 0.003 0.006
Omnibus: 1.042 Durbin-Watson: 1.673
Prob(Omnibus): 0.594 Jarque-Bera (JB): 0.551
Skew: 0.578 Prob(JB): 0.759
Kurtosis: 2.437 Cond. No. 11.5

In [9]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])


Regression - Proportion of contributions by Men


In [13]:
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()


Out[13]:
OLS Regression Results
Dep. Variable: male_prop R-squared: 0.914
Model: OLS Adj. R-squared: 0.900
Method: Least Squares F-statistic: 63.90
Date: Sat, 11 Oct 2014 Prob (F-statistic): 0.000204
Time: 13:16:47 Log-Likelihood: 34.304
No. Observations: 8 AIC: -64.61
Df Residuals: 6 BIC: -64.45
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.9855 0.003 329.630 0.000 0.978 0.993
semester -0.0047 0.001 -7.994 0.000 -0.006 -0.003
Omnibus: 1.042 Durbin-Watson: 1.673
Prob(Omnibus): 0.594 Jarque-Bera (JB): 0.551
Skew: -0.578 Prob(JB): 0.759
Kurtosis: 2.437 Cond. No. 11.5

In [14]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])



Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [1]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm

%matplotlib inline
mpl.style.use('ggplot')
# pyplot.rcdefaults()

client = pymongo.MongoClient('localhost', 27017)

community = 'math'
stats_db = client[community].statistics

pipeline = [
    {'$match':{'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"} }},
    {'$unwind': '$dates'},
    {'$project': {'gender':1, 'dates':1}}
    
]

cursor = stats_db.aggregate(pipeline, cursor={})

df =  pandas.DataFrame(list(cursor))

In [2]:
indexed_df = df.set_index(['dates'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']

In [3]:
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.dates).date()
    begin = min(df.dates).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']

In [4]:
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)

In [ ]: