Community: Mathematics

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary


In [5]:
merged.describe()


Out[5]:
contrib contrib_females contrib_males male_prop female_prop
count 7.000000 7.000000 7.000000 7.000000 7.000000
mean 1160.714286 90.428571 1070.285714 0.924035 0.075965
std 211.107172 35.771763 180.274526 0.020066 0.020066
min 952.000000 48.000000 902.000000 0.898922 0.047904
25% 1011.000000 64.000000 938.500000 0.910395 0.062641
50% 1072.000000 97.000000 994.000000 0.921721 0.078279
75% 1297.500000 105.000000 1192.500000 0.937359 0.089605
max 1484.000000 150.000000 1334.000000 0.952096 0.101078

Fifth Question: Does the proportion registrations by each gender is decreasing ?

Absolute amount of contributions


In [6]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")


Out[6]:
<matplotlib.text.Text at 0x10c829290>

Proportion of contributions by gender


In [7]:
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)


Out[7]:
<matplotlib.legend.Legend at 0x10cb2db50>

Regression - Proportion of Women who joined per semester


In [8]:
merged['semester'] = [1,2,3,4,5,6,7]


result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()


Out[8]:
OLS Regression Results
Dep. Variable: female_prop R-squared: 0.882
Model: OLS Adj. R-squared: 0.858
Method: Least Squares F-statistic: 37.27
Date: Mon, 27 Oct 2014 Prob (F-statistic): 0.00171
Time: 14:14:55 Log-Likelihood: 25.439
No. Observations: 7 AIC: -46.88
Df Residuals: 5 BIC: -46.99
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0411 0.006 6.429 0.001 0.025 0.058
semester 0.0087 0.001 6.105 0.002 0.005 0.012
Omnibus: nan Durbin-Watson: 2.739
Prob(Omnibus): nan Jarque-Bera (JB): 0.961
Skew: 0.179 Prob(JB): 0.618
Kurtosis: 1.220 Cond. No. 10.4

In [9]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])


Regression - Proportion of Men who joined per semester


In [10]:
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()


Out[10]:
OLS Regression Results
Dep. Variable: male_prop R-squared: 0.882
Model: OLS Adj. R-squared: 0.858
Method: Least Squares F-statistic: 37.27
Date: Mon, 27 Oct 2014 Prob (F-statistic): 0.00171
Time: 14:14:56 Log-Likelihood: 25.439
No. Observations: 7 AIC: -46.88
Df Residuals: 5 BIC: -46.99
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.9589 0.006 150.085 0.000 0.942 0.975
semester -0.0087 0.001 -6.105 0.002 -0.012 -0.005
Omnibus: nan Durbin-Watson: 2.739
Prob(Omnibus): nan Jarque-Bera (JB): 0.961
Skew: -0.179 Prob(JB): 0.618
Kurtosis: 1.220 Cond. No. 10.4

In [11]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])



Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [1]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
mpl.style.use('ggplot')

client = pymongo.MongoClient('localhost', 27017)

community = 'math'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"}  }, 
                       {u'_id': False, u'contributions_total': True,
                        u'joined': True, u'gender':True, 'gender_cat': True})


df =  pandas.DataFrame(list(cursor))

In [2]:
indexed_df = df.set_index(['joined'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']
unknown = indexed_df[indexed_df['gender']=='Unknown']

In [3]:
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.joined).date()
    begin = min(df.joined).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']

In [4]:
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)

In [ ]: