Community: Mathematics

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary



In [5]:

    
merged.describe()









    Out[5]:






  
    
      
      contrib
      contrib_females
      contrib_males
      male_prop
      female_prop
    
  
  
    
      count
            8.000000
          8.000000
            8.000000
       8.000000
       8.000000
    
    
      mean
        85232.500000
       3532.250000
        81700.250000
       0.964158
       0.035842
    
    
      std
        53279.776514
       2815.219238
        50549.113848
       0.012124
       0.012124
    
    
      min
         7259.000000
        187.000000
         7072.000000
       0.944552
       0.020251
    
    
      25%
        47974.000000
       1143.250000
        46830.750000
       0.956522
       0.025455
    
    
      50%
        86157.000000
       3138.500000
        83018.500000
       0.963940
       0.036060
    
    
      75%
       122456.750000
       6002.250000
       116104.000000
       0.974545
       0.043478
    
    
      max
       162888.000000
       7488.000000
       155400.000000
       0.979749
       0.055448

Fourth Question: Does the proportion of contributions by each gender is decreasing ?

Hypothesis 1:

H0: contributionsBefore(Males) > contributionsRecent(Males) & contributionsBefore(Females) > contributionsRecent(Females);

H1: contributionsBefore(Males) <= contributionsRecent(Males) & contributionsBefore(Females) <= contributionsRecent(Females).

Absolute amount of contributions



In [6]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")









    Out[6]:





<matplotlib.text.Text at 0x112730890>

Proportion of contributions by gender



In [7]:

    
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)









    Out[7]:





<matplotlib.legend.Legend at 0x11414cf10>

Regression - Proportion of contributions by Women



In [8]:

    
merged['semester'] = [1,2,3,4,5,6,7,8]

import statsmodels.formula.api as smf
result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()









    



/Library/Python/2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=8
  int(n))






    Out[8]:





OLS Regression Results

  Dep. Variable:        female_prop      R-squared:             0.914


  Model:                    OLS          Adj. R-squared:        0.900


  Method:              Least Squares     F-statistic:           63.90


  Date:              Sat, 11 Oct 2014    Prob (F-statistic):  0.000204


  Time:                  13:16:13        Log-Likelihood:       34.304


  No. Observations:            8         AIC:                  -64.61


  Df Residuals:                6         BIC:                  -64.45


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.0145      0.003      4.865   0.003      0.007     0.022


  semester       0.0047      0.001      7.994   0.000      0.003     0.006




  Omnibus:         1.042    Durbin-Watson:         1.673


  Prob(Omnibus):   0.594    Jarque-Bera (JB):      0.551


  Skew:            0.578    Prob(JB):              0.759


  Kurtosis:        2.437    Cond. No.               11.5



In [9]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])

Regression - Proportion of contributions by Men



In [13]:

    
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()









    Out[13]:





OLS Regression Results

  Dep. Variable:         male_prop       R-squared:             0.914


  Model:                    OLS          Adj. R-squared:        0.900


  Method:              Least Squares     F-statistic:           63.90


  Date:              Sat, 11 Oct 2014    Prob (F-statistic):  0.000204


  Time:                  13:16:47        Log-Likelihood:       34.304


  No. Observations:            8         AIC:                  -64.61


  Df Residuals:                6         BIC:                  -64.45


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.9855      0.003    329.630   0.000      0.978     0.993


  semester      -0.0047      0.001     -7.994   0.000     -0.006    -0.003




  Omnibus:         1.042    Durbin-Watson:         1.673


  Prob(Omnibus):   0.594    Jarque-Bera (JB):      0.551


  Skew:           -0.578    Prob(JB):              0.759


  Kurtosis:        2.437    Cond. No.               11.5



In [14]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.



In [1]:

    
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm

%matplotlib inline
mpl.style.use('ggplot')
# pyplot.rcdefaults()

client = pymongo.MongoClient('localhost', 27017)

community = 'math'
stats_db = client[community].statistics

pipeline = [
    {'$match':{'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"} }},
    {'$unwind': '$dates'},
    {'$project': {'gender':1, 'dates':1}}
    
]

cursor = stats_db.aggregate(pipeline, cursor={})

df =  pandas.DataFrame(list(cursor))



In [2]:

    
indexed_df = df.set_index(['dates'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']



In [3]:

    
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.dates).date()
    begin = min(df.dates).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']



In [4]:

    
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)



In [ ]:

	contrib	contrib_females	contrib_males	male_prop	female_prop
count	8.000000	8.000000	8.000000	8.000000	8.000000
mean	85232.500000	3532.250000	81700.250000	0.964158	0.035842
std	53279.776514	2815.219238	50549.113848	0.012124	0.012124
min	7259.000000	187.000000	7072.000000	0.944552	0.020251
25%	47974.000000	1143.250000	46830.750000	0.956522	0.025455
50%	86157.000000	3138.500000	83018.500000	0.963940	0.036060
75%	122456.750000	6002.250000	116104.000000	0.974545	0.043478
max	162888.000000	7488.000000	155400.000000	0.979749	0.055448

Dep. Variable:	female_prop	R-squared:	0.914
Model:	OLS	Adj. R-squared:	0.900
Method:	Least Squares	F-statistic:	63.90
Date:	Sat, 11 Oct 2014	Prob (F-statistic):	0.000204
Time:	13:16:13	Log-Likelihood:	34.304
No. Observations:	8	AIC:	-64.61
Df Residuals:	6	BIC:	-64.45
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	0.0145	0.003	4.865	0.003	0.007 0.022
semester	0.0047	0.001	7.994	0.000	0.003 0.006

Omnibus:	1.042	Durbin-Watson:	1.673
Prob(Omnibus):	0.594	Jarque-Bera (JB):	0.551
Skew:	0.578	Prob(JB):	0.759
Kurtosis:	2.437	Cond. No.	11.5

Dep. Variable:	male_prop	R-squared:	0.914
Model:	OLS	Adj. R-squared:	0.900
Method:	Least Squares	F-statistic:	63.90
Date:	Sat, 11 Oct 2014	Prob (F-statistic):	0.000204
Time:	13:16:47	Log-Likelihood:	34.304
No. Observations:	8	AIC:	-64.61
Df Residuals:	6	BIC:	-64.45
Df Model:	1