Community: Mathematics

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary



In [5]:

    
merged.describe()









    Out[5]:






  
    
      
      contrib
      contrib_females
      contrib_males
      male_prop
      female_prop
    
  
  
    
      count
          7.000000
         7.000000
          7.000000
       7.000000
       7.000000
    
    
      mean
       1160.714286
        90.428571
       1070.285714
       0.924035
       0.075965
    
    
      std
        211.107172
        35.771763
        180.274526
       0.020066
       0.020066
    
    
      min
        952.000000
        48.000000
        902.000000
       0.898922
       0.047904
    
    
      25%
       1011.000000
        64.000000
        938.500000
       0.910395
       0.062641
    
    
      50%
       1072.000000
        97.000000
        994.000000
       0.921721
       0.078279
    
    
      75%
       1297.500000
       105.000000
       1192.500000
       0.937359
       0.089605
    
    
      max
       1484.000000
       150.000000
       1334.000000
       0.952096
       0.101078

Fifth Question: Does the proportion registrations by each gender is decreasing ?

Absolute amount of contributions



In [6]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")









    Out[6]:





<matplotlib.text.Text at 0x10c829290>

Proportion of contributions by gender



In [7]:

    
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)









    Out[7]:





<matplotlib.legend.Legend at 0x10cb2db50>

Regression - Proportion of Women who joined per semester



In [8]:

    
merged['semester'] = [1,2,3,4,5,6,7]


result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()









    Out[8]:





OLS Regression Results

  Dep. Variable:        female_prop      R-squared:             0.882


  Model:                    OLS          Adj. R-squared:        0.858


  Method:              Least Squares     F-statistic:           37.27


  Date:              Mon, 27 Oct 2014    Prob (F-statistic):   0.00171


  Time:                  14:14:55        Log-Likelihood:       25.439


  No. Observations:            7         AIC:                  -46.88


  Df Residuals:                5         BIC:                  -46.99


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.0411      0.006      6.429   0.001      0.025     0.058


  semester       0.0087      0.001      6.105   0.002      0.005     0.012




  Omnibus:           nan    Durbin-Watson:         2.739


  Prob(Omnibus):     nan    Jarque-Bera (JB):      0.961


  Skew:            0.179    Prob(JB):              0.618


  Kurtosis:        1.220    Cond. No.               10.4



In [9]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])

Regression - Proportion of Men who joined per semester



In [10]:

    
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()









    Out[10]:





OLS Regression Results

  Dep. Variable:         male_prop       R-squared:             0.882


  Model:                    OLS          Adj. R-squared:        0.858


  Method:              Least Squares     F-statistic:           37.27


  Date:              Mon, 27 Oct 2014    Prob (F-statistic):   0.00171


  Time:                  14:14:56        Log-Likelihood:       25.439


  No. Observations:            7         AIC:                  -46.88


  Df Residuals:                5         BIC:                  -46.99


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.9589      0.006    150.085   0.000      0.942     0.975


  semester      -0.0087      0.001     -6.105   0.002     -0.012    -0.005




  Omnibus:           nan    Durbin-Watson:         2.739


  Prob(Omnibus):     nan    Jarque-Bera (JB):      0.961


  Skew:           -0.179    Prob(JB):              0.618


  Kurtosis:        1.220    Cond. No.               10.4



In [11]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.



In [1]:

    
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
mpl.style.use('ggplot')

client = pymongo.MongoClient('localhost', 27017)

community = 'math'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"}  }, 
                       {u'_id': False, u'contributions_total': True,
                        u'joined': True, u'gender':True, 'gender_cat': True})


df =  pandas.DataFrame(list(cursor))



In [2]:

    
indexed_df = df.set_index(['joined'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']
unknown = indexed_df[indexed_df['gender']=='Unknown']



In [3]:

    
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.joined).date()
    begin = min(df.joined).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']



In [4]:

    
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)



In [ ]:

	contrib	contrib_females	contrib_males	male_prop	female_prop
count	7.000000	7.000000	7.000000	7.000000	7.000000
mean	1160.714286	90.428571	1070.285714	0.924035	0.075965
std	211.107172	35.771763	180.274526	0.020066	0.020066
min	952.000000	48.000000	902.000000	0.898922	0.047904
25%	1011.000000	64.000000	938.500000	0.910395	0.062641
50%	1072.000000	97.000000	994.000000	0.921721	0.078279
75%	1297.500000	105.000000	1192.500000	0.937359	0.089605
max	1484.000000	150.000000	1334.000000	0.952096	0.101078

Dep. Variable:	female_prop	R-squared:	0.882
Model:	OLS	Adj. R-squared:	0.858
Method:	Least Squares	F-statistic:	37.27
Date:	Mon, 27 Oct 2014	Prob (F-statistic):	0.00171
Time:	14:14:55	Log-Likelihood:	25.439
No. Observations:	7	AIC:	-46.88
Df Residuals:	5	BIC:	-46.99
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	0.0411	0.006	6.429	0.001	0.025 0.058
semester	0.0087	0.001	6.105	0.002	0.005 0.012

Omnibus:	nan	Durbin-Watson:	2.739
Prob(Omnibus):	nan	Jarque-Bera (JB):	0.961
Skew:	0.179	Prob(JB):	0.618
Kurtosis:	1.220	Cond. No.	10.4

Dep. Variable:	male_prop	R-squared:	0.882
Model:	OLS	Adj. R-squared:	0.858
Method:	Least Squares	F-statistic:	37.27
Date:	Mon, 27 Oct 2014	Prob (F-statistic):	0.00171
Time:	14:14:56	Log-Likelihood:	25.439
No. Observations:	7	AIC:	-46.88
Df Residuals:	5	BIC:	-46.99
Df Model:	1