Community: SuperUser

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary



In [5]:

    
merged.describe()









    Out[5]:






  
    
      
      contrib
      contrib_females
      contrib_males
      male_prop
      female_prop
    
  
  
    
      count
          11.000000
         11.000000
          11.000000
       11.000000
       11.000000
    
    
      mean
       31573.454545
       1238.545455
       30334.909091
        0.964964
        0.035036
    
    
      std
       14222.254978
        632.761782
       13617.867039
        0.011355
        0.011355
    
    
      min
         349.000000
          3.000000
         346.000000
        0.951722
        0.008596
    
    
      25%
       32570.500000
       1212.000000
       31300.000000
        0.959400
        0.034489
    
    
      50%
       35081.000000
       1366.000000
       33787.000000
        0.963114
        0.036886
    
    
      75%
       39092.000000
       1511.500000
       37580.500000
        0.965511
        0.040600
    
    
      max
       45576.000000
       2124.000000
       43452.000000
        0.991404
        0.048278

Fourth Question: Does the proportion of contributions by each gender is decreasing ?

Hypothesis 1:

H0: contributionsBefore(Males) > contributionsRecent(Males) & contributionsBefore(Females) > contributionsRecent(Females);

H1: contributionsBefore(Males) <= contributionsRecent(Males) & contributionsBefore(Females) <= contributionsRecent(Females).

Absolute amount of contributions



In [6]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")









    Out[6]:





<matplotlib.text.Text at 0x116465390>

Proportion of contributions by gender



In [7]:

    
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)









    Out[7]:





<matplotlib.legend.Legend at 0x1174145d0>

Regression - Proportion of contributions by Women



In [8]:

    
merged['semester'] = [1,2,3,4,5,6,7,8,9,10,11]

import statsmodels.formula.api as smf
result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()









    



/Library/Python/2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  int(n))






    Out[8]:





OLS Regression Results

  Dep. Variable:        female_prop      R-squared:             0.545


  Model:                    OLS          Adj. R-squared:        0.495


  Method:              Least Squares     F-statistic:           10.79


  Date:              Sat, 11 Oct 2014    Prob (F-statistic):   0.00946


  Time:                  13:20:18        Log-Likelihood:       38.508


  No. Observations:           11         AIC:                  -73.02


  Df Residuals:                9         BIC:                  -72.22


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.0199      0.005      3.806   0.004      0.008     0.032


  semester       0.0025      0.001      3.285   0.009      0.001     0.004




  Omnibus:         0.007    Durbin-Watson:         1.080


  Prob(Omnibus):   0.996    Jarque-Bera (JB):      0.172


  Skew:           -0.036    Prob(JB):              0.918


  Kurtosis:        2.392    Cond. No.               14.8



In [9]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])

Regression - Proportion of contributions by Men



In [10]:

    
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()









    Out[10]:





OLS Regression Results

  Dep. Variable:         male_prop       R-squared:             0.545


  Model:                    OLS          Adj. R-squared:        0.495


  Method:              Least Squares     F-statistic:           10.79


  Date:              Sat, 11 Oct 2014    Prob (F-statistic):   0.00946


  Time:                  13:20:33        Log-Likelihood:       38.508


  No. Observations:           11         AIC:                  -73.02


  Df Residuals:                9         BIC:                  -72.22


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.9801      0.005    187.770   0.000      0.968     0.992


  semester      -0.0025      0.001     -3.285   0.009     -0.004    -0.001




  Omnibus:         0.007    Durbin-Watson:         1.080


  Prob(Omnibus):   0.996    Jarque-Bera (JB):      0.172


  Skew:            0.036    Prob(JB):              0.918


  Kurtosis:        2.392    Cond. No.               14.8



In [11]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.



In [1]:

    
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm

%matplotlib inline
mpl.style.use('ggplot')
# pyplot.rcdefaults()

client = pymongo.MongoClient('localhost', 27017)

community = 'superuser'
stats_db = client[community].statistics

pipeline = [
    {'$match':{'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"} }},
    {'$unwind': '$dates'},
    {'$project': {'gender':1, 'dates':1}}
    
]

cursor = stats_db.aggregate(pipeline, cursor={})

df =  pandas.DataFrame(list(cursor))



In [2]:

    
indexed_df = df.set_index(['dates'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']



In [3]:

    
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.dates).date()
    begin = min(df.dates).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']



In [4]:

    
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)



In [ ]:

	contrib	contrib_females	contrib_males	male_prop	female_prop
count	11.000000	11.000000	11.000000	11.000000	11.000000
mean	31573.454545	1238.545455	30334.909091	0.964964	0.035036
std	14222.254978	632.761782	13617.867039	0.011355	0.011355
min	349.000000	3.000000	346.000000	0.951722	0.008596
25%	32570.500000	1212.000000	31300.000000	0.959400	0.034489
50%	35081.000000	1366.000000	33787.000000	0.963114	0.036886
75%	39092.000000	1511.500000	37580.500000	0.965511	0.040600
max	45576.000000	2124.000000	43452.000000	0.991404	0.048278

Dep. Variable:	female_prop	R-squared:	0.545
Model:	OLS	Adj. R-squared:	0.495
Method:	Least Squares	F-statistic:	10.79
Date:	Sat, 11 Oct 2014	Prob (F-statistic):	0.00946
Time:	13:20:18	Log-Likelihood:	38.508
No. Observations:	11	AIC:	-73.02
Df Residuals:	9	BIC:	-72.22
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	0.0199	0.005	3.806	0.004	0.008 0.032
semester	0.0025	0.001	3.285	0.009	0.001 0.004

Omnibus:	0.007	Durbin-Watson:	1.080
Prob(Omnibus):	0.996	Jarque-Bera (JB):	0.172
Skew:	-0.036	Prob(JB):	0.918
Kurtosis:	2.392	Cond. No.	14.8

Dep. Variable:	male_prop	R-squared:	0.545
Model:	OLS	Adj. R-squared:	0.495
Method:	Least Squares	F-statistic:	10.79
Date:	Sat, 11 Oct 2014	Prob (F-statistic):	0.00946
Time:	13:20:33	Log-Likelihood:	38.508
No. Observations:	11	AIC:	-73.02
Df Residuals:	9	BIC:	-72.22
Df Model:	1