Community: StackOverflow

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary



In [6]:

    
merged.describe()









    Out[6]:






  
    
      
      contrib
      contrib_females
      contrib_males
      male_prop
      female_prop
    
  
  
    
      count
            11.000000
           11.000000
            11.000000
       11.000000
       11.000000
    
    
      mean
       1386627.272727
        64762.636364
       1321864.636364
        0.957977
        0.042023
    
    
      std
        720936.712398
        41650.407168
        679542.609959
        0.010583
        0.010583
    
    
      min
        266108.000000
         5197.000000
        260911.000000
        0.946952
        0.019530
    
    
      25%
        804623.500000
        31121.000000
        773502.500000
        0.950056
        0.038543
    
    
      50%
       1557108.000000
        65366.000000
       1491742.000000
        0.957483
        0.042517
    
    
      75%
       2021578.000000
       101961.000000
       1919617.000000
        0.961457
        0.049944
    
    
      max
       2348170.000000
       121994.000000
       2226176.000000
        0.980470
        0.053048

Fourth Question: Does the proportion of contributions by each gender is decreasing ?

Hypothesis 1:

H0: contributionsBefore(Males) > contributionsRecent(Males) & contributionsBefore(Females) > contributionsRecent(Females);

H1: contributionsBefore(Males) <= contributionsRecent(Males) & contributionsBefore(Females) <= contributionsRecent(Females).

Absolute amount of contributions



In [7]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")









    Out[7]:





<matplotlib.text.Text at 0x116d7fd10>

Proportion of contributions by gender



In [8]:

    
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)









    Out[8]:





<matplotlib.legend.Legend at 0x117ce5790>

Regression - Proportion of contributions by Women



In [9]:

    
merged['semester'] = [1,2,3,4,5,6,7,8,9,10,11]

import statsmodels.formula.api as smf
result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()









    



/Library/Python/2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  int(n))






    Out[9]:





OLS Regression Results

  Dep. Variable:        female_prop      R-squared:             0.881


  Model:                    OLS          Adj. R-squared:        0.868


  Method:              Least Squares     F-statistic:           66.64


  Date:              Sat, 11 Oct 2014    Prob (F-statistic):  1.88e-05


  Time:                  13:55:44        Log-Likelihood:       46.658


  No. Observations:           11         AIC:                  -89.32


  Df Residuals:                9         BIC:                  -88.52


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.0241      0.002      9.666   0.000      0.018     0.030


  semester       0.0030      0.000      8.164   0.000      0.002     0.004




  Omnibus:         1.666    Durbin-Watson:         0.871


  Prob(Omnibus):   0.435    Jarque-Bera (JB):      0.937


  Skew:           -0.688    Prob(JB):              0.626


  Kurtosis:        2.612    Cond. No.               14.8



In [10]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])

Regression - Proportion of contributions by Men



In [11]:

    
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()









    Out[11]:





OLS Regression Results

  Dep. Variable:         male_prop       R-squared:             0.881


  Model:                    OLS          Adj. R-squared:        0.868


  Method:              Least Squares     F-statistic:           66.64


  Date:              Sat, 11 Oct 2014    Prob (F-statistic):  1.88e-05


  Time:                  13:55:56        Log-Likelihood:       46.658


  No. Observations:           11         AIC:                  -89.32


  Df Residuals:                9         BIC:                  -88.52


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.9759      0.002    392.216   0.000      0.970     0.982


  semester      -0.0030      0.000     -8.164   0.000     -0.004    -0.002




  Omnibus:         1.666    Durbin-Watson:         0.871


  Prob(Omnibus):   0.435    Jarque-Bera (JB):      0.937


  Skew:            0.688    Prob(JB):              0.626


  Kurtosis:        2.612    Cond. No.               14.8



In [12]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.



In [2]:

    
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm

%matplotlib inline
# mpl.style.use('ggplot')
# pyplot.rcdefaults()

client = pymongo.MongoClient('localhost', 27017)

community = 'stackoverflow'
stats_db = client[community].statistics

pipeline = [
    {'$match':{'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"} }},
    {'$unwind': '$dates'},
    {'$project': {'gender':1, 'dates':1}}
    
]

cursor = stats_db.aggregate(pipeline, cursor={})

df =  pandas.DataFrame(list(cursor))



In [3]:

    
indexed_df = df.set_index(['dates'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']



In [4]:

    
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.dates).date()
    begin = min(df.dates).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']



In [5]:

    
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)



In [ ]:

	contrib	contrib_females	contrib_males	male_prop	female_prop
count	11.000000	11.000000	11.000000	11.000000	11.000000
mean	1386627.272727	64762.636364	1321864.636364	0.957977	0.042023
std	720936.712398	41650.407168	679542.609959	0.010583	0.010583
min	266108.000000	5197.000000	260911.000000	0.946952	0.019530
25%	804623.500000	31121.000000	773502.500000	0.950056	0.038543
50%	1557108.000000	65366.000000	1491742.000000	0.957483	0.042517
75%	2021578.000000	101961.000000	1919617.000000	0.961457	0.049944
max	2348170.000000	121994.000000	2226176.000000	0.980470	0.053048

Dep. Variable:	female_prop	R-squared:	0.881
Model:	OLS	Adj. R-squared:	0.868
Method:	Least Squares	F-statistic:	66.64
Date:	Sat, 11 Oct 2014	Prob (F-statistic):	1.88e-05
Time:	13:55:44	Log-Likelihood:	46.658
No. Observations:	11	AIC:	-89.32
Df Residuals:	9	BIC:	-88.52
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	0.0241	0.002	9.666	0.000	0.018 0.030
semester	0.0030	0.000	8.164	0.000	0.002 0.004

Omnibus:	1.666	Durbin-Watson:	0.871
Prob(Omnibus):	0.435	Jarque-Bera (JB):	0.937
Skew:	-0.688	Prob(JB):	0.626
Kurtosis:	2.612	Cond. No.	14.8

Dep. Variable:	male_prop	R-squared:	0.881
Model:	OLS	Adj. R-squared:	0.868
Method:	Least Squares	F-statistic:	66.64
Date:	Sat, 11 Oct 2014	Prob (F-statistic):	1.88e-05
Time:	13:55:56	Log-Likelihood:	46.658
No. Observations:	11	AIC:	-89.32
Df Residuals:	9	BIC:	-88.52
Df Model:	1