Community: StackOverflow

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary



In [5]:

    
merged.describe()









    Out[5]:






  
    
      
      contrib
      contrib_females
      contrib_males
      male_prop
      female_prop
    
  
  
    
      count
          11.000000
         11.000000
          11.000000
       11.000000
       11.000000
    
    
      mean
       12112.454545
        717.909091
       11394.545455
        0.940064
        0.059936
    
    
      std
        4388.494078
        335.495292
        4093.349444
        0.017631
        0.017631
    
    
      min
        3070.000000
        275.000000
        2794.000000
        0.910098
        0.029506
    
    
      25%
        9224.500000
        449.500000
        8911.500000
        0.929288
        0.049245
    
    
      50%
       12378.000000
        647.000000
       11740.000000
        0.938508
        0.061492
    
    
      75%
       15270.500000
       1041.000000
       14317.000000
        0.950755
        0.070712
    
    
      max
       17482.000000
       1173.000000
       16407.000000
        0.970494
        0.089902

Fifth Question: Does the proportion registrations by each gender is decreasing ?

Absolute amount of contributions



In [6]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")









    Out[6]:





<matplotlib.text.Text at 0x10d07e510>

Proportion of contributions by gender



In [7]:

    
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)









    Out[7]:





<matplotlib.legend.Legend at 0x10d9d4810>

Regression - Proportion of Women who joined per semester



In [8]:

    
merged['semester'] = [1,2,3,4,5,6,7,8,9,10,11]


result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()









    



/Library/Python/2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  int(n))






    Out[8]:





OLS Regression Results

  Dep. Variable:        female_prop      R-squared:             0.973


  Model:                    OLS          Adj. R-squared:        0.970


  Method:              Least Squares     F-statistic:           319.2


  Date:              Mon, 27 Oct 2014    Prob (F-statistic):  2.45e-08


  Time:                  14:17:06        Log-Likelihood:       49.116


  No. Observations:           11         AIC:                  -94.23


  Df Residuals:                9         BIC:                  -93.44


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.0285      0.002     14.312   0.000      0.024     0.033


  semester       0.0052      0.000     17.867   0.000      0.005     0.006




  Omnibus:         2.233    Durbin-Watson:         1.401


  Prob(Omnibus):   0.327    Jarque-Bera (JB):      1.089


  Skew:           -0.398    Prob(JB):              0.580


  Kurtosis:        1.680    Cond. No.               14.8



In [9]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])

Regression - Proportion of Men who joined per semester



In [10]:

    
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()









    Out[10]:





OLS Regression Results

  Dep. Variable:         male_prop       R-squared:             0.973


  Model:                    OLS          Adj. R-squared:        0.970


  Method:              Least Squares     F-statistic:           319.2


  Date:              Mon, 27 Oct 2014    Prob (F-statistic):  2.45e-08


  Time:                  14:17:07        Log-Likelihood:       49.116


  No. Observations:           11         AIC:                  -94.23


  Df Residuals:                9         BIC:                  -93.44


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.9715      0.002    488.198   0.000      0.967     0.976


  semester      -0.0052      0.000    -17.867   0.000     -0.006    -0.005




  Omnibus:         2.233    Durbin-Watson:         1.401


  Prob(Omnibus):   0.327    Jarque-Bera (JB):      1.089


  Skew:            0.398    Prob(JB):              0.580


  Kurtosis:        1.680    Cond. No.               14.8



In [11]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.



In [1]:

    
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
mpl.style.use('ggplot')

client = pymongo.MongoClient('localhost', 27017)

community = 'stackoverflow'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"}  }, 
                       {u'_id': False, u'contributions_total': True,
                        u'joined': True, u'gender':True, 'gender_cat': True})


df =  pandas.DataFrame(list(cursor))



In [2]:

    
indexed_df = df.set_index(['joined'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']
unknown = indexed_df[indexed_df['gender']=='Unknown']



In [3]:

    
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.joined).date()
    begin = min(df.joined).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']



In [4]:

    
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)



In [ ]:

	contrib	contrib_females	contrib_males	male_prop	female_prop
count	11.000000	11.000000	11.000000	11.000000	11.000000
mean	12112.454545	717.909091	11394.545455	0.940064	0.059936
std	4388.494078	335.495292	4093.349444	0.017631	0.017631
min	3070.000000	275.000000	2794.000000	0.910098	0.029506
25%	9224.500000	449.500000	8911.500000	0.929288	0.049245
50%	12378.000000	647.000000	11740.000000	0.938508	0.061492
75%	15270.500000	1041.000000	14317.000000	0.950755	0.070712
max	17482.000000	1173.000000	16407.000000	0.970494	0.089902

Dep. Variable:	female_prop	R-squared:	0.973
Model:	OLS	Adj. R-squared:	0.970
Method:	Least Squares	F-statistic:	319.2
Date:	Mon, 27 Oct 2014	Prob (F-statistic):	2.45e-08
Time:	14:17:06	Log-Likelihood:	49.116
No. Observations:	11	AIC:	-94.23
Df Residuals:	9	BIC:	-93.44
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	0.0285	0.002	14.312	0.000	0.024 0.033
semester	0.0052	0.000	17.867	0.000	0.005 0.006

Omnibus:	2.233	Durbin-Watson:	1.401
Prob(Omnibus):	0.327	Jarque-Bera (JB):	1.089
Skew:	-0.398	Prob(JB):	0.580
Kurtosis:	1.680	Cond. No.	14.8

Dep. Variable:	male_prop	R-squared:	0.973
Model:	OLS	Adj. R-squared:	0.970
Method:	Least Squares	F-statistic:	319.2
Date:	Mon, 27 Oct 2014	Prob (F-statistic):	2.45e-08
Time:	14:17:07	Log-Likelihood:	49.116
No. Observations:	11	AIC:	-94.23
Df Residuals:	9	BIC:	-93.44
Df Model:	1