Community: SuperUser

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary



In [5]:

    
merged.describe()









    Out[5]:






  
    
      
      contrib
      contrib_females
      contrib_males
      male_prop
      female_prop
    
  
  
    
      count
          9.000000
         9.000000
          9.000000
       9.000000
       9.000000
    
    
      mean
       2116.888889
        87.555556
       2029.333333
       0.956779
       0.043221
    
    
      std
        837.383491
        24.213174
        818.799273
       0.008407
       0.008407
    
    
      min
       1020.000000
        52.000000
        968.000000
       0.944318
       0.030369
    
    
      25%
       1828.000000
        76.000000
       1757.000000
       0.950385
       0.038840
    
    
      50%
       1967.000000
        80.000000
       1886.000000
       0.957694
       0.042306
    
    
      75%
       2191.000000
       103.000000
       2069.000000
       0.961160
       0.049615
    
    
      max
       4116.000000
       125.000000
       3991.000000
       0.969631
       0.055682

Fifth Question: Does the proportion registrations by each gender is decreasing ?

Absolute amount of contributions



In [6]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")









    Out[6]:





<matplotlib.text.Text at 0x108fb0850>

Proportion of contributions by gender



In [ ]:

    
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)

Regression - Proportion of Women who joined per semester



In [7]:

    
merged['semester'] = [1,2,3,4,5,6,7,8,9]


result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()









    



/Library/Python/2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=9
  int(n))






    Out[7]:





OLS Regression Results

  Dep. Variable:        female_prop      R-squared:             0.667


  Model:                    OLS          Adj. R-squared:        0.620


  Method:              Least Squares     F-statistic:           14.04


  Date:              Mon, 27 Oct 2014    Prob (F-statistic):   0.00720


  Time:                  14:15:46        Log-Likelihood:       35.720


  No. Observations:            9         AIC:                  -67.44


  Df Residuals:                7         BIC:                  -67.05


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.0307      0.004      8.147   0.000      0.022     0.040


  semester       0.0025      0.001      3.747   0.007      0.001     0.004




  Omnibus:         0.357    Durbin-Watson:         2.037


  Prob(Omnibus):   0.837    Jarque-Bera (JB):      0.434


  Skew:            0.017    Prob(JB):              0.805


  Kurtosis:        1.925    Cond. No.               12.6



In [8]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])

Regression - Proportion of Men who joined per semester



In [9]:

    
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()









    Out[9]:





OLS Regression Results

  Dep. Variable:         male_prop       R-squared:             0.667


  Model:                    OLS          Adj. R-squared:        0.620


  Method:              Least Squares     F-statistic:           14.04


  Date:              Mon, 27 Oct 2014    Prob (F-statistic):   0.00720


  Time:                  14:15:48        Log-Likelihood:       35.720


  No. Observations:            9         AIC:                  -67.44


  Df Residuals:                7         BIC:                  -67.05


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.9693      0.004    257.372   0.000      0.960     0.978


  semester      -0.0025      0.001     -3.747   0.007     -0.004    -0.001




  Omnibus:         0.357    Durbin-Watson:         2.037


  Prob(Omnibus):   0.837    Jarque-Bera (JB):      0.434


  Skew:           -0.017    Prob(JB):              0.805


  Kurtosis:        1.925    Cond. No.               12.6



In [10]:

    
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])

Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.



In [1]:

    
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
mpl.style.use('ggplot')

client = pymongo.MongoClient('localhost', 27017)

community = 'superuser'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"}  }, 
                       {u'_id': False, u'contributions_total': True,
                        u'joined': True, u'gender':True, 'gender_cat': True})


df =  pandas.DataFrame(list(cursor))



In [2]:

    
indexed_df = df.set_index(['joined'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']
unknown = indexed_df[indexed_df['gender']=='Unknown']



In [3]:

    
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.joined).date()
    begin = min(df.joined).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']



In [4]:

    
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)



In [4]:



In [ ]:

	contrib	contrib_females	contrib_males	male_prop	female_prop
count	9.000000	9.000000	9.000000	9.000000	9.000000
mean	2116.888889	87.555556	2029.333333	0.956779	0.043221
std	837.383491	24.213174	818.799273	0.008407	0.008407
min	1020.000000	52.000000	968.000000	0.944318	0.030369
25%	1828.000000	76.000000	1757.000000	0.950385	0.038840
50%	1967.000000	80.000000	1886.000000	0.957694	0.042306
75%	2191.000000	103.000000	2069.000000	0.961160	0.049615
max	4116.000000	125.000000	3991.000000	0.969631	0.055682

Dep. Variable:	female_prop	R-squared:	0.667
Model:	OLS	Adj. R-squared:	0.620
Method:	Least Squares	F-statistic:	14.04
Date:	Mon, 27 Oct 2014	Prob (F-statistic):	0.00720
Time:	14:15:46	Log-Likelihood:	35.720
No. Observations:	9	AIC:	-67.44
Df Residuals:	7	BIC:	-67.05
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	0.0307	0.004	8.147	0.000	0.022 0.040
semester	0.0025	0.001	3.747	0.007	0.001 0.004

Omnibus:	0.357	Durbin-Watson:	2.037
Prob(Omnibus):	0.837	Jarque-Bera (JB):	0.434
Skew:	0.017	Prob(JB):	0.805
Kurtosis:	1.925	Cond. No.	12.6

Dep. Variable:	male_prop	R-squared:	0.667
Model:	OLS	Adj. R-squared:	0.620
Method:	Least Squares	F-statistic:	14.04
Date:	Mon, 27 Oct 2014	Prob (F-statistic):	0.00720
Time:	14:15:48	Log-Likelihood:	35.720
No. Observations:	9	AIC:	-67.44
Df Residuals:	7	BIC:	-67.05
Df Model:	1