Community: SuperUser

Inicialization and importing data are at the end of this notebook. For better visualization of the analysis they were placed at the bottom, but it's necessary to run them first so the analysis work as expected. Click here to go there.

Data summary


In [5]:
merged.describe()


Out[5]:
contrib contrib_females contrib_males male_prop female_prop
count 9.000000 9.000000 9.000000 9.000000 9.000000
mean 2116.888889 87.555556 2029.333333 0.956779 0.043221
std 837.383491 24.213174 818.799273 0.008407 0.008407
min 1020.000000 52.000000 968.000000 0.944318 0.030369
25% 1828.000000 76.000000 1757.000000 0.950385 0.038840
50% 1967.000000 80.000000 1886.000000 0.957694 0.042306
75% 2191.000000 103.000000 2069.000000 0.961160 0.049615
max 4116.000000 125.000000 3991.000000 0.969631 0.055682

Fifth Question: Does the proportion registrations by each gender is decreasing ?

Absolute amount of contributions


In [6]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))
axes[0].plot(merged['contrib_females'])
axes[0].set_title("Absolute number of female's contributions")
axes[1].plot(merged['contrib_males'])
axes[1].set_title("Absolute number of male's contributions")


Out[6]:
<matplotlib.text.Text at 0x108fb0850>

Proportion of contributions by gender


In [ ]:
fig, ax = pyplot.subplots(figsize=(15, 7))
pyplot.plot(merged[['male_prop', 'female_prop']])
ax.set_title("Proportion of Contributions")
ax.legend(["Male","Female"],bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)

Regression - Proportion of Women who joined per semester


In [7]:
merged['semester'] = [1,2,3,4,5,6,7,8,9]


result_female = smf.ols(formula="female_prop ~ semester", data=merged).fit()
result_female.summary()


/Library/Python/2.7/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=9
  int(n))
Out[7]:
OLS Regression Results
Dep. Variable: female_prop R-squared: 0.667
Model: OLS Adj. R-squared: 0.620
Method: Least Squares F-statistic: 14.04
Date: Mon, 27 Oct 2014 Prob (F-statistic): 0.00720
Time: 14:15:46 Log-Likelihood: 35.720
No. Observations: 9 AIC: -67.44
Df Residuals: 7 BIC: -67.05
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0307 0.004 8.147 0.000 0.022 0.040
semester 0.0025 0.001 3.747 0.007 0.001 0.004
Omnibus: 0.357 Durbin-Watson: 2.037
Prob(Omnibus): 0.837 Jarque-Bera (JB): 0.434
Skew: 0.017 Prob(JB): 0.805
Kurtosis: 1.925 Cond. No. 12.6

In [8]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_female, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_female, "semester", ax=axes[1])


Regression - Proportion of Men who joined per semester


In [9]:
result_male = smf.ols(formula="male_prop ~ semester", data=merged).fit()
result_male.summary()


Out[9]:
OLS Regression Results
Dep. Variable: male_prop R-squared: 0.667
Model: OLS Adj. R-squared: 0.620
Method: Least Squares F-statistic: 14.04
Date: Mon, 27 Oct 2014 Prob (F-statistic): 0.00720
Time: 14:15:48 Log-Likelihood: 35.720
No. Observations: 9 AIC: -67.44
Df Residuals: 7 BIC: -67.05
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.9693 0.004 257.372 0.000 0.960 0.978
semester -0.0025 0.001 -3.747 0.007 -0.004 -0.001
Omnibus: 0.357 Durbin-Watson: 2.037
Prob(Omnibus): 0.837 Jarque-Bera (JB): 0.434
Skew: -0.017 Prob(JB): 0.805
Kurtosis: 1.925 Cond. No. 12.6

In [10]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(15,7))

fig = sm.graphics.plot_fit(result_male, "semester", ax=axes[0])
fig = sm.graphics.plot_ccpr(result_male, "semester", ax=axes[1])



Inicialization

Here you can find the data importing and some useful functions used for analysing the data. Please, run this first, otherwise the analysis will not work.

Importing the data from the MongoDB database and inserting into a panda dataframe for easy manipulation.


In [1]:
from __future__ import division
import pymongo, time, pylab, numpy, pandas, math
from scipy import stats
import matplotlib as mpl
from matplotlib import pyplot
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
mpl.style.use('ggplot')

client = pymongo.MongoClient('localhost', 27017)

community = 'superuser'
stats_db = client[community].statistics

cursor = stats_db.find({'$or': [{'questions_total':{'$gt':0}}, {'answers_total':{'$gt':0}}, 
                                {'comments_total':{'$gt':0}}], 
                        'gender': {'$ne': "Unknown"}  }, 
                       {u'_id': False, u'contributions_total': True,
                        u'joined': True, u'gender':True, 'gender_cat': True})


df =  pandas.DataFrame(list(cursor))

In [2]:
indexed_df = df.set_index(['joined'])
males = indexed_df[indexed_df['gender']=='Male']
females = indexed_df[indexed_df['gender']=='Female']
unknown = indexed_df[indexed_df['gender']=='Unknown']

In [3]:
from dateutil.relativedelta import *
import datetime

def aggregate_semesters(df):
    maxi = max(df.joined).date()
    begin = min(df.joined).date()
    end = begin + relativedelta(months=+6)
    
    return_df = pandas.DataFrame(data={})

    while(begin <= maxi):
        d = {"semester": begin, 
             "contrib_males": len(males[str(begin):str(end)].index),
             "contrib_females": len(females[str(begin):str(end)].index),
             "contrib": len(indexed_df[str(begin):str(end)].index)
             }
        return_df = return_df.append(d, ignore_index=True)
        
        begin = end + relativedelta(days=+1)
        end = begin + relativedelta(months=+6)
    
    return return_df

def male_proportion(row):
    return row['contrib_males'] / row['contrib']

def female_proportion(row):
    return row['contrib_females'] / row['contrib']

In [4]:
#aggregates data by semester
merged = aggregate_semesters(df)

#indexes dataframe by date
merged = merged.set_index(['semester'])

#calculating proportion of contributions by gender
merged['male_prop'] = merged.apply(male_proportion, axis=1)
merged['female_prop'] = merged.apply(female_proportion, axis=1)

In [4]:


In [ ]: