In [1]:
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)
# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)
# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)
I am interested in identify the leading indicators of experience broken down by gender in introductory CS at an elite research university like Berkeley. In short, I want to find the attributes that split the dataset as purely as possible into male and female.
To solve this problem, I will undertake the following course of action:
classification
task, explore a couple of classifiers that might be well suited for the problem at hand.In this notebook, I will tackle steps one and two. This notebook will focus on steps three through five.
In [2]:
%pylab inline
In [3]:
# Import libraries
from __future__ import division
import sys
sys.path.append('tools/')
import numpy as np
import pandas as pd
import pickle
import tools
# Graphing Libraries
import matplotlib.pyplot as pyplt
import seaborn as sns
sns.set_style("white")
Let's go ahead and read in the student dataset. There are two functions that support this dataset:
dataLookUp(surveyItemCode)
This function take a string that is coded survey item. For example if you execute dataLookUp(atcs_1)
, it prints out the corresponding survey question, I like to use computer science to solve problems. dataDescr()
This function gives you a general introduction to the dataset.Note: Majority of the questionnaire uses a 5-point Likert scale (where 1 = Strongly Disagree, 3 = Neutral and 5 = Strongly Agree).
In [4]:
# Load the student data. For this project we will restrict the analysis to male and female gender.
dataset = tools.preprocess()
dataset = dataset.query('gender == "Female" or gender == "Male"') #load rows with binary gender
dataset = dataset.reset_index(drop=True)
In [5]:
# Use funtion to view data description
tools.dataDescr()
In [6]:
print dataset.head()
To prepare the data for classification, I need to devise a scheme to transform all features into numeric data. This dataset as several non-numeric columns that need converting. Many of them are simply yes
/no
, e.g. prcs_2
. I can reasonably convert these into 1
/0
(binary) values. For the columns whose values are Nan
, I will convert these to the mean of the column.
In [7]:
# Find features that have any missing values and list their percentages
print "{:^40}".format("FEATURES WITH MISSING VALUES")
tools.find_missing_values(dataset)
In [8]:
# Preprocess feature columns
def preprocess_features(X):
outX = pd.DataFrame(index=X.index) # output dataframe, initially empty
# Check each column
for col, col_data in X.iteritems():
# If data type is non-numeric, try to replace all yes/no values with 1/0
if col_data.dtype == object:
col_data = col_data.replace(['Yes', 'No'], [1, 0])
# Note: This should change the data type for yes/no columns to int
# If still non-numeric, convert to one or more dummy variables
if col_data.dtype == object:
# e.g. 'reason' => 'reason_class_Interested' , 'reason_class_Other'
col_data = pd.get_dummies(col_data, prefix=col)
outX = outX.join(col_data) # collect column(s) in output dataframe
outX.fillna(outX.mean(), inplace=True) # set all NaN <missing> values to mean of the col
return outX
In [9]:
dataset = preprocess_features(dataset)
In [10]:
# Preprocess feature columns - Rename columns
# There are some columns that have whitespaces in their names, these makes it difficult for
# the tree plotting algorithms that we will be using later to graph these features.
# As a result, we will change these whitespaces to hypens.
dataset.rename(columns = {'grade_B or above':'grade_B_or_above'}, inplace = True)
dataset.rename(columns = {'grade_B or below':'grade_B_or_below'}, inplace = True)
In [11]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(dataset), columns=dataset.columns)
dataset = df_scaled
dataset.tail()
Out[11]:
As an aid in understanding the data, I will create 'coded' dimensions that I am interested in investigating.
These dimensions are as follows:
mtr
: The role of mentorshipprcs
: The role of prior CS exposureatcs
: The role of self reported attitude about CS competency atct
: The role of self reported attitudes about computational thinkingblg
: The role of self reported belonging in the class roomclet
: The role of social implications and ethics atcsgender
: The role of gendered notions of intelligence atcsjob
: The role of career driven beliefs about CScltrcmp
: The role of culutral competencypriorcs10
: The role of CS10
In [12]:
mtr = ['mtr_1', 'mtr_2', 'mtr_3'] # CS Mentors
prcs = ['prcs_1', 'prcs_2', 'prcs_3', 'prcs_4', 'prcs_5'] # Prior CS Exposure
atcs = ['atcs_1', 'atcs_2', 'atcs_3', 'atcs_5', 'atcs_4',
'atcs_6', 'atcs_7', 'atcs_8', 'atcs_9']# self reported attitude about CS competency
atct = ['atct_1', 'atct_2', 'atct_3', 'atct_4',
'atct_5', 'atct_6', 'atct_7', 'atct_8'] # Self reported attitudes about computational thinking
blg = ['blg_1', 'blg_2', 'blg_3', 'blg_4'] # Sense of belonging in the class room
clet = ['clet_1', 'clet_2'] # Social implications and ethics
atcsgender = ['atcsgender_1', 'atcsgender_2', 'atcsgender_3']
atcsjob = ['atcsjob_1', 'atcsjob_2']
cltrcmp = ['cltrcmp_1', 'cltrcmp_2'] # Culutral competency
priorcs10 = 'priorcs10' # had taken CS10 prior
I created a density estimation for some dimensions in the data to gain an understanding of the variables and determine if I need to reject some of them, or collapse others. The distributions of most of the dimensions looked very similary to that of atcs
. Most of the data is either skewed to the left or skewed to the right. As a result, I rejected using descriptive statistics to summarize the data in favor quantiles represented by boxplots.
In [13]:
dataset[atcs].plot(kind='kde');
x = [-0.5, 0.0, 0.5, 1.0, 1.5]
labels = ["", "Strongly Disagree", "Neutral", "Strongly Agree" , ""]
pyplt.xticks(x, labels)
pyplt.xlabel('SURVEY RESPONSES')
pyplt.title('DENSITY ESTIMATION OF COMPUTER SCIENCE ABILITY: ATCS')
pyplt.legend(loc='upper right', shadow=True, fontsize='medium')
pyplt.savefig('report/figures/atcs.png', dpi=100)
pyplt.close()
In [14]:
dataset[atcs].plot.box();
y = np.arange(0,1.1,0.25)
labels = ['{} percentile'.format(int(i*100)) for i in y]
pyplt.yticks(y, labels)
pyplt.title('QUANTILES OF COMPUTER SCIENCE ABILITY: ATCS')
pyplt.savefig('report/figures/atcs_quantile.png', dpi=200)
pyplt.close()
So what does the boxplot of atcs_dimension
tell us about the data? From the generated figure, we can see that the median of this dimension is approximately at the 75 percentile, which based on our Likert scale dataset means most students generally agree with the mostly positive attitudinal questions asked about their CS beliefs. Attitudes about computational thinking also have a similar pattern.
In [15]:
dataset[atct].plot.box();
y = np.arange(0,1.1,0.25)
labels = ['{} percentile'.format(int(i*100)) for i in y]
pyplt.yticks(y, labels)
pyplt.title('QUANTILES OF COMPUTATIONAL THINKING ABILITY: ATCT')
pyplt.savefig('report/figures/atct_quantile.png', dpi=300)
pyplt.close()
When it comes to belonging we see a different pattern. Majority of students feel like the belong, but most of them are neutral when belonging becomes more specific.
In [16]:
dataset[blg].plot.box();
y = np.arange(0,1.1,0.25)
labels = ['{} percentile'.format(int(i*100)) for i in y]
pyplt.yticks(y, labels)
pyplt.title('QUANTILES OF BELONGING: BLG')
pyplt.savefig('report/figures/blg_quantile.png', dpi=100)
pyplt.close()
From a plot of its density estimation, I can see that the distribution for this dimension is really skewed to the right, i.e., most students strongly disagree with the statements. That does not come as a suprise, what I found fascinating is that the median for atcsgender_2
is at the 25 percentile, which corresponds to neutral. You can see this in the boxplot. While students do not agree that women are smarter than men, half of them is undecided about this statement!
In [17]:
dataset[atcsgender].plot(kind='kde');
x = [-0.5, 0.0, 0.5, 1.0, 1.5]
labels = ["", "Strongly Disagree", "Neutral", "Strongly Agree" , ""]
pyplt.xticks(x, labels)
pyplt.xlabel('SURVEY RESPONSES')
pyplt.title('DENSITY ESTIMATION OF GENDERED NOTIONS OF INTELLIGENCE: ATCSGENDER')
pyplt.legend(loc='upper right', shadow=True, fontsize='medium')
pyplt.savefig('report/figures/atcsgender.png', dpi=100)
pyplt.close()
In [18]:
dataset[atcsgender].plot.box();
y = np.arange(0,1.1,0.25)
labels = ['{} percentile'.format(int(i*100)) for i in y]
pyplt.yticks(y, labels)
pyplt.title('QUANTILES OF GENDERED NOTIONS OF INTELLIGENCE: ATCSGENDER')
pyplt.savefig('report/figures/atcsgender_quantile.png', dpi=100)
pyplt.close()
In [19]:
target_col = dataset['gender_Female'] # column is the target/label
y = target_col # corresponding targets/labels
print "\nLabel values:-"
print y.head()
In [20]:
X = dataset.drop(['gender_Female', 'gender_Male'], axis=1, inplace=False)
print "\nFeature values:-"
print X.head()
In [27]:
y.plot.hist()
x = [0.05, 0.2, 0.4, 0.6, 0.8, 0.95]
labels = ["Male", "", "", "", "", "Female"]
pyplt.xticks(x, labels)
pyplt.grid(False)
_= pyplt.xlabel('VALUE OF TARGET LABEL')
_= pyplt.ylabel('COUNT')
_= pyplt.title('HISTOGRAM OF TARGET CLASS')
_= pyplt.yticks(np.arange(0, 700, 100))
pyplt.savefig('report/figures/targetClass.png', dpi=100)
pyplt.close()
In [22]:
num_male = y.tolist().count(0)
num_female = y.tolist().count(1)
print "Number of males in data", num_male
print "Number of females in data", num_female
print "Ratio of males to females {}".format(num_male/ num_female)
In [23]:
# Save dataframes to file
X.to_pickle('data/features.pickle.dat')
y.to_pickle('data/labels.pickle.dat')
In [ ]: