Identify Factors that Predict Intro CS Experience Based on Gender: Part One



In [1]:

    
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)

# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)

Introduction

Problem Statement

I am interested in identify the leading indicators of experience broken down by gender in introductory CS at an elite research university like Berkeley. In short, I want to find the attributes that split the dataset as purely as possible into male and female.

To solve this problem, I will undertake the following course of action:

Explore the dataset
- Explore the dataset to ensure its integrity and understand the context.
Identify features that may be used.
- If possible, engineer features that might provide greater discrimination.
With the understanding that this a classification task, explore a couple of classifiers that might be well suited for the problem at hand.
- Random Forest classifier
- eXtreme Gradient Boosted (XGBoost) trees classifier
- Support Vector Machine (SVM)
- Decision Tree classifier
Select appropriate classifier based on evaluation metric and tune it for optimality.
Extract top features responsible for discriminating the data.

In this notebook, I will tackle steps one and two. This notebook will focus on steps three through five.

Preliminaries



In [2]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [3]:

    
# Import libraries
from __future__ import division
import sys
sys.path.append('tools/')

import numpy as np
import pandas as pd
import pickle
import tools

   
# Graphing Libraries
import matplotlib.pyplot as pyplt
import seaborn as sns
sns.set_style("white")

Load Data

Let's go ahead and read in the student dataset. There are two functions that support this dataset:

dataLookUp(surveyItemCode) This function take a string that is coded survey item. For example if you execute dataLookUp(atcs_1), it prints out the corresponding survey question, I like to use computer science to solve problems.
dataDescr() This function gives you a general introduction to the dataset.

Note: Majority of the questionnaire uses a 5-point Likert scale (where 1 = Strongly Disagree, 3 = Neutral and 5 = Strongly Agree).



In [4]:

    
# Load the student data. For this project we will restrict the analysis to male and female gender.

dataset = tools.preprocess()

dataset = dataset.query('gender == "Female" or gender == "Male"') #load rows with binary gender
dataset = dataset.reset_index(drop=True)



In [5]:

    
# Use funtion to view data description
tools.dataDescr()









    



UC Berkeley Intro CS Student dataset

Notes
------
Data Set Characteristics:

Number of Instances:882

Attribute Information:

Self reported attitudes about CS
- atcs_1 I like to use computer science to solve problems.
- atcs_2 I can learn to understand computing concepts.
- atcs_3 I can achieve good grades (C or better) in computing courses.
- atcs_4 I do not like using computer science to solve problems.
- atcs_5 I am confident that I can solve problems by using computation.
- atcs_6 The challenge of solving problems using computer science appeals to me.
- atcs_7 I am comfortable with learning computing concepts.
- atcs_8 I am confident about my abilities with regards to computer science.
- atcs_9 I do think I can learn to understand computing concepts.

Gendered belief about CS ability
- atcsgender_1 Women are less capable of success in CS than men.
- atcsgender_2 Women are smarter than men.
- atcsgender_3 Men have better math and science abilities than women.

Career driven beliefs about CS
- atcsjob_1 Knowledge of computing will allow me to secure a good job.
- atcsjob_2 My career goals do not require that I learn computing skills.

Self reported attitudes about computational thinking
- atct_1 I am good at solving a problem by thinking about similar problems I have solved before.
- atct_2 I have good research skills.
- atct_3 I am good at using online search tools.
- atct_4 I am persistent at solving puzzles or logic problems.
- atct_5 I know how to write computer programs.
- atct_6 I am good at building things.
- atct_7 I am good at ignoring irrelevant details to solve a problem.
- atct_8 I know how to write a computer program to solve a problem.

Self reported attitudes about CS class belonging
- blg_1 In this class, I feel I belong.
- blg_2 In this class, I feel awkward and out of place.
- blg_3 In this class, I feel like my ideas count.
- blg_4 In this class, I feel like I matter.

Self reported beliefs about collegiality
- clet_1 I work well in teams.
- clet_2 I think about the ethical, legal, and social implications of computing.
- cltrcmp_1 I am comfortable interacting with peers from different backgrounds than my own (based on race, sexuality, income, and so on.)
- cltrcmp_2 I have good cultural competence, or the ability to interact effectively with people from diverse backgrounds.

Demographics
- gender Could I please know your gender
- reason_class What is your reason for taking this class
- major What is your major?

CS mentors and role models
- mtr_1 Before I came to UC Berkeley, I knew people who have careers in Computer Science.
- mtr_2 There are people with careers in Computer Science who look like me.
- mtr_3 I have role models within the Computer Science field that look like me.

Prior collegiate CS exposure
- prcs_1 Did you take a CS course in High School?
- prcs_2 Did you have any exposure to Computer Science before UC Berkeley?
- prcs_3 Did a family member introduce you to Computer Science?
- prcs_4 Did you have a close family member who is a Computer Scientist or is affiliated with computing industry?
- prcs_5 Did your high school offer AP CS?

Creator: Omoju Miller

Peek into the data to see what we are dealing with



In [6]:

    
print dataset.head()









    



   atcs_1  atcs_2  atcs_3  atcs_4  atcs_5  atcs_6  atcs_7  atcs_8  atcs_9  \
0     3.0     4.0     5.0     3.0     4.0     4.0     4.0     3.0     4.0   
1     1.0     1.0     1.0     5.0     1.0     1.0     1.0     1.0     2.0   
2     5.0     5.0     5.0     1.0     5.0     5.0     5.0     5.0     5.0   
3     5.0     4.0     4.0     1.0     4.0     4.0     4.0     5.0     5.0   
4     3.0     3.0     4.0     2.0     3.0     4.0     5.0     2.0     4.0   

   atcsgender_1      ...       mtr_2  mtr_3  prcs_1  prcs_2  prcs_3  prcs_4  \
0           1.0      ...         Yes     No      No      No      No      No   
1           1.0      ...         Yes     No      No      No     Yes     Yes   
2           1.0      ...         Yes    Yes      No      No      No      No   
3           1.0      ...         Yes     No      No      No      No      No   
4           1.0      ...          No     No      No      No      No      No   

   prcs_5  prepared  priorcs10  reason_class  
0     Yes       3.0        NaN    Interested  
1      No       2.0        NaN    Interested  
2     Yes       3.0        NaN    Interested  
3     Yes       4.0        NaN    Interested  
4      No       3.0        NaN    Interested  

[5 rows x 44 columns]

Preprocess Data

Preprocess feature columns

To prepare the data for classification, I need to devise a scheme to transform all features into numeric data. This dataset as several non-numeric columns that need converting. Many of them are simply yes/no, e.g. prcs_2. I can reasonably convert these into 1/0 (binary) values. For the columns whose values are Nan, I will convert these to the mean of the column.



In [7]:

    
# Find features that have any missing values and list their percentages
    
print "{:^40}".format("FEATURES WITH MISSING VALUES")
tools.find_missing_values(dataset)









    



      FEATURES WITH MISSING VALUES      
priorcs10                     0.4388
reason_class                  0.0068



In [8]:

    
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['Yes', 'No'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            # e.g. 'reason' => 'reason_class_Interested' , 'reason_class_Other'
            col_data = pd.get_dummies(col_data, prefix=col)

        outX = outX.join(col_data)  # collect column(s) in output dataframe 
        outX.fillna(outX.mean(), inplace=True) # set all NaN <missing> values to mean of the col

    return outX



In [9]:

    
dataset = preprocess_features(dataset)



In [10]:

    
# Preprocess feature columns - Rename columns
# There are some columns that have whitespaces in their names, these makes it difficult for 
# the tree plotting algorithms that we will be using later to graph these features. 
# As a result, we will change these whitespaces to hypens.

dataset.rename(columns = {'grade_B or above':'grade_B_or_above'}, inplace = True)
dataset.rename(columns = {'grade_B or below':'grade_B_or_below'}, inplace = True)

Scaling

Linearly scale each attribute to the range [0, 1] to get better output from the SVM.



In [11]:

    
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df_scaled = pd.DataFrame(scaler.fit_transform(dataset), columns=dataset.columns)
dataset = df_scaled
dataset.tail()









    Out[11]:






  
    
      
      atcs_1
      atcs_2
      atcs_3
      atcs_4
      atcs_5
      atcs_6
      atcs_7
      atcs_8
      atcs_9
      atcsgender_1
      ...
      mtr_3
      prcs_1
      prcs_2
      prcs_3
      prcs_4
      prcs_5
      prepared
      priorcs10
      reason_class_Interested
      reason_class_Other
    
  
  
    
      877
      1.00
      1.00
      1.00
      0.00
      1.00
      1.00
      1.00
      1.00
      1.00
      0.0
      ...
      1.0
      1.0
      1.0
      1.0
      1.0
      1.0
      1.00
      0.0
      1.0
      0.0
    
    
      878
      0.00
      0.50
      0.25
      1.00
      0.00
      0.00
      0.00
      0.00
      0.00
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.00
      0.0
      0.0
      1.0
    
    
      879
      1.00
      1.00
      1.00
      0.25
      0.75
      1.00
      1.00
      0.75
      1.00
      0.0
      ...
      0.0
      1.0
      1.0
      0.0
      1.0
      1.0
      1.00
      0.0
      0.0
      0.0
    
    
      880
      1.00
      1.00
      0.75
      0.25
      0.50
      0.75
      0.50
      0.50
      0.75
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.00
      0.0
      1.0
      0.0
    
    
      881
      0.75
      0.75
      0.50
      0.25
      0.25
      1.00
      0.75
      0.50
      1.00
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.25
      0.0
      1.0
      0.0
    
  

5 rows × 47 columns

Frequency Distribution

As an aid in understanding the data, I will create 'coded' dimensions that I am interested in investigating.
These dimensions are as follows:

mtr: The role of mentorship
prcs: The role of prior CS exposure
atcs: The role of self reported attitude about CS competency
atct: The role of self reported attitudes about computational thinking
blg: The role of self reported belonging in the class room
clet: The role of social implications and ethics
atcsgender: The role of gendered notions of intelligence
atcsjob: The role of career driven beliefs about CS
cltrcmp : The role of culutral competency
priorcs10: The role of CS10



In [12]:

    
mtr = ['mtr_1', 'mtr_2', 'mtr_3'] # CS Mentors
prcs = ['prcs_1', 'prcs_2', 'prcs_3', 'prcs_4', 'prcs_5'] # Prior CS Exposure
atcs = ['atcs_1', 'atcs_2', 'atcs_3', 'atcs_5', 'atcs_4', 
        'atcs_6', 'atcs_7', 'atcs_8', 'atcs_9']# self reported attitude about CS competency
atct = ['atct_1', 'atct_2', 'atct_3', 'atct_4', 
        'atct_5', 'atct_6', 'atct_7', 'atct_8'] # Self reported attitudes about computational thinking
blg = ['blg_1', 'blg_2', 'blg_3', 'blg_4'] # Sense of belonging in the class room
clet = ['clet_1', 'clet_2'] # Social implications and ethics
atcsgender = ['atcsgender_1', 'atcsgender_2', 'atcsgender_3'] 
atcsjob = ['atcsjob_1', 'atcsjob_2'] 
cltrcmp = ['cltrcmp_1', 'cltrcmp_2'] # Culutral competency
priorcs10 = 'priorcs10' # had taken CS10 prior

Summarizing Data

I created a density estimation for some dimensions in the data to gain an understanding of the variables and determine if I need to reject some of them, or collapse others. The distributions of most of the dimensions looked very similary to that of atcs. Most of the data is either skewed to the left or skewed to the right. As a result, I rejected using descriptive statistics to summarize the data in favor quantiles represented by boxplots.



In [13]:

    
dataset[atcs].plot(kind='kde');
x = [-0.5, 0.0, 0.5, 1.0, 1.5]
labels = ["", "Strongly Disagree", "Neutral", "Strongly Agree" , ""]
pyplt.xticks(x, labels)

pyplt.xlabel('SURVEY RESPONSES')
pyplt.title('DENSITY ESTIMATION OF COMPUTER SCIENCE ABILITY: ATCS')
pyplt.legend(loc='upper right', shadow=True, fontsize='medium')
pyplt.savefig('report/figures/atcs.png', dpi=100)
pyplt.close()



In [14]:

    
dataset[atcs].plot.box();
y = np.arange(0,1.1,0.25)
labels = ['{} percentile'.format(int(i*100)) for i in y]
pyplt.yticks(y, labels)

pyplt.title('QUANTILES OF COMPUTER SCIENCE ABILITY: ATCS')
pyplt.savefig('report/figures/atcs_quantile.png', dpi=200)
pyplt.close()

Computer Science beliefs

atcs_1 I like to use computer science to solve problems.
atcs_2 I can learn to understand computing concepts.
atcs_3 I can achieve good grades (C or better) in computing courses.
atcs_4 I do not like using computer science to solve problems.
atcs_5 I am confident that I can solve problems by using computation.
atcs_6 The challenge of solving problems using computer science appeals to me.
atcs_7 I am comfortable with learning computing concepts.
atcs_8 I am confident about my abilities with regards to computer science.
atcs_9 I do think I can learn to understand computing concepts

So what does the boxplot of atcs_dimension tell us about the data? From the generated figure, we can see that the median of this dimension is approximately at the 75 percentile, which based on our Likert scale dataset means most students generally agree with the mostly positive attitudinal questions asked about their CS beliefs. Attitudes about computational thinking also have a similar pattern.

Computational thinking beliefs

atct_1 I am good at solving a problem by thinking about similar problems I have solved before.
atct_2 I have good research skills.
atct_3 I am good at using online search tools.
atct_4 I am persistent at solving puzzles or logic problems.
atct_5 I know how to write computer programs.
atct_6 I am good at building things.
atct_7 I am good at ignoring irrelevant details to solve a problem.
atct_8 I know how to write a computer program to solve a problem.



In [15]:

    
dataset[atct].plot.box();
y = np.arange(0,1.1,0.25)
labels = ['{} percentile'.format(int(i*100)) for i in y]
pyplt.yticks(y, labels)

pyplt.title('QUANTILES OF COMPUTATIONAL THINKING ABILITY: ATCT')
pyplt.savefig('report/figures/atct_quantile.png', dpi=300)
pyplt.close()

Beliefs about Belonging

blg_1 In this class, I feel I belong.
blg_2 In this class, I feel awkward and out of place.
blg_3 In this class, I feel like my ideas count.
blg_4 In this class, I feel like I matter.

When it comes to belonging we see a different pattern. Majority of students feel like the belong, but most of them are neutral when belonging becomes more specific.



In [16]:

    
dataset[blg].plot.box();
y = np.arange(0,1.1,0.25)
labels = ['{} percentile'.format(int(i*100)) for i in y]
pyplt.yticks(y, labels)

pyplt.title('QUANTILES OF BELONGING: BLG')
pyplt.savefig('report/figures/blg_quantile.png', dpi=100)
pyplt.close()

Gendered notions of intelligence

atcsgender_1 Women are less capable of success in CS than men.
atcsgender_2 Women are smarter than men.
atcsgender_3 Men have better math and science abilities than women.

From a plot of its density estimation, I can see that the distribution for this dimension is really skewed to the right, i.e., most students strongly disagree with the statements. That does not come as a suprise, what I found fascinating is that the median for atcsgender_2 is at the 25 percentile, which corresponds to neutral. You can see this in the boxplot. While students do not agree that women are smarter than men, half of them is undecided about this statement!



In [17]:

    
dataset[atcsgender].plot(kind='kde');
x = [-0.5, 0.0, 0.5, 1.0, 1.5]
labels = ["", "Strongly Disagree", "Neutral", "Strongly Agree" , ""]
pyplt.xticks(x, labels)

pyplt.xlabel('SURVEY RESPONSES')
pyplt.title('DENSITY ESTIMATION OF GENDERED NOTIONS OF INTELLIGENCE: ATCSGENDER')
pyplt.legend(loc='upper right', shadow=True, fontsize='medium')
pyplt.savefig('report/figures/atcsgender.png', dpi=100)
pyplt.close()



In [18]:

    
dataset[atcsgender].plot.box();
y = np.arange(0,1.1,0.25)
labels = ['{} percentile'.format(int(i*100)) for i in y]
pyplt.yticks(y, labels)

pyplt.title('QUANTILES OF GENDERED NOTIONS OF INTELLIGENCE: ATCSGENDER')
pyplt.savefig('report/figures/atcsgender_quantile.png', dpi=100)
pyplt.close()

Extract features and labels for training models

Extract feature (X) and target (y) columns



In [19]:

    
target_col = dataset['gender_Female']  #  column is the target/label 
y = target_col  # corresponding targets/labels

print "\nLabel values:-"
print y.head()









    



Label values:-
0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
Name: gender_Female, dtype: float64



In [20]:

    
X = dataset.drop(['gender_Female', 'gender_Male'], axis=1, inplace=False)

print "\nFeature values:-"
print X.head()









    



Feature values:-
   atcs_1  atcs_2  atcs_3  atcs_4  atcs_5  atcs_6  atcs_7  atcs_8  atcs_9  \
0     0.5    0.75    1.00    0.50    0.75    0.75    0.75    0.50    0.75   
1     0.0    0.00    0.00    1.00    0.00    0.00    0.00    0.00    0.25   
2     1.0    1.00    1.00    0.00    1.00    1.00    1.00    1.00    1.00   
3     1.0    0.75    0.75    0.00    0.75    0.75    0.75    1.00    1.00   
4     0.5    0.50    0.75    0.25    0.50    0.75    1.00    0.25    0.75   

   atcsgender_1         ...          mtr_3  prcs_1  prcs_2  prcs_3  prcs_4  \
0           0.0         ...            0.0     0.0     0.0     0.0     0.0   
1           0.0         ...            0.0     0.0     0.0     1.0     1.0   
2           0.0         ...            1.0     0.0     0.0     0.0     0.0   
3           0.0         ...            0.0     0.0     0.0     0.0     0.0   
4           0.0         ...            0.0     0.0     0.0     0.0     0.0   

   prcs_5  prepared  priorcs10  reason_class_Interested  reason_class_Other  
0     1.0      0.50   0.080808                      1.0                 0.0  
1     0.0      0.25   0.080808                      1.0                 0.0  
2     1.0      0.50   0.080808                      1.0                 0.0  
3     1.0      0.75   0.080808                      1.0                 0.0  
4     0.0      0.50   0.080808                      1.0                 0.0  

[5 rows x 45 columns]

Determine if classes are balanced

As we can see the dataset is unbalanced, we have more males than females.



In [27]:

    
y.plot.hist()


x = [0.05, 0.2, 0.4, 0.6, 0.8, 0.95]
labels = ["Male", "", "", "", "", "Female"]
pyplt.xticks(x, labels)


pyplt.grid(False)
_= pyplt.xlabel('VALUE OF TARGET LABEL')
_= pyplt.ylabel('COUNT')
_= pyplt.title('HISTOGRAM OF TARGET CLASS')
_= pyplt.yticks(np.arange(0, 700, 100))
pyplt.savefig('report/figures/targetClass.png', dpi=100)
pyplt.close()



In [22]:

    
num_male = y.tolist().count(0)
num_female = y.tolist().count(1)

print "Number of males in data", num_male
print "Number of females in data", num_female
print "Ratio of males to females {}".format(num_male/ num_female)









    



Number of males in data 494
Number of females in data 388
Ratio of males to females 1.27319587629



In [23]:

    
# Save dataframes to file

X.to_pickle('data/features.pickle.dat')
y.to_pickle('data/labels.pickle.dat')



In [ ]:

	atcs_1	atcs_2	atcs_3	atcs_4	atcs_5	atcs_6	atcs_7	atcs_8	atcs_9	...	mtr_3	prcs_1	prcs_2	prcs_3	prcs_4	prcs_5	prepared	reason_class_Interested	reason_class_Other
877	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00	...	1.0	1.0	1.0	1.0	1.0	1.0	1.00	1.0	0.0
878	0.00	0.50	0.25	1.00	0.00	0.00	0.00	0.00	0.00	...	0.0	0.0	0.0	0.0	0.0	0.0	0.00	0.0	1.0
879	1.00	1.00	1.00	0.25	0.75	1.00	1.00	0.75	1.00	...	0.0	1.0	1.0	0.0	1.0	1.0	1.00	0.0	0.0
880	1.00	1.00	0.75	0.25	0.50	0.75	0.50	0.50	0.75	...	0.0	0.0	0.0	0.0	0.0	0.0	0.00	1.0	0.0
881	0.75	0.75	0.50	0.25	0.25	1.00	0.75	0.50	1.00	...	0.0	0.0	0.0	0.0	0.0	1.0	0.25	1.0	0.0