In this notebook, our goal is to produce a useful and descriptive model that identifies students who may not finish high school on time. We can do this using data science methods that produce predictive scores from input student data. On a given set of students, we will conduct an experiment carried out in two trials, each composed of different students. The first trial will be used to train the predictive model, while the second trial will be used to evaluate its efficacy.
First, let's some Python packages. pandas is a Python package providing data structures designed to make working with labeled data fast and easy. We also import our high school prediction pipeline, hspipeline, and relevant functions therein.
Second, we can define some basic constants. The input and output directories (input_dir and output_dir, respectively) determine where input (if applicable) is expected and where output is written.
In [ ]:
import pandas as pd
# Import experimental configuration.
import config
# Import pipeline modules, and database utilities.
from hspipeline.pipeline import *
from hspipeline.utils.database import connect as db_connect
BASE_PATH = os.path.dirname(os.path.abspath('__file__'))
input_dir = os.path.join(BASE_PATH, 'input')
output_dir = os.path.join(BASE_PATH, 'output')
Next, we define modeling parameters.
unit_col specifies the field that defines an individual instance or record.
time_col specifies the field that defines an individual timestep.
cohort_col specifies the field that determines how individual instances or records are grouped. This is assumed to be a cohort year and is used for separating instances (i.e., students) into training and testing disjuncts.
label_col specifies the field that is used as the target variable for prediction.
The fetch_data function either retrieves data from a saved Python pickle file or retrieves data for the specified district from a PostgreSQL database. It returns data, which represents each unit_col as a row with features and a label_col and, if applicable, returns the categories associated with each feature, X_categories.
In [ ]:
unit_col = 'student_id'
time_col = 'grade_level'
cohort_col = 'cohort'
label_col = 'label'
district = 'vps'
#pickle_filename = 'data_all_2015_08_20_vps'
data, X_categories = fetch_data(district=district,
from_pickle=False,
#pickle_filename=os.path.join(input_dir, pickle_filename),
unit_col=unit_col,
time_col=time_col
)
Next, we can perform basic pre-processing to the data prior to modeling. Here, we drop date of birth from the feature list, as well as district-specific fields. We define the features or prediction variables, X_cols, ensuring that the label, unit, and time fields are excluded. We also define the target variable, y_col.
In [ ]:
data = data.drop(['date_of_birth'], 1)
if district == 'vps':
pass
elif district == 'wcpss':
data = data.drop(['age_first_entered_wcpss'], 1)
data = data.drop(['age_entered_into_us'], 1)
X_cols = data.columns[(data.columns != label_col) & (data.columns != unit_col) & (data.columns != time_col)]
y_col = label_col
Finally, we run a sweep of predefined models via our run_models function. This function will write a variety of output to the specified output_dir.
summary.csv contains one row for each model configuration in the model sweep, providing a summary of the performance results. Within the output directory, a subdirectory, predictions, is created that provides additional information for each model. At the termination of a nested folder structure, a file is written for each model configuration in the model sweep. This file contains a row for each unit_col (i.e., student) in the testing data for that model with four columns: actual label_col (i.e., the label) value, predicted label value, probability for the positive label value, and the unit_col identifier.
Additionally, a summary_dummy.csv file may be created with baseline, "dummy classifier" results.
In [ ]:
run_models(data, unit_col, time_col, cohort_col, X_cols, y_col,
X_categories=X_categories, output_dir=output_dir)