Predicting School District Performance


The Data Schoolers
Ashwin Deo, Aasta Frascati-Robinson, Bhanu Kanna, Brendan Law

Overview and Motivation

What factors have an impact on school district performance? We seek to learn if we can predict graduation rate based upon numerous school district characteristics. We look to understand what factors have little or no impact on performance. We also strive to classify school districts by custom team-built peer school district grouping rather than solely geographical grouping by nation, state, and school district, which would include factors like total students, student/teacher ratio, percent of children in poverty, district type, location, etc. We used the most current national graduation data found, which was for the school year 2009-2010. We have kept the dataset years consistent across data sources.

The goal of predicting school district performance based on school environment is to inform parents and interested citizens of what factors in school districts influence key success indicators such as graduation rate. Identifying these factors would help school districts look at potential opportunities to improve. This topic was selected because of a passion for using technology to enhance education and desire to give back to the education communities that have helped shape us. One team member would love to work in educational data science in the future.

Open education data is now being provided via several national, state, and local government portals. It is often up to the end user to piece together datasets across these portals to answer their questions, which is not something that a typical parent or interested citizen has the time or expertise to pursue. Instead, the data science community can support these users by melding these datasets and answering important education questions.

Dekker, Pechenizkiy, and Vleeshouwers built multiple models to predict Eindhoven University of Technology freshman dropout (2009, Educational Data Mining). We referenced this work to identify what types of models might be applicable for interpreting education data.

District Data Processbook (2_ProcessBookHiLoGradRate.ipynb)

https://github.com/ashwindeo/dataschoolers/blob/master/2_ProcessBookHiLoGradRate.ipynb

After rigorous cleaning of the data and creating indicators, we attempt a variety of classifiers to determine the best approach at prediting the graduation rate.

Regression Workbook (3_ProcessBookNumericalGradRate.ipynb)

https://github.com/ashwindeo/dataschoolers/blob/master/3_ProcessBookNumericalGradRate.ipynb

Based on feedback from our TF on 12/7, we attempted several forms of regression to see how well we could predict numerical graduation rate with school district characteristics alone, then with previous years graduation rates, then with school district characteristics and previous years graduation rates.

One opportunity in this space is that the U.S. Department of Education typically delays making graduation rate data available. For instance, it is the 2015-2016 school year, and the most current graduation rate data available is for the year 2009-2010. If we could build a model to predict graduation rate for the years of missing data, organizations that rely on graduation rate data to provide schools services could use this graduation rate approximation until newer graduation rate data becomes available.

For this notebook, we pulled 3 previous years of graduation rate data (2006-2007, 2007-2008, and 2008-2009). First we built regression models using school district data alone and predicting numerical graduation rate, then we built regression models using historic graduation rate data alone and predicting numerical graduation rate, then we built regression models using school district data and historic graduation rate data and predicting numerical graduation rate, and lastly we built a regression model using 2006-2007 school district data and fed it new 2009-2010 school district data to see how well it would predict 2009-2010 numerical graduation rate.

We compared the models using mean squared error, with the lower the mean squared error, the better.

SVM Feature Selected Classifiers (4_FinalSVMFeatureSelected.ipynb)

https://github.com/ashwindeo/dataschoolers/blob/master/4_FinalSVMFeatureSelected.ipynb

In order to enable visualizing factors that school districts could readily change versus would be more difficult to change, we needed two more runs of our best models - log with lasso without gender and ethnicity features for classifying high graduation and low graduation.

Visualization School Cleanup (5_VisualizationSchoolCleanupEDA.ipynb)

https://github.com/ashwindeo/dataschoolers/blob/master/5_VisualizationSchoolCleanupEDA.ipynb

This is the process book that covers how we loaded and cleaned the schools data for the year 2009-2010. We use the schools data in the visualization.

We did not use the schools data in our models because graduation rate data is not available nationwide at the school level. We found invidual states or cities that made graduation rate data publically available, yet it would have been too time consuming to download from many different places. School level data is for data vizualization purpose.

Grouping Workbook (6_VisualizationSchoolGrouping.ipynb)

https://github.com/ashwindeo/dataschoolers/blob/master/6_VisualizationSchoolGrouping.ipynb

We wanted the ability to compare school districts based on similar school districts as well as by statewide. This notebook creates the groupings. The output files for these groupings were used in the Tableau visualization.

The columns for these groupings were chosen based on the New York State Education Department's definition of similar schools. Link: http://www.p12.nysed.gov/repcrd2004/information/similar-schools/guide.shtml

Generate data needed for visualization in Tableau (7_VisualizationTableauData.ipynb)

https://github.com/ashwindeo/dataschoolers/blob/master/7_VisualizationTableauData.ipynb

Cleaning up the data for the Tableau visualization. We removed the unused fields and fixed the field names.

Overall Findings and Result

Extensive cleaning was required of the data in order to process it through the models. Fields were converted to indicators and data that did not pertain to the results were dropped. In the end, we had over 10,000 districts throughout the United States with data that included student population information, revenue data, graduation rates, and miscellaneous information about the district we found important.

Going into the project, we had little idea of what the results would look like or what the direction would ultimately be. The first step was to be certain that the data could accurately determine if district was generally above or below average in graduation rate. For this we utilized two methods.

In the first we attempted to find what would be the main contributing factors that put schools in the top or bottom quartile for graduation rate. Twelve different classifier models were implemented and their ROC curves were used to analyze the true positive rate of the district being in the top quartile or the bottom quartile of graduation rates. This was also an attempt to find the the features that were the most prominent while avoiding a random selection of features. Surprisingly, while many features did have relative importance, the percentage of free lunches the district provided stood out as a highly correlated feature.

The second stage we attempted to predict the actual numerical graduation rates using the data provided. Initially with only the current school district data available and then combining it with graduation rates from previous years. As expected, the models that also utilized the previous graduation rates were much superior in predicting the graduation rates. However, by limiting the models to make use of only current school district data like financials, location, student-teacher ratio allows much greater insight on what can be done to improve a school's performance.

Ultimately, many of the prominent features were centered around gender and ethnicity. We determined that these characteristics, while very critical in understanding graduation rates, could not be changed in order to affect graduation levels. In order to better understand other factors, in a separate model we also excluded gender and ethnicity and found that a lower percentage of students receiving free lunch and a higher level of local funding was correlated with stronger graduation rates. Further analysis could be taken to determine if this is an indication of poorer communities struggling to graduate or if more local funding could be useful in increasing graduation levels.

We hope that this analysis can be utilized by districts struggling with graduation rate to improve their performance. Similarly it can be useful for new districts that may want more information about what may be required to succeed, and interested parties that want to understand and visualize the success of their district and those around the country.


In [ ]: