In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
%matplotlib inline
Let's load the data and give it a quick look.
In [15]:
df = pd.read_csv('data/apib12tx.csv')
In [16]:
df.describe()
Out[16]:
Let's start looking at how variables in our dataset relate to each other so we know what to expect when we start modeling.
In [17]:
df.corr()
Out[17]:
The percentage of students enrolled in free/reduced-price lunch programs is often used as a proxy for poverty.
In [18]:
df.plot(kind="scatter", x="MEALS", y="API12B")
Out[18]:
Conversely, the education level of a student's parents is often a good predictor of how well a student will do in school.
In [19]:
df.plot(kind="scatter", x="AVG_ED", y="API12B")
Out[19]:
Like we did last week, we'll use scikit-learn to run basic single-variable regressions. Let's start by looking at California's Academic Performance index as it relates to the percentage of students, per school, enrolled in free/reduced-price lunch programs.
In [20]:
data = np.asarray(df[['API12B','MEALS']])
x, y = data[:, 1:], data[:, 0]
In [21]:
lr = LinearRegression()
lr.fit(x, y)
Out[21]:
In [22]:
# plot the linear regression line on the scatter plot
lr.coef_
Out[22]:
In [ ]:
lr.score(x, y)
In [23]:
plt.scatter(x, y, color='blue')
plt.plot(x, lr.predict(x), color='red', linewidth=1)
Out[23]:
In our naive universe where we're only paying attention to two variables -- academic performance and free/reduced lunch -- we can clearly see that some percentage of schools is overperforming the performance that would be expected of them, taking poverty out of the equation.
A handful, in particular, seem to be dramatically overperforming. Let's look at them:
In [24]:
df[(df['MEALS'] >= 80) & (df['API12B'] >= 900)]
Out[24]:
Let's look specifically at Solano Avenue Elementary, which has an API of 922 and 80 percent of students being in the free/reduced lunch program. If you were to use the above regression to predict how well Solano would do, it would look like this:
In [25]:
lr.predict(80)
Out[25]:
With an index of 922, clearly the school is overperforming what our simplified model expects.