```
In [1]:
```import warnings
warnings.filterwarnings('ignore')

```
In [2]:
```import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

```
In [3]:
```#Reading the dataset in a dataframe using Pandas
df = pd.read_csv('../data/master.csv')
#Print first observations
df.head()

```
Out[3]:
```

**scikit-learn**, an open source machine learning library for the Python programming language.

```
In [4]:
```# There are some NaN values in our numerics (UT Campus and ABIA)
# Let us remove rows from those zip codes from the DataFrame:
df = df[np.isfinite(df['Population'])]
# create X and y
feature_cols = ['Med_Income', 'Population', 'Home_Ownership']
X = df[feature_cols]
y = df.Score
# follow the usual sklearn pattern: import, instantiate, fit
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_

```
```

```
In [5]:
```# pair the feature names with the coefficients
zip(feature_cols, lm.coef_)

```
Out[5]:
```

```
In [6]:
```# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lm.score(X,y))

```
```

```
In [8]:
```# Let us check how many rows we are left with after excluding the UT Campus and ABIA areas:
len(df)

```
Out[8]:
```

```
In [9]:
```average_scores = df.groupby('Zip_Code').mean()
len(average_scores)

```
Out[9]:
```

```
In [10]:
```average_scores.head()

```
Out[10]:
```

**NOT** supposed to be extracting any useful information from this regression model we are about to apply. After all, we only have a tiny number of rows, each one corresponding to an entire Zip Code. However we might still be able to create a model that will yield some strong trends which would help us make, if not a prediction, an **educated guess** about the average score of all restaurants in a Zip Code with given demographic data.

Let's take it one by one:

```
In [17]:
```from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = average_scores['Med_Income'].values
y = average_scores['Score'].values

```
In [18]:
```# Check the shapes of the X and y vectors:
print X.shape
print y.shape

```
```

```
In [19]:
```# Reshape to get them to work when we fit the model:
X = X.reshape(34,1);
y = y.reshape(34,1);

```
In [20]:
```# Fit linear regression model:
lm.fit(X, y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_

```
```

```
In [21]:
```# And score it:
lm.score(X,y)

```
Out[21]:
```

This is a not a very exciting score for our model; We are explaining 35.2% of the variance, but we have to keep into account that:

- We have 35 pairs of data points: (Median Income, Score)
- We have trained our model on our entire dataset so we have really overfit the data

```
In [22]:
```# Visualization using Seaborn:
sns.lmplot(x="Med_Income", y="Score", data=average_scores);

```
```

```
In [23]:
```print "The Pearson Correlation coefficient between Median Income and Average Score is {0}".\
format(np.corrcoef(average_scores['Med_Income'].values, average_scores['Score'].values)[1][0])

```
```

**for Simple Linear Regression models like ours, the square of the Pearson Correlation Coefficient is equal to the R-squared parameter** we have calculated above, i.e. the fraction of the variance in our data explained by our model.

The process is exactly the same as above:

```
In [24]:
```from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = average_scores['Population'].values
y = average_scores['Score'].values

```
In [25]:
```# Check the shapes of the X and y vectors:
print X.shape
print y.shape

```
```

```
In [26]:
```# Reshape to get them to work when we fit the model:
X = X.reshape(34,1);
y = y.reshape(34,1);

```
In [27]:
```# Fit linear regression model:
lm.fit(X, y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_

```
```

```
In [28]:
```# And score it:
lm.score(X,y)

```
Out[28]:
```

I was expecting a much lower score for this model compared to the previous one. Population isn't really a good indicator of restaurants' health inspection scores, since the more populous Zip Codes (for instance: 78704) correspond to really diverse neighborhoods with a large variety of establishments of all kinds. This will end up being an important result of this entire project: Austin's areas are really diverse.

Of course, the arguments from above against much faith in this model still stand: we have 35 pairs of data points and we are overfitting.

```
In [29]:
```# Visualization using Seaborn:
sns.lmplot(x="Population", y="Score", data=average_scores);

```
```

```
In [30]:
```print "The Pearson Correlation coefficient between Population and Average Score is {0}".\
format(np.corrcoef(average_scores['Population'].values, average_scores['Score'].values)[1][0])

```
```

```
In [36]:
```print "For a predicted score: {0} (just below the cutoff), the population would have to be {1}".\
format(lm.predict(450000)[0][0], 450000)

```
```

As above:

```
In [37]:
```from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = average_scores['Home_Ownership'].values
y = average_scores['Score'].values

```
In [38]:
```# Reshape the X and y vectors to get them to work when we fit the model:
X = X.reshape(34,1);
y = y.reshape(34,1);

```
In [39]:
```# Fit linear regression model:
lm.fit(X, y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_

```
```

```
In [40]:
```# And score it:
lm.score(X,y)

```
Out[40]:
```

```
In [41]:
```# Visualization using Seaborn:
sns.lmplot(x="Home_Ownership", y="Score", data=average_scores);

```
```

And for completeness:

```
In [42]:
```print "The Pearson Correlation coefficient between Home Ownership Percentage and Average Score is {0}".\
format(np.corrcoef(average_scores['Home_Ownership'].values, average_scores['Score'].values)[1][0])

```
```

The three attempts for individual linear regression predictive models have not really made us any wiser, even though they did at least show us some broad trends pertinent to our data set, which we could loosely extrapolate to other urban areas with the characteristics of Austin. (good luck with that!)

To complete our toy problem, let us attempt to fit a multiple linear regression model to our data; Let us try to predict the average score of health inspections on a Zip Code when all three features are given as inputs.

```
In [43]:
```feature_cols = ['Med_Income', 'Population', 'Home_Ownership']

```
In [44]:
```from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = average_scores[feature_cols]
y = average_scores['Score'].values

```
In [45]:
```# Check the shapes of the X and y vectors:
print X.shape
print y.shape

```
```

```
In [46]:
```y = y.reshape(34,1);

```
In [47]:
```# Fit linear regression model:
lm.fit(X, y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_

```
```

```
In [48]:
```# And score it:
lm.score(X,y)

```
Out[48]:
```