We are going to cover two main topics in this class: **Linear Regressions** and **Validation**. We need to start with a broader question, though.

The goal this semester is to use machine learning to teach the computer how to make predictions. So we'll start with my definitions of machine learning -- in particular of supervised machine learning. We are using a programming algorithm that gives the computer the tools it needs to identify patterns in a set of data. Once we have those patterns, we can use them to make predictions - what we would expect should happen if we gather more data that may not necessarily be exactly the same as the data we learned from.

We'll start by looking at a very simple set of fake data that will help us cover the key ideas. Suppose that we just collected four data points. I've manually input them (as opposed to using a CSV file). Execute the following cell to see what the data look like.

```
In [1]:
```import pandas as pd
fakedata1 = pd.DataFrame(
[[ 0.862, 2.264],
[ 0.694, 1.847],
[ 0.184, 0.705],
[ 0.41 , 1.246]], columns=['input','output'])
fakedata1.plot(x='input',y='output',kind='scatter')

```
Out[1]:
```

It is pretty clear that there is a linear trend here. If I wanted to predict what would happen if we tried the input of `x=0.6`

, it would be a good guess to pick something like `y=1.6`

or so. Training the computer to do this is what we mean by *Machine Learning*.

To formalize this a little bit, it consists of four steps:

- We start with relevant historical data. This is our input to the machine learning algorithm.
- Choose an algorithm. There are a number of possibilities that we will cover over the course of the semester.
- Train the model. This is where the computer learns the pattern.
- Test the model. We now have to check to see how well the model works.

We then refine the model and repeat the process until we are happy with the results.

There is a bit of a sticky point here. If we use our data to train the computer, what do we use to test the model to see how good it is? If we use the same data to test the model we will, most likely, get fantastic results! After all, we used that data to train the model, so it should (if the model worked at all) do a great job of predicting the results.

However, this doesn't tell us anything about how well the model will work with a * new* data point. Even if we get a new data point, we won't necessarily know what it is

There is a library that does this for us in Python called `train_test_split`

. The documentation is here: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. I want you to get used to looking up the documentation yourself to see how the function works. Pay close attention to the inputs and the outputs of the function.

One of the inputs we will use is the `random_state`

option. By using the same number here we should all end up with the same results. If you change this number, you change the random distribution of the data and, thus, the end result.

```
In [2]:
```from sklearn.model_selection import train_test_split
faketrain1, faketest1 = train_test_split(fakedata1, test_size=0.2, random_state=23)
faketrain1.plot(x='input',y='output',kind='scatter')
faketest1.plot(x='input',y='output',kind='scatter')

```
Out[2]:
```

`Class02_fakedata2.csv`

file and split it 80/20 training/testing datasets.

```
In [3]:
```fakedata2 = pd.read_csv('Class02_fakedata2.csv')
faketrain2, faketest2 = train_test_split(fakedata2, test_size=0.2, random_state=23)
faketrain2.plot(x='input',y='output',kind='scatter')
faketest2.plot(x='input',y='output',kind='scatter')

```
Out[3]:
```

We are now ready to train our linear model on the training part of this data. Remember that, from this point forward, we must "lock" the testing data and not use it to train our models. This takes two steps in Python. The first step is to define the model and set any model parameters (in this case we'll use the defaults). This is a Python object that will subsequently hold all the information about the model including fit parameters and other information about the fit. Again, take a look at the documentation: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html.

The second step is to actually fit the data. We need to reformat our data so that we can tell the computer what our inputs are and what our outputs are. We define two new variables called "features" and "labels". Note the use of the double square bracket in selecting data for the features. This will allow us to, in the future, select mutltiple columns as our input variables. In the mean time, it formats the data in the way that the fit algorithm needs it to be formatted.

```
In [4]:
```faketrain2.head()

```
Out[4]:
```

```
In [5]:
```from sklearn.linear_model import LinearRegression
# Step 1: Create linear regression object
regr = LinearRegression()
# Step 2: Train the model using the training sets
features = faketrain2[['input']].values
labels = faketrain2['output'].values
regr.fit(features,labels)

```
Out[5]:
```

```
In [6]:
```print('Coefficients: \n', regr.coef_)
print('Intercept: \n', regr.intercept_)

```
```

```
In [7]:
```testinputs = faketest2[['input']].values
predictions = regr.predict(testinputs)
actuals = faketest2['output'].values
import matplotlib.pyplot as plt
plt.scatter(testinputs, actuals, color='black', label='Actual')
plt.plot(testinputs, predictions, color='blue', linewidth=1, label='Prediction')
# We also add a legend to our plot. Note that we've added the 'label' option above. This will put those labels together in a single legend.
plt.legend(loc='upper left', shadow=False, scatterpoints=1)
plt.xlabel('input')
plt.ylabel('output')

```
Out[7]:
```

```
In [8]:
```plt.scatter(testinputs, (actuals-predictions), color='green', label='Residuals just because $\lambda$')
plt.xlabel('input')
plt.ylabel('residuals')
plt.legend(loc='upper left', shadow=False, scatterpoints=1)

```
Out[8]:
```

```
In [9]:
```import numpy as np
print("RMS Error: {0:.3f}".format( np.sqrt(np.mean((predictions - actuals) ** 2))))

```
```

```
In [10]:
```diabetes = pd.read_csv('../Class01/Class01_diabetes_data.csv')
diabetes.head()

```
Out[10]:
```

I've put all the steps together in one cell and commented on each step.

```
In [11]:
```# Step 1: Split off the test data
dia_train, dia_test = train_test_split(diabetes, test_size=0.2, random_state=23)
# Step 2: Create linear regression object
dia_model = LinearRegression()
# Step 3: Train the model using the training sets
features = dia_train[['BMI']].values
labels = dia_train['Target'].values
# Step 4: Fit the model
dia_model.fit(features,labels)
# Step 5: Get the predictions
testinputs = dia_test[['BMI']].values
predictions = dia_model.predict(testinputs)
actuals = dia_test['Target'].values
# Step 6: Plot the results
plt.scatter(testinputs, actuals, color='black', label='Actual')
plt.plot(testinputs, predictions, color='blue', linewidth=1, label='Prediction')
plt.xlabel('BMI') # Label the x axis
plt.ylabel('Target') # Label the y axis
plt.legend(loc='upper left', shadow=False, scatterpoints=1)
# Step 7: Get the RMS value
print("RMS Error: {0:.3f}".format( np.sqrt(np.mean((predictions - actuals) ** 2))))

```
```

```
In [12]:
```# Step 2: Create linear regression object
dia_model2 = LinearRegression()
# Possible columns:
# 'Age', 'Sex', 'BMI', 'BP', 'TC', 'LDL', 'HDL', 'TCH', 'LTG', 'GLU'
#
inputcolumns = [ 'BMI', 'HDL']
# Step 3: Train the model using the training sets
features = dia_train[inputcolumns].values
labels = dia_train['Target'].values
# Step 4: Fit the model
dia_model2.fit(features,labels)
# Step 5: Get the predictions
testinputs = dia_test[inputcolumns].values
predictions = dia_model2.predict(testinputs)
actuals = dia_test['Target'].values
# Step 6: Plot the results
#
# Note the change here in how we plot the test inputs. We can only plot one variable, so we choose the first.
# Also, it no longer makes sense to plot the fit points as lines. They have more than one input, so we only visualize them as points.
#
plt.scatter(testinputs[:,0], actuals, color='black', label='Actual')
plt.scatter(testinputs[:,0], predictions, color='blue', label='Prediction')
plt.legend(loc='upper left', shadow=False, scatterpoints=1)
# Step 7: Get the RMS value
print("RMS Error: {0:.3f}".format( np.sqrt(np.mean((predictions - actuals) ** 2))))

```
```

```
In [ ]:
```