# Model complexity vs accuracy - empirical anlaysis

This notebook is intended as a moderate stress test for the DSX infrastructure. Notebook generates a known function, adds random noise to it and runs an ML algorithm on a wild goose chase asking it to fit and predict based on this data.

Right now I am running this against 10 Million points. To increase the complexity, you can do two things

• Increase the number of points (direct hit)
• Increase the complexity of the function (indirect)
``````

In [101]:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

``````

Create a linear stream of `10`million points between `-50` and `50`.

``````

In [102]:

x = np.arange(-50,50,0.00001)
x.shape

``````
``````

Out[102]:

(10000000,)

``````

Create random noise of same dimension

``````

In [103]:

bias = np.random.standard_normal(x.shape)

``````

### Define the function

``````

In [104]:

y2 = np.cos(x)**3 * (x**2/max(x)) + bias*5

``````

### Train test split

``````

In [105]:

x_train, x_test, y_train, y_test = train_test_split(x,y2, test_size=0.3)

``````
``````

In [106]:

x_train.shape

``````
``````

Out[106]:

(7000000,)

``````

Plotting algorithms cannot work with millions of points, so you downsample just for plotting

``````

In [107]:

stepper = int(x_train.shape[0]/1000)
stepper

``````
``````

Out[107]:

7000

``````
``````

In [108]:

fig, ax = plt.subplots(1,1, figsize=(13,8))
ax.scatter(x[::stepper],y2[::stepper], marker='d')
ax.set_title('Distribution of training points')

``````
``````

Out[108]:

Text(0.5,1,'Distribution of training points')

``````

### Curve fitting

Let us define a function that will try to fit against the training data. It starts with lower order and sequentially increases the complexity of the model. The hope is, somewhere here is the sweet spot of low bias and variance. We will find it empirically

``````

In [109]:

def greedy_fitter(x_train, y_train, x_test, y_test, max_order=25):
"""Fitter will try to find the best order of
polynomial curve fit for the given synthetic data"""
import time
train_predictions=[]
train_rmse=[]

test_predictions=[]
test_rmse=[]

for order in range(1,max_order+1):
t1 = time.time()
coeff = np.polyfit(x_train, y_train, deg=order)
n_order = order
count = 0
y_predict = np.zeros(x_train.shape)
while n_order >=0:
y_predict += coeff[count]*x_train**n_order
count+=1
n_order = n_order-1

# append to predictions
train_predictions.append(y_predict)

# find training errors
current_train_rmse =np.sqrt(mean_squared_error(y_train, y_predict))
train_rmse.append(current_train_rmse)

# predict and find test errors
n_order = order
count = 0
y_predict_test = np.zeros(x_test.shape)
while n_order >=0:
y_predict_test += coeff[count]*x_test**n_order
count+=1
n_order = n_order-1

# append test predictions
test_predictions.append(y_predict_test)
# find test errors
current_test_rmse =np.sqrt(mean_squared_error(y_test, y_predict_test))
test_rmse.append(current_test_rmse)

t2 = time.time()
elapsed = round(t2-t1, 3)
print("Elapsed: " + str(elapsed) + \
"s Order: " + str(order) + \
" Train RMSE: " + str(round(current_train_rmse, 4)) + \
" Test RMSE: " + str(round(current_test_rmse, 4)))

return (train_predictions, train_rmse, test_predictions, test_rmse)

``````

Run the model. Change the `max_order` to higher or lower if you wish

``````

In [110]:

%%time
complexity=50
train_predictions, train_rmse, test_predictions, test_rmse = greedy_fitter(
x_train, y_train, x_test, y_test, max_order=complexity)

``````
``````

Elapsed: 0.826s Order: 1 Train RMSE: 13.1708 Test RMSE: 13.1646
Elapsed: 1.264s Order: 2 Train RMSE: 13.1646 Test RMSE: 13.1582
Elapsed: 2.061s Order: 3 Train RMSE: 13.1646 Test RMSE: 13.1582
Elapsed: 2.727s Order: 4 Train RMSE: 13.1627 Test RMSE: 13.1564
Elapsed: 3.4s Order: 5 Train RMSE: 13.1627 Test RMSE: 13.1564
Elapsed: 4.144s Order: 6 Train RMSE: 13.1585 Test RMSE: 13.1519
Elapsed: 5.01s Order: 7 Train RMSE: 13.1585 Test RMSE: 13.1519
Elapsed: 5.749s Order: 8 Train RMSE: 13.0983 Test RMSE: 13.0891
Elapsed: 6.43s Order: 9 Train RMSE: 13.0983 Test RMSE: 13.0891
Elapsed: 7.193s Order: 10 Train RMSE: 12.876 Test RMSE: 12.865
Elapsed: 7.955s Order: 11 Train RMSE: 12.876 Test RMSE: 12.865
Elapsed: 8.777s Order: 12 Train RMSE: 12.4236 Test RMSE: 12.4185
Elapsed: 9.727s Order: 13 Train RMSE: 12.4236 Test RMSE: 12.4185
Elapsed: 10.495s Order: 14 Train RMSE: 11.9035 Test RMSE: 11.9015
Elapsed: 11.452s Order: 15 Train RMSE: 11.9035 Test RMSE: 11.9014
Elapsed: 11.929s Order: 16 Train RMSE: 11.6687 Test RMSE: 11.6657
Elapsed: 12.827s Order: 17 Train RMSE: 11.6687 Test RMSE: 11.6657
Elapsed: 13.863s Order: 18 Train RMSE: 11.6666 Test RMSE: 11.6638
Elapsed: 15.234s Order: 19 Train RMSE: 11.6666 Test RMSE: 11.6638
Elapsed: 15.793s Order: 20 Train RMSE: 11.2828 Test RMSE: 11.2825
Elapsed: 16.477s Order: 21 Train RMSE: 11.2828 Test RMSE: 11.2825
Elapsed: 18.752s Order: 22 Train RMSE: 10.6544 Test RMSE: 10.6509
Elapsed: 19.699s Order: 23 Train RMSE: 10.6544 Test RMSE: 10.6509
Elapsed: 20.26s Order: 24 Train RMSE: 10.6051 Test RMSE: 10.601

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 20.433s Order: 25 Train RMSE: 10.6051 Test RMSE: 10.601

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 20.777s Order: 26 Train RMSE: 10.6168 Test RMSE: 10.613

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 20.747s Order: 27 Train RMSE: 10.6168 Test RMSE: 10.613

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 22.231s Order: 28 Train RMSE: 9.7878 Test RMSE: 9.7872

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 23.836s Order: 29 Train RMSE: 9.7878 Test RMSE: 9.7872

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 25.725s Order: 30 Train RMSE: 9.5223 Test RMSE: 9.5227

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 25.587s Order: 31 Train RMSE: 9.5223 Test RMSE: 9.5227

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 25.041s Order: 32 Train RMSE: 9.3192 Test RMSE: 9.3201

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 26.645s Order: 33 Train RMSE: 9.3192 Test RMSE: 9.3201

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 27.387s Order: 34 Train RMSE: 9.2033 Test RMSE: 9.2045

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 28.049s Order: 35 Train RMSE: 9.2033 Test RMSE: 9.2045

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 29.866s Order: 36 Train RMSE: 9.1679 Test RMSE: 9.1692

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 31.415s Order: 37 Train RMSE: 9.1679 Test RMSE: 9.1692

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 33.605s Order: 38 Train RMSE: 9.1874 Test RMSE: 9.1887

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 33.52s Order: 39 Train RMSE: 9.1874 Test RMSE: 9.1886

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 33.863s Order: 40 Train RMSE: 9.1526 Test RMSE: 9.1539

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 34.658s Order: 41 Train RMSE: 9.1526 Test RMSE: 9.1539

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 35.006s Order: 42 Train RMSE: 9.0739 Test RMSE: 9.0755

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 35.865s Order: 43 Train RMSE: 9.0739 Test RMSE: 9.0755

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 36.595s Order: 44 Train RMSE: 8.3806 Test RMSE: 8.3852

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 39.269s Order: 45 Train RMSE: 8.3806 Test RMSE: 8.3852

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 38.545s Order: 46 Train RMSE: 8.4328 Test RMSE: 8.4372

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 42.502s Order: 47 Train RMSE: 8.4328 Test RMSE: 8.4372

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 41.427s Order: 48 Train RMSE: 8.5054 Test RMSE: 8.5096

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 43.643s Order: 49 Train RMSE: 8.5054 Test RMSE: 8.5096

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/ipykernel_launcher.py:13: RankWarning: Polyfit may be poorly conditioned
del sys.path[0]

Elapsed: 43.055s Order: 50 Train RMSE: 8.5792 Test RMSE: 8.5831
CPU times: user 41min 15s, sys: 4min 3s, total: 45min 19s
Wall time: 17min 31s

``````

## Plot results

How well did the models fit against training data?

### Training results

``````

In [111]:

%%time
fig, axes = plt.subplots(1,1, figsize=(15,15))
axes.scatter(x_train[::stepper], y_train[::stepper],
label='Original data', color='gray', marker='x')
order=1
for p, r in zip(train_predictions, train_rmse):
axes.scatter(x_train[:stepper], p[:stepper],
label='O: ' + str(order) + " RMSE: " + str(round(r,2)),
marker='.')
order+=1
axes.legend(loc=0)
axes.set_title('Performance against training data')

``````
``````

CPU times: user 1.1 s, sys: 39.6 ms, total: 1.14 s
Wall time: 918 ms

``````

### Test results

``````

In [112]:

%%time
fig, axes = plt.subplots(1,1, figsize=(15,15))
axes.scatter(x_test[::stepper], y_test[::stepper],
label='Test data', color='gray', marker='x')
order=1
for p, r in zip(test_predictions, test_rmse):
axes.scatter(x_test[:stepper], p[:stepper],
label='O: ' + str(order) + " RMSE: " + str(round(r,2)),
marker='.')
order+=1
axes.legend(loc=0)
axes.set_title('Performance against test data')

``````
``````

CPU times: user 893 ms, sys: 25.9 ms, total: 919 ms
Wall time: 901 ms

``````

### Bias vs Variance

``````

In [120]:

ax = plt.plot(np.arange(1,complexity+1),test_rmse)
plt.title('Bias vs Complexity'); plt.xlabel('Order of polynomial'); plt.ylabel('Test RMSE')
ax[0].axes.get_yaxis().get_major_formatter().set_useOffset(False)
plt.savefig('Model efficiency.png')

``````
``````

``````