This notebook is intended as a moderate stress test for the DSX infrastructure. Notebook generates a known function, adds random noise to it and runs an ML algorithm on a wild goose chase asking it to fit and predict based on this data.
Right now I am running this against 10 Million points. To increase the complexity, you can do two things
In [101]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
Create a linear stream of 10
million points between -50
and 50
.
In [102]:
x = np.arange(-50,50,0.00001)
x.shape
Out[102]:
Create random noise of same dimension
In [103]:
bias = np.random.standard_normal(x.shape)
In [104]:
y2 = np.cos(x)**3 * (x**2/max(x)) + bias*5
In [105]:
x_train, x_test, y_train, y_test = train_test_split(x,y2, test_size=0.3)
In [106]:
x_train.shape
Out[106]:
Plotting algorithms cannot work with millions of points, so you downsample just for plotting
In [107]:
stepper = int(x_train.shape[0]/1000)
stepper
Out[107]:
In [108]:
fig, ax = plt.subplots(1,1, figsize=(13,8))
ax.scatter(x[::stepper],y2[::stepper], marker='d')
ax.set_title('Distribution of training points')
Out[108]:
In [109]:
def greedy_fitter(x_train, y_train, x_test, y_test, max_order=25):
"""Fitter will try to find the best order of
polynomial curve fit for the given synthetic data"""
import time
train_predictions=[]
train_rmse=[]
test_predictions=[]
test_rmse=[]
for order in range(1,max_order+1):
t1 = time.time()
coeff = np.polyfit(x_train, y_train, deg=order)
n_order = order
count = 0
y_predict = np.zeros(x_train.shape)
while n_order >=0:
y_predict += coeff[count]*x_train**n_order
count+=1
n_order = n_order-1
# append to predictions
train_predictions.append(y_predict)
# find training errors
current_train_rmse =np.sqrt(mean_squared_error(y_train, y_predict))
train_rmse.append(current_train_rmse)
# predict and find test errors
n_order = order
count = 0
y_predict_test = np.zeros(x_test.shape)
while n_order >=0:
y_predict_test += coeff[count]*x_test**n_order
count+=1
n_order = n_order-1
# append test predictions
test_predictions.append(y_predict_test)
# find test errors
current_test_rmse =np.sqrt(mean_squared_error(y_test, y_predict_test))
test_rmse.append(current_test_rmse)
t2 = time.time()
elapsed = round(t2-t1, 3)
print("Elapsed: " + str(elapsed) + \
"s Order: " + str(order) + \
" Train RMSE: " + str(round(current_train_rmse, 4)) + \
" Test RMSE: " + str(round(current_test_rmse, 4)))
return (train_predictions, train_rmse, test_predictions, test_rmse)
Run the model. Change the max_order
to higher or lower if you wish
In [110]:
%%time
complexity=50
train_predictions, train_rmse, test_predictions, test_rmse = greedy_fitter(
x_train, y_train, x_test, y_test, max_order=complexity)
In [111]:
%%time
fig, axes = plt.subplots(1,1, figsize=(15,15))
axes.scatter(x_train[::stepper], y_train[::stepper],
label='Original data', color='gray', marker='x')
order=1
for p, r in zip(train_predictions, train_rmse):
axes.scatter(x_train[:stepper], p[:stepper],
label='O: ' + str(order) + " RMSE: " + str(round(r,2)),
marker='.')
order+=1
axes.legend(loc=0)
axes.set_title('Performance against training data')
In [112]:
%%time
fig, axes = plt.subplots(1,1, figsize=(15,15))
axes.scatter(x_test[::stepper], y_test[::stepper],
label='Test data', color='gray', marker='x')
order=1
for p, r in zip(test_predictions, test_rmse):
axes.scatter(x_test[:stepper], p[:stepper],
label='O: ' + str(order) + " RMSE: " + str(round(r,2)),
marker='.')
order+=1
axes.legend(loc=0)
axes.set_title('Performance against test data')
In [120]:
ax = plt.plot(np.arange(1,complexity+1),test_rmse)
plt.title('Bias vs Complexity'); plt.xlabel('Order of polynomial'); plt.ylabel('Test RMSE')
ax[0].axes.get_yaxis().get_major_formatter().set_useOffset(False)
plt.savefig('Model efficiency.png')