In the first part (MarketPlaceSimulator.ipynb) we briefly described the main assumptions and strategy of the MarketPlace simulator and presented the source code. In the second part (Exploratory_data_analysis.ipynb), we started using the data generated in part one to better understand the correlations and trends of the data. In this part of the project, we will use ML algorithms to answer the following questions:
For this task we will be using the following Python packages: matplotlib, graphlab create, SFrame. As mentioned in Part 1 of the project, the MarketPlace simulator will generate a set of data called dayX_relinfo.csv where X corresponds to a particular day and these data sets are already balanced, ie, the number of positive and negative observations (Match_day) are equally distributed.
This notebook is structured as:
1. Data Overview
2. Applying different Machine Learning algorithms to different models
i. Regression methods
Results
ii. Classification methods
Results
In [405]:
import graphlab as gl
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import numpy
import sklearn
gl.canvas.set_target('ipynb')
%matplotlib inline
In [421]:
# Loading the data
data_all = gl.SFrame.read_csv('Source_Code/TradeLog_relinfo.csv')
data_day1 = gl.SFrame.read_csv('Source_Code/day1_relinfo.csv')
data_day2 = gl.SFrame.read_csv('Source_Code/day2_relinfo.csv')
data_day3 = gl.SFrame.read_csv('Source_Code/day3_relinfo.csv')
In [424]:
# Quick look at the second data set
data_all
Out[424]:
In [425]:
data_day1
Out[425]:
In [426]:
my_features = ['avgIBR', 'OrdRate_IBR', 'wavgMatchRate_IBR', 'AmountOrd','TimeStampPlaced']
In [427]:
# Filtering Euro and GBP
data_day1_Euro = data_day1[data_day1['CurrencySell'] == 'EURO']
data_day1_GBP = data_day1[data_day1['CurrencySell'] == 'GBP' ]
data_day1_Euro[my_features].show()
data_day1_GBP[my_features].show()
data_day1['Match_day'].show(view='Categorical')
We will access how different algorithmos for regression works in predicting when a matching event will occur. For this, we will use the Time_diff as our continuous predictor variable. In particular we will use Linear Regression, Random Forest and K-Nearest Neighbours algorithms. We will now define functions that will take the variables chosen and apply different combinations of these variables to the different algorithms.
In [443]:
# Splitting the data into train and test sets
train_set, test_set = data_all.random_split(.8,seed=0)
# 1) Linear Regression
predictors = ['CurrencySell','avgIBR', 'OrdRate_IBR', 'wavgMatchRate_IBR', 'AmountOrd','TimeStampPlaced']
LinearReg_model = gl.linear_regression.create(train_set, target='Time_diff',
features=predictors, validation_set=test_set);
In [444]:
# 2) RandomForest Regression
RFReg_model = gl.random_forest_regression.create(train_set, target='Time_diff', max_iterations=500, max_depth=3,
features=predictors, validation_set=test_set);
In [445]:
# 3) Boosted Trees
BTReg_model = gl.boosted_trees_regression.create(train_set, target='Time_diff', max_iterations=500, max_depth=3,
features=predictors, validation_set=test_set)
In [454]:
print '--------------------------------'
print 'Results for Regression models: '
print '--------------------------------'
print 'Linear Regression: ', LinearReg_model.evaluate(test_set,metric='auto')
#LinearReg_model.show(view='Evaluation')
print '--------------------------------'
print 'Results for Regression models: '
print '--------------------------------'
print 'RandomForest : ', RFReg_model.evaluate(test_set,metric='auto')
#RFReg_model.show(view='Evaluation')
print '--------------------------------'
print 'Results for Regression models: '
print '--------------------------------'
print 'Boosted Trees: ', BTReg_model.evaluate(test_set,metric='auto')
#BTReg_model.show(view='Evaluation')
In [465]:
# Splitting the data into train and test sets
train_set, test_set = data_day2.random_split(.8,seed=0)
gl.canvas.set_target('ipynb')
predictors = ['CurrencySell','avgIBR', 'OrdRate_IBR', 'wavgMatchRate_IBR', 'AmountOrd','TimeStampPlaced']
max_it = 50
# 1) Logistic Regression
LogReg_model = gl.logistic_classifier.create(train_set, target='Match_day',
features=predictors, max_iterations=max_it,
feature_rescaling='True',
verbose='True', validation_set=test_set,
l2_penalty=0);
# 2) RandomForest Classifier
RFClass_model = gl.random_forest_classifier.create(train_set, target='Match_day',
max_iterations=500, max_depth=1,
features=predictors,
row_subsample=0.5,column_subsample=0.5,
validation_set=test_set,random_seed=0,
)
# 3) Boosted Trees
BTClass_model = gl.boosted_trees_classifier.create(train_set, target='Match_day',
max_iterations=500,# max_depth=5,
features=predictors,
row_subsample=0.5,column_subsample=0.5,
validation_set=test_set,random_seed=0,
)
In [466]:
print 'Logistic Regression '
print '--------------------'
LogReg_model.evaluate(test_set, metric='auto')
LogReg_model.show(view='Evaluation')
print 'RandomForest '
print '--------------------'
RFClass_model.evaluate(test_set, metric='auto')
RFClass_model.show(view='Evaluation')
print 'Boosted Trees '
print '--------------------'
BTClass_model.evaluate(test_set, metric='auto')
BTClass_model.show(view='Evaluation')
In [ ]: