Prediction with Machine Learning: Part 3

In the first part (MarketPlaceSimulator.ipynb) we briefly described the main assumptions and strategy of the MarketPlace simulator and presented the source code. In the second part (Exploratory_data_analysis.ipynb), we started using the data generated in part one to better understand the correlations and trends of the data. In this part of the project, we will use ML algorithms to answer the following questions:

  1. Can we predict when a trader will be matched given his set of parameters choice?
  2. Which method is the best method for this task?

For this task we will be using the following Python packages: matplotlib, graphlab create, SFrame. As mentioned in Part 1 of the project, the MarketPlace simulator will generate a set of data called dayX_relinfo.csv where X corresponds to a particular day and these data sets are already balanced, ie, the number of positive and negative observations (Match_day) are equally distributed.

This notebook is structured as:

1. Data Overview
2. Applying different Machine Learning algorithms to different models 
   i. Regression methods 
      Results
   ii. Classification methods
      Results

1. Data Overview


In [405]:
import graphlab as gl
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import numpy
import sklearn
gl.canvas.set_target('ipynb')
%matplotlib inline

In [421]:
# Loading the data
data_all = gl.SFrame.read_csv('Source_Code/TradeLog_relinfo.csv')
data_day1 = gl.SFrame.read_csv('Source_Code/day1_relinfo.csv')
data_day2 = gl.SFrame.read_csv('Source_Code/day2_relinfo.csv')
data_day3 = gl.SFrame.read_csv('Source_Code/day3_relinfo.csv')


Finished parsing file /Users/meloamaury/Documents/Big_data_specialization/Projects_GitHub/FOREX_p2p_simulation_ML/Source_Code/TradeLog_relinfo.csv
Parsing completed. Parsed 100 lines in 0.128158 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,float,float,float,float,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/meloamaury/Documents/Big_data_specialization/Projects_GitHub/FOREX_p2p_simulation_ML/Source_Code/TradeLog_relinfo.csv
Parsing completed. Parsed 28874 lines in 0.09628 secs.
Finished parsing file /Users/meloamaury/Documents/Big_data_specialization/Projects_GitHub/FOREX_p2p_simulation_ML/Source_Code/day1_relinfo.csv
Parsing completed. Parsed 100 lines in 0.092324 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,float,float,float,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/meloamaury/Documents/Big_data_specialization/Projects_GitHub/FOREX_p2p_simulation_ML/Source_Code/day1_relinfo.csv
Parsing completed. Parsed 43244 lines in 0.0931 secs.
Finished parsing file /Users/meloamaury/Documents/Big_data_specialization/Projects_GitHub/FOREX_p2p_simulation_ML/Source_Code/day2_relinfo.csv
Parsing completed. Parsed 100 lines in 0.087495 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,float,float,float,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/meloamaury/Documents/Big_data_specialization/Projects_GitHub/FOREX_p2p_simulation_ML/Source_Code/day2_relinfo.csv
Parsing completed. Parsed 41838 lines in 0.091811 secs.
Finished parsing file /Users/meloamaury/Documents/Big_data_specialization/Projects_GitHub/FOREX_p2p_simulation_ML/Source_Code/day3_relinfo.csv
Parsing completed. Parsed 100 lines in 0.09296 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,float,float,float,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/meloamaury/Documents/Big_data_specialization/Projects_GitHub/FOREX_p2p_simulation_ML/Source_Code/day3_relinfo.csv
Parsing completed. Parsed 40628 lines in 0.096659 secs.

In [424]:
# Quick look at the second data set
data_all


Out[424]:
CurrencySell avgIBR OrdRate_IBR wavgMatchRate_IBR AmountOrd TimeStampPlaced Time_diff
EURO 0.778787670271 0.7764 0.780645237163 70300.0 8053 970
GBP 1.28214298363 1.2844 1.2853470437 116100.0 0 14811
GBP 1.29415696773 1.2931 1.2962761421 36800.0 0 4215
GBP 1.28908591455 1.2888 1.28932439402 54400.0 0 7681
EURO 0.778015393152 0.7778 0.778816199377 60400.0 14249 4
GBP 1.27861019473 1.2787 1.28336755647 2800.0 0 17154
EURO 0.783349861192 0.7796 0.780125386965 147900.0 8678 9289
EURO 0.782811168004 0.78 0.78063902068 134900.0 1259 17513
EURO 0.771292037295 0.7696 0.771663156835 99000.0 3526 0
GBP 1.29322457642 1.2909 1.29123797431 113100.0 0 1341
[28874 rows x 7 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [425]:
data_day1


Out[425]:
CurrencySell avgIBR OrdRate_IBR wavgMatchRate_IBR AmountOrd TimeStampPlaced Match_day
GBP 1.28867470926 1.2896 1.29032258065 18800 0 0
EURO 0.772660362928 0.7715 0.773335395561 92500 4304 1
GBP 1.28501839922 1.2809 1.28766359505 147500 0 0
EURO 0.782007605574 0.7791 0.781606216338 77400 4606 0
EURO 0.776004350398 0.7724 0.776276975625 9500 11144 1
GBP 1.29253210267 1.2869 1.29115558425 14500 0 0
EURO 0.783022997104 0.7799 0.782131738464 120000 4205 0
EURO 0.77178992695 0.7734 0.773874013311 133700 4712 1
GBP 1.28104663896 1.2832 1.28435653737 12000 0 0
EURO 0.781611445232 0.779 0.78339208774 59900 1011 0
[43244 rows x 7 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [426]:
my_features = ['avgIBR', 'OrdRate_IBR', 'wavgMatchRate_IBR', 'AmountOrd','TimeStampPlaced']

In [427]:
# Filtering Euro and GBP
data_day1_Euro = data_day1[data_day1['CurrencySell'] == 'EURO']
data_day1_GBP  = data_day1[data_day1['CurrencySell'] == 'GBP' ]
data_day1_Euro[my_features].show()
data_day1_GBP[my_features].show()
data_day1['Match_day'].show(view='Categorical')


2. Applying different Machine Learning algorithms to different models

i. Regression methods

We will access how different algorithmos for regression works in predicting when a matching event will occur. For this, we will use the Time_diff as our continuous predictor variable. In particular we will use Linear Regression, Random Forest and K-Nearest Neighbours algorithms. We will now define functions that will take the variables chosen and apply different combinations of these variables to the different algorithms.


In [443]:
# Splitting the data into train and test sets
train_set, test_set = data_all.random_split(.8,seed=0)

# 1) Linear Regression
predictors = ['CurrencySell','avgIBR', 'OrdRate_IBR', 'wavgMatchRate_IBR', 'AmountOrd','TimeStampPlaced']

LinearReg_model = gl.linear_regression.create(train_set, target='Time_diff',
                                              features=predictors, validation_set=test_set);


Linear regression:
--------------------------------------------------------
Number of examples          : 23163
Number of features          : 6
Number of unpacked features : 6
Number of coefficients    : 7
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1         | 2        | 0.047536     | 15849.872531       | 15842.107631         | 4793.398078   | 4822.789060     |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
SUCCESS: Optimal solution found.


In [444]:
# 2) RandomForest Regression
RFReg_model = gl.random_forest_regression.create(train_set, target='Time_diff', max_iterations=500, max_depth=3,
                                              features=predictors, validation_set=test_set);


Random forest regression:
--------------------------------------------------------
Number of examples          : 23163
Number of features          : 6
Number of unpacked features : 6
+-----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1         | 0.015168     | 13217.349609       | 13099.349609         | 2780.771484   | 2803.247803     |
| 2         | 0.026861     | 10514.763672       | 8589.632812          | 2894.516113   | 2903.438721     |
| 3         | 0.041398     | 8414.614258        | 7126.173340          | 2745.850586   | 2761.121582     |
| 4         | 0.056609     | 8702.555664        | 6697.072266          | 2642.983643   | 2653.546875     |
| 5         | 0.071963     | 9619.421875        | 7803.712402          | 2629.226807   | 2639.807617     |
| 6         | 0.083867     | 10226.606445       | 8710.681641          | 2642.552246   | 2656.499023     |
| 11        | 0.144155     | 10510.059570       | 9244.789062          | 2590.905029   | 2603.983398     |
| 51        | 0.536541     | 10151.790039       | 7761.172852          | 2667.808350   | 2675.899414     |
| 100       | 1.049988     | 9887.041016        | 7602.176270          | 2624.868652   | 2632.700928     |
| 101       | 1.060756     | 9790.277344        | 7637.093750          | 2622.662354   | 2630.562500     |
| 200       | 2.035496     | 9417.754883        | 7426.119629          | 2600.487549   | 2608.229980     |
| 300       | 3.029947     | 9363.733398        | 7323.023438          | 2599.992676   | 2607.226807     |
| 400       | 4.024596     | 9215.952148        | 7397.187500          | 2600.870605   | 2608.482422     |
| 500       | 5.005080     | 9014.969727        | 7363.248047          | 2602.949707   | 2610.561035     |
+-----------+--------------+--------------------+----------------------+---------------+-----------------+

In [445]:
# 3) Boosted Trees
BTReg_model = gl.boosted_trees_regression.create(train_set, target='Time_diff', max_iterations=500, max_depth=3,
                                              features=predictors, validation_set=test_set)


Boosted trees regression:
--------------------------------------------------------
Number of examples          : 23163
Number of features          : 6
Number of unpacked features : 6
+-----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1         | 0.024020     | 15232.063477       | 15234.063477         | 6956.173340   | 6979.920410     |
| 2         | 0.048766     | 12002.864258       | 12004.864258         | 5141.800781   | 5163.077148     |
| 3         | 0.069211     | 9466.624023        | 9468.624023          | 3841.766113   | 3849.787842     |
| 4         | 0.083785     | 9242.631836        | 7336.728516          | 2926.786133   | 2933.408203     |
| 5         | 0.102145     | 9172.524414        | 7909.395020          | 2309.164307   | 2315.242188     |
| 6         | 0.116817     | 9092.616211        | 8314.770508          | 1902.402588   | 1910.980469     |
| 11        | 0.193334     | 10094.978516       | 9453.257812          | 1300.739990   | 1319.108643     |
| 51        | 0.649600     | 7457.339844        | 5789.382324          | 1048.066284   | 1082.307129     |
| 100       | 1.178002     | 9196.906250        | 5899.379883          | 983.706116    | 1041.555420     |
| 101       | 1.190107     | 9197.146484        | 5896.247559          | 982.727356    | 1041.043945     |
| 200       | 2.307863     | 7457.750488        | 5921.600098          | 917.339233    | 1016.228271     |
| 300       | 3.423663     | 6342.536621        | 6269.685547          | 871.117859    | 1004.445007     |
| 400       | 4.543557     | 6024.265137        | 6575.474609          | 837.137451    | 999.087769      |
| 500       | 5.816777     | 5923.801270        | 6568.659668          | 808.748047    | 1001.096985     |
+-----------+--------------+--------------------+----------------------+---------------+-----------------+

Results

As we can see, the best model among the regressors is given by Boosted Trees, where the RMSE is equal to TimeStampMatched= 1001 ~ 2 days.


In [454]:
print '--------------------------------'
print 'Results for Regression models: '
print '--------------------------------'
print 'Linear Regression: ', LinearReg_model.evaluate(test_set,metric='auto')
#LinearReg_model.show(view='Evaluation')

print '--------------------------------'
print 'Results for Regression models: '
print '--------------------------------'
print 'RandomForest : ', RFReg_model.evaluate(test_set,metric='auto')
#RFReg_model.show(view='Evaluation')


print '--------------------------------'
print 'Results for Regression models: '
print '--------------------------------'
print 'Boosted Trees: ', BTReg_model.evaluate(test_set,metric='auto')
#BTReg_model.show(view='Evaluation')


--------------------------------
Results for Regression models: 
--------------------------------
Linear Regression:  {'max_error': 15842.107631492778, 'rmse': 4822.789059864433}
--------------------------------
Results for Regression models: 
--------------------------------
RandomForest :  {'max_error': 7363.248046875, 'rmse': 2610.561045707597}
--------------------------------
Results for Regression models: 
--------------------------------
Boosted Trees:  {'max_error': 6568.65966796875, 'rmse': 1001.096981365979}

ii. Classification methods

If we reformulate the question to "Will a matching event occur in day1 or day2 or day3 ...?" the problem becomes a binary classification problem. For this, we will use the new logical variables created day1, day2, day3 as our predictor variable.


In [465]:
# Splitting the data into train and test sets
train_set, test_set = data_day2.random_split(.8,seed=0)

gl.canvas.set_target('ipynb')

predictors = ['CurrencySell','avgIBR', 'OrdRate_IBR', 'wavgMatchRate_IBR', 'AmountOrd','TimeStampPlaced']

max_it = 50

# 1) Logistic Regression
LogReg_model = gl.logistic_classifier.create(train_set, target='Match_day',
                                              features=predictors, max_iterations=max_it,
                                              feature_rescaling='True',
                                              verbose='True', validation_set=test_set,
                                              l2_penalty=0);
# 2) RandomForest Classifier
RFClass_model = gl.random_forest_classifier.create(train_set, target='Match_day',
                                                   max_iterations=500, max_depth=1,
                                                  features=predictors,
                                                   row_subsample=0.5,column_subsample=0.5,
                                                  validation_set=test_set,random_seed=0,
                                                  )
# 3) Boosted Trees
BTClass_model = gl.boosted_trees_classifier.create(train_set, target='Match_day',
                                                   max_iterations=500,# max_depth=5,
                                                  features=predictors,
                                                   row_subsample=0.5,column_subsample=0.5,
                                                  validation_set=test_set,random_seed=0,
                                                  )


Logistic regression:
--------------------------------------------------------
Number of examples          : 33505
Number of classes           : 2
Number of feature columns   : 6
Number of unpacked features : 6
Number of coefficients    : 7
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+--------------+-------------------+---------------------+
| 1         | 2        | 0.112509     | 0.781406          | 0.781471            |
| 2         | 3        | 0.188744     | 0.799254          | 0.799592            |
| 3         | 4        | 0.272983     | 0.805193          | 0.804872            |
| 4         | 5        | 0.351393     | 0.805641          | 0.805112            |
| 5         | 6        | 0.429185     | 0.805641          | 0.805112            |
+-----------+----------+--------------+-------------------+---------------------+
SUCCESS: Optimal solution found.

Random forest classifier:
--------------------------------------------------------
Number of examples          : 33505
Number of classes           : 2
Number of feature columns   : 6
Number of unpacked features : 6
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
| Iteration | Elapsed Time | Training-accuracy | Validation-accuracy | Training-log_loss | Validation-log_loss |
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
| 1         | 0.025651     | 0.777377          | 0.777151            | 0.519542          | 0.520197            |
| 2         | 0.038974     | 0.808118          | 0.799112            | 0.453576          | 0.459284            |
| 3         | 0.054225     | 0.855305          | 0.856354            | 0.427850          | 0.430925            |
| 4         | 0.073490     | 0.855305          | 0.856354            | 0.434483          | 0.436631            |
| 5         | 0.090868     | 0.855305          | 0.856354            | 0.425800          | 0.427173            |
| 6         | 0.111472     | 0.855842          | 0.856234            | 0.421337          | 0.422356            |
| 11        | 0.175801     | 0.855454          | 0.855994            | 0.419030          | 0.419000            |
| 51        | 0.683938     | 0.842173          | 0.841714            | 0.423037          | 0.423376            |
| 100       | 1.312712     | 0.842650          | 0.841954            | 0.424257          | 0.424470            |
| 101       | 1.332000     | 0.842382          | 0.841954            | 0.424364          | 0.424573            |
| 200       | 2.735931     | 0.845128          | 0.845434            | 0.423001          | 0.423182            |
| 300       | 4.147366     | 0.848829          | 0.849874            | 0.422484          | 0.422758            |
| 400       | 5.513219     | 0.848829          | 0.849874            | 0.421837          | 0.422129            |
| 500       | 6.837018     | 0.848829          | 0.849874            | 0.422212          | 0.422469            |
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
Boosted trees classifier:
--------------------------------------------------------
Number of examples          : 33505
Number of classes           : 2
Number of feature columns   : 6
Number of unpacked features : 6
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
| Iteration | Elapsed Time | Training-accuracy | Validation-accuracy | Training-log_loss | Validation-log_loss |
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
| 1         | 0.037335     | 0.777496          | 0.776671            | 0.609024          | 0.609530            |
| 2         | 0.065255     | 0.895180          | 0.889836            | 0.469258          | 0.473143            |
| 3         | 0.091921     | 0.915117          | 0.908796            | 0.379817          | 0.384646            |
| 4         | 0.116596     | 0.895120          | 0.891156            | 0.349481          | 0.354919            |
| 5         | 0.146902     | 0.897329          | 0.891876            | 0.313815          | 0.321056            |
| 6         | 0.173273     | 0.898045          | 0.889716            | 0.289776          | 0.298205            |
| 11        | 0.296606     | 0.942844          | 0.933637            | 0.182742          | 0.193816            |
| 50        | 1.130673     | 0.975347          | 0.965439            | 0.064928          | 0.078761            |
| 51        | 1.152377     | 0.975287          | 0.965319            | 0.063999          | 0.077857            |
| 100       | 2.227975     | 0.989196          | 0.978279            | 0.039098          | 0.057106            |
| 101       | 2.249638     | 0.989046          | 0.978279            | 0.038883          | 0.057076            |
| 150       | 3.445701     | 0.994240          | 0.982119            | 0.026600          | 0.047633            |
| 200       | 4.521577     | 0.997284          | 0.984279            | 0.019857          | 0.042855            |
| 250       | 5.766251     | 0.998418          | 0.985479            | 0.014701          | 0.039748            |
| 300       | 7.070087     | 0.999373          | 0.986799            | 0.011317          | 0.037721            |
| 350       | 8.295815     | 0.999702          | 0.987159            | 0.008983          | 0.036716            |
| 400       | 9.665097     | 0.999881          | 0.987040            | 0.007399          | 0.036447            |
| 450       | 10.910552    | 1.000000          | 0.987400            | 0.006140          | 0.036235            |
| 500       | 12.031473    | 1.000000          | 0.987520            | 0.005014          | 0.035362            |
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

Resuts

Next we show the metrics for each model and discuss the results.


In [466]:
print 'Logistic Regression '
print '--------------------'
LogReg_model.evaluate(test_set, metric='auto')
LogReg_model.show(view='Evaluation')

print 'RandomForest        '
print '--------------------'
RFClass_model.evaluate(test_set, metric='auto')
RFClass_model.show(view='Evaluation')

print 'Boosted Trees        '
print '--------------------'
BTClass_model.evaluate(test_set, metric='auto')
BTClass_model.show(view='Evaluation')


Logistic Regression 
--------------------
RandomForest        
--------------------
Boosted Trees        
--------------------

In [ ]: