AIR POLLUTION DATASET EXERCISE

This dataset was taken from the UCI machine learning repository. It contains features such as time, different gas pollution indicators, temperature, relative humidity and absolute humidity from a city in Italy. The main objective of this analysis is to estimate the change in temperature when the pollution features change over time. In simpler world, the model will try to estimate how temperature increase and decrease based on changes in the concentration of pollutants in the air column above the region where the measuring was taken. It will also determine which are the pollutants that most influence the temperature for the area in which the measures were taken. There are many factors that can contribute to the change in temperature, in consequence I expect to see how much these pollutants explain the variability of the temperature and/or humidity. If they contribute significantly to it, I believe that in the future, those local factors could be added to the weather prediction models.

Importing the main libraries


In [40]:
%matplotlib notebook
import IPython
from IPython.display import display
from sqlalchemy import create_engine
import psycopg2
import psycopg2.extras
import pandas as pd
import csv
from numpy import nan as NA
from datetime import datetime
import re
import sys
import numpy as np
import matplotlib.pyplot as plt 
from pandas import *
import pickle
import requests
import os
from sklearn.preprocessing import LabelEncoder
from yellowbrick.features import Rank2D
from sklearn.linear_model import LinearRegression
from yellowbrick.features.importances import FeatureImportances
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import KernelPCA
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
import pprint

Data ingestion and Wrangling

Note here that part of the wrangling was done using Postgresql. The file it is called SQL_code and it is in the same folder.


In [41]:
# Setting display options:
pd.set_option('max_columns',50) 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Defining the ingestion class


In [42]:
class Ingest(object):

    def __init__(self, database, query):
        self.engine = create_engine(database)
        self.table_names = self.engine.table_names()
        self.con = self.engine.connect()
        self.rs = self.con.execute(query)
        self.df = pd.DataFrame(self.rs.fetchall())
        self.con.close()

    def cols(self):
        self.df.columns = self.rs.keys()
        return self.df

In [43]:
dataframe = Ingest('postgresql://postgres:postgres@localhost:5432/Project', "SELECT * FROM Air_Quality")

Exploring the data


In [44]:
df = dataframe.cols()

In [45]:
df.head(5)


Out[45]:
date time cogt pt08co nmhcgt c6h6gt pt08nmhc noxgt pt08nox no2 pt08no2 pt08o3 temperaturec relhumidity ahabhumidity datetime
0 2004-03-10 18:00:00 2.6 1360.0 150.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578 2004-03-10 18:00:00
1 2004-03-10 19:00:00 2.0 1292.0 112.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255 2004-03-10 19:00:00
2 2004-03-10 20:00:00 2.2 1402.0 88.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502 2004-03-10 20:00:00
3 2004-03-10 21:00:00 2.2 1376.0 80.0 9.2 948.0 172.0 1092.0 122.0 1584.0 1203.0 11.0 60.0 0.7867 2004-03-10 21:00:00
4 2004-03-10 22:00:00 1.6 1272.0 51.0 6.5 836.0 131.0 1205.0 116.0 1490.0 1110.0 11.2 59.6 0.7888 2004-03-10 22:00:00

In [46]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9357 entries, 0 to 9356
Data columns (total 16 columns):
date            9357 non-null object
time            9357 non-null object
cogt            9357 non-null float64
pt08co          9357 non-null float64
nmhcgt          9357 non-null float64
c6h6gt          9357 non-null float64
pt08nmhc        9357 non-null float64
noxgt           9357 non-null float64
pt08nox         9357 non-null float64
no2             9357 non-null float64
pt08no2         9357 non-null float64
pt08o3          9357 non-null float64
temperaturec    9357 non-null float64
relhumidity     9357 non-null float64
ahabhumidity    9357 non-null float64
datetime        9357 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(13), object(2)
memory usage: 1.1+ MB

There is no missing data in this dataset. The little wrangling that was needed was done in PostgreSQL. The next step is to transform everything into numerical in order to perform visual exploration of the dataset. Since most of the columns (including the potential target features) contains continuous data. I intend to fit a regressor in order to perform predictions.

Encoding the dataset

Since the dataset contains time values, it will be necesary to encode the dates before using it. In this exercise I will only encode the date and time columns since all the other columns are continuous.


In [50]:
def encoder(data):
    """Since LabelEncoder only works for single features, it is necesary to create a loop
        that will iterate through all of them"""
    encoder = LabelEncoder() 
    for colname,col in data.iteritems(): # adapted from stack overflow  
        data[colname] = encoder.fit_transform(col.astype(str)) 
    return (data)

In [51]:
df2 = encoder(df[['date', 'time', 'datetime']])
df3 = df.drop(['date', 'time', 'datetime'], axis = 1)
df3[['date', 'time', 'datetime']] = df2[['date', 'time', 'datetime']]
df = df3

df.head(5)


c:\users\franc\appdata\local\programs\python\python37\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[51]:
cogt pt08co nmhcgt c6h6gt pt08nmhc noxgt pt08nox no2 pt08no2 pt08o3 temperaturec relhumidity ahabhumidity date time datetime
0 2.6 1360.0 150.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578 0 10 0
1 2.0 1292.0 112.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255 0 11 1
2 2.2 1402.0 88.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502 0 13 1112
3 2.2 1376.0 80.0 9.2 948.0 172.0 1092.0 122.0 1584.0 1203.0 11.0 60.0 0.7867 0 14 2223
4 1.6 1272.0 51.0 6.5 836.0 131.0 1205.0 116.0 1490.0 1110.0 11.2 59.6 0.7888 0 15 3334

In [52]:
df.keys()


Out[52]:
Index(['cogt', 'pt08co', 'nmhcgt', 'c6h6gt', 'pt08nmhc', 'noxgt', 'pt08nox',
       'no2', 'pt08no2', 'pt08o3', 'temperaturec', 'relhumidity',
       'ahabhumidity', 'date', 'time', 'datetime'],
      dtype='object')

Further Exploration of the dataset:

Exploring the relationship between variables:


In [53]:
features = df.keys()
visualizer = Rank2D(features=features, algorithm='pearson')

visualizer.fit(df.values)                
visualizer.transform(df.values)             
visualizer.poof()


As it can be appreciated here, there some multicollinearity between the most important variables. and the correlation between temperature and the rest of the features is quite variable.

Selecting the target variable:

There are two potential target variables. Humidity, Temperature and Relative Humidity. In this case, I will select Temperature because it the simplest and most understand feature. The other two will be left out since I am focusing in the importance of anthropogenic pollutants.


In [54]:
X_ = df.drop(['temperaturec', 'relhumidity', 'ahabhumidity'], axis=1)
y_ = df['temperaturec']

Preliminary Feature importance analysis


In [123]:
features = X_.keys()
Xi = X_[features]
#Xi = Xi.values
yi = y_.values

figure = plt.figure()
axis = figure.add_subplot()

viz = FeatureImportances(LinearRegression(), ax=axis)
viz.fit(Xi, yi)
viz.poof()



In [56]:
features = X_.keys()
Xi = X_[features]
#Xi = Xi.values
yi = y_.values

figure = plt.figure()
axis = figure.add_subplot()

viz = FeatureImportances(Lasso(), ax=axis)
viz.fit(Xi, yi)
viz.poof()



In [57]:
features = X_.keys()
Xi = X_[features]
#Xi = Xi.values
yi = y_.values

figure = plt.figure()
axis = figure.add_subplot()

viz = FeatureImportances(Ridge(), ax=axis)
viz.fit(Xi, yi)
viz.poof()



In [161]:
features = X_.keys()
Xi = X_[features]
#Xi = Xi.values
yi = y_.values

figure = plt.figure()
axis = figure.add_subplot()

viz = FeatureImportances(SVR(kernel='linear'), ax=axis)
viz = viz.fit(Xi, yi)
viz.poof()



In [162]:
with open('figure.pickle', 'wb') as figure:
            pickle.dump(viz , figure)

In [ ]:
with open('figure.pickle', "rb") as figur:
    fig = pickle.load(figur, encoding="utf8")
fig.poof()

In Here, it can be appreciated that the most important feature is benzene followed by benzene, date, no-methane hydrocarbons (nmhc) and nitrogen compounds. Note that this feature importance is only related to linear models. Other methods should be taken into consideration when selecting features.

Data Preprocessing and scaling

Scaling:

First, I will perfomn a test to check the veriability of the dataset and then perform a scaling in order to reduce that variability.


In [93]:
cols2 = pd.DataFrame(X_)
cols2 = list(cols2.columns)

boxplot = X_.boxplot(column=cols2, rot=90, fontsize=10)



In [101]:
minmax_scaler =  MinMaxScaler(feature_range=(0, 1)) 
X_minmax = minmax_scaler.fit(X_).transform(X_)
yi = y_.values.reshape(-1,1)
#y_minmax = minmax_scaler.fit(yi).transform(yi)

In [102]:
X_minmax = pd.DataFrame(X_minmax)
X_minmax.columns = X_.columns
cols2 = pd.DataFrame(X_minmax)
cols2 = list(cols2.columns)
boxplot = X_minmax.boxplot(column=cols2, rot=90, fontsize=10)



In [103]:
scaler =  StandardScaler() 
X_std = scaler.fit(X_).transform(X_)
y_std = y_.values.reshape(-1,1)
#y_std = scaler.fit(y_std).transform(y_std)

X_std = pd.DataFrame(X_std)
X_std.columns = X_.columns
cols2 = pd.DataFrame(X_std)
cols2 = list(cols2.columns)



boxplot = X_std.boxplot(column=cols2, rot=90, fontsize=10)


In this exersice I will use Standard since the medians are much closer together:

Defining the trainin set and the test set:


In [104]:
from sklearn import model_selection
X  = X_
y  = y_

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2,random_state= 0)
X_train = scaler.fit(X_train).transform(X_train)
X_test = scaler.fit(X_test).transform(X_test)

Optimal Number of features using recursive feature elimination:


In [105]:
def rec_fe(target, data, filename):
    
    if __name__ == '__main__':

        svr = SVR(kernel = 'linear')
        rfecv = RFECV(estimator=svr, step=1, cv=KFold(12), scoring='r2', verbose = 10, n_jobs = -1)
        rfecv.fit(data,target)
        optimal_features = rfecv.n_features_

        print("Optimal number of features : %d" % rfecv.n_features_) 

        with open(filename, 'wb') as features:
            pickle.dump([optimal_features, rfecv] , features)

In [106]:
rec_fe(y_train,X_train, 'OptimalFeatures.pickle')


Optimal number of features : 13

In [107]:
with open('OptimalFeatures.pickle', "rb") as feature:
    feat = pickle.load(feature, encoding="utf8")

rfecv = feat[1]

In [108]:
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()


The plot shows that the optimal number of features is 13. Later during the model improvement and hyperparameter tuning, I will confirm this using dimensionality reduction techniques.

Model Selection


In [109]:
from yellowbrick.regressor import ResidualsPlot


lin = LinearRegression()
visualizer = ResidualsPlot(lin)

visualizer.fit(X_train, y_train)  
visualizer.score(X_test, y_test)  
visualizer.poof()


Here the points for both, the training and the test set are not horizontally and randomly distributed along the plot. This suggest that a non-linear model would be best suited for this particular case.


In [110]:
regr = LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print("Mean squared error: %.5f"% mean_squared_error(y_test,y_pred))

print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))


Mean squared error: 26.68950
r2 score for the training set: 0.98616
r2 score for the test set : 0.98569

In [111]:
regr = Ridge()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print("Mean squared error: %.5f"% mean_squared_error(y_test,y_pred))

print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))


Mean squared error: 26.69380
r2 score for the training set: 0.98616
r2 score for the test set : 0.98569

In [112]:
regr = Lasso()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print("Mean squared error: %.5f"% mean_squared_error(y_test,y_pred))

print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))


Mean squared error: 48.26229
r2 score for the training set: 0.97555
r2 score for the test set : 0.97413

In [113]:
regr = SVR()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)


print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))


r2 score for the training set: 0.90960
r2 score for the test set : 0.90263

In [114]:
regr = KNeighborsRegressor()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))


r2 score for the training set: 0.99777
r2 score for the test set : 0.99614

In [115]:
regr = RandomForestRegressor()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))


r2 score for the training set: 0.99964
r2 score for the test set : 0.99665

In [116]:
regr = MLPRegressor()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))


r2 score for the training set: 0.99554
r2 score for the test set : 0.99508
c:\users\franc\appdata\local\programs\python\python37\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py:564: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

As it can be seen here, the last three models showed to explain in a more accurate way the relationship between the independent variables and the dependent variable (temperature) but the differences in scores are almost negligible. In the final steps of this exercice I will work with the three best non linear regressors: KNNregressor, RandomForestRegressor and MLPRegressor.

Hyperparameter Tunning

This section is done solely for educational purposes, since the r-square of all the models is very high.


In [118]:
def gridsearch(model, parameters):
    grid_search = GridSearchCV(estimator = model, param_grid = parameters, scoring = 'r2', cv = 4, n_jobs = -1, verbose = 10)
    grid_search = grid_search.fit(X_train, y_train)
    score = grid_search.best_score_
    best_params = grid_search.best_params_
    return score, best_params

In [119]:
KNNREG = KNeighborsRegressor()
RNFREG = RandomForestRegressor()
MLPREG = MLPRegressor()
SVREG = SVR()
list_params = [{'n_neighbors':[1, 3, 5, 10, 30, 50]}, {'n_estimators' :[50,128,300, 500, 1000]},
               {'hidden_layer_sizes': [(100,), (200,), (300,)], 'max_iter': [200, 400, 1000]},
               {'kernel': ['poly', 'rbf', 'sigmoid'], 'C': [0.01,0.1,1,10]}]


models = [KNNREG, RNFREG, MLPREG, SVREG]

In [120]:
scores_dict = {'r2': [], 'best_params': []}

if __name__ == '__main__':

    for model, param in zip(models, list_params):

        acu, best_params = gridsearch(model, param)
        scores_dict['r2'].append(acu)
        scores_dict['best_params'].append(best_params)

        print (scores_dict)

    with open('gridsearch_y.pickle', 'wb') as grid:
        pickle.dump(scores_dict, grid)


Fitting 4 folds for each of 6 candidates, totalling 24 fits
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    4.9s
[Parallel(n_jobs=-1)]: Done  12 out of  24 | elapsed:    5.0s remaining:    5.0s
[Parallel(n_jobs=-1)]: Done  15 out of  24 | elapsed:    5.2s remaining:    3.1s
[Parallel(n_jobs=-1)]: Done  18 out of  24 | elapsed:    6.1s remaining:    2.0s
[Parallel(n_jobs=-1)]: Done  21 out of  24 | elapsed:    6.4s remaining:    0.8s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:    6.8s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:    6.8s finished
{'r2': [0.9960137542727955], 'best_params': [{'n_neighbors': 3}]}
Fitting 4 folds for each of 5 candidates, totalling 20 fits
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done   8 out of  20 | elapsed:   13.3s remaining:   19.9s
[Parallel(n_jobs=-1)]: Done  11 out of  20 | elapsed:   26.2s remaining:   21.4s
[Parallel(n_jobs=-1)]: Done  14 out of  20 | elapsed:   44.7s remaining:   19.1s
[Parallel(n_jobs=-1)]: Done  17 out of  20 | elapsed:  1.3min remaining:   13.9s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  1.4min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  1.4min finished
{'r2': [0.9960137542727955, 0.998087083010924], 'best_params': [{'n_neighbors': 3}, {'n_estimators': 1000}]}
Fitting 4 folds for each of 9 candidates, totalling 36 fits
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    7.4s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   17.4s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:   36.0s
[Parallel(n_jobs=-1)]: Done  25 out of  36 | elapsed:  1.1min remaining:   28.8s
[Parallel(n_jobs=-1)]: Done  29 out of  36 | elapsed:  1.4min remaining:   20.2s
[Parallel(n_jobs=-1)]: Done  33 out of  36 | elapsed:  1.5min remaining:    8.1s
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:  1.6min finished
{'r2': [0.9960137542727955, 0.998087083010924, 0.9957415968571529], 'best_params': [{'n_neighbors': 3}, {'n_estimators': 1000}, {'hidden_layer_sizes': (200,), 'max_iter': 1000}]}
Fitting 4 folds for each of 12 candidates, totalling 48 fits
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    9.6s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:   12.6s
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   17.3s
[Parallel(n_jobs=-1)]: Done  38 out of  48 | elapsed:   26.2s remaining:    6.8s
[Parallel(n_jobs=-1)]: Done  43 out of  48 | elapsed:   28.5s remaining:    3.2s
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:   31.5s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:   31.5s finished
{'r2': [0.9960137542727955, 0.998087083010924, 0.9957415968571529, 0.9903724953506169], 'best_params': [{'n_neighbors': 3}, {'n_estimators': 1000}, {'hidden_layer_sizes': (200,), 'max_iter': 1000}, {'C': 10, 'kernel': 'rbf'}]}

Final Feature selection


In [121]:
print (scores_dict)


{'r2': [0.9960137542727955, 0.998087083010924, 0.9957415968571529, 0.9903724953506169], 'best_params': [{'n_neighbors': 3}, {'n_estimators': 1000}, {'hidden_layer_sizes': (200,), 'max_iter': 1000}, {'C': 10, 'kernel': 'rbf'}]}

Based on the results from this gridsearch. I will select the best two models which where KnnRegressor and RandomForestRegressor. I will perform feature importance, and optimal number of features in order to further improve the model. There is no option for KNNRegressor feature importance: Note: this was made on the complete unscaled dataset and it is important to note the difference that exist with the linear models' feature importance plot.


In [122]:
X.columns = X_.columns
features = X_.keys()
Xi = X[features]
yi = y

from yellowbrick.features.importances import FeatureImportances
figure = plt.figure()
axis = figure.add_subplot()

viz = FeatureImportances(RandomForestRegressor(n_estimators= 1000), ax=axis)
viz = viz.fit(Xi, yi)
viz.poof()



In [125]:
def rec_fe2(target, data, filename):
    
    if __name__ == '__main__':

        rnfreg = RandomForestRegressor(n_estimators= 1000)
        rfecv = RFECV(estimator=rnfreg, step=1, cv=KFold(12), scoring='r2', verbose = 10, n_jobs = -1)
        rfecv.fit(data,target)
        optimal_features = rfecv.n_features_

        print("Optimal number of features : %d" % rfecv.n_features_) 

        with open(filename, 'wb') as features_knnreg:
            pickle.dump([optimal_features, rfecv] , features_knnreg)

In [126]:
rec_fe2(y_train,X_train, 'OptimalFeatures_rnfreg.pickle')


Optimal number of features : 10

In [127]:
with open('OptimalFeatures_rnfreg.pickle', "rb") as features:
    feat = pickle.load(features, encoding="utf8")

rferg = feat[1]

In [128]:
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score ")
plt.plot(range(1, len(rferg.grid_scores_) + 1), rferg.grid_scores_)
plt.show()


Final Models

Without dimentionality reduction:


In [129]:
regr = RandomForestRegressor(n_estimators= 1000)
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))


r2 score for the training set: 0.99978
r2 score for the test set : 0.99709

With dimentionality reduction:


In [130]:
def gridsearch2(model, parameters):
    grid_search = GridSearchCV(estimator = model, param_grid = parameters, scoring = 'r2', cv = 4, n_jobs = -1, verbose = 10)
    grid_search = grid_search.fit(X_train, y_train)
    score = grid_search.best_score_
    best_params = grid_search.best_params_
    return score, best_params

In [131]:
pipe = Pipeline([('reduce_dimensions', PCA()),('regressor', RandomForestRegressor(n_estimators = 1000, n_jobs = -1))])
random = RandomForestRegressor(n_estimators= 1000)
scores_dict2 = {'r2': [], 'best_params': []}
components = [2, 3, 4, 5, 6]
param_grid = [
    {
        'reduce_dimensions': [PCA(), KernelPCA(kernel = 'rbf'), TruncatedSVD()],
        'reduce_dimensions__n_components': components},]

if __name__ == '__main__':
    score, best_params = gridsearch2(pipe, param_grid)
    scores_dict2['r2'].append(acu)
    scores_dict2['best_params'].append(best_params)

    print (scores_dict2)

    with open('gridsearch2.pickle', 'wb') as grid2:
        pickle.dump(scores_dict2, grid2)


Fitting 4 folds for each of 15 candidates, totalling 60 fits
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    8.6s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   50.4s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done  52 out of  60 | elapsed:  4.3min remaining:   39.6s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  4.6min finished
{'r2': [0.9903724953506169], 'best_params': [{'reduce_dimensions': TruncatedSVD(algorithm='randomized', n_components=6, n_iter=5,
       random_state=None, tol=0.0), 'reduce_dimensions__n_components': 6}]}

The optimal number of features with dimentionality reduction is 6. The best Dimensionality reduction technique was TruncatedSVD. Next, I will Implement TruncatedSVD independently and compare it to manual feature elimination based on the feature importance analysis that was done previously.

Using Truncated SVD

Reset Training and test set:


In [138]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2,random_state= 0)
X_train = scaler.fit(X_train).transform(X_train)
X_test = scaler.fit(X_test).transform(X_test)

In [139]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(algorithm='randomized', n_components=6, n_iter=5)# input a number for feature extraction

X_train = svd.fit_transform(X_train)
X_test = svd.transform(X_test)

In [152]:
regr = RandomForestRegressor(n_estimators= 1000, n_jobs = -1)
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))


r2 score for the training set: 0.99942
r2 score for the test set : 0.99564

----------Predicted vs estimated----------


In [155]:
from sklearn.metrics import r2_score

'r2 score for the predicted y and the test y : %.5f' % r2_score(y_test, y_pred)


Out[155]:
'r2 score for the predicted y and the test y : 0.99564'

Manually selecting the most important features based on feature importance analysis


In [141]:
X_prime = X[['pt08nox', 'pt08o3', 'pt08nmhc', 'pt08no2', 'pt08co', 'c6h6gt']]

X_train_prime, X_test_prime, y_train_prime, y_test_prime = model_selection.train_test_split(X_prime, y, test_size=0.2,random_state= 0)
X_train_prime = scaler.fit(X_train_prime).transform(X_train_prime)
X_test_prime = scaler.fit(X_test_prime).transform(X_test_prime)

In [157]:
regr = RandomForestRegressor(n_estimators= 1000, n_jobs = -1)
regr.fit(X_train_prime, y_train_prime)
y_pred = regr.predict(X_test_prime)
print('r2 score for the training set: %.5f' % regr.score(X_train_prime, y_train_prime))
print ('r2 score for the test set : %.5f' % regr.score(X_test_prime, y_test_prime))


r2 score for the training set: 0.99883
r2 score for the test set : 0.99119

----------Predicted vs estimated----------


In [158]:
'r2 score for the predicted y and the test y : %.5f' % r2_score(y_test, y_pred)


Out[158]:
'r2 score for the predicted y and the test y : 0.99119'

Both Scores seems to be very similar. As a final step, I will perform a cross validation on the training and test sets to check if there are inconsistence with the previous results. Since cross validation will select many training and test sets along the dataset, this will rule ou any possible outlier that may have been omitted during the previous steps:

Cross Validation


In [159]:
from sklearn.model_selection import cross_val_score

def fin_models(pickle_file):
    
    scores = {'r_squared_knn_regressor_train': [], 'r_squared_knn_regressor_test': [],
              'r_squared_random_forest_regressor_train':[], 'r_squared_random_forest_regressor_test':[]}
    
    knn = KNeighborsRegressor(n_neighbors = 5, n_jobs= -1) 
    rnf = RandomForestRegressor(n_estimators = 1000, n_jobs= -1)
    
    # knn
    scores['r_squared_knn_regressor_train'].append(cross_val_score(knn, X_train, y_train, cv=12, scoring='r2'))
    scores['r_squared_knn_regressor_test'].append(cross_val_score(knn, X_test, y_test, cv=12, scoring='r2'))
    
    #rnf 
    
    scores['r_squared_random_forest_regressor_train'].append(cross_val_score(rnf, X_train, y_train, cv=12, scoring='r2'))
    scores['r_squared_random_forest_regressor_test'].append(cross_val_score(rnf, X_test, y_test, cv=12, scoring='r2'))
    
    print(scores)
    
    # write  estimators to disc.
    
    with open(pickle_file, 'wb') as final:
        pickle.dump([scores], final)
        
    
    print('DONE!')
        
    return (scores)

In [160]:
fin_models('Regressor_fitted_model_scores.pickle')


{'r_squared_knn_regressor_train': [array([0.99475279, 0.99511764, 0.99576769, 0.99333985, 0.9920085 ,
       0.99544331, 0.99657262, 0.99667494, 0.99286914, 0.99428561,
       0.99463395, 0.99493247])], 'r_squared_knn_regressor_test': [array([0.99134967, 0.99208708, 0.98672775, 0.99384526, 0.98807136,
       0.99346391, 0.99271903, 0.99548433, 0.99473428, 0.96662642,
       0.99289228, 0.99278065])], 'r_squared_random_forest_regressor_train': [array([0.99522845, 0.9959595 , 0.99621981, 0.99386935, 0.99292463,
       0.99656529, 0.99699533, 0.99680722, 0.99407218, 0.99516718,
       0.99564843, 0.99559626])], 'r_squared_random_forest_regressor_test': [array([0.99332362, 0.99338728, 0.99044333, 0.99398227, 0.98934744,
       0.99365539, 0.99246639, 0.99610479, 0.99452562, 0.96636112,
       0.99268549, 0.99339017])]}
DONE!
Out[160]:
{'r_squared_knn_regressor_train': [array([0.99475279, 0.99511764, 0.99576769, 0.99333985, 0.9920085 ,
         0.99544331, 0.99657262, 0.99667494, 0.99286914, 0.99428561,
         0.99463395, 0.99493247])],
 'r_squared_knn_regressor_test': [array([0.99134967, 0.99208708, 0.98672775, 0.99384526, 0.98807136,
         0.99346391, 0.99271903, 0.99548433, 0.99473428, 0.96662642,
         0.99289228, 0.99278065])],
 'r_squared_random_forest_regressor_train': [array([0.99522845, 0.9959595 , 0.99621981, 0.99386935, 0.99292463,
         0.99656529, 0.99699533, 0.99680722, 0.99407218, 0.99516718,
         0.99564843, 0.99559626])],
 'r_squared_random_forest_regressor_test': [array([0.99332362, 0.99338728, 0.99044333, 0.99398227, 0.98934744,
         0.99365539, 0.99246639, 0.99610479, 0.99452562, 0.96636112,
         0.99268549, 0.99339017])]}

Final Observations:

The model performance was quite high. There is a strong relationship between air pollutants an temperature. In order to use this trained model on new data, it will be necessary to use the same encoding that was used to encode any non-numerical variable. The label encoder assign a unique number to any non-numerical variable and that codification should be the same when entering new data. In consequence I believe that it is necessary to create an encoding glossary like the one that we built in our final group project.


In [ ]: