This dataset was taken from the UCI machine learning repository. It contains features such as time, different gas pollution indicators, temperature, relative humidity and absolute humidity from a city in Italy. The main objective of this analysis is to estimate the change in temperature when the pollution features change over time. In simpler world, the model will try to estimate how temperature increase and decrease based on changes in the concentration of pollutants in the air column above the region where the measuring was taken. It will also determine which are the pollutants that most influence the temperature for the area in which the measures were taken. There are many factors that can contribute to the change in temperature, in consequence I expect to see how much these pollutants explain the variability of the temperature and/or humidity. If they contribute significantly to it, I believe that in the future, those local factors could be added to the weather prediction models.
In [40]:
%matplotlib notebook
import IPython
from IPython.display import display
from sqlalchemy import create_engine
import psycopg2
import psycopg2.extras
import pandas as pd
import csv
from numpy import nan as NA
from datetime import datetime
import re
import sys
import numpy as np
import matplotlib.pyplot as plt
from pandas import *
import pickle
import requests
import os
from sklearn.preprocessing import LabelEncoder
from yellowbrick.features import Rank2D
from sklearn.linear_model import LinearRegression
from yellowbrick.features.importances import FeatureImportances
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import KernelPCA
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
import pprint
In [41]:
# Setting display options:
pd.set_option('max_columns',50)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
In [42]:
class Ingest(object):
def __init__(self, database, query):
self.engine = create_engine(database)
self.table_names = self.engine.table_names()
self.con = self.engine.connect()
self.rs = self.con.execute(query)
self.df = pd.DataFrame(self.rs.fetchall())
self.con.close()
def cols(self):
self.df.columns = self.rs.keys()
return self.df
In [43]:
dataframe = Ingest('postgresql://postgres:postgres@localhost:5432/Project', "SELECT * FROM Air_Quality")
In [44]:
df = dataframe.cols()
In [45]:
df.head(5)
Out[45]:
In [46]:
df.info()
There is no missing data in this dataset. The little wrangling that was needed was done in PostgreSQL. The next step is to transform everything into numerical in order to perform visual exploration of the dataset. Since most of the columns (including the potential target features) contains continuous data. I intend to fit a regressor in order to perform predictions.
In [50]:
def encoder(data):
"""Since LabelEncoder only works for single features, it is necesary to create a loop
that will iterate through all of them"""
encoder = LabelEncoder()
for colname,col in data.iteritems(): # adapted from stack overflow
data[colname] = encoder.fit_transform(col.astype(str))
return (data)
In [51]:
df2 = encoder(df[['date', 'time', 'datetime']])
df3 = df.drop(['date', 'time', 'datetime'], axis = 1)
df3[['date', 'time', 'datetime']] = df2[['date', 'time', 'datetime']]
df = df3
df.head(5)
Out[51]:
In [52]:
df.keys()
Out[52]:
In [53]:
features = df.keys()
visualizer = Rank2D(features=features, algorithm='pearson')
visualizer.fit(df.values)
visualizer.transform(df.values)
visualizer.poof()
As it can be appreciated here, there some multicollinearity between the most important variables. and the correlation between temperature and the rest of the features is quite variable.
There are two potential target variables. Humidity, Temperature and Relative Humidity. In this case, I will select Temperature because it the simplest and most understand feature. The other two will be left out since I am focusing in the importance of anthropogenic pollutants.
In [54]:
X_ = df.drop(['temperaturec', 'relhumidity', 'ahabhumidity'], axis=1)
y_ = df['temperaturec']
In [123]:
features = X_.keys()
Xi = X_[features]
#Xi = Xi.values
yi = y_.values
figure = plt.figure()
axis = figure.add_subplot()
viz = FeatureImportances(LinearRegression(), ax=axis)
viz.fit(Xi, yi)
viz.poof()
In [56]:
features = X_.keys()
Xi = X_[features]
#Xi = Xi.values
yi = y_.values
figure = plt.figure()
axis = figure.add_subplot()
viz = FeatureImportances(Lasso(), ax=axis)
viz.fit(Xi, yi)
viz.poof()
In [57]:
features = X_.keys()
Xi = X_[features]
#Xi = Xi.values
yi = y_.values
figure = plt.figure()
axis = figure.add_subplot()
viz = FeatureImportances(Ridge(), ax=axis)
viz.fit(Xi, yi)
viz.poof()
In [161]:
features = X_.keys()
Xi = X_[features]
#Xi = Xi.values
yi = y_.values
figure = plt.figure()
axis = figure.add_subplot()
viz = FeatureImportances(SVR(kernel='linear'), ax=axis)
viz = viz.fit(Xi, yi)
viz.poof()
In [162]:
with open('figure.pickle', 'wb') as figure:
pickle.dump(viz , figure)
In [ ]:
with open('figure.pickle', "rb") as figur:
fig = pickle.load(figur, encoding="utf8")
fig.poof()
In Here, it can be appreciated that the most important feature is benzene followed by benzene, date, no-methane hydrocarbons (nmhc) and nitrogen compounds. Note that this feature importance is only related to linear models. Other methods should be taken into consideration when selecting features.
In [93]:
cols2 = pd.DataFrame(X_)
cols2 = list(cols2.columns)
boxplot = X_.boxplot(column=cols2, rot=90, fontsize=10)
In [101]:
minmax_scaler = MinMaxScaler(feature_range=(0, 1))
X_minmax = minmax_scaler.fit(X_).transform(X_)
yi = y_.values.reshape(-1,1)
#y_minmax = minmax_scaler.fit(yi).transform(yi)
In [102]:
X_minmax = pd.DataFrame(X_minmax)
X_minmax.columns = X_.columns
cols2 = pd.DataFrame(X_minmax)
cols2 = list(cols2.columns)
boxplot = X_minmax.boxplot(column=cols2, rot=90, fontsize=10)
In [103]:
scaler = StandardScaler()
X_std = scaler.fit(X_).transform(X_)
y_std = y_.values.reshape(-1,1)
#y_std = scaler.fit(y_std).transform(y_std)
X_std = pd.DataFrame(X_std)
X_std.columns = X_.columns
cols2 = pd.DataFrame(X_std)
cols2 = list(cols2.columns)
boxplot = X_std.boxplot(column=cols2, rot=90, fontsize=10)
In this exersice I will use Standard since the medians are much closer together:
In [104]:
from sklearn import model_selection
X = X_
y = y_
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2,random_state= 0)
X_train = scaler.fit(X_train).transform(X_train)
X_test = scaler.fit(X_test).transform(X_test)
In [105]:
def rec_fe(target, data, filename):
if __name__ == '__main__':
svr = SVR(kernel = 'linear')
rfecv = RFECV(estimator=svr, step=1, cv=KFold(12), scoring='r2', verbose = 10, n_jobs = -1)
rfecv.fit(data,target)
optimal_features = rfecv.n_features_
print("Optimal number of features : %d" % rfecv.n_features_)
with open(filename, 'wb') as features:
pickle.dump([optimal_features, rfecv] , features)
In [106]:
rec_fe(y_train,X_train, 'OptimalFeatures.pickle')
In [107]:
with open('OptimalFeatures.pickle', "rb") as feature:
feat = pickle.load(feature, encoding="utf8")
rfecv = feat[1]
In [108]:
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
The plot shows that the optimal number of features is 13. Later during the model improvement and hyperparameter tuning, I will confirm this using dimensionality reduction techniques.
In [109]:
from yellowbrick.regressor import ResidualsPlot
lin = LinearRegression()
visualizer = ResidualsPlot(lin)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.poof()
Here the points for both, the training and the test set are not horizontally and randomly distributed along the plot. This suggest that a non-linear model would be best suited for this particular case.
In [110]:
regr = LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print("Mean squared error: %.5f"% mean_squared_error(y_test,y_pred))
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))
In [111]:
regr = Ridge()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print("Mean squared error: %.5f"% mean_squared_error(y_test,y_pred))
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))
In [112]:
regr = Lasso()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print("Mean squared error: %.5f"% mean_squared_error(y_test,y_pred))
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))
In [113]:
regr = SVR()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))
In [114]:
regr = KNeighborsRegressor()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))
In [115]:
regr = RandomForestRegressor()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))
In [116]:
regr = MLPRegressor()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))
As it can be seen here, the last three models showed to explain in a more accurate way the relationship between the independent variables and the dependent variable (temperature) but the differences in scores are almost negligible. In the final steps of this exercice I will work with the three best non linear regressors: KNNregressor, RandomForestRegressor and MLPRegressor.
In [118]:
def gridsearch(model, parameters):
grid_search = GridSearchCV(estimator = model, param_grid = parameters, scoring = 'r2', cv = 4, n_jobs = -1, verbose = 10)
grid_search = grid_search.fit(X_train, y_train)
score = grid_search.best_score_
best_params = grid_search.best_params_
return score, best_params
In [119]:
KNNREG = KNeighborsRegressor()
RNFREG = RandomForestRegressor()
MLPREG = MLPRegressor()
SVREG = SVR()
list_params = [{'n_neighbors':[1, 3, 5, 10, 30, 50]}, {'n_estimators' :[50,128,300, 500, 1000]},
{'hidden_layer_sizes': [(100,), (200,), (300,)], 'max_iter': [200, 400, 1000]},
{'kernel': ['poly', 'rbf', 'sigmoid'], 'C': [0.01,0.1,1,10]}]
models = [KNNREG, RNFREG, MLPREG, SVREG]
In [120]:
scores_dict = {'r2': [], 'best_params': []}
if __name__ == '__main__':
for model, param in zip(models, list_params):
acu, best_params = gridsearch(model, param)
scores_dict['r2'].append(acu)
scores_dict['best_params'].append(best_params)
print (scores_dict)
with open('gridsearch_y.pickle', 'wb') as grid:
pickle.dump(scores_dict, grid)
In [121]:
print (scores_dict)
Based on the results from this gridsearch. I will select the best two models which where KnnRegressor and RandomForestRegressor. I will perform feature importance, and optimal number of features in order to further improve the model. There is no option for KNNRegressor feature importance: Note: this was made on the complete unscaled dataset and it is important to note the difference that exist with the linear models' feature importance plot.
In [122]:
X.columns = X_.columns
features = X_.keys()
Xi = X[features]
yi = y
from yellowbrick.features.importances import FeatureImportances
figure = plt.figure()
axis = figure.add_subplot()
viz = FeatureImportances(RandomForestRegressor(n_estimators= 1000), ax=axis)
viz = viz.fit(Xi, yi)
viz.poof()
In [125]:
def rec_fe2(target, data, filename):
if __name__ == '__main__':
rnfreg = RandomForestRegressor(n_estimators= 1000)
rfecv = RFECV(estimator=rnfreg, step=1, cv=KFold(12), scoring='r2', verbose = 10, n_jobs = -1)
rfecv.fit(data,target)
optimal_features = rfecv.n_features_
print("Optimal number of features : %d" % rfecv.n_features_)
with open(filename, 'wb') as features_knnreg:
pickle.dump([optimal_features, rfecv] , features_knnreg)
In [126]:
rec_fe2(y_train,X_train, 'OptimalFeatures_rnfreg.pickle')
In [127]:
with open('OptimalFeatures_rnfreg.pickle', "rb") as features:
feat = pickle.load(features, encoding="utf8")
rferg = feat[1]
In [128]:
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score ")
plt.plot(range(1, len(rferg.grid_scores_) + 1), rferg.grid_scores_)
plt.show()
In [129]:
regr = RandomForestRegressor(n_estimators= 1000)
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))
In [130]:
def gridsearch2(model, parameters):
grid_search = GridSearchCV(estimator = model, param_grid = parameters, scoring = 'r2', cv = 4, n_jobs = -1, verbose = 10)
grid_search = grid_search.fit(X_train, y_train)
score = grid_search.best_score_
best_params = grid_search.best_params_
return score, best_params
In [131]:
pipe = Pipeline([('reduce_dimensions', PCA()),('regressor', RandomForestRegressor(n_estimators = 1000, n_jobs = -1))])
random = RandomForestRegressor(n_estimators= 1000)
scores_dict2 = {'r2': [], 'best_params': []}
components = [2, 3, 4, 5, 6]
param_grid = [
{
'reduce_dimensions': [PCA(), KernelPCA(kernel = 'rbf'), TruncatedSVD()],
'reduce_dimensions__n_components': components},]
if __name__ == '__main__':
score, best_params = gridsearch2(pipe, param_grid)
scores_dict2['r2'].append(acu)
scores_dict2['best_params'].append(best_params)
print (scores_dict2)
with open('gridsearch2.pickle', 'wb') as grid2:
pickle.dump(scores_dict2, grid2)
The optimal number of features with dimentionality reduction is 6. The best Dimensionality reduction technique was TruncatedSVD. Next, I will Implement TruncatedSVD independently and compare it to manual feature elimination based on the feature importance analysis that was done previously.
Reset Training and test set:
In [138]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2,random_state= 0)
X_train = scaler.fit(X_train).transform(X_train)
X_test = scaler.fit(X_test).transform(X_test)
In [139]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(algorithm='randomized', n_components=6, n_iter=5)# input a number for feature extraction
X_train = svd.fit_transform(X_train)
X_test = svd.transform(X_test)
In [152]:
regr = RandomForestRegressor(n_estimators= 1000, n_jobs = -1)
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('r2 score for the training set: %.5f' % regr.score(X_train, y_train))
print ('r2 score for the test set : %.5f' % regr.score(X_test, y_test))
----------Predicted vs estimated----------
In [155]:
from sklearn.metrics import r2_score
'r2 score for the predicted y and the test y : %.5f' % r2_score(y_test, y_pred)
Out[155]:
In [141]:
X_prime = X[['pt08nox', 'pt08o3', 'pt08nmhc', 'pt08no2', 'pt08co', 'c6h6gt']]
X_train_prime, X_test_prime, y_train_prime, y_test_prime = model_selection.train_test_split(X_prime, y, test_size=0.2,random_state= 0)
X_train_prime = scaler.fit(X_train_prime).transform(X_train_prime)
X_test_prime = scaler.fit(X_test_prime).transform(X_test_prime)
In [157]:
regr = RandomForestRegressor(n_estimators= 1000, n_jobs = -1)
regr.fit(X_train_prime, y_train_prime)
y_pred = regr.predict(X_test_prime)
print('r2 score for the training set: %.5f' % regr.score(X_train_prime, y_train_prime))
print ('r2 score for the test set : %.5f' % regr.score(X_test_prime, y_test_prime))
----------Predicted vs estimated----------
In [158]:
'r2 score for the predicted y and the test y : %.5f' % r2_score(y_test, y_pred)
Out[158]:
Both Scores seems to be very similar. As a final step, I will perform a cross validation on the training and test sets to check if there are inconsistence with the previous results. Since cross validation will select many training and test sets along the dataset, this will rule ou any possible outlier that may have been omitted during the previous steps:
In [159]:
from sklearn.model_selection import cross_val_score
def fin_models(pickle_file):
scores = {'r_squared_knn_regressor_train': [], 'r_squared_knn_regressor_test': [],
'r_squared_random_forest_regressor_train':[], 'r_squared_random_forest_regressor_test':[]}
knn = KNeighborsRegressor(n_neighbors = 5, n_jobs= -1)
rnf = RandomForestRegressor(n_estimators = 1000, n_jobs= -1)
# knn
scores['r_squared_knn_regressor_train'].append(cross_val_score(knn, X_train, y_train, cv=12, scoring='r2'))
scores['r_squared_knn_regressor_test'].append(cross_val_score(knn, X_test, y_test, cv=12, scoring='r2'))
#rnf
scores['r_squared_random_forest_regressor_train'].append(cross_val_score(rnf, X_train, y_train, cv=12, scoring='r2'))
scores['r_squared_random_forest_regressor_test'].append(cross_val_score(rnf, X_test, y_test, cv=12, scoring='r2'))
print(scores)
# write estimators to disc.
with open(pickle_file, 'wb') as final:
pickle.dump([scores], final)
print('DONE!')
return (scores)
In [160]:
fin_models('Regressor_fitted_model_scores.pickle')
Out[160]:
The model performance was quite high. There is a strong relationship between air pollutants an temperature. In order to use this trained model on new data, it will be necessary to use the same encoding that was used to encode any non-numerical variable. The label encoder assign a unique number to any non-numerical variable and that codification should be the same when entering new data. In consequence I believe that it is necessary to create an encoding glossary like the one that we built in our final group project.
In [ ]: