This notebook aims to explore a dataset where you could apply the knn algorithm to predict something. This kind of problems are known as regression problems and some examples related to this category of problems are to predict stock prices or predict house prices based on a serie of events.
The dataset choosen was the 'Forest Fire Data Set'. The aim of the data set is to predict the burned area of forest fires. We have found the data set on http://archive.ics.uci.edu/ml/datasets/Forest+Fires and the dataset is properlly referenced in 1
Data Set Characteristics | Number of Instances | Area | Attribute Characteristics | Number of Attributes | Associated Tasks | Missing Values? |
---|---|---|---|---|---|---|
Multivariate | 517 | Physical | Real | 13 | Regression | N/A |
Data set abstract: This is a difficult regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data.
Data set features:
The diagram below illustrates the components of the FWI System. Calculation of the components is based on consecutive daily observations of temperature, relative humidity, wind speed, and 24-hour rainfall. The six standard components provide numeric ratings of relative potential for wildland fire.
In [30]:
Victor acho que poderíamos abordar isso no trabalho
https://machinelearningmastery.com/feature-selection-machine-learning-python/
mas podemos deixar para depois que terminar, seria a cereja do bolo
In [13]:
import math
#
import numpy as np
import pandas as pd
#
%matplotlib inline
import matplotlib.pyplot as plt
In [14]:
firedb = pd.read_csv("forestfires.csv")
firedb.columns
Out[14]:
No null values.
In [15]:
firedb.info()
In [40]:
firedb['area'] = firedb['area'].astype(np.float32)
In [12]:
firedb[['FFMC','ISI','temp','RH','wind','rain']].plot(figsize=(17,15))
Out[12]:
In [24]:
firedb['month'].value_counts()
Out[24]:
In [15]:
firedb[['area']].plot(figsize=(17,15))
Out[15]:
In [16]:
firedb['area_adjusted'] = np.log(firedb['area']+1)
In [60]:
fig, axes = plt.subplots(nrows=1, ncols=2)
firedb['area'].plot.hist(ax=axes[0],figsize=(17,8))
firedb['area_adjusted'].plot.hist(ax=axes[1])
Out[60]:
In [17]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, KFold
from sklearn import preprocessing
In [18]:
firedb.head()
Out[18]:
In [36]:
hyper_params = range(1,20)
features_list = [ ['FFMC'], ['DMC'], ['DC'], ['ISI'], ['temp'], ['RH'], ['wind'], ['rain'],
['X', 'Y', 'FFMC', 'DMC','DC','ISI'], ['X', 'Y', 'temp','RH','wind','rain'],
['FFMC', 'DMC','DC','ISI'], ['temp','RH','wind','rain'],
['DMC', 'wind'], ['DC','RH','wind'],
['X', 'Y', 'DMC', 'wind'], ['X', 'Y', 'DC','RH','wind'],
['FFMC', 'DMC','DC','ISI', 'temp','RH','wind','rain'],
['X', 'Y', 'FFMC', 'DMC','DC','ISI', 'temp','RH','wind','rain'] ]
num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]
outputs = ['area', 'area_adjusted']
In [20]:
import csv
# initializing our file that will act like our database for the results
#open the file in the 'write' mode
file = open('results_db.csv','w')
writer = csv.writer(file)
db_row = ['HYPER_PARAM', 'FEATURES', 'K_FOLDS', 'OUTPUT',
'AVG_RMSE', 'AVG_RMSE_%_AREA', 'STD_RMSE', 'CV_RMSE',
'AVG_MAE', 'AVG_MAE_%_AREA', 'STD_MAE', 'CV_MAE']
#write the row mounted
writer.writerow(db_row)
#close the file
file.close()
In [21]:
from IPython.core.display import clear_output
from time import time
start_time = time()
k = 0
k_total = len(hyper_params) * len(features_list) * len(num_folds) * len(outputs)
for hp in hyper_params:
for features in features_list:
for fold in num_folds:
for output in outputs:
k += 1
kf = KFold(fold, shuffle=True, random_state=1)
model = KNeighborsRegressor(n_neighbors = hp, algorithm='auto')
mses = cross_val_score(model, firedb[features],
firedb[output], scoring="neg_mean_squared_error", cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)
avg_rmse_per_area = avg_rmse / np.mean(firedb[output])
std_rmse = np.std(rmses)
cv_rmse = std_rmse / np.mean(firedb[output])
maes = cross_val_score(model, firedb[features],
firedb[output], scoring="neg_mean_absolute_error", cv=kf)
maes = np.absolute(maes)
avg_mae = np.mean(maes)
avg_mae_per_area = avg_mae / np.mean(firedb[output])
std_mae = np.std(maes)
cv_mae = std_mae / np.mean(firedb[output])
db_row = [ hp, ', '.join(features), fold, output,
avg_rmse, avg_rmse_per_area, std_rmse, cv_rmse,
avg_mae, avg_mae_per_area, std_mae, cv_mae ]
print('ITERATION %d OF %d' % (k, k_total) )
print( 'HP: ', hp )
print('FEATURES: ', ', '.join(features) )
print('FOLDS: ', fold)
print('OUTPUT: ', output)
print('AVG_RMSE: ', avg_rmse)
print('AVG_RMSE_PER_AREA: ', avg_rmse_per_area)
print('STD_RMSE: ', std_rmse)
print('CV_RMSE: ', cv_rmse)
print('AVG_MAE: ', avg_mae)
print('AVG_MAE_PER_AREA: ', avg_mae_per_area)
print('STD_MAE: ', std_mae)
print('CV_MAE: ', cv_mae)
print('\n\n')
#clear_output(wait = True)
#open the file that will act like a database in the 'append' mode
#which allow us to append a row each time we open it
file = open('results_db.csv','a')
writer = csv.writer(file)
#write the row mounted
writer.writerow(db_row)
#close the file
file.close()
end_time = time()
elapsed_time = end_time - start_time
print('Elapsed time: ', elapsed_time)
In [37]:
results = pd.read_csv('results_db_complete.csv')
results.head()
Out[37]:
In [38]:
results[ results['OUTPUT'] == 'area' ].sort_values(by='AVG_RMSE')[:5]
Out[38]:
In [5]:
results[ results['OUTPUT'] == 'area' ].sort_values(by='STD_RMSE')[:5]
Out[5]:
In [24]:
results[ results['OUTPUT'] == 'area' ].sort_values(by='AVG_MAE')[:5]
Out[24]:
In [25]:
results[ results['OUTPUT'] == 'area' ].sort_values(by='STD_MAE')[:5]
Out[25]:
In [47]:
mean_weight = 0.5
std_weight = 1-mean_weight
results['SCORE_RMSE'] = mean_weight*results['AVG_RMSE'] + std_weight*results['STD_RMSE']
results['SCORE_RMSE'] = ( (np.absolute(results['SCORE_RMSE'] - results['SCORE_RMSE'].mean())) /
results['SCORE_RMSE'].std() )
results['SCORE_MAE'] = mean_weight*results['AVG_MAE'] + std_weight*results['STD_MAE']
results['SCORE_MAE'] = ( (np.absolute(results['SCORE_MAE'] - results['SCORE_MAE'].mean())) /
results['SCORE_MAE'].std() )
results.head()
Out[47]:
In [48]:
results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_RMSE')[:5]
Out[48]:
In [ ]:
results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_MAE')[:5]
In [99]:
features_str = str(results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_RMSE')[:1].iloc[0,1])
features = []
for feature in features_str.split(','):
features.append(feature.strip())
hp = int(results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_RMSE')[:1]['HYPER_PARAM'])
fold = int(results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_RMSE')[:1]['K_FOLDS'])
In [104]:
kf = KFold(fold, shuffle=True, random_state=1)
model = KNeighborsRegressor(n_neighbors = hp, algorithm='auto')
mses = cross_val_score(model, firedb[features],
firedb['area'], scoring="neg_mean_squared_error", cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)
std_rmse = np.std(rmses)
print(avg_rmse,std_rmse)
In [101]:
rmses
Out[101]:
In [105]:
predictions = model.predict(firedb[features])
predictions
In [ ]:
In [76]:
hyper_params = range(1,10)
mse_values = []
mad_values = []
predictions = []
numFolds = 10
# 10-fold cross validation
#kf = KFold(n_splits=10)
le = preprocessing.LabelEncoder()
x = firedb.ix[:, range(1, 10)].values
Y = le.fit_transform(firedb.ix[:, 10].values)
kf = KFold(numFolds, shuffle=True)
conv_X = pd.get_dummies(firedb.ix[:, range(1, 10)])
In [ ]:
kf = KFold(n_splits = 10, shuffle = True)
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test = firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test = firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test = firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test = firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test = firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test = firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test = firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test = firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test = firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test = firedb.iloc[result[1]]
In [ ]:
for train_index, test_index in kf.split(X):
train_X = conv_X.ix[train_indices, :]
train_Y = Y[train_indices]
test_X = conv_X.ix[test_indices, :]
test_Y = Y[test_indices]
for knumber in hyper_params:
# Configuring the classificator
knn = KNeighborsRegressor(n_neighbors=knumber, algorithm='brute', n_jobs=3)
# Creating model
knn.fit(train_df[['accommodates','bedrooms','bathrooms','number_of_reviews']], train_df['price'])
# Predicting
predictions = knn.predict(test_df[['accommodates','bedrooms','bathrooms','number_of_reviews']])
# Checking the mean squared error
mse_values.append(mean_squared_error(predictions, test_df['price']))
mad_values.append(mean_absolute_error(predictions, test))
mse_values.plot()
mad_values.plot()
In [ ]: