Forest Fire Span Prediction

Objectives:

This notebook aims to explore a dataset where you could apply the knn algorithm to predict something. This kind of problems are known as regression problems and some examples related to this category of problems are to predict stock prices or predict house prices based on a serie of events.

The dataset:

The dataset choosen was the 'Forest Fire Data Set'. The aim of the data set is to predict the burned area of forest fires. We have found the data set on http://archive.ics.uci.edu/ml/datasets/Forest+Fires and the dataset is properlly referenced in 1

Data Set Characteristics	Number of Instances	Area	Attribute Characteristics	Number of Attributes	Associated Tasks	Missing Values?
Multivariate	517	Physical	Real	13	Regression	N/A

Data set abstract: This is a difficult regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data.

Participants:

Marco Olimpio - marco.olimpio at gmail
Rebecca Betwel - bekbetwel at gmail
Victor Hugo - victorhugo.automacao at gmail

Introduction
K Nearest Neighborhoods - KNN
Feature Selection
Experiments
Results and comparitions
References

1. Introduction

Data set features:

Structure of the FWI System

The diagram below illustrates the components of the FWI System. Calculation of the components is based on consecutive daily observations of temperature, relative humidity, wind speed, and 24-hour rainfall. The six standard components provide numeric ratings of relative potential for wildland fire.

Location
- X - X-axis spatial coordinate within the Montesinho park map: 1 to 9
- Y - Y-axis spatial coordinate within the Montesinho park map: 2 to 9
Date
- month - Month of the year: "jan" to "dec"
- day - Day of the week: "mon" to "sun"
Fire Wheater Index - FWI
- FFMC - Fine Fuel Moisture Code The Fine Fuel Moisture Code (FFMC) is a numeric rating of the moisture content of litter and other cured fine fuels. This code is an indicator of the relative ease of ignition and the flammability of fine fuel: 18.7 to 96.20
- DMC - The Duff Moisture Code (DMC) is a numeric rating of the average moisture content of loosely compacted organic layers of moderate depth. This code gives an indication of fuel consumption in moderate duff layers and medium-size woody material.: 1.1 to 291.3
- DC - The Drought Code (DC) is a numeric rating of the average moisture content of deep, compact organic layers. This code is a useful indicator of seasonal drought effects on forest fuels and the amount of smoldering in deep duff layers and large logs.: 7.9 to 860.6
- ISI - The Initial Spread Index (ISI) is a numeric rating of the expected rate of fire spread. It combines the effects of wind and the FFMC on rate of spread without the influence of variable quantities of fuel.: 0.0 to 56.10
- temp - Temperature: 2.2 to 33.30
- RH - Relative Humidity in %: 15.0 to 100
- wind - Wind speed inkm/h: 0.40 to 9.40
- rain - Outside rain in mm/m2 : 0.0 to 6.4
area - The burned area of the forest (in ha): 0.00 to 1090.84

2. K-Nearest Neighborhood - KNN

3. Feature Selection



In [30]:

    
Victor acho que poderíamos abordar isso no trabalho
https://machinelearningmastery.com/feature-selection-machine-learning-python/
mas podemos deixar para depois que terminar, seria a cereja do bolo









    



  File "<ipython-input-30-5df58209cc92>", line 1
    Victor acho que poderíamos abordar isso no trabalho
              ^
SyntaxError: invalid syntax

3. Start - Loading, Checking and Adjusting Data Set



In [13]:

    
import math
# 
import numpy as np
import pandas as pd

#
%matplotlib inline
import matplotlib.pyplot as plt



In [14]:

    
firedb = pd.read_csv("forestfires.csv")
firedb.columns









    Out[14]:





Index(['X', 'Y', 'month', 'day', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH',
       'wind', 'rain', 'area'],
      dtype='object')

No null values.



In [15]:

    
firedb.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 13 columns):
X        517 non-null int64
Y        517 non-null int64
month    517 non-null object
day      517 non-null object
FFMC     517 non-null float64
DMC      517 non-null float64
DC       517 non-null float64
ISI      517 non-null float64
temp     517 non-null float64
RH       517 non-null int64
wind     517 non-null float64
rain     517 non-null float64
area     517 non-null float64
dtypes: float64(8), int64(3), object(2)
memory usage: 52.6+ KB



In [40]:

    
firedb['area'] = firedb['area'].astype(np.float32)



In [12]:

    
firedb[['FFMC','ISI','temp','RH','wind','rain']].plot(figsize=(17,15))









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x113590080>



In [24]:

    
firedb['month'].value_counts()









    Out[24]:





aug    184
sep    172
mar     54
jul     32
feb     20
jun     17
oct     15
apr      9
dec      9
may      2
jan      2
nov      1
Name: month, dtype: int64



In [15]:

    
firedb[['area']].plot(figsize=(17,15))









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x112ff2eb8>



In [16]:

    
firedb['area_adjusted'] = np.log(firedb['area']+1)

Pq a transformação?



In [60]:

    
fig, axes = plt.subplots(nrows=1, ncols=2)
firedb['area'].plot.hist(ax=axes[0],figsize=(17,8))
firedb['area_adjusted'].plot.hist(ax=axes[1])









    Out[60]:





<matplotlib.axes._subplots.AxesSubplot at 0x1176371d0>

Principal Component Analisys

Experiments



In [17]:

    
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, KFold
from sklearn import preprocessing



In [18]:

    
firedb.head()









    Out[18]:







  
    
      
      X
      Y
      month
      day
      FFMC
      DMC
      DC
      ISI
      temp
      RH
      wind
      rain
      area
      area_adjusted
    
  
  
    
      0
      7
      5
      mar
      fri
      86.2
      26.2
      94.3
      5.1
      8.2
      51
      6.7
      0.0
      0.0
      0.0
    
    
      1
      7
      4
      oct
      tue
      90.6
      35.4
      669.1
      6.7
      18.0
      33
      0.9
      0.0
      0.0
      0.0
    
    
      2
      7
      4
      oct
      sat
      90.6
      43.7
      686.9
      6.7
      14.6
      33
      1.3
      0.0
      0.0
      0.0
    
    
      3
      8
      6
      mar
      fri
      91.7
      33.3
      77.5
      9.0
      8.3
      97
      4.0
      0.2
      0.0
      0.0
    
    
      4
      8
      6
      mar
      sun
      89.3
      51.3
      102.2
      9.6
      11.4
      99
      1.8
      0.0
      0.0
      0.0



In [36]:

    
hyper_params = range(1,20)
features_list = [ ['FFMC'], ['DMC'], ['DC'], ['ISI'], ['temp'], ['RH'], ['wind'], ['rain'],
                  ['X', 'Y', 'FFMC', 'DMC','DC','ISI'], ['X', 'Y', 'temp','RH','wind','rain'],
                  ['FFMC', 'DMC','DC','ISI'], ['temp','RH','wind','rain'],
                  ['DMC', 'wind'], ['DC','RH','wind'], 
                  ['X', 'Y', 'DMC', 'wind'], ['X', 'Y', 'DC','RH','wind'], 
                  ['FFMC', 'DMC','DC','ISI', 'temp','RH','wind','rain'],
                  ['X', 'Y', 'FFMC', 'DMC','DC','ISI', 'temp','RH','wind','rain']  ]

num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]

outputs = ['area', 'area_adjusted']



In [20]:

    
import csv

# initializing our file that will act like our database for the results

#open the file in the 'write' mode
file = open('results_db.csv','w')
writer = csv.writer(file) 

db_row = ['HYPER_PARAM', 'FEATURES', 'K_FOLDS', 'OUTPUT', 
          'AVG_RMSE', 'AVG_RMSE_%_AREA', 'STD_RMSE', 'CV_RMSE',
          'AVG_MAE', 'AVG_MAE_%_AREA', 'STD_MAE', 'CV_MAE']

#write the row mounted
writer.writerow(db_row)

#close the file
file.close()



In [21]:

    
from IPython.core.display import clear_output
from time import time

start_time = time()

k = 0
k_total = len(hyper_params) * len(features_list) * len(num_folds) * len(outputs)

for hp in hyper_params:
    for features in features_list:
        for fold in num_folds:
            for output in outputs:
                k += 1
                
                kf = KFold(fold, shuffle=True, random_state=1)
                model = KNeighborsRegressor(n_neighbors = hp, algorithm='auto')
                
                mses = cross_val_score(model, firedb[features],
                                       firedb[output], scoring="neg_mean_squared_error", cv=kf)
                
                rmses = np.sqrt(np.absolute(mses))
                
                avg_rmse = np.mean(rmses)
                avg_rmse_per_area = avg_rmse / np.mean(firedb[output])
                
                std_rmse = np.std(rmses)
                cv_rmse = std_rmse / np.mean(firedb[output])
                                
                maes = cross_val_score(model, firedb[features],
                                       firedb[output], scoring="neg_mean_absolute_error", cv=kf)
                
                maes = np.absolute(maes)
                
                avg_mae = np.mean(maes)
                avg_mae_per_area = avg_mae / np.mean(firedb[output])
                
                std_mae = np.std(maes)
                cv_mae = std_mae / np.mean(firedb[output])
                
                
                db_row = [ hp, ', '.join(features), fold, output, 
                           avg_rmse, avg_rmse_per_area, std_rmse, cv_rmse,
                           avg_mae, avg_mae_per_area, std_mae, cv_mae ]
                
                print('ITERATION %d OF %d' % (k, k_total) )
                print( 'HP: ', hp )
                print('FEATURES: ', ', '.join(features) )
                print('FOLDS: ', fold)
                print('OUTPUT: ', output)
                print('AVG_RMSE: ', avg_rmse)
                print('AVG_RMSE_PER_AREA: ', avg_rmse_per_area)
                print('STD_RMSE: ', std_rmse)
                print('CV_RMSE: ', cv_rmse)
                print('AVG_MAE: ', avg_mae)
                print('AVG_MAE_PER_AREA: ', avg_mae_per_area)
                print('STD_MAE: ', std_mae)
                print('CV_MAE: ', cv_mae)
                print('\n\n')
                
                #clear_output(wait = True)
                
                #open the file that will act like a database in the 'append' mode
                #which allow us to append a row each time we open it
                file = open('results_db.csv','a')
                writer = csv.writer(file)                        
                #write the row mounted
                writer.writerow(db_row)

                #close the file
                file.close()

end_time = time()
elapsed_time = end_time - start_time
print('Elapsed time: ', elapsed_time)









    



Elapsed time:  559.6654286384583



In [37]:

    
results = pd.read_csv('results_db_complete.csv')
results.head()









    Out[37]:







  
    
      
      HYPER_PARAM
      FEATURES
      K_FOLDS
      OUTPUT
      AVG_RMSE
      AVG_RMSE_%_AVG
      STD_RMSE
      STD_RMSE_%_AVG
      AVG_MAE
      AVG_MAE_%_AVG
      STD_MAE
      STD_MAE_%_AVG
    
  
  
    
      0
      1
      FFMC
      3
      area
      64.234474
      4.999845
      33.784210
      2.629676
      22.584815
      1.757944
      6.838523
      0.532293
    
    
      1
      1
      FFMC
      3
      area_adjusted
      2.118254
      1.906575
      0.170288
      0.153271
      1.577092
      1.419492
      0.144546
      0.130101
    
    
      2
      1
      FFMC
      5
      area
      74.601200
      5.806765
      66.267227
      5.158070
      25.380312
      1.975538
      20.731563
      1.613691
    
    
      3
      1
      FFMC
      5
      area_adjusted
      2.068721
      1.861992
      0.271317
      0.244204
      1.521285
      1.369262
      0.210700
      0.189644
    
    
      4
      1
      FFMC
      7
      area
      60.500201
      4.709179
      35.810316
      2.787382
      19.424166
      1.511927
      6.023734
      0.468872



In [38]:

    
results[ results['OUTPUT'] == 'area' ].sort_values(by='AVG_RMSE')[:5]









    Out[38]:







  
    
      
      HYPER_PARAM
      FEATURES
      K_FOLDS
      OUTPUT
      AVG_RMSE
      AVG_RMSE_%_AVG
      STD_RMSE
      STD_RMSE_%_AVG
      AVG_MAE
      AVG_MAE_%_AVG
      STD_MAE
      STD_MAE_%_AVG
    
  
  
    
      1054
      3
      rain
      23
      area
      37.646597
      2.930314
      51.812790
      4.032974
      12.982227
      1.010503
      12.502984
      0.973200
    
    
      1486
      4
      rain
      23
      area
      37.771874
      2.940065
      51.799049
      4.031904
      12.871462
      1.001881
      12.516371
      0.974242
    
    
      1918
      5
      rain
      23
      area
      37.861003
      2.947003
      51.785804
      4.030873
      12.860091
      1.000996
      12.506039
      0.973438
    
    
      2350
      6
      rain
      23
      area
      37.912035
      2.950975
      51.780799
      4.030484
      12.830999
      0.998732
      12.512251
      0.973921
    
    
      3214
      8
      rain
      23
      area
      37.944237
      2.953481
      51.774597
      4.030001
      12.830263
      0.998674
      12.512822
      0.973966



In [5]:

    
results[ results['OUTPUT'] == 'area' ].sort_values(by='STD_RMSE')[:5]









    Out[5]:







  
    
      
      HYPER_PARAM
      FEATURES
      K_FOLDS
      OUTPUT
      AVG_RMSE
      AVG_RMSE_%_AVG
      STD_RMSE
      STD_RMSE_%_AVG
      AVG_MAE
      AVG_MAE_%_AVG
      STD_MAE
      STD_MAE_%_AVG
    
  
  
    
      984
      3
      RH
      3
      area
      97.547290
      7.592829
      2.918129
      0.227140
      33.949893
      2.642572
      5.632186
      0.438395
    
    
      288
      1
      DMC, wind
      3
      area
      97.212446
      7.566765
      5.730754
      0.446067
      21.824580
      1.698769
      1.062871
      0.082731
    
    
      408
      1
      X, Y, FFMC, DMC, DC, ISI, temp, RH, wind, rain
      3
      area
      99.896722
      7.775703
      8.015584
      0.623912
      24.301300
      1.891550
      4.007867
      0.311962
    
    
      1416
      4
      RH
      3
      area
      83.350460
      6.487784
      12.532951
      0.975532
      28.950223
      2.253410
      2.913883
      0.226809
    
    
      312
      1
      DC, RH, wind
      3
      area
      97.050761
      7.554180
      14.510319
      1.129446
      22.461071
      1.748312
      3.706224
      0.288483



In [24]:

    
results[ results['OUTPUT'] == 'area' ].sort_values(by='AVG_MAE')[:5]









    Out[24]:







  
    
      
      HYPER_PARAM
      FEATURES
      K_FOLDS
      OUTPUT
      AVG_RMSE
      AVG_RMSE_%_AREA
      STD_RMSE
      CV_RMSE
      AVG_MAE
      AVG_MAE_%_AREA
      STD_MAE
      CV_MAE
    
  
  
    
      58
      1
      FFMC, DMC, DC, ISI
      11
      area
      49.570663
      3.858452
      37.457692
      2.915610
      16.834410
      1.310347
      7.568464
      0.589110
    
    
      50
      1
      FFMC, DMC, DC, ISI
      5
      area
      55.622226
      4.329490
      27.703588
      2.156376
      17.077887
      1.329299
      4.285801
      0.333596
    
    
      70
      1
      FFMC, DMC, DC, ISI
      23
      area
      44.329268
      3.450476
      49.960001
      3.888757
      17.636206
      1.372757
      13.414093
      1.044118
    
    
      1054
      5
      temp, RH, wind, rain
      23
      area
      42.471606
      3.305880
      48.530475
      3.777487
      17.944971
      1.396790
      11.961518
      0.931054
    
    
      1046
      5
      temp, RH, wind, rain
      15
      area
      47.373685
      3.687445
      43.979429
      3.423245
      17.980452
      1.399552
      8.641483
      0.672631



In [25]:

    
results[ results['OUTPUT'] == 'area' ].sort_values(by='STD_MAE')[:5]









    Out[25]:







  
    
      
      HYPER_PARAM
      FEATURES
      K_FOLDS
      OUTPUT
      AVG_RMSE
      AVG_RMSE_%_AREA
      STD_RMSE
      CV_RMSE
      AVG_MAE
      AVG_MAE_%_AREA
      STD_MAE
      CV_MAE
    
  
  
    
      3384
      15
      X, Y, temp, RH, wind, rain
      3
      area
      62.384643
      4.855859
      28.334701
      2.205500
      21.705308
      1.689485
      1.058712
      0.082407
    
    
      96
      1
      DMC, wind
      3
      area
      97.212446
      7.566765
      5.730754
      0.446067
      21.824580
      1.698769
      1.062871
      0.082731
    
    
      504
      3
      X, Y, temp, RH, wind, rain
      3
      area
      72.246404
      5.623473
      22.180449
      1.726469
      21.452032
      1.669771
      1.099584
      0.085589
    
    
      3144
      14
      X, Y, temp, RH, wind, rain
      3
      area
      62.922219
      4.897703
      27.929218
      2.173938
      21.948226
      1.708393
      1.111336
      0.086504
    
    
      2904
      13
      X, Y, temp, RH, wind, rain
      3
      area
      62.942560
      4.899286
      27.860243
      2.168569
      21.613934
      1.682373
      1.146942
      0.089275



In [47]:

    
mean_weight = 0.5
std_weight = 1-mean_weight

results['SCORE_RMSE'] = mean_weight*results['AVG_RMSE'] + std_weight*results['STD_RMSE']
results['SCORE_RMSE'] = ( (np.absolute(results['SCORE_RMSE'] - results['SCORE_RMSE'].mean())) /
                           results['SCORE_RMSE'].std() )

results['SCORE_MAE'] = mean_weight*results['AVG_MAE'] + std_weight*results['STD_MAE']
results['SCORE_MAE'] = ( (np.absolute(results['SCORE_MAE'] - results['SCORE_MAE'].mean())) /
                           results['SCORE_MAE'].std() )


results.head()









    Out[47]:







  
    
      
      HYPER_PARAM
      FEATURES
      K_FOLDS
      OUTPUT
      AVG_RMSE
      AVG_RMSE_%_AVG
      STD_RMSE
      STD_RMSE_%_AVG
      AVG_MAE
      AVG_MAE_%_AVG
      STD_MAE
      STD_MAE_%_AVG
      SCORE_RMSE
      SCORE_MAE
    
  
  
    
      0
      1
      FFMC
      3
      area
      64.234474
      4.999845
      33.784210
      2.629676
      22.584815
      1.757944
      6.838523
      0.532293
      1.061522
      1.035485
    
    
      1
      1
      FFMC
      3
      area_adjusted
      2.118254
      1.906575
      0.170288
      0.153271
      1.577092
      1.419492
      0.144546
      0.130101
      0.978203
      0.948428
    
    
      2
      1
      FFMC
      5
      area
      74.601200
      5.806765
      66.267227
      5.158070
      25.380312
      1.975538
      20.731563
      1.613691
      1.974522
      2.230668
    
    
      3
      1
      FFMC
      5
      area_adjusted
      2.068721
      1.861992
      0.271317
      0.244204
      1.521285
      1.369262
      0.210700
      0.189644
      0.977106
      0.947687
    
    
      4
      1
      FFMC
      7
      area
      60.500201
      4.709179
      35.810316
      2.787382
      19.424166
      1.511927
      6.023734
      0.468872
      1.025126
      0.750776



In [48]:

    
results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_RMSE')[:5]









    Out[48]:







  
    
      
      HYPER_PARAM
      FEATURES
      K_FOLDS
      OUTPUT
      AVG_RMSE
      AVG_RMSE_%_AVG
      STD_RMSE
      STD_RMSE_%_AVG
      AVG_MAE
      AVG_MAE_%_AVG
      STD_MAE
      STD_MAE_%_AVG
      SCORE_RMSE
      SCORE_MAE
    
  
  
    
      242
      1
      FFMC, DMC, DC, ISI
      5
      area
      55.622226
      4.329490
      27.703588
      2.156376
      17.077887
      1.329299
      4.285801
      0.333596
      0.748460
      0.458277
    
    
      6746
      16
      temp, RH, wind, rain
      5
      area
      56.972798
      4.434615
      28.305571
      2.203232
      21.088888
      1.641505
      2.025628
      0.157670
      0.790063
      0.583666
    
    
      3674
      9
      X, Y, temp, RH, wind, rain
      5
      area
      60.524415
      4.711064
      24.888559
      1.937261
      21.873142
      1.702549
      3.742140
      0.291278
      0.792931
      0.762763
    
    
      4106
      10
      X, Y, temp, RH, wind, rain
      5
      area
      60.498849
      4.709074
      25.097940
      1.953559
      21.659460
      1.685916
      3.321103
      0.258506
      0.796848
      0.717307
    
    
      120
      1
      RH
      3
      area
      59.389493
      4.622725
      26.289632
      2.046317
      19.197119
      1.494254
      4.369890
      0.340141
      0.798602
      0.616072



In [ ]:

    
results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_MAE')[:5]



In [99]:

    
features_str = str(results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_RMSE')[:1].iloc[0,1])
features = []
for feature in features_str.split(','):
    features.append(feature.strip())

hp = int(results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_RMSE')[:1]['HYPER_PARAM'])
fold = int(results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_RMSE')[:1]['K_FOLDS'])



In [104]:

    
kf = KFold(fold, shuffle=True, random_state=1)
model = KNeighborsRegressor(n_neighbors = hp, algorithm='auto')

mses = cross_val_score(model, firedb[features],
                       firedb['area'], scoring="neg_mean_squared_error", cv=kf)

rmses = np.sqrt(np.absolute(mses))
                
avg_rmse = np.mean(rmses)
std_rmse = np.std(rmses)

print(avg_rmse,std_rmse)









    



55.6222261899 27.7035883176



In [101]:

    
rmses









    Out[101]:





array([ 80.01691363,  93.85199869,  20.55676167,  32.58281942,  51.10263754])



In [105]:

    
predictions = model.predict(firedb[features])
predictions









    



---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-105-fdff30c8a19e> in <module>()
----> 1 predictions = model.predict(firedb[features])
      2 predictions

/home/pattousai/anaconda3/lib/python3.6/site-packages/sklearn/neighbors/regression.py in predict(self, X)
    142         X = check_array(X, accept_sparse='csr')
    143 
--> 144         neigh_dist, neigh_ind = self.kneighbors(X)
    145 
    146         weights = _get_weights(neigh_dist, self.weights)

/home/pattousai/anaconda3/lib/python3.6/site-packages/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
    321         """
    322         if self._fit_method is None:
--> 323             raise NotFittedError("Must fit neighbors before querying.")
    324 
    325         if n_neighbors is None:

NotFittedError: Must fit neighbors before querying.



In [ ]:



In [76]:

    
hyper_params = range(1,10)
mse_values = []
mad_values = []
predictions = []
numFolds = 10
# 10-fold cross validation
#kf = KFold(n_splits=10)
le = preprocessing.LabelEncoder()
x = firedb.ix[:, range(1, 10)].values

Y = le.fit_transform(firedb.ix[:, 10].values)
kf = KFold(numFolds, shuffle=True)
conv_X = pd.get_dummies(firedb.ix[:, range(1, 10)])



In [ ]:

    
kf = KFold(n_splits = 10, shuffle = True)
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]



In [ ]:

    
for train_index, test_index in kf.split(X):
    train_X = conv_X.ix[train_indices, :]
    train_Y = Y[train_indices]
    test_X = conv_X.ix[test_indices, :]
    test_Y = Y[test_indices]
    for knumber in hyper_params:
        # Configuring the classificator
        knn = KNeighborsRegressor(n_neighbors=knumber, algorithm='brute', n_jobs=3)
    
        # Creating model
        knn.fit(train_df[['accommodates','bedrooms','bathrooms','number_of_reviews']], train_df['price'])
    
        # Predicting
        predictions = knn.predict(test_df[['accommodates','bedrooms','bathrooms','number_of_reviews']])
    
        # Checking the mean squared error
        mse_values.append(mean_squared_error(predictions, test_df['price']))
        mad_values.append(mean_absolute_error(predictions, test))

mse_values.plot()
mad_values.plot()

Results



In [ ]:

References

P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9. Available at: http://www.dsi.uminho.pt/~pcortez/fires.pdf
http://cwfis.cfs.nrcan.gc.ca/background/summary/fwi

	X	Y	month	day	FFMC	DMC	DC	ISI	temp	RH	wind	rain
0	7	5	mar	fri	86.2	26.2	94.3	5.1	8.2	51	6.7	0.0
1	7	4	oct	tue	90.6	35.4	669.1	6.7	18.0	33	0.9	0.0
2	7	4	oct	sat	90.6	43.7	686.9	6.7	14.6	33	1.3	0.0
3	8	6	mar	fri	91.7	33.3	77.5	9.0	8.3	97	4.0	0.2
4	8	6	mar	sun	89.3	51.3	102.2	9.6	11.4	99	1.8	0.0

	HYPER_PARAM	FEATURES	K_FOLDS	OUTPUT	AVG_RMSE	AVG_RMSE_%_AVG	STD_RMSE	STD_RMSE_%_AVG	AVG_MAE	AVG_MAE_%_AVG	STD_MAE	STD_MAE_%_AVG
0	1	FFMC	3	area	64.234474	4.999845	33.784210	2.629676	22.584815	1.757944	6.838523	0.532293
1	1	FFMC	3	area_adjusted	2.118254	1.906575	0.170288	0.153271	1.577092	1.419492	0.144546	0.130101
2	1	FFMC	5	area	74.601200	5.806765	66.267227	5.158070	25.380312	1.975538	20.731563	1.613691
3	1	FFMC	5	area_adjusted	2.068721	1.861992	0.271317	0.244204	1.521285	1.369262	0.210700	0.189644
4	1	FFMC	7	area	60.500201	4.709179	35.810316	2.787382	19.424166	1.511927	6.023734	0.468872

	HYPER_PARAM	FEATURES	K_FOLDS	OUTPUT	AVG_RMSE	AVG_RMSE_%_AVG	STD_RMSE	STD_RMSE_%_AVG	AVG_MAE	AVG_MAE_%_AVG	STD_MAE	STD_MAE_%_AVG
1054	3	rain	23	area	37.646597	2.930314	51.812790	4.032974	12.982227	1.010503	12.502984	0.973200
1486	4	rain	23	area	37.771874	2.940065	51.799049	4.031904	12.871462	1.001881	12.516371	0.974242
1918	5	rain	23	area	37.861003	2.947003	51.785804	4.030873	12.860091	1.000996	12.506039	0.973438
2350	6	rain	23	area	37.912035	2.950975	51.780799	4.030484	12.830999	0.998732	12.512251	0.973921
3214	8	rain	23	area	37.944237	2.953481	51.774597	4.030001	12.830263	0.998674	12.512822	0.973966

	HYPER_PARAM	FEATURES	K_FOLDS	OUTPUT	AVG_RMSE	AVG_RMSE_%_AVG	STD_RMSE	STD_RMSE_%_AVG	AVG_MAE	AVG_MAE_%_AVG	STD_MAE	STD_MAE_%_AVG
984	3	RH	3	area	97.547290	7.592829	2.918129	0.227140	33.949893	2.642572	5.632186	0.438395
288	1	DMC, wind	3	area	97.212446	7.566765	5.730754	0.446067	21.824580	1.698769	1.062871	0.082731
408	1	X, Y, FFMC, DMC, DC, ISI, temp, RH, wind, rain	3	area	99.896722	7.775703	8.015584	0.623912	24.301300	1.891550	4.007867	0.311962
1416	4	RH	3	area	83.350460	6.487784	12.532951	0.975532	28.950223	2.253410	2.913883	0.226809
312	1	DC, RH, wind	3	area	97.050761	7.554180	14.510319	1.129446	22.461071	1.748312	3.706224	0.288483

	HYPER_PARAM	FEATURES	K_FOLDS	OUTPUT	AVG_RMSE	AVG_RMSE_%_AREA	STD_RMSE	CV_RMSE	AVG_MAE	AVG_MAE_%_AREA	STD_MAE	CV_MAE
58	1	FFMC, DMC, DC, ISI	11	area	49.570663	3.858452	37.457692	2.915610	16.834410	1.310347	7.568464	0.589110
50	1	FFMC, DMC, DC, ISI	5	area	55.622226	4.329490	27.703588	2.156376	17.077887	1.329299	4.285801	0.333596
70	1	FFMC, DMC, DC, ISI	23	area	44.329268	3.450476	49.960001	3.888757	17.636206	1.372757	13.414093	1.044118
1054	5	temp, RH, wind, rain	23	area	42.471606	3.305880	48.530475	3.777487	17.944971	1.396790	11.961518	0.931054
1046	5	temp, RH, wind, rain	15	area	47.373685	3.687445	43.979429	3.423245	17.980452	1.399552	8.641483	0.672631

	HYPER_PARAM	FEATURES	K_FOLDS	OUTPUT	AVG_RMSE	AVG_RMSE_%_AREA	STD_RMSE	CV_RMSE	AVG_MAE	AVG_MAE_%_AREA	STD_MAE	CV_MAE
3384	15	X, Y, temp, RH, wind, rain	3	area	62.384643	4.855859	28.334701	2.205500	21.705308	1.689485	1.058712	0.082407
96	1	DMC, wind	3	area	97.212446	7.566765	5.730754	0.446067	21.824580	1.698769	1.062871	0.082731
504	3	X, Y, temp, RH, wind, rain	3	area	72.246404	5.623473	22.180449	1.726469	21.452032	1.669771	1.099584	0.085589
3144	14	X, Y, temp, RH, wind, rain	3	area	62.922219	4.897703	27.929218	2.173938	21.948226	1.708393	1.111336	0.086504
2904	13	X, Y, temp, RH, wind, rain	3	area	62.942560	4.899286	27.860243	2.168569	21.613934	1.682373	1.146942	0.089275