Forest Fire Span Prediction


Objectives:


This notebook aims to explore a dataset where you could apply the knn algorithm to predict something. This kind of problems are known as regression problems and some examples related to this category of problems are to predict stock prices or predict house prices based on a serie of events.

The dataset:


The dataset choosen was the 'Forest Fire Data Set'. The aim of the data set is to predict the burned area of forest fires. We have found the data set on http://archive.ics.uci.edu/ml/datasets/Forest+Fires and the dataset is properlly referenced in 1

Data Set Characteristics Number of Instances Area Attribute Characteristics Number of Attributes Associated Tasks Missing Values?
Multivariate 517 Physical Real 13 Regression N/A

Data set abstract: This is a difficult regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data.

Participants:

  • Marco Olimpio - marco.olimpio at gmail
  • Rebecca Betwel - bekbetwel at gmail
  • Victor Hugo - victorhugo.automacao at gmail

Contents

  1. Introduction
  2. K Nearest Neighborhoods - KNN
  3. Feature Selection
  4. Experiments
  5. Results and comparitions
  6. References

1. Introduction


Data set features:

Structure of the FWI System

The diagram below illustrates the components of the FWI System. Calculation of the components is based on consecutive daily observations of temperature, relative humidity, wind speed, and 24-hour rainfall. The six standard components provide numeric ratings of relative potential for wildland fire.


  • Location
    • X - X-axis spatial coordinate within the Montesinho park map: 1 to 9
    • Y - Y-axis spatial coordinate within the Montesinho park map: 2 to 9
  • Date
    • month - Month of the year: "jan" to "dec"
    • day - Day of the week: "mon" to "sun"
  • Fire Wheater Index - FWI
    • FFMC - Fine Fuel Moisture Code The Fine Fuel Moisture Code (FFMC) is a numeric rating of the moisture content of litter and other cured fine fuels. This code is an indicator of the relative ease of ignition and the flammability of fine fuel: 18.7 to 96.20
    • DMC - The Duff Moisture Code (DMC) is a numeric rating of the average moisture content of loosely compacted organic layers of moderate depth. This code gives an indication of fuel consumption in moderate duff layers and medium-size woody material.: 1.1 to 291.3
    • DC - The Drought Code (DC) is a numeric rating of the average moisture content of deep, compact organic layers. This code is a useful indicator of seasonal drought effects on forest fuels and the amount of smoldering in deep duff layers and large logs.: 7.9 to 860.6
    • ISI - The Initial Spread Index (ISI) is a numeric rating of the expected rate of fire spread. It combines the effects of wind and the FFMC on rate of spread without the influence of variable quantities of fuel.: 0.0 to 56.10
    • temp - Temperature: 2.2 to 33.30
    • RH - Relative Humidity in %: 15.0 to 100
    • wind - Wind speed inkm/h: 0.40 to 9.40
    • rain - Outside rain in mm/m2 : 0.0 to 6.4
  • area - The burned area of the forest (in ha): 0.00 to 1090.84

2. K-Nearest Neighborhood - KNN


3. Feature Selection



In [30]:
Victor acho que poderíamos abordar isso no trabalho
https://machinelearningmastery.com/feature-selection-machine-learning-python/
mas podemos deixar para depois que terminar, seria a cereja do bolo


  File "<ipython-input-30-5df58209cc92>", line 1
    Victor acho que poderíamos abordar isso no trabalho
              ^
SyntaxError: invalid syntax

3. Start - Loading, Checking and Adjusting Data Set



In [13]:
import math
# 
import numpy as np
import pandas as pd

#
%matplotlib inline
import matplotlib.pyplot as plt

In [14]:
firedb = pd.read_csv("forestfires.csv")
firedb.columns


Out[14]:
Index(['X', 'Y', 'month', 'day', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH',
       'wind', 'rain', 'area'],
      dtype='object')

No null values.


In [15]:
firedb.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 13 columns):
X        517 non-null int64
Y        517 non-null int64
month    517 non-null object
day      517 non-null object
FFMC     517 non-null float64
DMC      517 non-null float64
DC       517 non-null float64
ISI      517 non-null float64
temp     517 non-null float64
RH       517 non-null int64
wind     517 non-null float64
rain     517 non-null float64
area     517 non-null float64
dtypes: float64(8), int64(3), object(2)
memory usage: 52.6+ KB

In [40]:
firedb['area'] = firedb['area'].astype(np.float32)

In [12]:
firedb[['FFMC','ISI','temp','RH','wind','rain']].plot(figsize=(17,15))


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x113590080>

In [24]:
firedb['month'].value_counts()


Out[24]:
aug    184
sep    172
mar     54
jul     32
feb     20
jun     17
oct     15
apr      9
dec      9
may      2
jan      2
nov      1
Name: month, dtype: int64

In [15]:
firedb[['area']].plot(figsize=(17,15))


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x112ff2eb8>

In [16]:
firedb['area_adjusted'] = np.log(firedb['area']+1)
Pq a transformação?

In [60]:
fig, axes = plt.subplots(nrows=1, ncols=2)
firedb['area'].plot.hist(ax=axes[0],figsize=(17,8))
firedb['area_adjusted'].plot.hist(ax=axes[1])


Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x1176371d0>

Principal Component Analisys


Experiments



In [17]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, KFold
from sklearn import preprocessing

In [18]:
firedb.head()


Out[18]:
X Y month day FFMC DMC DC ISI temp RH wind rain area area_adjusted
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.0 0.0
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.0 0.0
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0.0 0.0
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.0 0.0
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0.0 0.0

In [36]:
hyper_params = range(1,20)
features_list = [ ['FFMC'], ['DMC'], ['DC'], ['ISI'], ['temp'], ['RH'], ['wind'], ['rain'],
                  ['X', 'Y', 'FFMC', 'DMC','DC','ISI'], ['X', 'Y', 'temp','RH','wind','rain'],
                  ['FFMC', 'DMC','DC','ISI'], ['temp','RH','wind','rain'],
                  ['DMC', 'wind'], ['DC','RH','wind'], 
                  ['X', 'Y', 'DMC', 'wind'], ['X', 'Y', 'DC','RH','wind'], 
                  ['FFMC', 'DMC','DC','ISI', 'temp','RH','wind','rain'],
                  ['X', 'Y', 'FFMC', 'DMC','DC','ISI', 'temp','RH','wind','rain']  ]

num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]

outputs = ['area', 'area_adjusted']

In [20]:
import csv

# initializing our file that will act like our database for the results

#open the file in the 'write' mode
file = open('results_db.csv','w')
writer = csv.writer(file) 

db_row = ['HYPER_PARAM', 'FEATURES', 'K_FOLDS', 'OUTPUT', 
          'AVG_RMSE', 'AVG_RMSE_%_AREA', 'STD_RMSE', 'CV_RMSE',
          'AVG_MAE', 'AVG_MAE_%_AREA', 'STD_MAE', 'CV_MAE']

#write the row mounted
writer.writerow(db_row)

#close the file
file.close()

In [21]:
from IPython.core.display import clear_output
from time import time

start_time = time()

k = 0
k_total = len(hyper_params) * len(features_list) * len(num_folds) * len(outputs)

for hp in hyper_params:
    for features in features_list:
        for fold in num_folds:
            for output in outputs:
                k += 1
                
                kf = KFold(fold, shuffle=True, random_state=1)
                model = KNeighborsRegressor(n_neighbors = hp, algorithm='auto')
                
                mses = cross_val_score(model, firedb[features],
                                       firedb[output], scoring="neg_mean_squared_error", cv=kf)
                
                rmses = np.sqrt(np.absolute(mses))
                
                avg_rmse = np.mean(rmses)
                avg_rmse_per_area = avg_rmse / np.mean(firedb[output])
                
                std_rmse = np.std(rmses)
                cv_rmse = std_rmse / np.mean(firedb[output])
                                
                maes = cross_val_score(model, firedb[features],
                                       firedb[output], scoring="neg_mean_absolute_error", cv=kf)
                
                maes = np.absolute(maes)
                
                avg_mae = np.mean(maes)
                avg_mae_per_area = avg_mae / np.mean(firedb[output])
                
                std_mae = np.std(maes)
                cv_mae = std_mae / np.mean(firedb[output])
                
                
                db_row = [ hp, ', '.join(features), fold, output, 
                           avg_rmse, avg_rmse_per_area, std_rmse, cv_rmse,
                           avg_mae, avg_mae_per_area, std_mae, cv_mae ]
                
                print('ITERATION %d OF %d' % (k, k_total) )
                print( 'HP: ', hp )
                print('FEATURES: ', ', '.join(features) )
                print('FOLDS: ', fold)
                print('OUTPUT: ', output)
                print('AVG_RMSE: ', avg_rmse)
                print('AVG_RMSE_PER_AREA: ', avg_rmse_per_area)
                print('STD_RMSE: ', std_rmse)
                print('CV_RMSE: ', cv_rmse)
                print('AVG_MAE: ', avg_mae)
                print('AVG_MAE_PER_AREA: ', avg_mae_per_area)
                print('STD_MAE: ', std_mae)
                print('CV_MAE: ', cv_mae)
                print('\n\n')
                
                #clear_output(wait = True)
                
                #open the file that will act like a database in the 'append' mode
                #which allow us to append a row each time we open it
                file = open('results_db.csv','a')
                writer = csv.writer(file)                        
                #write the row mounted
                writer.writerow(db_row)

                #close the file
                file.close()

end_time = time()
elapsed_time = end_time - start_time
print('Elapsed time: ', elapsed_time)


Elapsed time:  559.6654286384583

In [37]:
results = pd.read_csv('results_db_complete.csv')
results.head()


Out[37]:
HYPER_PARAM FEATURES K_FOLDS OUTPUT AVG_RMSE AVG_RMSE_%_AVG STD_RMSE STD_RMSE_%_AVG AVG_MAE AVG_MAE_%_AVG STD_MAE STD_MAE_%_AVG
0 1 FFMC 3 area 64.234474 4.999845 33.784210 2.629676 22.584815 1.757944 6.838523 0.532293
1 1 FFMC 3 area_adjusted 2.118254 1.906575 0.170288 0.153271 1.577092 1.419492 0.144546 0.130101
2 1 FFMC 5 area 74.601200 5.806765 66.267227 5.158070 25.380312 1.975538 20.731563 1.613691
3 1 FFMC 5 area_adjusted 2.068721 1.861992 0.271317 0.244204 1.521285 1.369262 0.210700 0.189644
4 1 FFMC 7 area 60.500201 4.709179 35.810316 2.787382 19.424166 1.511927 6.023734 0.468872

In [38]:
results[ results['OUTPUT'] == 'area' ].sort_values(by='AVG_RMSE')[:5]


Out[38]:
HYPER_PARAM FEATURES K_FOLDS OUTPUT AVG_RMSE AVG_RMSE_%_AVG STD_RMSE STD_RMSE_%_AVG AVG_MAE AVG_MAE_%_AVG STD_MAE STD_MAE_%_AVG
1054 3 rain 23 area 37.646597 2.930314 51.812790 4.032974 12.982227 1.010503 12.502984 0.973200
1486 4 rain 23 area 37.771874 2.940065 51.799049 4.031904 12.871462 1.001881 12.516371 0.974242
1918 5 rain 23 area 37.861003 2.947003 51.785804 4.030873 12.860091 1.000996 12.506039 0.973438
2350 6 rain 23 area 37.912035 2.950975 51.780799 4.030484 12.830999 0.998732 12.512251 0.973921
3214 8 rain 23 area 37.944237 2.953481 51.774597 4.030001 12.830263 0.998674 12.512822 0.973966

In [5]:
results[ results['OUTPUT'] == 'area' ].sort_values(by='STD_RMSE')[:5]


Out[5]:
HYPER_PARAM FEATURES K_FOLDS OUTPUT AVG_RMSE AVG_RMSE_%_AVG STD_RMSE STD_RMSE_%_AVG AVG_MAE AVG_MAE_%_AVG STD_MAE STD_MAE_%_AVG
984 3 RH 3 area 97.547290 7.592829 2.918129 0.227140 33.949893 2.642572 5.632186 0.438395
288 1 DMC, wind 3 area 97.212446 7.566765 5.730754 0.446067 21.824580 1.698769 1.062871 0.082731
408 1 X, Y, FFMC, DMC, DC, ISI, temp, RH, wind, rain 3 area 99.896722 7.775703 8.015584 0.623912 24.301300 1.891550 4.007867 0.311962
1416 4 RH 3 area 83.350460 6.487784 12.532951 0.975532 28.950223 2.253410 2.913883 0.226809
312 1 DC, RH, wind 3 area 97.050761 7.554180 14.510319 1.129446 22.461071 1.748312 3.706224 0.288483

In [24]:
results[ results['OUTPUT'] == 'area' ].sort_values(by='AVG_MAE')[:5]


Out[24]:
HYPER_PARAM FEATURES K_FOLDS OUTPUT AVG_RMSE AVG_RMSE_%_AREA STD_RMSE CV_RMSE AVG_MAE AVG_MAE_%_AREA STD_MAE CV_MAE
58 1 FFMC, DMC, DC, ISI 11 area 49.570663 3.858452 37.457692 2.915610 16.834410 1.310347 7.568464 0.589110
50 1 FFMC, DMC, DC, ISI 5 area 55.622226 4.329490 27.703588 2.156376 17.077887 1.329299 4.285801 0.333596
70 1 FFMC, DMC, DC, ISI 23 area 44.329268 3.450476 49.960001 3.888757 17.636206 1.372757 13.414093 1.044118
1054 5 temp, RH, wind, rain 23 area 42.471606 3.305880 48.530475 3.777487 17.944971 1.396790 11.961518 0.931054
1046 5 temp, RH, wind, rain 15 area 47.373685 3.687445 43.979429 3.423245 17.980452 1.399552 8.641483 0.672631

In [25]:
results[ results['OUTPUT'] == 'area' ].sort_values(by='STD_MAE')[:5]


Out[25]:
HYPER_PARAM FEATURES K_FOLDS OUTPUT AVG_RMSE AVG_RMSE_%_AREA STD_RMSE CV_RMSE AVG_MAE AVG_MAE_%_AREA STD_MAE CV_MAE
3384 15 X, Y, temp, RH, wind, rain 3 area 62.384643 4.855859 28.334701 2.205500 21.705308 1.689485 1.058712 0.082407
96 1 DMC, wind 3 area 97.212446 7.566765 5.730754 0.446067 21.824580 1.698769 1.062871 0.082731
504 3 X, Y, temp, RH, wind, rain 3 area 72.246404 5.623473 22.180449 1.726469 21.452032 1.669771 1.099584 0.085589
3144 14 X, Y, temp, RH, wind, rain 3 area 62.922219 4.897703 27.929218 2.173938 21.948226 1.708393 1.111336 0.086504
2904 13 X, Y, temp, RH, wind, rain 3 area 62.942560 4.899286 27.860243 2.168569 21.613934 1.682373 1.146942 0.089275

In [47]:
mean_weight = 0.5
std_weight = 1-mean_weight

results['SCORE_RMSE'] = mean_weight*results['AVG_RMSE'] + std_weight*results['STD_RMSE']
results['SCORE_RMSE'] = ( (np.absolute(results['SCORE_RMSE'] - results['SCORE_RMSE'].mean())) /
                           results['SCORE_RMSE'].std() )

results['SCORE_MAE'] = mean_weight*results['AVG_MAE'] + std_weight*results['STD_MAE']
results['SCORE_MAE'] = ( (np.absolute(results['SCORE_MAE'] - results['SCORE_MAE'].mean())) /
                           results['SCORE_MAE'].std() )


results.head()


Out[47]:
HYPER_PARAM FEATURES K_FOLDS OUTPUT AVG_RMSE AVG_RMSE_%_AVG STD_RMSE STD_RMSE_%_AVG AVG_MAE AVG_MAE_%_AVG STD_MAE STD_MAE_%_AVG SCORE_RMSE SCORE_MAE
0 1 FFMC 3 area 64.234474 4.999845 33.784210 2.629676 22.584815 1.757944 6.838523 0.532293 1.061522 1.035485
1 1 FFMC 3 area_adjusted 2.118254 1.906575 0.170288 0.153271 1.577092 1.419492 0.144546 0.130101 0.978203 0.948428
2 1 FFMC 5 area 74.601200 5.806765 66.267227 5.158070 25.380312 1.975538 20.731563 1.613691 1.974522 2.230668
3 1 FFMC 5 area_adjusted 2.068721 1.861992 0.271317 0.244204 1.521285 1.369262 0.210700 0.189644 0.977106 0.947687
4 1 FFMC 7 area 60.500201 4.709179 35.810316 2.787382 19.424166 1.511927 6.023734 0.468872 1.025126 0.750776

In [48]:
results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_RMSE')[:5]


Out[48]:
HYPER_PARAM FEATURES K_FOLDS OUTPUT AVG_RMSE AVG_RMSE_%_AVG STD_RMSE STD_RMSE_%_AVG AVG_MAE AVG_MAE_%_AVG STD_MAE STD_MAE_%_AVG SCORE_RMSE SCORE_MAE
242 1 FFMC, DMC, DC, ISI 5 area 55.622226 4.329490 27.703588 2.156376 17.077887 1.329299 4.285801 0.333596 0.748460 0.458277
6746 16 temp, RH, wind, rain 5 area 56.972798 4.434615 28.305571 2.203232 21.088888 1.641505 2.025628 0.157670 0.790063 0.583666
3674 9 X, Y, temp, RH, wind, rain 5 area 60.524415 4.711064 24.888559 1.937261 21.873142 1.702549 3.742140 0.291278 0.792931 0.762763
4106 10 X, Y, temp, RH, wind, rain 5 area 60.498849 4.709074 25.097940 1.953559 21.659460 1.685916 3.321103 0.258506 0.796848 0.717307
120 1 RH 3 area 59.389493 4.622725 26.289632 2.046317 19.197119 1.494254 4.369890 0.340141 0.798602 0.616072

In [ ]:
results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_MAE')[:5]

In [99]:
features_str = str(results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_RMSE')[:1].iloc[0,1])
features = []
for feature in features_str.split(','):
    features.append(feature.strip())

hp = int(results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_RMSE')[:1]['HYPER_PARAM'])
fold = int(results[ results['OUTPUT'] == 'area' ].sort_values(by='SCORE_RMSE')[:1]['K_FOLDS'])

In [104]:
kf = KFold(fold, shuffle=True, random_state=1)
model = KNeighborsRegressor(n_neighbors = hp, algorithm='auto')

mses = cross_val_score(model, firedb[features],
                       firedb['area'], scoring="neg_mean_squared_error", cv=kf)

rmses = np.sqrt(np.absolute(mses))
                
avg_rmse = np.mean(rmses)
std_rmse = np.std(rmses)

print(avg_rmse,std_rmse)


55.6222261899 27.7035883176

In [101]:
rmses


Out[101]:
array([ 80.01691363,  93.85199869,  20.55676167,  32.58281942,  51.10263754])

In [105]:
predictions = model.predict(firedb[features])
predictions


---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-105-fdff30c8a19e> in <module>()
----> 1 predictions = model.predict(firedb[features])
      2 predictions

/home/pattousai/anaconda3/lib/python3.6/site-packages/sklearn/neighbors/regression.py in predict(self, X)
    142         X = check_array(X, accept_sparse='csr')
    143 
--> 144         neigh_dist, neigh_ind = self.kneighbors(X)
    145 
    146         weights = _get_weights(neigh_dist, self.weights)

/home/pattousai/anaconda3/lib/python3.6/site-packages/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
    321         """
    322         if self._fit_method is None:
--> 323             raise NotFittedError("Must fit neighbors before querying.")
    324 
    325         if n_neighbors is None:

NotFittedError: Must fit neighbors before querying.

In [ ]:


In [76]:
hyper_params = range(1,10)
mse_values = []
mad_values = []
predictions = []
numFolds = 10
# 10-fold cross validation
#kf = KFold(n_splits=10)
le = preprocessing.LabelEncoder()
x = firedb.ix[:, range(1, 10)].values

Y = le.fit_transform(firedb.ix[:, 10].values)
kf = KFold(numFolds, shuffle=True)
conv_X = pd.get_dummies(firedb.ix[:, range(1, 10)])

In [ ]:
kf = KFold(n_splits = 10, shuffle = True)
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]
result = next(kf.split(firedb), None)
print (result)
train = firedb.iloc[result[0]]
test =  firedb.iloc[result[1]]

In [ ]:
for train_index, test_index in kf.split(X):
    train_X = conv_X.ix[train_indices, :]
    train_Y = Y[train_indices]
    test_X = conv_X.ix[test_indices, :]
    test_Y = Y[test_indices]
    for knumber in hyper_params:
        # Configuring the classificator
        knn = KNeighborsRegressor(n_neighbors=knumber, algorithm='brute', n_jobs=3)
    
        # Creating model
        knn.fit(train_df[['accommodates','bedrooms','bathrooms','number_of_reviews']], train_df['price'])
    
        # Predicting
        predictions = knn.predict(test_df[['accommodates','bedrooms','bathrooms','number_of_reviews']])
    
        # Checking the mean squared error
        mse_values.append(mean_squared_error(predictions, test_df['price']))
        mad_values.append(mean_absolute_error(predictions, test))

mse_values.plot()
mad_values.plot()

Results



In [ ]:

References


  1. P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9. Available at: http://www.dsi.uminho.pt/~pcortez/fires.pdf
  2. http://cwfis.cfs.nrcan.gc.ca/background/summary/fwi