Considering our data

Our initial goal was to apply a ML approach to accurately predict the likelihood of a wildfire occuring. The data we used was first balanced so we had an equal amount of data for both occasions with and without fires. This data was stored in CSV format. Using the Python library Pandas we can analyse the structure of this CSV.



In [12]:

    
import pandas as pd

df = pd.io.parsers.read_csv(
    'Data/NewBalanced.csv',

)

print(df.shape)
print('\n')
print(df.head(5))
print('\n')
print(df.tail(1))









    



(10341, 23)


  CurrentDate  Temp1  Temp2  Temp3  Temp4  avgTemp  Hum1  Hum2  Hum3  Hum4  \
0  01/13/2011     24     16     13     24    19.25    35    32    34    57   
1  01/15/2011     28     18     14     25    21.25    64    40    20    45   
2  01/20/2011     18     12     22     23    18.75    90    69    18    32   
3  01/28/2011     13     11     17     19    15.00    43    26    27    84   
4  02/04/2011      8     17     18     13    14.00    70    57    56    84   

   ...   Wind3  Wind4   avgWind  Rain  14DayAvgTemp  14dayAvgHum  \
0  ...   1.864  0.000  1.242667     0     14.230769    53.916667   
1  ...   0.621  0.000  0.621000     0     15.392857    46.059524   
2  ...   0.621  0.000  0.621000     0     18.160714    52.333333   
3  ...   0.621  0.000  1.242500     0     19.125000    46.107143   
4  ...   2.486  0.621  1.553500     0     16.160714    64.142857   

   14DayAvgWind  14DayAvgRain              Location  Fire  
0             2             0  La Ca_ada Flintridge     1  
1             2             0  La Ca_ada Flintridge     1  
2             1             0  La Ca_ada Flintridge     1  
3             1             0  La Ca_ada Flintridge     1  
4             2             0  La Ca_ada Flintridge     1  

[5 rows x 23 columns]


      CurrentDate  Temp1  Temp2  Temp3  Temp4  avgTemp  Hum1  Hum2  Hum3  \
10340  07/05/2017     31     33     32     25    30.25    33    22    32   

       Hum4  ...   Wind3  Wind4  avgWind  Rain  14DayAvgTemp  14dayAvgHum  \
10340    36  ...   5.593  7.457  5.59275     0     28.357143    36.678571   

       14DayAvgWind  14DayAvgRain     Location  Fire  
10340             5             0  placerville     0  

[1 rows x 23 columns]

This file has 23 features and 10,341 data points. Clearly not all of these features are useful for training a model. For example we have date and location. By applying principal component analysis (https://en.wikipedia.org/wiki/Principal_component_analysis) to our data we decided that the 7 features: 'avgTemp', 'avgWind', '14dayAvgTemp', '14dayAvgHum' and '14DayAvgRain' were most conducive to values in the 'Fire' column for whom a 1 corresponds to their being a fire and a 0 to no fire. Next came model selection. Due to the relatively small amount of data we had relative to the number of useful features we opted to go with a Support Vector Machine based model.

Support Vector Machines

A Support Vector Machine (or SVM for short) is a supervised ML method that can be used for classification of data. Each data item is plotted in n-dimensional space (where n corresponds to the number of features) and then classification is performed by finding a hyperplane that differentiates the classes of the data well. It does so by finding vectors (data points) belonging to each class and basing the position of the hyperplane upon the position of these vectors. These vectors are known as support vectors hence the name.

SVM's work particuarly well when the order of the feature-space is large (as in our case) because as its size increases its more likely that the classes will form distinct clusters allowing for a better fitting hyperplane. In addition having a relatively small amount of data isn't game over as presuming the classes form relatively tight data clusters hyperplanes fitted to larger datasets will still be in a similar place to smaller datasets. It is for these reasons we opted to go with a SVM model to make our predictions. Obviously we are assuming that our relatively small dataset is representative of what datap in the class generally looks like and training on a larger dataset likely wouldnt hurt. This is something we would like to improve our model by doing if provided with the resources.

A SVM in the context of Wildfires

In the context of our data the classes we are training to classify are given by the 'Fire' column in the CSV file (0 for no fire and 1 for fire). We shall consider only the features we deemed important through PCA when training our model. Construction of the dataFrame and splitting of data looks like this:



In [2]:

    
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
df = pd.io.parsers.read_csv(
    'Data/NewBalanced.csv',
    header=None,
    skiprows = [0],
    usecols=[5,10,15,17,18,19,20,22]
)

X = df.values[:,:7]

y = df.values[:,7]

#split the data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=12345)

Next we have to standardise the data. Standardisation is an integral part of preprocessing for an SVM. It ensures all features exist on the same scale.



In [3]:

    
from sklearn import preprocessing
std_scale = preprocessing.StandardScaler().fit(X_train) #allows data to be standardised under the same scale
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)

Implementing an SVM from scratch would be a tedious and tricky process. Luckily Scikit-Learn has already done so by creating a python wrapper for the C++ library LibSVM. LibSVM is a very efficient library for running SVM related tasks.



In [4]:

    
from sklearn.svm import SVC
clf = SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False) 

clf.fit(X_train,y_train)









    Out[4]:





SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Of the input parameters above, the most important are C, class_weight, gamma and kernel. The purpose of C is to decide the trade off between fitting the model to the training set and maintaining a smooth hyperplane. class_weight simply denotes the structure of the training data provided relative to its classes. Our data is balanced hence we have used that here. gamma corresponds to how much influence a single training point has over the fitting of the hyperplane. In this example we have let sklearn select gamma automatically. Finally the kernel is the function that is responsible for finding the mathematical relationship between the independent feature vectors and corresponding classes. In our case we have selected 'rbf' or 'radial based field'. This kernel allows fitting of a non linear hyperplane to the data. This is useful as the relationship between our features (i.e. avgtemp, avghumid etc) and our classes (Fire, No fire) may not necessarily be linear.

The above code equates to training the model. Predictions can now be made with the following code snippet



In [9]:

    
clf.predict(X_test)









    Out[9]:





array([ 0.,  0.,  0., ...,  1.,  0.,  1.])

In addition an accuracy score can be calculated similarily:



In [14]:

    
print('Accuracy is: {}%'.format(clf.score(X_test, y_test, sample_weight=None)*100))









    



Accuracy is: 72.5749274895%

This accuracy can be tweaked by changing hyper-parameters used to train the model as well as altering the data that is trained upon by means of changing the seed when splitting the data. Our final model obtains an accuracy of x%.

What about probability though?

When designing a model for a scenario such as a wildfire it would be far more useful to have probability values associated with the classe outcomes predicted. Unfortunately this is not an out of the box functionality of SVM's. Luckily however, it can be achieved by applying a procedure known as Platt scaling (https://en.wikipedia.org/wiki/Platt_scaling). Essentially this approach applies a probability distribution to predicted classes by means of the sigmoid function. In doing so it allows us to associate probabilities of data points being in certain classes. LibSVM, thus by proxy Scikit-Learn has an efficient implementation of this procedure which means it is only a single line to do so:



In [24]:

    
clf.predict_proba(X_test)









    Out[24]:





array([[ 0.5684794 ,  0.4315206 ],
       [ 0.58296673,  0.41703327],
       [ 0.88029603,  0.11970397],
       ..., 
       [ 0.12410921,  0.87589079],
       [ 0.8163141 ,  0.1836859 ],
       [ 0.11962028,  0.88037972]])

Making predictions

Predictions can be made using this model by first retrieving the prediction data values from a CSV, standardising them under the same scale used for the training and then running Scikit-Learn's predict() function as shown earlier. For example:



In [17]:

    
foredf = pd.io.parsers.read_csv(
    'Data/svminput.csv',
    header=None,
    skiprows = [0],
    usecols=[1,2,3,4,5,6,8,9,10,11]
)

X_forecast = foredf.values[:,3:]
X_forecast_std = std_scale.transform(X_forecast)

fore_pred = clf.predict(X_forecast_std)

We then opted to append the predictions array above to a pandas dataFrame and compile that data frame as a new CSV 'svmoutput.csv'



In [18]:

    
forearray = foredf.values.tolist()

i = 0

for element in forearray:

    element.append(fore_pred[i])
    #element.append(fore_prob[i][1])

    i +=1

df = pd.DataFrame(forearray)

df.to_csv('Data/svmoutput.csv')

As you can imagine the generated CSV has the same format as the input CSV with the only exception being the appended prediction column added to the end.



In [28]:

    
df = pd.io.parsers.read_csv(
    'Data/svmoutput.csv', 

)

print(df.shape)
print('\n')
print(df.head(10))









    



(1158, 11)


   0          1           2       3   4     5          6          7  8  9  10
0  0  37.810696 -122.183232  24.285  49  2.35  19.210625  64.125000  3  1   0
1  1  37.810696 -122.183232  19.725  61  3.53  19.210625  64.125000  3  1   0
2  2  37.810696 -122.183232  17.725  69  3.85  19.210625  64.125000  3  1   0
3  3  37.810696 -122.183232  18.635  67  3.35  19.210625  64.125000  3  1   0
4  4  37.810696 -122.183232  19.520  62  2.73  19.210625  64.125000  3  1   0
5  5  37.810696 -122.183232  19.560  62  2.76  19.210625  64.125000  3  1   0
6  0  38.267458 -114.595831  26.085  26  2.31  23.372187  35.270833  2  2   1
7  1  38.267458 -114.595831  24.195  32  4.26  23.372187  35.270833  2  2   1
8  2  38.267458 -114.595831  24.155  34  4.10  23.372187  35.270833  2  2   1
9  3  38.267458 -114.595831  23.075  43  3.87  23.372187  35.270833  2  2   1

Our code in practice

The ideas and code snippets above are the basis for our code. We have opted to use an object orientated approach as this lends us several advantages such as clarity and generalisability. Below is an example of a script that utilises our code to process, standardise, train then test a model. It outputs a prediction for every value in the test dataset along with its probability (calculated through Platt scaling) and the correct value. Finally it outputs the overall accuracy the model achieved when making predictions upon the dataset.



In [ ]:

    
'''
Author: Flinn Dolman

@License: MIT

An example script that leverages our code to train a model and make predictions based upon it. Predictions
are printed to stdout and then the model used to make the predictions is saved.
'''
from SVM import SVM
from Standardiser import Standardiser


def Main():
    forecast_loc = 'Data/svminput.csv'
    standard_data = Standardiser()
    standard_data.initialise()
    clf = SVM()
    clf.initialise(standard_data.get_std_X_train(),standard_data.get_std_X_test(),standard_data.get_y_train(),standard_data.get_y_test())
    print('\nThese are the predictions: {}\n'.format(clf.predictions()))
    predictions, probs = clf.predictions()
    y_test = standard_data.get_y_test()


    for i in range(0,len(predictions)-1):
        print('Prediction: {}, with probability: {}, correct value: {}'.format(predictions[i],probs[i], y_test[i]))

    print('Accuracy is: {}%'.format(clf.accuracy()*100))

    fore_Pred, fore_Prob = clf.forecast_Pred(standard_data.loadForecast(forecast_loc))

    standard_data.make_CSV(fore_Pred,fore_Prob,'Data/svmoutputnew.csv')

    clf.saveModel()

if __name__ =="__main__":

    Main()

The structure of our code means that all this script is really responsible for is initialisation of objects and the formatting of the predictions, probabilities and correct values.