PREPROCESSING



In [1]:

    
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
% matplotlib inline

LOAD DATA

The follow features were dropped to prevent overfitting: CountyName, State, and Date.



In [2]:

    
df = pd.read_csv('data/wheat-2013-supervised.csv')
drop_cols = ['CountyName','State','Date']
df.drop(drop_cols,axis=1,inplace=True)
df.head()









    Out[2]:






  
    
      
      Latitude
      Longitude
      apparentTemperatureMax
      apparentTemperatureMin
      cloudCover
      dewPoint
      humidity
      precipIntensity
      precipIntensityMax
      precipProbability
      ...
      precipTypeIsOther
      pressure
      temperatureMax
      temperatureMin
      visibility
      windBearing
      windSpeed
      NDVI
      DayInSeason
      Yield
    
  
  
    
      0
      46.811686
      -118.695237
      35.70
      20.85
      0.00
      29.53
      0.91
      0.0000
      0.0000
      0.00
      ...
      0
      1027.13
      35.70
      27.48
      2.46
      214
      1.18
      134.110657
      0
      35.7
    
    
      1
      46.929839
      -118.352109
      35.10
      26.92
      0.00
      29.77
      0.93
      0.0001
      0.0019
      0.05
      ...
      0
      1026.87
      35.10
      26.92
      2.83
      166
      1.01
      131.506592
      0
      35.7
    
    
      2
      47.006888
      -118.510160
      33.38
      26.95
      0.00
      29.36
      0.94
      0.0001
      0.0022
      0.06
      ...
      0
      1026.88
      33.38
      26.95
      2.95
      158
      1.03
      131.472946
      0
      35.7
    
    
      3
      47.162342
      -118.699677
      28.05
      25.93
      0.91
      29.47
      0.94
      0.0002
      0.0039
      0.15
      ...
      0
      1026.37
      33.19
      27.17
      2.89
      153
      1.84
      131.288300
      0
      35.7
    
    
      4
      47.157512
      -118.434056
      28.83
      25.98
      0.91
      29.86
      0.94
      0.0003
      0.0055
      0.24
      ...
      0
      1026.19
      33.85
      27.07
      2.97
      156
      1.85
      131.288300
      0
      35.7
    
  

5 rows × 23 columns

DATA SIZE CHECK

I check for data length to double check if I should run on my local computer or AWS.



In [3]:

    
df.shape









    Out[3]:





(177493, 23)

NULL CHECK

This is important to note for the preprocessing stage, where I will impute a value (such as average) for missing data.

The following features have missing data: precipIntensity, precipIntensityMax, precipProbability, pressure, visibility



In [4]:

    
df.isnull().sum()









    Out[4]:





Latitude                    0
Longitude                   0
apparentTemperatureMax      0
apparentTemperatureMin      0
cloudCover                  0
dewPoint                    0
humidity                    0
precipIntensity             1
precipIntensityMax          1
precipProbability           1
precipAccumulation          0
precipTypeIsRain            0
precipTypeIsSnow            0
precipTypeIsOther           0
pressure                  254
temperatureMax              0
temperatureMin              0
visibility                 30
windBearing                 0
windSpeed                   0
NDVI                        0
DayInSeason                 0
Yield                       0
dtype: int64

FEATURE VARIANCE CHECK

I check for zero variance features (i.e. features with just one value). The following feature(s) have zero variance: precipTypeIsOther



In [5]:

    
for col in df.columns[5:]:
    if df[col].var() == 0:
        print '*****LOW VARIANCE WARNING***** ==> {} ==> var:{}'.format(col,df[col].var())
    else:
        print '{} ==> var:{}'.format(col,df[col].var())









    



dewPoint ==> var:278.479665344
humidity ==> var:0.026761541123
precipIntensity ==> var:2.07884441849e-05
precipIntensityMax ==> var:0.00200543474883
precipProbability ==> var:0.064219439784
precipAccumulation ==> var:0.131102111636
precipTypeIsRain ==> var:0.166287619868
precipTypeIsSnow ==> var:0.0822035304839
*****LOW VARIANCE WARNING***** ==> precipTypeIsOther ==> var:0.0
pressure ==> var:74.0873495915
temperatureMax ==> var:430.781354338
temperatureMin ==> var:316.90117822
visibility ==> var:1.64280922978
windBearing ==> var:10837.5596277
windSpeed ==> var:22.7315132832
NDVI ==> var:102.622306243
DayInSeason ==> var:2873.90002706
Yield ==> var:231.469040495

REMOVE USELESS FEATURE(S)

As you can see below, all values for "precipTypeIsOther" is all zero; thus making it a useless feature. So I removed the feature.



In [6]:

    
df.precipTypeIsOther.value_counts().plot(kind='bar')
df.drop('precipTypeIsOther',axis=1,inplace=True)

IMPUTATION | DROP NULL ROWS

I'm typically frugal with my data and try not to throw anything out (i.e. I would impute averages or other extrapolated values that are appropriate). However, the ratio of NaNs to data points is so small enough for me to drop any NaNs/null values.



In [7]:

    
df.dropna(inplace=True)

SAVE DATA TO CSV



In [8]:

    
df.to_csv('data/wheat-2013-supervised-edited.csv')

	Latitude	Longitude	apparentTemperatureMax	apparentTemperatureMin	cloudCover	dewPoint	humidity	precipIntensity	precipIntensityMax	precipProbability	...	pressure	temperatureMax	temperatureMin	visibility	windBearing	windSpeed	NDVI	Yield
0	46.811686	-118.695237	35.70	20.85	0.00	29.53	0.91	0.0000	0.0000	0.00	...	1027.13	35.70	27.48	2.46	214	1.18	134.110657	35.7
1	46.929839	-118.352109	35.10	26.92	0.00	29.77	0.93	0.0001	0.0019	0.05	...	1026.87	35.10	26.92	2.83	166	1.01	131.506592	35.7
2	47.006888	-118.510160	33.38	26.95	0.00	29.36	0.94	0.0001	0.0022	0.06	...	1026.88	33.38	26.95	2.95	158	1.03	131.472946	35.7
3	47.162342	-118.699677	28.05	25.93	0.91	29.47	0.94	0.0002	0.0039	0.15	...	1026.37	33.19	27.17	2.89	153	1.84	131.288300	35.7
4	47.157512	-118.434056	28.83	25.98	0.91	29.86	0.94	0.0003	0.0055	0.24	...	1026.19	33.85	27.07	2.97	156	1.85	131.288300	35.7