EXPLORATORY DATA ANALYSIS



In [1]:

    
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
% matplotlib inline

LOAD DATA



In [2]:

    
df = pd.read_csv('data/wheat-2013-supervised-edited.csv')
df.drop(df.columns[0],axis=1,inplace=True)
df.head()









    Out[2]:






  
    
      
      Latitude
      Longitude
      apparentTemperatureMax
      apparentTemperatureMin
      cloudCover
      dewPoint
      humidity
      precipIntensity
      precipIntensityMax
      precipProbability
      ...
      precipTypeIsSnow
      pressure
      temperatureMax
      temperatureMin
      visibility
      windBearing
      windSpeed
      NDVI
      DayInSeason
      Yield
    
  
  
    
      0
      46.811686
      -118.695237
      35.70
      20.85
      0.00
      29.53
      0.91
      0.0000
      0.0000
      0.00
      ...
      0
      1027.13
      35.70
      27.48
      2.46
      214
      1.18
      134.110657
      0
      35.7
    
    
      1
      46.929839
      -118.352109
      35.10
      26.92
      0.00
      29.77
      0.93
      0.0001
      0.0019
      0.05
      ...
      0
      1026.87
      35.10
      26.92
      2.83
      166
      1.01
      131.506592
      0
      35.7
    
    
      2
      47.006888
      -118.510160
      33.38
      26.95
      0.00
      29.36
      0.94
      0.0001
      0.0022
      0.06
      ...
      1
      1026.88
      33.38
      26.95
      2.95
      158
      1.03
      131.472946
      0
      35.7
    
    
      3
      47.162342
      -118.699677
      28.05
      25.93
      0.91
      29.47
      0.94
      0.0002
      0.0039
      0.15
      ...
      1
      1026.37
      33.19
      27.17
      2.89
      153
      1.84
      131.288300
      0
      35.7
    
    
      4
      47.157512
      -118.434056
      28.83
      25.98
      0.91
      29.86
      0.94
      0.0003
      0.0055
      0.24
      ...
      0
      1026.19
      33.85
      27.07
      2.97
      156
      1.85
      131.288300
      0
      35.7
    
  

5 rows × 22 columns



In [3]:

    
df.shape









    Out[3]:





(177229, 22)

TARGET DISTRIBUTION

Generally, I view the distribution of the target variable to ensure that the models will be trained on "balanced" data. Upon review, the distribution is fairly balanced. Based off this distribution, I would be comfortable with a 75% (~133,000) train size and 25% (~44,000) test size as my train/test split.



In [4]:

    
col = 'Yield'
figs,axes = plt.subplots(nrows=1,ncols=2)
figs.set_figwidth(12)
figs.set_figheight(5)
df[col].plot(kind='kde', ax=axes[0], grid=True, title='KDE:'+col)
df[col].plot(kind='hist',ax=axes[1], grid=True, title='HIST:'+col)
plt.show()

FEATURE DISTRIBUTIONS

I plotted the KDE and histograms of all features except "precipTypeIsOther"; since it has zero variance. Viewing the distributions loosely helps me evaluate the importance of each one. Later on, I will cross check with coefficients and/or feature importances to validate whether or not to keep a feature.



In [5]:

    
i=0
figs,axes = plt.subplots(nrows=len(df.columns[:-1]),ncols=2)
figs.set_figwidth(12)
figs.set_figheight(5*len(df.columns[:-1]))
for col in df.columns[:-1]:
    i+=1
    if abs(df[col].max()) >= 1:
        df[col].plot(kind='kde', ax=axes[-1+i,0], grid=True, title='KDE:'+col)
        df[col].plot(kind='hist',ax=axes[-1+i,1], grid=True, title='HIST:'+col)
    elif abs(df[col].max()) < 1:
        df[col].plot(kind='kde', ax=axes[-1+i,0], grid=True, title='KDE:'+col)
        df[col].plot(kind='kde', ax=axes[-1+i,1], grid=True, title='KDE(log(x)):'+col, logx=True)

SEABORN: Pair Plot

I like plotting the "pair plot" using seaborn to help show correlation (if any) between features/target.



In [6]:

    
df.dropna(inplace=True)
df.drop(df.columns[:5],axis=1,inplace=True)
sns.pairplot(df)









    Out[6]:





<seaborn.axisgrid.PairGrid at 0x116e75090>

SEANBORN: Heat Map

The seaborn plot is great...but I actually enjoy looking at my correlational heat map. Specifically, I find it easier to evaluate the relationship between the target and features.



In [7]:

    
cols = list(df.columns)
cm = np.corrcoef(df[cols].values.T)
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
hm = sns.heatmap(cm,cbar=True,
                 annot=True,
                 square=True,
                 fmt='.2f',
                 annot_kws={'size':5},
                 yticklabels=cols,
                 xticklabels=cols)
plt.show()

	Latitude	Longitude	apparentTemperatureMax	apparentTemperatureMin	cloudCover	dewPoint	humidity	precipIntensity	precipIntensityMax	precipProbability	...	precipTypeIsSnow	pressure	temperatureMax	temperatureMin	visibility	windBearing	windSpeed	NDVI	Yield
0	46.811686	-118.695237	35.70	20.85	0.00	29.53	0.91	0.0000	0.0000	0.00	...	0	1027.13	35.70	27.48	2.46	214	1.18	134.110657	35.7
1	46.929839	-118.352109	35.10	26.92	0.00	29.77	0.93	0.0001	0.0019	0.05	...	0	1026.87	35.10	26.92	2.83	166	1.01	131.506592	35.7
2	47.006888	-118.510160	33.38	26.95	0.00	29.36	0.94	0.0001	0.0022	0.06	...	1	1026.88	33.38	26.95	2.95	158	1.03	131.472946	35.7
3	47.162342	-118.699677	28.05	25.93	0.91	29.47	0.94	0.0002	0.0039	0.15	...	1	1026.37	33.19	27.17	2.89	153	1.84	131.288300	35.7
4	47.157512	-118.434056	28.83	25.98	0.91	29.86	0.94	0.0003	0.0055	0.24	...	0	1026.19	33.85	27.07	2.97	156	1.85	131.288300	35.7