EXPLORATORY DATA ANALYSIS


In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
% matplotlib inline

LOAD DATA


In [2]:
df = pd.read_csv('data/wheat-2013-supervised-edited.csv')
df.drop(df.columns[0],axis=1,inplace=True)
df.head()


Out[2]:
Latitude Longitude apparentTemperatureMax apparentTemperatureMin cloudCover dewPoint humidity precipIntensity precipIntensityMax precipProbability ... precipTypeIsSnow pressure temperatureMax temperatureMin visibility windBearing windSpeed NDVI DayInSeason Yield
0 46.811686 -118.695237 35.70 20.85 0.00 29.53 0.91 0.0000 0.0000 0.00 ... 0 1027.13 35.70 27.48 2.46 214 1.18 134.110657 0 35.7
1 46.929839 -118.352109 35.10 26.92 0.00 29.77 0.93 0.0001 0.0019 0.05 ... 0 1026.87 35.10 26.92 2.83 166 1.01 131.506592 0 35.7
2 47.006888 -118.510160 33.38 26.95 0.00 29.36 0.94 0.0001 0.0022 0.06 ... 1 1026.88 33.38 26.95 2.95 158 1.03 131.472946 0 35.7
3 47.162342 -118.699677 28.05 25.93 0.91 29.47 0.94 0.0002 0.0039 0.15 ... 1 1026.37 33.19 27.17 2.89 153 1.84 131.288300 0 35.7
4 47.157512 -118.434056 28.83 25.98 0.91 29.86 0.94 0.0003 0.0055 0.24 ... 0 1026.19 33.85 27.07 2.97 156 1.85 131.288300 0 35.7

5 rows × 22 columns


In [3]:
df.shape


Out[3]:
(177229, 22)

TARGET DISTRIBUTION

Generally, I view the distribution of the target variable to ensure that the models will be trained on "balanced" data. Upon review, the distribution is fairly balanced. Based off this distribution, I would be comfortable with a 75% (~133,000) train size and 25% (~44,000) test size as my train/test split.


In [4]:
col = 'Yield'
figs,axes = plt.subplots(nrows=1,ncols=2)
figs.set_figwidth(12)
figs.set_figheight(5)
df[col].plot(kind='kde', ax=axes[0], grid=True, title='KDE:'+col)
df[col].plot(kind='hist',ax=axes[1], grid=True, title='HIST:'+col)
plt.show()


FEATURE DISTRIBUTIONS

I plotted the KDE and histograms of all features except "precipTypeIsOther"; since it has zero variance. Viewing the distributions loosely helps me evaluate the importance of each one. Later on, I will cross check with coefficients and/or feature importances to validate whether or not to keep a feature.


In [5]:
i=0
figs,axes = plt.subplots(nrows=len(df.columns[:-1]),ncols=2)
figs.set_figwidth(12)
figs.set_figheight(5*len(df.columns[:-1]))
for col in df.columns[:-1]:
    i+=1
    if abs(df[col].max()) >= 1:
        df[col].plot(kind='kde', ax=axes[-1+i,0], grid=True, title='KDE:'+col)
        df[col].plot(kind='hist',ax=axes[-1+i,1], grid=True, title='HIST:'+col)
    elif abs(df[col].max()) < 1:
        df[col].plot(kind='kde', ax=axes[-1+i,0], grid=True, title='KDE:'+col)
        df[col].plot(kind='kde', ax=axes[-1+i,1], grid=True, title='KDE(log(x)):'+col, logx=True)


SEABORN: Pair Plot

I like plotting the "pair plot" using seaborn to help show correlation (if any) between features/target.


In [6]:
df.dropna(inplace=True)
df.drop(df.columns[:5],axis=1,inplace=True)
sns.pairplot(df)


Out[6]:
<seaborn.axisgrid.PairGrid at 0x116e75090>

SEANBORN: Heat Map

The seaborn plot is great...but I actually enjoy looking at my correlational heat map. Specifically, I find it easier to evaluate the relationship between the target and features.


In [7]:
cols = list(df.columns)
cm = np.corrcoef(df[cols].values.T)
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
hm = sns.heatmap(cm,cbar=True,
                 annot=True,
                 square=True,
                 fmt='.2f',
                 annot_kws={'size':5},
                 yticklabels=cols,
                 xticklabels=cols)
plt.show()