We will make intuitive visualizations that will help us understand our 'Seeds' data

First we import the necessary libraries for processing (pandas) and ploting (seaborn, matplotlib).



In [1]:

    
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

We read our data, making sure that it gets imported without any problems with the header etc.



In [2]:

    
wheat = pd.read_csv('wheat.data') # Dataset available at https://archive.ics.uci.edu/ml/datasets/seeds
wheat.head()









    Out[2]:






  
    
      
      id
      area
      perimeter
      compactness
      length
      width
      asymmetry
      groove
      wheat_type
    
  
  
    
      0
      0
      15.26
      14.84
      0.8710
      5.763
      3.312
      2.221
      5.220
      kama
    
    
      1
      1
      14.88
      14.57
      0.8811
      5.554
      3.333
      1.018
      4.956
      kama
    
    
      2
      2
      14.29
      14.09
      0.9050
      5.291
      3.337
      2.699
      4.825
      kama
    
    
      3
      3
      13.84
      13.94
      0.8955
      5.324
      3.379
      2.259
      4.805
      kama
    
    
      4
      4
      16.14
      14.99
      0.9034
      5.658
      3.562
      1.355
      5.175
      kama

We won't need the 'id' column, so we drop it.

Let's make a histogram of a few features and look at their variance.



In [3]:

    
wheat.drop('id',1,inplace=True)

s1 = wheat[['area','perimeter']]
s2 = wheat[['groove','asymmetry']]
s1.hist(alpha=0.75)
s2.hist(alpha=0.75)
plt.show()

With a scatter plot we can look closer at these relationships



In [4]:

    
plt.figure()
sns.FacetGrid(wheat, hue='wheat_type', size=5) \
            .map(plt.scatter, 'asymmetry', 'perimeter') \
            .add_legend()
plt.show()









    





<matplotlib.figure.Figure at 0x7f97cdb0e8d0>

Violin plot. Dense regions of the data are wider, sparse regions are thinner.



In [5]:

    
plt.figure()
sns.violinplot(x='wheat_type', y='perimeter', data=wheat, size=6)
plt.show()

Looking at univariate relations. kdeplot creates and visualizes a kernel density estimate of the underlying feature



In [6]:

    
plt.figure()
sns.FacetGrid(wheat, hue="wheat_type", size=6) \
   .map(sns.kdeplot, "perimeter") \
   .add_legend()
plt.show()









    



/media/yannis/HGST_4TB/Ubudirs/anaconda3/lib/python3.6/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    





<matplotlib.figure.Figure at 0x7f97cd92c5f8>

A nice overview of relationships between different features. Area x Perimeter seem to have a really strong correlation



In [7]:

    
plt.figure()
sns.pairplot(wheat, hue='wheat_type', size=2,diag_kind="kde")
plt.show()









    



/media/yannis/HGST_4TB/Ubudirs/anaconda3/lib/python3.6/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
/media/yannis/HGST_4TB/Ubudirs/anaconda3/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in greater
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.
/media/yannis/HGST_4TB/Ubudirs/anaconda3/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in less
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.






    





<matplotlib.figure.Figure at 0x7f97cd92ceb8>

Andrews Curves helps visualize higher dimensionality, multivariate data, by plotting each observation as a curve. The feature values act as coefficients of the curve.



In [8]:

    
from pandas.tools.plotting import andrews_curves
plt.figure()
andrews_curves(wheat,'wheat_type')
plt.show()

Parallel coordinates let you view observations with more than three dimensions by tacking on additional parallel coordinates. Best use for limited number of features.



In [9]:

    
from pandas.tools.plotting import parallel_coordinates
plt.figure()
parallel_coordinates(wheat, 'wheat_type')
plt.show()

Radviz - "...puts each feature as a point on a 2D plane, and then simulates having each sample attached to those points through a spring weighted by the relative value for that feature" ~ Ben Hammer, kaggle notebook



In [10]:

    
from pandas.tools.plotting import radviz
plt.figure()
radviz(wheat,'wheat_type')
plt.show()

	id	area	perimeter	compactness	length	width	asymmetry	groove	wheat_type
0	0	15.26	14.84	0.8710	5.763	3.312	2.221	5.220	kama
1	1	14.88	14.57	0.8811	5.554	3.333	1.018	4.956	kama
2	2	14.29	14.09	0.9050	5.291	3.337	2.699	4.825	kama
3	3	13.84	13.94	0.8955	5.324	3.379	2.259	4.805	kama
4	4	16.14	14.99	0.9034	5.658	3.562	1.355	5.175	kama