In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
We read our data, making sure that it gets imported without any problems with the header etc.
In [2]:
wheat = pd.read_csv('wheat.data') # Dataset available at https://archive.ics.uci.edu/ml/datasets/seeds
wheat.head()
Out[2]:
We won't need the 'id' column, so we drop it.
Let's make a histogram of a few features and look at their variance.
In [3]:
wheat.drop('id',1,inplace=True)
s1 = wheat[['area','perimeter']]
s2 = wheat[['groove','asymmetry']]
s1.hist(alpha=0.75)
s2.hist(alpha=0.75)
plt.show()
With a scatter plot we can look closer at these relationships
In [4]:
plt.figure()
sns.FacetGrid(wheat, hue='wheat_type', size=5) \
.map(plt.scatter, 'asymmetry', 'perimeter') \
.add_legend()
plt.show()
Violin plot. Dense regions of the data are wider, sparse regions are thinner.
In [5]:
plt.figure()
sns.violinplot(x='wheat_type', y='perimeter', data=wheat, size=6)
plt.show()
Looking at univariate relations. kdeplot creates and visualizes a kernel density estimate of the underlying feature
In [6]:
plt.figure()
sns.FacetGrid(wheat, hue="wheat_type", size=6) \
.map(sns.kdeplot, "perimeter") \
.add_legend()
plt.show()
A nice overview of relationships between different features. Area x Perimeter seem to have a really strong correlation
In [7]:
plt.figure()
sns.pairplot(wheat, hue='wheat_type', size=2,diag_kind="kde")
plt.show()
Andrews Curves helps visualize higher dimensionality, multivariate data, by plotting each observation as a curve. The feature values act as coefficients of the curve.
In [8]:
from pandas.tools.plotting import andrews_curves
plt.figure()
andrews_curves(wheat,'wheat_type')
plt.show()
Parallel coordinates let you view observations with more than three dimensions by tacking on additional parallel coordinates. Best use for limited number of features.
In [9]:
from pandas.tools.plotting import parallel_coordinates
plt.figure()
parallel_coordinates(wheat, 'wheat_type')
plt.show()
Radviz - "...puts each feature as a point on a 2D plane, and then simulates having each sample attached to those points through a spring weighted by the relative value for that feature" ~ Ben Hammer, kaggle notebook
In [10]:
from pandas.tools.plotting import radviz
plt.figure()
radviz(wheat,'wheat_type')
plt.show()