Data Visualization

Before jumping straight into the analysis of the data, it's a good idea to get a feel of your data. This is best done using data visualization toegether with some descriptive statistics. By getting a feel for the data, you can ask better, more informed questions. It also allows a clear way of presenting any discoveries to others, which is the whole point.


In [8]:
%matplotlib notebook
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import numpy as np
matplotlib.style.use('ggplot')

basic plotting


In [10]:
#Series
ts = pd.Series(np.random.randn(1000), index=pd.date_range('11/1/2016', periods = 1000))
#print(ts)#cols: date and randn
ts = ts.cumsum()#basically a random walk
ts.plot()


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73b97d8390>

If we have a DataFrame, we can use plot to plot all columns with labels


In [21]:
#DataFrame
df = pd.DataFrame(np.random.randn(1000,4), index=ts.index, columns=list('ABCD'))
#DataFrame takes a np array, index are row names, columns are column names
df = df.cumsum()#make it like a random walk
df.plot();#df.figure()



In [22]:
df.head(5)#just a sanity check


Out[22]:
A B C D
2016-11-01 0.004646 0.260336 0.382391 0.412165
2016-11-02 -0.353224 -1.469392 -0.700460 -0.817369
2016-11-03 -1.007779 -0.798864 -2.011398 1.107809
2016-11-04 0.771349 -2.067733 -0.957745 1.129245
2016-11-05 -0.688185 -2.224250 -0.391073 2.515247

plot column vs another


In [27]:
df3 = pd.DataFrame(np.random.randn(1000,2), columns=['B','C']).cumsum()
df3['A'] = pd.Series(list(range(len(df))))#a 1000 length pd.Series
df3.plot(x='A', y='B')


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73b5256668>

In [28]:
df3.head(5)


Out[28]:
B C A
0 -1.252196 -0.218002 0
1 -1.812590 0.512022 1
2 -0.607503 0.162963 2
3 -1.704258 0.558315 3
4 -1.222234 0.040880 4

histograms

Drawn by using DataFrame.plot.hist() and Series.plot.hist()


In [34]:
df4 = pd.DataFrame({
        'a':np.random.randn(1000)+2,
        'b':np.random.randn(1000),
        'c':np.random.randn(1000)-1
    }, columns=['a','b','c'])
#plt.figure();#I don't think I need this with inline plot
df4.plot.hist(alpha=.5, bins = 100)


Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73b5412e10>

scatter plot

Use DataFrame.plot.scatter() which requires numeric columns for x and y


In [38]:
df = pd.DataFrame(np.random.randn(100,4), columns=list('abcd'))
df.plot.scatter(x='a', y='b')


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73b4c36ef0>

In [48]:
#or plot muliple columns on a single axis, repeat plot method specifying target ax
ax = df.plot.scatter(x='a',y='b', color='DarkOrange',label='Group 1')
df.plot.scatter(x='c',y='d', color='forestgreen', label='Group 2', ax=ax)


Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73b465f2e8>

In [57]:
#c for color for each point, s for size
df.plot.scatter(x='a',y='b',c='c',s=(df['d']+10)*10)


Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73b4171d30>

In [79]:
#hex plots are useful when plot is too dense to distinguish between points
df = pd.DataFrame(np.random.randn(1000,2), columns=['a','b'])
df['b'] = df['b'] + np.arange(1000)#arange returns evenly spaced values 
df.plot.hexbin(x='a',y='b',gridsize=30)


Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73afad42e8>

In [64]:
df.head(5)


Out[64]:
a b
0 0.205532 -0.493490
1 3.008765 1.242169
2 0.246308 2.604009
3 0.658248 1.972886
4 0.334364 5.803460

pie plot


In [81]:
#pie plot
series = pd.Series(3*np.random.rand(4), index=list('abcd'), name='series')#uniform dist rather than normal
plt.figure()#without this, will overwrite previous plot 
series.plot.pie(figsize=(6,6))


Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73b404a438>

scatter matrix plot


In [97]:
from pandas.tools.plotting import scatter_matrix
df = pd.DataFrame(np.random.randn(1000,4), columns=list('abcd'))
#this may take a noticeable amount of time
scatter_matrix(df, alpha=.2, figsize=(6,6),diagonal='kde')#kernel density estiamtion


Out[97]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f73a30637b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f73a3032780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2ff9ba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2fb7908>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2f7fc18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2f3b9b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2f019b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2f11b70>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2e914a8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2dd99b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2e13dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2d62128>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2d18f60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2ce0f60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2cf2080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f73a2c6ea58>]], dtype=object)

Andrews plot

An Andrews curve allows structure visualization for high dimensional data. Each data point $x={x_1, x_2, \cdots, x_d}$ defines a finite fourier series $$f_x(t) = \frac{x_1}{\sqrt{2}} + x_2 sin(t) + x_3 cos(t) + \cdots$$ which is then plotted for $-\pi < t < \pi$. Each data point can be viewed as a curve between $-\pi$ and $\pi$. We can think of this as a projection of the datapoint onto the vector $$\left(\frac{1}{\sqrt{2}}, sin(t), cos(t), \cdots\right)$$


In [109]:
from pandas.tools.plotting import andrews_curves
#from sklearn import datasets
#iris = datasets.load_iris()
data = pd.read_csv('/home/kevin/Documents/data/iris.csv')
#data
plt.figure()
andrews_curves(data, 'Name')


Out[109]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73a1ae3b00>

In [136]:
#try on another dataset
data2 = pd.read_csv('/home/kevin/Documents/data/breastcancer/data.csv')
plt.figure()
andrews_curves(data2[[1,2,3,4]],'diagnosis')#remove id
#running directly shows nothing, we probably normalize the data first convert strings to float
#from sklearn import preprocessing


Out[136]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f739d8125c0>

parallel coordinates

Another method for plotting multivariate data of moderate sizes. Allows us to see clusterss in data. Each vertical line represents an attribute. 1 set of connected line segments represents 1 data point.


In [137]:
from pandas.tools.plotting import parallel_coordinates
data = pd.read_csv('/home/kevin/Documents/data/iris.csv')
plt.figure()#need to prevent overwriting of previous plot
parallel_coordinates(data,'Name')
#it seems like versicolor and virginica are 'more similar' than setosa in their petals


Out[137]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f739d048da0>

more advanced plotting (to be continued)