Before jumping straight into the analysis of the data, it's a good idea to get a feel of your data. This is best done using data visualization toegether with some descriptive statistics. By getting a feel for the data, you can ask more informative questions. It also allows a clear way of presenting any discoveries to others, which is sometimes the whole point when doing data analysis.
In [44]:
%matplotlib notebook
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import numpy as np
matplotlib.style.use('ggplot')
In [45]:
#Series
ts = pd.Series(np.random.randn(1000), index=pd.date_range('11/1/2016', periods = 1000))
#print(ts)#cols: date and randn
ts = ts.cumsum()#basically a random walk
ts.plot()
Out[45]:
If we have a DataFrame, we can use plot() to plot all columns with corresponding labels
In [46]:
#DataFrame
df = pd.DataFrame(np.random.randn(1000,4), index=ts.index, columns=list('ABCD'))
#DataFrame takes a np array, index are row names, columns are column names
df = df.cumsum()#make it like a random walk
df.plot()#df.figure()
Out[46]:
It's also useful to take a look at the data directly:
In [47]:
df.head(5)#just a sanity check
Out[47]:
To Plot one column against another:
In [48]:
df3 = pd.DataFrame(np.random.randn(1000,2), columns=['B','C']).cumsum()
df3['A'] = pd.Series(list(range(len(df))))#a 1000 length pd.Series
df3.plot(x='A', y='B')
Out[48]:
In [49]:
df3.head(5)
Out[49]:
In [50]:
df4 = pd.DataFrame({
'a':np.random.randn(1000)+2,
'b':np.random.randn(1000),
'c':np.random.randn(1000)-1
}, columns=['a','b','c'])
#plt.figure();#I don't think I need this with inline plot
df4.plot.hist(alpha=.5, bins = 100)
Out[50]:
qqplots (quantil-quantile plots) help us visually determine wheter some data follows a normal distribution. If the datapoints roughly follow a straight line, then the data is approximately normal and we can apply common techniques like linear regression. The x-axis represents the theoretical quantiles assuming it were normal, while the y-axis represents the acutal data in units of standard deviations from the mean. Notice that the second plot has 3 outliers which I added by hand.
In [51]:
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot as plt
data = np.random.normal(0,1,50)
sm.qqplot(data, line = '45')
plt.show()
In [52]:
sm.qqplot(np.append(data, [4,5,6]), line='45')
plt.show()
In [53]:
df = pd.DataFrame(np.random.randn(100,4), columns=list('abcd'))
df.plot.scatter(x='a', y='b')
Out[53]:
In [54]:
#or plot muliple columns on a single axis, repeat plot method specifying target ax
ax = df.plot.scatter(x='a',y='b', color='DarkOrange',label='Group 1')
df.plot.scatter(x='c',y='d', color='forestgreen', label='Group 2', ax=ax)
Out[54]:
In [55]:
#c for color for each point, s for size
df.plot.scatter(x='a',y='b',c='c',s=(df['d']+10)*10)
Out[55]:
In [56]:
#hex plots are useful when plot is too dense to distinguish between points
df = pd.DataFrame(np.random.randn(1000,2), columns=['a','b'])
df['b'] = df['b'] + np.arange(1000)#arange returns evenly spaced values
df.plot.hexbin(x='a',y='b',gridsize=30)
Out[56]:
In [57]:
from mpl_toolkits.mplot3d import Axes3D
# Create the figure
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
#generate data
n = 250
# Create a lambda function to generate the n random values in the given range
#np.random.rand returns [0,1)
f = lambda minval, maxval, n: minval + (maxval - minval) * np.random.rand(n)
g = lambda mean, sd, n: np.random.normal(mean, sd, n)
# Generate the values: uniformly distributed cluster
x_vals = f(15, 30, n)
y_vals = f(-10, 20, n)
z_vals = f(-32, -5, n)
#normally distributed cluster
x_vals_g = g(0,10,n)
y_vals_g = g(0,10,n)
z_vals_g = g(0,10,n)
# Plot the clusters in 3D
ax.scatter(x_vals, y_vals, z_vals, c='#e34234', marker='o')
ax.scatter(x_vals_g, y_vals_g, z_vals_g, c = '#21B6A8', marker = 'v')
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')
Out[57]:
In [58]:
df.head(5)
Out[58]:
In [59]:
#pie plot
series = pd.Series(3*np.random.rand(4), index=list('abcd'), name='series')#uniform dist rather than normal
plt.figure()#without this, will overwrite previous plot
series.plot.pie(figsize=(6,6))
#-------------------------------#
#labels and values dictionary (counterclockwise)
"""
import numpy as np
import matplotlib.pyplot as plt
# Labels and corresponding values in counter clockwise direction
data = {'Apple': 26,
'Mango': 17,
'Pineapple': 21,
'Banana': 29,
'Strawberry': 11}
# List of corresponding colors
colors = ['orange', 'lightgreen', 'lightblue', 'gold', 'cyan']
# Needed if we want to highlight a section
explode = (0, 0.3, 0, 0, 0)
# Plot the pie chat
plt.pie(list(data.values()), explode=explode, labels=data.keys(),
colors=colors, autopct='%1.1f%%', shadow=False, startangle=90)
# Aspect ratio of the pie chart, 'equal' indicates that want it to be a circle
plt.axis('equal')
plt.show()"""
Out[59]:
In [60]:
from pandas.tools.plotting import scatter_matrix
df = pd.DataFrame(np.random.randn(1000,4), columns=list('abcd'))
#this may take a noticeable amount of time
scatter_matrix(df, alpha=.2, figsize=(6,6),diagonal='kde')#kernel density estiamtion
Out[60]:
An Andrews curve allows structure visualization for high dimensional data. Each data point $x={x_1, x_2, \cdots, x_d}$ defines a finite fourier series $$f_x(t) = \frac{x_1}{\sqrt{2}} + x_2 sin(t) + x_3 cos(t) + \cdots$$ which is then plotted for $-\pi < t < \pi$. Each data point can be viewed as a curve between $-\pi$ and $\pi$. We can think of this as a projection of the datapoint onto the vector $$\left(\frac{1}{\sqrt{2}}, sin(t), cos(t), \cdots\right)$$
In [61]:
from pandas.tools.plotting import andrews_curves
#from sklearn import datasets
#iris = datasets.load_iris()
data = pd.read_csv('/home/kevin/Documents/data/iris.csv')
#data
plt.figure()
andrews_curves(data, 'Name')
Out[61]:
In [62]:
#try on another dataset
data2 = pd.read_csv('/home/kevin/Documents/data/breastcancer/data.csv')
plt.figure()
andrews_curves(data2[[1,2,3,4]],'diagnosis')#remove id
#running directly shows nothing, we probably normalize the data first convert strings to float
#from sklearn import preprocessing
Out[62]:
In [63]:
from pandas.tools.plotting import parallel_coordinates
data = pd.read_csv('/home/kevin/Documents/data/iris.csv')
plt.figure()#need to prevent overwriting of previous plot
parallel_coordinates(data,'Name')
#it seems like versicolor and virginica are 'more similar' than setosa in their petals
Out[63]: