Table of Contents
1. Let's load the dataset
2. Let's plot the dataset
2.1 scatter_matrix
2.2 Bar plot
2.3 Histogram
2.4 Box plot
2.5 Area plot
2.6 Scatter plot
2.7 Hexbin plot
2.8 Pie plot
2.9 Missing data on plots
2.10 Density/KDE plots
2.11 Andrews Curves
2.12 Parallel Coordinates
2.13 Lag plot
2.14 Autocorrelation plot
2.15 Bootstrap plot
2.16 RadVis plot
3. Data holders
3.1 Series
3.1.1 Series from ndarray
3.1.2 Series from dict
3.1.3 Series from scalar
3.2 DataFrame
3.2.1 DataFrame from dict of Series o dicts
3.2.2 DataFrame from dict of ndarrays
3.2.3 DataFrame from structured/record arrays
3.2.4 DataFrame from lists of dicts
3.2.5 DataFrame from dict of tuples
3.2.6 DataFrame from Series
4. Dealing with missing data
4.1 Automatic Formating of date strings
4.2 Handle of NaNs
4.2.1 Find NaN values
4.2.1.1 NaN and None
4.2.2 Fill NaN with fixed value
4.2.3 Fill NaN with mean or median
4.2.4 Drop NaN rows
4.2.5 Calculations with NaN
4.3 Reading Bad Input files
4.4
Let's focus on the Pandas dataframe Munging stuff
The modules that we'll need:
In [4]:
%matplotlib inline
In [5]:
import numpy as np
In [6]:
from sklearn import tree
In [7]:
import pandas as pd
We'll load one of the example datasets from sklearn!
In [8]:
from sklearn import datasets
In [9]:
iris = datasets.load_iris()
This is a sklearn dataset including the
.DESCR
: general description.data
: all features.features_names
: names of features.target
: target values expressed as values / numbered classes.target_names
: names of the target classes .shape
: can be applied both on .data
and .target
and give the (row,column)
tuple
In [7]:
iris.DESCR;
iris.data;
iris.feature_names
Out[7]:
In [8]:
iris.target;
iris.target_names
Out[8]:
In [9]:
iris.data.shape
Out[9]:
In [10]:
type(iris.data)
Out[10]:
Let's use pandas' to convert the ndarray into dataframe, keeping the column names!
In [11]:
dataframe = pd.DataFrame(iris.data, columns=iris.feature_names)
In [12]:
dataframe.head()
Out[12]:
In [13]:
target_df = pd.DataFrame(iris.target, columns=['Species'])
In [14]:
target_df.tail(2)
Out[14]:
Using pandas' scatter_matrix
function.
This is similar to R's pairs
and makes a quick scatter plot map for the quantitative
features of the dataframe
First we'll make a list named colors as a book-keeper. Then to visualise the different target categories (i.e. the Species) we'll use the target numerical categorisation (0,1,2) to differently colour each row of the dataframe based on the target variable
In [15]:
colors = list()
In [16]:
palette = { 0: "red", 1: "green", 2:"blue"}
In [17]:
for c in np.nditer(iris.target):
colors.append(palette[int(c)]) # c is 0,1 or 2 and we append red, green blue
In [18]:
scatterplot = pd.scatter_matrix(dataframe,
alpha = 0.3,
figsize = (10,10),
diagonal = 'hist', #'kde'
color = colors,
marker = 'o',
grid = True)
In [19]:
dataframe.plot(kind='bar', stacked=True, figsize=(10,8)) #barh
Out[19]:
In [20]:
dataframe['sepal length (cm)'].plot(kind='hist', color='red', alpha=0.3, figsize=(10,8))
Out[20]:
In [21]:
dataframe.plot(kind='hist', alpha=0.3, orientation='horizontal',
cumulative=True, bins=15, figsize=(10,8))
Out[21]:
In [22]:
dataframe.diff().hist(color='k', alpha=0.5, bins=50, figsize=(10,8))
# diff() gets the difference of all
# rows with row-0
Out[22]:
Let's remind here that box plot is a convenient way of depicting groups of numerical data and their quartiles.
The first quartile is the 25th percentile (splits off the lowest 25% of data from the highest 75%), the second quartile (median) is the 50th percentile (cuts the dataset in half) and the third quartile is the upper quartile or the 75th percentile (splits off the highest 25% from the lowest 75%). IRQ (interquantile range) is the difference between upper and lower quartiles IRQ=Q3-Q1).
The line is the median (Q2 or 50%). The box represents the IRQ, the from median to highest values is the Q2 to Q3 region, while the lowest is the Q1 to Q2 region.
_____ -> maximum
|
|
--- -> third quartile
| |
|___| -> median (Q2)
| |
| |
--- -> first quartile
|
|
_____ -> minimum
Outliers --> 3 x IRQ
Suspected Outliers 1.5 x IRQ
In [23]:
dataframe.plot(kind='box', vert=False, by='X', figsize=(10,8))
Out[23]:
In [24]:
dataframe.plot(kind='area', alpha=0.6, figsize=(10,8))
Out[24]:
In [25]:
dataframe.plot(kind='scatter', x='sepal length (cm)', y='sepal width (cm)', figsize=(10,8))
Out[25]:
In [26]:
ax = dataframe.plot(kind='scatter', x='sepal length (cm)', y='sepal width (cm)',
color='DarkBlue', label='Group 1');
dataframe.plot(kind='scatter', x='petal length (cm)', y='petal width (cm)',
color='DarkGreen', label='Group 2' , figsize=(10,8),
ax=ax); # ax = ax to overlay
# if not set 2 plots made
Out[26]:
In [27]:
dataframe.plot(kind='scatter', x='petal length (cm)', y='petal width (cm)',
c='sepal width (cm)', s=50, figsize=(10,8))
Out[27]:
In [28]:
dataframe.plot(kind='hexbin', x='petal length (cm)', y='petal width (cm)',
gridsize=25, figsize=(10,8))
Out[28]:
In [29]:
series = pd.Series(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], name='series')
series.plot(kind='pie', figsize=(10,10))
Out[29]:
Depending on the plot type pandas handle by default the NAs
Plot type | Handling |
---|---|
Line | Leave gaps at NA |
Line stacked | Fill 0's |
Bar | Fill 0's |
Scatter | Drop NA |
Histogram | Drop NA |
Box | Drop NA |
Area | Fill 0's |
KDE (density) | Fill 0's |
Hexbin | Drop NA |
Pie | Fill 0 |
If this is not the wanted behaviour use: fillna()
or dropna()
before plotting
In [30]:
dataframe.ix[:,0:2].plot(kind='kde', figsize=(10,8))
# this is to take all rows for the first two columns of the dataframe (or df.iloc[:,0:2])
Out[30]:
They are used to plot multivariate data as a large number of curves that are created using the atributes of samples as coefficients for Fourier series. By coloring these curves per class it is possible to visualize data clustering.
It is a good way to visualize structure in high-dimensional data. Each data point $x={x_1 , x_2, ..., x_d}$ defines a finite Fourrier series:
$f_{x}(t) = \frac{x_1}{\sqrt{2}} + x_2 sin(t) + x_3 cos(t) + x_4 sin(2t) + x_5 cos(2t)+...$
This function is plotted for $-\pi < t < \pi$. Therefore, each data point may be viewed as a line between $-\pi$ and $\pi$. The formula can be thought of as the projection of the data point onto the vector
$(\frac{1}{\sqrt{2}}, sin(t), cos(t), sin(2t), cos(2t),...)$
Each df row is a line per category. On the horizontal axis is the $-\pi < t < \pi$ and on the vertical the $f_x(t)$.
Curves belonging to samples of the same class will usually be closer together and form larger structures!
In [31]:
from pandas.tools.plotting import andrews_curves
In [32]:
new_df = dataframe;
new_df['Species']=target_df;
new_df.head(5)
Out[32]:
In [33]:
andrews_curves(new_df, "Species")
Out[33]:
Parallel coordinates is a way of visualising high-dimensional geometry and analysing multivariate data.
To show a set of points in a n-dimensional space, the horizontal axis is split consisting n parallel vertical (typically) lines. A point in the n-dimensional space is represented as a polyline with vertices on the parallel axes; the position of the vertex on the i-th axis corresponds to the i-th coordinate of the point.
This type of visualisation is closely related to time series visualisation, with the difference that it is applied on data that have no time dependance.
Three points must be taken into account when the plot is used for statistical inference:
Points that tend to cluster will appear closer together.
In [34]:
from pandas.tools.plotting import parallel_coordinates
parallel_coordinates(new_df, "Species")
Out[34]:
Lag plots are used to identify randomness in data. Random data should not exhibit any structure in the lag plot
Non-random structure implies that data are not random.
A lag is a fixed time displacement. For example, given a dataset $Y1,~Y2~...,~Yn$, $Y2$ and $Y7$ have lag $5$ since $7 - 2 = 5$. Lag plots can be generated for any arbitrary lag, although the most commonly used lag is $1$.
In [35]:
from pandas.tools.plotting import lag_plot
lag_plot(new_df)
Out[35]:
This is to check randomness in series. It computes the autocorrelations for data values at varying time lags.
If series are random, the autocorrelation should be near zero for any and all time-lag separation.
If time series are non-random then one or more of the autocorrelations will be significantly non-zero.
The horizontal lines correspond to the 95% and 99% confidence bands (dashed = 99%)
Vertical axis : autocorrelation coefficient: $R_{h}=C_{h}/ C_{0}$,
where $C_{h}$ is the autocovariance :
$$C_{h} = \frac{1}{N} \sum_{t=1}^{N-h}(Y_{t}-\bar{Y})(Y_{t+h}-\bar{Y})$$
and $$C_{0} = \frac{\sum_{t=1}^{N}(Y_{t}-\bar{Y})^{2}}{N}$$
$-1 \leq R_{h}\leq 1$
N.B. Some times the the autocovariance function is given as
$$C_{h} = \frac{1}{N-h} \sum_{t=1}^{N-h}(Y_{t}-\bar{Y})(Y_{t+h}-\bar{Y})$$
which has less bias (1/N-h)
Horizontal Axis: time lag $h$, $h=(1,2,...)$
The above line containes several reference lines :
In [36]:
from pandas.tools.plotting import autocorrelation_plot
data = pd.Series(0.7 * np.random.rand(1000) +
0.3 * np.sin(np.linspace(-9 * np.pi, 9 * np.pi, num=1000)))
autocorrelation_plot(data)
Out[36]:
Bootstrap plots are made to assess the uncertainty of a statistic metric (mean, median, midrange etc)
To make a bootstrap uncertainty estimate for one metric given a dataset, a subset of the sample of size les or equal to the size of the dataset is generated from the data and the statistic is calculated. This subset is generated with replacement, and thus each data point can be resampled multiple times or not at all. The process is repeated, usually 500 to 1000 times. The computed values for the statistic form an estimate of the sampling distribution of the statistics.
For example, in a sample of 50 values you want to bootstrap the median. You generate a subset of the sample with 50 elements and calculate the median. Repeat this 500 times, so that you have at least 500 values for the median. To calculate the 90% confidence interval for the median, the sample of medians are sorted into ascending order and the value of the 25th median is the lower confidence limit, while the value of 475th median is the upper confidence limit (first and third quartiles).
The plots generated are the series and the histograms for mean, median and mid-range. For uniformly distributed values, mid-range has the smallest variance.
In [37]:
from pandas.tools.plotting import bootstrap_plot
data = pd.Series(np.random.rand(1000));
bootstrap_plot(data, size=50, samples=500, color='grey')
Out[37]:
RadVis is a way to visualise multivariate data; to visualise n-dimensional points into a two dimensional space. In this case, the mapping is not linear. The technique is based on a simple spring tension minimization algorithm.
Imagine $n$ points, $S_1,S_2,...S_n$ arranged to be equally spaced around the cirumference of the unit circle. Now suppose a set of $n$ springs being fixed at one end to each of these points and the other ends to a puck. Finally, assume the stiffness constant (as in Hooke's law) of the j$th$ string in the $x_{ij}$ for one of the data points $i$.
If the puck is released and allowed to reach equilibrium position, the coordinates of this position, $(u_i, v_j)^{T}$, are the projection in the two dimensional space of the point $(x_{i1}, x_{i2}, ... x_{in})^{T}$ in the $n$-dimensional space. If the $(u_i, v_j)^{T}$ is computed for $i=1,2...n$ and the points plotted, a visualisation of the $n$-dimensional dataset in the two dimensions is achieved.
To understand more about the projection of the $n$-dimensional space into the two dimensional one, consider the forces acting on the puck. When the puck is in equilibrium, there are no resultant forces acting on it (their sum is 0). Denoting the position vectors of $S_1$ to $S_n$ by $\mathbf{S_1}$ to $\mathbf{S_n}$ and putting $\mathbf{u_{i}}=(u_i, v_i)^{T}$ we have:
$$\sum_{j=1,n}=(\mathbf{S}_{j}-\mathbf{u}_{i})x_{ij} = 0 $$which when solved for $\mathbf{u}_{i}$ :
$$\mathbf{u}_{i} = \sum_{j=1,n}w_{ij}\mathbf{S}_j$$where
$$ w_{ij} = \left( \sum_{j=1,n} x_{ij} \right)^{-1} x_{ij}$$This means that for each $i$ (i.e. dataframe row), $\mathbf{u_i}$ is the weighted mean of the $\mathbf{S_j}$'s whose weights are the $n$ variables for case $i$ normalised to unity.
N.B. This normalisation makes the mapping $\mathcal{R}^{n}\rightarrow \mathcal{R}^{2}$ non-linear.
In [38]:
from pandas.tools.plotting import radviz
radviz(new_df, 'Species')
Out[38]:
Now let's do see some 'data wrangling' commands with pandas framework. These are to re-shape, select and extract information from a given dataset.
Let's dive a bit into the different data structures
Series is in fact an 1D array with labels that holds a specific data type (double, int, string etc). The labels (axis labels) are referred to as index.
To create a Series you need data
and an index
definition.
Data can be:
Index can be:
If no index is included in the definition, the indices will be numerical from 0 to len(ndarray)-1.
In [39]:
s = pd.Series(np.random.randn(5), index = ['a','b','c','d','e'])
# 5 is the # of observations
# only one axis in tuple -> only 1D series
In [40]:
s
Out[40]:
In [41]:
s.plot(kind='kde')
Out[41]:
this does not work for a 2D ndarray
In [42]:
#s = pd.Series(np.random.randn(5,5)) # this returns an exception : Data must be 1-dimensional
In [43]:
d = {'a':125., 'b':500, 'c':400, 'e':240}; d
Out[43]:
In [44]:
s2 = pd.Series(d); s2
Out[44]:
In [45]:
s2.plot(kind='bar', alpha=0.8, color='pink', ylim=(0, 1000))
Out[45]:
In [46]:
s3 = pd.Series(d, index=['a','b','c','d','e','f']); s3
Out[46]:
In [47]:
s3.plot(kind='bar', alpha=0.8, color='cyan', ylim=(0, 1000))
Out[47]:
So indeed the series are filled with NaN's and the plots with 0's!.
In [48]:
s4 = pd.Series(5, index=['a','b','c']); s4
Out[48]:
Series can behave as ndarrays or as dictionaries including all the operations! For example
In [49]:
s2[3]
Out[49]:
In [50]:
s2[:4]
Out[50]:
In [51]:
s2.median(); s2.mean(); s2.std(); s2.quantile(0.50)
Out[51]:
In [52]:
s2[s2>s2.median()]
Out[52]:
In [53]:
s2['a']
Out[53]:
In [54]:
'a' in s2
Out[54]:
Also vectorised operations can be done. The difference with ndarray is that the result is aligned automatically based on the labels.
In [55]:
s+s2; s2*2; np.exp(s)
Out[55]:
In [56]:
s5 = pd.Series(np.random.randn(5), name='something'); s5.name
Out[56]:
On the other side, we have DataFrames. A dataframe is like an Excel spreadsheet or a SQL table. Dataframes can accept various many different kinds of inputs
Apart from data, the input arguments can be columns and index.
The result index will be the union of the indexes of the various Series. Nested dicts will be converted to Series first.
In [57]:
d = {'one' : pd.Series([1.,2.,3.], index = ['a','b','c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a','b','c','d']) }; d
Out[57]:
In [58]:
df = pd.DataFrame(d); df
Out[58]:
In [59]:
pd.DataFrame(d, index = ['d','b','a']) # rearanging stuff
Out[59]:
In [60]:
pd.DataFrame(d, index = ['d','b','a'], columns=['two' , 'three']) # there is no 'three' col
Out[60]:
The index and columns can be accessed directly:
In [61]:
df.index
Out[61]:
In [62]:
df.columns
Out[62]:
In [63]:
d = {'one' : [1.,2.,3.,4.],
'two' : [4.,3.,2.,1.]}; d
Out[63]:
In [64]:
pd.DataFrame(d)
Out[64]:
In [65]:
pd.DataFrame(d, index=['a','b','c','d'])
Out[65]:
In [66]:
x = np.array([(1,2,'Hello'), (2,3,'World') ],
dtype=[('foo','i4'),('bar','f4'), ('baz', 'S10')]) # structured array
In [67]:
x[0]
Out[67]:
In [68]:
x['foo']
Out[68]:
In [69]:
x.dtype.names
Out[69]:
In [70]:
x[['foo','bar']] # access multiple fields simultaneously
Out[70]:
In [71]:
y = np.rec.array([(1,2.,'Hello'),(2,3.,"World")],
dtype=[('foo', 'i4'),('bar', 'f4'), ('baz', 'S10')]) ## record array
In [72]:
y[0]
Out[72]:
In [73]:
y['foo']
Out[73]:
In [74]:
y.dtype.names
Out[74]:
In [75]:
y[['foo','bar']] # access multiple fields simultaneously
Out[75]:
In [76]:
y[1].baz # access one element
Out[76]:
So now making a DF from such arrays:
In [77]:
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
In [78]:
data[:] = [(1,2.,'Hello'), (2,3.,"World")]
In [79]:
pd.DataFrame(data)
Out[79]:
In [80]:
pd.DataFrame(data, index=['first', 'second'])
Out[80]:
In [81]:
pd.DataFrame(data, columns=['C', 'A', 'B'])
Out[81]:
In [82]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
In [83]:
pd.DataFrame(data2)
Out[83]:
In [84]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
Out[84]:
Now let's make one dataframe and see what we can do when we have dates and NaN values in it.
In [91]:
# Create the csv file
out = open("file.csv", 'write');
s = '''Date,Temperature_city_1,Temperature_city_2,Temperature_city_3,Which_destination\n
20140910,80,32,40,1\n
20140911,100,50,36,2\n
20140912,102,55,46,1\n
20140912,60,20,35,3\n
20140914,60,,32,3\n
20140914,,57,42,2''';
out.write(s);
out.close();
In [92]:
# read it in pandas
df_csv = pd.read_csv("file.csv", sep=','); df_csv
Out[92]:
In [93]:
# read it in pandas and process dates
df_csv2 = pd.read_csv("file.csv", sep=',', parse_dates=[0]); df_csv2
Out[93]:
In [38]:
df_nulls = pd.read_csv("file.csv", sep=',', parse_dates=[0]); df_nulls
Out[38]:
In [109]:
pd.isnull(df_nulls['Temperature_city_1']) # looking for nulls in one column
Out[109]:
In [110]:
df_nulls['Temperature_city_1'].isnull() # same thing different format
Out[110]:
In [111]:
df_nulls.notnull() # look for actual values (NON NaN's) in the whole df
Out[111]:
In [114]:
print np.nan;
np.nan == None
Out[114]:
Also NaN are not comparable, this means that
In [115]:
np.nan == np.nan
Out[115]:
Pandas/Numpy use this to separate None and NaN, since
In [116]:
None == None
Out[116]:
In [97]:
# read it in pandas and process dates
df_csv3 = pd.read_csv("file.csv", sep=',', parse_dates=[0]);
df_csv3.fillna(-1)
Out[97]:
keep in mind that this does not replace the values in the original DataFrame, but it must be stored in a copy.
So let's replace and see the values:
In [99]:
df_csv4 = df_csv3.fillna(-1); df_csv4
Out[99]:
the df_csv4
has indeed replaced NaN with -1, but what happened of df_csv3
?
In [100]:
df_csv3
Out[100]:
df_csv3
is the original DF and has not replaced its NaN values.
In [104]:
print type(df_csv3.mean(axis=0));
df_csv3.mean(axis=0)
Out[104]:
In [105]:
df_csv5 = df_csv3.fillna(df_csv3.mean(axis=0)); df_csv5
Out[105]:
N.B. The argument axis=0 implies the calculation of the means that span the rows so the obtained means that they extend column-wise. The axis=1 spans columns and the row-wise results are obtained.
In [106]:
df_csv6 = df_csv3.dropna(); df_csv6
Out[106]:
In [39]:
df_nulls
Out[39]:
In [40]:
df_nulls.count()
Out[40]:
In [41]:
df_nulls.interpolate().count() # first interpolate the NaN
Out[41]:
In [43]:
df_nulls.interpolate()
Out[43]:
to see how this interpolation is done we can plot it
In [49]:
df_nulls.plot(title='no linear interpolation')
df_nulls.interpolate().plot(title='with linear interpolation')
In [25]:
## Let's make some dfs with NaNs
a1 = pd.Series(np.random.randn(3), index = ['e','f','h']);
a2 = pd.Series(np.random.randn(5), index = ['a','c','e','f','h']);
a = {'one': a1, 'two': a2};
dfa = pd.DataFrame(a);
print dfa
b1 = pd.Series(np.random.randn(2), index = ['e','f']);
b2 = pd.Series(np.random.randn(5), index = ['a','c','e','f','h']);
b3 = pd.Series(np.random.randn(5), index = ['a','c','e','f','h']);
b = {'one':b1, 'two':b2, 'three':b3};
dfb = pd.DataFrame(b);
print dfb
In [27]:
dfa+dfb ## NaN are propagated properly
Out[27]:
In [29]:
(dfa+dfb).mean(axis=0) #mean for rows
Out[29]:
Functions like cumsum()
(cumulative sum over requested axis) and cumprod()
(cumulative product over requested axis) ignore the NaNs. Similarly the groupby()
In [34]:
dfa.cumsum()
Out[34]:
In [37]:
dfa.groupby('one').mean()
Out[37]:
There is also the case - especially in the real world - that the dataset which is loaded is bad or erroeneous. When loading such dataset with load_csv()
the program will exit with an error. A workaround is to ignore bad lines by the error_bad_lines
option.
Assume that we have the following bad dataset:
In [1]:
f = open('bad_dataset.csv', 'write')
f.write('Val1,Val2,Val3\n')
f.write('1,1,1\n')
f.write('2,2,2,2\n') # 4 columns, 3 expected!
f.write('3,3,3\n')
f.close()
Now lets load it as always
In [11]:
bdf = pd.read_csv('bad_dataset.csv')
Notice the CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
Now trying with the option:
In [12]:
bdf = pd.read_csv('bad_dataset.csv', error_bad_lines=False) #i.e. "I don't want a parser check"
In [14]:
bdf
Out[14]:
So now line (2,2,2,2) is dropped. N.B.: Notice the change of the indices
In [ ]: