Data Munging with Pandas

Table of Contents

1. Let's load the dataset
2. Let's plot the dataset
    2.1  scatter_matrix
    2.2  Bar plot
    2.3  Histogram
    2.4  Box plot
    2.5  Area plot
    2.6  Scatter plot
    2.7  Hexbin plot
    2.8  Pie plot
    2.9  Missing data on plots
    2.10 Density/KDE plots
    2.11 Andrews Curves
    2.12 Parallel Coordinates
    2.13 Lag plot
    2.14 Autocorrelation plot
    2.15 Bootstrap plot
    2.16 RadVis plot
3. Data holders
    3.1 Series
        3.1.1 Series from ndarray
        3.1.2 Series from dict
        3.1.3 Series from scalar
    3.2 DataFrame
        3.2.1 DataFrame from dict of Series o dicts
        3.2.2 DataFrame from dict of ndarrays
        3.2.3 DataFrame from structured/record arrays
        3.2.4 DataFrame from lists of dicts 
        3.2.5 DataFrame from dict of tuples
        3.2.6 DataFrame from Series
4. Dealing with missing data
    4.1 Automatic Formating of date strings
    4.2 Handle of NaNs
        4.2.1 Find NaN values
            4.2.1.1 NaN and None
        4.2.2 Fill NaN with fixed value
        4.2.3 Fill NaN with mean or median
        4.2.4 Drop NaN rows
        4.2.5 Calculations with NaN
        4.2.6 Interpolate NaN
    4.3 Reading Bad Input files
    4.4 

Let's focus on the Pandas dataframe Munging stuff

The modules that we'll need:


In [4]:
%matplotlib inline

In [5]:
import numpy as np

In [6]:
from sklearn import tree

In [7]:
import pandas as pd

1. Let's load the dataset

We'll load one of the example datasets from sklearn!


In [8]:
from sklearn import datasets

In [9]:
iris = datasets.load_iris()

This is a sklearn dataset including the

  • .DESCR : general description
  • .data : all features
  • .features_names : names of features
  • .target : target values expressed as values / numbered classes
  • .target_names : names of the target classes
  • .shape : can be applied both on .data and .target and give the (row,column) tuple

In [7]:
iris.DESCR;
iris.data;
iris.feature_names


Out[7]:
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [8]:
iris.target;
iris.target_names


Out[8]:
array(['setosa', 'versicolor', 'virginica'], 
      dtype='|S10')

In [9]:
iris.data.shape


Out[9]:
(150, 4)

In [10]:
type(iris.data)


Out[10]:
numpy.ndarray

Let's use pandas' to convert the ndarray into dataframe, keeping the column names!


In [11]:
dataframe = pd.DataFrame(iris.data, columns=iris.feature_names)

In [12]:
dataframe.head()


Out[12]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

In [13]:
target_df = pd.DataFrame(iris.target, columns=['Species'])

In [14]:
target_df.tail(2)


Out[14]:
Species
148 2
149 2


2. Let's plot the dataset!

2.1 Scatter_matrix

Using pandas' scatter_matrix function. This is similar to R's pairs and makes a quick scatter plot map for the quantitative features of the dataframe

First we'll make a list named colors as a book-keeper. Then to visualise the different target categories (i.e. the Species) we'll use the target numerical categorisation (0,1,2) to differently colour each row of the dataframe based on the target variable


In [15]:
colors = list()

In [16]:
palette = { 0: "red", 1: "green", 2:"blue"}

In [17]:
for c in np.nditer(iris.target):
    colors.append(palette[int(c)])  # c is 0,1 or 2 and we append red, green blue

In [18]:
scatterplot = pd.scatter_matrix(dataframe, 
                               alpha    = 0.3,
                               figsize  = (10,10),
                               diagonal = 'hist', #'kde'
                               color    = colors, 
                               marker   = 'o',
                               grid     = True)



There are various plot methods inside pandas.

Now to make the scatter plot matrix

2.2 A bar plot


In [19]:
dataframe.plot(kind='bar', stacked=True, figsize=(10,8)) #barh


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1082d4e50>

2.3 Histogram (for one column, for all, or split by diff )


In [20]:
dataframe['sepal length (cm)'].plot(kind='hist', color='red', alpha=0.3, figsize=(10,8))


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x108a73290>

In [21]:
dataframe.plot(kind='hist', alpha=0.3, orientation='horizontal', 
               cumulative=True, bins=15, figsize=(10,8))


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x109da9910>

In [22]:
dataframe.diff().hist(color='k', alpha=0.5, bins=50, figsize=(10,8))
                    # diff() gets the difference of all
                    # rows with row-0


Out[22]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10a3a8210>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a580490>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10a5fd250>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a65e610>]], dtype=object)

2.4 Box plots

Let's remind here that box plot is a convenient way of depicting groups of numerical data and their quartiles.

The first quartile is the 25th percentile (splits off the lowest 25% of data from the highest 75%), the second quartile (median) is the 50th percentile (cuts the dataset in half) and the third quartile is the upper quartile or the 75th percentile (splits off the highest 25% from the lowest 75%). IRQ (interquantile range) is the difference between upper and lower quartiles IRQ=Q3-Q1).

The line is the median (Q2 or 50%). The box represents the IRQ, the from median to highest values is the Q2 to Q3 region, while the lowest is the Q1 to Q2 region.

      _____   -> maximum
        |
        |
       ---    -> third quartile
      |   |
      |___|   -> median (Q2)
      |   |
      |   |
       ---    -> first quartile
        |
        |
      _____   -> minimum

Outliers --> 3 x IRQ

Suspected Outliers 1.5 x IRQ


In [23]:
dataframe.plot(kind='box', vert=False, by='X', figsize=(10,8))


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a74bf90>

2.5 Area plots!


In [24]:
dataframe.plot(kind='area', alpha=0.6, figsize=(10,8))


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x10adb1e90>

2.6 Scatter Plots

  • Defining the x and y axis
  • plotting two sets with the ax=ax
  • palette height

In [25]:
dataframe.plot(kind='scatter', x='sepal length (cm)', y='sepal width (cm)', figsize=(10,8))


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a3c1650>

In [26]:
ax = dataframe.plot(kind='scatter', x='sepal length (cm)', y='sepal width (cm)', 
                    color='DarkBlue', label='Group 1');
dataframe.plot(kind='scatter', x='petal length (cm)', y='petal width (cm)', 
                    color='DarkGreen', label='Group 2' , figsize=(10,8),
                    ax=ax); # ax = ax to overlay
                            # if not set 2 plots made


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b11c190>

In [27]:
dataframe.plot(kind='scatter', x='petal length (cm)', y='petal width (cm)', 
               c='sepal width (cm)', s=50, figsize=(10,8))


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b477f50>

2.7 Hexbin

Hexbin are alternative to scatter plots if the data are too dense to plot individually


In [28]:
dataframe.plot(kind='hexbin', x='petal length (cm)', y='petal width (cm)', 
               gridsize=25, figsize=(10,8))


Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b706e50>

2.8 Pie plots

Pies are good for series


In [29]:
series = pd.Series(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], name='series')
series.plot(kind='pie', figsize=(10,10))


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b64a910>

2.9 Missing data

Depending on the plot type pandas handle by default the NAs

Plot type Handling
Line Leave gaps at NA
Line stacked Fill 0's
Bar Fill 0's
Scatter Drop NA
Histogram Drop NA
Box Drop NA
Area Fill 0's
KDE (density) Fill 0's
Hexbin Drop NA
Pie Fill 0

If this is not the wanted behaviour use: fillna() or dropna() before plotting


2.10 Density/KDE Plot


In [30]:
dataframe.ix[:,0:2].plot(kind='kde', figsize=(10,8))
# this is to take all rows for the first two columns of the dataframe (or df.iloc[:,0:2])


Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b9d4790>

2.11 Andrews Curves

They are used to plot multivariate data as a large number of curves that are created using the atributes of samples as coefficients for Fourier series. By coloring these curves per class it is possible to visualize data clustering.

It is a good way to visualize structure in high-dimensional data. Each data point $x={x_1 , x_2, ..., x_d}$ defines a finite Fourrier series:

$f_{x}(t) = \frac{x_1}{\sqrt{2}} + x_2 sin(t) + x_3 cos(t) + x_4 sin(2t) + x_5 cos(2t)+...$

This function is plotted for $-\pi < t < \pi$. Therefore, each data point may be viewed as a line between $-\pi$ and $\pi$. The formula can be thought of as the projection of the data point onto the vector

$(\frac{1}{\sqrt{2}}, sin(t), cos(t), sin(2t), cos(2t),...)$

Each df row is a line per category. On the horizontal axis is the $-\pi < t < \pi$ and on the vertical the $f_x(t)$.

Curves belonging to samples of the same class will usually be closer together and form larger structures!


In [31]:
from pandas.tools.plotting import andrews_curves

In [32]:
new_df = dataframe;
new_df['Species']=target_df;
new_df.head(5)


Out[32]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

In [33]:
andrews_curves(new_df, "Species")


Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bb5eb90>

2.12 Parallel Coordinates

Parallel coordinates is a way of visualising high-dimensional geometry and analysing multivariate data.

To show a set of points in a n-dimensional space, the horizontal axis is split consisting n parallel vertical (typically) lines. A point in the n-dimensional space is represented as a polyline with vertices on the parallel axes; the position of the vertex on the i-th axis corresponds to the i-th coordinate of the point.

This type of visualisation is closely related to time series visualisation, with the difference that it is applied on data that have no time dependance.

Three points must be taken into account when the plot is used for statistical inference:

  1. The order of the axis is critical for finding features (many reorderings are done in practice).
  2. The rotation of the axis is a translation in the parallel coordinates; if the lines intersected outside the parallel axes, it can be translated between them by rotations (i.e. 180$\circ$).
  3. The scaling is necessary since the plot is based on interpolation of consecutive pairs of variables. Thus the variables must be on a common scale and orthogonal to each parallel axis.

Points that tend to cluster will appear closer together.


In [34]:
from pandas.tools.plotting import parallel_coordinates
parallel_coordinates(new_df, "Species")


Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c307f50>

2.13 Lag plot

Lag plots are used to identify randomness in data. Random data should not exhibit any structure in the lag plot

Non-random structure implies that data are not random.

A lag is a fixed time displacement. For example, given a dataset $Y1,~Y2~...,~Yn$, $Y2$ and $Y7$ have lag $5$ since $7 - 2 = 5$. Lag plots can be generated for any arbitrary lag, although the most commonly used lag is $1$.


In [35]:
from pandas.tools.plotting import lag_plot
lag_plot(new_df)


Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c788490>

2.14 Autocorrelation Plot

This is to check randomness in series. It computes the autocorrelations for data values at varying time lags.

If series are random, the autocorrelation should be near zero for any and all time-lag separation.

If time series are non-random then one or more of the autocorrelations will be significantly non-zero.

The horizontal lines correspond to the 95% and 99% confidence bands (dashed = 99%)

definitions:

  • Vertical axis : autocorrelation coefficient: $R_{h}=C_{h}/ C_{0}$,

    where $C_{h}$ is the autocovariance :

    $$C_{h} = \frac{1}{N} \sum_{t=1}^{N-h}(Y_{t}-\bar{Y})(Y_{t+h}-\bar{Y})$$

    and $$C_{0} = \frac{\sum_{t=1}^{N}(Y_{t}-\bar{Y})^{2}}{N}$$

    $-1 \leq R_{h}\leq 1$

    N.B. Some times the the autocovariance function is given as

    $$C_{h} = \frac{1}{N-h} \sum_{t=1}^{N-h}(Y_{t}-\bar{Y})(Y_{t+h}-\bar{Y})$$

    which has less bias (1/N-h)

  • Horizontal Axis: time lag $h$, $h=(1,2,...)$

  • The above line containes several reference lines :

    • the middle line is at zero
    • the other four lines are the 95% and 99% confidence bands. Note that there are two distinct formulas for generating confidence bands.
      • if the autocorrelation plot is being used to test randomness (no time dependence in data) the following formula is recomended $$CL = \pm\frac{z_{1-a/2}}{\sqrt{N}}$$, where $N$ is the sample size, $z$ the cumulative distribution function of the standard normal distribution and $\alpha$ the the significance level
      • if autocorrelation plot is made using the ARIMA model fitting then the the confidence bands are generated by : $$CL= \pm z_{1-a/2} \cdot \sqrt{ \frac{1}{N}(1+2\sum_{i}^{k}y_{i}^2) } $$ where $k$ is the lag, $N$ the sample size, $z$ the cumulative distribution function of the standard normal distribution and $\alpha$ the significance level.

In [36]:
from pandas.tools.plotting import autocorrelation_plot
data = pd.Series(0.7 * np.random.rand(1000) +
                 0.3 * np.sin(np.linspace(-9 * np.pi, 9 * np.pi, num=1000)))

autocorrelation_plot(data)


Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c7adf90>

2.15 Bootstrap plot

Bootstrap plots are made to assess the uncertainty of a statistic metric (mean, median, midrange etc)

To make a bootstrap uncertainty estimate for one metric given a dataset, a subset of the sample of size les or equal to the size of the dataset is generated from the data and the statistic is calculated. This subset is generated with replacement, and thus each data point can be resampled multiple times or not at all. The process is repeated, usually 500 to 1000 times. The computed values for the statistic form an estimate of the sampling distribution of the statistics.

For example, in a sample of 50 values you want to bootstrap the median. You generate a subset of the sample with 50 elements and calculate the median. Repeat this 500 times, so that you have at least 500 values for the median. To calculate the 90% confidence interval for the median, the sample of medians are sorted into ascending order and the value of the 25th median is the lower confidence limit, while the value of 475th median is the upper confidence limit (first and third quartiles).

The plots generated are the series and the histograms for mean, median and mid-range. For uniformly distributed values, mid-range has the smallest variance.


In [37]:
from pandas.tools.plotting import bootstrap_plot
data = pd.Series(np.random.rand(1000));
    
bootstrap_plot(data, size=50, samples=500, color='grey')


Out[37]:

2.16 RadViz Plot

RadVis is a way to visualise multivariate data; to visualise n-dimensional points into a two dimensional space. In this case, the mapping is not linear. The technique is based on a simple spring tension minimization algorithm.

Imagine $n$ points, $S_1,S_2,...S_n$ arranged to be equally spaced around the cirumference of the unit circle. Now suppose a set of $n$ springs being fixed at one end to each of these points and the other ends to a puck. Finally, assume the stiffness constant (as in Hooke's law) of the j$th$ string in the $x_{ij}$ for one of the data points $i$.

If the puck is released and allowed to reach equilibrium position, the coordinates of this position, $(u_i, v_j)^{T}$, are the projection in the two dimensional space of the point $(x_{i1}, x_{i2}, ... x_{in})^{T}$ in the $n$-dimensional space. If the $(u_i, v_j)^{T}$ is computed for $i=1,2...n$ and the points plotted, a visualisation of the $n$-dimensional dataset in the two dimensions is achieved.

To understand more about the projection of the $n$-dimensional space into the two dimensional one, consider the forces acting on the puck. When the puck is in equilibrium, there are no resultant forces acting on it (their sum is 0). Denoting the position vectors of $S_1$ to $S_n$ by $\mathbf{S_1}$ to $\mathbf{S_n}$ and putting $\mathbf{u_{i}}=(u_i, v_i)^{T}$ we have:

$$\sum_{j=1,n}=(\mathbf{S}_{j}-\mathbf{u}_{i})x_{ij} = 0 $$

which when solved for $\mathbf{u}_{i}$ :

$$\mathbf{u}_{i} = \sum_{j=1,n}w_{ij}\mathbf{S}_j$$

where

$$ w_{ij} = \left( \sum_{j=1,n} x_{ij} \right)^{-1} x_{ij}$$

This means that for each $i$ (i.e. dataframe row), $\mathbf{u_i}$ is the weighted mean of the $\mathbf{S_j}$'s whose weights are the $n$ variables for case $i$ normalised to unity.

N.B. This normalisation makes the mapping $\mathcal{R}^{n}\rightarrow \mathcal{R}^{2}$ non-linear.


In [38]:
from pandas.tools.plotting import radviz

radviz(new_df, 'Species')


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x10ce44590>


3. Data Holders

Now let's do see some 'data wrangling' commands with pandas framework. These are to re-shape, select and extract information from a given dataset.

Let's dive a bit into the different data structures

3.1. Series

Series is in fact an 1D array with labels that holds a specific data type (double, int, string etc). The labels (axis labels) are referred to as index.

To create a Series you need data and an index definition.

  • Data can be:

    • dictionary
    • ndarray
    • scalar value (i.e. 5)
  • Index can be:

    • any list of strings

If no index is included in the definition, the indices will be numerical from 0 to len(ndarray)-1.

3.1.1. Series from ndarray


In [39]:
s = pd.Series(np.random.randn(5), index = ['a','b','c','d','e']) 
                                    # 5 is the # of observations
                                    # only one axis in tuple -> only 1D series

In [40]:
s


Out[40]:
a    0.418756
b   -1.405162
c    0.525731
d    2.026037
e   -0.360525
dtype: float64

In [41]:
s.plot(kind='kde')


Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d079350>

this does not work for a 2D ndarray


In [42]:
#s = pd.Series(np.random.randn(5,5)) # this returns an exception : Data must be 1-dimensional

3.1.2. Series from dictionary

Generating from a dictionary, the keys are taken as index. If extra index argument is given this the keys values are overwritten for the matching cases, while if the index argument is larger than the len(dict.keys()) then NaN's are created.


In [43]:
d = {'a':125., 'b':500, 'c':400, 'e':240}; d


Out[43]:
{'a': 125.0, 'b': 500, 'c': 400, 'e': 240}

In [44]:
s2 = pd.Series(d); s2


Out[44]:
a    125
b    500
c    400
e    240
dtype: float64

In [45]:
s2.plot(kind='bar', alpha=0.8, color='pink', ylim=(0, 1000))


Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d153a90>

In [46]:
s3 = pd.Series(d, index=['a','b','c','d','e','f']); s3


Out[46]:
a    125
b    500
c    400
d    NaN
e    240
f    NaN
dtype: float64

In [47]:
s3.plot(kind='bar', alpha=0.8, color='cyan', ylim=(0, 1000))


Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x10cf27750>

So indeed the series are filled with NaN's and the plots with 0's!.


3.1.3. Series from scalar

Now introducing series with scalar input, the values will all be identical.


In [48]:
s4 = pd.Series(5, index=['a','b','c']); s4


Out[48]:
a    5
b    5
c    5
dtype: int64

Series can behave as ndarrays or as dictionaries including all the operations! For example


In [49]:
s2[3]


Out[49]:
240.0

In [50]:
s2[:4]


Out[50]:
a    125
b    500
c    400
e    240
dtype: float64

In [51]:
s2.median(); s2.mean(); s2.std(); s2.quantile(0.50)


Out[51]:
320.0

In [52]:
s2[s2>s2.median()]


Out[52]:
b    500
c    400
dtype: float64

In [53]:
s2['a']


Out[53]:
125.0

In [54]:
'a' in s2


Out[54]:
True

Also vectorised operations can be done. The difference with ndarray is that the result is aligned automatically based on the labels.


In [55]:
s+s2; s2*2; np.exp(s)


Out[55]:
a    1.520070
b    0.245327
c    1.691695
d    7.583971
e    0.697310
dtype: float64

In [56]:
s5 = pd.Series(np.random.randn(5), name='something'); s5.name


Out[56]:
'something'


3.2 DataFrames

On the other side, we have DataFrames. A dataframe is like an Excel spreadsheet or a SQL table. Dataframes can accept various many different kinds of inputs

  • Dictionay of 1D ndarrays, lists, dict or Series
  • 2D numpy.ndarray
  • structured or record ndarray
  • series
  • another DataFrame

Apart from data, the input arguments can be columns and index.

3.2.1 DataFrame from dict of Series or dicts

The result index will be the union of the indexes of the various Series. Nested dicts will be converted to Series first.


In [57]:
d = {'one' : pd.Series([1.,2.,3.], index = ['a','b','c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a','b','c','d']) }; d


Out[57]:
{'one': a    1
 b    2
 c    3
 dtype: float64, 'two': a    1
 b    2
 c    3
 d    4
 dtype: float64}

In [58]:
df = pd.DataFrame(d); df


Out[58]:
one two
a 1 1
b 2 2
c 3 3
d NaN 4

In [59]:
pd.DataFrame(d, index = ['d','b','a']) # rearanging stuff


Out[59]:
one two
d NaN 4
b 2 2
a 1 1

In [60]:
pd.DataFrame(d, index = ['d','b','a'], columns=['two' , 'three']) # there is no 'three' col


Out[60]:
two three
d 4 NaN
b 2 NaN
a 1 NaN

The index and columns can be accessed directly:


In [61]:
df.index


Out[61]:
Index([u'a', u'b', u'c', u'd'], dtype='object')

In [62]:
df.columns


Out[62]:
Index([u'one', u'two'], dtype='object')

3.2.2 DataFrame from dict of ndarrays/lists


In [63]:
d = {'one' : [1.,2.,3.,4.],
     'two' : [4.,3.,2.,1.]}; d


Out[63]:
{'one': [1.0, 2.0, 3.0, 4.0], 'two': [4.0, 3.0, 2.0, 1.0]}

In [64]:
pd.DataFrame(d)


Out[64]:
one two
0 1 4
1 2 3
2 3 2
3 4 1

In [65]:
pd.DataFrame(d, index=['a','b','c','d'])


Out[65]:
one two
a 1 4
b 2 3
c 3 2
d 4 1

3.2.3 DataFrame from structured/record arrays

These arrays are made using named indices.

For example


In [66]:
x = np.array([(1,2,'Hello'), (2,3,'World') ],
             dtype=[('foo','i4'),('bar','f4'), ('baz', 'S10')])   # structured array

In [67]:
x[0]


Out[67]:
(1, 2.0, 'Hello')

In [68]:
x['foo']


Out[68]:
array([1, 2], dtype=int32)

In [69]:
x.dtype.names


Out[69]:
('foo', 'bar', 'baz')

In [70]:
x[['foo','bar']]  # access multiple fields simultaneously


Out[70]:
array([(1, 2.0), (2, 3.0)], 
      dtype=[('foo', '<i4'), ('bar', '<f4')])


In [71]:
y = np.rec.array([(1,2.,'Hello'),(2,3.,"World")], 
                          dtype=[('foo', 'i4'),('bar', 'f4'), ('baz', 'S10')]) ## record array

In [72]:
y[0]


Out[72]:
(1, 2.0, 'Hello')

In [73]:
y['foo']


Out[73]:
array([1, 2], dtype=int32)

In [74]:
y.dtype.names


Out[74]:
('foo', 'bar', 'baz')

In [75]:
y[['foo','bar']]  # access multiple fields simultaneously


Out[75]:
array([(1, 2.0), (2, 3.0)], 
      dtype=[('foo', '<i4'), ('bar', '<f4')])

In [76]:
y[1].baz  # access one element


Out[76]:
'World'

So now making a DF from such arrays:


In [77]:
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])

In [78]:
data[:] = [(1,2.,'Hello'), (2,3.,"World")]

In [79]:
pd.DataFrame(data)


Out[79]:
A B C
0 1 2 Hello
1 2 3 World

In [80]:
pd.DataFrame(data, index=['first', 'second'])


Out[80]:
A B C
first 1 2 Hello
second 2 3 World

In [81]:
pd.DataFrame(data, columns=['C', 'A', 'B'])


Out[81]:
C A B
0 Hello 1 2
1 World 2 3

3.2.4 DataFrame from list of dicts

Now make a list of dictionaries


In [82]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [83]:
pd.DataFrame(data2)


Out[83]:
a b c
0 1 2 NaN
1 5 10 20

3.2.5 DataFrame from dict of tuples

Now make a dict containing tuples:


In [84]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
                ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
                ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
                ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
                ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})


Out[84]:
a b
a b c a b
A B 4 1 5 8 10
C 3 2 6 7 NaN
D NaN NaN NaN NaN 9

3.2.6 DataFrame from Series

This is including various series for the creation of a DF



4. Dealing with dates and problematic data

Now let's make one dataframe and see what we can do when we have dates and NaN values in it.


In [91]:
# Create the csv file
out = open("csv_datasets/file.csv", 'write');
s = '''Date,Temperature_city_1,Temperature_city_2,Temperature_city_3,Which_destination\n
20140910,80,32,40,1\n
20140911,100,50,36,2\n
20140912,102,55,46,1\n
20140912,60,20,35,3\n
20140914,60,,32,3\n
20140914,,57,42,2''';
out.write(s);
out.close();

In [92]:
# read it in pandas
df_csv = pd.read_csv("file.csv", sep=','); df_csv


Out[92]:
Date Temperature_city_1 Temperature_city_2 Temperature_city_3 Which_destination
0 20140910 80 32 40 1
1 20140911 100 50 36 2
2 20140912 102 55 46 1
3 20140912 60 20 35 3
4 20140914 60 NaN 32 3
5 20140914 NaN 57 42 2

4.1 Automatic Formating of "date strings"

The read_csv() function has an option that specifyies which column contains dates. Then these dates can be parsed and formated accordingly.

Just include the argiment parse_dates=[list of columns with dates]


In [93]:
# read it in pandas and process dates
df_csv2 = pd.read_csv("csv_datasets/file.csv", sep=',', parse_dates=[0]); df_csv2


Out[93]:
Date Temperature_city_1 Temperature_city_2 Temperature_city_3 Which_destination
0 2014-09-10 80 32 40 1
1 2014-09-11 100 50 36 2
2 2014-09-12 102 55 46 1
3 2014-09-12 60 20 35 3
4 2014-09-14 60 NaN 32 3
5 2014-09-14 NaN 57 42 2

4.2 Handle NaNs

There are various ways of dealing with NaN values within pandas. But first one must find them!

4.2.1 Find NaN Values

There are various ways of finding NaN values. Mainly these are revolting around the isnull() and notnull() functions.


In [38]:
df_nulls = pd.read_csv("file.csv", sep=',', parse_dates=[0]); df_nulls


Out[38]:
Date Temperature_city_1 Temperature_city_2 Temperature_city_3 Which_destination
0 2014-09-10 80 32 40 1
1 2014-09-11 100 50 36 2
2 2014-09-12 102 55 46 1
3 2014-09-12 60 20 35 3
4 2014-09-14 60 NaN 32 3
5 2014-09-14 NaN 57 42 2

In [109]:
pd.isnull(df_nulls['Temperature_city_1'])  # looking for nulls in one column


Out[109]:
0    False
1    False
2    False
3    False
4    False
5     True
Name: Temperature_city_1, dtype: bool

In [110]:
df_nulls['Temperature_city_1'].isnull()   # same thing different format


Out[110]:
0    False
1    False
2    False
3    False
4    False
5     True
Name: Temperature_city_1, dtype: bool

In [111]:
df_nulls.notnull()   # look for actual values (NON NaN's) in the whole df


Out[111]:
Date Temperature_city_1 Temperature_city_2 Temperature_city_3 Which_destination
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True False True True
5 True False True True True

4.2.1.1 NaN and None

It is very important to remember that NaN is not None :


In [114]:
print np.nan;
np.nan == None


nan
Out[114]:
False

Also NaN are not comparable, this means that


In [115]:
np.nan == np.nan


Out[115]:
False

Pandas/Numpy use this to separate None and NaN, since


In [116]:
None == None


Out[116]:
True


4.2.2 Fill NaN with fixed value

One may choose to replace NaN with a certain fixed value (fillna(val))


In [97]:
# read it in pandas and process dates
df_csv3 = pd.read_csv("file.csv", sep=',', parse_dates=[0]);
df_csv3.fillna(-1)


Out[97]:
Date Temperature_city_1 Temperature_city_2 Temperature_city_3 Which_destination
0 2014-09-10 80 32 40 1
1 2014-09-11 100 50 36 2
2 2014-09-12 102 55 46 1
3 2014-09-12 60 20 35 3
4 2014-09-14 60 -1 32 3
5 2014-09-14 -1 57 42 2

keep in mind that this does not replace the values in the original DataFrame, but it must be stored in a copy.

So let's replace and see the values:


In [99]:
df_csv4 = df_csv3.fillna(-1); df_csv4


Out[99]:
Date Temperature_city_1 Temperature_city_2 Temperature_city_3 Which_destination
0 2014-09-10 80 32 40 1
1 2014-09-11 100 50 36 2
2 2014-09-12 102 55 46 1
3 2014-09-12 60 20 35 3
4 2014-09-14 60 -1 32 3
5 2014-09-14 -1 57 42 2

the df_csv4 has indeed replaced NaN with -1, but what happened of df_csv3 ?


In [100]:
df_csv3


Out[100]:
Date Temperature_city_1 Temperature_city_2 Temperature_city_3 Which_destination
0 2014-09-10 80 32 40 1
1 2014-09-11 100 50 36 2
2 2014-09-12 102 55 46 1
3 2014-09-12 60 20 35 3
4 2014-09-14 60 NaN 32 3
5 2014-09-14 NaN 57 42 2

df_csv3 is the original DF and has not replaced its NaN values.


4.2.2 Fill NaN with mean or Median

One can replace the NaN values with the column mean or median so to minimise the guessing error. This can easily be done by


In [104]:
print type(df_csv3.mean(axis=0));
df_csv3.mean(axis=0)


<class 'pandas.core.series.Series'>
Out[104]:
Temperature_city_1    80.4
Temperature_city_2    42.8
Temperature_city_3    38.5
Which_destination      2.0
dtype: float64

In [105]:
df_csv5 = df_csv3.fillna(df_csv3.mean(axis=0)); df_csv5


Out[105]:
Date Temperature_city_1 Temperature_city_2 Temperature_city_3 Which_destination
0 2014-09-10 80.0 32.0 40 1
1 2014-09-11 100.0 50.0 36 2
2 2014-09-12 102.0 55.0 46 1
3 2014-09-12 60.0 20.0 35 3
4 2014-09-14 60.0 42.8 32 3
5 2014-09-14 80.4 57.0 42 2

N.B. The argument axis=0 implies the calculation of the means that span the rows so the obtained means that they extend column-wise. The axis=1 spans columns and the row-wise results are obtained.


4.2.5 Drop NaN rows

One might choose to drop the NaN values this can be done by


In [106]:
df_csv6 = df_csv3.dropna(); df_csv6


Out[106]:
Date Temperature_city_1 Temperature_city_2 Temperature_city_3 Which_destination
0 2014-09-10 80 32 40 1
1 2014-09-11 100 50 36 2
2 2014-09-12 102 55 46 1
3 2014-09-12 60 20 35 3

4.2.6 Interpolate NaN

Series and DataFrames have the interpolate method, which performs linear interpolation (by default) for the missing datapoints


In [39]:
df_nulls


Out[39]:
Date Temperature_city_1 Temperature_city_2 Temperature_city_3 Which_destination
0 2014-09-10 80 32 40 1
1 2014-09-11 100 50 36 2
2 2014-09-12 102 55 46 1
3 2014-09-12 60 20 35 3
4 2014-09-14 60 NaN 32 3
5 2014-09-14 NaN 57 42 2

In [40]:
df_nulls.count()


Out[40]:
Date                  6
Temperature_city_1    5
Temperature_city_2    5
Temperature_city_3    6
Which_destination     6
dtype: int64

In [41]:
df_nulls.interpolate().count() # first interpolate the NaN


Out[41]:
Date                  6
Temperature_city_1    6
Temperature_city_2    6
Temperature_city_3    6
Which_destination     6
dtype: int64

In [43]:
df_nulls.interpolate()


Out[43]:
Date Temperature_city_1 Temperature_city_2 Temperature_city_3 Which_destination
0 2014-09-10 80 32.0 40 1
1 2014-09-11 100 50.0 36 2
2 2014-09-12 102 55.0 46 1
3 2014-09-12 60 20.0 35 3
4 2014-09-14 60 38.5 32 3
5 2014-09-14 60 57.0 42 2

to see how this interpolation is done we can plot it


In [50]:
df_nulls.plot(title='no linear interpolation')
df_nulls.interpolate().plot(title='with linear interpolation')


Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x11577cf90>

Different interpolations method are given by the method kwarg.

  • [method=linear] : ignores the index and treat the values as equally spaced
  • method=time : interpolation works on daily and higher resolution data to interpolate given length of interval;interpolation works for series or df with DateTimeIndex.
  • method=index : use the numerical values of the index
  • method=value : use the floating point values of index

[scipy.interpolate.interp1d]

  • method=nearest
  • method=zero
  • method=slinear
  • method=quadratic
  • method=cubic
  • method=barycentric
  • method=polynomial, order=n

[Wrappers around SciPy methods]

  • method=pchip
  • method=krogh
  • method=piecewise_polynomial
  • method=spline

In [52]:
df_nulls.interpolate(method='polynomial', order=2).plot(title='with pol2 interpolation')


Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x11577c450>

4.2.5 Calculations with NaNs

Missing values do propagate naturally in operations with Pandas Objects


In [25]:
## Let's make some dfs with NaNs
a1 = pd.Series(np.random.randn(3), index = ['e','f','h']);
a2 = pd.Series(np.random.randn(5), index = ['a','c','e','f','h']);
a = {'one': a1, 'two': a2};
dfa = pd.DataFrame(a); 
print dfa

b1 = pd.Series(np.random.randn(2), index = ['e','f']);
b2 = pd.Series(np.random.randn(5), index = ['a','c','e','f','h']);
b3 = pd.Series(np.random.randn(5), index = ['a','c','e','f','h']);
b = {'one':b1, 'two':b2, 'three':b3};
dfb = pd.DataFrame(b);
print dfb


        one       two
a       NaN -0.265040
c       NaN -2.358900
e -0.249150  0.321340
f -1.049753  1.032773
h -1.049537 -0.657652
        one     three       two
a       NaN  0.183358 -1.778737
c       NaN -1.081133 -0.915654
e -1.025047 -0.667466  1.083392
f  0.160163 -0.613965  0.599008
h       NaN  0.211446 -0.900994

In [27]:
dfa+dfb ## NaN are propagated properly


Out[27]:
one three two
a NaN NaN -2.043777
c NaN NaN -3.274554
e -1.274197 NaN 1.404731
f -0.889589 NaN 1.631781
h NaN NaN -1.558646

In [29]:
(dfa+dfb).mean(axis=0) #mean for rows


Out[29]:
one     -1.081893
three         NaN
two     -0.768093
dtype: float64

Functions like cumsum() (cumulative sum over requested axis) and cumprod() (cumulative product over requested axis) ignore the NaNs. Similarly the groupby()


In [34]:
dfa.cumsum()


Out[34]:
one two
a NaN -0.265040
c NaN -2.623940
e -0.249150 -2.302600
f -1.298902 -1.269827
h -2.348439 -1.927479

In [37]:
dfa.groupby('one').mean()


Out[37]:
two
one
-1.049753 1.032773
-1.049537 -0.657652
-0.249150 0.321340

4.3 Reading Bad Input Files

There is also the case - especially in the real world - that the dataset which is loaded is bad or erroeneous. When loading such dataset with load_csv() the program will exit with an error. A workaround is to ignore bad lines by the error_bad_lines option.

Assume that we have the following bad dataset:


In [1]:
f = open('bad_dataset.csv', 'write')
f.write('Val1,Val2,Val3\n')
f.write('1,1,1\n')
f.write('2,2,2,2\n') # 4 columns, 3 expected!
f.write('3,3,3\n')
f.close()

Now lets load it as always


In [11]:
bdf = pd.read_csv('bad_dataset.csv')


---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-11-7138b6e8fd73> in <module>()
----> 1 bdf = pd.read_csv('bad_dataset.csv')

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    472                     skip_blank_lines=skip_blank_lines)
    473 
--> 474         return _read(filepath_or_buffer, kwds)
    475 
    476     parser_f.__name__ = name

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    258         return parser
    259 
--> 260     return parser.read()
    261 
    262 _parser_defaults = {

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
    719                 raise ValueError('skip_footer not supported for iteration')
    720 
--> 721         ret = self._engine.read(nrows)
    722 
    723         if self.options.get('as_recarray'):

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
   1168 
   1169         try:
-> 1170             data = self._reader.read(nrows)
   1171         except StopIteration:
   1172             if nrows is None:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7544)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7784)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8401)()

pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8275)()

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:20691)()

CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

Notice the CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

Now trying with the option:


In [12]:
bdf = pd.read_csv('bad_dataset.csv', error_bad_lines=False) #i.e. "I don't want a parser check"


Skipping line 3: expected 3 fields, saw 4


In [14]:
bdf


Out[14]:
Val1 Val2 Val3
0 1 1 1
1 3 3 3

So now line (2,2,2,2) is dropped. N.B.: Notice the change of the indices


4.3 Reading Big Datasets

If a given dataset is too big for the available memory, pandas can split it to chunks and then use csv module to read over the various chunks.

For example let's load a dataset with ~1k rows.


In [53]:
sales = pd.read_csv('csv_datasets/sales2009.csv');
sales.count()


Out[53]:
Transaction_date    998
Product             998
Price               998
Payment_Type        998
Name                998
City                998
State               997
Country             998
Account_Created     998
Last_Login          998
Latitude            998
Longitude           998
dtype: int64

using the chunksize=n option we split the csv in a list of df's each one containing 10 rows.

N.B.: What this returns is not actually a list, but an iterator


In [62]:
sales_chunks = pd.read_csv('csv_datasets/sales2009.csv', header=None, 
                           names=['Transaction_date','Product', 
                                  'Price', 'Payment_Type', 'Name', 
                                  'City', 'State', 'Country', 
                                  'Account_Created','Last_Login',
                                  'Latitude', 'Longitude'], chunksize=100);
sales_chunks


Out[62]:
<pandas.io.parsers.TextFileReader at 0x11555d050>

In [63]:
for chunk in sales_chunks:
    print chunk.shape


(100, 12)
(100, 12)
(100, 12)
(100, 12)
(100, 12)
(100, 12)
(100, 12)
(100, 12)
(100, 12)
(99, 12)

so now we have fixed the size of the chunk to 100 rows.

We can also let Panda's dynamically choose the size by the iterator


In [64]:
sales_chunks2 = pd.read_csv('csv_datasets/sales2009.csv', header=None, 
                           names=['Transaction_date','Product', 
                                  'Price', 'Payment_Type', 'Name', 
                                  'City', 'State', 'Country', 
                                  'Account_Created','Last_Login',
                                  'Latitude', 'Longitude'], iterator=True);

for chunk in sales_chunks2:
    print chunk.shape


(999, 12)

In [ ]: