Data Munging with Pandas

Table of Contents

1. Let's load the dataset
2. Let's plot the dataset
    2.1  scatter_matrix
    2.2  Bar plot
    2.3  Histogram
    2.4  Box plot
    2.5  Area plot
    2.6  Scatter plot
    2.7  Hexbin plot
    2.8  Pie plot
    2.9  Missing data on plots
    2.10 Density/KDE plots
    2.11 Andrews Curves
    2.12 Parallel Coordinates
    2.13 Lag plot
    2.14 Autocorrelation plot
    2.15 Bootstrap plot
    2.16 RadVis plot
3. Data holders
    3.1 Series
        3.1.1 Series from ndarray
        3.1.2 Series from dict
        3.1.3 Series from scalar
    3.2 DataFrame
        3.2.1 DataFrame from dict of Series o dicts
        3.2.2 DataFrame from dict of ndarrays
        3.2.3 DataFrame from structured/record arrays
        3.2.4 DataFrame from lists of dicts 
        3.2.5 DataFrame from dict of tuples
        3.2.6 DataFrame from Series
4. Dealing with missing data
    4.1 Automatic Formating of date strings
    4.2 Handle of NaNs
        4.2.1 Find NaN values
            4.2.1.1 NaN and None
        4.2.2 Fill NaN with fixed value
        4.2.3 Fill NaN with mean or median
        4.2.4 Drop NaN rows
        4.2.5 Calculations with NaN
    4.3 Reading Bad Input files
    4.4

Let's focus on the Pandas dataframe Munging stuff

The modules that we'll need:



In [4]:

    
%matplotlib inline



In [5]:

    
import numpy as np



In [6]:

    
from sklearn import tree



In [7]:

    
import pandas as pd

1. Let's load the dataset

We'll load one of the example datasets from sklearn!



In [8]:

    
from sklearn import datasets



In [9]:

    
iris = datasets.load_iris()

This is a sklearn dataset including the

.DESCR : general description
.data : all features
.features_names : names of features
.target : target values expressed as values / numbered classes
.target_names : names of the target classes
.shape : can be applied both on .data and .target and give the (row,column) tuple



In [7]:

    
iris.DESCR;
iris.data;
iris.feature_names









    Out[7]:





['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']



In [8]:

    
iris.target;
iris.target_names









    Out[8]:





array(['setosa', 'versicolor', 'virginica'], 
      dtype='|S10')



In [9]:

    
iris.data.shape









    Out[9]:





(150, 4)



In [10]:

    
type(iris.data)









    Out[10]:





numpy.ndarray

Let's use pandas' to convert the ndarray into dataframe, keeping the column names!



In [11]:

    
dataframe = pd.DataFrame(iris.data, columns=iris.feature_names)



In [12]:

    
dataframe.head()









    Out[12]:






  
    
      
      sepal length (cm)
      sepal width (cm)
      petal length (cm)
      petal width (cm)
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
    
    
      1
      4.9
      3.0
      1.4
      0.2
    
    
      2
      4.7
      3.2
      1.3
      0.2
    
    
      3
      4.6
      3.1
      1.5
      0.2
    
    
      4
      5.0
      3.6
      1.4
      0.2



In [13]:

    
target_df = pd.DataFrame(iris.target, columns=['Species'])



In [14]:

    
target_df.tail(2)

2. Let's plot the dataset!

2.1 Scatter_matrix

Using pandas' scatter_matrix function. This is similar to R's pairs and makes a quick scatter plot map for the quantitative features of the dataframe

First we'll make a list named colors as a book-keeper. Then to visualise the different target categories (i.e. the Species) we'll use the target numerical categorisation (0,1,2) to differently colour each row of the dataframe based on the target variable



In [15]:

    
colors = list()



In [16]:

    
palette = { 0: "red", 1: "green", 2:"blue"}



In [17]:

    
for c in np.nditer(iris.target):
    colors.append(palette[int(c)])  # c is 0,1 or 2 and we append red, green blue



In [18]:

    
scatterplot = pd.scatter_matrix(dataframe, 
                               alpha    = 0.3,
                               figsize  = (10,10),
                               diagonal = 'hist', #'kde'
                               color    = colors, 
                               marker   = 'o',
                               grid     = True)

There are various plot methods inside pandas.

Now to make the scatter plot matrix

2.2 A bar plot



In [19]:

    
dataframe.plot(kind='bar', stacked=True, figsize=(10,8)) #barh









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x1082d4e50>

2.3 Histogram (for one column, for all, or split by diff )



In [20]:

    
dataframe['sepal length (cm)'].plot(kind='hist', color='red', alpha=0.3, figsize=(10,8))









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x108a73290>



In [21]:

    
dataframe.plot(kind='hist', alpha=0.3, orientation='horizontal', 
               cumulative=True, bins=15, figsize=(10,8))









    Out[21]:





<matplotlib.axes._subplots.AxesSubplot at 0x109da9910>



In [22]:

    
dataframe.diff().hist(color='k', alpha=0.5, bins=50, figsize=(10,8))
                    # diff() gets the difference of all
                    # rows with row-0









    Out[22]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10a3a8210>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a580490>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10a5fd250>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a65e610>]], dtype=object)

2.4 Box plots

Let's remind here that box plot is a convenient way of depicting groups of numerical data and their quartiles.

The first quartile is the 25th percentile (splits off the lowest 25% of data from the highest 75%), the second quartile (median) is the 50th percentile (cuts the dataset in half) and the third quartile is the upper quartile or the 75th percentile (splits off the highest 25% from the lowest 75%). IRQ (interquantile range) is the difference between upper and lower quartiles IRQ=Q3-Q1).

The line is the median (Q2 or 50%). The box represents the IRQ, the from median to highest values is the Q2 to Q3 region, while the lowest is the Q1 to Q2 region.

      _____   -> maximum
        |
        |
       ---    -> third quartile
      |   |
      |___|   -> median (Q2)
      |   |
      |   |
       ---    -> first quartile
        |
        |
      _____   -> minimum

Outliers --> 3 x IRQ

Suspected Outliers 1.5 x IRQ



In [23]:

    
dataframe.plot(kind='box', vert=False, by='X', figsize=(10,8))









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x10a74bf90>

2.5 Area plots!



In [24]:

    
dataframe.plot(kind='area', alpha=0.6, figsize=(10,8))









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x10adb1e90>

2.6 Scatter Plots

Defining the x and y axis
plotting two sets with the ax=ax
palette height



In [25]:

    
dataframe.plot(kind='scatter', x='sepal length (cm)', y='sepal width (cm)', figsize=(10,8))









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x10a3c1650>



In [26]:

    
ax = dataframe.plot(kind='scatter', x='sepal length (cm)', y='sepal width (cm)', 
                    color='DarkBlue', label='Group 1');
dataframe.plot(kind='scatter', x='petal length (cm)', y='petal width (cm)', 
                    color='DarkGreen', label='Group 2' , figsize=(10,8),
                    ax=ax); # ax = ax to overlay
                            # if not set 2 plots made









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b11c190>



In [27]:

    
dataframe.plot(kind='scatter', x='petal length (cm)', y='petal width (cm)', 
               c='sepal width (cm)', s=50, figsize=(10,8))









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b477f50>

2.7 Hexbin

Hexbin are alternative to scatter plots if the data are too dense to plot individually



In [28]:

    
dataframe.plot(kind='hexbin', x='petal length (cm)', y='petal width (cm)', 
               gridsize=25, figsize=(10,8))









    Out[28]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b706e50>

2.8 Pie plots

Pies are good for series



In [29]:

    
series = pd.Series(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], name='series')
series.plot(kind='pie', figsize=(10,10))









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b64a910>

2.9 Missing data

Depending on the plot type pandas handle by default the NAs

Plot type	Handling
Line	Leave gaps at NA
Line stacked	Fill 0's
Bar	Fill 0's
Scatter	Drop NA
Histogram	Drop NA
Box	Drop NA
Area	Fill 0's
KDE (density)	Fill 0's
Hexbin	Drop NA
Pie	Fill 0

If this is not the wanted behaviour use: fillna() or dropna() before plotting

2.10 Density/KDE Plot



In [30]:

    
dataframe.ix[:,0:2].plot(kind='kde', figsize=(10,8))
# this is to take all rows for the first two columns of the dataframe (or df.iloc[:,0:2])









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b9d4790>

2.11 Andrews Curves

They are used to plot multivariate data as a large number of curves that are created using the atributes of samples as coefficients for Fourier series. By coloring these curves per class it is possible to visualize data clustering.

It is a good way to visualize structure in high-dimensional data. Each data point $x={x_1 , x_2, ..., x_d}$ defines a finite Fourrier series:

$f_{x}(t) = \frac{x_1}{\sqrt{2}} + x_2 sin(t) + x_3 cos(t) + x_4 sin(2t) + x_5 cos(2t)+...$

This function is plotted for $-\pi < t < \pi$. Therefore, each data point may be viewed as a line between $-\pi$ and $\pi$. The formula can be thought of as the projection of the data point onto the vector

$(\frac{1}{\sqrt{2}}, sin(t), cos(t), sin(2t), cos(2t),...)$

Each df row is a line per category. On the horizontal axis is the $-\pi < t < \pi$ and on the vertical the $f_x(t)$.

Curves belonging to samples of the same class will usually be closer together and form larger structures!



In [31]:

    
from pandas.tools.plotting import andrews_curves



In [32]:

    
new_df = dataframe;
new_df['Species']=target_df;
new_df.head(5)









    Out[32]:






  
    
      
      sepal length (cm)
      sepal width (cm)
      petal length (cm)
      petal width (cm)
      Species
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      0
    
    
      1
      4.9
      3.0
      1.4
      0.2
      0
    
    
      2
      4.7
      3.2
      1.3
      0.2
      0
    
    
      3
      4.6
      3.1
      1.5
      0.2
      0
    
    
      4
      5.0
      3.6
      1.4
      0.2
      0



In [33]:

    
andrews_curves(new_df, "Species")









    Out[33]:





<matplotlib.axes._subplots.AxesSubplot at 0x10bb5eb90>

2.12 Parallel Coordinates

Parallel coordinates is a way of visualising high-dimensional geometry and analysing multivariate data.

To show a set of points in a n-dimensional space, the horizontal axis is split consisting n parallel vertical (typically) lines. A point in the n-dimensional space is represented as a polyline with vertices on the parallel axes; the position of the vertex on the i-th axis corresponds to the i-th coordinate of the point.

This type of visualisation is closely related to time series visualisation, with the difference that it is applied on data that have no time dependance.

Three points must be taken into account when the plot is used for statistical inference:

The order of the axis is critical for finding features (many reorderings are done in practice).
The rotation of the axis is a translation in the parallel coordinates; if the lines intersected outside the parallel axes, it can be translated between them by rotations (i.e. 180$\circ$).
The scaling is necessary since the plot is based on interpolation of consecutive pairs of variables. Thus the variables must be on a common scale and orthogonal to each parallel axis.

Points that tend to cluster will appear closer together.



In [34]:

    
from pandas.tools.plotting import parallel_coordinates
parallel_coordinates(new_df, "Species")









    Out[34]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c307f50>

2.13 Lag plot

Lag plots are used to identify randomness in data. Random data should not exhibit any structure in the lag plot

Non-random structure implies that data are not random.

A lag is a fixed time displacement. For example, given a dataset $Y1,~Y2~...,~Yn$, $Y2$ and $Y7$ have lag $5$ since $7 - 2 = 5$. Lag plots can be generated for any arbitrary lag, although the most commonly used lag is $1$.



In [35]:

    
from pandas.tools.plotting import lag_plot
lag_plot(new_df)









    Out[35]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c788490>

2.14 Autocorrelation Plot

This is to check randomness in series. It computes the autocorrelations for data values at varying time lags.

If series are random, the autocorrelation should be near zero for any and all time-lag separation.

If time series are non-random then one or more of the autocorrelations will be significantly non-zero.

The horizontal lines correspond to the 95% and 99% confidence bands (dashed = 99%)

definitions:

Vertical axis : autocorrelation coefficient: $R_{h}=C_{h}/ C_{0}$,

where $C_{h}$ is the autocovariance :

$$C_{h} = \frac{1}{N} \sum_{t=1}^{N-h}(Y_{t}-\bar{Y})(Y_{t+h}-\bar{Y})$$

and $$C_{0} = \frac{\sum_{t=1}^{N}(Y_{t}-\bar{Y})^{2}}{N}$$

$-1 \leq R_{h}\leq 1$

N.B. Some times the the autocovariance function is given as

$$C_{h} = \frac{1}{N-h} \sum_{t=1}^{N-h}(Y_{t}-\bar{Y})(Y_{t+h}-\bar{Y})$$

which has less bias (1/N-h)

Horizontal Axis: time lag $h$, $h=(1,2,...)$
The above line containes several reference lines :
- the middle line is at zero
- the other four lines are the 95% and 99% confidence bands. Note that there are two distinct formulas for generating confidence bands.
  - if the autocorrelation plot is being used to test randomness (no time dependence in data) the following formula is recomended $$CL = \pm\frac{z_{1-a/2}}{\sqrt{N}}$$, where $N$ is the sample size, $z$ the cumulative distribution function of the standard normal distribution and $\alpha$ the the significance level
  - if autocorrelation plot is made using the ARIMA model fitting then the the confidence bands are generated by : $$CL= \pm z_{1-a/2} \cdot \sqrt{ \frac{1}{N}(1+2\sum_{i}^{k}y_{i}^2) } $$ where $k$ is the lag, $N$ the sample size, $z$ the cumulative distribution function of the standard normal distribution and $\alpha$ the significance level.



In [36]:

    
from pandas.tools.plotting import autocorrelation_plot
data = pd.Series(0.7 * np.random.rand(1000) +
                 0.3 * np.sin(np.linspace(-9 * np.pi, 9 * np.pi, num=1000)))

autocorrelation_plot(data)









    Out[36]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c7adf90>

2.15 Bootstrap plot

Bootstrap plots are made to assess the uncertainty of a statistic metric (mean, median, midrange etc)

To make a bootstrap uncertainty estimate for one metric given a dataset, a subset of the sample of size les or equal to the size of the dataset is generated from the data and the statistic is calculated. This subset is generated with replacement, and thus each data point can be resampled multiple times or not at all. The process is repeated, usually 500 to 1000 times. The computed values for the statistic form an estimate of the sampling distribution of the statistics.

For example, in a sample of 50 values you want to bootstrap the median. You generate a subset of the sample with 50 elements and calculate the median. Repeat this 500 times, so that you have at least 500 values for the median. To calculate the 90% confidence interval for the median, the sample of medians are sorted into ascending order and the value of the 25th median is the lower confidence limit, while the value of 475th median is the upper confidence limit (first and third quartiles).

The plots generated are the series and the histograms for mean, median and mid-range. For uniformly distributed values, mid-range has the smallest variance.



In [37]:

    
from pandas.tools.plotting import bootstrap_plot
data = pd.Series(np.random.rand(1000));
    
bootstrap_plot(data, size=50, samples=500, color='grey')









    Out[37]:

2.16 RadViz Plot

RadVis is a way to visualise multivariate data; to visualise n-dimensional points into a two dimensional space. In this case, the mapping is not linear. The technique is based on a simple spring tension minimization algorithm.

Imagine $n$ points, $S_1,S_2,...S_n$ arranged to be equally spaced around the cirumference of the unit circle. Now suppose a set of $n$ springs being fixed at one end to each of these points and the other ends to a puck. Finally, assume the stiffness constant (as in Hooke's law) of the j$th$ string in the $x_{ij}$ for one of the data points $i$.

If the puck is released and allowed to reach equilibrium position, the coordinates of this position, $(u_i, v_j)^{T}$, are the projection in the two dimensional space of the point $(x_{i1}, x_{i2}, ... x_{in})^{T}$ in the $n$-dimensional space. If the $(u_i, v_j)^{T}$ is computed for $i=1,2...n$ and the points plotted, a visualisation of the $n$-dimensional dataset in the two dimensions is achieved.

To understand more about the projection of the $n$-dimensional space into the two dimensional one, consider the forces acting on the puck. When the puck is in equilibrium, there are no resultant forces acting on it (their sum is 0). Denoting the position vectors of $S_1$ to $S_n$ by $\mathbf{S_1}$ to $\mathbf{S_n}$ and putting $\mathbf{u_{i}}=(u_i, v_i)^{T}$ we have:

$$\sum_{j=1,n}=(\mathbf{S}_{j}-\mathbf{u}_{i})x_{ij} = 0 $$

which when solved for $\mathbf{u}_{i}$ :

$$\mathbf{u}_{i} = \sum_{j=1,n}w_{ij}\mathbf{S}_j$$

where

$$ w_{ij} = \left( \sum_{j=1,n} x_{ij} \right)^{-1} x_{ij}$$

This means that for each $i$ (i.e. dataframe row), $\mathbf{u_i}$ is the weighted mean of the $\mathbf{S_j}$'s whose weights are the $n$ variables for case $i$ normalised to unity.

N.B. This normalisation makes the mapping $\mathcal{R}^{n}\rightarrow \mathcal{R}^{2}$ non-linear.



In [38]:

    
from pandas.tools.plotting import radviz

radviz(new_df, 'Species')









    Out[38]:





<matplotlib.axes._subplots.AxesSubplot at 0x10ce44590>

3. Data Holders

Now let's do see some 'data wrangling' commands with pandas framework. These are to re-shape, select and extract information from a given dataset.

Let's dive a bit into the different data structures

3.1. Series

Series is in fact an 1D array with labels that holds a specific data type (double, int, string etc). The labels (axis labels) are referred to as index.

To create a Series you need data and an index definition.

Data can be:
- dictionary
- ndarray
- scalar value (i.e. 5)
Index can be:
- any list of strings

If no index is included in the definition, the indices will be numerical from 0 to len(ndarray)-1.

3.1.1. Series from ndarray



In [39]:

    
s = pd.Series(np.random.randn(5), index = ['a','b','c','d','e']) 
                                    # 5 is the # of observations
                                    # only one axis in tuple -> only 1D series



In [40]:

    
s









    Out[40]:





a    0.418756
b   -1.405162
c    0.525731
d    2.026037
e   -0.360525
dtype: float64



In [41]:

    
s.plot(kind='kde')









    Out[41]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d079350>

this does not work for a 2D ndarray



In [42]:

    
#s = pd.Series(np.random.randn(5,5)) # this returns an exception : Data must be 1-dimensional

3.1.2. Series from dictionary

Generating from a dictionary, the keys are taken as index. If extra index argument is given this the keys values are overwritten for the matching cases, while if the index argument is larger than the len(dict.keys()) then NaN's are created.



In [43]:

    
d = {'a':125., 'b':500, 'c':400, 'e':240}; d









    Out[43]:





{'a': 125.0, 'b': 500, 'c': 400, 'e': 240}



In [44]:

    
s2 = pd.Series(d); s2









    Out[44]:





a    125
b    500
c    400
e    240
dtype: float64



In [45]:

    
s2.plot(kind='bar', alpha=0.8, color='pink', ylim=(0, 1000))









    Out[45]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d153a90>



In [46]:

    
s3 = pd.Series(d, index=['a','b','c','d','e','f']); s3









    Out[46]:





a    125
b    500
c    400
d    NaN
e    240
f    NaN
dtype: float64



In [47]:

    
s3.plot(kind='bar', alpha=0.8, color='cyan', ylim=(0, 1000))









    Out[47]:





<matplotlib.axes._subplots.AxesSubplot at 0x10cf27750>

So indeed the series are filled with NaN's and the plots with 0's!.

3.1.3. Series from scalar

Now introducing series with scalar input, the values will all be identical.



In [48]:

    
s4 = pd.Series(5, index=['a','b','c']); s4









    Out[48]:





a    5
b    5
c    5
dtype: int64

Series can behave as ndarrays or as dictionaries including all the operations! For example



In [49]:

    
s2[3]









    Out[49]:





240.0



In [50]:

    
s2[:4]









    Out[50]:





a    125
b    500
c    400
e    240
dtype: float64



In [51]:

    
s2.median(); s2.mean(); s2.std(); s2.quantile(0.50)









    Out[51]:





320.0



In [52]:

    
s2[s2>s2.median()]









    Out[52]:





b    500
c    400
dtype: float64



In [53]:

    
s2['a']









    Out[53]:





125.0



In [54]:

    
'a' in s2









    Out[54]:





True

Also vectorised operations can be done. The difference with ndarray is that the result is aligned automatically based on the labels.



In [55]:

    
s+s2; s2*2; np.exp(s)









    Out[55]:





a    1.520070
b    0.245327
c    1.691695
d    7.583971
e    0.697310
dtype: float64



In [56]:

    
s5 = pd.Series(np.random.randn(5), name='something'); s5.name









    Out[56]:





'something'

3.2 DataFrames

On the other side, we have DataFrames. A dataframe is like an Excel spreadsheet or a SQL table. Dataframes can accept various many different kinds of inputs

Dictionay of 1D ndarrays, lists, dict or Series
2D numpy.ndarray
structured or record ndarray
series
another DataFrame

Apart from data, the input arguments can be columns and index.

3.2.1 DataFrame from dict of Series or dicts

The result index will be the union of the indexes of the various Series. Nested dicts will be converted to Series first.



In [57]:

    
d = {'one' : pd.Series([1.,2.,3.], index = ['a','b','c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a','b','c','d']) }; d









    Out[57]:





{'one': a    1
 b    2
 c    3
 dtype: float64, 'two': a    1
 b    2
 c    3
 d    4
 dtype: float64}



In [58]:

    
df = pd.DataFrame(d); df



In [59]:

    
pd.DataFrame(d, index = ['d','b','a']) # rearanging stuff



In [60]:

    
pd.DataFrame(d, index = ['d','b','a'], columns=['two' , 'three']) # there is no 'three' col

The index and columns can be accessed directly:



In [61]:

    
df.index









    Out[61]:





Index([u'a', u'b', u'c', u'd'], dtype='object')



In [62]:

    
df.columns









    Out[62]:





Index([u'one', u'two'], dtype='object')

3.2.2 DataFrame from dict of ndarrays/lists



In [63]:

    
d = {'one' : [1.,2.,3.,4.],
     'two' : [4.,3.,2.,1.]}; d









    Out[63]:





{'one': [1.0, 2.0, 3.0, 4.0], 'two': [4.0, 3.0, 2.0, 1.0]}



In [64]:

    
pd.DataFrame(d)



In [65]:

    
pd.DataFrame(d, index=['a','b','c','d'])

3.2.3 DataFrame from structured/record arrays

These arrays are made using named indices.

For example



In [66]:

    
x = np.array([(1,2,'Hello'), (2,3,'World') ],
             dtype=[('foo','i4'),('bar','f4'), ('baz', 'S10')])   # structured array



In [67]:

    
x[0]









    Out[67]:





(1, 2.0, 'Hello')



In [68]:

    
x['foo']









    Out[68]:





array([1, 2], dtype=int32)



In [69]:

    
x.dtype.names









    Out[69]:





('foo', 'bar', 'baz')



In [70]:

    
x[['foo','bar']]  # access multiple fields simultaneously









    Out[70]:





array([(1, 2.0), (2, 3.0)], 
      dtype=[('foo', '<i4'), ('bar', '<f4')])



In [71]:

    
y = np.rec.array([(1,2.,'Hello'),(2,3.,"World")], 
                          dtype=[('foo', 'i4'),('bar', 'f4'), ('baz', 'S10')]) ## record array



In [72]:

    
y[0]









    Out[72]:





(1, 2.0, 'Hello')



In [73]:

    
y['foo']









    Out[73]:





array([1, 2], dtype=int32)



In [74]:

    
y.dtype.names









    Out[74]:





('foo', 'bar', 'baz')



In [75]:

    
y[['foo','bar']]  # access multiple fields simultaneously









    Out[75]:





array([(1, 2.0), (2, 3.0)], 
      dtype=[('foo', '<i4'), ('bar', '<f4')])



In [76]:

    
y[1].baz  # access one element









    Out[76]:





'World'

So now making a DF from such arrays:



In [77]:

    
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])



In [78]:

    
data[:] = [(1,2.,'Hello'), (2,3.,"World")]



In [79]:

    
pd.DataFrame(data)



In [80]:

    
pd.DataFrame(data, index=['first', 'second'])



In [81]:

    
pd.DataFrame(data, columns=['C', 'A', 'B'])

3.2.4 DataFrame from list of dicts

Now make a list of dictionaries



In [82]:

    
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]



In [83]:

    
pd.DataFrame(data2)

3.2.5 DataFrame from dict of tuples

Now make a dict containing tuples:



In [84]:

    
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
                ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
                ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
                ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
                ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

3.2.6 DataFrame from Series

This is including various series for the creation of a DF

4. Dealing with dates and problematic data

Now let's make one dataframe and see what we can do when we have dates and NaN values in it.



In [91]:

    
# Create the csv file
out = open("file.csv", 'write');
s = '''Date,Temperature_city_1,Temperature_city_2,Temperature_city_3,Which_destination\n
20140910,80,32,40,1\n
20140911,100,50,36,2\n
20140912,102,55,46,1\n
20140912,60,20,35,3\n
20140914,60,,32,3\n
20140914,,57,42,2''';
out.write(s);
out.close();



In [92]:

    
# read it in pandas
df_csv = pd.read_csv("file.csv", sep=','); df_csv









    Out[92]:






  
    
      
      Date
      Temperature_city_1
      Temperature_city_2
      Temperature_city_3
      Which_destination
    
  
  
    
      0
      20140910
      80
      32
      40
      1
    
    
      1
      20140911
      100
      50
      36
      2
    
    
      2
      20140912
      102
      55
      46
      1
    
    
      3
      20140912
      60
      20
      35
      3
    
    
      4
      20140914
      60
      NaN
      32
      3
    
    
      5
      20140914
      NaN
      57
      42
      2

4.1 Automatic Formating of "date strings"

The read_csv() function has an option that specifyies which column contains dates. Then these dates can be parsed and formated accordingly.

Just include the argiment parse_dates=[list of columns with dates]



In [93]:

    
# read it in pandas and process dates
df_csv2 = pd.read_csv("file.csv", sep=',', parse_dates=[0]); df_csv2









    Out[93]:






  
    
      
      Date
      Temperature_city_1
      Temperature_city_2
      Temperature_city_3
      Which_destination
    
  
  
    
      0
      2014-09-10
      80
      32
      40
      1
    
    
      1
      2014-09-11
      100
      50
      36
      2
    
    
      2
      2014-09-12
      102
      55
      46
      1
    
    
      3
      2014-09-12
      60
      20
      35
      3
    
    
      4
      2014-09-14
      60
      NaN
      32
      3
    
    
      5
      2014-09-14
      NaN
      57
      42
      2

4.2 Handle NaNs

There are various ways of dealing with NaN values within pandas. But first one must find them!

4.2.1 Find NaN Values

There are various ways of finding NaN values. Mainly these are revolting around the isnull() and notnull() functions.



In [38]:

    
df_nulls = pd.read_csv("file.csv", sep=',', parse_dates=[0]); df_nulls









    Out[38]:






  
    
      
      Date
      Temperature_city_1
      Temperature_city_2
      Temperature_city_3
      Which_destination
    
  
  
    
      0
      2014-09-10
      80
      32
      40
      1
    
    
      1
      2014-09-11
      100
      50
      36
      2
    
    
      2
      2014-09-12
      102
      55
      46
      1
    
    
      3
      2014-09-12
      60
      20
      35
      3
    
    
      4
      2014-09-14
      60
      NaN
      32
      3
    
    
      5
      2014-09-14
      NaN
      57
      42
      2



In [109]:

    
pd.isnull(df_nulls['Temperature_city_1'])  # looking for nulls in one column









    Out[109]:





0    False
1    False
2    False
3    False
4    False
5     True
Name: Temperature_city_1, dtype: bool



In [110]:

    
df_nulls['Temperature_city_1'].isnull()   # same thing different format









    Out[110]:





0    False
1    False
2    False
3    False
4    False
5     True
Name: Temperature_city_1, dtype: bool



In [111]:

    
df_nulls.notnull()   # look for actual values (NON NaN's) in the whole df









    Out[111]:






  
    
      
      Date
      Temperature_city_1
      Temperature_city_2
      Temperature_city_3
      Which_destination
    
  
  
    
      0
      True
      True
      True
      True
      True
    
    
      1
      True
      True
      True
      True
      True
    
    
      2
      True
      True
      True
      True
      True
    
    
      3
      True
      True
      True
      True
      True
    
    
      4
      True
      True
      False
      True
      True
    
    
      5
      True
      False
      True
      True
      True

4.2.1.1 NaN and None

It is very important to remember that NaN is not None :



In [114]:

    
print np.nan;
np.nan == None









    



nan






    Out[114]:





False

Also NaN are not comparable, this means that



In [115]:

    
np.nan == np.nan









    Out[115]:





False

Pandas/Numpy use this to separate None and NaN, since



In [116]:

    
None == None









    Out[116]:





True

4.2.2 Fill NaN with fixed value

One may choose to replace NaN with a certain fixed value (fillna(val))



In [97]:

    
# read it in pandas and process dates
df_csv3 = pd.read_csv("file.csv", sep=',', parse_dates=[0]);
df_csv3.fillna(-1)









    Out[97]:






  
    
      
      Date
      Temperature_city_1
      Temperature_city_2
      Temperature_city_3
      Which_destination
    
  
  
    
      0
      2014-09-10
      80
      32
      40
      1
    
    
      1
      2014-09-11
      100
      50
      36
      2
    
    
      2
      2014-09-12
      102
      55
      46
      1
    
    
      3
      2014-09-12
      60
      20
      35
      3
    
    
      4
      2014-09-14
      60
      -1
      32
      3
    
    
      5
      2014-09-14
      -1
      57
      42
      2

keep in mind that this does not replace the values in the original DataFrame, but it must be stored in a copy.

So let's replace and see the values:



In [99]:

    
df_csv4 = df_csv3.fillna(-1); df_csv4









    Out[99]:






  
    
      
      Date
      Temperature_city_1
      Temperature_city_2
      Temperature_city_3
      Which_destination
    
  
  
    
      0
      2014-09-10
      80
      32
      40
      1
    
    
      1
      2014-09-11
      100
      50
      36
      2
    
    
      2
      2014-09-12
      102
      55
      46
      1
    
    
      3
      2014-09-12
      60
      20
      35
      3
    
    
      4
      2014-09-14
      60
      -1
      32
      3
    
    
      5
      2014-09-14
      -1
      57
      42
      2

the df_csv4 has indeed replaced NaN with -1, but what happened of df_csv3 ?



In [100]:

    
df_csv3









    Out[100]:






  
    
      
      Date
      Temperature_city_1
      Temperature_city_2
      Temperature_city_3
      Which_destination
    
  
  
    
      0
      2014-09-10
      80
      32
      40
      1
    
    
      1
      2014-09-11
      100
      50
      36
      2
    
    
      2
      2014-09-12
      102
      55
      46
      1
    
    
      3
      2014-09-12
      60
      20
      35
      3
    
    
      4
      2014-09-14
      60
      NaN
      32
      3
    
    
      5
      2014-09-14
      NaN
      57
      42
      2

df_csv3 is the original DF and has not replaced its NaN values.

4.2.2 Fill NaN with mean or Median

One can replace the NaN values with the column mean or median so to minimise the guessing error. This can easily be done by



In [104]:

    
print type(df_csv3.mean(axis=0));
df_csv3.mean(axis=0)









    



<class 'pandas.core.series.Series'>






    Out[104]:





Temperature_city_1    80.4
Temperature_city_2    42.8
Temperature_city_3    38.5
Which_destination      2.0
dtype: float64



In [105]:

    
df_csv5 = df_csv3.fillna(df_csv3.mean(axis=0)); df_csv5









    Out[105]:






  
    
      
      Date
      Temperature_city_1
      Temperature_city_2
      Temperature_city_3
      Which_destination
    
  
  
    
      0
      2014-09-10
      80.0
      32.0
      40
      1
    
    
      1
      2014-09-11
      100.0
      50.0
      36
      2
    
    
      2
      2014-09-12
      102.0
      55.0
      46
      1
    
    
      3
      2014-09-12
      60.0
      20.0
      35
      3
    
    
      4
      2014-09-14
      60.0
      42.8
      32
      3
    
    
      5
      2014-09-14
      80.4
      57.0
      42
      2

N.B. The argument axis=0 implies the calculation of the means that span the rows so the obtained means that they extend column-wise. The axis=1 spans columns and the row-wise results are obtained.

4.2.5 Drop NaN rows

One might choose to drop the NaN values this can be done by



In [106]:

    
df_csv6 = df_csv3.dropna(); df_csv6









    Out[106]:






  
    
      
      Date
      Temperature_city_1
      Temperature_city_2
      Temperature_city_3
      Which_destination
    
  
  
    
      0
      2014-09-10
      80
      32
      40
      1
    
    
      1
      2014-09-11
      100
      50
      36
      2
    
    
      2
      2014-09-12
      102
      55
      46
      1
    
    
      3
      2014-09-12
      60
      20
      35
      3

4.2.6 Interpolate NaN

Series and DataFrames have the interpolate method, which performs linear interpolation (by default) for the missing datapoints



In [39]:

    
df_nulls









    Out[39]:






  
    
      
      Date
      Temperature_city_1
      Temperature_city_2
      Temperature_city_3
      Which_destination
    
  
  
    
      0
      2014-09-10
      80
      32
      40
      1
    
    
      1
      2014-09-11
      100
      50
      36
      2
    
    
      2
      2014-09-12
      102
      55
      46
      1
    
    
      3
      2014-09-12
      60
      20
      35
      3
    
    
      4
      2014-09-14
      60
      NaN
      32
      3
    
    
      5
      2014-09-14
      NaN
      57
      42
      2



In [40]:

    
df_nulls.count()









    Out[40]:





Date                  6
Temperature_city_1    5
Temperature_city_2    5
Temperature_city_3    6
Which_destination     6
dtype: int64



In [41]:

    
df_nulls.interpolate().count() # first interpolate the NaN









    Out[41]:





Date                  6
Temperature_city_1    6
Temperature_city_2    6
Temperature_city_3    6
Which_destination     6
dtype: int64



In [43]:

    
df_nulls.interpolate()









    Out[43]:






  
    
      
      Date
      Temperature_city_1
      Temperature_city_2
      Temperature_city_3
      Which_destination
    
  
  
    
      0
      2014-09-10
      80
      32.0
      40
      1
    
    
      1
      2014-09-11
      100
      50.0
      36
      2
    
    
      2
      2014-09-12
      102
      55.0
      46
      1
    
    
      3
      2014-09-12
      60
      20.0
      35
      3
    
    
      4
      2014-09-14
      60
      38.5
      32
      3
    
    
      5
      2014-09-14
      60
      57.0
      42
      2

to see how this interpolation is done we can plot it



In [49]:

    
df_nulls.plot(title='no linear interpolation')
df_nulls.interpolate().plot(title='with linear interpolation')









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-49-409a2547db52> in <module>()
      1 df_nulls.plot(title='no linear interpolation')
----> 2 df_nulls.interpolate(title='with linear interpolation').plot()

/usr/local/lib/python2.7/site-packages/pandas/core/generic.pyc in interpolate(self, method, axis, limit, inplace, downcast, **kwargs)
   2960                                           inplace=inplace,
   2961                                           downcast=downcast,
-> 2962                                           **kwargs)
   2963         if inplace:
   2964             if axis == 1:

/usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc in interpolate(self, **kwargs)
   2490 
   2491     def interpolate(self, **kwargs):
-> 2492         return self.apply('interpolate', **kwargs)
   2493 
   2494     def shift(self, **kwargs):

/usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc in apply(self, f, axes, filter, do_integrity_check, **kwargs)
   2457                                                  copy=align_copy)
   2458 
-> 2459             applied = getattr(b, f)(**kwargs)
   2460 
   2461             if isinstance(applied, list):

/usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc in interpolate(self, method, axis, index, values, inplace, limit, fill_value, coerce, downcast, **kwargs)
    772                                      **kwargs)
    773 
--> 774         raise ValueError("invalid method '{0}' to interpolate.".format(method))
    775 
    776     def _interpolate_with_fill(self, method='pad', axis=0, inplace=False,

ValueError: invalid method 'linear' to interpolate.

4.2.5 Calculations with NaNs

Missing values do propagate naturally in operations with Pandas Objects



In [25]:

    
## Let's make some dfs with NaNs
a1 = pd.Series(np.random.randn(3), index = ['e','f','h']);
a2 = pd.Series(np.random.randn(5), index = ['a','c','e','f','h']);
a = {'one': a1, 'two': a2};
dfa = pd.DataFrame(a); 
print dfa

b1 = pd.Series(np.random.randn(2), index = ['e','f']);
b2 = pd.Series(np.random.randn(5), index = ['a','c','e','f','h']);
b3 = pd.Series(np.random.randn(5), index = ['a','c','e','f','h']);
b = {'one':b1, 'two':b2, 'three':b3};
dfb = pd.DataFrame(b);
print dfb









    



        one       two
a       NaN -0.265040
c       NaN -2.358900
e -0.249150  0.321340
f -1.049753  1.032773
h -1.049537 -0.657652
        one     three       two
a       NaN  0.183358 -1.778737
c       NaN -1.081133 -0.915654
e -1.025047 -0.667466  1.083392
f  0.160163 -0.613965  0.599008
h       NaN  0.211446 -0.900994



In [27]:

    
dfa+dfb ## NaN are propagated properly



In [29]:

    
(dfa+dfb).mean(axis=0) #mean for rows









    Out[29]:





one     -1.081893
three         NaN
two     -0.768093
dtype: float64

Functions like cumsum() (cumulative sum over requested axis) and cumprod() (cumulative product over requested axis) ignore the NaNs. Similarly the groupby()



In [34]:

    
dfa.cumsum()



In [37]:

    
dfa.groupby('one').mean()

4.3 Reading Bad Input Files

There is also the case - especially in the real world - that the dataset which is loaded is bad or erroeneous. When loading such dataset with load_csv() the program will exit with an error. A workaround is to ignore bad lines by the error_bad_lines option.

Assume that we have the following bad dataset:



In [1]:

    
f = open('bad_dataset.csv', 'write')
f.write('Val1,Val2,Val3\n')
f.write('1,1,1\n')
f.write('2,2,2,2\n') # 4 columns, 3 expected!
f.write('3,3,3\n')
f.close()

Now lets load it as always



In [11]:

    
bdf = pd.read_csv('bad_dataset.csv')









    



---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-11-7138b6e8fd73> in <module>()
----> 1 bdf = pd.read_csv('bad_dataset.csv')

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    472                     skip_blank_lines=skip_blank_lines)
    473 
--> 474         return _read(filepath_or_buffer, kwds)
    475 
    476     parser_f.__name__ = name

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    258         return parser
    259 
--> 260     return parser.read()
    261 
    262 _parser_defaults = {

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
    719                 raise ValueError('skip_footer not supported for iteration')
    720 
--> 721         ret = self._engine.read(nrows)
    722 
    723         if self.options.get('as_recarray'):

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
   1168 
   1169         try:
-> 1170             data = self._reader.read(nrows)
   1171         except StopIteration:
   1172             if nrows is None:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7544)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7784)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8401)()

pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8275)()

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:20691)()

CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

Notice the CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

Now trying with the option:



In [12]:

    
bdf = pd.read_csv('bad_dataset.csv', error_bad_lines=False) #i.e. "I don't want a parser check"









    



Skipping line 3: expected 3 fields, saw 4



In [14]:

    
bdf

So now line (2,2,2,2) is dropped. N.B.: Notice the change of the indices



In [ ]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Date	Temperature_city_1	Temperature_city_2	Temperature_city_3	Which_destination
0	20140910	80	32	40	1
1	20140911	100	50	36	2
2	20140912	102	55	46	1
3	20140912	60	20	35	3
4	20140914	60	NaN	32	3
5	20140914	NaN	57	42	2

	Date	Temperature_city_1	Temperature_city_2	Temperature_city_3	Which_destination
0	2014-09-10	80	32	40	1
1	2014-09-11	100	50	36	2
2	2014-09-12	102	55	46	1
3	2014-09-12	60	20	35	3
4	2014-09-14	60	NaN	32	3
5	2014-09-14	NaN	57	42	2

	Date	Temperature_city_1	Temperature_city_2	Temperature_city_3	Which_destination
0	True	True	True	True	True
1	True	True	True	True	True
2	True	True	True	True	True
3	True	True	True	True	True
4	True	True	False	True	True
5	True	False	True	True	True

	Date	Temperature_city_1	Temperature_city_2	Temperature_city_3	Which_destination
0	2014-09-10	80.0	32.0	40	1
1	2014-09-11	100.0	50.0	36	2
2	2014-09-12	102.0	55.0	46	1
3	2014-09-12	60.0	20.0	35	3
4	2014-09-14	60.0	42.8	32	3
5	2014-09-14	80.4	57.0	42	2

	one	three	two
a	NaN	NaN	-2.043777
c	NaN	NaN	-3.274554
e	-1.274197	NaN	1.404731
f	-0.889589	NaN	1.631781
h	NaN	NaN	-1.558646

	one	two
a	NaN	-0.265040
c	NaN	-2.623940
e	-0.249150	-2.302600
f	-1.298902	-1.269827
h	-2.348439	-1.927479

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2