Introduction to data manipulation and analysis with Pandas

The objective of this notebook is to make a brief overview of the pandas module for data analysis and manipulation. To present the different pandas tools,w e use as an example the hydrometeorological data from the famous sistema Cantareira reservoir (São Paulo/Brazil). The data files are located in the data directory of this repository.

Have fun :)

What is Pandas and why use it?

Pandas provides easy to use and intuitive data structures and high performance tools for data analysis. It has been designed for practical, real-world analysis.

The community have the objective to build the most powerful and flexible open source data analysis / manipulation tool available in any language.

You will find everything you need to know about Pandas at: http://pandas.pydata.org/

Topics covered

Object Creation: Series and Dataframe
Data Visualisation
Data Selection
Read data from file
Dealing with missing data
Setting in data structure
Basic operation and apply function
Merge/concatenate/join/append data structures
Resample
Grouping : split-apply-combine
Pivot Table
Writing to file
Basic plots

This presentation is inspired from the amazing "10 minutes to Pandas" tutorial (http://pandas.pydata.org/pandas-docs/stable/10min.html).

For further details look at the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/index.html

Let's fire up Pandas !

We will need numpy and matplotlib.pyplot as well.



In [ ]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Object Creation

Pandas supports three data structures:

Series (1 axis).
Dataframe (2 axis)
Panel (3 - N axis)

They are labeled arrays which can contain any kind of data types (int,float, string, objects ...). The labels are called index in pandas.

The main difference between numpy and pandas reside

For more information about data structure, please have a look at the documentation: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro

Create a serie

We need:

an index, e.g. a list of string
an array, e.g. a numpy.ndarray

if no index is specified, pandas will by default use a list of interger.



In [ ]:

    
labels = ['a','b','c','d']
values = np.arange(4)



In [ ]:

    
s = pd.Series(values, index=labels)

Create a dataframe

A dataframe is a two-dimensional labeled array. There is various way to create a dataframe. for more information, you can have a look at: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe

from a dictionnary of list



In [ ]:

    
dic = {'a':[1,2,3,4,5],
       'b':[10,20,30,40,50]
       }



In [ ]:

    
df = pd.DataFrame(dic)

Time Series / Date functionality

Pandas provide powerful tools to work with time series based upon numpy datetime64 and timedelta64 dtypes.



In [ ]:

    
rng = pd.date_range('1/1/2016', periods=100, freq='S')



In [ ]:

    
rng = pd.period_range('1/1/2016', '1/1/2017', freq='D')

To create a temporal serie



In [ ]:

    
s = pd.Series(range(100), index = rng)

Read (real) data from file.

Let's explore hydrometeorological data observed in the region of the Cantareira sistema (The main reservoir of the São Paulo megacity), more precisely the accumulated rainfall (mm/day) and the volume of water in (%) (100% correspond to the maximum capacity).

Note: In this example, the index of our dataframe is going to be a datetime object, as time series are recurrently used in atmospheric sciences. However, the index could be whatever labeled object, for example the ID of the stars for applications in astronomy.



In [ ]:

    
Path = './data/'
filename = "DataCantareira.csv"

Lets have a look at the amazing capabilities of the "read_csv" method! http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

This method provide so many options that you will probably never need to perform extra programing in order to read your data! (This is really timesaving for data analysis tasks)



In [ ]:

    
df = pd.read_csv(Path+filename)

In this case the read csv methods return directly a dataframe. As you can see the first column hold the date of each event. We could have directly defined the index to be a datetime object by passing the following argument to read csv.



In [ ]:

    
df = pd.read_csv( Path+filename, index_col =0, parse_dates=True)

We can verify that the index is a datetime index with the folowing code



In [ ]:

    
# type(df.index)

Visualising the data structure

Methods to quickly give a look at your data:

head
tail
columns
index
values
describe



In [ ]:

    
# df.describe()



In [ ]:

    
# df.columns



In [ ]:

    
# df.index

Selection

You can select items in your Pandas data structure in the same style than numpy array.

However, Pandas has his own optimized methods for data access:

.at
.iat
.loc
.iloc
.ix

Opa Numpy style

Get a column from the dataframe



In [ ]:

    
# df['volume (%)']

Get some rows



In [ ]:

    
# df[0:3]



In [ ]:

    
# df['2003-01-01':'2004-01-01']

Selection by label

For production code, it is recommended that you take advantage of the optimized pandas data access methods.



In [ ]:

    
# df.loc['2003-01-01':'2004-01-01','vol (%)']

Selection by Position

Select via the position of the passed integers



In [ ]:

    
# df.iloc[3:5,0:2]

Boolean Indexing

Let's select the events when the Cantareira reservatory was under 5% of its volume



In [ ]:

    
# df[df['volume (%)'] < 5]

Setting

Adding a new column



In [ ]:

    
# df['rain (m/dia)'] = df['rain (mm/dia)']/1000

Working with Missing data

Pandas uses the object np.nan to represent missing data. Important note: it is by default not included in the operations.

Methods to deal with missing data

.isnull
dropna
fillna

To drop the NaN from the dataframe



In [ ]:

    
# df.dropna(how='any')

To fill NaN with a value



In [ ]:

    
# df.fillna(value=df.mean())

To interpolate where their is missing data



In [ ]:

    
# df.interpolate("linear")

Operations

Pandas includes a lot of methods to perform operations along an axis.

count: Number of non-null observations
sum: Sum of values
mean: Mean of values
mad: Mean absolute deviation
median: Arithmetic median of values
min: Minimum
max: Maximum
mode: Mode
abs: Absolute Value
prod: Product of values
std: Bessel-corrected sample standard deviation
var: Unbiased variance
sem: Standard error of the mean
skew: Sample skewness (3rd moment)
kurt: Sample kurtosis (4th moment)
quantile: Sample quantile (value at %)
cumsum: Cumulative sum
cumprod: Cumulative product
cummax: Cumulative maximum
cummin: Cumulative minimum

Perform the mean of each dataframe columns.



In [ ]:

    
# df.mean(axis=0)

Apply

You can also pass a function along an axis with the apply method. This method is very efficient to iterate along the axis, much faster than a for loop for example.



In [ ]:

    
# df['rain (mm/dia)'].apply(np.sqrt)

Merge data structures

Pandas provide different methods to merge Series, Dataframe and Paneis:

Merge
join
concat
append

Let's merge our dataframe with the observed temperature from external file.



In [ ]:

    
Path = "./data/"
filename = "temperature.txt"

df_temp = pd.read_csv(Path+filename, index_col =0, parse_dates=True)

Resample

df temp does not have the same time frequency. Therfore, It is necessary to resample this dataframe by day prior to merge.



In [ ]:

    
df_temp_day = df_temp.resample('D', how='mean')

Merge

We use the argument join='inner' to keep only the index labels that are present in both dataframe. But we could have use join='outer' to keep all the labeles of each dataframes



In [ ]:

    
df_merged = pd.concat([df, df_temp_day], axis=1, join='inner')

Avanced manipulation

From this stage the true power of Pandas is unleashed.

Grouping : split-apply-combine

Grouping with the method group-by consist of a 3 step process:

Spliting the data with some criteria
Applying a function to each group.
Combining the results in a data structure

In the following example, we want to perform the period average by months.

Create a new column with the month number
Pass the column month as a criteria
Apply the function mean() on each group

For more information please see the documentation http://pandas.pydata.org/pandas-docs/stable/groupby.html



In [ ]:

    
df['month'] = df.index.month
grouped = df.groupby('month').mean()

Pivot Table

Create pivot table spread-sheet like with multi-index.

data: A DataFrame object
values: a column or a list of columns to aggregate
index: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
columns: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
aggfunc: function to use for aggregation, defaulting to numpy.mean

In the following example we want to restructure the dataframe to see the average volume and precipitation by month and year.



In [ ]:

    
df['month'] = df.index.month
df['year']= df.index.year



In [ ]:

    
pt = pd.pivot_table(df, values=['volume (%)', "rain (mm/dia)"], index=['month','year'])

The new datastructure has a multi-index along the axis 0. Pandas object can handle multiple index on multiple axis.



In [ ]:

    
# pt.index

Writing data structure to file



In [ ]:

    
df.to_csv('./my_results.csv')

Plots

Pandas module has some basic plot functionality built on top of matplotlib



In [ ]:

    
df.loc[:,"volume (%)"].resample('A', how='mean').plot(kind="bar")
plt.show()



In [ ]:

    
df.loc[:,"volume (%)"].resample('A', how='mean').plot.area(stacked=False)
plt.show()