Introduction to data manipulation and analysis with Pandas

The objective of this notebook is to make a brief overview of the pandas module for data analysis and manipulation. To present the different pandas tools,w e use as an example the hydrometeorological data from the famous sistema Cantareira reservoir (São Paulo/Brazil). The data files are located in the data directory of this repository.

Have fun :)

What is Pandas and why use it?

Pandas provides easy to use and intuitive data structures and high performance tools for data analysis. It has been designed for practical, real-world analysis.

The community have the objective to build the most powerful and flexible open source data analysis / manipulation tool available in any language.

You will find everything you need to know about Pandas at: http://pandas.pydata.org/

Topics covered

  • Object Creation: Series and Dataframe
  • Data Visualisation
  • Data Selection
  • Read data from file
  • Dealing with missing data
  • Setting in data structure
  • Basic operation and apply function
  • Merge/concatenate/join/append data structures
  • Resample
  • Grouping : split-apply-combine
  • Pivot Table
  • Writing to file
  • Basic plots

This presentation is inspired from the amazing "10 minutes to Pandas" tutorial (http://pandas.pydata.org/pandas-docs/stable/10min.html).

For further details look at the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/index.html

Let's fire up Pandas !

We will need numpy and matplotlib.pyplot as well.


In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Object Creation

Pandas supports three data structures:

  • Series (1 axis).
  • Dataframe (2 axis)
  • Panel (3 - N axis)

They are labeled arrays which can contain any kind of data types (int,float, string, objects ...). The labels are called index in pandas.

The main difference between numpy and pandas reside

For more information about data structure, please have a look at the documentation: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro

Create a serie

We need:

  • an index, e.g. a list of string
  • an array, e.g. a numpy.ndarray

if no index is specified, pandas will by default use a list of interger.


In [ ]:
labels = ['a','b','c','d']
values = np.arange(4)

In [ ]:
s = pd.Series(values, index=labels)

Create a dataframe

A dataframe is a two-dimensional labeled array. There is various way to create a dataframe. for more information, you can have a look at: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe

from a dictionnary of list


In [ ]:
dic = {'a':[1,2,3,4,5],
       'b':[10,20,30,40,50]
       }

In [ ]:
df = pd.DataFrame(dic)

Time Series / Date functionality

Pandas provide powerful tools to work with time series based upon numpy datetime64 and timedelta64 dtypes.


In [ ]:
rng = pd.date_range('1/1/2016', periods=100, freq='S')

In [ ]:
rng = pd.period_range('1/1/2016', '1/1/2017', freq='D')

To create a temporal serie


In [ ]:
s = pd.Series(range(100), index = rng)

Read (real) data from file.

Let's explore hydrometeorological data observed in the region of the Cantareira sistema (The main reservoir of the São Paulo megacity), more precisely the accumulated rainfall (mm/day) and the volume of water in (%) (100% correspond to the maximum capacity).

Note: In this example, the index of our dataframe is going to be a datetime object, as time series are recurrently used in atmospheric sciences. However, the index could be whatever labeled object, for example the ID of the stars for applications in astronomy.


In [ ]:
Path = './data/'
filename = "DataCantareira.csv"

Lets have a look at the amazing capabilities of the "read_csv" method! http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

This method provide so many options that you will probably never need to perform extra programing in order to read your data! (This is really timesaving for data analysis tasks)


In [ ]:
df = pd.read_csv(Path+filename)

In this case the read csv methods return directly a dataframe. As you can see the first column hold the date of each event. We could have directly defined the index to be a datetime object by passing the following argument to read csv.


In [ ]:
df = pd.read_csv( Path+filename, index_col =0, parse_dates=True)

We can verify that the index is a datetime index with the folowing code


In [ ]:
# type(df.index)

Visualising the data structure

Methods to quickly give a look at your data:

  • head
  • tail
  • columns
  • index
  • values
  • describe

In [ ]:
# df.describe()

In [ ]:
# df.columns

In [ ]:
# df.index

Selection

You can select items in your Pandas data structure in the same style than numpy array.

However, Pandas has his own optimized methods for data access:

  • .at
  • .iat
  • .loc
  • .iloc
  • .ix

Opa Numpy style

Get a column from the dataframe


In [ ]:
# df['volume (%)']

Get some rows


In [ ]:
# df[0:3]

In [ ]:
# df['2003-01-01':'2004-01-01']

Selection by label

For production code, it is recommended that you take advantage of the optimized pandas data access methods.


In [ ]:
# df.loc['2003-01-01':'2004-01-01','vol (%)']

Selection by Position

Select via the position of the passed integers


In [ ]:
# df.iloc[3:5,0:2]

Boolean Indexing

Let's select the events when the Cantareira reservatory was under 5% of its volume


In [ ]:
# df[df['volume (%)'] < 5]

Setting

Adding a new column


In [ ]:
# df['rain (m/dia)'] = df['rain (mm/dia)']/1000

Working with Missing data

Pandas uses the object np.nan to represent missing data. Important note: it is by default not included in the operations.

Methods to deal with missing data

  • .isnull
  • dropna
  • fillna

To drop the NaN from the dataframe


In [ ]:
# df.dropna(how='any')

To fill NaN with a value


In [ ]:
# df.fillna(value=df.mean())

To interpolate where their is missing data


In [ ]:
# df.interpolate("linear")

Operations

Pandas includes a lot of methods to perform operations along an axis.

  • count: Number of non-null observations
  • sum: Sum of values
  • mean: Mean of values
  • mad: Mean absolute deviation
  • median: Arithmetic median of values
  • min: Minimum
  • max: Maximum
  • mode: Mode
  • abs: Absolute Value
  • prod: Product of values
  • std: Bessel-corrected sample standard deviation
  • var: Unbiased variance
  • sem: Standard error of the mean
  • skew: Sample skewness (3rd moment)
  • kurt: Sample kurtosis (4th moment)
  • quantile: Sample quantile (value at %)
  • cumsum: Cumulative sum
  • cumprod: Cumulative product
  • cummax: Cumulative maximum
  • cummin: Cumulative minimum

Perform the mean of each dataframe columns.


In [ ]:
# df.mean(axis=0)

Apply

You can also pass a function along an axis with the apply method. This method is very efficient to iterate along the axis, much faster than a for loop for example.


In [ ]:
# df['rain (mm/dia)'].apply(np.sqrt)

Merge data structures

Pandas provide different methods to merge Series, Dataframe and Paneis:

  • Merge
  • join
  • concat
  • append

Let's merge our dataframe with the observed temperature from external file.


In [ ]:
Path = "./data/"
filename = "temperature.txt"

df_temp = pd.read_csv(Path+filename, index_col =0, parse_dates=True)

Resample

df temp does not have the same time frequency. Therfore, It is necessary to resample this dataframe by day prior to merge.


In [ ]:
df_temp_day = df_temp.resample('D', how='mean')

Merge

We use the argument join='inner' to keep only the index labels that are present in both dataframe. But we could have use join='outer' to keep all the labeles of each dataframes


In [ ]:
df_merged = pd.concat([df, df_temp_day], axis=1, join='inner')

Avanced manipulation

From this stage the true power of Pandas is unleashed.

Grouping : split-apply-combine

Grouping with the method group-by consist of a 3 step process:

  • Spliting the data with some criteria
  • Applying a function to each group.
  • Combining the results in a data structure

In the following example, we want to perform the period average by months.

  1. Create a new column with the month number
  2. Pass the column month as a criteria
  3. Apply the function mean() on each group

For more information please see the documentation http://pandas.pydata.org/pandas-docs/stable/groupby.html


In [ ]:
df['month'] = df.index.month
grouped = df.groupby('month').mean()

Pivot Table

Create pivot table spread-sheet like with multi-index.

  • data: A DataFrame object
  • values: a column or a list of columns to aggregate
  • index: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
  • columns: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
  • aggfunc: function to use for aggregation, defaulting to numpy.mean

In the following example we want to restructure the dataframe to see the average volume and precipitation by month and year.


In [ ]:
df['month'] = df.index.month
df['year']= df.index.year

In [ ]:
pt = pd.pivot_table(df, values=['volume (%)', "rain (mm/dia)"], index=['month','year'])

The new datastructure has a multi-index along the axis 0. Pandas object can handle multiple index on multiple axis.


In [ ]:
# pt.index

Writing data structure to file


In [ ]:
df.to_csv('./my_results.csv')

Plots

Pandas module has some basic plot functionality built on top of matplotlib


In [ ]:
df.loc[:,"volume (%)"].resample('A', how='mean').plot(kind="bar")
plt.show()

In [ ]:
df.loc[:,"volume (%)"].resample('A', how='mean').plot.area(stacked=False)
plt.show()