The objective of this notebook is to make a brief overview of the pandas module for data analysis and manipulation. To present the different pandas tools,w e use as an example the hydrometeorological data from the famous sistema Cantareira reservoir (São Paulo/Brazil). The data files are located in the data directory of this repository.
Have fun :)
Pandas provides easy to use and intuitive data structures and high performance tools for data analysis. It has been designed for practical, real-world analysis.
The community have the objective to build the most powerful and flexible open source data analysis / manipulation tool available in any language.
You will find everything you need to know about Pandas at: http://pandas.pydata.org/
This presentation is inspired from the amazing "10 minutes to Pandas" tutorial (http://pandas.pydata.org/pandas-docs/stable/10min.html).
For further details look at the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/index.html
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Pandas supports three data structures:
They are labeled arrays which can contain any kind of data types (int,float, string, objects ...). The labels are called index in pandas.
The main difference between numpy and pandas reside
For more information about data structure, please have a look at the documentation: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro
In [ ]:
labels = ['a','b','c','d']
values = np.arange(4)
In [ ]:
s = pd.Series(values, index=labels)
A dataframe is a two-dimensional labeled array. There is various way to create a dataframe. for more information, you can have a look at: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe
In [ ]:
dic = {'a':[1,2,3,4,5],
'b':[10,20,30,40,50]
}
In [ ]:
df = pd.DataFrame(dic)
In [ ]:
rng = pd.date_range('1/1/2016', periods=100, freq='S')
In [ ]:
rng = pd.period_range('1/1/2016', '1/1/2017', freq='D')
To create a temporal serie
In [ ]:
s = pd.Series(range(100), index = rng)
Let's explore hydrometeorological data observed in the region of the Cantareira sistema (The main reservoir of the São Paulo megacity), more precisely the accumulated rainfall (mm/day) and the volume of water in (%) (100% correspond to the maximum capacity).
Note: In this example, the index of our dataframe is going to be a datetime object, as time series are recurrently used in atmospheric sciences. However, the index could be whatever labeled object, for example the ID of the stars for applications in astronomy.
In [ ]:
Path = './data/'
filename = "DataCantareira.csv"
Lets have a look at the amazing capabilities of the "read_csv" method! http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
This method provide so many options that you will probably never need to perform extra programing in order to read your data! (This is really timesaving for data analysis tasks)
In [ ]:
df = pd.read_csv(Path+filename)
In this case the read csv methods return directly a dataframe. As you can see the first column hold the date of each event. We could have directly defined the index to be a datetime object by passing the following argument to read csv.
In [ ]:
df = pd.read_csv( Path+filename, index_col =0, parse_dates=True)
We can verify that the index is a datetime index with the folowing code
In [ ]:
# type(df.index)
In [ ]:
# df.describe()
In [ ]:
# df.columns
In [ ]:
# df.index
You can select items in your Pandas data structure in the same style than numpy array.
However, Pandas has his own optimized methods for data access:
Get a column from the dataframe
In [ ]:
# df['volume (%)']
Get some rows
In [ ]:
# df[0:3]
In [ ]:
# df['2003-01-01':'2004-01-01']
In [ ]:
# df.loc['2003-01-01':'2004-01-01','vol (%)']
In [ ]:
# df.iloc[3:5,0:2]
In [ ]:
# df[df['volume (%)'] < 5]
Adding a new column
In [ ]:
# df['rain (m/dia)'] = df['rain (mm/dia)']/1000
Pandas uses the object np.nan to represent missing data. Important note: it is by default not included in the operations.
Methods to deal with missing data
To drop the NaN from the dataframe
In [ ]:
# df.dropna(how='any')
To fill NaN with a value
In [ ]:
# df.fillna(value=df.mean())
To interpolate where their is missing data
In [ ]:
# df.interpolate("linear")
Perform the mean of each dataframe columns.
In [ ]:
# df.mean(axis=0)
In [ ]:
# df['rain (mm/dia)'].apply(np.sqrt)
Pandas provide different methods to merge Series, Dataframe and Paneis:
In [ ]:
Path = "./data/"
filename = "temperature.txt"
df_temp = pd.read_csv(Path+filename, index_col =0, parse_dates=True)
df temp does not have the same time frequency. Therfore, It is necessary to resample this dataframe by day prior to merge.
In [ ]:
df_temp_day = df_temp.resample('D', how='mean')
We use the argument join='inner' to keep only the index labels that are present in both dataframe. But we could have use join='outer' to keep all the labeles of each dataframes
In [ ]:
df_merged = pd.concat([df, df_temp_day], axis=1, join='inner')
Grouping with the method group-by consist of a 3 step process:
In the following example, we want to perform the period average by months.
For more information please see the documentation http://pandas.pydata.org/pandas-docs/stable/groupby.html
In [ ]:
df['month'] = df.index.month
grouped = df.groupby('month').mean()
Create pivot table spread-sheet like with multi-index.
In the following example we want to restructure the dataframe to see the average volume and precipitation by month and year.
In [ ]:
df['month'] = df.index.month
df['year']= df.index.year
In [ ]:
pt = pd.pivot_table(df, values=['volume (%)', "rain (mm/dia)"], index=['month','year'])
The new datastructure has a multi-index along the axis 0. Pandas object can handle multiple index on multiple axis.
In [ ]:
# pt.index
In [ ]:
df.to_csv('./my_results.csv')
Pandas module has some basic plot functionality built on top of matplotlib
In [ ]:
df.loc[:,"volume (%)"].resample('A', how='mean').plot(kind="bar")
plt.show()
In [ ]:
df.loc[:,"volume (%)"].resample('A', how='mean').plot.area(stacked=False)
plt.show()