Pandas (http://pandas.pydata.org) is great for data analysis, again we met it briefly in the software carpentry course, but it's worth revisiting.
Note the book on that website - 'Python for data analysis', this is a useful text which much of this section was drawn from.
We're also going to look at how we might use pandas to work with data read in with CIS.
In [1]:
import pandas as pd
A Series
is essentially a container for series data (think time-series, but more general).
Let's create a basic time-series:
In [2]:
from datetime import datetime
s = pd.Series([0.13, 0.21, 0.15, 'NaN', 0.29, 0.09, 0.24, -10], dtype='f',
index = [datetime(2015,11,16,15,41,23), datetime(2015,11,16,15,42,22), datetime(2015,11,16,15,43,25), datetime(2015,11,16,15,44,20), datetime(2015,11,16,15,45,22),
datetime(2015,11,16,15,46,23), datetime(2015,11,16,15,47,26), datetime(2015,11,16,15,48,21)])
print(s)
As you can see, it's dealt with our missing value nicely - this is one of the nice things about Pandas.
We can get rid of the negative value easily as well:
In [3]:
s = s[s>0]
print(s)
Note this also got rid of our NaN (as NaN comparisons are always negative)
Now, as you probably noticed, I added a lot of datetimes to this data which represent the timings of the measurements. Pandas uses these times as an index
on the data, and gives us access to some very powerful tools.
For example, resampling our data to a minutely average is easy:
In [4]:
s.resample('5min').max()
Out[4]:
Another way of creating series is using dictionaries:
In [5]:
colours = pd.Series({'Blue': 42, 'Green': 12, 'Yellow': 37})
colours
Out[5]:
We can index Series just like numpy arrays, or using the named index:
In [6]:
print(colours[1])
print(colours[:-1])
print(colours['Blue'])
Or both:
In [7]:
print(colours[1:]['Green'])
Another nice benefit of the indices is in data allignment. So for example when performing operations on two series, Pandas will line up the indices first:
In [8]:
more_colours = pd.Series({'Blue': 16, 'Red': 22,
'Purple': 34, 'Green': 25,})
more_colours + colours
Out[8]:
As you can see, if not both of the indices are present then Pandas will return NaNs.
Pandas uses numpy heavily underneath, so many of the numpy array operations work on Series as well:
In [9]:
colours.mean(), colours.max()
Out[9]:
Data frames are essentially collections of Series, with a shared index. Each column
of data is labelled and the whole frame can be pictured as a table, or spreadsheet of data.
In [10]:
df = pd.DataFrame({'First': colours, 'Second': more_colours})
print(df)
And can be indexed by row, or index via the ix attribute:
In [11]:
# Column by index
print(df['First'])
In [12]:
# Column as attribute
print(df.First)
In [ ]:
# Row via ix
print(df.ix['Blue'])
We can then apply many of the same numpy functions on this data, on a per column basis:
In [13]:
df.max()
Out[13]:
In [14]:
df.sum()
Out[14]:
In [15]:
example_csv = pd.read_csv('../resources/B1_mosquito_data.csv',
parse_dates=True, index_col=0)
example_csv[0:10]
Out[15]:
In [16]:
example_csv.corr()
Out[16]:
We can easily convert CIS data into pandas data to take advantage of this time-series functionality.
In [17]:
from cis import read_data_list
aerosol_cci_collocated = read_data_list('col_output.nc', '*')
cis_df = aerosol_cci_collocated.as_data_frame()
cis_df
Out[17]:
In [18]:
# Now we can do cool Pandas stuff!
cis_df.ix[cis_df['NUMBER_CONCENTRATION'].argmin()]
Out[18]:
In [19]:
cis_short = cis_df.dropna()
In [20]:
cis_short.ix[cis_short['NUMBER_CONCENTRATION'].argmin()]
Out[20]:
In [21]:
%matplotlib inline
cis_df['NUMBER_CONCENTRATION'].plot(kind='kde', xlim=[0,1000], label='Raw')
cis_df['NUMBER_CONCENTRATION'].resample('10min').mean().plot(kind='kde', label='10min')
ax=cis_df['NUMBER_CONCENTRATION'].resample('120min').mean().plot(kind='kde', label='120min')
ax.legend()
Out[21]:
In [22]:
from pandas.tools.plotting import scatter_matrix
m = scatter_matrix(cis_df, alpha=0.2, figsize=(8, 8), diagonal='kde', edgecolors='none')