Working with pandas DataFrames

Pandas (http://pandas.pydata.org) is great for data analysis, again we met it briefly in the software carpentry course, but it's worth revisiting.

Note the book on that website - 'Python for data analysis', this is a useful text which much of this section was drawn from.

We're also going to look at how we might use pandas to work with data read in with CIS.



In [ ]:

    
import pandas as pd

Series

A Series is essentially a container for series data (think time-series, but more general).

Let's create a basic time-series:



In [ ]:

    
from datetime import datetime
s = pd.Series([0.13, 0.21, 0.15, 'NaN', 0.29, 0.09, 0.24, -10], dtype='f',
                 index = [datetime(2015,11,16,15,41,23), datetime(2015,11,16,15,42,22), datetime(2015,11,16,15,43,25), datetime(2015,11,16,15,44,20), datetime(2015,11,16,15,45,22),
                          datetime(2015,11,16,15,46,23), datetime(2015,11,16,15,47,26), datetime(2015,11,16,15,48,21)])
print(s)

As you can see, it's dealt with our missing value nicely - this is one of the nice things about Pandas.

We can get rid of the negative value easily as well:



In [ ]:

    
s = s[s>0]
print(s)

Note this also got rid of our NaN (as NaN comparisons are always negative)

Now, as you probably noticed, I added a lot of datetimes to this data which represent the timings of the measurements. Pandas uses these times as an index on the data, and gives us access to some very powerful tools.

For example, resampling our data to a minutely average is easy:



In [ ]:

    
s.resample('5min').max()

Another way of creating series is using dictionaries:



In [ ]:

    
colours = pd.Series({'Blue': 42, 'Green': 12, 'Yellow': 37})
colours

We can index Series just like numpy arrays, or using the named index:



In [ ]:

    
print(colours[1])
print(colours[:-1])
print(colours['Blue'])

Or both:



In [ ]:

    
print(colours[1:]['Green'])

Another nice benefit of the indices is in data allignment. So for example when performing operations on two series, Pandas will line up the indices first:



In [ ]:

    
more_colours = pd.Series({'Blue': 16, 'Red': 22, 'Purple': 34, 'Green': 25,})

more_colours + colours

As you can see, if not both of the indices are present then Pandas will return NaNs.

Pandas uses numpy heavily underneath, so many of the numpy array operations work on Series as well:



In [ ]:

    
colours.mean(), colours.max()

DataFrames

Data frames are essentially collections of Series, with a shared index. Each column of data is labelled and the whole frame can be pictured as a table, or spreadsheet of data.



In [ ]:

    
df = pd.DataFrame({'First': colours, 'Second': more_colours})
print(df)

And can be indexed by row, or index via the ix attribute:



In [ ]:

    
# Column by index
print(df['First'])



In [ ]:

    
# Column as attribute
print(df.First)



In [ ]:

    
# Row via ix
print(df.ix['Blue'])

We can then apply many of the same numpy functions on this data, on a per column basis:



In [ ]:

    
df.max()



In [ ]:

    
df.sum()

Reading Excel files



In [ ]:

    
example_csv = pd.read_csv('../resources/B1_mosquito_data.csv', 
                          parse_dates=True, index_col=0)
example_csv[0:10]



In [ ]:

    
example_csv.corr()

Using Pandas with CIS data

We can easily convert CIS data into pandas data to take advantage of this time-series functionality.



In [ ]:

    
from cis import read_data_list

aerosol_cci_collocated = read_data_list('col_output.nc', '*')

cis_df = aerosol_cci_collocated.as_data_frame()
cis_df



In [ ]:

    
# Now we can do cool Pandas stuff!
cis_df.ix[cis_df['NUMBER_CONCENTRATION'].argmin()]



In [ ]:

    
cis_short = cis_df.dropna()



In [ ]:

    
cis_short.ix[cis_short['NUMBER_CONCENTRATION'].argmin()]

Exercise

In pairs, plot probability distributions (use kde) of the raw, 10 minutely and 2 hourly averaged number concentration



In [ ]:

    
%matplotlib inline

cis_df['NUMBER_CONCENTRATION'].plot(kind='kde', xlim=[0,1000], label='Raw')
cis_df['NUMBER_CONCENTRATION'].resample('10min').mean().plot(kind='kde', label='10min')
ax=cis_df['NUMBER_CONCENTRATION'].resample('120min').mean().plot(kind='kde', label='120min')
ax.legend()

Extras



In [ ]:

    
from pandas.tools.plotting import scatter_matrix
m = scatter_matrix(cis_df, alpha=0.2, figsize=(8, 8), diagonal='kde', edgecolors='none')



In [ ]: