Introduction to Pandas

Pandas (http://pandas.pydata.org) is great for data analysis, again we met it briefly in the software carpentry course, but it's worth revisiting.

Note the book on that website - 'Python for data analysis', this is a useful text which much of this section was drawn from.

We're also going to look at how we might use pandas to work with data read in with CIS.


In [1]:
import pandas as pd

Series

A Series is essentially a container for series data (think time-series, but more general).

Let's create a basic time-series:


In [2]:
from datetime import datetime
s = pd.Series([0.13, 0.21, 0.15, 'NaN', 0.29, 0.09, 0.24, -10], dtype='f',
                 index = [datetime(2015,11,16,15,41,23), datetime(2015,11,16,15,42,22), datetime(2015,11,16,15,43,25), datetime(2015,11,16,15,44,20), datetime(2015,11,16,15,45,22),
                          datetime(2015,11,16,15,46,23), datetime(2015,11,16,15,47,26), datetime(2015,11,16,15,48,21)])
print(s)


2015-11-16 15:41:23     0.13
2015-11-16 15:42:22     0.21
2015-11-16 15:43:25     0.15
2015-11-16 15:44:20      NaN
2015-11-16 15:45:22     0.29
2015-11-16 15:46:23     0.09
2015-11-16 15:47:26     0.24
2015-11-16 15:48:21   -10.00
dtype: float32

As you can see, it's dealt with our missing value nicely - this is one of the nice things about Pandas.

We can get rid of the negative value easily as well:


In [3]:
s = s[s>0]
print(s)


2015-11-16 15:41:23    0.13
2015-11-16 15:42:22    0.21
2015-11-16 15:43:25    0.15
2015-11-16 15:45:22    0.29
2015-11-16 15:46:23    0.09
2015-11-16 15:47:26    0.24
dtype: float32

Note this also got rid of our NaN (as NaN comparisons are always negative)

Now, as you probably noticed, I added a lot of datetimes to this data which represent the timings of the measurements. Pandas uses these times as an index on the data, and gives us access to some very powerful tools.

For example, resampling our data to a minutely average is easy:


In [4]:
s.resample('5min').max()


Out[4]:
2015-11-16 15:40:00    0.21
2015-11-16 15:45:00    0.29
Freq: 5T, dtype: float32

Another way of creating series is using dictionaries:


In [5]:
colours = pd.Series({'Blue': 42, 'Green': 12, 'Yellow': 37})
colours


Out[5]:
Blue      42
Green     12
Yellow    37
dtype: int64

We can index Series just like numpy arrays, or using the named index:


In [6]:
print(colours[1])
print(colours[:-1])
print(colours['Blue'])


12
Blue     42
Green    12
dtype: int64
42

Or both:


In [7]:
print(colours[1:]['Green'])


12

Another nice benefit of the indices is in data allignment. So for example when performing operations on two series, Pandas will line up the indices first:


In [8]:
more_colours = pd.Series({'Blue': 16, 'Red': 22, 
                          'Purple': 34, 'Green': 25,})

more_colours + colours


Out[8]:
Blue      58.0
Green     37.0
Purple     NaN
Red        NaN
Yellow     NaN
dtype: float64

As you can see, if not both of the indices are present then Pandas will return NaNs.

Pandas uses numpy heavily underneath, so many of the numpy array operations work on Series as well:


In [9]:
colours.mean(), colours.max()


Out[9]:
(30.333333333333332, 42)

DataFrames

Data frames are essentially collections of Series, with a shared index. Each column of data is labelled and the whole frame can be pictured as a table, or spreadsheet of data.


In [10]:
df = pd.DataFrame({'First': colours, 'Second': more_colours})
print(df)


        First  Second
Blue     42.0    16.0
Green    12.0    25.0
Purple    NaN    34.0
Red       NaN    22.0
Yellow   37.0     NaN

And can be indexed by row, or index via the ix attribute:


In [11]:
# Column by index
print(df['First'])


Blue      42.0
Green     12.0
Purple     NaN
Red        NaN
Yellow    37.0
Name: First, dtype: float64

In [12]:
# Column as attribute
print(df.First)


Blue      42.0
Green     12.0
Purple     NaN
Red        NaN
Yellow    37.0
Name: First, dtype: float64

In [ ]:
# Row via ix
print(df.ix['Blue'])

We can then apply many of the same numpy functions on this data, on a per column basis:


In [13]:
df.max()


Out[13]:
First     42.0
Second    34.0
dtype: float64

In [14]:
df.sum()


Out[14]:
First     91.0
Second    97.0
dtype: float64

Reading Excel/CSV files


In [15]:
example_csv = pd.read_csv('../resources/B1_mosquito_data.csv', 
                          parse_dates=True, index_col=0)
example_csv[0:10]


Out[15]:
temperature rainfall mosquitos
year
1960-01-01 87 270 122
1961-01-01 77 296 116
1962-01-01 87 281 117
1963-01-01 73 167 87
1964-01-01 79 166 91
1965-01-01 76 272 113
1966-01-01 77 140 87
1967-01-01 87 207 106
1968-01-01 86 266 115
1969-01-01 79 159 92

In [16]:
example_csv.corr()


Out[16]:
temperature rainfall mosquitos
temperature 1.000000 0.137805 0.408380
rainfall 0.137805 1.000000 0.948906
mosquitos 0.408380 0.948906 1.000000

Using Pandas with CIS data

We can easily convert CIS data into pandas data to take advantage of this time-series functionality.


In [17]:
from cis import read_data_list

aerosol_cci_collocated = read_data_list('col_output.nc', '*')

cis_df = aerosol_cci_collocated.as_data_frame()
cis_df


Out[17]:
altitude latitude longitude AOD670 time AOD550 AOD870 AOD1600 NUMBER_CONCENTRATION
2008-04-11 20:01:30 233.4 64.8029 -147.8828 0.284781 149120.834375 0.270871 0.130443 0.284781 2191.0
2008-04-11 20:02:30 652.9 64.7594 -147.9471 0.294635 149120.835069 0.279516 0.133261 0.294635 581.0
2008-04-11 20:03:30 1099.1 64.7385 -148.0415 0.345198 149120.835764 0.328403 0.158271 0.345198 306.0
2008-04-11 20:04:30 1496.7 64.7820 -148.1129 0.391794 149120.836458 0.379837 0.196193 0.391794 281.0
2008-04-11 20:05:30 1877.5 64.8434 -148.1207 0.443872 149120.837153 0.432696 0.227793 0.443872 246.0
2008-04-11 20:06:30 2245.8 64.9073 -148.1275 0.520862 149120.837847 0.514773 0.283677 0.520862 230.0
2008-04-11 20:07:30 2638.7 64.9720 -148.1360 0.489666 149120.838542 0.483544 0.265761 0.489666 241.0
2008-04-11 20:08:30 3013.7 65.0354 -148.1538 0.265527 149120.839236 0.260520 0.140179 0.265527 197.0
2008-04-11 20:09:30 3275.1 65.1007 -148.1691 0.416466 149120.839931 0.407154 0.216467 0.416466 209.0
2008-04-11 20:10:30 3411.1 65.1700 -148.1806 0.397496 149120.840625 0.387216 0.203360 0.397496 194.0
2008-04-11 20:11:30 3412.2 65.2419 -148.1874 0.529464 149120.841319 0.513911 0.266541 0.529464 185.0
2008-04-11 20:12:30 3410.5 65.3126 -148.1910 0.506485 149120.842014 0.447898 0.122831 0.506485 184.0
2008-04-11 20:13:30 3412.0 65.3824 -148.1974 0.429768 149120.842708 0.399655 0.160455 0.429768 190.0
2008-04-11 20:14:30 3412.1 65.4528 -148.2055 0.542344 149120.843403 0.500470 0.197428 0.542344 173.0
2008-04-11 20:15:30 3411.5 65.5235 -148.2154 0.553993 149120.844097 0.531456 0.264294 0.553993 170.0
2008-04-11 20:16:30 3410.9 65.5938 -148.2256 0.453869 149120.844792 0.437072 0.220412 0.453869 175.0
2008-04-11 20:17:30 3409.3 65.6636 -148.2357 0.377955 149120.845486 0.364274 0.184259 0.377955 169.0
2008-04-11 20:18:30 3408.5 65.7332 -148.2463 0.591343 149120.846181 0.576690 0.304027 0.591343 162.0
2008-04-11 20:19:30 3407.7 65.8034 -148.2568 0.381449 149120.846875 0.361938 0.172673 0.381449 177.0
2008-04-11 20:20:30 3407.2 65.8740 -148.2664 0.381449 149120.847569 0.361938 0.172673 0.381449 194.0
2008-04-11 20:21:30 3407.3 65.9449 -148.2756 NaN 149120.848264 NaN NaN NaN 236.0
2008-04-11 20:22:30 3407.1 66.0160 -148.2849 NaN 149120.848958 NaN NaN NaN 360.0
2008-04-11 20:23:30 3406.0 66.0875 -148.2942 NaN 149120.849653 NaN NaN NaN 333.0
2008-04-11 20:24:30 3404.7 66.1592 -148.3032 NaN 149120.850347 NaN NaN NaN 307.0
2008-04-11 20:25:30 3405.3 66.2312 -148.3120 NaN 149120.851042 NaN NaN NaN 277.0
2008-04-11 20:26:30 3405.2 66.3033 -148.3210 NaN 149120.851736 NaN NaN NaN 268.0
2008-04-11 20:27:30 3404.9 66.3755 -148.3303 NaN 149120.852431 NaN NaN NaN 255.0
2008-04-11 20:28:30 3403.7 66.4471 -148.3402 NaN 149120.853125 NaN NaN NaN 252.0
2008-04-11 20:29:30 3402.5 66.5181 -148.3506 NaN 149120.853819 NaN NaN NaN 158.0
2008-04-11 20:30:30 3403.1 66.5886 -148.3616 NaN 149120.854514 NaN NaN NaN 151.0
... ... ... ... ... ... ... ... ... ...
2008-04-16 02:34:30 5326.5 66.4332 -150.5058 NaN 149125.107292 NaN NaN NaN 350.0
2008-04-16 02:35:30 5327.2 66.3617 -150.3813 NaN 149125.107986 NaN NaN NaN 321.0
2008-04-16 02:36:30 5328.5 66.2913 -150.2537 NaN 149125.108681 NaN NaN NaN 496.0
2008-04-16 02:37:30 5330.0 66.2213 -150.1258 NaN 149125.109375 NaN NaN NaN 403.0
2008-04-16 02:38:30 5331.2 66.1513 -149.9991 NaN 149125.110069 NaN NaN NaN 344.0
2008-04-16 02:39:30 5332.1 66.0805 -149.8781 NaN 149125.110764 NaN NaN NaN 236.0
2008-04-16 02:40:30 5334.5 66.0091 -149.7616 NaN 149125.111458 NaN NaN NaN 288.0
2008-04-16 02:41:30 5337.4 65.9380 -149.6455 NaN 149125.112153 NaN NaN NaN 544.0
2008-04-16 02:42:30 5316.7 65.8672 -149.5295 NaN 149125.112847 NaN NaN NaN 610.0
2008-04-16 02:43:30 5018.4 65.7983 -149.4151 NaN 149125.113542 NaN NaN NaN 377.0
2008-04-16 02:44:30 4771.4 65.7328 -149.3015 NaN 149125.114236 NaN NaN NaN 303.0
2008-04-16 02:45:30 4504.9 65.6694 -149.1929 NaN 149125.114931 NaN NaN NaN 231.0
2008-04-16 02:46:30 4238.8 65.6070 -149.0823 NaN 149125.115625 NaN NaN NaN 200.0
2008-04-16 02:47:30 4013.3 65.5502 -148.9587 NaN 149125.116319 NaN NaN NaN 302.0
2008-04-16 02:48:30 3787.6 65.4937 -148.8395 NaN 149125.117014 NaN NaN NaN 262.0
2008-04-16 02:49:30 3546.2 65.4354 -148.7259 NaN 149125.117708 NaN NaN NaN 555.0
2008-04-16 02:50:30 3328.0 65.3767 -148.6144 NaN 149125.118403 NaN NaN NaN 546.0
2008-04-16 02:51:30 3088.1 65.3185 -148.5031 NaN 149125.119097 NaN NaN NaN 538.0
2008-04-16 02:52:30 2736.2 65.2592 -148.3931 NaN 149125.119792 NaN NaN NaN 511.0
2008-04-16 02:53:30 2400.7 65.2034 -148.2816 NaN 149125.120486 NaN NaN NaN 187.0
2008-04-16 02:54:30 2020.8 65.1602 -148.1653 NaN 149125.121181 NaN NaN NaN 210.0
2008-04-16 02:55:30 1694.9 65.1254 -148.0474 NaN 149125.121875 NaN NaN NaN 222.0
2008-04-16 02:56:30 1441.9 65.0882 -147.9370 NaN 149125.122569 NaN NaN NaN 230.0
2008-04-16 02:57:30 1232.0 65.0506 -147.8268 NaN 149125.123264 NaN NaN NaN 236.0
2008-04-16 02:58:30 1199.5 65.0156 -147.7097 NaN 149125.123958 NaN NaN NaN 237.0
2008-04-16 02:59:30 1091.9 64.9706 -147.6263 NaN 149125.124653 NaN NaN NaN 226.0
2008-04-16 03:00:30 944.9 64.9277 -147.6486 NaN 149125.125347 NaN NaN NaN 219.0
2008-04-16 03:01:30 682.1 64.8948 -147.7128 NaN 149125.126042 NaN NaN NaN 221.0
2008-04-16 03:02:30 435.8 64.8635 -147.7722 NaN 149125.126736 NaN NaN NaN 220.0
2008-04-16 03:03:30 203.5 64.8336 -147.8285 NaN 149125.127431 NaN NaN NaN 187.0

1268 rows × 9 columns


In [18]:
# Now we can do cool Pandas stuff!
cis_df.ix[cis_df['NUMBER_CONCENTRATION'].argmin()]


Out[18]:
altitude                  2916.100000
latitude                    73.281700
longitude                 -153.326600
AOD670                            NaN
time                    149121.992708
AOD550                            NaN
AOD870                            NaN
AOD1600                           NaN
NUMBER_CONCENTRATION        17.000000
Name: 2008-04-12 23:49:30, dtype: float64

In [19]:
cis_short = cis_df.dropna()

In [20]:
cis_short.ix[cis_short['NUMBER_CONCENTRATION'].argmin()]


Out[20]:
altitude                  2587.800000
latitude                    65.139400
longitude                 -147.897000
AOD670                       0.565450
time                    149121.878125
AOD550                       0.566969
AOD870                       0.326906
AOD1600                      0.565450
NUMBER_CONCENTRATION       139.000000
Name: 2008-04-12 21:04:30, dtype: float64

Exercise

In pairs, plot probability distributions (use kde) of the raw, 10 minutely and 2 hourly averaged number concentration


In [21]:
%matplotlib inline

cis_df['NUMBER_CONCENTRATION'].plot(kind='kde', xlim=[0,1000], label='Raw')
cis_df['NUMBER_CONCENTRATION'].resample('10min').mean().plot(kind='kde', label='10min')
ax=cis_df['NUMBER_CONCENTRATION'].resample('120min').mean().plot(kind='kde', label='120min')
ax.legend()


Out[21]:
<matplotlib.legend.Legend at 0x114628470>

Extras


In [22]:
from pandas.tools.plotting import scatter_matrix
m = scatter_matrix(cis_df, alpha=0.2, figsize=(8, 8), diagonal='kde', edgecolors='none')