This notebook is an introduction to the pandas library for python. We'll go over the data structures of pandas and explore some of their uses. We'll then play around with some data from a mobile gamma-ray detector array to look at some real uses of DataFrames.
Pandas site
Wes's Book
Other Pandas tutorials I found online with a 5 minute google search
Stack Overflow
I'll just let the Pandas website do the talking here...
pandas consists of the following things
- A set of labeled array data structures, the primary of which are Series and DataFrame
- Series 1D labeled homogenously typed array
- DataFrame 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns
- Panel General 3D labeled, also size-mutable array
- Index objects enabling both simple axis indexing and multi-level / hierarchical axis indexing
- An integrated group by engine for aggregating and transforming data sets
- Date range generation (date_range) and custom date offsets enabling the implementation of customized frequencies
- Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel 2003), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.
- Memory-efficient “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value)
- Moving window statistics (rolling mean, rolling standard deviation, etc.)
- Static and moving window linear and panel regression
Pandas has a truly incredible amount of functionality, however, accessing a lot of this requires understanding the nuance of Index objects and knowning that the functionality exists.
In [1]:
import numpy as np
import pandas as pd
In [ ]:
sadstring = 'sadnap'
mfs = pd.Series(list(sadstring))
mfs
The .values attribute will return the underlying numpy array or recarray
In [ ]:
mfs.values
The associated index can be viewed. Theres also plenty of attributes for working with more complicated indicies.
You can also write yourself a new index.
In [ ]:
mfs.index
In [ ]:
mfs.index = range(5,11)
mfs
Grabbing a component in a Series is as easy as grabbing based on its label.
In [ ]:
mfs[5]
.ix, .iloc, and .loc can be used to slice in different ways too. More about this in a bit.
In [ ]:
mfs.loc[10]
In [ ]:
mfs.loc[8:10]
In [ ]:
mfs.iloc[:3]
Passing a list of labels works as you'd expect. The ordering of the list is preserved in the result!
In [ ]:
listoflabels = np.arange(10,4,-1)
mfs[listoflabels]
Adding a few items is straightforward.
In [ ]:
mfs[4] = ' '
mfs[0] = ' wants friend'
In [ ]:
mfs.loc[[5,6,7,4,10,9,8,7,6,0]].tolist()
In [ ]:
''.join(_)
In [ ]:
animalsdict = {
'moose':1220,
'buffalo':774,
'chicken':13,
'sadpanda':97,
'platypus':2084
}
mss = pd.Series(animalsdict)
mss.name = 'CoolFactor'
mss
In [ ]:
mss * 2.41
In [ ]:
mss>555
In [ ]:
mss[mss>555]
Boolean masking can work with a boolean Series or a boolean array of appropriate length.
The .apply method can applie an arbitray function to each labeled piece. When we get to 2D structures you can apply to rows or columns if you'd like.
In [ ]:
mss[mss>555].apply(np.random.poisson)
In [ ]:
np.min(mss)
In [ ]:
mss.argmin()
In [ ]:
mss.describe()
In [ ]:
'chicken' in mss
In [ ]:
13 in mss
In [ ]:
mts = pd.Series( {'chicken':1.2,
'moose':30.77,
'elk': 22.88,
'marmot':4.49,
'platypus':14.9,
'humans':100.,
'sadpanda':0.9} )
mts.name = 'NormalizedRadFactor'
mts
In [ ]:
coolness = mss / mss.chicken
radness = mts
print coolness
print radness
In [ ]:
animalBitCoinValue = (radness + coolness * radness) / radness.max()
animalBitCoinValue
In [ ]:
animalBitCoinValue.isnull()
In [ ]:
prizedAnimals = animalBitCoinValue[animalBitCoinValue>17.]
print prizedAnimals
In [ ]:
animalBitCoinValue.fillna(1.)
I personally end up using np.arrays, dicts of arrays, lists of named Series, and sometimes concatenate lists of DFs.
Lets make a DataFrame (df) with a dict.
In [ ]:
data = {'state':['Ohio','California','Indiana','Idaho'],
'year':[2000,2001,2002,2004],
'pop':[23.4,34.1,7.8,1.3]}
df = pd.DataFrame(data)
df
.index gives us the row index
.columns gives the column index
In [ ]:
print df.index
print ' '
print df.columns
In [ ]:
df['state']
In [ ]:
df.year
In [ ]:
df2 = pd.DataFrame(data, columns=['pop','state','year','debt'], index=range(1,5))
df2
In [ ]:
df2[1]
In [ ]:
print df2
In [ ]:
df2.loc[1]
In [ ]:
df2.iloc[0]
In [ ]:
df2.ix[1,'state']
In [ ]:
df2['state'][1]
In [ ]:
df2.loc[1:3,'year']
In [ ]:
df2['debt'] = 22
df2
In [ ]:
df2.loc[3,'debt'] = 0.22
df2.loc[2,'debt'] = np.nan
df2
In [ ]:
df2.fillna(321)
In [ ]:
df2.values
In [ ]:
df2.reindex(range(6), fill_value=0.1)
In [ ]:
df2
Loads of Pandas functions will return a new object. The original is left in tact. Sometimes you'll want to overwrite the old object, in this case also pass the inplace=True flag.
In [ ]:
df2.drop(2)
Some functionality could be applied to rows or columns. To break the ambiguity there is the axis flag. axis=0 corresponds to rows, axis=1 corresponds to columns.
In [ ]:
df2.drop('year', axis=1)
In [ ]:
df2.loc[[2,4],['pop','state']] = [[11.1,'Maine'],[87,'Canada']]
print df2
This data comes from a mobile array of High Purity Germanium (HPGe) Detectors. It is measuring the gamma-ray background. In this background we'll have events across a range of energies, and some peaks at a few characteristic energies.
Please do not distribute this data, not that you would want to. While this is part of a semi-open dataset access to this dataset should be pursued through the formal channels.
https://dl.dropboxusercontent.com/u/4558549/THWPasses_segmented.hdf5
In [2]:
import matplotlib.pyplot as plt
%matplotlib inline
import mpld3
mpld3.enable_notebook()
Pandas can write objects to loads of different formats. It can also read them in from a variety of formats. The amount of functionality in read_csv is great!
Pandas has PyTables integration! This means that you can drop your Series or DataFrames directly into an HDF5 file (called an HDFStore in Pandas lingo). Lets open the Store of data we'll use.
In [3]:
infile = pd.HDFStore('THWPasses_segmented.hdf5','r')
In [5]:
print infile
In [6]:
keys = infile.keys()
print keys
So we've got a few dataframes hiding inside
In [7]:
gpsdf = infile[keys[0]]
gammadf = infile[keys[1]]
infile.close()
In [ ]:
gammadf.head(10)
In [ ]:
gpsdf.head(10)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
gammadf.detch.value_counts()
Looks like two outliers... Lets investiage by looking at the spectra of each of our detectors.
In [ ]:
fullhistogram = gammadf.groupby('detch').energy.apply(lambda x: np.histogram(x, bins=range(0,3000,1))[0])
fullhistogram.head()
Groupby is quite powerful once you get the hang of it. Here we've grouped our gamma-ray events by channel number. Then for each group we've applied a function to that entire group, in this case we've histogrammed.
The results of the application of this function to each group is returned as a Series (or DF if the result has length), indexed by the grouped variable (detch). In my lambda function I've selected only the binned data and not the bins themselves.
While handy, a Series of lists isn't super helpful, but we can apply Pd.Series to each list and make our selves a dataframe.
In [ ]:
fullhistogram = fullhistogram.apply(pd.Series)
In [ ]:
fullhistogram.head()
In [ ]:
fullhistogram.T.plot(figsize=(12,8))
Maybe we could normalize by number of events... and look at only the 1460 keV peak.
In [ ]:
fullhistogram.div(gammadf.detch.value_counts(),axis=0).loc[:,1450:1470].T.plot(figsize=(12,8))
plt.xlabel('Energy (keV)')
plt.ylabel('Normalized counts')
Great, all of the detectors are working nicely.
In [123]:
fullhistogram = gammadf.groupby(['detch','segment']).energy.apply(lambda x: np.histogram(x, bins=range(0,3000,1))[0])
fullhistogram = fullhistogram.apply(pd.Series)
fullhistogram.head()
Out[123]:
In [ ]:
Lets just sort by timestamp so that we have sequential data, also we'll reset the index.
In [8]:
gammadf.sort_values(by='timestamp', inplace=True)
gpsdf.sort_values(by='timestamp', inplace=True)
gammadf.reset_index(inplace=True, drop=True)
gpsdf.reset_index(inplace=True, drop=True)
How about the GPS data?
In [ ]:
gpsdf.describe()
Looks like there isn't much variance in the GPS positions. Lets convert to meters in cartesian space to have a look!
In [11]:
# in meters
R_earth = 6371000
degtoster = np.pi/180.
def haversine(lat1deg, lon1deg, lat2deg, lon2deg):
"""
Calculate the great circle distance between two points in meters
and the bearing direction on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1=(np.pi/180)*lon1deg
lat1=(np.pi/180)*lat1deg
lon2=(np.pi/180)*lon2deg
lat2=(np.pi/180)*lat2deg
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
# distance
dist = R_earth * c
# bearing direction
direction = np.arctan2( np.sin(dlon)*np.cos(lat2),
np.cos(lat1)*np.sin(lat2) - np.sin(lat1)*np.cos(lat2)*np.cos(dlon) )
return dist, direction
vhaversine = np.vectorize(haversine)
def convertLatLon(originlat, originlon, lat, lon):
r, heading = vhaversine(originlat, originlon, lat, lon)
x = r * np.sin(heading)
y = r * np.cos(heading)
return y,x
vconvertLatLon = np.vectorize(convertLatLon)
In [12]:
y,x = vconvertLatLon(gpsdf.lat.min(), gpsdf.lon.min(), gpsdf.lat, gpsdf.lon)
In [13]:
gpsdf['x'] = x
gpsdf['y'] = y
gpsdf.head()
Out[13]:
In [21]:
gpssegments = gpsdf.groupby('segment')
plt.figure(figsize=(11,7))
for segname, segdat in gpssegments:
plt.plot(segdat.x, segdat.y, label=segname)
There is definitely some noise in the GPS, applying a filter could be a good idea, but for now lets just select the eastbound segments.
In [23]:
eastbound = gpsdf.groupby('segment').apply(lambda aa: aa.x.iloc[-1] - aa.x.iloc[0]) > 0
In [47]:
gammadfE = gammadf.set_index('segment').ix[eastbound].reset_index()
gpsdfE = gpsdf.set_index('segment').ix[eastbound].reset_index()
Ok, perhaps distance down the road would let us aggregate along chunks of distance here and look at the variability of natural radioactivity.
In [48]:
gpsdfE['distance'] = np.sqrt(gpsdfE.x**2 + gpsdfE.y**2)
In [49]:
from datetime import datetime
gpsdfE.timestamp.apply(datetime.fromtimestamp).head()
Out[49]:
Ok, that works. Maybe lets make it a DateTime Index and see what we can do!
In [51]:
gpsdfEts = gpsdfE.set_index(gpsdfE.timestamp.apply(datetime.fromtimestamp))
gpsdfEts.head()
Out[51]:
In [52]:
gpsdfEts.loc['2012-03-20'].tail()
Out[52]:
We can re-sample our DateTimeIndex and specify the function to apply during the resampling. Dropping all of the days that are empty with dropna(), we can look at what days we have Eastbound data.
In [53]:
gpsdfEts.resample('D', how=np.mean).dropna()
Out[53]:
In this example we'll join our eastbound gamma-ray data (all columns) with the eastbound GPS data (distance and time).
We'll just the 'timestamp' value as the joining variable and perform an outer join, meaning we'll keep all timestamp values from both dataframes.
In [93]:
lmjoined = gammadfE.merge(gpsdfE[['distance','timestamp']], left_on='timestamp', right_on='timestamp', how='outer')
In [94]:
lmjoined.head()
Out[94]:
In [95]:
lmjoined.tail()
Out[95]:
In [65]:
lmjoined.timestamp.diff().describe()
Out[65]:
Looks like we should sort again before doing anything.
In [96]:
lmjoined = lmjoined.sort_values('timestamp').reset_index(drop=True)
Forward and backward fill work as expected. There are interpolation tools as well.
We'll fill forward first in the distance column (in our timesorted frame) then backwards to get the straggling events. This is allowing us to label each gamma-ray event with the corresponding distance coordinate.
In [97]:
lmjoined.distance.ffill(inplace=True)
lmjoined.distance.bfill(inplace=True)
Now any row that has np.nan is a GPS time point that has no gamma-ray data. Lets drop these.
In [99]:
lmjoined.dropna(inplace=True)
lmjoined.head()
Out[99]:
Next, lets categorize distance values so we can group them. With that we can then look at variability in gamma-ray backgrounds.
In [101]:
stepsize = 20
lmjoined['distchunk'] = (lmjoined.distance // stepsize).astype(int) * stepsize + stepsize/2
I like this trick of // division and casting as an integer to make a nice arbitrary unit value for grouping...
Another way is using pd.cut to label data. Perhaps silly here, but useful in other circumstances.
In [102]:
labeledges = np.arange(0,lmjoined.distance.max()+stepsize,stepsize)
labels = (labeledges[:-1] + labeledges[1:])/2
lmjoined['distchunk2'] = pd.cut(lmjoined.distance, labeledges, labels=labels, include_lowest=False)
In [103]:
lmjoined.head()
Out[103]:
In [104]:
segdistgroup = lmjoined.groupby(['segment','distchunk'])
histdata = segdistgroup.energy.apply(lambda x: 1. * np.histogram(x,bins=np.arange(0,3000,1))[0]).apply(pd.Series)
deltaT = segdistgroup.timestamp.max() - segdistgroup.timestamp.min()
numdetectors = segdistgroup.detch.apply(lambda x: len(x.unique()))
In [105]:
normhist = histdata.div(deltaT, level=[0,1], axis=0)
normhist = normhist.div(numdetectors, level=[0,1], axis=0)
In [120]:
normhist.ix[::20,1450:1470].T.plot(figsize=(11,7))
Out[120]:
In [135]:
peakratesdict = {}
peakratesdict['k'] = normhist.loc[:,1460-3:1460+3].sum(axis=1)
peakratesdict['pos'] = normhist.loc[:,511-3:511+3].sum(axis=1)
peakratesdict['tl'] = normhist.loc[:,2614-3:2614+3].sum(axis=1)
peakratesdict['bi'] = normhist.loc[:,609-3:609+3].sum(axis=1)
peakrates = pd.DataFrame(peakratesdict)
In [136]:
peakrates.head()
Out[136]:
In [137]:
peakrates = peakrates.reorder_levels([1,0]).sort_index()
peakrates.head()
Out[137]:
You can 'unstack' and 'stack' different levels of multi-index data frames. This means shifting a level of label from row to column or vice versa.
In [153]:
peakrates.unstack(level=0).head()
Out[153]:
In [154]:
peakrates.unstack(level=0).apply(np.mean,axis=0)
Out[154]:
In [151]:
fig = plt.figure()
ax = fig.add_subplot(111)
peakrates.unstack(level=0).apply(np.mean,axis=0).unstack(level=0).plot(ax=ax)
Out[151]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: