pandas

Pediatric autoimmune neuropsychiatric disorder associated with group A streptococci (PANDAS) is a term used to describe a subset of children whose symptoms of obsessive compulsive disorder (OCD) or tic disorders are exacerbated by group A streptococcal (GAS) infection...

pandas provides high-performance, easy-to-use data structures and data analysis tools.

  • Tool used in the data science community, data types not strictly numerical

  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;

  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;

  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;

  • and more

pandas does not implement significant modeling functionality outside of linear and panel regression; for this, look to statsmodels and scikit-learn. More work is still needed to make Python a first class statistical modeling environment, but we are well on our way toward that goal.

pandas is under continious development, and improvements are added on a weekly basis. As with many other open source projects, you also learn a lot by just keeping up to date on the development (as a minimum, read the release notes of every new release).


In [1]:
import numpy as np
import pandas as pd
import csv
from matplotlib import pyplot as plt
from datetime import datetime

Lets load the CSV GPS data that has been introduced before


In [2]:
fname = 'gps.csv'
df = pd.read_csv(fname, header=0, skiprows=[1], quoting=2)
%timeit pd.read_csv(fname, header=0, skiprows=[1], quoting=2)


10 loops, best of 3: 129 ms per loop

Although it is not relevant here since there are no quotes in the data set, we can also use the CSV dialect option


In [3]:
dia = csv.excel()
dia.quoting = csv.QUOTE_NONE
df = pd.read_csv(fname, header=0, skiprows=[1], dialect=dia)

Yes, very nice, but what does pandas actually understand about the data types in this case?


In [4]:
df.dtypes


Out[4]:
Time            object
Latitude N     float64
Longitude E    float64
orientation    float64
dtype: object

Time is an object. I guess that's nice and general...Lets see if we can force it into a datetime object instead


In [5]:
df = pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0])
df.dtypes


Out[5]:
Time           datetime64[ns]
Latitude N            float64
Longitude E           float64
orientation           float64
dtype: object

Automatic date parsing is realy nice and easy, but it is SLOOOOWWWW. How slow exactly?


In [6]:
%timeit pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0])


1 loops, best of 3: 13.6 s per loop

Can we do that any faster? How about manually defining the parse date function?


In [7]:
date = '20131211 18:29:30'
print date[:4], date[4:6], date[6:8], date[9:11], date[12:14], date[15:17]
print datetime(int(date[:4]), int(date[4:6]), int(date[6:8]), int(date[9:11]), int(date[12:14]), int(date[15:17]))
print datetime.strptime(date, "%Y%m%d %H:%M:%S")


2013 12 11 18 29 30
2013-12-11 18:29:30
2013-12-11 18:29:30

The read_csv function has an handy date_parser argument that allows the user to define her/his own date parse function. The default one is:


In [8]:
import dateutil.parser as parser
def date_parser_default(date):
    date = parser.parse(date)
    return date

Because the parser tries to actually understand the date format it is very slow. If we specify how to read the date field, it can go a lot faster. As a bonus, we could include any other correction or transformation operation on the date.


In [9]:
from datetime import datetime
def date_parser(date):
    # create a datetime instance and assume constant formatting
    # format: 20131211 18:29:30
    return datetime(int(date[:4]), int(date[4:6]), int(date[6:8]), int(date[9:11]), int(date[12:14]), int(date[15:17]))
df = pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser)

In [10]:
%timeit pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser)


1 loops, best of 3: 603 ms per loop

Looks a bit ugly, isn't the next one more Pythonic?


In [11]:
def date_parser_2(date):
    return datetime.strptime(date, "%Y%m%d %H:%M:%S")
%timeit pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser_2)


1 loops, best of 3: 2.62 s per loop

More compact, but unfortunately not as fast.


In [12]:
df.dtypes


Out[12]:
Time           datetime64[ns]
Latitude N            float64
Longitude E           float64
orientation           float64
dtype: object

Printing DataFrames

How does the raw data looks like actually when printing the pandas DataFrame?


In [13]:
print df[:10]


                 Time  Latitude N  Longitude E  orientation
0 2013-11-14 06:05:19   42.604009    13.369236        327.7
1 2013-11-14 06:05:48   42.604160    13.369462        325.4
2 2013-11-14 06:06:19   42.604325    13.369717        323.7
3 2013-11-14 06:06:48   42.604478    13.369960        325.6
4 2013-11-14 06:07:19   42.604640    13.370227        327.5
5 2013-11-14 06:07:49   42.604794    13.370491        327.8
6 2013-11-14 06:08:18   42.604941    13.370746        330.4
7 2013-11-14 06:08:49   42.605097    13.371018        331.2
8 2013-11-14 06:09:19   42.605247    13.371289        331.1
9 2013-11-14 06:09:49   42.605393    13.371561        331.6

[10 rows x 4 columns]

However, there is also some build-in IPython notebook magic that goes like this


In [14]:
df[:10]


Out[14]:
Time Latitude N Longitude E orientation
0 2013-11-14 06:05:19 42.604009 13.369236 327.7
1 2013-11-14 06:05:48 42.604160 13.369462 325.4
2 2013-11-14 06:06:19 42.604325 13.369717 323.7
3 2013-11-14 06:06:48 42.604478 13.369960 325.6
4 2013-11-14 06:07:19 42.604640 13.370227 327.5
5 2013-11-14 06:07:49 42.604794 13.370491 327.8
6 2013-11-14 06:08:18 42.604941 13.370746 330.4
7 2013-11-14 06:08:49 42.605097 13.371018 331.2
8 2013-11-14 06:09:19 42.605247 13.371289 331.1
9 2013-11-14 06:09:49 42.605393 13.371561 331.6

10 rows × 4 columns

Plotting

pandas plotting functions are using matplotlib. Notice the fancy default pandas plotting style.


In [15]:
pd.options.display.mpl_style = 'default'
ax = df.plot()



In [16]:
plt.rcParams['figure.figsize'] = 12, 8
ax = df.plot(subplots=True, x='Time')



In [17]:
ax = df.plot(x='Latitude N', y='Longitude E', kind='scatter')


Obvervations regarding the data set:

  • Latitude and Longitude could have only jumped around like this after using beam.me.up() from the Scotty package, which we didn't use here

  • Orientation should have a 0-360 range

  • Sampling rate is not constant

Lets try to deal with these isues.

First, index the data using the Time column in order to use more advanced selection, interpolation and group by functions provided by pandas.

Second, remove bad measurement samples by applying selection criteria on the data set

Third, re-sample the data set.

Fourth, fill in the gaps by interpolateing the data so we have a constant sample rate. See also the Working with missing data entry from the pandas manual.

Indexing


In [18]:
df_indexed = df.set_index('Time')
print df_indexed[:3]
print df[:3]


                     Latitude N  Longitude E  orientation
Time                                                     
2013-11-14 06:05:19   42.604009    13.369236        327.7
2013-11-14 06:05:48   42.604160    13.369462        325.4
2013-11-14 06:06:19   42.604325    13.369717        323.7

[3 rows x 3 columns]
                 Time  Latitude N  Longitude E  orientation
0 2013-11-14 06:05:19   42.604009    13.369236        327.7
1 2013-11-14 06:05:48   42.604160    13.369462        325.4
2 2013-11-14 06:06:19   42.604325    13.369717        323.7

[3 rows x 4 columns]

Data selection


In [19]:
#group1 = df.groupby('Latitude N')
#group1.filter(lambda x: x.mean() < -1.0)

#df.query("(orientation > 360.0)")
#df[(df.orientation > 360.0) | (df['Latitude N'] < -1.0)]
#df.query("(orientation > 360.0)")
#df[(df['Latitude N'] < -1.0) & (df.orientation < 360.0)]

#masked = df.mask(df['Latitude N']<0.0)
df_sel = df[(df['Latitude N'] > -1.0)]

# date selection
df_sel2 = df[ df.Time > datetime(2014,1,1) ]

#df[(df.a < df.b) & (df.b < df.c)]
#df.where(df['Latitude N'] < 0.0, np.nan, inplace=True)
#df.where(df['Longitude E'] < 10.0, np.nan, inplace=True)
#df.query("(orientation > 360.0)")
#df[(df['Longitude E'] < 10.0)]
#df['Latitude N'].mean()

Plotting


In [20]:
ax = df_sel.plot(subplots=True, x='Time')



In [21]:
ax = df_sel.plot(x='Latitude N', y='Longitude E', kind='scatter')


Re-sampling

See the API reference for more details on resample()


In [22]:
# re-sample the orientation to an hourly rate
df_resample = df_sel.set_index('Time').resample('H')
# lets have a look at the first and last rows
print df_resample[0:5]
print df_resample[-5:]


                     Latitude N  Longitude E  orientation
Time                                                     
2013-11-14 06:00:00   42.612452    13.383209   326.466364
2013-11-14 07:00:00   42.629733    13.411938   303.714167
2013-11-14 08:00:00   42.635881    13.437030   111.726667
2013-11-14 09:00:00   42.644507    13.449025   180.421667
2013-11-14 10:00:00   42.634876    13.455481    93.121667

[5 rows x 3 columns]
                     Latitude N  Longitude E  orientation
Time                                                     
2014-01-08 12:00:00   42.484987    13.468480    46.146667
2014-01-08 13:00:00   42.484981    13.468490    46.170000
2014-01-08 14:00:00   42.484986    13.468482    46.866667
2014-01-08 15:00:00   42.484988    13.468489    46.880833
2014-01-08 16:00:00   42.484981    13.468489    46.833735

[5 rows x 3 columns]

Group by

Grouping the data per time interval.


In [23]:
df_indexed = df_sel.set_index('Time')
# or use x.second, x.minute, x.hour, x.day, x.week, x.month, x.year 
df_grouped = df_indexed.groupby(lambda x: x.month)
for name, group in df_grouped:
    group.plot(x='Latitude N', y='Longitude E', kind='scatter')



In [24]:
df_grouped_date = df_indexed.groupby(lambda x: x.week)
for name, group in df_grouped_date:
    print name, group.orientation.count()
    group.plot()


1 20154
2 7658
46 10790
47 20094
48 20139
49 20153
50 18159
51 8822
52 14804

In [25]:
df_grouped = df_indexed.groupby(lambda x: x.month)
for name, group in df_grouped:
    print ('%3i %6i %6.1f %6.1f') % (name, group.orientation.count(), group.orientation.min(), group.orientation.max())
    ax = group.orientation.hist(alpha=0.6, label=str(name), bins=range(0,361,20))
ax.legend(loc='best')
ax.set_xlim([0,360])


  1  22052   37.1  141.4
 11  48143    0.0  359.9
 12  70578    0.0  360.0
Out[25]:
(0, 360)

Is the above histogram presenting a fair comparison between the 3 months? If not, can you use simple pandas methods to reduce the bias that possibly one month has over another...


In [26]:
#df1 = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
#                        'foo', 'bar', 'foo', 'foo'],
#                'B' : ['one', 'one', 'two', 'three',
#                        'two', 'two', 'one', 'three'],
#                'C' : range(8), 'D' : range(8)})
#df1