pandas

Pediatric autoimmune neuropsychiatric disorder associated with group A streptococci (PANDAS) is a term used to describe a subset of children whose symptoms of obsessive compulsive disorder (OCD) or tic disorders are exacerbated by group A streptococcal (GAS) infection...

pandas provides high-performance, easy-to-use data structures and data analysis tools.

Tool used in the data science community, data types not strictly numerical
Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
and more

pandas does not implement significant modeling functionality outside of linear and panel regression; for this, look to statsmodels and scikit-learn. More work is still needed to make Python a first class statistical modeling environment, but we are well on our way toward that goal.

pandas is under continious development, and improvements are added on a weekly basis. As with many other open source projects, you also learn a lot by just keeping up to date on the development (as a minimum, read the release notes of every new release).



In [1]:

    
import numpy as np
import pandas as pd
import csv
from matplotlib import pyplot as plt
from datetime import datetime

Lets load the CSV GPS data that has been introduced before



In [2]:

    
fname = 'gps.csv'
df = pd.read_csv(fname, header=0, skiprows=[1], quoting=2)
%timeit pd.read_csv(fname, header=0, skiprows=[1], quoting=2)









    



10 loops, best of 3: 129 ms per loop

Although it is not relevant here since there are no quotes in the data set, we can also use the CSV dialect option



In [3]:

    
dia = csv.excel()
dia.quoting = csv.QUOTE_NONE
df = pd.read_csv(fname, header=0, skiprows=[1], dialect=dia)

Yes, very nice, but what does pandas actually understand about the data types in this case?



In [4]:

    
df.dtypes









    Out[4]:





Time            object
Latitude N     float64
Longitude E    float64
orientation    float64
dtype: object

Time is an object. I guess that's nice and general...Lets see if we can force it into a datetime object instead



In [5]:

    
df = pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0])
df.dtypes









    Out[5]:





Time           datetime64[ns]
Latitude N            float64
Longitude E           float64
orientation           float64
dtype: object

Automatic date parsing is realy nice and easy, but it is SLOOOOWWWW. How slow exactly?



In [6]:

    
%timeit pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0])









    



1 loops, best of 3: 13.6 s per loop

Can we do that any faster? How about manually defining the parse date function?



In [7]:

    
date = '20131211 18:29:30'
print date[:4], date[4:6], date[6:8], date[9:11], date[12:14], date[15:17]
print datetime(int(date[:4]), int(date[4:6]), int(date[6:8]), int(date[9:11]), int(date[12:14]), int(date[15:17]))
print datetime.strptime(date, "%Y%m%d %H:%M:%S")









    



2013 12 11 18 29 30
2013-12-11 18:29:30
2013-12-11 18:29:30

The read_csv function has an handy date_parser argument that allows the user to define her/his own date parse function. The default one is:



In [8]:

    
import dateutil.parser as parser
def date_parser_default(date):
    date = parser.parse(date)
    return date

Because the parser tries to actually understand the date format it is very slow. If we specify how to read the date field, it can go a lot faster. As a bonus, we could include any other correction or transformation operation on the date.



In [9]:

    
from datetime import datetime
def date_parser(date):
    # create a datetime instance and assume constant formatting
    # format: 20131211 18:29:30
    return datetime(int(date[:4]), int(date[4:6]), int(date[6:8]), int(date[9:11]), int(date[12:14]), int(date[15:17]))
df = pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser)



In [10]:

    
%timeit pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser)









    



1 loops, best of 3: 603 ms per loop

Looks a bit ugly, isn't the next one more Pythonic?



In [11]:

    
def date_parser_2(date):
    return datetime.strptime(date, "%Y%m%d %H:%M:%S")
%timeit pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser_2)









    



1 loops, best of 3: 2.62 s per loop

More compact, but unfortunately not as fast.



In [12]:

    
df.dtypes









    Out[12]:





Time           datetime64[ns]
Latitude N            float64
Longitude E           float64
orientation           float64
dtype: object

Printing DataFrames

How does the raw data looks like actually when printing the pandas DataFrame?



In [13]:

    
print df[:10]









    



                 Time  Latitude N  Longitude E  orientation
0 2013-11-14 06:05:19   42.604009    13.369236        327.7
1 2013-11-14 06:05:48   42.604160    13.369462        325.4
2 2013-11-14 06:06:19   42.604325    13.369717        323.7
3 2013-11-14 06:06:48   42.604478    13.369960        325.6
4 2013-11-14 06:07:19   42.604640    13.370227        327.5
5 2013-11-14 06:07:49   42.604794    13.370491        327.8
6 2013-11-14 06:08:18   42.604941    13.370746        330.4
7 2013-11-14 06:08:49   42.605097    13.371018        331.2
8 2013-11-14 06:09:19   42.605247    13.371289        331.1
9 2013-11-14 06:09:49   42.605393    13.371561        331.6

[10 rows x 4 columns]

However, there is also some build-in IPython notebook magic that goes like this



In [14]:

    
df[:10]









    Out[14]:






  
    
      
      Time
      Latitude N
      Longitude E
      orientation
    
  
  
    
      0
      2013-11-14 06:05:19
       42.604009
       13.369236
       327.7
    
    
      1
      2013-11-14 06:05:48
       42.604160
       13.369462
       325.4
    
    
      2
      2013-11-14 06:06:19
       42.604325
       13.369717
       323.7
    
    
      3
      2013-11-14 06:06:48
       42.604478
       13.369960
       325.6
    
    
      4
      2013-11-14 06:07:19
       42.604640
       13.370227
       327.5
    
    
      5
      2013-11-14 06:07:49
       42.604794
       13.370491
       327.8
    
    
      6
      2013-11-14 06:08:18
       42.604941
       13.370746
       330.4
    
    
      7
      2013-11-14 06:08:49
       42.605097
       13.371018
       331.2
    
    
      8
      2013-11-14 06:09:19
       42.605247
       13.371289
       331.1
    
    
      9
      2013-11-14 06:09:49
       42.605393
       13.371561
       331.6
    
  

10 rows × 4 columns

Plotting

pandas plotting functions are using matplotlib. Notice the fancy default pandas plotting style.



In [15]:

    
pd.options.display.mpl_style = 'default'
ax = df.plot()



In [16]:

    
plt.rcParams['figure.figsize'] = 12, 8
ax = df.plot(subplots=True, x='Time')



In [17]:

    
ax = df.plot(x='Latitude N', y='Longitude E', kind='scatter')

Obvervations regarding the data set:

Latitude and Longitude could have only jumped around like this after using beam.me.up() from the Scotty package, which we didn't use here
Orientation should have a 0-360 range
Sampling rate is not constant

Lets try to deal with these isues.

First, index the data using the Time column in order to use more advanced selection, interpolation and group by functions provided by pandas.

Second, remove bad measurement samples by applying selection criteria on the data set

Third, re-sample the data set.

Fourth, fill in the gaps by interpolateing the data so we have a constant sample rate. See also the Working with missing data entry from the pandas manual.

Indexing



In [18]:

    
df_indexed = df.set_index('Time')
print df_indexed[:3]
print df[:3]









    



                     Latitude N  Longitude E  orientation
Time                                                     
2013-11-14 06:05:19   42.604009    13.369236        327.7
2013-11-14 06:05:48   42.604160    13.369462        325.4
2013-11-14 06:06:19   42.604325    13.369717        323.7

[3 rows x 3 columns]
                 Time  Latitude N  Longitude E  orientation
0 2013-11-14 06:05:19   42.604009    13.369236        327.7
1 2013-11-14 06:05:48   42.604160    13.369462        325.4
2 2013-11-14 06:06:19   42.604325    13.369717        323.7

[3 rows x 4 columns]

Data selection



In [19]:

    
#group1 = df.groupby('Latitude N')
#group1.filter(lambda x: x.mean() < -1.0)

#df.query("(orientation > 360.0)")
#df[(df.orientation > 360.0) | (df['Latitude N'] < -1.0)]
#df.query("(orientation > 360.0)")
#df[(df['Latitude N'] < -1.0) & (df.orientation < 360.0)]

#masked = df.mask(df['Latitude N']<0.0)
df_sel = df[(df['Latitude N'] > -1.0)]

# date selection
df_sel2 = df[ df.Time > datetime(2014,1,1) ]

#df[(df.a < df.b) & (df.b < df.c)]
#df.where(df['Latitude N'] < 0.0, np.nan, inplace=True)
#df.where(df['Longitude E'] < 10.0, np.nan, inplace=True)
#df.query("(orientation > 360.0)")
#df[(df['Longitude E'] < 10.0)]
#df['Latitude N'].mean()

Plotting



In [20]:

    
ax = df_sel.plot(subplots=True, x='Time')



In [21]:

    
ax = df_sel.plot(x='Latitude N', y='Longitude E', kind='scatter')

Re-sampling

See the API reference for more details on resample()



In [22]:

    
# re-sample the orientation to an hourly rate
df_resample = df_sel.set_index('Time').resample('H')
# lets have a look at the first and last rows
print df_resample[0:5]
print df_resample[-5:]









    



                     Latitude N  Longitude E  orientation
Time                                                     
2013-11-14 06:00:00   42.612452    13.383209   326.466364
2013-11-14 07:00:00   42.629733    13.411938   303.714167
2013-11-14 08:00:00   42.635881    13.437030   111.726667
2013-11-14 09:00:00   42.644507    13.449025   180.421667
2013-11-14 10:00:00   42.634876    13.455481    93.121667

[5 rows x 3 columns]
                     Latitude N  Longitude E  orientation
Time                                                     
2014-01-08 12:00:00   42.484987    13.468480    46.146667
2014-01-08 13:00:00   42.484981    13.468490    46.170000
2014-01-08 14:00:00   42.484986    13.468482    46.866667
2014-01-08 15:00:00   42.484988    13.468489    46.880833
2014-01-08 16:00:00   42.484981    13.468489    46.833735

[5 rows x 3 columns]

Group by

Grouping the data per time interval.



In [23]:

    
df_indexed = df_sel.set_index('Time')
# or use x.second, x.minute, x.hour, x.day, x.week, x.month, x.year 
df_grouped = df_indexed.groupby(lambda x: x.month)
for name, group in df_grouped:
    group.plot(x='Latitude N', y='Longitude E', kind='scatter')



In [24]:

    
df_grouped_date = df_indexed.groupby(lambda x: x.week)
for name, group in df_grouped_date:
    print name, group.orientation.count()
    group.plot()



In [25]:

    
df_grouped = df_indexed.groupby(lambda x: x.month)
for name, group in df_grouped:
    print ('%3i %6i %6.1f %6.1f') % (name, group.orientation.count(), group.orientation.min(), group.orientation.max())
    ax = group.orientation.hist(alpha=0.6, label=str(name), bins=range(0,361,20))
ax.legend(loc='best')
ax.set_xlim([0,360])









    



  1  22052   37.1  141.4
 11  48143    0.0  359.9
 12  70578    0.0  360.0






    Out[25]:





(0, 360)

Is the above histogram presenting a fair comparison between the 3 months? If not, can you use simple pandas methods to reduce the bias that possibly one month has over another...



In [26]:

    
#df1 = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
#                        'foo', 'bar', 'foo', 'foo'],
#                'B' : ['one', 'one', 'two', 'three',
#                        'two', 'two', 'one', 'three'],
#                'C' : range(8), 'D' : range(8)})
#df1

	Time	Latitude N	Longitude E	orientation
0	2013-11-14 06:05:19	42.604009	13.369236	327.7
1	2013-11-14 06:05:48	42.604160	13.369462	325.4
2	2013-11-14 06:06:19	42.604325	13.369717	323.7
3	2013-11-14 06:06:48	42.604478	13.369960	325.6
4	2013-11-14 06:07:19	42.604640	13.370227	327.5
5	2013-11-14 06:07:49	42.604794	13.370491	327.8
6	2013-11-14 06:08:18	42.604941	13.370746	330.4
7	2013-11-14 06:08:49	42.605097	13.371018	331.2
8	2013-11-14 06:09:19	42.605247	13.371289	331.1
9	2013-11-14 06:09:49	42.605393	13.371561	331.6