Getting Started with Exploratory Data Analysis

3 important Python packages

NumPy for efficient computation on arrays
Pandas for data analysis
Matplotlib for plotting in the notebook



In [1]:

    
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Pandas

Python module for manipulating tabular data

`pandas`

Provides python a DataFrame
- Table data structure
- Can be easily manipulated similar to lists and arrays
Structured manipulation tools
Built on top of numpy
Huge growth from 2011-2012
Very efficient
Great for medium data

Resources

pandas.pydata.org
Python for Data Analysis by Wes McKinney
Data Wrangling Kung Fu with Pandas by Wes McKinney
Cheat sheet by Quandl

Why `pandas`?

80% of the effort in data analysis is spent cleaning data. Hadley Wickham

Efficency

Different views of data
Tidy data by Hadley Wickham

Raw data is often in the wrong format

How often to you download an array ready for array-oriented computing?
e.g. scikit-learn interface

Storage may be best in a different format

Sparse representations
Upload to database

Simple example using seal oberservational data

Data from:

Baker L, Flemming JEM, Jonsen ID, Lidgard DC, Iverson SJ, Bowen WD (2015) A novel approach to quantifying the spatiotemporal behavior of instrumented grey seals used to sample the environment. Movement Ecology 3(1):20. doi:10.1186/s40462-015-0047-4
Lidgard DC, Bowen WD, Iverson SJ (2015) Data from: A novel approach to quantifying the spatiotemporal behavior of instrumented grey seals used to sample the environment. Movebank Data Repository. doi:10.5441/001/1.910p0c20

Reading a CSV file as text

Data is provided as a CSV file
You can just read the file in line by line
This will create a list of the lines in the file



In [2]:

    
f = open("Grey seals (Halichoerus grypus) at Sable Island (data from Baker et al. 2015).csv", 'r')
lines = f.readlines()
lines[:10]









    Out[2]:





['event-id,visible,timestamp,location-long,location-lat,manually-marked-outlier,sensor-type,individual-taxon-canonical-name,tag-local-identifier,individual-local-identifier,study-name\r\n',
 '677436629,true,2011-06-15 17:35:18.000,-59.97949982,43.92495728,,"gps","Halichoerus grypus","106705","E 87","Grey seals (Halichoerus grypus) at Sable Island (data from Baker et al. 2015)"\r\n',
 '677436630,true,2011-06-15 17:50:19.000,-59.98273849,43.92548752,,"gps","Halichoerus grypus","106705","E 87","Grey seals (Halichoerus grypus) at Sable Island (data from Baker et al. 2015)"\r\n',
 '677436631,true,2011-06-15 18:05:32.000,-59.98968887,43.92582703,,"gps","Halichoerus grypus","106705","E 87","Grey seals (Halichoerus grypus) at Sable Island (data from Baker et al. 2015)"\r\n',
 '677436632,true,2011-06-15 18:21:27.000,-59.99033737,43.92613602,,"gps","Halichoerus grypus","106705","E 87","Grey seals (Halichoerus grypus) at Sable Island (data from Baker et al. 2015)"\r\n',
 '677436633,true,2011-06-15 18:36:31.000,-59.9889679,43.92525482,,"gps","Halichoerus grypus","106705","E 87","Grey seals (Halichoerus grypus) at Sable Island (data from Baker et al. 2015)"\r\n',
 '677436634,true,2011-06-15 18:51:23.000,-59.98394394,43.92564011,,"gps","Halichoerus grypus","106705","E 87","Grey seals (Halichoerus grypus) at Sable Island (data from Baker et al. 2015)"\r\n',
 '677436635,true,2011-06-15 19:06:20.000,-59.98566055,43.92499924,,"gps","Halichoerus grypus","106705","E 87","Grey seals (Halichoerus grypus) at Sable Island (data from Baker et al. 2015)"\r\n',
 '677436636,true,2011-06-15 19:22:18.000,-59.987854,43.92406082,,"gps","Halichoerus grypus","106705","E 87","Grey seals (Halichoerus grypus) at Sable Island (data from Baker et al. 2015)"\r\n',
 '677436637,true,2011-06-15 19:37:18.000,-59.98072815,43.92603302,,"gps","Halichoerus grypus","106705","E 87","Grey seals (Halichoerus grypus) at Sable Island (data from Baker et al. 2015)"\r\n']

Creating a `DataFrame`

df = pd.read_csv(filename)
print df

Why store it this way?

Similar to a table
powerfull way to interact with the data

Converting the timestamp Column

NumPy datetime64 dtype



In [3]:

    
?pd.read_csv



In [4]:

    
df = pd.read_csv("Grey seals (Halichoerus grypus) at Sable Island (data from Baker et al. 2015).csv", 
                 parse_dates=[2])
df.head(3)









    



/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2902: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)






    Out[4]:






  
    
      
      event-id
      visible
      timestamp
      location-long
      location-lat
      manually-marked-outlier
      sensor-type
      individual-taxon-canonical-name
      tag-local-identifier
      individual-local-identifier
      study-name
    
  
  
    
      0
      677436629
      True
      2011-06-15 17:35:18
      -59.979500
      43.924957
      NaN
      gps
      Halichoerus grypus
      106705
      E 87
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      1
      677436630
      True
      2011-06-15 17:50:19
      -59.982738
      43.925488
      NaN
      gps
      Halichoerus grypus
      106705
      E 87
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      2
      677436631
      True
      2011-06-15 18:05:32
      -59.989689
      43.925827
      NaN
      gps
      Halichoerus grypus
      106705
      E 87
      Grey seals (Halichoerus grypus) at Sable Islan...

What data types are in the data frame?



In [5]:

    
df.dtypes









    Out[5]:





event-id                                    int64
visible                                      bool
timestamp                          datetime64[ns]
location-long                             float64
location-lat                              float64
manually-marked-outlier                    object
sensor-type                                object
individual-taxon-canonical-name            object
tag-local-identifier                        int64
individual-local-identifier                object
study-name                                 object
dtype: object

Overview over the numerical types of the data



In [6]:

    
df.describe()









    Out[6]:






  
    
      
      event-id
      visible
      location-long
      location-lat
      tag-local-identifier
    
  
  
    
      count
      1.244350e+05
      124435
      124435.000000
      124435.000000
      124435.000000
    
    
      mean
      6.455033e+08
      0.999397
      -60.352127
      44.453460
      106714.273621
    
    
      std
      8.829974e+06
      0.0245431
      1.124249
      0.837039
      5.576758
    
    
      min
      6.430037e+08
      False
      -75.647453
      43.408535
      106705.000000
    
    
      25%
      6.430355e+08
      1
      -60.473623
      43.928371
      106710.000000
    
    
      50%
      6.430666e+08
      1
      -60.064209
      44.198532
      106715.000000
    
    
      75%
      6.430977e+08
      1
      -59.855791
      44.689791
      106718.000000
    
    
      max
      6.774455e+08
      True
      -50.937111
      59.374504
      106724.000000

Size of the data set



In [7]:

    
len(df)









    Out[7]:





124435

Indexing - very similar to Numpy arrays

0 based indexing
last element not included



In [8]:

    
df[:10:2]









    Out[8]:






  
    
      
      event-id
      visible
      timestamp
      location-long
      location-lat
      manually-marked-outlier
      sensor-type
      individual-taxon-canonical-name
      tag-local-identifier
      individual-local-identifier
      study-name
    
  
  
    
      0
      677436629
      True
      2011-06-15 17:35:18
      -59.979500
      43.924957
      NaN
      gps
      Halichoerus grypus
      106705
      E 87
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      2
      677436631
      True
      2011-06-15 18:05:32
      -59.989689
      43.925827
      NaN
      gps
      Halichoerus grypus
      106705
      E 87
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      4
      677436633
      True
      2011-06-15 18:36:31
      -59.988968
      43.925255
      NaN
      gps
      Halichoerus grypus
      106705
      E 87
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      6
      677436635
      True
      2011-06-15 19:06:20
      -59.985661
      43.924999
      NaN
      gps
      Halichoerus grypus
      106705
      E 87
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      8
      677436637
      True
      2011-06-15 19:37:18
      -59.980728
      43.926033
      NaN
      gps
      Halichoerus grypus
      106705
      E 87
      Grey seals (Halichoerus grypus) at Sable Islan...



In [9]:

    
df[-5:]









    Out[9]:






  
    
      
      event-id
      visible
      timestamp
      location-long
      location-lat
      manually-marked-outlier
      sensor-type
      individual-taxon-canonical-name
      tag-local-identifier
      individual-local-identifier
      study-name
    
  
  
    
      124430
      643119961
      True
      2012-01-06 03:16:42
      -59.756832
      43.974522
      NaN
      gps
      Halichoerus grypus
      106724
      F357
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      124431
      643119962
      True
      2012-01-06 03:33:53
      -59.757130
      43.974274
      NaN
      gps
      Halichoerus grypus
      106724
      F357
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      124432
      643119963
      True
      2012-01-06 03:52:42
      -59.758202
      43.970879
      NaN
      gps
      Halichoerus grypus
      106724
      F357
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      124433
      643119964
      True
      2012-01-06 04:10:06
      -59.761147
      43.971104
      NaN
      gps
      Halichoerus grypus
      106724
      F357
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      124434
      643119965
      True
      2012-01-06 04:25:06
      -59.762070
      43.971161
      NaN
      gps
      Halichoerus grypus
      106724
      F357
      Grey seals (Halichoerus grypus) at Sable Islan...

Extracting the values of a column



In [10]:

    
longitude = df['location-long'].values
print(type(longitude))
print len(longitude)
longitude









    



<type 'numpy.ndarray'>
124435






    Out[10]:





array([-59.97949982, -59.98273849, -59.98968887, ..., -59.7582016 ,
       -59.76114655, -59.7620697 ])

Finding how many individuals are tracked



In [11]:

    
df["individual-local-identifier"].unique()









    Out[11]:





array(['E 87', 'S0749', 'S0757', 'F104', 'S0753', 'F122', 'K 88', 'K 11',
       'S0751', 'S0758', 'S0756', 'F532', 'F719', 'F367', 'F357'], dtype=object)

Extracting columns for a new data frame



In [12]:

    
df.head(2)









    Out[12]:






  
    
      
      event-id
      visible
      timestamp
      location-long
      location-lat
      manually-marked-outlier
      sensor-type
      individual-taxon-canonical-name
      tag-local-identifier
      individual-local-identifier
      study-name
    
  
  
    
      0
      677436629
      True
      2011-06-15 17:35:18
      -59.979500
      43.924957
      NaN
      gps
      Halichoerus grypus
      106705
      E 87
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      1
      677436630
      True
      2011-06-15 17:50:19
      -59.982738
      43.925488
      NaN
      gps
      Halichoerus grypus
      106705
      E 87
      Grey seals (Halichoerus grypus) at Sable Islan...



In [13]:

    
sdf = df[["timestamp","location-long","location-lat","individual-local-identifier","event-id"]]
sdf.head(5)









    Out[13]:






  
    
      
      timestamp
      location-long
      location-lat
      individual-local-identifier
      event-id
    
  
  
    
      0
      2011-06-15 17:35:18
      -59.979500
      43.924957
      E 87
      677436629
    
    
      1
      2011-06-15 17:50:19
      -59.982738
      43.925488
      E 87
      677436630
    
    
      2
      2011-06-15 18:05:32
      -59.989689
      43.925827
      E 87
      677436631
    
    
      3
      2011-06-15 18:21:27
      -59.990337
      43.926136
      E 87
      677436632
    
    
      4
      2011-06-15 18:36:31
      -59.988968
      43.925255
      E 87
      677436633

Using the timestamp as an index



In [14]:

    
sdf.set_index("timestamp",inplace=True)
sdf.head(5)









    Out[14]:






  
    
      
      location-long
      location-lat
      individual-local-identifier
      event-id
    
    
      timestamp
      
      
      
      
    
  
  
    
      2011-06-15 17:35:18
      -59.979500
      43.924957
      E 87
      677436629
    
    
      2011-06-15 17:50:19
      -59.982738
      43.925488
      E 87
      677436630
    
    
      2011-06-15 18:05:32
      -59.989689
      43.925827
      E 87
      677436631
    
    
      2011-06-15 18:21:27
      -59.990337
      43.926136
      E 87
      677436632
    
    
      2011-06-15 18:36:31
      -59.988968
      43.925255
      E 87
      677436633

Adding a column

Want to add a behavior index to the data.
Data is in a numpy array



In [15]:

    
behav = np.random.randn(len(sdf))
sdf.insert(4,'behavior',behav)
sdf.head(5)









    Out[15]:






  
    
      
      location-long
      location-lat
      individual-local-identifier
      event-id
      behavior
    
    
      timestamp
      
      
      
      
      
    
  
  
    
      2011-06-15 17:35:18
      -59.979500
      43.924957
      E 87
      677436629
      0.328219
    
    
      2011-06-15 17:50:19
      -59.982738
      43.925488
      E 87
      677436630
      1.007528
    
    
      2011-06-15 18:05:32
      -59.989689
      43.925827
      E 87
      677436631
      0.785205
    
    
      2011-06-15 18:21:27
      -59.990337
      43.926136
      E 87
      677436632
      1.010807
    
    
      2011-06-15 18:36:31
      -59.988968
      43.925255
      E 87
      677436633
      -1.148943

Renaming columns

location-long --> longitude
location-lat --> latitude
individual-local-identifier -> individual



In [16]:

    
sdf = sdf.rename(columns={"location-long":"longitude", 
                        "location-lat":"latitude", 
                        "individual-local-identifier": "individual"})
sdf.head(5)









    Out[16]:






  
    
      
      longitude
      latitude
      individual
      event-id
      behavior
    
    
      timestamp
      
      
      
      
      
    
  
  
    
      2011-06-15 17:35:18
      -59.979500
      43.924957
      E 87
      677436629
      0.328219
    
    
      2011-06-15 17:50:19
      -59.982738
      43.925488
      E 87
      677436630
      1.007528
    
    
      2011-06-15 18:05:32
      -59.989689
      43.925827
      E 87
      677436631
      0.785205
    
    
      2011-06-15 18:21:27
      -59.990337
      43.926136
      E 87
      677436632
      1.010807
    
    
      2011-06-15 18:36:31
      -59.988968
      43.925255
      E 87
      677436633
      -1.148943

Writing to a csv file



In [17]:

    
sdf.to_csv("seal-behav.csv")
!head "seal-behav.csv"









    



timestamp,longitude,latitude,individual,event-id,behavior
2011-06-15 17:35:18,-59.97949982,43.92495728,E 87,677436629,0.328219244499
2011-06-15 17:50:19,-59.98273849,43.92548752,E 87,677436630,1.00752792258
2011-06-15 18:05:32,-59.98968887,43.92582703,E 87,677436631,0.785205425388
2011-06-15 18:21:27,-59.99033737,43.92613602,E 87,677436632,1.0108068571
2011-06-15 18:36:31,-59.9889679,43.92525482,E 87,677436633,-1.1489431311
2011-06-15 18:51:23,-59.98394394,43.92564011,E 87,677436634,-0.616059556057
2011-06-15 19:06:20,-59.98566055,43.92499924,E 87,677436635,-0.81089909047
2011-06-15 19:22:18,-59.987854,43.92406082,E 87,677436636,0.728499356042
2011-06-15 19:37:18,-59.98072815,43.92603302,E 87,677436637,-0.172214739418

Hierarchical columns

Reorder the organization of the table

Index: timestamp
Columns: the individuals



In [18]:

    
sd = sdf.pivot(columns='individual') #row, column, values (optional)
sd[:5]









    Out[18]:






  
    
      
      longitude
      ...
      behavior
    
    
      individual
      E 87
      F104
      F122
      F357
      F367
      F532
      F719
      K 11
      K 88
      S0749
      ...
      F532
      F719
      K 11
      K 88
      S0749
      S0751
      S0753
      S0756
      S0757
      S0758
    
    
      timestamp
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2011-06-11 19:07:27
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -59.961720
      ...
      NaN
      NaN
      NaN
      NaN
      0.752542
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-06-11 19:24:28
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -59.960075
      ...
      NaN
      NaN
      NaN
      NaN
      -0.520997
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-06-11 19:41:25
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -59.956333
      ...
      NaN
      NaN
      NaN
      NaN
      0.698960
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-06-11 19:57:39
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -59.957340
      ...
      NaN
      NaN
      NaN
      NaN
      -1.912928
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-06-11 20:14:14
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -59.965260
      ...
      NaN
      NaN
      NaN
      NaN
      1.311894
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 60 columns



In [19]:

    
sd['behavior'][:5]









    Out[19]:






  
    
      individual
      E 87
      F104
      F122
      F357
      F367
      F532
      F719
      K 11
      K 88
      S0749
      S0751
      S0753
      S0756
      S0757
      S0758
    
    
      timestamp
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2011-06-11 19:07:27
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0.752542
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-06-11 19:24:28
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -0.520997
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-06-11 19:41:25
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0.698960
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-06-11 19:57:39
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -1.912928
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-06-11 20:14:14
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1.311894
      NaN
      NaN
      NaN
      NaN
      NaN

Extracting longitude and latitude for each individual



In [20]:

    
longLat = sdf[['individual', 'longitude', 'latitude']]
longLat[2::5000]









    Out[20]:






  
    
      
      individual
      longitude
      latitude
    
    
      timestamp
      
      
      
    
  
  
    
      2011-06-15 18:05:32
      E 87
      -59.989689
      43.925827
    
    
      2011-09-26 15:03:18
      E 87
      -60.023972
      43.928467
    
    
      2011-07-01 14:24:30
      S0749
      -59.932743
      44.020000
    
    
      2011-10-17 15:09:15
      S0749
      -60.100521
      44.495800
    
    
      2011-07-07 01:14:48
      S0757
      -59.275227
      44.682209
    
    
      2011-12-01 04:11:42
      S0757
      -60.867977
      44.248436
    
    
      2011-08-20 09:16:11
      F104
      -60.041412
      43.927155
    
    
      2011-11-16 19:40:27
      F104
      -60.496716
      44.621536
    
    
      2011-07-20 04:25:05
      S0753
      -59.538494
      44.334286
    
    
      2011-10-12 08:00:38
      S0753
      -59.769253
      44.785385
    
    
      2011-06-23 10:07:36
      F122
      -59.639961
      44.038654
    
    
      2011-08-09 14:19:54
      K 88
      -59.081959
      43.875942
    
    
      2011-11-18 02:43:41
      K 88
      -61.186897
      43.768154
    
    
      2011-08-21 08:05:02
      K 11
      -60.871934
      45.303607
    
    
      2011-12-05 10:54:25
      K 11
      -60.863003
      44.177757
    
    
      2011-09-05 20:35:02
      S0751
      -64.866554
      46.786152
    
    
      2011-07-15 15:28:05
      S0758
      -60.021687
      43.928288
    
    
      2011-10-27 03:14:26
      S0758
      -59.938644
      43.758900
    
    
      2011-07-05 18:11:54
      S0756
      -59.991898
      44.616425
    
    
      2011-10-16 07:19:25
      S0756
      -60.324448
      45.068775
    
    
      2011-06-17 14:07:51
      F532
      -60.800690
      43.785324
    
    
      2011-12-24 18:04:44
      F532
      -60.248558
      43.836136
    
    
      2011-07-02 12:27:12
      F367
      -60.241093
      44.314720
    
    
      2011-08-13 11:41:55
      F357
      -60.123775
      43.957939
    
    
      2011-10-28 19:59:47
      F357
      -60.425831
      45.067749

Same for the pivoted data



In [21]:

    
sd[['longitude', 'latitude']][::5000]









    Out[21]:






  
    
      
      longitude
      ...
      latitude
    
    
      individual
      E 87
      F104
      F122
      F357
      F367
      F532
      F719
      K 11
      K 88
      S0749
      ...
      F532
      F719
      K 11
      K 88
      S0749
      S0751
      S0753
      S0756
      S0757
      S0758
    
    
      timestamp
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2011-06-11 19:07:27
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -59.961720
      ...
      NaN
      NaN
      NaN
      NaN
      43.936630
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-06-22 03:16:32
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      43.643261
    
    
      2011-06-29 01:15:33
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      44.672211
      NaN
      NaN
    
    
      2011-07-06 09:42:34
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      44.525101
      NaN
      NaN
      NaN
    
    
      2011-07-14 09:15:25
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -59.762409
      ...
      NaN
      NaN
      NaN
      NaN
      43.997272
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-07-23 05:14:59
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      44.120434
      NaN
      NaN
      NaN
    
    
      2011-07-30 19:32:57
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      43.618286
    
    
      2011-08-07 22:27:20
      NaN
      NaN
      NaN
      NaN
      NaN
      -60.082142
      NaN
      NaN
      NaN
      NaN
      ...
      43.939621
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-08-16 20:55:54
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -59.684322
      NaN
      ...
      NaN
      NaN
      NaN
      44.005180
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-08-24 16:26:58
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -59.989624
      ...
      NaN
      NaN
      NaN
      NaN
      43.936295
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-09-02 06:08:14
      NaN
      NaN
      NaN
      -59.808678
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-09-10 07:23:12
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      43.929222
      NaN
      NaN
    
    
      2011-09-19 15:56:32
      NaN
      NaN
      NaN
      NaN
      -60.160984
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-09-27 10:38:24
      -60.009354
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-10-06 14:57:37
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      44.485497
      NaN
      NaN
    
    
      2011-10-15 00:27:56
      NaN
      NaN
      NaN
      -60.398602
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-10-22 13:20:00
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      48.025097
      NaN
      NaN
      NaN
      NaN
    
    
      2011-10-29 02:48:50
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      44.784321
      NaN
      NaN
      NaN
    
    
      2011-11-04 21:24:41
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -60.117374
      NaN
      ...
      NaN
      NaN
      NaN
      43.948349
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-11-13 00:09:13
      NaN
      -60.440189
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-11-20 05:29:35
      NaN
      -60.555164
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-11-27 13:48:57
      NaN
      NaN
      NaN
      -60.507767
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-12-05 03:15:54
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -60.880478
      NaN
      NaN
      ...
      NaN
      NaN
      44.168858
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-12-13 01:49:53
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      44.117413
      NaN
    
    
      2011-12-20 18:14:31
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      -60.806515
      NaN
      NaN
      ...
      NaN
      NaN
      44.659252
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

25 rows × 30 columns

Extracting Data from a Data Frame with a Condition

Extracting Seal F719 from the table



In [22]:

    
df[df["individual-local-identifier"] == "F719"][:5]









    Out[22]:






  
    
      
      event-id
      visible
      timestamp
      location-long
      location-lat
      manually-marked-outlier
      sensor-type
      individual-taxon-canonical-name
      tag-local-identifier
      individual-local-identifier
      study-name
    
  
  
    
      105100
      643100631
      True
      2011-06-14 19:59:29
      -59.826447
      43.937832
      NaN
      gps
      Halichoerus grypus
      106721
      F719
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      105101
      643100632
      True
      2011-06-14 20:32:38
      -59.829632
      43.931831
      NaN
      gps
      Halichoerus grypus
      106721
      F719
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      105102
      643100633
      True
      2011-06-14 20:48:07
      -59.813534
      43.936577
      NaN
      gps
      Halichoerus grypus
      106721
      F719
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      105103
      643100634
      True
      2011-06-14 21:18:03
      -59.812885
      43.942875
      NaN
      gps
      Halichoerus grypus
      106721
      F719
      Grey seals (Halichoerus grypus) at Sable Islan...
    
    
      105104
      643100635
      True
      2011-06-14 21:33:03
      -59.808041
      43.941067
      NaN
      gps
      Halichoerus grypus
      106721
      F719
      Grey seals (Halichoerus grypus) at Sable Islan...



In [23]:

    
sd[[('behavior', "F719"), ('latitude', "F104"),('longitude', "F719")]][:5]









    Out[23]:






  
    
      
      behavior
      latitude
      longitude
    
    
      individual
      F719
      F104
      F719
    
    
      timestamp
      
      
      
    
  
  
    
      2011-06-11 19:07:27
      NaN
      NaN
      NaN
    
    
      2011-06-11 19:24:28
      NaN
      NaN
      NaN
    
    
      2011-06-11 19:41:25
      NaN
      NaN
      NaN
    
    
      2011-06-11 19:57:39
      NaN
      NaN
      NaN
    
    
      2011-06-11 20:14:14
      NaN
      NaN
      NaN



In [24]:

    
sd[[('behavior', "F719"), 
    ('latitude', "F719"),
    ('longitude', "F719")]].dropna()[:5]









    Out[24]:






  
    
      
      behavior
      latitude
      longitude
    
    
      individual
      F719
      F719
      F719
    
    
      timestamp
      
      
      
    
  
  
    
      2011-06-14 19:59:29
      -0.214912
      43.937832
      -59.826447
    
    
      2011-06-14 20:32:38
      0.738354
      43.931831
      -59.829632
    
    
      2011-06-14 20:48:07
      1.353730
      43.936577
      -59.813534
    
    
      2011-06-14 21:18:03
      0.369941
      43.942875
      -59.812885
    
    
      2011-06-14 21:33:03
      0.589742
      43.941067
      -59.808041

Simple Plotting

Plot the datafrom



In [25]:

    
sd.plot()









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x10f4a63d0>

Plot the behavior of all seals



In [26]:

    
sd['behavior'].plot(figsize=(10, 10))
plt.ylabel('behavior')
plt.title('Seal plotting exercise')
plt.savefig('seal_behavior.png')
plt.grid()
plt.show()

Plot the behavior of seal K 11



In [27]:

    
sd[('behavior',"K 11")].plot(figsize=(10, 10))
plt.ylabel('behavior')
plt.title('Seal plotting exercise')
plt.savefig('seal-k11.png')
plt.grid()
plt.show()

Importing the image into the Markdown



In [ ]:

Revisiting the Weather Data Plot



In [28]:

    
import urllib2
import StringIO 
import itertools 
import datetime



In [29]:

    
import string

boulder_url  = "http://www.esrl.noaa.gov/psd/boulder/data/boulderdaily.complete"


def cdate(x1, x2, x3):
    try: 
        return datetime.datetime(int(x1), int(x2), int(x3))
    except:
        return datetime.datetime(int(x1), int(x2), int(x3)-1)

df = pd.read_csv(boulder_url, sep=' +', index_col=0, skiprows=1, skipfooter=14, 
                 na_values='-998', engine='python', \
                 parse_dates=[[0,1,2]], date_parser=cdate, \
                 header=None, 
                 names=['year', 'month', 'day', 'tmax', 'tmin', 'precip', 'snow', 'snowcover'],)
df.dtypes









    Out[29]:





tmax         float64
tmin         float64
precip       float64
snow         float64
snowcover    float64
dtype: object



In [30]:

    
df.tail(5)









    Out[30]:






  
    
      
      tmax
      tmin
      precip
      snow
      snowcover
    
    
      year_month_day
      
      
      
      
      
    
  
  
    
      2015-12-27
      27
      5
      0
      0
      5
    
    
      2015-12-28
      31
      4
      0
      0
      4
    
    
      2015-12-29
      27
      9
      0
      0
      3
    
    
      2015-12-30
      30
      7
      0
      0
      3
    
    
      2015-12-31
      31
      6
      0
      0
      3



In [32]:

    
df['tmax'][-365:].plot()









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x121854090>



In [ ]:

	event-id	visible	timestamp	location-long	location-lat	manually-marked-outlier	sensor-type	individual-taxon-canonical-name	tag-local-identifier	individual-local-identifier	study-name
0	677436629	True	2011-06-15 17:35:18	-59.979500	43.924957	NaN	gps	Halichoerus grypus	106705	E 87	Grey seals (Halichoerus grypus) at Sable Islan...
1	677436630	True	2011-06-15 17:50:19	-59.982738	43.925488	NaN	gps	Halichoerus grypus	106705	E 87	Grey seals (Halichoerus grypus) at Sable Islan...
2	677436631	True	2011-06-15 18:05:32	-59.989689	43.925827	NaN	gps	Halichoerus grypus	106705	E 87	Grey seals (Halichoerus grypus) at Sable Islan...

	event-id	visible	location-long	location-lat	tag-local-identifier
count	1.244350e+05	124435	124435.000000	124435.000000	124435.000000
mean	6.455033e+08	0.999397	-60.352127	44.453460	106714.273621
std	8.829974e+06	0.0245431	1.124249	0.837039	5.576758
min	6.430037e+08	False	-75.647453	43.408535	106705.000000
25%	6.430355e+08	1	-60.473623	43.928371	106710.000000
50%	6.430666e+08	1	-60.064209	44.198532	106715.000000
75%	6.430977e+08	1	-59.855791	44.689791	106718.000000
max	6.774455e+08	True	-50.937111	59.374504	106724.000000

	event-id	visible	timestamp	location-long	location-lat	manually-marked-outlier	sensor-type	individual-taxon-canonical-name	tag-local-identifier	individual-local-identifier	study-name
124430	643119961	True	2012-01-06 03:16:42	-59.756832	43.974522	NaN	gps	Halichoerus grypus	106724	F357	Grey seals (Halichoerus grypus) at Sable Islan...
124431	643119962	True	2012-01-06 03:33:53	-59.757130	43.974274	NaN	gps	Halichoerus grypus	106724	F357	Grey seals (Halichoerus grypus) at Sable Islan...
124432	643119963	True	2012-01-06 03:52:42	-59.758202	43.970879	NaN	gps	Halichoerus grypus	106724	F357	Grey seals (Halichoerus grypus) at Sable Islan...
124433	643119964	True	2012-01-06 04:10:06	-59.761147	43.971104	NaN	gps	Halichoerus grypus	106724	F357	Grey seals (Halichoerus grypus) at Sable Islan...
124434	643119965	True	2012-01-06 04:25:06	-59.762070	43.971161	NaN	gps	Halichoerus grypus	106724	F357	Grey seals (Halichoerus grypus) at Sable Islan...

	location-long	location-lat	individual-local-identifier	event-id
timestamp
2011-06-15 17:35:18	-59.979500	43.924957	E 87	677436629
2011-06-15 17:50:19	-59.982738	43.925488	E 87	677436630
2011-06-15 18:05:32	-59.989689	43.925827	E 87	677436631
2011-06-15 18:21:27	-59.990337	43.926136	E 87	677436632
2011-06-15 18:36:31	-59.988968	43.925255	E 87	677436633

	longitude										...	behavior
individual	E 87	F104	F122	F357	F367	F532	F719	K 11	K 88	S0749	...	F532	F719	K 11	K 88	S0749	S0751	S0753	S0756	S0757	S0758
timestamp
2011-06-11 19:07:27	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-59.961720	...	NaN	NaN	NaN	NaN	0.752542	NaN	NaN	NaN	NaN	NaN
2011-06-11 19:24:28	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-59.960075	...	NaN	NaN	NaN	NaN	-0.520997	NaN	NaN	NaN	NaN	NaN
2011-06-11 19:41:25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-59.956333	...	NaN	NaN	NaN	NaN	0.698960	NaN	NaN	NaN	NaN	NaN
2011-06-11 19:57:39	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-59.957340	...	NaN	NaN	NaN	NaN	-1.912928	NaN	NaN	NaN	NaN	NaN
2011-06-11 20:14:14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-59.965260	...	NaN	NaN	NaN	NaN	1.311894	NaN	NaN	NaN	NaN	NaN

	event-id	visible	timestamp	location-long	location-lat	manually-marked-outlier	sensor-type	individual-taxon-canonical-name	tag-local-identifier	individual-local-identifier	study-name
105100	643100631	True	2011-06-14 19:59:29	-59.826447	43.937832	NaN	gps	Halichoerus grypus	106721	F719	Grey seals (Halichoerus grypus) at Sable Islan...
105101	643100632	True	2011-06-14 20:32:38	-59.829632	43.931831	NaN	gps	Halichoerus grypus	106721	F719	Grey seals (Halichoerus grypus) at Sable Islan...
105102	643100633	True	2011-06-14 20:48:07	-59.813534	43.936577	NaN	gps	Halichoerus grypus	106721	F719	Grey seals (Halichoerus grypus) at Sable Islan...
105103	643100634	True	2011-06-14 21:18:03	-59.812885	43.942875	NaN	gps	Halichoerus grypus	106721	F719	Grey seals (Halichoerus grypus) at Sable Islan...
105104	643100635	True	2011-06-14 21:33:03	-59.808041	43.941067	NaN	gps	Halichoerus grypus	106721	F719	Grey seals (Halichoerus grypus) at Sable Islan...

	tmax	tmin	precip	snow	snowcover
year_month_day
2015-12-27	27	5	0	0	5
2015-12-28	31	4	0	0	4
2015-12-29	27	9	0	0	3
2015-12-30	30	7	0	0	3
2015-12-31	31	6	0	0	3