Python Data Science with Pandas

Agenda:

Selecting and slicing data with loc and iloc
Selecting data by value
Selecting data from a time series by date and time values
Applying and mapping functions to pandas dataframes and series



In [1]:

    
# magic command to display matplotlib plots inline within the ipython notebook webpage
%matplotlib inline

# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt

Part 1
Basics of Selecting and Slicing Data



In [2]:

    
# Operation                       Syntax         Result
#----------------------------------------------------------
# Select column                   df[col]        Series
# Select row by label             df.loc[label]  Series
# Select row by integer location  df.iloc[loc]   Series
# Slice rows                      df[5:10]       DataFrame
# Select rows by boolean vector   df[bool_vec]   DataFrame



In [3]:

    
# create a pandas dataframe from the location data set
df = pd.read_csv('data/summer-travel-gps-full.csv')
df.head()









    Out[3]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      0
      51.481292
      -0.451011
      05/14/2014 09:07
      West Drayton
      United Kingdom
    
    
      1
      51.474005
      -0.450999
      05/14/2014 09:22
      Hounslow
      United Kingdom
    
    
      2
      51.478199
      -0.446081
      05/14/2014 10:51
      Hounslow
      United Kingdom
    
    
      3
      51.478199
      -0.446081
      05/14/2014 11:24
      Hounslow
      United Kingdom
    
    
      4
      51.474146
      -0.451562
      05/14/2014 11:38
      Hounslow
      United Kingdom

With Python strings and lists, you can access individual elements inside the object by indexing or slicing with square brackets like my_list[0]. The same syntax works on pandas dataframes and series.



In [4]:

    
# get a column from the dataframe by its label with [] indexing - returns column as a pandas series with the dataframe's index
df['city'].head()









    Out[4]:





0    West Drayton
1        Hounslow
2        Hounslow
3        Hounslow
4        Hounslow
Name: city, dtype: object

You can slice a pandas dataframe to grab multiple columns or rows at once, by their index, just like you can slice a list or a string. Here, we just pass a list of 2 column names in:



In [5]:

    
# get multiple columns by their labels by passing a list of column names within the [] operator - returns a dataframe
df[['city', 'country']].head()









    Out[5]:






  
    
      
      city
      country
    
  
  
    
      0
      West Drayton
      United Kingdom
    
    
      1
      Hounslow
      United Kingdom
    
    
      2
      Hounslow
      United Kingdom
    
    
      3
      Hounslow
      United Kingdom
    
    
      4
      Hounslow
      United Kingdom

To get a single "cell's" value out of a dataframe, pass a column name, then a row label. This is equivalent to slicing the dataframe down to a single series, then slicing a single value out of that series using [ ] indexing.



In [6]:

    
# get a value using the [] operator for a column label then a row label
df['city'][0]









    Out[6]:





'West Drayton'

Using .loc[ ]



In [7]:

    
# use .loc to select by row label - returns the row as a series whose index is the dataframe column labels
df.loc[0]









    Out[7]:





lat                 51.4813
lon               -0.451011
date       05/14/2014 09:07
city           West Drayton
country      United Kingdom
Name: 0, dtype: object



In [8]:

    
# use .loc to select by row label and column label
df.loc[0, 'country']









    Out[8]:





'United Kingdom'



In [9]:

    
# 0:3 is a slice of rows with label 0 to label 3
# ['city', 'date'] is a list of column labels
df.loc[0:3, ['city', 'date']]









    Out[9]:






  
    
      
      city
      date
    
  
  
    
      0
      West Drayton
      05/14/2014 09:07
    
    
      1
      Hounslow
      05/14/2014 09:22
    
    
      2
      Hounslow
      05/14/2014 10:51
    
    
      3
      Hounslow
      05/14/2014 11:24



In [10]:

    
# slice by rows and columns
df.loc[0:3, 'date':'country']









    Out[10]:






  
    
      
      date
      city
      country
    
  
  
    
      0
      05/14/2014 09:07
      West Drayton
      United Kingdom
    
    
      1
      05/14/2014 09:22
      Hounslow
      United Kingdom
    
    
      2
      05/14/2014 10:51
      Hounslow
      United Kingdom
    
    
      3
      05/14/2014 11:24
      Hounslow
      United Kingdom



In [11]:

    
# you can set values with .loc as well
print(df.loc[0, 'city'])
df.loc[0, 'city'] = 'London'
print(df.loc[0, 'city'])









    



West Drayton
London

Using .iloc[ ]



In [12]:

    
# use .iloc for integer position based indexing
# get the value from the row in position 3 and the column in position 2
df.iloc[3, 2]









    Out[12]:





'05/14/2014 11:24'



In [13]:

    
# you can use .iloc with slices too
# slice rows from position 112 to 115 and columns from position 2 to 4
# iloc is not inclusive, so for example "from 2 to 4" will return positions 2 and 3 (but not 4)
df.iloc[112:115, 2:4]









    Out[13]:






  
    
      
      date
      city
    
  
  
    
      112
      05/16/2014 07:03
      Lisbon
    
    
      113
      05/16/2014 07:33
      Lisbon
    
    
      114
      05/16/2014 08:03
      Lisbon



In [14]:

    
# you can use iloc to select every nth row from a data set
n = 300
df.iloc[range(0, len(df), n)]









    Out[14]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      0
      51.481292
      -0.451011
      05/14/2014 09:07
      London
      United Kingdom
    
    
      300
      41.377091
      2.151175
      05/20/2014 03:18
      Barcelona
      Spain
    
    
      600
      50.052338
      19.946220
      05/31/2014 21:10
      Krakow
      Poland
    
    
      900
      48.561181
      9.059672
      06/09/2014 15:12
      Tübingen
      Germany
    
    
      1200
      41.378301
      2.187443
      06/17/2014 16:37
      Barcelona
      Spain
    
    
      1500
      42.208201
      20.735993
      06/30/2014 08:27
      Prizren
      Kosovo



In [15]:

    
# it's easier to tell the difference between .loc and .iloc if the index labels aren't the same as their positions
df.index = [label**2 for label in df.index]
df.head()









    Out[15]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      0
      51.481292
      -0.451011
      05/14/2014 09:07
      London
      United Kingdom
    
    
      1
      51.474005
      -0.450999
      05/14/2014 09:22
      Hounslow
      United Kingdom
    
    
      4
      51.478199
      -0.446081
      05/14/2014 10:51
      Hounslow
      United Kingdom
    
    
      9
      51.478199
      -0.446081
      05/14/2014 11:24
      Hounslow
      United Kingdom
    
    
      16
      51.474146
      -0.451562
      05/14/2014 11:38
      Hounslow
      United Kingdom



In [16]:

    
# this returns the rows with labels between 4 and 9 (.loc is inclusive)
df.loc[4:9]









    Out[16]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      4
      51.478199
      -0.446081
      05/14/2014 10:51
      Hounslow
      United Kingdom
    
    
      9
      51.478199
      -0.446081
      05/14/2014 11:24
      Hounslow
      United Kingdom



In [17]:

    
# this returns rows with labels in positions 4 through 8 (not through 9, because .iloc is not inclusive)
df.iloc[4:9]









    Out[17]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      16
      51.474146
      -0.451562
      05/14/2014 11:38
      Hounslow
      United Kingdom
    
    
      25
      51.480146
      -0.441103
      05/14/2014 12:23
      Hounslow
      United Kingdom
    
    
      36
      51.479669
      -0.441024
      05/14/2014 12:38
      Hounslow
      United Kingdom
    
    
      49
      38.781775
      -9.137544
      05/14/2014 15:11
      Lisbon
      Portugal
    
    
      64
      38.774422
      -9.129195
      05/14/2014 15:25
      Lisbon
      Portugal

Part 2
How to select rows by some value(s)



In [18]:

    
# load a reduced set of gps data
df = pd.read_csv('data/summer-travel-gps-simplified.csv')
df.tail()









    Out[18]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      173
      41.044556
      28.983286
      07/08/2014 16:44
      Istanbul
      Turkey
    
    
      174
      41.008992
      28.968268
      07/08/2014 20:03
      Istanbul
      Turkey
    
    
      175
      41.043487
      28.985488
      07/08/2014 22:18
      Istanbul
      Turkey
    
    
      176
      40.977637
      28.823879
      07/09/2014 09:03
      Istanbul
      Turkey
    
    
      177
      48.357110
      11.791346
      07/09/2014 13:20
      Munich
      Germany



In [19]:

    
# create a Series of true/false, indicating if each row in the column is equal to some value
df['city']=='Munich'









    Out[19]:





0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
148    False
149    False
150    False
151    False
152    False
153    False
154    False
155    False
156    False
157    False
158    False
159    False
160    False
161    False
162    False
163    False
164    False
165    False
166    False
167    False
168    False
169    False
170    False
171    False
172    False
173    False
174    False
175    False
176    False
177     True
Name: city, dtype: bool



In [20]:

    
# now, select only those rows in the df that match that condition
df[df['city']=='Munich']









    Out[20]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      177
      48.35711
      11.791346
      07/09/2014 13:20
      Munich
      Germany



In [21]:

    
# pandas logical operators are: | for or, & for and, ~ for not
# these must be grouped by using parentheses
# what cities were visited in spain that were not barcelona?
not_bcn = df[(df['country']=='Spain') & ~(df['city']=='Barcelona')]
not_bcn









    Out[21]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      24
      41.303911
      2.105931
      05/18/2014 22:35
      El Prat de Llobregat
      Spain
    
    
      25
      41.289946
      2.064590
      05/18/2014 23:04
      Viladecans
      Spain
    
    
      126
      41.306752
      2.097624
      06/12/2014 17:19
      El Prat de Llobregat
      Spain
    
    
      127
      41.304333
      2.072728
      06/12/2014 17:49
      El Prat de Llobregat
      Spain
    
    
      131
      41.358460
      2.128701
      06/13/2014 11:35
      Hospitalet de Llobregat
      Spain
    
    
      138
      41.294761
      2.059722
      06/20/2014 22:15
      Viladecans
      Spain



In [22]:

    
not_bcn['city'].unique()









    Out[22]:





array(['El Prat de Llobregat', 'Viladecans', 'Hospitalet de Llobregat'], dtype=object)



In [23]:

    
# select rows where either the city is munich, or the country is serbia
df[(df['city']=='Munich') | (df['country']=='Serbia')]









    Out[23]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      139
      44.821164
      20.289821
      06/21/2014 01:59
      Belgrade
      Serbia
    
    
      140
      44.820414
      20.463465
      06/21/2014 18:44
      Belgrade
      Serbia
    
    
      141
      44.761583
      19.577904
      06/22/2014 07:58
      Slepčević
      Serbia
    
    
      177
      48.357110
      11.791346
      07/09/2014 13:20
      Munich
      Germany



In [24]:

    
# how many observations are west of the prime meridian?
len(df[df['lon'] < 0])









    Out[24]:





24



In [25]:

    
# get all rows that contain a city that starts with the letter G
row_mask = df['city'].str.startswith('G')
df[row_mask]









    Out[25]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      62
      50.273632
      18.729429
      06/02/2014 06:39
      Gliwice
      Poland
    
    
      114
      48.282940
      8.199630
      06/10/2014 13:33
      Gutach
      Germany
    
    
      115
      48.389029
      8.021342
      06/10/2014 13:48
      Gengenbach
      Germany
    
    
      152
      40.187825
      20.079303
      07/04/2014 17:42
      Gjirokastër
      Albania



In [26]:

    
# select all rows with certain city names by using .isin([list])
row_mask = df['city'].isin(['Munich', 'Berat', 'Maia', 'Sarajevo'])
df[row_mask]









    Out[26]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      23
      41.237205
      -8.672173
      05/18/2014 19:55
      Maia
      Portugal
    
    
      142
      43.859459
      18.425962
      06/22/2014 21:12
      Sarajevo
      Bosnia and Herzegovina
    
    
      143
      43.858562
      18.397478
      06/23/2014 14:12
      Sarajevo
      Bosnia and Herzegovina
    
    
      151
      40.702184
      19.946288
      07/04/2014 08:14
      Berat
      Albania
    
    
      177
      48.357110
      11.791346
      07/09/2014 13:20
      Munich
      Germany

Part 3
How to select based on a date-time values



In [27]:

    
# load the location data set, indexed by the date field
# and, parse the dates so they're no longer strings but now rather Python datetime objects
# this lets us do date and time based operations on the data set
dt = pd.read_csv('data/summer-travel-gps-full.csv', index_col='date', parse_dates=True)
dt.head()









    Out[27]:






  
    
      
      lat
      lon
      city
      country
    
    
      date
      
      
      
      
    
  
  
    
      2014-05-14 09:07:00
      51.481292
      -0.451011
      West Drayton
      United Kingdom
    
    
      2014-05-14 09:22:00
      51.474005
      -0.450999
      Hounslow
      United Kingdom
    
    
      2014-05-14 10:51:00
      51.478199
      -0.446081
      Hounslow
      United Kingdom
    
    
      2014-05-14 11:24:00
      51.478199
      -0.446081
      Hounslow
      United Kingdom
    
    
      2014-05-14 11:38:00
      51.474146
      -0.451562
      Hounslow
      United Kingdom



In [28]:

    
len(dt)









    Out[28]:





1759



In [29]:

    
# 1759 rows - but is the timestamp index unique?
dt.index.is_unique









    Out[29]:





False



In [30]:

    
# the index is not unique - drop duplicates
dt.drop_duplicates(inplace=True)
len(dt)









    Out[30]:





1750



In [31]:

    
# the .weekday attribute determines which day of the week a date is
# 0 is sunday and 6 is saturday, M-F are 1-5
# what day of the week is each datetime in our dataframe's index?
str(list(dt.index.weekday))









    Out[31]:





'[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]'



In [32]:

    
# use bitwise OR to create a boolean vector of which rows are a weekend date
weekend_mask = (dt.index.weekday==6) | (dt.index.weekday==0)
weekend_mask









    Out[32]:





array([False, False, False, ..., False, False, False], dtype=bool)



In [33]:

    
weekends = dt[weekend_mask]
weekdays = dt[~weekend_mask]



In [34]:

    
hourly_share = pd.DataFrame()

# calculate what share of the weekday observations each hour has
weekday_hourly = weekdays.groupby(weekdays.index.hour).size()
hourly_share['weekday'] = weekday_hourly / sum(weekday_hourly)

# calculate what share of the weekend observations each hour has
weekend_hourly = weekends.groupby(weekends.index.hour).size()
hourly_share['weekend'] = weekend_hourly / sum(weekend_hourly)

# format the x-axis ticks like 0:00 times and plot the data
hourly_share.index = [s + ':00' for s in hourly_share.index.astype(str)]
hourly_share.plot(figsize=[9, 4], kind='bar', stacked=False, alpha=0.7, title='Share of observations, by hour')









    Out[34]:





<matplotlib.axes._subplots.AxesSubplot at 0x1a63b4f8470>



In [35]:

    
# calculate and plot the number of observations each day of the week has
daily_count = dt.groupby(dt.index.weekday).size()
daily_count.index = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
daily_count.plot(figsize=[8, 5], kind='bar', title='Number of observations, by day of week')









    Out[35]:





<matplotlib.axes._subplots.AxesSubplot at 0x1a63b53fcc0>

Part 4
map(), apply(), and applymap() in pandas

These methods are useful for mapping/applying a function across elements, rows, and columns of a pandas DataFrame or Series. But they have some important and often confusing differences.

map() applies a function element-wise on a Series
apply() works on a row or column basis on a DataFrame (specify the axis!), or on a row basis on a Series
applymap() works element-wise on an entire DataFrame

Let's see what that means in practice with some examples.



In [36]:

    
# create a new DataFrame with fake year data
df = pd.DataFrame({'start_year':[2001, 2002, 2005, 2005, 2006], 
                   'end_year':[2002, 2010, 2008, 2006, 2014]})
df



In [37]:

    
# you can iterate through a DataFrame using the .iterrows() method
for _, row in df.iterrows():
    start_year = row['start_year']
    if start_year > 2004:
        print(start_year + 10)
    else:
        print(start_year)

alternatively, .map() applies a function element-wise on a Series



In [38]:

    
# create a function
def get_new_year(x):
    if x > 2004:
        return x + 10
    else:
        return x

# then map it to the series
df['start_year'].map(get_new_year)









    Out[38]:





0    2001
1    2002
2    2015
3    2015
4    2016
Name: start_year, dtype: int64



In [39]:

    
# or do the same thing all in one line, using a lambda function as .map()'s argument... you commonly see this in pandas
df['start_year'].map(lambda x: x + 10 if x > 2004 else x)









    Out[39]:





0    2001
1    2002
2    2015
3    2015
4    2016
Name: start_year, dtype: int64

A lambda function is a simple, one-off, anonymous function. You can't call it again later because it doesn't have a name. It just lets you repeatedly perform some operation across a series of values (in our case, a column in our dataframe) using a minimal amount of code. Also notice that the if-else statement is all on one line: this is called a ternary operator or an inline-if.



In [40]:

    
# you can easily create a new column to contain the results of the function mapping
df['new_year'] = df['start_year'].map(get_new_year)
df.head()

.apply() is like .map(), but it works on a row or column basis on an entire DataFrame (specify the axis)



In [41]:

    
# applies a function to calculate the difference between the min and max values in each column (ie, row-wise)
def get_difference(vector):
    difference = vector.max() - vector.min()
    return difference

df.apply(get_difference, axis=0)









    Out[41]:





end_year      12
start_year     5
new_year      15
dtype: int64



In [42]:

    
# same thing again, using a lambda function
df.apply(lambda x: x.max() - x.min(), axis=0)









    Out[42]:





end_year      12
start_year     5
new_year      15
dtype: int64



In [43]:

    
# here .apply() finds the difference between the min and max values in each row (ie, column-wise) and saves to a new column
df['difference'] = df.apply(get_difference, axis=1)
df

.applymap() works element-wise on an entire DataFrame

This is like doing a .map() to each column in the DataFrame



In [44]:

    
# divide every value in the dataframe by 2 (use a float so you don't do rounded integer division)
df.applymap(lambda x: x / 2.)



In [ ]:

	end_year	start_year	new_year	difference
0	1001.0	1000.5	1000.5	0.5
1	1005.0	1001.0	1001.0	4.0
2	1004.0	1002.5	1007.5	5.0
3	1003.0	1002.5	1007.5	5.0
4	1007.0	1003.0	1008.0	5.0

	lat	lon	date	city	country
0	51.481292	-0.451011	05/14/2014 09:07	West Drayton	United Kingdom
1	51.474005	-0.450999	05/14/2014 09:22	Hounslow	United Kingdom
2	51.478199	-0.446081	05/14/2014 10:51	Hounslow	United Kingdom
3	51.478199	-0.446081	05/14/2014 11:24	Hounslow	United Kingdom
4	51.474146	-0.451562	05/14/2014 11:38	Hounslow	United Kingdom

	date	city
112	05/16/2014 07:03	Lisbon
113	05/16/2014 07:33	Lisbon
114	05/16/2014 08:03	Lisbon

	lat	lon	date	city	country
173	41.044556	28.983286	07/08/2014 16:44	Istanbul	Turkey
174	41.008992	28.968268	07/08/2014 20:03	Istanbul	Turkey
175	41.043487	28.985488	07/08/2014 22:18	Istanbul	Turkey
176	40.977637	28.823879	07/09/2014 09:03	Istanbul	Turkey
177	48.357110	11.791346	07/09/2014 13:20	Munich	Germany

	lat	lon	date	city	country
24	41.303911	2.105931	05/18/2014 22:35	El Prat de Llobregat	Spain
25	41.289946	2.064590	05/18/2014 23:04	Viladecans	Spain
126	41.306752	2.097624	06/12/2014 17:19	El Prat de Llobregat	Spain
127	41.304333	2.072728	06/12/2014 17:49	El Prat de Llobregat	Spain
131	41.358460	2.128701	06/13/2014 11:35	Hospitalet de Llobregat	Spain
138	41.294761	2.059722	06/20/2014 22:15	Viladecans	Spain

	lat	lon	date	city	country
139	44.821164	20.289821	06/21/2014 01:59	Belgrade	Serbia
140	44.820414	20.463465	06/21/2014 18:44	Belgrade	Serbia
141	44.761583	19.577904	06/22/2014 07:58	Slepčević	Serbia
177	48.357110	11.791346	07/09/2014 13:20	Munich	Germany

	lat	lon	date	city	country
62	50.273632	18.729429	06/02/2014 06:39	Gliwice	Poland
114	48.282940	8.199630	06/10/2014 13:33	Gutach	Germany
115	48.389029	8.021342	06/10/2014 13:48	Gengenbach	Germany
152	40.187825	20.079303	07/04/2014 17:42	Gjirokastër	Albania

	lat	lon	date	city	country
23	41.237205	-8.672173	05/18/2014 19:55	Maia	Portugal
142	43.859459	18.425962	06/22/2014 21:12	Sarajevo	Bosnia and Herzegovina
143	43.858562	18.397478	06/23/2014 14:12	Sarajevo	Bosnia and Herzegovina
151	40.702184	19.946288	07/04/2014 08:14	Berat	Albania
177	48.357110	11.791346	07/09/2014 13:20	Munich	Germany

	lat	lon	city	country
date
2014-05-14 09:07:00	51.481292	-0.451011	West Drayton	United Kingdom
2014-05-14 09:22:00	51.474005	-0.450999	Hounslow	United Kingdom
2014-05-14 10:51:00	51.478199	-0.446081	Hounslow	United Kingdom
2014-05-14 11:24:00	51.478199	-0.446081	Hounslow	United Kingdom
2014-05-14 11:38:00	51.474146	-0.451562	Hounslow	United Kingdom

	end_year	start_year	new_year
0	2002	2001	2001
1	2010	2002	2002
2	2008	2005	2015
3	2006	2005	2015
4	2014	2006	2016