Tips for Selecting Columns in a DataFrame

Notebook to accompany this post.



In [1]:

    
import pandas as pd
import numpy as np



In [2]:

    
df = pd.read_csv(
    'https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD&bom=true&format=true'
)

Build a mapping list so we can see the index of all the columns



In [3]:

    
col_mapping = [f"{c[0]}:{c[1]}" for c in enumerate(df.columns)]



In [4]:

    
col_mapping









    Out[4]:





['0:X',
 '1:Y',
 '2:Unique Squirrel ID',
 '3:Hectare',
 '4:Shift',
 '5:Date',
 '6:Hectare Squirrel Number',
 '7:Age',
 '8:Primary Fur Color',
 '9:Highlight Fur Color',
 '10:Combination of Primary and Highlight Color',
 '11:Color notes',
 '12:Location',
 '13:Above Ground Sighter Measurement',
 '14:Specific Location',
 '15:Running',
 '16:Chasing',
 '17:Climbing',
 '18:Eating',
 '19:Foraging',
 '20:Other Activities',
 '21:Kuks',
 '22:Quaas',
 '23:Moans',
 '24:Tail flags',
 '25:Tail twitches',
 '26:Approaches',
 '27:Indifferent',
 '28:Runs from',
 '29:Other Interactions',
 '30:Lat/Long',
 '31:Zip Codes',
 '32:Community Districts',
 '33:Borough Boundaries',
 '34:City Council Districts',
 '35:Police Precincts']

We can also build a dictionary



In [5]:

    
col_mapping_dict = {c[0]:c[1] for c in enumerate(df.columns)}



In [6]:

    
col_mapping_dict









    Out[6]:





{0: 'X',
 1: 'Y',
 2: 'Unique Squirrel ID',
 3: 'Hectare',
 4: 'Shift',
 5: 'Date',
 6: 'Hectare Squirrel Number',
 7: 'Age',
 8: 'Primary Fur Color',
 9: 'Highlight Fur Color',
 10: 'Combination of Primary and Highlight Color',
 11: 'Color notes',
 12: 'Location',
 13: 'Above Ground Sighter Measurement',
 14: 'Specific Location',
 15: 'Running',
 16: 'Chasing',
 17: 'Climbing',
 18: 'Eating',
 19: 'Foraging',
 20: 'Other Activities',
 21: 'Kuks',
 22: 'Quaas',
 23: 'Moans',
 24: 'Tail flags',
 25: 'Tail twitches',
 26: 'Approaches',
 27: 'Indifferent',
 28: 'Runs from',
 29: 'Other Interactions',
 30: 'Lat/Long',
 31: 'Zip Codes',
 32: 'Community Districts',
 33: 'Borough Boundaries',
 34: 'City Council Districts',
 35: 'Police Precincts'}

Use iloc to select just the second column (Unique Squirrel ID)



In [7]:

    
df.iloc[:, 2]









    Out[7]:





0       37F-PM-1014-03
1       37E-PM-1006-03
2        2E-AM-1010-03
3        5D-PM-1018-05
4       39B-AM-1018-01
             ...      
3018    30B-AM-1007-04
3019    19A-PM-1013-05
3020    22D-PM-1012-07
3021    29B-PM-1010-02
3022     5E-PM-1012-01
Name: Unique Squirrel ID, Length: 3023, dtype: object

Pass a list of integers to select multiple columns by index



In [8]:

    
df.iloc[:, [0,1,2]]









    Out[8]:







  
    
      
      X
      Y
      Unique Squirrel ID
    
  
  
    
      0
      -73.956134
      40.794082
      37F-PM-1014-03
    
    
      1
      -73.957044
      40.794851
      37E-PM-1006-03
    
    
      2
      -73.976831
      40.766718
      2E-AM-1010-03
    
    
      3
      -73.975725
      40.769703
      5D-PM-1018-05
    
    
      4
      -73.959313
      40.797533
      39B-AM-1018-01
    
    
      ...
      ...
      ...
      ...
    
    
      3018
      -73.963943
      40.790868
      30B-AM-1007-04
    
    
      3019
      -73.970402
      40.782560
      19A-PM-1013-05
    
    
      3020
      -73.966587
      40.783678
      22D-PM-1012-07
    
    
      3021
      -73.963994
      40.789915
      29B-PM-1010-02
    
    
      3022
      -73.975479
      40.769640
      5E-PM-1012-01
    
  

3023 rows × 3 columns

We can also pass a slice object to select a range of columns



In [9]:

    
df.iloc[:, 0:3]









    Out[9]:







  
    
      
      X
      Y
      Unique Squirrel ID
    
  
  
    
      0
      -73.956134
      40.794082
      37F-PM-1014-03
    
    
      1
      -73.957044
      40.794851
      37E-PM-1006-03
    
    
      2
      -73.976831
      40.766718
      2E-AM-1010-03
    
    
      3
      -73.975725
      40.769703
      5D-PM-1018-05
    
    
      4
      -73.959313
      40.797533
      39B-AM-1018-01
    
    
      ...
      ...
      ...
      ...
    
    
      3018
      -73.963943
      40.790868
      30B-AM-1007-04
    
    
      3019
      -73.970402
      40.782560
      19A-PM-1013-05
    
    
      3020
      -73.966587
      40.783678
      22D-PM-1012-07
    
    
      3021
      -73.963994
      40.789915
      29B-PM-1010-02
    
    
      3022
      -73.975479
      40.769640
      5E-PM-1012-01
    
  

3023 rows × 3 columns

If we want to combine the list and slice notation, we need to use nump.r_ to process the data into an appropriate format.



In [10]:

    
np.r_[0:3,15:19,24,25]









    Out[10]:





array([ 0,  1,  2, 15, 16, 17, 18, 24, 25])

We can pass the output of np.r_ to .iloc to use multiple selection approaches



In [11]:

    
df.iloc[:, np.r_[0:3,15:19,24,25]]









    Out[11]:







  
    
      
      X
      Y
      Unique Squirrel ID
      Running
      Chasing
      Climbing
      Eating
      Tail flags
      Tail twitches
    
  
  
    
      0
      -73.956134
      40.794082
      37F-PM-1014-03
      False
      False
      False
      False
      False
      False
    
    
      1
      -73.957044
      40.794851
      37E-PM-1006-03
      True
      False
      False
      False
      False
      False
    
    
      2
      -73.976831
      40.766718
      2E-AM-1010-03
      False
      False
      True
      False
      False
      False
    
    
      3
      -73.975725
      40.769703
      5D-PM-1018-05
      False
      False
      True
      False
      False
      False
    
    
      4
      -73.959313
      40.797533
      39B-AM-1018-01
      False
      False
      False
      False
      False
      False
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      3018
      -73.963943
      40.790868
      30B-AM-1007-04
      False
      False
      False
      True
      False
      False
    
    
      3019
      -73.970402
      40.782560
      19A-PM-1013-05
      False
      False
      False
      False
      False
      False
    
    
      3020
      -73.966587
      40.783678
      22D-PM-1012-07
      False
      False
      False
      True
      False
      False
    
    
      3021
      -73.963994
      40.789915
      29B-PM-1010-02
      False
      False
      False
      True
      False
      False
    
    
      3022
      -73.975479
      40.769640
      5E-PM-1012-01
      False
      False
      False
      True
      False
      False
    
  

3023 rows × 9 columns

We can use the same notation when reading in a csv as well



In [12]:

    
df_2 = pd.read_csv(
    'https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD&bom=true&format=true',
    usecols=np.r_[1,2,5:8,15:25],
)



In [13]:

    
df_2.head()









    Out[13]:







  
    
      
      Y
      Unique Squirrel ID
      Date
      Hectare Squirrel Number
      Age
      Running
      Chasing
      Climbing
      Eating
      Foraging
      Other Activities
      Kuks
      Quaas
      Moans
      Tail flags
    
  
  
    
      0
      40.794082
      37F-PM-1014-03
      10142018
      3
      NaN
      False
      False
      False
      False
      False
      NaN
      False
      False
      False
      False
    
    
      1
      40.794851
      37E-PM-1006-03
      10062018
      3
      Adult
      True
      False
      False
      False
      False
      NaN
      False
      False
      False
      False
    
    
      2
      40.766718
      2E-AM-1010-03
      10102018
      3
      Adult
      False
      False
      True
      False
      False
      NaN
      False
      False
      False
      False
    
    
      3
      40.769703
      5D-PM-1018-05
      10182018
      5
      Juvenile
      False
      False
      True
      False
      False
      NaN
      False
      False
      False
      False
    
    
      4
      40.797533
      39B-AM-1018-01
      10182018
      1
      NaN
      False
      False
      False
      False
      False
      unknown
      True
      False
      False
      False

We can also select columns using a boolean array



In [14]:

    
run_cols = df.columns.str.contains('run', case=False)
run_cols









    Out[14]:





array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False])



In [15]:

    
df.iloc[:, run_cols].head()









    Out[15]:







  
    
      
      Running
      Runs from
    
  
  
    
      0
      False
      False
    
    
      1
      True
      True
    
    
      2
      False
      False
    
    
      3
      False
      True
    
    
      4
      False
      False

A lambda function can be useful for combining into 1 line.



In [16]:

    
df.iloc[:, lambda df:df.columns.str.contains('run', case=False)].head()









    Out[16]:







  
    
      
      Running
      Runs from
    
  
  
    
      0
      False
      False
    
    
      1
      True
      True
    
    
      2
      False
      False
    
    
      3
      False
      True
    
    
      4
      False
      False

A more complex example



In [17]:

    
df.iloc[:, lambda df: df.columns.str.contains('district|precinct|boundaries',
                                              case=False)].head()









    Out[17]:







  
    
      
      Community Districts
      Borough Boundaries
      City Council Districts
      Police Precincts
    
  
  
    
      0
      19
      4
      19
      13
    
    
      1
      19
      4
      19
      13
    
    
      2
      19
      4
      19
      13
    
    
      3
      19
      4
      19
      13
    
    
      4
      19
      4
      19
      13

Combining index and boolean arrays



In [18]:

    
location_cols = df.columns.str.contains('district|precinct|boundaries',
                                        case=False)
location_cols









    Out[18]:





array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True,  True,  True,  True])



In [19]:

    
location_indices = [i for i, col in enumerate(location_cols) if col]
location_indices









    Out[19]:





[32, 33, 34, 35]



In [20]:

    
df.iloc[:, np.r_[0:3,location_indices]].head()









    Out[20]:







  
    
      
      X
      Y
      Unique Squirrel ID
      Community Districts
      Borough Boundaries
      City Council Districts
      Police Precincts
    
  
  
    
      0
      -73.956134
      40.794082
      37F-PM-1014-03
      19
      4
      19
      13
    
    
      1
      -73.957044
      40.794851
      37E-PM-1006-03
      19
      4
      19
      13
    
    
      2
      -73.976831
      40.766718
      2E-AM-1010-03
      19
      4
      19
      13
    
    
      3
      -73.975725
      40.769703
      5D-PM-1018-05
      19
      4
      19
      13
    
    
      4
      -73.959313
      40.797533
      39B-AM-1018-01
      19
      4
      19
      13



In [ ]:

	X	Y	Unique Squirrel ID
0	-73.956134	40.794082	37F-PM-1014-03
1	-73.957044	40.794851	37E-PM-1006-03
2	-73.976831	40.766718	2E-AM-1010-03
3	-73.975725	40.769703	5D-PM-1018-05
4	-73.959313	40.797533	39B-AM-1018-01
...	...	...	...
3018	-73.963943	40.790868	30B-AM-1007-04
3019	-73.970402	40.782560	19A-PM-1013-05
3020	-73.966587	40.783678	22D-PM-1012-07
3021	-73.963994	40.789915	29B-PM-1010-02
3022	-73.975479	40.769640	5E-PM-1012-01

	Y	Unique Squirrel ID	Date	Hectare Squirrel Number	Age	Running	Chasing	Climbing	Eating	Foraging	Other Activities	Kuks	Quaas	Moans	Tail flags
0	40.794082	37F-PM-1014-03	10142018	3	NaN	False	False	False	False	False	NaN	False	False	False	False
1	40.794851	37E-PM-1006-03	10062018	3	Adult	True	False	False	False	False	NaN	False	False	False	False
2	40.766718	2E-AM-1010-03	10102018	3	Adult	False	False	True	False	False	NaN	False	False	False	False
3	40.769703	5D-PM-1018-05	10182018	5	Juvenile	False	False	True	False	False	NaN	False	False	False	False
4	40.797533	39B-AM-1018-01	10182018	1	NaN	False	False	False	False	False	unknown	True	False	False	False