Tips for Selecting Columns in a DataFrame

Notebook to accompany this post.


In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv(
    'https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD&bom=true&format=true'
)

Build a mapping list so we can see the index of all the columns


In [3]:
col_mapping = [f"{c[0]}:{c[1]}" for c in enumerate(df.columns)]

In [4]:
col_mapping


Out[4]:
['0:X',
 '1:Y',
 '2:Unique Squirrel ID',
 '3:Hectare',
 '4:Shift',
 '5:Date',
 '6:Hectare Squirrel Number',
 '7:Age',
 '8:Primary Fur Color',
 '9:Highlight Fur Color',
 '10:Combination of Primary and Highlight Color',
 '11:Color notes',
 '12:Location',
 '13:Above Ground Sighter Measurement',
 '14:Specific Location',
 '15:Running',
 '16:Chasing',
 '17:Climbing',
 '18:Eating',
 '19:Foraging',
 '20:Other Activities',
 '21:Kuks',
 '22:Quaas',
 '23:Moans',
 '24:Tail flags',
 '25:Tail twitches',
 '26:Approaches',
 '27:Indifferent',
 '28:Runs from',
 '29:Other Interactions',
 '30:Lat/Long',
 '31:Zip Codes',
 '32:Community Districts',
 '33:Borough Boundaries',
 '34:City Council Districts',
 '35:Police Precincts']

We can also build a dictionary


In [5]:
col_mapping_dict = {c[0]:c[1] for c in enumerate(df.columns)}

In [6]:
col_mapping_dict


Out[6]:
{0: 'X',
 1: 'Y',
 2: 'Unique Squirrel ID',
 3: 'Hectare',
 4: 'Shift',
 5: 'Date',
 6: 'Hectare Squirrel Number',
 7: 'Age',
 8: 'Primary Fur Color',
 9: 'Highlight Fur Color',
 10: 'Combination of Primary and Highlight Color',
 11: 'Color notes',
 12: 'Location',
 13: 'Above Ground Sighter Measurement',
 14: 'Specific Location',
 15: 'Running',
 16: 'Chasing',
 17: 'Climbing',
 18: 'Eating',
 19: 'Foraging',
 20: 'Other Activities',
 21: 'Kuks',
 22: 'Quaas',
 23: 'Moans',
 24: 'Tail flags',
 25: 'Tail twitches',
 26: 'Approaches',
 27: 'Indifferent',
 28: 'Runs from',
 29: 'Other Interactions',
 30: 'Lat/Long',
 31: 'Zip Codes',
 32: 'Community Districts',
 33: 'Borough Boundaries',
 34: 'City Council Districts',
 35: 'Police Precincts'}

Use iloc to select just the second column (Unique Squirrel ID)


In [7]:
df.iloc[:, 2]


Out[7]:
0       37F-PM-1014-03
1       37E-PM-1006-03
2        2E-AM-1010-03
3        5D-PM-1018-05
4       39B-AM-1018-01
             ...      
3018    30B-AM-1007-04
3019    19A-PM-1013-05
3020    22D-PM-1012-07
3021    29B-PM-1010-02
3022     5E-PM-1012-01
Name: Unique Squirrel ID, Length: 3023, dtype: object

Pass a list of integers to select multiple columns by index


In [8]:
df.iloc[:, [0,1,2]]


Out[8]:
X Y Unique Squirrel ID
0 -73.956134 40.794082 37F-PM-1014-03
1 -73.957044 40.794851 37E-PM-1006-03
2 -73.976831 40.766718 2E-AM-1010-03
3 -73.975725 40.769703 5D-PM-1018-05
4 -73.959313 40.797533 39B-AM-1018-01
... ... ... ...
3018 -73.963943 40.790868 30B-AM-1007-04
3019 -73.970402 40.782560 19A-PM-1013-05
3020 -73.966587 40.783678 22D-PM-1012-07
3021 -73.963994 40.789915 29B-PM-1010-02
3022 -73.975479 40.769640 5E-PM-1012-01

3023 rows × 3 columns

We can also pass a slice object to select a range of columns


In [9]:
df.iloc[:, 0:3]


Out[9]:
X Y Unique Squirrel ID
0 -73.956134 40.794082 37F-PM-1014-03
1 -73.957044 40.794851 37E-PM-1006-03
2 -73.976831 40.766718 2E-AM-1010-03
3 -73.975725 40.769703 5D-PM-1018-05
4 -73.959313 40.797533 39B-AM-1018-01
... ... ... ...
3018 -73.963943 40.790868 30B-AM-1007-04
3019 -73.970402 40.782560 19A-PM-1013-05
3020 -73.966587 40.783678 22D-PM-1012-07
3021 -73.963994 40.789915 29B-PM-1010-02
3022 -73.975479 40.769640 5E-PM-1012-01

3023 rows × 3 columns

If we want to combine the list and slice notation, we need to use nump.r_ to process the data into an appropriate format.


In [10]:
np.r_[0:3,15:19,24,25]


Out[10]:
array([ 0,  1,  2, 15, 16, 17, 18, 24, 25])

We can pass the output of np.r_ to .iloc to use multiple selection approaches


In [11]:
df.iloc[:, np.r_[0:3,15:19,24,25]]


Out[11]:
X Y Unique Squirrel ID Running Chasing Climbing Eating Tail flags Tail twitches
0 -73.956134 40.794082 37F-PM-1014-03 False False False False False False
1 -73.957044 40.794851 37E-PM-1006-03 True False False False False False
2 -73.976831 40.766718 2E-AM-1010-03 False False True False False False
3 -73.975725 40.769703 5D-PM-1018-05 False False True False False False
4 -73.959313 40.797533 39B-AM-1018-01 False False False False False False
... ... ... ... ... ... ... ... ... ...
3018 -73.963943 40.790868 30B-AM-1007-04 False False False True False False
3019 -73.970402 40.782560 19A-PM-1013-05 False False False False False False
3020 -73.966587 40.783678 22D-PM-1012-07 False False False True False False
3021 -73.963994 40.789915 29B-PM-1010-02 False False False True False False
3022 -73.975479 40.769640 5E-PM-1012-01 False False False True False False

3023 rows × 9 columns

We can use the same notation when reading in a csv as well


In [12]:
df_2 = pd.read_csv(
    'https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD&bom=true&format=true',
    usecols=np.r_[1,2,5:8,15:25],
)

In [13]:
df_2.head()


Out[13]:
Y Unique Squirrel ID Date Hectare Squirrel Number Age Running Chasing Climbing Eating Foraging Other Activities Kuks Quaas Moans Tail flags
0 40.794082 37F-PM-1014-03 10142018 3 NaN False False False False False NaN False False False False
1 40.794851 37E-PM-1006-03 10062018 3 Adult True False False False False NaN False False False False
2 40.766718 2E-AM-1010-03 10102018 3 Adult False False True False False NaN False False False False
3 40.769703 5D-PM-1018-05 10182018 5 Juvenile False False True False False NaN False False False False
4 40.797533 39B-AM-1018-01 10182018 1 NaN False False False False False unknown True False False False

We can also select columns using a boolean array


In [14]:
run_cols = df.columns.str.contains('run', case=False)
run_cols


Out[14]:
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False])

In [15]:
df.iloc[:, run_cols].head()


Out[15]:
Running Runs from
0 False False
1 True True
2 False False
3 False True
4 False False

A lambda function can be useful for combining into 1 line.


In [16]:
df.iloc[:, lambda df:df.columns.str.contains('run', case=False)].head()


Out[16]:
Running Runs from
0 False False
1 True True
2 False False
3 False True
4 False False

A more complex example


In [17]:
df.iloc[:, lambda df: df.columns.str.contains('district|precinct|boundaries',
                                              case=False)].head()


Out[17]:
Community Districts Borough Boundaries City Council Districts Police Precincts
0 19 4 19 13
1 19 4 19 13
2 19 4 19 13
3 19 4 19 13
4 19 4 19 13

Combining index and boolean arrays


In [18]:
location_cols = df.columns.str.contains('district|precinct|boundaries',
                                        case=False)
location_cols


Out[18]:
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True,  True,  True,  True])

In [19]:
location_indices = [i for i, col in enumerate(location_cols) if col]
location_indices


Out[19]:
[32, 33, 34, 35]

In [20]:
df.iloc[:, np.r_[0:3,location_indices]].head()


Out[20]:
X Y Unique Squirrel ID Community Districts Borough Boundaries City Council Districts Police Precincts
0 -73.956134 40.794082 37F-PM-1014-03 19 4 19 13
1 -73.957044 40.794851 37E-PM-1006-03 19 4 19 13
2 -73.976831 40.766718 2E-AM-1010-03 19 4 19 13
3 -73.975725 40.769703 5D-PM-1018-05 19 4 19 13
4 -73.959313 40.797533 39B-AM-1018-01 19 4 19 13

In [ ]: