03 - Pandas: Indexing and selecting data - part I

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:

    
import pandas as pd



In [2]:

    
# redefining the example objects

# series
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 
                        'United Kingdom': 64.9, 'Netherlands': 16.9})

# dataframe
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries









    Out[2]:







  
    
      
      country
      population
      area
      capital
    
  
  
    
      0
      Belgium
      11.3
      30510
      Brussels
    
    
      1
      France
      64.3
      671308
      Paris
    
    
      2
      Germany
      81.3
      357050
      Berlin
    
    
      3
      Netherlands
      16.9
      41526
      Amsterdam
    
    
      4
      United Kingdom
      64.9
      244820
      London

Subsetting data

Subset variables (columns)

For a DataFrame, basic indexing selects the columns (cfr. the dictionaries of pure python)

Selecting a single column:



In [3]:

    
countries['area'] # single []









    Out[3]:





0     30510
1    671308
2    357050
3     41526
4    244820
Name: area, dtype: int64

Remember that the same syntax can also be used to add a new columns: df['new'] = ....

We can also select multiple columns by passing a list of column names into []:



In [4]:

    
countries[['area', 'population']] # double [[]]

Subset observations (rows)

Using [], slicing or boolean indexing accesses the rows:

Slicing



In [5]:

    
countries[0:4]









    Out[5]:







  
    
      
      country
      population
      area
      capital
    
  
  
    
      0
      Belgium
      11.3
      30510
      Brussels
    
    
      1
      France
      64.3
      671308
      Paris
    
    
      2
      Germany
      81.3
      357050
      Berlin
    
    
      3
      Netherlands
      16.9
      41526
      Amsterdam

Boolean indexing (filtering)

Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL) and comparable to numpy.

The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.



In [6]:

    
countries['area'] > 100000









    Out[6]:





0    False
1     True
2     True
3    False
4     True
Name: area, dtype: bool



In [7]:

    
countries[countries['area'] > 100000]









    Out[7]:







  
    
      
      country
      population
      area
      capital
    
  
  
    
      1
      France
      64.3
      671308
      Paris
    
    
      2
      Germany
      81.3
      357050
      Berlin
    
    
      4
      United Kingdom
      64.9
      244820
      London



In [8]:

    
countries[countries['population'] > 50]









    Out[8]:







  
    
      
      country
      population
      area
      capital
    
  
  
    
      1
      France
      64.3
      671308
      Paris
    
    
      2
      Germany
      81.3
      357050
      Berlin
    
    
      4
      United Kingdom
      64.9
      244820
      London

An overview of the possible comparison operations:

Operator	Description
==	Equal
!=	Not equal
>	Greater than
>=	Greater than or equal
<	Lesser than
!=	Lesser than or equal

and to combine multiple conditions:

Operator	Description
&	And (`cond1 & cond2`)
\|	Or (`cond1 \| cond2`)

REMEMBER:

So as a summary, `[]` provides the following convenience shortcuts: * **Series**: selecting a **label**: `s[label]` * **DataFrame**: selecting a single or multiple **columns**:`df['col']` or `df[['col1', 'col2']]` * **DataFrame**: slicing or filtering the **rows**: `df['row_label1':'row_label2']` or `df[mask]`

Some other useful methods: `isin` and `string` methods

The isin method of Series is very useful to select rows that may contain certain values:



In [9]:

    
s = countries['capital']



In [10]:

    
s.isin?









    





Signature: s.isin(values)
Docstring:
Check whether `values` are contained in Series.

Return a boolean Series showing whether each element in the Series
matches an element in the passed sequence of `values` exactly.

Parameters
----------
values : set or list-like
    The sequence of values to test. Passing in a single string will
    raise a ``TypeError``. Instead, turn a single string into a
    list of one element.

    .. versionadded:: 0.18.1

      Support for values as a set.

Returns
-------
Series
    Series of booleans indicating if each element is in values.

Raises
------
TypeError
  * If `values` is a string

See Also
--------
DataFrame.isin : Equivalent method on DataFrame.

Examples
--------
>>> s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama',
...                'hippo'], name='animal')
>>> s.isin(['cow', 'lama'])
0     True
1     True
2     True
3    False
4     True
5    False
Name: animal, dtype: bool

Passing a single string as ``s.isin('lama')`` will raise an error. Use
a list of one element instead:

>>> s.isin(['lama'])
0     True
1    False
2     True
3    False
4     True
5    False
Name: animal, dtype: bool
File:      ~/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/pandas/core/series.py
Type:      method



In [11]:

    
s.isin(['Berlin', 'London'])









    Out[11]:





0    False
1    False
2     True
3    False
4     True
Name: capital, dtype: bool

This can then be used to filter the dataframe with boolean indexing:



In [12]:

    
countries[countries['capital'].isin(['Berlin', 'London'])]









    Out[12]:







  
    
      
      country
      population
      area
      capital
    
  
  
    
      2
      Germany
      81.3
      357050
      Berlin
    
    
      4
      United Kingdom
      64.9
      244820
      London

Let's say we want to select all data for which the capital starts with a 'B'. In Python, when having a string, we could use the startswith method:



In [13]:

    
string = 'Berlin'



In [14]:

    
string.startswith('B')









    Out[14]:





True

In pandas, these are available on a Series through the str namespace:



In [15]:

    
countries['capital'].str.startswith('B')









    Out[15]:





0     True
1    False
2     True
3    False
4    False
Name: capital, dtype: bool

For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

Exercises using the Titanic dataset



In [16]:

    
df = pd.read_csv("../data/titanic.csv")



In [17]:

    
df.head()









    Out[17]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

EXERCISE:

Select all rows for male passengers and calculate the mean age of those passengers. Do the same for the female passengers.



In [18]:

    
males = df[df['Sex'] == 'male']



In [19]:

    
males['Age'].mean()









    Out[19]:





30.72664459161148



In [20]:

    
df[df['Sex'] == 'female']['Age'].mean()









    Out[20]:





27.915708812260537

We will later see an easier way to calculate both averages at the same time with groupby.

EXERCISE:

How many passengers older than 70 were on the Titanic?



In [21]:

    
len(df[df['Age'] > 70])









    Out[21]:





5



In [22]:

    
(df['Age'] > 70).sum()









    Out[22]:





5

EXERCISE:

Select the passengers that are between 30 and 40 years old?



In [23]:

    
df[(df['Age'] > 30) & (df['Age'] <= 40)]









    Out[23]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S
    
    
      13
      14
      0
      3
      Andersson, Mr. Anders Johan
      male
      39.0
      1
      5
      347082
      31.2750
      NaN
      S
    
    
      18
      19
      0
      3
      Vander Planke, Mrs. Julius (Emelia Maria Vande...
      female
      31.0
      1
      0
      345763
      18.0000
      NaN
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      867
      868
      0
      1
      Roebling, Mr. Washington Augustus II
      male
      31.0
      0
      0
      PC 17590
      50.4958
      A24
      S
    
    
      872
      873
      0
      1
      Carlsson, Mr. Frans Olof
      male
      33.0
      0
      0
      695
      5.0000
      B51 B53 B55
      S
    
    
      881
      882
      0
      3
      Markun, Mr. Johann
      male
      33.0
      0
      0
      349257
      7.8958
      NaN
      S
    
    
      885
      886
      0
      3
      Rice, Mrs. William (Margaret Norton)
      female
      39.0
      0
      5
      382652
      29.1250
      NaN
      Q
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32.0
      0
      0
      370376
      7.7500
      NaN
      Q
    
  

155 rows × 12 columns

EXERCISE:

Split the 'Name' column on the `,` extract the first part (the surname), and add this as new column 'Surname' .

Tip: try it first on a single string (and for this, check the `split` method of a string), and then try to 'apply' this on each row.



In [24]:

    
df['Surname'] = df['Name'].apply(lambda x: x.split(',')[0])

EXERCISE:

Select all passenger that have a surname starting with 'Williams'.



In [25]:

    
df[df['Surname'].str.startswith('Williams')]









    Out[25]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      Surname
    
  
  
    
      17
      18
      1
      2
      Williams, Mr. Charles Eugene
      male
      NaN
      0
      0
      244373
      13.0000
      NaN
      S
      Williams
    
    
      155
      156
      0
      1
      Williams, Mr. Charles Duane
      male
      51.0
      0
      1
      PC 17597
      61.3792
      NaN
      C
      Williams
    
    
      304
      305
      0
      3
      Williams, Mr. Howard Hugh "Harry"
      male
      NaN
      0
      0
      A/5 2466
      8.0500
      NaN
      S
      Williams
    
    
      351
      352
      0
      1
      Williams-Lambert, Mr. Fletcher Fellows
      male
      NaN
      0
      0
      113510
      35.0000
      C128
      S
      Williams-Lambert
    
    
      735
      736
      0
      3
      Williams, Mr. Leslie
      male
      28.5
      0
      0
      54636
      16.1000
      NaN
      S
      Williams

EXERCISE:

Select all rows for the passengers with a surname of more than 15 characters.



In [26]:

    
df[df['Surname'].str.len() > 15]









    Out[26]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      Surname
    
  
  
    
      307
      308
      1
      1
      Penasco y Castellana, Mrs. Victor de Satode (M...
      female
      17.0
      1
      0
      PC 17758
      108.9000
      C65
      C
      Penasco y Castellana
    
    
      351
      352
      0
      1
      Williams-Lambert, Mr. Fletcher Fellows
      male
      NaN
      0
      0
      113510
      35.0000
      C128
      S
      Williams-Lambert
    
    
      430
      431
      1
      1
      Bjornstrom-Steffansson, Mr. Mauritz Hakan
      male
      28.0
      0
      0
      110564
      26.5500
      C52
      S
      Bjornstrom-Steffansson
    
    
      444
      445
      1
      3
      Johannesen-Bratthammer, Mr. Bernt
      male
      NaN
      0
      0
      65306
      8.1125
      NaN
      S
      Johannesen-Bratthammer
    
    
      505
      506
      0
      1
      Penasco y Castellana, Mr. Victor de Satode
      male
      18.0
      1
      0
      PC 17758
      108.9000
      C65
      C
      Penasco y Castellana
    
    
      587
      588
      1
      1
      Frolicher-Stehli, Mr. Maxmillian
      male
      60.0
      1
      1
      13567
      79.2000
      B41
      C
      Frolicher-Stehli
    
    
      632
      633
      1
      1
      Stahelin-Maeglin, Dr. Max
      male
      32.0
      0
      0
      13214
      30.5000
      B50
      C
      Stahelin-Maeglin



In [ ]:

[OPTIONAL] more exercises

For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv and cast.csv and put them in the /data folder.



In [27]:

    
cast = pd.read_csv('../data/cast.csv')
cast.head()









    Out[27]:







  
    
      
      title
      year
      name
      type
      character
      n
    
  
  
    
      0
      Suuri illusioni
      1985
      Homo $
      actor
      Guests
      22.0
    
    
      1
      Gangsta Rap: The Glockumentary
      2007
      Too $hort
      actor
      Himself
      NaN
    
    
      2
      Menace II Society
      1993
      Too $hort
      actor
      Lew-Loc
      27.0
    
    
      3
      Porndogs: The Adventures of Sadie
      2009
      Too $hort
      actor
      Bosco
      3.0
    
    
      4
      Stop Pepper Palmer
      2014
      Too $hort
      actor
      Himself
      NaN



In [28]:

    
titles = pd.read_csv('../data/titles.csv')
titles.head()









    Out[28]:







  
    
      
      title
      year
    
  
  
    
      0
      The Rising Son
      1990
    
    
      1
      Ashes of Kukulcan
      2016
    
    
      2
      The Thousand Plane Raid
      1969
    
    
      3
      Crucea de piatra
      1993
    
    
      4
      The 86
      2015

EXERCISE:

How many movies are listed in the titles dataframe?



In [29]:

    
len(titles)









    Out[29]:





215981

EXERCISE:

What are the earliest two films listed in the titles dataframe?



In [30]:

    
titles.sort_values('year').head(2)









    Out[30]:







  
    
      
      title
      year
    
  
  
    
      165182
      Miss Jerry
      1894
    
    
      85708
      Reproduction of the Corbett and Fitzsimmons Fight
      1897

EXERCISE:

How many movies have the title "Hamlet"?



In [31]:

    
len(titles[titles['title'] == 'Hamlet'])









    Out[31]:





19

EXERCISE:

List all of the "Treasure Island" movies from earliest to most recent.



In [32]:

    
titles[titles.title == 'Treasure Island'].sort_values('year')









    Out[32]:







  
    
      
      title
      year
    
  
  
    
      191379
      Treasure Island
      1918
    
    
      47769
      Treasure Island
      1920
    
    
      192917
      Treasure Island
      1934
    
    
      90175
      Treasure Island
      1950
    
    
      104714
      Treasure Island
      1972
    
    
      103646
      Treasure Island
      1973
    
    
      190792
      Treasure Island
      1985
    
    
      166675
      Treasure Island
      1999

EXERCISE:

How many movies were made from 1950 through 1959?



In [33]:

    
len(titles[(titles['year'] >= 1950) & (titles['year'] <= 1959)])









    Out[33]:





12120



In [34]:

    
len(titles[titles['year'] // 10 == 195])









    Out[34]:





12120

EXERCISE:

How many roles in the movie "Inception" are NOT ranked by an "n" value?



In [35]:

    
inception = cast[cast['title'] == 'Inception']



In [36]:

    
len(inception[inception['n'].isnull()])









    Out[36]:





22



In [37]:

    
inception['n'].isnull().sum()









    Out[37]:





22

EXERCISE:

But how many roles in the movie "Inception" did receive an "n" value?



In [38]:

    
len(inception[inception['n'].notnull()])









    Out[38]:





51

EXERCISE:

Display the cast of the "Titanic" (the most famous 1997 one) in their correct "n"-value order, ignoring roles that did not earn a numeric "n" value.



In [39]:

    
titanic = cast[(cast['title'] == 'Titanic') & (cast['year'] == 1997)]
titanic = titanic[titanic['n'].notnull()]
titanic.sort_values('n')









    Out[39]:







  
    
      
      title
      year
      name
      type
      character
      n
    
  
  
    
      527123
      Titanic
      1997
      Leonardo DiCaprio
      actor
      Jack Dawson
      1.0
    
    
      3304287
      Titanic
      1997
      Kate Winslet
      actress
      Rose DeWitt Bukater
      2.0
    
    
      2241466
      Titanic
      1997
      Billy Zane
      actor
      Caledon 'Cal' Hockley
      3.0
    
    
      2323120
      Titanic
      1997
      Kathy Bates
      actress
      Molly Brown
      4.0
    
    
      2567572
      Titanic
      1997
      Frances Fisher
      actress
      Ruth Dewitt Bukater
      5.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2678999
      Titanic
      1997
      Andie Hicks
      actress
      Dancer
      113.0
    
    
      898633
      Titanic
      1997
      Scott Hislop
      actor
      Dancer
      114.0
    
    
      1315604
      Titanic
      1997
      Stan Mazin
      actor
      Dancer
      115.0
    
    
      3054629
      Titanic
      1997
      Lisa Ratzin
      actress
      Dancer
      116.0
    
    
      3062849
      Titanic
      1997
      Julene Renee
      actress
      Dancer
      117.0
    
  

112 rows × 6 columns

EXERCISE:

List the supporting roles (having n=2) played by Brad Pitt in the 1990s, in order by year.



In [40]:

    
brad = cast[cast['name'] == 'Brad Pitt']
brad = brad[brad['year'] // 10 == 199]
brad = brad[brad['n'] == 2]
brad.sort_values('year')









    Out[40]:







  
    
      
      title
      year
      name
      type
      character
      n
    
  
  
    
      1614615
      Across the Tracks
      1990
      Brad Pitt
      actor
      Joe Maloney
      2.0
    
    
      1614613
      A River Runs Through It
      1992
      Brad Pitt
      actor
      Paul Maclean
      2.0
    
    
      1614657
      The Devil's Own
      1997
      Brad Pitt
      actor
      Rory Devaney
      2.0
    
    
      1614658
      The Devil's Own
      1997
      Brad Pitt
      actor
      Francis Austin McGuire
      2.0
    
    
      1614623
      Fight Club
      1999
      Brad Pitt
      actor
      Tyler Durden
      2.0

Acknowledgement

The optional exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that.

	country	population	area	capital
0	Belgium	11.3	30510	Brussels
1	France	64.3	671308	Paris
2	Germany	81.3	357050	Berlin
3	Netherlands	16.9	41526	Amsterdam
4	United Kingdom	64.9	244820	London

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Survived	Pclass	Name	Sex	Age	Parch	Ticket	Fare	Cabin	Embarked	Surname
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	0	244373	13.0000	NaN	S	Williams
155	156	0	1	Williams, Mr. Charles Duane	male	51.0	1	PC 17597	61.3792	NaN	C	Williams
304	305	0	3	Williams, Mr. Howard Hugh "Harry"	male	NaN	0	A/5 2466	8.0500	NaN	S	Williams
351	352	0	1	Williams-Lambert, Mr. Fletcher Fellows	male	NaN	0	113510	35.0000	C128	S	Williams-Lambert
735	736	0	3	Williams, Mr. Leslie	male	28.5	0	54636	16.1000	NaN	S	Williams

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Surname
307	308	1	1	Penasco y Castellana, Mrs. Victor de Satode (M...	female	17.0	1	0	PC 17758	108.9000	C65	C	Penasco y Castellana
351	352	0	1	Williams-Lambert, Mr. Fletcher Fellows	male	NaN	0	0	113510	35.0000	C128	S	Williams-Lambert
430	431	1	1	Bjornstrom-Steffansson, Mr. Mauritz Hakan	male	28.0	0	0	110564	26.5500	C52	S	Bjornstrom-Steffansson
444	445	1	3	Johannesen-Bratthammer, Mr. Bernt	male	NaN	0	0	65306	8.1125	NaN	S	Johannesen-Bratthammer
505	506	0	1	Penasco y Castellana, Mr. Victor de Satode	male	18.0	1	0	PC 17758	108.9000	C65	C	Penasco y Castellana
587	588	1	1	Frolicher-Stehli, Mr. Maxmillian	male	60.0	1	1	13567	79.2000	B41	C	Frolicher-Stehli
632	633	1	1	Stahelin-Maeglin, Dr. Max	male	32.0	0	0	13214	30.5000	B50	C	Stahelin-Maeglin

	title	year	name	type	character	n
0	Suuri illusioni	1985	Homo $	actor	Guests	22.0
1	Gangsta Rap: The Glockumentary	2007	Too $hort	actor	Himself	NaN
2	Menace II Society	1993	Too $hort	actor	Lew-Loc	27.0
3	Porndogs: The Adventures of Sadie	2009	Too $hort	actor	Bosco	3.0
4	Stop Pepper Palmer	2014	Too $hort	actor	Himself	NaN

	title	year
0	The Rising Son	1990
1	Ashes of Kukulcan	2016
2	The Thousand Plane Raid	1969
3	Crucea de piatra	1993
4	The 86	2015

	title	year
165182	Miss Jerry	1894
85708	Reproduction of the Corbett and Fitzsimmons Fight	1897

	title	year
191379	Treasure Island	1918
47769	Treasure Island	1920
192917	Treasure Island	1934
90175	Treasure Island	1950
104714	Treasure Island	1972
103646	Treasure Island	1973
190792	Treasure Island	1985
166675	Treasure Island	1999

	title	year	name	type	character	n
527123	Titanic	1997	Leonardo DiCaprio	actor	Jack Dawson	1.0
3304287	Titanic	1997	Kate Winslet	actress	Rose DeWitt Bukater	2.0
2241466	Titanic	1997	Billy Zane	actor	Caledon 'Cal' Hockley	3.0
2323120	Titanic	1997	Kathy Bates	actress	Molly Brown	4.0
2567572	Titanic	1997	Frances Fisher	actress	Ruth Dewitt Bukater	5.0
...	...	...	...	...	...	...
2678999	Titanic	1997	Andie Hicks	actress	Dancer	113.0
898633	Titanic	1997	Scott Hislop	actor	Dancer	114.0
1315604	Titanic	1997	Stan Mazin	actor	Dancer	115.0
3054629	Titanic	1997	Lisa Ratzin	actress	Dancer	116.0
3062849	Titanic	1997	Julene Renee	actress	Dancer	117.0

	title	year	name	type	character	n
1614615	Across the Tracks	1990	Brad Pitt	actor	Joe Maloney	2.0
1614613	A River Runs Through It	1992	Brad Pitt	actor	Paul Maclean	2.0
1614657	The Devil's Own	1997	Brad Pitt	actor	Rory Devaney	2.0
1614658	The Devil's Own	1997	Brad Pitt	actor	Francis Austin McGuire	2.0
1614623	Fight Club	1999	Brad Pitt	actor	Tyler Durden	2.0

Subsetting data

Subset variables (columns)

Subset observations (rows)

Slicing

Boolean indexing (filtering)

Some other useful methods: isin and string methods

Exercises using the Titanic dataset

[OPTIONAL] more exercises

Acknowledgement

Some other useful methods: `isin` and `string` methods