Python pandas Q&A video series by Data School

YouTube playlist and GitHub repository

What is pandas?
How do I read a tabular data file into pandas?
How do I select a pandas Series from a DataFrame?
Why do some pandas commands end with parentheses (and others don't)?
How do I rename columns in a pandas DataFrame?
How do I remove columns from a pandas DataFrame?
How do I sort a pandas DataFrame or a Series?
How do I filter rows of a pandas DataFrame by column value?
How do I apply multiple filter criteria to a pandas DataFrame?
Your pandas questions answered!
How do I use the "axis" parameter in pandas?
How do I use string methods in pandas?
How do I change the data type of a pandas Series?
When should I use a "groupby" in pandas?
How do I explore a pandas Series?
How do I handle missing values in pandas?
What do I need to know about the pandas index? (Part 1)
What do I need to know about the pandas index? (Part 2)
How do I select multiple rows and columns from a pandas DataFrame?
When should I use the "inplace" parameter in pandas?
How do I make my pandas DataFrame smaller and faster?
How do I use pandas with scikit-learn to create Kaggle submissions?
More of your pandas questions answered!
How do I create dummy variables in pandas?
How do I work with dates and times in pandas?
How do I find and remove duplicate rows in pandas?
How do I avoid a SettingWithCopyWarning in pandas?



In [1]:

    
# conventional way to import pandas
import pandas as pd
# get Pansda's vesrion #
print ('Pandas version', pd.__version__)









    



('Pandas version', u'0.18.1')

1. What is pandas?

pandas main page
pandas installation instructions
Anaconda distribution of Python (includes pandas)
How to use the IPython/Jupyter notebook (video)

2. How do I read a tabular data file into pandas?



In [2]:

    
# read a dataset of Chipotle orders directly from a URL and store the results in a DataFrame
orders = pd.read_table('data/chipotle.tsv')



In [3]:

    
# examine the first 5 rows
orders.head()









    Out[3]:






  
    
      
      order_id
      quantity
      item_name
      choice_description
      item_price
    
  
  
    
      0
      1
      1
      Chips and Fresh Tomato Salsa
      NaN
      $2.39
    
    
      1
      1
      1
      Izze
      [Clementine]
      $3.39
    
    
      2
      1
      1
      Nantucket Nectar
      [Apple]
      $3.39
    
    
      3
      1
      1
      Chips and Tomatillo-Green Chili Salsa
      NaN
      $2.39
    
    
      4
      2
      2
      Chicken Bowl
      [Tomatillo-Red Chili Salsa (Hot), [Black Beans...
      $16.98

Documentation for read_table



In [4]:

    
users = pd.read_table('data/u.user')
# examine the first 5 rows
users.head()









    Out[4]:






  
    
      
      1|24|M|technician|85711
    
  
  
    
      0
      2|53|F|other|94043
    
    
      1
      3|23|M|writer|32067
    
    
      2
      4|24|M|technician|43537
    
    
      3
      5|33|F|other|15213
    
    
      4
      6|42|M|executive|98101



In [5]:

    
users = pd.read_table('data/u.user', sep='|')
# examine the first 5 rows
users.head()



In [6]:

    
users = pd.read_table('data/u.user', sep='|', header=None)
# examine the first 5 rows
users.head()



In [7]:

    
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table('data/u.user', sep='|', header=None, names=user_cols)
# examine the first 5 rows
users.head()

3. How do I select a pandas Series from a DataFrame?



In [7]:

    
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_table('data/ufo.csv', sep=',')

OR



In [8]:

    
# read_csv is equivalent to read_table, except it assumes a comma separator
ufo = pd.read_csv('data/ufo.csv')
type(ufo)









    Out[8]:





pandas.core.frame.DataFrame



In [9]:

    
# examine the first 5 rows
ufo.head()









    Out[9]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00
    
    
      3
      Abilene
      NaN
      DISK
      KS
      6/1/1931 13:00
    
    
      4
      New York Worlds Fair
      NaN
      LIGHT
      NY
      4/18/1933 19:00



In [10]:

    
# select the 'City' Series using bracket notation
ufo['Colors Reported']









    Out[10]:





0           NaN
1           NaN
2           NaN
3           NaN
4           NaN
5           NaN
6           NaN
7           NaN
8           NaN
9           NaN
10          NaN
11          NaN
12          RED
13          NaN
14          NaN
15          NaN
16          NaN
17          NaN
18          NaN
19          RED
20          NaN
21          NaN
22          NaN
23          NaN
24          NaN
25          NaN
26          NaN
27          NaN
28          NaN
29          NaN
          ...  
18211       NaN
18212       NaN
18213     GREEN
18214       NaN
18215       NaN
18216    ORANGE
18217       NaN
18218       NaN
18219       NaN
18220      BLUE
18221       NaN
18222       NaN
18223       NaN
18224       NaN
18225       NaN
18226       NaN
18227       NaN
18228       NaN
18229       NaN
18230       NaN
18231       NaN
18232       NaN
18233       RED
18234       NaN
18235       NaN
18236       NaN
18237       NaN
18238       NaN
18239       RED
18240       NaN
Name: Colors Reported, dtype: object



In [16]:

    
type(ufo['City'])









    Out[16]:





pandas.core.series.Series



In [17]:

    
# or equivalently, use dot notation - see notes below
ufo.City









    Out[17]:





0                      Ithaca
1                 Willingboro
2                     Holyoke
3                     Abilene
4        New York Worlds Fair
5                 Valley City
6                 Crater Lake
7                        Alma
8                     Eklutna
9                     Hubbard
10                    Fontana
11                   Waterloo
12                     Belton
13                     Keokuk
14                  Ludington
15                Forest Home
16                Los Angeles
17                  Hapeville
18                     Oneida
19                 Bering Sea
20                   Nebraska
21                        NaN
22                        NaN
23                  Owensboro
24                 Wilderness
25                  San Diego
26                 Wilderness
27                     Clovis
28                 Los Alamos
29               Ft. Duschene
                 ...         
18211                 Holyoke
18212                  Carson
18213                Pasadena
18214                  Austin
18215                El Campo
18216            Garden Grove
18217           Berthoud Pass
18218              Sisterdale
18219            Garden Grove
18220             Shasta Lake
18221                Franklin
18222          Albrightsville
18223              Greenville
18224                 Eufaula
18225             Simi Valley
18226           San Francisco
18227           San Francisco
18228              Kingsville
18229                 Chicago
18230             Pismo Beach
18231             Pismo Beach
18232                    Lodi
18233               Anchorage
18234                Capitola
18235          Fountain Hills
18236              Grant Park
18237             Spirit Lake
18238             Eagle River
18239             Eagle River
18240                    Ybor
Name: City, dtype: object

Bracket notation will always work, whereas dot notation has limitations:

Dot notation doesn't work if there are spaces in the Series name
Dot notation doesn't work if the Series has the same name as a DataFrame method or attribute (like 'head' or 'shape')
Dot notation can't be used to define the name of a new Series (see below)



In [10]:

    
# create a new 'Location' Series (must use bracket notation to define the Series name)
ufo['Location'] = ufo.City + ', ' + ufo.State
ufo.head()









    Out[10]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
      Location
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
      Ithaca, NY
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
      Willingboro, NJ
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00
      Holyoke, CO
    
    
      3
      Abilene
      NaN
      DISK
      KS
      6/1/1931 13:00
      Abilene, KS
    
    
      4
      New York Worlds Fair
      NaN
      LIGHT
      NY
      4/18/1933 19:00
      New York Worlds Fair, NY

[Back to top]

4. Why do some pandas commands end with parentheses (and others don't)?



In [11]:

    
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('data/imdb_1000.csv')

Methods end with parentheses, while attributes don't:



In [12]:

    
# example method: show the first 5 rows
movies.head()









    Out[12]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      0
      9.3
      The Shawshank Redemption
      R
      Crime
      142
      [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    
    
      1
      9.2
      The Godfather
      R
      Crime
      175
      [u'Marlon Brando', u'Al Pacino', u'James Caan']
    
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      3
      9.0
      The Dark Knight
      PG-13
      Action
      152
      [u'Christian Bale', u'Heath Ledger', u'Aaron E...
    
    
      4
      8.9
      Pulp Fiction
      R
      Crime
      154
      [u'John Travolta', u'Uma Thurman', u'Samuel L....



In [13]:

    
# example method: calculate summary statistics
movies.describe()









    Out[13]:






  
    
      
      star_rating
      duration
    
  
  
    
      count
      979.000000
      979.000000
    
    
      mean
      7.889785
      120.979571
    
    
      std
      0.336069
      26.218010
    
    
      min
      7.400000
      64.000000
    
    
      25%
      7.600000
      102.000000
    
    
      50%
      7.800000
      117.000000
    
    
      75%
      8.100000
      134.000000
    
    
      max
      9.300000
      242.000000



In [14]:

    
# example attribute: number of rows and columns
movies.shape









    Out[14]:





(979, 6)



In [15]:

    
# example attribute: data type of each column
movies.dtypes









    Out[15]:





star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object



In [16]:

    
# use an optional parameter to the describe method to summarize only 'object' columns
movies.describe(include=['object'])









    Out[16]:






  
    
      
      title
      content_rating
      genre
      actors_list
    
  
  
    
      count
      979
      976
      979
      979
    
    
      unique
      975
      12
      16
      969
    
    
      top
      The Girl with the Dragon Tattoo
      R
      Drama
      [u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
    
    
      freq
      2
      460
      278
      6

Documentation for describe

[Back to top]

5. How do I rename columns in a pandas DataFrame?



In [11]:

    
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('data/ufo.csv')



In [12]:

    
# examine the column names
ufo.columns









    Out[12]:





Index([u'City', u'Colors Reported', u'Shape Reported', u'State', u'Time'], dtype='object')



In [13]:

    
# rename two of the columns by using the 'rename' method
ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)
ufo.columns









    Out[13]:





Index([u'City', u'Colors_Reported', u'Shape_Reported', u'State', u'Time'], dtype='object')

Documentation for rename



In [22]:

    
# replace all of the column names by overwriting the 'columns' attribute
ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time']
ufo.columns = ufo_cols
ufo.columns









    Out[22]:





Index([u'city', u'colors reported', u'shape reported', u'state', u'time'], dtype='object')



In [23]:

    
# replace the column names during the file reading process by using the 'names' parameter
ufo = pd.read_csv('data/ufo.csv', header=0, names=ufo_cols)
ufo.head()









    Out[23]:






  
    
      
      city
      colors reported
      shape reported
      state
      time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00
    
    
      3
      Abilene
      NaN
      DISK
      KS
      6/1/1931 13:00
    
    
      4
      New York Worlds Fair
      NaN
      LIGHT
      NY
      4/18/1933 19:00

Documentation for read_csv



In [24]:

    
# replace all spaces with underscores in the column names by using the 'str.replace' method
ufo.columns = ufo.columns.str.replace(' ', '_')
ufo.columns









    Out[24]:





Index([u'city', u'colors_reported', u'shape_reported', u'state', u'time'], dtype='object')

Documentation for str.replace

[Back to top]

6. How do I remove columns from a pandas DataFrame?



In [25]:

    
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('data/ufo.csv')
ufo.head()









    Out[25]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00
    
    
      3
      Abilene
      NaN
      DISK
      KS
      6/1/1931 13:00
    
    
      4
      New York Worlds Fair
      NaN
      LIGHT
      NY
      4/18/1933 19:00



In [24]:

    
# remove a single column (axis=1 refers to columns)
ufo.drop('Colors Reported', axis=1, inplace=True)
ufo.head()









    Out[24]:






  
    
      
      City
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      OVAL
      CO
      2/15/1931 14:00
    
    
      3
      Abilene
      DISK
      KS
      6/1/1931 13:00
    
    
      4
      New York Worlds Fair
      LIGHT
      NY
      4/18/1933 19:00

Documentation for drop



In [25]:

    
# remove multiple columns at once
ufo.drop(['City', 'State'], axis=1, inplace=True)
ufo.head()









    Out[25]:






  
    
      
      Shape Reported
      Time
    
  
  
    
      0
      TRIANGLE
      6/1/1930 22:00
    
    
      1
      OTHER
      6/30/1930 20:00
    
    
      2
      OVAL
      2/15/1931 14:00
    
    
      3
      DISK
      6/1/1931 13:00
    
    
      4
      LIGHT
      4/18/1933 19:00



In [26]:

    
# remove multiple rows at once (axis=0 refers to rows)
ufo.drop([0, 1], axis=0, inplace=True)
ufo.head()









    Out[26]:






  
    
      
      Shape Reported
      Time
    
  
  
    
      2
      OVAL
      2/15/1931 14:00
    
    
      3
      DISK
      6/1/1931 13:00
    
    
      4
      LIGHT
      4/18/1933 19:00
    
    
      5
      DISK
      9/15/1934 15:30
    
    
      6
      CIRCLE
      6/15/1935 0:00

[Back to top]

7. How do I sort a pandas DataFrame or a Series?



In [28]:

    
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('data/imdb_1000.csv')
movies.head()









    Out[28]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      0
      9.3
      The Shawshank Redemption
      R
      Crime
      142
      [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    
    
      1
      9.2
      The Godfather
      R
      Crime
      175
      [u'Marlon Brando', u'Al Pacino', u'James Caan']
    
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      3
      9.0
      The Dark Knight
      PG-13
      Action
      152
      [u'Christian Bale', u'Heath Ledger', u'Aaron E...
    
    
      4
      8.9
      Pulp Fiction
      R
      Crime
      154
      [u'John Travolta', u'Uma Thurman', u'Samuel L....

Note: None of the sorting methods below affect the underlying data. (In other words, the sorting is temporary).



In [29]:

    
# sort the 'title' Series in ascending order (returns a Series)
movies.title.sort_values()









    Out[29]:





542                   (500) Days of Summer
5                             12 Angry Men
201                       12 Years a Slave
698                              127 Hours
110                  2001: A Space Odyssey
910                                   2046
596                               21 Grams
624                              25th Hour
708                       28 Days Later...
60                                3 Idiots
225                                 3-Iron
570                                    300
555                           3:10 to Yuma
427           4 Months, 3 Weeks and 2 Days
824                                     42
597                                  50/50
203                                  8 1/2
170                       A Beautiful Mind
941                       A Bridge Too Far
571                           A Bronx Tale
266                      A Christmas Story
86                      A Clockwork Orange
716                         A Few Good Men
750                    A Fish Called Wanda
276                   A Fistful of Dollars
612                     A Hard Day's Night
883                  A History of Violence
869              A Nightmare on Elm Street
865                        A Perfect World
426                              A Prophet
                      ...                 
207       What Ever Happened to Baby Jane?
562            What's Eating Gilbert Grape
719                When Harry Met Sally...
649                      Where Eagles Dare
33                                Whiplash
669                Who Framed Roger Rabbit
219        Who's Afraid of Virginia Woolf?
127                      Wild Strawberries
497    Willy Wonka & the Chocolate Factory
270                        Wings of Desire
483                           Withnail & I
920                                Witness
65             Witness for the Prosecution
970                            Wonder Boys
518                         Wreck-It Ralph
954                                  X-Men
248             X-Men: Days of Future Past
532                     X-Men: First Class
871                                     X2
695                      Y Tu Mama Tambien
403                             Ying xiong
235                                Yip Man
96                                 Yojimbo
280                     Young Frankenstein
535                                  Zelig
955                       Zero Dark Thirty
677                                 Zodiac
615                             Zombieland
526                                   Zulu
864                                  [Rec]
Name: title, dtype: object



In [29]:

    
# sort in descending order instead
movies.title.sort_values(ascending=False).head()









    Out[29]:





864               [Rec]
526                Zulu
615          Zombieland
677              Zodiac
955    Zero Dark Thirty
Name: title, dtype: object

Documentation for sort_values for a Series. (Prior to version 0.17, use order instead.)



In [31]:

    
# sort the entire DataFrame by the 'title' Series (returns a DataFrame)
movies.sort_values('title').head()









    Out[31]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      542
      7.8
      (500) Days of Summer
      PG-13
      Comedy
      95
      [u'Zooey Deschanel', u'Joseph Gordon-Levitt', ...
    
    
      5
      8.9
      12 Angry Men
      NOT RATED
      Drama
      96
      [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
    
    
      201
      8.1
      12 Years a Slave
      R
      Biography
      134
      [u'Chiwetel Ejiofor', u'Michael Kenneth Willia...
    
    
      698
      7.6
      127 Hours
      R
      Adventure
      94
      [u'James Franco', u'Amber Tamblyn', u'Kate Mara']
    
    
      110
      8.3
      2001: A Space Odyssey
      G
      Mystery
      160
      [u'Keir Dullea', u'Gary Lockwood', u'William S...



In [33]:

    
# sort in descending order instead
movies.sort_values('title', ascending=False).head()









    Out[33]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      864
      7.5
      [Rec]
      R
      Horror
      78
      [u'Manuela Velasco', u'Ferran Terraza', u'Jorg...
    
    
      526
      7.8
      Zulu
      UNRATED
      Drama
      138
      [u'Stanley Baker', u'Jack Hawkins', u'Ulla Jac...
    
    
      615
      7.7
      Zombieland
      R
      Comedy
      88
      [u'Jesse Eisenberg', u'Emma Stone', u'Woody Ha...
    
    
      677
      7.7
      Zodiac
      R
      Crime
      157
      [u'Jake Gyllenhaal', u'Robert Downey Jr.', u'M...
    
    
      955
      7.4
      Zero Dark Thirty
      R
      Drama
      157
      [u'Jessica Chastain', u'Joel Edgerton', u'Chri...

Documentation for sort_values for a DataFrame. (Prior to version 0.17, use sort instead.)



In [32]:

    
# sort the DataFrame first by 'content_rating', then by 'duration'
movies.sort_values(['content_rating', 'duration']).head()









    Out[32]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      713
      7.6
      The Jungle Book
      APPROVED
      Animation
      78
      [u'Phil Harris', u'Sebastian Cabot', u'Louis P...
    
    
      513
      7.8
      Invasion of the Body Snatchers
      APPROVED
      Horror
      80
      [u'Kevin McCarthy', u'Dana Wynter', u'Larry Ga...
    
    
      272
      8.1
      The Killing
      APPROVED
      Crime
      85
      [u'Sterling Hayden', u'Coleen Gray', u'Vince E...
    
    
      703
      7.6
      Dracula
      APPROVED
      Horror
      85
      [u'Bela Lugosi', u'Helen Chandler', u'David Ma...
    
    
      612
      7.7
      A Hard Day's Night
      APPROVED
      Comedy
      87
      [u'John Lennon', u'Paul McCartney', u'George H...

Summary of changes to the sorting API in pandas 0.17

[Back to top]

8. How do I filter rows of a pandas DataFrame by column value?



In [35]:

    
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('data/imdb_1000.csv')
movies.head()









    Out[35]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      0
      9.3
      The Shawshank Redemption
      R
      Crime
      142
      [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    
    
      1
      9.2
      The Godfather
      R
      Crime
      175
      [u'Marlon Brando', u'Al Pacino', u'James Caan']
    
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      3
      9.0
      The Dark Knight
      PG-13
      Action
      152
      [u'Christian Bale', u'Heath Ledger', u'Aaron E...
    
    
      4
      8.9
      Pulp Fiction
      R
      Crime
      154
      [u'John Travolta', u'Uma Thurman', u'Samuel L....



In [34]:

    
# examine the number of rows and columns
movies.shape









    Out[34]:





(979, 6)

Goal: Filter the DataFrame rows to only show movies with a 'duration' of at least 200 minutes.



In [35]:

    
# create a list in which each element refers to a DataFrame row: True if the row satisfies the condition, False otherwise
booleans = []
for length in movies.duration:
    if length >= 200:
        booleans.append(True)
    else:
        booleans.append(False)



In [36]:

    
# confirm that the list has the same length as the DataFrame
len(booleans)









    Out[36]:





979



In [37]:

    
# examine the first five list elements
booleans[0:5]









    Out[37]:





[False, False, True, False, False]



In [38]:

    
# convert the list to a Series
is_long = pd.Series(booleans)
is_long.head()









    Out[38]:





0    False
1    False
2     True
3    False
4    False
dtype: bool



In [39]:

    
# use bracket notation with the boolean Series to tell the DataFrame which rows to display
movies[is_long]









    Out[39]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      7
      8.9
      The Lord of the Rings: The Return of the King
      PG-13
      Adventure
      201
      [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
    
    
      17
      8.7
      Seven Samurai
      UNRATED
      Drama
      207
      [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
    
    
      78
      8.4
      Once Upon a Time in America
      R
      Crime
      229
      [u'Robert De Niro', u'James Woods', u'Elizabet...
    
    
      85
      8.4
      Lawrence of Arabia
      PG
      Adventure
      216
      [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
    
    
      142
      8.3
      Lagaan: Once Upon a Time in India
      PG
      Adventure
      224
      [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
    
    
      157
      8.2
      Gone with the Wind
      G
      Drama
      238
      [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
    
    
      204
      8.1
      Ben-Hur
      G
      Adventure
      212
      [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
    
    
      445
      7.9
      The Ten Commandments
      APPROVED
      Adventure
      220
      [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
    
    
      476
      7.8
      Hamlet
      PG-13
      Drama
      242
      [u'Kenneth Branagh', u'Julie Christie', u'Dere...
    
    
      630
      7.7
      Malcolm X
      PG-13
      Biography
      202
      [u'Denzel Washington', u'Angela Bassett', u'De...
    
    
      767
      7.6
      It's a Mad, Mad, Mad, Mad World
      APPROVED
      Action
      205
      [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...



In [40]:

    
# simplify the steps above: no need to write a for loop to create 'is_long' since pandas will broadcast the comparison
is_long = movies.duration >= 200
movies[is_long]

# or equivalently, write it in one line (no need to create the 'is_long' object)
movies[movies.duration >= 200]









    Out[40]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      7
      8.9
      The Lord of the Rings: The Return of the King
      PG-13
      Adventure
      201
      [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
    
    
      17
      8.7
      Seven Samurai
      UNRATED
      Drama
      207
      [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
    
    
      78
      8.4
      Once Upon a Time in America
      R
      Crime
      229
      [u'Robert De Niro', u'James Woods', u'Elizabet...
    
    
      85
      8.4
      Lawrence of Arabia
      PG
      Adventure
      216
      [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
    
    
      142
      8.3
      Lagaan: Once Upon a Time in India
      PG
      Adventure
      224
      [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
    
    
      157
      8.2
      Gone with the Wind
      G
      Drama
      238
      [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
    
    
      204
      8.1
      Ben-Hur
      G
      Adventure
      212
      [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
    
    
      445
      7.9
      The Ten Commandments
      APPROVED
      Adventure
      220
      [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
    
    
      476
      7.8
      Hamlet
      PG-13
      Drama
      242
      [u'Kenneth Branagh', u'Julie Christie', u'Dere...
    
    
      630
      7.7
      Malcolm X
      PG-13
      Biography
      202
      [u'Denzel Washington', u'Angela Bassett', u'De...
    
    
      767
      7.6
      It's a Mad, Mad, Mad, Mad World
      APPROVED
      Action
      205
      [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...



In [41]:

    
# select the 'genre' Series from the filtered DataFrame
is_long = movies.duration >= 200
movies[is_long].genre

# or equivalently, use the 'loc' method
movies.loc[movies.duration >= 200, 'genre']









    Out[41]:





2          Crime
7      Adventure
17         Drama
78         Crime
85     Adventure
142    Adventure
157        Drama
204    Adventure
445    Adventure
476        Drama
630    Biography
767       Action
Name: genre, dtype: object

Documentation for loc

[Back to top]

9. How do I apply multiple filter criteria to a pandas DataFrame?



In [36]:

    
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('data/imdb_1000.csv')
movies.head()









    Out[36]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      0
      9.3
      The Shawshank Redemption
      R
      Crime
      142
      [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    
    
      1
      9.2
      The Godfather
      R
      Crime
      175
      [u'Marlon Brando', u'Al Pacino', u'James Caan']
    
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      3
      9.0
      The Dark Knight
      PG-13
      Action
      152
      [u'Christian Bale', u'Heath Ledger', u'Aaron E...
    
    
      4
      8.9
      Pulp Fiction
      R
      Crime
      154
      [u'John Travolta', u'Uma Thurman', u'Samuel L....



In [37]:

    
# filter the DataFrame to only show movies with a 'duration' of at least 200 minutes
movies[movies.duration >= 200]









    Out[37]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      7
      8.9
      The Lord of the Rings: The Return of the King
      PG-13
      Adventure
      201
      [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
    
    
      17
      8.7
      Seven Samurai
      UNRATED
      Drama
      207
      [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
    
    
      78
      8.4
      Once Upon a Time in America
      R
      Crime
      229
      [u'Robert De Niro', u'James Woods', u'Elizabet...
    
    
      85
      8.4
      Lawrence of Arabia
      PG
      Adventure
      216
      [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
    
    
      142
      8.3
      Lagaan: Once Upon a Time in India
      PG
      Adventure
      224
      [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
    
    
      157
      8.2
      Gone with the Wind
      G
      Drama
      238
      [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
    
    
      204
      8.1
      Ben-Hur
      G
      Adventure
      212
      [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
    
    
      445
      7.9
      The Ten Commandments
      APPROVED
      Adventure
      220
      [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
    
    
      476
      7.8
      Hamlet
      PG-13
      Drama
      242
      [u'Kenneth Branagh', u'Julie Christie', u'Dere...
    
    
      630
      7.7
      Malcolm X
      PG-13
      Biography
      202
      [u'Denzel Washington', u'Angela Bassett', u'De...
    
    
      767
      7.6
      It's a Mad, Mad, Mad, Mad World
      APPROVED
      Action
      205
      [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

Understanding logical operators:

and: True only if both sides of the operator are True
or: True if either side of the operator is True



In [44]:

    
# demonstration of the 'and' operator
print(True and True)
print(True and False)
print(False and False)









    



True
False
False



In [45]:

    
# demonstration of the 'or' operator
print(True or True)
print(True or False)
print(False or False)









    



True
True
False

Rules for specifying multiple filter criteria in pandas:

use & instead of and
use | instead of or
add parentheses around each condition to specify evaluation order

Goal: Further filter the DataFrame of long movies (duration >= 200) to only show movies which also have a 'genre' of 'Drama'



In [46]:

    
# CORRECT: use the '&' operator to specify that both conditions are required
movies[(movies.duration >=200) & (movies.genre == 'Drama')]









    Out[46]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      17
      8.7
      Seven Samurai
      UNRATED
      Drama
      207
      [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
    
    
      157
      8.2
      Gone with the Wind
      G
      Drama
      238
      [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
    
    
      476
      7.8
      Hamlet
      PG-13
      Drama
      242
      [u'Kenneth Branagh', u'Julie Christie', u'Dere...



In [47]:

    
# INCORRECT: using the '|' operator would have shown movies that are either long or dramas (or both)
movies[(movies.duration >=200) | (movies.genre == 'Drama')].head()









    Out[47]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      5
      8.9
      12 Angry Men
      NOT RATED
      Drama
      96
      [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
    
    
      7
      8.9
      The Lord of the Rings: The Return of the King
      PG-13
      Adventure
      201
      [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
    
    
      9
      8.9
      Fight Club
      R
      Drama
      139
      [u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
    
    
      13
      8.8
      Forrest Gump
      PG-13
      Drama
      142
      [u'Tom Hanks', u'Robin Wright', u'Gary Sinise']

Goal: Filter the original DataFrame to show movies with a 'genre' of 'Crime' or 'Drama' or 'Action'



In [44]:

    
# use the '|' operator to specify that a row can match any of the three criteria
movies[(movies.genre == 'Crime') | (movies.genre == 'Drama') | (movies.genre == 'Action')].tail(20)

# or equivalently, use the 'isin' method
#movies[movies.genre.isin(['Crime', 'Drama', 'Action'])].head(10)









    Out[44]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      931
      7.4
      Mean Streets
      R
      Crime
      112
      [u'Robert De Niro', u'Harvey Keitel', u'David ...
    
    
      939
      7.4
      Predestination
      R
      Drama
      97
      [u'Ethan Hawke', u'Sarah Snook', u'Noah Taylor']
    
    
      941
      7.4
      A Bridge Too Far
      PG
      Drama
      175
      [u'Sean Connery', u"Ryan O'Neal", u'Michael Ca...
    
    
      945
      7.4
      Take Shelter
      R
      Drama
      120
      [u'Michael Shannon', u'Jessica Chastain', u'Sh...
    
    
      946
      7.4
      Far from Heaven
      PG-13
      Drama
      107
      [u'Julianne Moore', u'Dennis Quaid', u'Dennis ...
    
    
      947
      7.4
      Eraserhead
      UNRATED
      Drama
      89
      [u'Jack Nance', u'Charlotte Stewart', u'Allen ...
    
    
      950
      7.4
      Bound
      R
      Crime
      108
      [u'Jennifer Tilly', u'Gina Gershon', u'Joe Pan...
    
    
      951
      7.4
      Sleepy Hollow
      R
      Drama
      105
      [u'Johnny Depp', u'Christina Ricci', u'Miranda...
    
    
      954
      7.4
      X-Men
      PG-13
      Action
      104
      [u'Patrick Stewart', u'Hugh Jackman', u'Ian Mc...
    
    
      955
      7.4
      Zero Dark Thirty
      R
      Drama
      157
      [u'Jessica Chastain', u'Joel Edgerton', u'Chri...
    
    
      958
      7.4
      My Sister's Keeper
      PG-13
      Drama
      109
      [u'Cameron Diaz', u'Abigail Breslin', u'Alec B...
    
    
      963
      7.4
      La Femme Nikita
      R
      Action
      118
      [u'Anne Parillaud', u'Marc Duret', u'Patrick F...
    
    
      967
      7.4
      The Rock
      R
      Action
      136
      [u'Sean Connery', u'Nicolas Cage', u'Ed Harris']
    
    
      968
      7.4
      The English Patient
      R
      Drama
      162
      [u'Ralph Fiennes', u'Juliette Binoche', u'Will...
    
    
      969
      7.4
      Law Abiding Citizen
      R
      Crime
      109
      [u'Gerard Butler', u'Jamie Foxx', u'Leslie Bibb']
    
    
      970
      7.4
      Wonder Boys
      R
      Drama
      107
      [u'Michael Douglas', u'Tobey Maguire', u'Franc...
    
    
      972
      7.4
      Blue Valentine
      NC-17
      Drama
      112
      [u'Ryan Gosling', u'Michelle Williams', u'John...
    
    
      973
      7.4
      The Cider House Rules
      PG-13
      Drama
      126
      [u'Tobey Maguire', u'Charlize Theron', u'Micha...
    
    
      976
      7.4
      Master and Commander: The Far Side of the World
      PG-13
      Action
      138
      [u'Russell Crowe', u'Paul Bettany', u'Billy Bo...
    
    
      978
      7.4
      Wall Street
      R
      Crime
      126
      [u'Charlie Sheen', u'Michael Douglas', u'Tamar...

Documentation for isin

[Back to top]

10. Your pandas questions answered!

Question: When reading from a file, how do I read in only a subset of the columns?



In [49]:

    
# read a dataset of UFO reports into a DataFrame, and check the columns
ufo = pd.read_csv('data/ufo.csv')
ufo.columns









    Out[49]:





Index([u'City', u'Colors Reported', u'Shape Reported', u'State', u'Time'], dtype='object')



In [50]:

    
# specify which columns to include by name
ufo = pd.read_csv('data/ufo.csv', usecols=['City', 'State'])

# or equivalently, specify columns by position
ufo = pd.read_csv('data/ufo.csv', usecols=[0, 4])
ufo.columns









    Out[50]:





Index([u'City', u'Time'], dtype='object')

Question: When reading from a file, how do I read in only a subset of the rows?



In [51]:

    
# specify how many rows to read
ufo = pd.read_csv('data/ufo.csv', nrows=3)
ufo









    Out[51]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00

Documentation for read_csv

Question: How do I iterate through a Series?



In [52]:

    
# Series are directly iterable (like a list)
for c in ufo.City:
    print(c)









    



Ithaca
Willingboro
Holyoke

Question: How do I iterate through a DataFrame?



In [53]:

    
# various methods are available to iterate through a DataFrame
for index, row in ufo.iterrows():
    print(index, row.City, row.State)









    



(0, 'Ithaca', 'NY')
(1, 'Willingboro', 'NJ')
(2, 'Holyoke', 'CO')

Documentation for iterrows

Question: How do I drop all non-numeric columns from a DataFrame?



In [45]:

    
# read a dataset of alcohol consumption into a DataFrame, and check the data types
drinks = pd.read_csv('data/drinks.csv')
drinks.dtypes









    Out[45]:





country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object



In [55]:

    
# only include numeric columns in the DataFrame
import numpy as np
drinks.select_dtypes(include=[np.number]).dtypes









    Out[55]:





beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
dtype: object

Documentation for select_dtypes

Question: How do I know whether I should pass an argument as a string or a list?



In [56]:

    
# describe all of the numeric columns
drinks.describe()









    Out[56]:






  
    
      
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
    
  
  
    
      count
      193.000000
      193.000000
      193.000000
      193.000000
    
    
      mean
      106.160622
      80.994819
      49.450777
      4.717098
    
    
      std
      101.143103
      88.284312
      79.697598
      3.773298
    
    
      min
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      20.000000
      4.000000
      1.000000
      1.300000
    
    
      50%
      76.000000
      56.000000
      8.000000
      4.200000
    
    
      75%
      188.000000
      128.000000
      59.000000
      7.200000
    
    
      max
      376.000000
      438.000000
      370.000000
      14.400000



In [57]:

    
# pass the string 'all' to describe all columns
drinks.describe(include='all')









    Out[57]:






  
    
      
      country
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
  
  
    
      count
      193
      193.000000
      193.000000
      193.000000
      193.000000
      193
    
    
      unique
      193
      NaN
      NaN
      NaN
      NaN
      6
    
    
      top
      Lesotho
      NaN
      NaN
      NaN
      NaN
      Africa
    
    
      freq
      1
      NaN
      NaN
      NaN
      NaN
      53
    
    
      mean
      NaN
      106.160622
      80.994819
      49.450777
      4.717098
      NaN
    
    
      std
      NaN
      101.143103
      88.284312
      79.697598
      3.773298
      NaN
    
    
      min
      NaN
      0.000000
      0.000000
      0.000000
      0.000000
      NaN
    
    
      25%
      NaN
      20.000000
      4.000000
      1.000000
      1.300000
      NaN
    
    
      50%
      NaN
      76.000000
      56.000000
      8.000000
      4.200000
      NaN
    
    
      75%
      NaN
      188.000000
      128.000000
      59.000000
      7.200000
      NaN
    
    
      max
      NaN
      376.000000
      438.000000
      370.000000
      14.400000
      NaN



In [58]:

    
# pass a list of data types to only describe certain types
drinks.describe(include=['object', 'float64'])









    Out[58]:






  
    
      
      country
      total_litres_of_pure_alcohol
      continent
    
  
  
    
      count
      193
      193.000000
      193
    
    
      unique
      193
      NaN
      6
    
    
      top
      Lesotho
      NaN
      Africa
    
    
      freq
      1
      NaN
      53
    
    
      mean
      NaN
      4.717098
      NaN
    
    
      std
      NaN
      3.773298
      NaN
    
    
      min
      NaN
      0.000000
      NaN
    
    
      25%
      NaN
      1.300000
      NaN
    
    
      50%
      NaN
      4.200000
      NaN
    
    
      75%
      NaN
      7.200000
      NaN
    
    
      max
      NaN
      14.400000
      NaN



In [59]:

    
# pass a list even if you only want to describe a single data type
drinks.describe(include=['object'])









    Out[59]:






  
    
      
      country
      continent
    
  
  
    
      count
      193
      193
    
    
      unique
      193
      6
    
    
      top
      Lesotho
      Africa
    
    
      freq
      1
      53

Documentation for describe

[Back to top]

11. How do I use the "axis" parameter in pandas?



In [60]:

    
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('data/drinks.csv')
drinks.head()









    Out[60]:






  
    
      
      country
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
  
  
    
      0
      Afghanistan
      0
      0
      0
      0.0
      Asia
    
    
      1
      Albania
      89
      132
      54
      4.9
      Europe
    
    
      2
      Algeria
      25
      0
      14
      0.7
      Africa
    
    
      3
      Andorra
      245
      138
      312
      12.4
      Europe
    
    
      4
      Angola
      217
      57
      45
      5.9
      Africa



In [61]:

    
# drop a column (temporarily)
drinks.drop('continent', axis=1).head()









    Out[61]:






  
    
      
      country
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
    
  
  
    
      0
      Afghanistan
      0
      0
      0
      0.0
    
    
      1
      Albania
      89
      132
      54
      4.9
    
    
      2
      Algeria
      25
      0
      14
      0.7
    
    
      3
      Andorra
      245
      138
      312
      12.4
    
    
      4
      Angola
      217
      57
      45
      5.9

Documentation for drop



In [62]:

    
# drop a row (temporarily)
drinks.drop(2, axis=0).head()









    Out[62]:






  
    
      
      country
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
  
  
    
      0
      Afghanistan
      0
      0
      0
      0.0
      Asia
    
    
      1
      Albania
      89
      132
      54
      4.9
      Europe
    
    
      3
      Andorra
      245
      138
      312
      12.4
      Europe
    
    
      4
      Angola
      217
      57
      45
      5.9
      Africa
    
    
      5
      Antigua & Barbuda
      102
      128
      45
      4.9
      North America

When referring to rows or columns with the axis parameter:

axis 0 refers to rows
axis 1 refers to columns



In [46]:

    
# calculate the mean of each numeric column
drinks.mean()

# or equivalently, specify the axis explicitly
drinks.mean(axis=0)









    Out[46]:





beer_servings                   106.160622
spirit_servings                  80.994819
wine_servings                    49.450777
total_litres_of_pure_alcohol      4.717098
dtype: float64

Documentation for mean



In [64]:

    
# calculate the mean of each row
drinks.mean(axis=1).head()









    Out[64]:





0      0.000
1     69.975
2      9.925
3    176.850
4     81.225
dtype: float64

When performing a mathematical operation with the axis parameter:

axis 0 means the operation should "move down" the row axis
axis 1 means the operation should "move across" the column axis



In [65]:

    
# 'index' is an alias for axis 0
drinks.mean(axis='index')









    Out[65]:





beer_servings                   106.160622
spirit_servings                  80.994819
wine_servings                    49.450777
total_litres_of_pure_alcohol      4.717098
dtype: float64



In [66]:

    
# 'columns' is an alias for axis 1
drinks.mean(axis='columns').head()









    Out[66]:





0      0.000
1     69.975
2      9.925
3    176.850
4     81.225
dtype: float64

[Back to top]

12. How do I use string methods in pandas?



In [67]:

    
# read a dataset of Chipotle orders into a DataFrame
orders = pd.read_table('data/chipotle.tsv')
orders.head()









    Out[67]:






  
    
      
      order_id
      quantity
      item_name
      choice_description
      item_price
    
  
  
    
      0
      1
      1
      Chips and Fresh Tomato Salsa
      NaN
      $2.39
    
    
      1
      1
      1
      Izze
      [Clementine]
      $3.39
    
    
      2
      1
      1
      Nantucket Nectar
      [Apple]
      $3.39
    
    
      3
      1
      1
      Chips and Tomatillo-Green Chili Salsa
      NaN
      $2.39
    
    
      4
      2
      2
      Chicken Bowl
      [Tomatillo-Red Chili Salsa (Hot), [Black Beans...
      $16.98



In [68]:

    
# normal way to access string methods in Python
'hello'.upper()









    Out[68]:





'HELLO'



In [69]:

    
# string methods for pandas Series are accessed via 'str'
orders.item_name.str.upper().head()









    Out[69]:





0             CHIPS AND FRESH TOMATO SALSA
1                                     IZZE
2                         NANTUCKET NECTAR
3    CHIPS AND TOMATILLO-GREEN CHILI SALSA
4                             CHICKEN BOWL
Name: item_name, dtype: object



In [70]:

    
# string method 'contains' checks for a substring and returns a boolean Series
orders.item_name.str.contains('Chicken').head()









    Out[70]:





0    False
1    False
2    False
3    False
4     True
Name: item_name, dtype: bool



In [71]:

    
# use the boolean Series to filter the DataFrame
orders[orders.item_name.str.contains('Chicken')].head()









    Out[71]:






  
    
      
      order_id
      quantity
      item_name
      choice_description
      item_price
    
  
  
    
      4
      2
      2
      Chicken Bowl
      [Tomatillo-Red Chili Salsa (Hot), [Black Beans...
      $16.98
    
    
      5
      3
      1
      Chicken Bowl
      [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...
      $10.98
    
    
      11
      6
      1
      Chicken Crispy Tacos
      [Roasted Chili Corn Salsa, [Fajita Vegetables,...
      $8.75
    
    
      12
      6
      1
      Chicken Soft Tacos
      [Roasted Chili Corn Salsa, [Rice, Black Beans,...
      $8.75
    
    
      13
      7
      1
      Chicken Bowl
      [Fresh Tomato Salsa, [Fajita Vegetables, Rice,...
      $11.25



In [72]:

    
# string methods can be chained together
orders.choice_description.str.replace('[', '').str.replace(']', '').head()









    Out[72]:





0                                                  NaN
1                                           Clementine
2                                                Apple
3                                                  NaN
4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
Name: choice_description, dtype: object



In [73]:

    
# many pandas string methods support regular expressions (regex)
orders.choice_description.str.replace('[\[\]]', '').head()









    Out[73]:





0                                                  NaN
1                                           Clementine
2                                                Apple
3                                                  NaN
4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
Name: choice_description, dtype: object

String handling section of the pandas API reference

[Back to top]

13. How do I change the data type of a pandas Series?



In [74]:

    
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('data/drinks.csv')
drinks.head()









    Out[74]:






  
    
      
      country
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
  
  
    
      0
      Afghanistan
      0
      0
      0
      0.0
      Asia
    
    
      1
      Albania
      89
      132
      54
      4.9
      Europe
    
    
      2
      Algeria
      25
      0
      14
      0.7
      Africa
    
    
      3
      Andorra
      245
      138
      312
      12.4
      Europe
    
    
      4
      Angola
      217
      57
      45
      5.9
      Africa



In [75]:

    
# examine the data type of each Series
drinks.dtypes









    Out[75]:





country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object



In [76]:

    
# change the data type of an existing Series
drinks['beer_servings'] = drinks.beer_servings.astype(float)
drinks.dtypes









    Out[76]:





country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

Documentation for astype



In [77]:

    
# alternatively, change the data type of a Series while reading in a file
drinks = pd.read_csv('data/drinks.csv', dtype={'beer_servings':float})
drinks.dtypes









    Out[77]:





country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object



In [78]:

    
# read a dataset of Chipotle orders into a DataFrame
orders = pd.read_table('data/chipotle.tsv')
orders.head()









    Out[78]:






  
    
      
      order_id
      quantity
      item_name
      choice_description
      item_price
    
  
  
    
      0
      1
      1
      Chips and Fresh Tomato Salsa
      NaN
      $2.39
    
    
      1
      1
      1
      Izze
      [Clementine]
      $3.39
    
    
      2
      1
      1
      Nantucket Nectar
      [Apple]
      $3.39
    
    
      3
      1
      1
      Chips and Tomatillo-Green Chili Salsa
      NaN
      $2.39
    
    
      4
      2
      2
      Chicken Bowl
      [Tomatillo-Red Chili Salsa (Hot), [Black Beans...
      $16.98



In [79]:

    
# examine the data type of each Series
orders.dtypes









    Out[79]:





order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object



In [80]:

    
# convert a string to a number in order to do math
orders.item_price.str.replace('$', '').astype(float).mean()









    Out[80]:





7.464335785374397



In [81]:

    
# string method 'contains' checks for a substring and returns a boolean Series
orders.item_name.str.contains('Chicken').head()









    Out[81]:





0    False
1    False
2    False
3    False
4     True
Name: item_name, dtype: bool



In [82]:

    
# convert a boolean Series to an integer (False = 0, True = 1)
orders.item_name.str.contains('Chicken').astype(int).head()









    Out[82]:





0    0
1    0
2    0
3    0
4    1
Name: item_name, dtype: int32

[Back to top]

14. When should I use a "groupby" in pandas?



In [83]:

    
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('data/drinks.tsv')
drinks.head()









    Out[83]:






  
    
      
      country
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
  
  
    
      0
      Afghanistan
      0
      0
      0
      0.0
      Asia
    
    
      1
      Albania
      89
      132
      54
      4.9
      Europe
    
    
      2
      Algeria
      25
      0
      14
      0.7
      Africa
    
    
      3
      Andorra
      245
      138
      312
      12.4
      Europe
    
    
      4
      Angola
      217
      57
      45
      5.9
      Africa



In [84]:

    
# calculate the mean beer servings across the entire dataset
drinks.beer_servings.mean()









    Out[84]:





106.16062176165804



In [85]:

    
# calculate the mean beer servings just for countries in Africa
drinks[drinks.continent=='Africa'].beer_servings.mean()









    Out[85]:





61.471698113207545



In [86]:

    
# calculate the mean beer servings for each continent
drinks.groupby('continent').beer_servings.mean()









    Out[86]:





continent
Africa            61.471698
Asia              37.045455
Europe           193.777778
North America    145.434783
Oceania           89.687500
South America    175.083333
Name: beer_servings, dtype: float64

Documentation for groupby



In [87]:

    
# other aggregation functions (such as 'max') can also be used with groupby
drinks.groupby('continent').beer_servings.max()









    Out[87]:





continent
Africa           376
Asia             247
Europe           361
North America    285
Oceania          306
South America    333
Name: beer_servings, dtype: int64



In [88]:

    
# multiple aggregation functions can be applied simultaneously
drinks.groupby('continent').beer_servings.agg(['count', 'mean', 'min', 'max'])









    Out[88]:






  
    
      
      count
      mean
      min
      max
    
    
      continent
      
      
      
      
    
  
  
    
      Africa
      53
      61.471698
      0
      376
    
    
      Asia
      44
      37.045455
      0
      247
    
    
      Europe
      45
      193.777778
      0
      361
    
    
      North America
      23
      145.434783
      1
      285
    
    
      Oceania
      16
      89.687500
      0
      306
    
    
      South America
      12
      175.083333
      93
      333

Documentation for agg



In [89]:

    
# specifying a column to which the aggregation function should be applied is not required
drinks.groupby('continent').mean()









    Out[89]:






  
    
      
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
    
    
      continent
      
      
      
      
    
  
  
    
      Africa
      61.471698
      16.339623
      16.264151
      3.007547
    
    
      Asia
      37.045455
      60.840909
      9.068182
      2.170455
    
    
      Europe
      193.777778
      132.555556
      142.222222
      8.617778
    
    
      North America
      145.434783
      165.739130
      24.521739
      5.995652
    
    
      Oceania
      89.687500
      58.437500
      35.625000
      3.381250
    
    
      South America
      175.083333
      114.750000
      62.416667
      6.308333



In [90]:

    
# allow plots to appear in the notebook
%matplotlib inline



In [91]:

    
# side-by-side bar plot of the DataFrame directly above
drinks.groupby('continent').mean().plot(kind='bar')









    Out[91]:





<matplotlib.axes._subplots.AxesSubplot at 0x9a4fd30>

Documentation for plot

[Back to top]

15. How do I explore a pandas Series?



In [92]:

    
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('data/imbd_1000.csv')
movies.head()









    Out[92]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      0
      9.3
      The Shawshank Redemption
      R
      Crime
      142
      [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    
    
      1
      9.2
      The Godfather
      R
      Crime
      175
      [u'Marlon Brando', u'Al Pacino', u'James Caan']
    
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      3
      9.0
      The Dark Knight
      PG-13
      Action
      152
      [u'Christian Bale', u'Heath Ledger', u'Aaron E...
    
    
      4
      8.9
      Pulp Fiction
      R
      Crime
      154
      [u'John Travolta', u'Uma Thurman', u'Samuel L....



In [93]:

    
# examine the data type of each Series
movies.dtypes









    Out[93]:





star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

Exploring a non-numeric Series:



In [94]:

    
# count the non-null values, unique values, and frequency of the most common value
movies.genre.describe()









    Out[94]:





count       979
unique       16
top       Drama
freq        278
Name: genre, dtype: object

Documentation for describe



In [95]:

    
# count how many times each value in the Series occurs
movies.genre.value_counts()









    Out[95]:





Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Adventure     75
Animation     62
Horror        29
Mystery       16
Western        9
Thriller       5
Sci-Fi         5
Film-Noir      3
Family         2
Fantasy        1
History        1
Name: genre, dtype: int64

Documentation for value_counts



In [96]:

    
# display percentages instead of raw counts
movies.genre.value_counts(normalize=True)









    Out[96]:





Drama        0.283963
Comedy       0.159346
Action       0.138917
Crime        0.126660
Biography    0.078652
Adventure    0.076609
Animation    0.063330
Horror       0.029622
Mystery      0.016343
Western      0.009193
Thriller     0.005107
Sci-Fi       0.005107
Film-Noir    0.003064
Family       0.002043
Fantasy      0.001021
History      0.001021
Name: genre, dtype: float64



In [97]:

    
# 'value_counts' (like many pandas methods) outputs a Series
type(movies.genre.value_counts())









    Out[97]:





pandas.core.series.Series



In [98]:

    
# thus, you can add another Series method on the end
movies.genre.value_counts().head()









    Out[98]:





Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Name: genre, dtype: int64



In [99]:

    
# display the unique values in the Series
movies.genre.unique()









    Out[99]:





array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography',
       'Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi',
       'History', 'Thriller', 'Family', 'Fantasy'], dtype=object)



In [100]:

    
# count the number of unique values in the Series
movies.genre.nunique()









    Out[100]:





16

Documentation for unique and nunique



In [101]:

    
# compute a cross-tabulation of two Series
pd.crosstab(movies.genre, movies.content_rating)









    Out[101]:






  
    
      content_rating
      APPROVED
      G
      GP
      NC-17
      NOT RATED
      PASSED
      PG
      PG-13
      R
      TV-MA
      UNRATED
      X
    
    
      genre
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Action
      3
      1
      1
      0
      4
      1
      11
      44
      67
      0
      3
      0
    
    
      Adventure
      3
      2
      0
      0
      5
      1
      21
      23
      17
      0
      2
      0
    
    
      Animation
      3
      20
      0
      0
      3
      0
      25
      5
      5
      0
      1
      0
    
    
      Biography
      1
      2
      1
      0
      1
      0
      6
      29
      36
      0
      0
      0
    
    
      Comedy
      9
      2
      1
      1
      16
      3
      23
      23
      73
      0
      4
      1
    
    
      Crime
      6
      0
      0
      1
      7
      1
      6
      4
      87
      0
      11
      1
    
    
      Drama
      12
      3
      0
      4
      24
      1
      25
      55
      143
      1
      9
      1
    
    
      Family
      0
      1
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      Fantasy
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      Film-Noir
      1
      0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
    
    
      History
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      Horror
      2
      0
      0
      1
      1
      0
      1
      2
      16
      0
      5
      1
    
    
      Mystery
      4
      1
      0
      0
      1
      0
      1
      2
      6
      0
      1
      0
    
    
      Sci-Fi
      1
      0
      0
      0
      0
      0
      0
      1
      3
      0
      0
      0
    
    
      Thriller
      1
      0
      0
      0
      0
      0
      1
      0
      3
      0
      0
      0
    
    
      Western
      1
      0
      0
      0
      2
      0
      2
      1
      3
      0
      0
      0

Documentation for crosstab

Exploring a numeric Series:



In [102]:

    
# calculate various summary statistics
movies.duration.describe()









    Out[102]:





count    979.000000
mean     120.979571
std       26.218010
min       64.000000
25%      102.000000
50%      117.000000
75%      134.000000
max      242.000000
Name: duration, dtype: float64



In [103]:

    
# many statistics are implemented as Series methods
movies.duration.mean()









    Out[103]:





120.97957099080695

Documentation for mean



In [104]:

    
# 'value_counts' is primarily useful for categorical data, not numerical data
movies.duration.value_counts().head()









    Out[104]:





112    23
113    22
102    20
101    20
129    19
Name: duration, dtype: int64



In [105]:

    
# allow plots to appear in the notebook
%matplotlib inline



In [106]:

    
# histogram of the 'duration' Series (shows the distribution of a numerical variable)
movies.duration.plot(kind='hist')









    Out[106]:





<matplotlib.axes._subplots.AxesSubplot at 0x9ead668>



In [107]:

    
# bar plot of the 'value_counts' for the 'genre' Series
movies.genre.value_counts().plot(kind='bar')









    Out[107]:





<matplotlib.axes._subplots.AxesSubplot at 0xa1a6240>

Documentation for plot

[Back to top]

16. How do I handle missing values in pandas?



In [108]:

    
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('data/ufo.csv')
ufo.tail()









    Out[108]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      18236
      Grant Park
      NaN
      TRIANGLE
      IL
      12/31/2000 23:00
    
    
      18237
      Spirit Lake
      NaN
      DISK
      IA
      12/31/2000 23:00
    
    
      18238
      Eagle River
      NaN
      NaN
      WI
      12/31/2000 23:45
    
    
      18239
      Eagle River
      RED
      LIGHT
      WI
      12/31/2000 23:45
    
    
      18240
      Ybor
      NaN
      OVAL
      FL
      12/31/2000 23:59

What does "NaN" mean?

"NaN" is not a string, rather it's a special value: numpy.nan.
It stands for "Not a Number" and indicates a missing value.
read_csv detects missing values (by default) when reading the file, and replaces them with this special value.

Documentation for read_csv



In [109]:

    
# 'isnull' returns a DataFrame of booleans (True if missing, False if not missing)
ufo.isnull().tail()









    Out[109]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      18236
      False
      True
      False
      False
      False
    
    
      18237
      False
      True
      False
      False
      False
    
    
      18238
      False
      True
      True
      False
      False
    
    
      18239
      False
      False
      False
      False
      False
    
    
      18240
      False
      True
      False
      False
      False



In [110]:

    
# 'nonnull' returns the opposite of 'isnull' (True if not missing, False if missing)
ufo.notnull().tail()









    Out[110]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      18236
      True
      False
      True
      True
      True
    
    
      18237
      True
      False
      True
      True
      True
    
    
      18238
      True
      False
      False
      True
      True
    
    
      18239
      True
      True
      True
      True
      True
    
    
      18240
      True
      False
      True
      True
      True

Documentation for isnull and notnull



In [111]:

    
# count the number of missing values in each Series
ufo.isnull().sum()









    Out[111]:





City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

This calculation works because:

The sum method for a DataFrame operates on axis=0 by default (and thus produces column sums).
In order to add boolean values, pandas converts True to 1 and False to 0.



In [112]:

    
# use the 'isnull' Series method to filter the DataFrame rows
ufo[ufo.City.isnull()].head()









    Out[112]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      21
      NaN
      NaN
      NaN
      LA
      8/15/1943 0:00
    
    
      22
      NaN
      NaN
      LIGHT
      LA
      8/15/1943 0:00
    
    
      204
      NaN
      NaN
      DISK
      CA
      7/15/1952 12:30
    
    
      241
      NaN
      BLUE
      DISK
      MT
      7/4/1953 14:00
    
    
      613
      NaN
      NaN
      DISK
      NV
      7/1/1960 12:00

How to handle missing values depends on the dataset as well as the nature of your analysis. Here are some options:



In [113]:

    
# examine the number of rows and columns
ufo.shape









    Out[113]:





(18241, 5)



In [114]:

    
# if 'any' values are missing in a row, then drop that row
ufo.dropna(how='any').shape









    Out[114]:





(2486, 5)

Documentation for dropna



In [115]:

    
# 'inplace' parameter for 'dropna' is False by default, thus rows were only dropped temporarily
ufo.shape









    Out[115]:





(18241, 5)



In [116]:

    
# if 'all' values are missing in a row, then drop that row (none are dropped in this case)
ufo.dropna(how='all').shape









    Out[116]:





(18241, 5)



In [117]:

    
# if 'any' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row
ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape









    Out[117]:





(15576, 5)



In [118]:

    
# if 'all' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row
ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape









    Out[118]:





(18237, 5)



In [119]:

    
# 'value_counts' does not include missing values by default
ufo['Shape Reported'].value_counts().head()









    Out[119]:





LIGHT       2803
DISK        2122
TRIANGLE    1889
OTHER       1402
CIRCLE      1365
Name: Shape Reported, dtype: int64



In [120]:

    
# explicitly include missing values
ufo['Shape Reported'].value_counts(dropna=False).head()









    Out[120]:





LIGHT       2803
NaN         2644
DISK        2122
TRIANGLE    1889
OTHER       1402
Name: Shape Reported, dtype: int64

Documentation for value_counts



In [121]:

    
# fill in missing values with a specified value
ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)

Documentation for fillna



In [122]:

    
# confirm that the missing values were filled in
ufo['Shape Reported'].value_counts().head()









    Out[122]:





VARIOUS     2977
LIGHT       2803
DISK        2122
TRIANGLE    1889
OTHER       1402
Name: Shape Reported, dtype: int64

Working with missing data in pandas

[Back to top]

17. What do I need to know about the pandas index?



In [123]:

    
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('data/drinks.csv')
drinks.head()









    Out[123]:






  
    
      
      country
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
  
  
    
      0
      Afghanistan
      0
      0
      0
      0.0
      Asia
    
    
      1
      Albania
      89
      132
      54
      4.9
      Europe
    
    
      2
      Algeria
      25
      0
      14
      0.7
      Africa
    
    
      3
      Andorra
      245
      138
      312
      12.4
      Europe
    
    
      4
      Angola
      217
      57
      45
      5.9
      Africa



In [124]:

    
# every DataFrame has an index (sometimes called the "row labels")
drinks.index









    Out[124]:





RangeIndex(start=0, stop=193, step=1)



In [125]:

    
# column names are also stored in a special "index" object
drinks.columns









    Out[125]:





Index([u'country', u'beer_servings', u'spirit_servings', u'wine_servings',
       u'total_litres_of_pure_alcohol', u'continent'],
      dtype='object')



In [126]:

    
# neither the index nor the columns are included in the shape
drinks.shape









    Out[126]:





(193, 6)



In [127]:

    
# index and columns both default to integers if you don't define them
pd.read_table('data/imbd_1000.csv', header=None, sep='|').head()

What is the index used for?

identification
selection
alignment (covered in the next video)



In [128]:

    
# identification: index remains with each row when filtering the DataFrame
drinks[drinks.continent=='South America']









    Out[128]:






  
    
      
      country
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
  
  
    
      6
      Argentina
      193
      25
      221
      8.3
      South America
    
    
      20
      Bolivia
      167
      41
      8
      3.8
      South America
    
    
      23
      Brazil
      245
      145
      16
      7.2
      South America
    
    
      35
      Chile
      130
      124
      172
      7.6
      South America
    
    
      37
      Colombia
      159
      76
      3
      4.2
      South America
    
    
      52
      Ecuador
      162
      74
      3
      4.2
      South America
    
    
      72
      Guyana
      93
      302
      1
      7.1
      South America
    
    
      132
      Paraguay
      213
      117
      74
      7.3
      South America
    
    
      133
      Peru
      163
      160
      21
      6.1
      South America
    
    
      163
      Suriname
      128
      178
      7
      5.6
      South America
    
    
      185
      Uruguay
      115
      35
      220
      6.6
      South America
    
    
      188
      Venezuela
      333
      100
      3
      7.7
      South America



In [129]:

    
# selection: select a portion of the DataFrame using the index
drinks.loc[23, 'beer_servings']









    Out[129]:





245

Documentation for loc



In [130]:

    
# set an existing column as the index
drinks.set_index('country', inplace=True)
drinks.head()









    Out[130]:






  
    
      
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
    
      country
      
      
      
      
      
    
  
  
    
      Afghanistan
      0
      0
      0
      0.0
      Asia
    
    
      Albania
      89
      132
      54
      4.9
      Europe
    
    
      Algeria
      25
      0
      14
      0.7
      Africa
    
    
      Andorra
      245
      138
      312
      12.4
      Europe
    
    
      Angola
      217
      57
      45
      5.9
      Africa

Documentation for set_index



In [131]:

    
# 'country' is now the index
drinks.index









    Out[131]:





Index([u'Afghanistan', u'Albania', u'Algeria', u'Andorra', u'Angola',
       u'Antigua & Barbuda', u'Argentina', u'Armenia', u'Australia',
       u'Austria',
       ...
       u'Tanzania', u'USA', u'Uruguay', u'Uzbekistan', u'Vanuatu',
       u'Venezuela', u'Vietnam', u'Yemen', u'Zambia', u'Zimbabwe'],
      dtype='object', name=u'country', length=193)



In [132]:

    
# 'country' is no longer a column
drinks.columns









    Out[132]:





Index([u'beer_servings', u'spirit_servings', u'wine_servings',
       u'total_litres_of_pure_alcohol', u'continent'],
      dtype='object')



In [133]:

    
# 'country' data is no longer part of the DataFrame contents
drinks.shape









    Out[133]:





(193, 5)



In [134]:

    
# country name can now be used for selection
drinks.loc['Brazil', 'beer_servings']









    Out[134]:





245



In [135]:

    
# index name is optional
drinks.index.name = None
drinks.head()









    Out[135]:






  
    
      
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
  
  
    
      Afghanistan
      0
      0
      0
      0.0
      Asia
    
    
      Albania
      89
      132
      54
      4.9
      Europe
    
    
      Algeria
      25
      0
      14
      0.7
      Africa
    
    
      Andorra
      245
      138
      312
      12.4
      Europe
    
    
      Angola
      217
      57
      45
      5.9
      Africa



In [136]:

    
# restore the index name, and move the index back to a column
drinks.index.name = 'country'
drinks.reset_index(inplace=True)
drinks.head()









    Out[136]:






  
    
      
      country
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
  
  
    
      0
      Afghanistan
      0
      0
      0
      0.0
      Asia
    
    
      1
      Albania
      89
      132
      54
      4.9
      Europe
    
    
      2
      Algeria
      25
      0
      14
      0.7
      Africa
    
    
      3
      Andorra
      245
      138
      312
      12.4
      Europe
    
    
      4
      Angola
      217
      57
      45
      5.9
      Africa

Documentation for reset_index



In [137]:

    
# many DataFrame methods output a DataFrame
drinks.describe()









    Out[137]:






  
    
      
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
    
  
  
    
      count
      193.000000
      193.000000
      193.000000
      193.000000
    
    
      mean
      106.160622
      80.994819
      49.450777
      4.717098
    
    
      std
      101.143103
      88.284312
      79.697598
      3.773298
    
    
      min
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      20.000000
      4.000000
      1.000000
      1.300000
    
    
      50%
      76.000000
      56.000000
      8.000000
      4.200000
    
    
      75%
      188.000000
      128.000000
      59.000000
      7.200000
    
    
      max
      376.000000
      438.000000
      370.000000
      14.400000



In [138]:

    
# you can interact with any DataFrame using its index and columns
drinks.describe().loc['25%', 'beer_servings']









    Out[138]:





20.0

Indexing and selecting data

[Back to top]

18. What do I need to know about the pandas index?



In [139]:

    
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('data/drinks.csv')
drinks.head()









    Out[139]:






  
    
      
      country
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
  
  
    
      0
      Afghanistan
      0
      0
      0
      0.0
      Asia
    
    
      1
      Albania
      89
      132
      54
      4.9
      Europe
    
    
      2
      Algeria
      25
      0
      14
      0.7
      Africa
    
    
      3
      Andorra
      245
      138
      312
      12.4
      Europe
    
    
      4
      Angola
      217
      57
      45
      5.9
      Africa



In [140]:

    
# every DataFrame has an index
drinks.index









    Out[140]:





RangeIndex(start=0, stop=193, step=1)



In [141]:

    
# every Series also has an index (which carries over from the DataFrame)
drinks.continent.head()









    Out[141]:





0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: object



In [142]:

    
# set 'country' as the index
drinks.set_index('country', inplace=True)

Documentation for set_index



In [143]:

    
# Series index is on the left, values are on the right
drinks.continent.head()









    Out[143]:





country
Afghanistan      Asia
Albania        Europe
Algeria        Africa
Andorra        Europe
Angola         Africa
Name: continent, dtype: object



In [144]:

    
# another example of a Series (output from the 'value_counts' method)
drinks.continent.value_counts()









    Out[144]:





Africa           53
Europe           45
Asia             44
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64

Documentation for value_counts



In [145]:

    
# access the Series index
drinks.continent.value_counts().index









    Out[145]:





Index([u'Africa', u'Europe', u'Asia', u'North America', u'Oceania',
       u'South America'],
      dtype='object')



In [146]:

    
# access the Series values
drinks.continent.value_counts().values









    Out[146]:





array([53, 45, 44, 23, 16, 12], dtype=int64)



In [147]:

    
# elements in a Series can be selected by index (using bracket notation)
drinks.continent.value_counts()['Africa']









    Out[147]:





53



In [148]:

    
# any Series can be sorted by its values
drinks.continent.value_counts().sort_values()









    Out[148]:





South America    12
Oceania          16
North America    23
Asia             44
Europe           45
Africa           53
Name: continent, dtype: int64



In [149]:

    
# any Series can also be sorted by its index
drinks.continent.value_counts().sort_index()









    Out[149]:





Africa           53
Asia             44
Europe           45
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64

Documentation for sort_values and sort_index

What is the index used for?

identification (covered in the previous video)
selection (covered in the previous video)
alignment



In [150]:

    
# 'beer_servings' Series contains the average annual beer servings per person
drinks.beer_servings.head()









    Out[150]:





country
Afghanistan      0
Albania         89
Algeria         25
Andorra        245
Angola         217
Name: beer_servings, dtype: int64



In [151]:

    
# create a Series containing the population of two countries
people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population')
people









    Out[151]:





Albania    3000000
Andorra      85000
Name: population, dtype: int64

Documentation for Series



In [152]:

    
# calculate the total annual beer servings for each country
(drinks.beer_servings * people).head()









    Out[152]:





Afghanistan            NaN
Albania        267000000.0
Algeria                NaN
Andorra         20825000.0
Angola                 NaN
dtype: float64

The two Series were aligned by their indexes.
If a value is missing in either Series, the result is marked as NaN.
Alignment enables us to easily work with incomplete data.



In [153]:

    
# concatenate the 'drinks' DataFrame with the 'population' Series (aligns by the index)
pd.concat([drinks, people], axis=1).head()









    Out[153]:






  
    
      
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
      population
    
  
  
    
      Afghanistan
      0
      0
      0
      0.0
      Asia
      NaN
    
    
      Albania
      89
      132
      54
      4.9
      Europe
      3000000.0
    
    
      Algeria
      25
      0
      14
      0.7
      Africa
      NaN
    
    
      Andorra
      245
      138
      312
      12.4
      Europe
      85000.0
    
    
      Angola
      217
      57
      45
      5.9
      Africa
      NaN

Documentation for concat

Indexing and selecting data

[Back to top]

19. How do I select multiple rows and columns from a pandas DataFrame?



In [154]:

    
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('data/ufo.csv')
ufo.head(3)









    Out[154]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00

The loc method is used to select rows and columns by label. You can pass it:

A single label
A list of labels
A slice of labels
A boolean Series
A colon (which indicates "all labels")



In [155]:

    
# row 0, all columns
ufo.loc[0, :]









    Out[155]:





City                       Ithaca
Colors Reported               NaN
Shape Reported           TRIANGLE
State                          NY
Time               6/1/1930 22:00
Name: 0, dtype: object



In [156]:

    
# rows 0 and 1 and 2, all columns
ufo.loc[[0, 1, 2], :]









    Out[156]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00



In [157]:

    
# rows 0 through 2 (inclusive), all columns
ufo.loc[0:2, :]









    Out[157]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00



In [158]:

    
# this implies "all columns", but explicitly stating "all columns" is better
ufo.loc[0:2]









    Out[158]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00



In [159]:

    
# rows 0 through 2 (inclusive), column 'City'
ufo.loc[0:2, 'City']









    Out[159]:





0         Ithaca
1    Willingboro
2        Holyoke
Name: City, dtype: object



In [160]:

    
# rows 0 through 2 (inclusive), columns 'City' and 'State'
ufo.loc[0:2, ['City', 'State']]









    Out[160]:






  
    
      
      City
      State
    
  
  
    
      0
      Ithaca
      NY
    
    
      1
      Willingboro
      NJ
    
    
      2
      Holyoke
      CO



In [161]:

    
# accomplish the same thing using double brackets - but using 'loc' is preferred since it's more explicit
ufo[['City', 'State']].head(3)









    Out[161]:






  
    
      
      City
      State
    
  
  
    
      0
      Ithaca
      NY
    
    
      1
      Willingboro
      NJ
    
    
      2
      Holyoke
      CO



In [162]:

    
# rows 0 through 2 (inclusive), columns 'City' through 'State' (inclusive)
ufo.loc[0:2, 'City':'State']









    Out[162]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
    
    
      2
      Holyoke
      NaN
      OVAL
      CO



In [163]:

    
# accomplish the same thing using 'head' and 'drop'
ufo.head(3).drop('Time', axis=1)









    Out[163]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
    
    
      2
      Holyoke
      NaN
      OVAL
      CO



In [164]:

    
# rows in which the 'City' is 'Oakland', column 'State'
ufo.loc[ufo.City=='Oakland', 'State']









    Out[164]:





1694     CA
2144     CA
4686     MD
7293     CA
8488     CA
8768     CA
10816    OR
10948    CA
11045    CA
12322    CA
12941    CA
16803    MD
17322    CA
Name: State, dtype: object



In [165]:

    
# accomplish the same thing using "chained indexing" - but using 'loc' is preferred since chained indexing can cause problems
ufo[ufo.City=='Oakland'].State









    Out[165]:





1694     CA
2144     CA
4686     MD
7293     CA
8488     CA
8768     CA
10816    OR
10948    CA
11045    CA
12322    CA
12941    CA
16803    MD
17322    CA
Name: State, dtype: object

The iloc method is used to select rows and columns by integer position. You can pass it:

A single integer position
A list of integer positions
A slice of integer positions
A colon (which indicates "all integer positions")



In [166]:

    
# rows in positions 0 and 1, columns in positions 0 and 3
ufo.iloc[[0, 1], [0, 3]]









    Out[166]:






  
    
      
      City
      State
    
  
  
    
      0
      Ithaca
      NY
    
    
      1
      Willingboro
      NJ



In [167]:

    
# rows in positions 0 through 2 (exclusive), columns in positions 0 through 4 (exclusive)
ufo.iloc[0:2, 0:4]









    Out[167]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ



In [168]:

    
# rows in positions 0 through 2 (exclusive), all columns
ufo.iloc[0:2, :]









    Out[168]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00



In [169]:

    
# accomplish the same thing - but using 'iloc' is preferred since it's more explicit
ufo[0:2]









    Out[169]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00

The ix method is used to select rows and columns by label or integer position, and should only be used when you need to mix label-based and integer-based selection in the same call.



In [170]:

    
# read a dataset of alcohol consumption into a DataFrame and set 'country' as the index
drinks = pd.read_csv('data/drinks.csv', index_col='country')
drinks.head()









    Out[170]:






  
    
      
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
    
      country
      
      
      
      
      
    
  
  
    
      Afghanistan
      0
      0
      0
      0.0
      Asia
    
    
      Albania
      89
      132
      54
      4.9
      Europe
    
    
      Algeria
      25
      0
      14
      0.7
      Africa
    
    
      Andorra
      245
      138
      312
      12.4
      Europe
    
    
      Angola
      217
      57
      45
      5.9
      Africa



In [171]:

    
# row with label 'Albania', column in position 0
drinks.ix['Albania', 0]









    Out[171]:





89



In [172]:

    
# row in position 1, column with label 'beer_servings'
drinks.ix[1, 'beer_servings']









    Out[172]:





89

Rules for using numbers with ix:

If the index is strings, numbers are treated as integer positions, and thus slices are exclusive on the right.
If the index is integers, numbers are treated as labels, and thus slices are inclusive.



In [173]:

    
# rows 'Albania' through 'Andorra' (inclusive), columns in positions 0 through 2 (exclusive)
drinks.ix['Albania':'Andorra', 0:2]









    Out[173]:






  
    
      
      beer_servings
      spirit_servings
    
    
      country
      
      
    
  
  
    
      Albania
      89
      132
    
    
      Algeria
      25
      0
    
    
      Andorra
      245
      138



In [174]:

    
# rows 0 through 2 (inclusive), columns in positions 0 through 2 (exclusive)
ufo.ix[0:2, 0:2]









    Out[174]:






  
    
      
      City
      Colors Reported
    
  
  
    
      0
      Ithaca
      NaN
    
    
      1
      Willingboro
      NaN
    
    
      2
      Holyoke
      NaN

Summary of the pandas API for selection

[Back to top]

20. When should I use the "inplace" parameter in pandas?



In [175]:

    
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('data/ufo.csv')
ufo.head()









    Out[175]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00
    
    
      3
      Abilene
      NaN
      DISK
      KS
      6/1/1931 13:00
    
    
      4
      New York Worlds Fair
      NaN
      LIGHT
      NY
      4/18/1933 19:00



In [176]:

    
ufo.shape









    Out[176]:





(18241, 5)



In [177]:

    
# remove the 'City' column (doesn't affect the DataFrame since inplace=False)
ufo.drop('City', axis=1).head()









    Out[177]:






  
    
      
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      NaN
      OVAL
      CO
      2/15/1931 14:00
    
    
      3
      NaN
      DISK
      KS
      6/1/1931 13:00
    
    
      4
      NaN
      LIGHT
      NY
      4/18/1933 19:00



In [178]:

    
# confirm that the 'City' column was not actually removed
ufo.head()









    Out[178]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00
    
    
      3
      Abilene
      NaN
      DISK
      KS
      6/1/1931 13:00
    
    
      4
      New York Worlds Fair
      NaN
      LIGHT
      NY
      4/18/1933 19:00



In [179]:

    
# remove the 'City' column (does affect the DataFrame since inplace=True)
ufo.drop('City', axis=1, inplace=True)



In [180]:

    
# confirm that the 'City' column was actually removed
ufo.head()









    Out[180]:






  
    
      
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      NaN
      OVAL
      CO
      2/15/1931 14:00
    
    
      3
      NaN
      DISK
      KS
      6/1/1931 13:00
    
    
      4
      NaN
      LIGHT
      NY
      4/18/1933 19:00



In [181]:

    
# drop a row if any value is missing from that row (doesn't affect the DataFrame since inplace=False)
ufo.dropna(how='any').shape









    Out[181]:





(2490, 4)



In [182]:

    
# confirm that no rows were actually removed
ufo.shape









    Out[182]:





(18241, 4)



In [183]:

    
# use an assignment statement instead of the 'inplace' parameter
ufo = ufo.set_index('Time')
ufo.tail()









    Out[183]:






  
    
      
      Colors Reported
      Shape Reported
      State
    
    
      Time
      
      
      
    
  
  
    
      12/31/2000 23:00
      NaN
      TRIANGLE
      IL
    
    
      12/31/2000 23:00
      NaN
      DISK
      IA
    
    
      12/31/2000 23:45
      NaN
      NaN
      WI
    
    
      12/31/2000 23:45
      RED
      LIGHT
      WI
    
    
      12/31/2000 23:59
      NaN
      OVAL
      FL



In [184]:

    
# fill missing values using "backward fill" strategy (doesn't affect the DataFrame since inplace=False)
ufo.fillna(method='bfill').tail()









    Out[184]:






  
    
      
      Colors Reported
      Shape Reported
      State
    
    
      Time
      
      
      
    
  
  
    
      12/31/2000 23:00
      RED
      TRIANGLE
      IL
    
    
      12/31/2000 23:00
      RED
      DISK
      IA
    
    
      12/31/2000 23:45
      RED
      LIGHT
      WI
    
    
      12/31/2000 23:45
      RED
      LIGHT
      WI
    
    
      12/31/2000 23:59
      NaN
      OVAL
      FL



In [185]:

    
# compare with "forward fill" strategy (doesn't affect the DataFrame since inplace=False)
ufo.fillna(method='ffill').tail()









    Out[185]:






  
    
      
      Colors Reported
      Shape Reported
      State
    
    
      Time
      
      
      
    
  
  
    
      12/31/2000 23:00
      RED
      TRIANGLE
      IL
    
    
      12/31/2000 23:00
      RED
      DISK
      IA
    
    
      12/31/2000 23:45
      RED
      DISK
      WI
    
    
      12/31/2000 23:45
      RED
      LIGHT
      WI
    
    
      12/31/2000 23:59
      RED
      OVAL
      FL

[Back to top]

21. How do I make my pandas DataFrame smaller and faster?



In [186]:

    
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('data/drinks.csv')
drinks.head()









    Out[186]:






  
    
      
      country
      beer_servings
      spirit_servings
      wine_servings
      total_litres_of_pure_alcohol
      continent
    
  
  
    
      0
      Afghanistan
      0
      0
      0
      0.0
      Asia
    
    
      1
      Albania
      89
      132
      54
      4.9
      Europe
    
    
      2
      Algeria
      25
      0
      14
      0.7
      Africa
    
    
      3
      Andorra
      245
      138
      312
      12.4
      Europe
    
    
      4
      Angola
      217
      57
      45
      5.9
      Africa



In [187]:

    
# exact memory usage is unknown because object columns are references elsewhere
drinks.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.1+ KB



In [188]:

    
# force pandas to calculate the true memory usage
drinks.info(memory_usage='deep')









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 24.4 KB



In [189]:

    
# calculate the memory usage for each Series (in bytes)
drinks.memory_usage(deep=True)









    Out[189]:





Index                             72
country                         9500
beer_servings                   1544
spirit_servings                 1544
wine_servings                   1544
total_litres_of_pure_alcohol    1544
continent                       9244
dtype: int64

Documentation for info and memory_usage



In [190]:

    
# use the 'category' data type (new in pandas 0.15) to store the 'continent' strings as integers
drinks['continent'] = drinks.continent.astype('category')
drinks.dtypes









    Out[190]:





country                           object
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object



In [191]:

    
# 'continent' Series appears to be unchanged
drinks.continent.head()









    Out[191]:





0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: category
Categories (6, object): [Africa, Asia, Europe, North America, Oceania, South America]



In [192]:

    
# strings are now encoded (0 means 'Africa', 1 means 'Asia', 2 means 'Europe', etc.)
drinks.continent.cat.codes.head()









    Out[192]:





0    1
1    2
2    0
3    2
4    0
dtype: int8



In [193]:

    
# memory usage has been drastically reduced
drinks.memory_usage(deep=True)









    Out[193]:





Index                             72
country                         9500
beer_servings                   1544
spirit_servings                 1544
wine_servings                   1544
total_litres_of_pure_alcohol    1544
continent                        488
dtype: int64



In [194]:

    
# repeat this process for the 'country' Series
drinks['country'] = drinks.country.astype('category')
drinks.memory_usage(deep=True)









    Out[194]:





Index                             72
country                         9886
beer_servings                   1544
spirit_servings                 1544
wine_servings                   1544
total_litres_of_pure_alcohol    1544
continent                        488
dtype: int64



In [195]:

    
# memory usage increased because we created 193 categories
drinks.country.cat.categories









    Out[195]:





Index([u'Afghanistan', u'Albania', u'Algeria', u'Andorra', u'Angola',
       u'Antigua & Barbuda', u'Argentina', u'Armenia', u'Australia',
       u'Austria',
       ...
       u'United Arab Emirates', u'United Kingdom', u'Uruguay', u'Uzbekistan',
       u'Vanuatu', u'Venezuela', u'Vietnam', u'Yemen', u'Zambia', u'Zimbabwe'],
      dtype='object', length=193)

The category data type should only be used with a string Series that has a small number of possible values.



In [196]:

    
# create a small DataFrame from a dictionary
df = pd.DataFrame({'ID':[100, 101, 102, 103], 'quality':['good', 'very good', 'good', 'excellent']})
df



In [197]:

    
# sort the DataFrame by the 'quality' Series (alphabetical order)
df.sort_values('quality')



In [198]:

    
# define a logical ordering for the categories
df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)
df.quality









    Out[198]:





0         good
1    very good
2         good
3    excellent
Name: quality, dtype: category
Categories (3, object): [good < very good < excellent]



In [199]:

    
# sort the DataFrame by the 'quality' Series (logical order)
df.sort_values('quality')



In [200]:

    
# comparison operators work with ordered categories
df.loc[df.quality > 'good', :]









    Out[200]:






  
    
      
      ID
      quality
    
  
  
    
      1
      101
      very good
    
    
      3
      103
      excellent

Overview of categorical data in pandas

API reference for categorical methods

[Back to top]

22. How do I use pandas with scikit-learn to create Kaggle submissions?



In [201]:

    
# read the training dataset from Kaggle's Titanic competition into a DataFrame
train = pd.read_csv('http://bit.ly/kaggletrain')
train.head()









    Out[201]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

Goal: Predict passenger survival aboard the Titanic based on passenger attributes

Video: What is machine learning, and how does it work?



In [202]:

    
# create a feature matrix 'X' by selecting two DataFrame columns
feature_cols = ['Pclass', 'Parch']
X = train.loc[:, feature_cols]
X.shape









    Out[202]:





(891, 2)



In [203]:

    
# create a response vector 'y' by selecting a Series
y = train.Survived
y.shape









    Out[203]:





(891L,)

Note: There is no need to convert these pandas objects to NumPy arrays. scikit-learn will understand these objects as long as they are entirely numeric and the proper shapes.



In [204]:

    
# fit a classification model to the training data
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)









    Out[204]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Video series: Introduction to machine learning with scikit-learn



In [205]:

    
# read the testing dataset from Kaggle's Titanic competition into a DataFrame
test = pd.read_csv('http://bit.ly/kaggletest')
test.head()









    Out[205]:






  
    
      
      PassengerId
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      892
      3
      Kelly, Mr. James
      male
      34.5
      0
      0
      330911
      7.8292
      NaN
      Q
    
    
      1
      893
      3
      Wilkes, Mrs. James (Ellen Needs)
      female
      47.0
      1
      0
      363272
      7.0000
      NaN
      S
    
    
      2
      894
      2
      Myles, Mr. Thomas Francis
      male
      62.0
      0
      0
      240276
      9.6875
      NaN
      Q
    
    
      3
      895
      3
      Wirz, Mr. Albert
      male
      27.0
      0
      0
      315154
      8.6625
      NaN
      S
    
    
      4
      896
      3
      Hirvonen, Mrs. Alexander (Helga E Lindqvist)
      female
      22.0
      1
      1
      3101298
      12.2875
      NaN
      S



In [206]:

    
# create a feature matrix from the testing data that matches the training data
X_new = test.loc[:, feature_cols]
X_new.shape









    Out[206]:





(418, 2)



In [207]:

    
# use the fitted model to make predictions for the testing set observations
new_pred_class = logreg.predict(X_new)



In [208]:

    
# create a DataFrame of passenger IDs and testing set predictions
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).head()









    Out[208]:






  
    
      
      PassengerId
      Survived
    
  
  
    
      0
      892
      0
    
    
      1
      893
      0
    
    
      2
      894
      0
    
    
      3
      895
      0
    
    
      4
      896
      0

Documentation for the DataFrame constructor



In [209]:

    
# ensure that PassengerID is the first column by setting it as the index
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').head()









    Out[209]:






  
    
      
      Survived
    
    
      PassengerId
      
    
  
  
    
      892
      0
    
    
      893
      0
    
    
      894
      0
    
    
      895
      0
    
    
      896
      0



In [210]:

    
# write the DataFrame to a CSV file that can be submitted to Kaggle
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').to_csv('sub.csv')

Documentation for to_csv



In [211]:

    
# save a DataFrame to disk ("pickle it")
train.to_pickle('train.pkl')



In [212]:

    
# read a pickled object from disk ("unpickle it")
pd.read_pickle('train.pkl').head()









    Out[212]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

Documentation for to_pickle and read_pickle

[Back to top]

23. More of your pandas questions answered!

Question: Could you explain how to read the pandas documentation?

pandas API reference

Question: What is the difference between ufo.isnull() and pd.isnull(ufo)?



In [213]:

    
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('http://bit.ly/uforeports')
ufo.head()









    Out[213]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00
    
    
      3
      Abilene
      NaN
      DISK
      KS
      6/1/1931 13:00
    
    
      4
      New York Worlds Fair
      NaN
      LIGHT
      NY
      4/18/1933 19:00



In [214]:

    
# use 'isnull' as a top-level function
pd.isnull(ufo).head()









    Out[214]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      False
      True
      False
      False
      False
    
    
      1
      False
      True
      False
      False
      False
    
    
      2
      False
      True
      False
      False
      False
    
    
      3
      False
      True
      False
      False
      False
    
    
      4
      False
      True
      False
      False
      False



In [215]:

    
# equivalent: use 'isnull' as a DataFrame method
ufo.isnull().head()









    Out[215]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      False
      True
      False
      False
      False
    
    
      1
      False
      True
      False
      False
      False
    
    
      2
      False
      True
      False
      False
      False
    
    
      3
      False
      True
      False
      False
      False
    
    
      4
      False
      True
      False
      False
      False

Documentation for isnull

Question: Why are DataFrame slices inclusive when using .loc, but exclusive when using .iloc?



In [216]:

    
# label-based slicing is inclusive of the start and stop
ufo.loc[0:4, :]









    Out[216]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00
    
    
      3
      Abilene
      NaN
      DISK
      KS
      6/1/1931 13:00
    
    
      4
      New York Worlds Fair
      NaN
      LIGHT
      NY
      4/18/1933 19:00



In [217]:

    
# position-based slicing is inclusive of the start and exclusive of the stop
ufo.iloc[0:4, :]









    Out[217]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00
    
    
      3
      Abilene
      NaN
      DISK
      KS
      6/1/1931 13:00

Documentation for loc and iloc



In [218]:

    
# 'iloc' is simply following NumPy's slicing convention...
ufo.values[0:4, :]









    Out[218]:





array([['Ithaca', nan, 'TRIANGLE', 'NY', '6/1/1930 22:00'],
       ['Willingboro', nan, 'OTHER', 'NJ', '6/30/1930 20:00'],
       ['Holyoke', nan, 'OVAL', 'CO', '2/15/1931 14:00'],
       ['Abilene', nan, 'DISK', 'KS', '6/1/1931 13:00']], dtype=object)



In [219]:

    
# ...and NumPy is simply following Python's slicing convention
'python'[0:4]









    Out[219]:





'pyth'



In [220]:

    
# 'loc' is inclusive of the stopping label because you don't necessarily know what label will come after it
ufo.loc[0:4, 'City':'State']









    Out[220]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
    
    
      3
      Abilene
      NaN
      DISK
      KS
    
    
      4
      New York Worlds Fair
      NaN
      LIGHT
      NY

Question: How do I randomly sample rows from a DataFrame?



In [221]:

    
# sample 3 rows from the DataFrame without replacement (new in pandas 0.16.1)
ufo.sample(n=3)









    Out[221]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      13722
      Lincoln City
      NaN
      LIGHT
      OR
      6/16/1999 23:00
    
    
      14564
      Pagosa Springs
      NaN
      CIGAR
      CO
      9/12/1999 20:55
    
    
      4176
      Chicago
      NaN
      VARIOUS
      IL
      8/28/1979 2:45

Documentation for sample



In [222]:

    
# use the 'random_state' parameter for reproducibility
ufo.sample(n=3, random_state=42)









    Out[222]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      217
      Norridgewock
      NaN
      DISK
      ME
      9/15/1952 14:00
    
    
      12282
      Ipava
      NaN
      TRIANGLE
      IL
      10/1/1998 21:15
    
    
      17933
      Ellinwood
      NaN
      FIREBALL
      KS
      11/13/2000 22:00



In [223]:

    
# sample 75% of the DataFrame's rows without replacement
train = ufo.sample(frac=0.75, random_state=99)



In [224]:

    
# store the remaining 25% of the rows in another DataFrame
test = ufo.loc[~ufo.index.isin(train.index), :]

Documentation for isin

[Back to top]

24. How do I create dummy variables in pandas?



In [225]:

    
# read the training dataset from Kaggle's Titanic competition
train = pd.read_csv('http://bit.ly/kaggletrain')
train.head()









    Out[225]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S



In [226]:

    
# create the 'Sex_male' dummy variable using the 'map' method
train['Sex_male'] = train.Sex.map({'female':0, 'male':1})
train.head()









    Out[226]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      Sex_male
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
      1
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
      0
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
      0
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
      0
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S
      1

Documentation for map



In [227]:

    
# alternative: use 'get_dummies' to create one column for every possible value
pd.get_dummies(train.Sex).head()

Generally speaking:

If you have "K" possible values for a categorical feature, you only need "K-1" dummy variables to capture all of the information about that feature.
One convention is to drop the first dummy variable, which defines that level as the "baseline".



In [228]:

    
# drop the first dummy variable ('female') using the 'iloc' method
pd.get_dummies(train.Sex).iloc[:, 1:].head()



In [229]:

    
# add a prefix to identify the source of the dummy variables
pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:].head()



In [230]:

    
# use 'get_dummies' with a feature that has 3 possible values
pd.get_dummies(train.Embarked, prefix='Embarked').head(10)



In [231]:

    
# drop the first dummy variable ('C')
pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:].head(10)

How to translate these values back to the original 'Embarked' value:

0, 0 means C
1, 0 means Q
0, 1 means S



In [232]:

    
# save the DataFrame of dummy variables and concatenate them to the original DataFrame
embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]
train = pd.concat([train, embarked_dummies], axis=1)
train.head()









    Out[232]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      Sex_male
      Embarked_Q
      Embarked_S
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
      1
      0.0
      1.0
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
      0
      0.0
      0.0
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
      0
      0.0
      1.0
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
      0
      0.0
      1.0
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S
      1
      0.0
      1.0

Documentation for concat



In [233]:

    
# reset the DataFrame
train = pd.read_csv('http://bit.ly/kaggletrain')
train.head()









    Out[233]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S



In [234]:

    
# pass the DataFrame to 'get_dummies' and specify which columns to dummy (it drops the original columns)
pd.get_dummies(train, columns=['Sex', 'Embarked']).head()









    Out[234]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Sex_female
      Sex_male
      Embarked_C
      Embarked_Q
      Embarked_S
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      0.0
      1.0
      0.0
      0.0
      1.0
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      1.0
      0.0
      1.0
      0.0
      0.0
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      1.0
      0.0
      0.0
      0.0
      1.0
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      35.0
      1
      0
      113803
      53.1000
      C123
      1.0
      0.0
      0.0
      0.0
      1.0
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      35.0
      0
      0
      373450
      8.0500
      NaN
      0.0
      1.0
      0.0
      0.0
      1.0



In [235]:

    
# use the 'drop_first' parameter (new in pandas 0.18) to drop the first dummy variable for each feature
pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True).head()









    Out[235]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Sex_male
      Embarked_Q
      Embarked_S
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      1.0
      0.0
      1.0
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      0.0
      0.0
      0.0
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      0.0
      0.0
      1.0
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      35.0
      1
      0
      113803
      53.1000
      C123
      0.0
      0.0
      1.0
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      35.0
      0
      0
      373450
      8.0500
      NaN
      1.0
      0.0
      1.0

Documentation for get_dummies

[Back to top]

25. How do I work with dates and times in pandas?



In [236]:

    
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('http://bit.ly/uforeports')
ufo.head()









    Out[236]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      6/1/1930 22:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      6/30/1930 20:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      2/15/1931 14:00
    
    
      3
      Abilene
      NaN
      DISK
      KS
      6/1/1931 13:00
    
    
      4
      New York Worlds Fair
      NaN
      LIGHT
      NY
      4/18/1933 19:00



In [237]:

    
# 'Time' is currently stored as a string
ufo.dtypes









    Out[237]:





City               object
Colors Reported    object
Shape Reported     object
State              object
Time               object
dtype: object



In [238]:

    
# hour could be accessed using string slicing, but this approach breaks too easily
ufo.Time.str.slice(-5, -3).astype(int).head()









    Out[238]:





0    22
1    20
2    14
3    13
4    19
Name: Time, dtype: int32



In [239]:

    
# convert 'Time' to datetime format
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo.head()









    Out[239]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      0
      Ithaca
      NaN
      TRIANGLE
      NY
      1930-06-01 22:00:00
    
    
      1
      Willingboro
      NaN
      OTHER
      NJ
      1930-06-30 20:00:00
    
    
      2
      Holyoke
      NaN
      OVAL
      CO
      1931-02-15 14:00:00
    
    
      3
      Abilene
      NaN
      DISK
      KS
      1931-06-01 13:00:00
    
    
      4
      New York Worlds Fair
      NaN
      LIGHT
      NY
      1933-04-18 19:00:00



In [240]:

    
ufo.dtypes









    Out[240]:





City                       object
Colors Reported            object
Shape Reported             object
State                      object
Time               datetime64[ns]
dtype: object

Documentation for to_datetime



In [241]:

    
# convenient Series attributes are now available
ufo.Time.dt.hour.head()









    Out[241]:





0    22
1    20
2    14
3    13
4    19
Name: Time, dtype: int64



In [242]:

    
ufo.Time.dt.weekday_name.head()









    Out[242]:





0     Sunday
1     Monday
2     Sunday
3     Monday
4    Tuesday
Name: Time, dtype: object



In [243]:

    
ufo.Time.dt.dayofyear.head()









    Out[243]:





0    152
1    181
2     46
3    152
4    108
Name: Time, dtype: int64

API reference for datetime properties and methods



In [244]:

    
# convert a single string to datetime format (outputs a timestamp object)
ts = pd.to_datetime('1/1/1999')
ts









    Out[244]:





Timestamp('1999-01-01 00:00:00')



In [245]:

    
# compare a datetime Series with a timestamp
ufo.loc[ufo.Time >= ts, :].head()









    Out[245]:






  
    
      
      City
      Colors Reported
      Shape Reported
      State
      Time
    
  
  
    
      12832
      Loma Rica
      NaN
      LIGHT
      CA
      1999-01-01 02:30:00
    
    
      12833
      Bauxite
      NaN
      NaN
      AR
      1999-01-01 03:00:00
    
    
      12834
      Florence
      NaN
      CYLINDER
      SC
      1999-01-01 14:00:00
    
    
      12835
      Lake Henshaw
      NaN
      CIGAR
      CA
      1999-01-01 15:00:00
    
    
      12836
      Wilmington Island
      NaN
      LIGHT
      GA
      1999-01-01 17:15:00



In [246]:

    
# perform mathematical operations with timestamps (outputs a timedelta object)
ufo.Time.max() - ufo.Time.min()









    Out[246]:





Timedelta('25781 days 01:59:00')



In [247]:

    
# timedelta objects also have attributes you can access
(ufo.Time.max() - ufo.Time.min()).days









    Out[247]:





25781L



In [248]:

    
# allow plots to appear in the notebook
%matplotlib inline



In [249]:

    
# count the number of UFO reports per year
ufo['Year'] = ufo.Time.dt.year
ufo.Year.value_counts().sort_index().head()









    Out[249]:





1930    2
1931    2
1933    1
1934    1
1935    1
Name: Year, dtype: int64



In [250]:

    
# plot the number of UFO reports per year (line plot is the default)
ufo.Year.value_counts().sort_index().plot()









    Out[250]:





<matplotlib.axes._subplots.AxesSubplot at 0xd571278>

[Back to top]

26. How do I find and remove duplicate rows in pandas?



In [251]:

    
# read a dataset of movie reviewers into a DataFrame
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols, index_col='user_id')
users.head()



In [252]:

    
users.shape









    Out[252]:





(943, 4)



In [253]:

    
# detect duplicate zip codes: True if an item is identical to a previous item
users.zip_code.duplicated().tail()









    Out[253]:





user_id
939    False
940     True
941    False
942    False
943    False
Name: zip_code, dtype: bool



In [254]:

    
# count the duplicate items (True becomes 1, False becomes 0)
users.zip_code.duplicated().sum()









    Out[254]:





148



In [255]:

    
# detect duplicate DataFrame rows: True if an entire row is identical to a previous row
users.duplicated().tail()









    Out[255]:





user_id
939    False
940    False
941    False
942    False
943    False
dtype: bool



In [256]:

    
# count the duplicate rows
users.duplicated().sum()









    Out[256]:





7

Logic for duplicated:

keep='first' (default): Mark duplicates as True except for the first occurrence.
keep='last': Mark duplicates as True except for the last occurrence.
keep=False: Mark all duplicates as True.



In [257]:

    
# examine the duplicate rows (ignoring the first occurrence)
users.loc[users.duplicated(keep='first'), :]



In [258]:

    
# examine the duplicate rows (ignoring the last occurrence)
users.loc[users.duplicated(keep='last'), :]



In [259]:

    
# examine the duplicate rows (including all duplicates)
users.loc[users.duplicated(keep=False), :]



In [260]:

    
# drop the duplicate rows (inplace=False by default)
users.drop_duplicates(keep='first').shape









    Out[260]:





(936, 4)



In [261]:

    
users.drop_duplicates(keep='last').shape









    Out[261]:





(936, 4)



In [262]:

    
users.drop_duplicates(keep=False).shape









    Out[262]:





(929, 4)

Documentation for drop_duplicates



In [263]:

    
# only consider a subset of columns when identifying duplicates
users.duplicated(subset=['age', 'zip_code']).sum()









    Out[263]:





16



In [264]:

    
users.drop_duplicates(subset=['age', 'zip_code']).shape









    Out[264]:





(927, 4)

[Back to top]

27. How do I avoid a SettingWithCopyWarning in pandas?



In [265]:

    
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('http://bit.ly/imdbratings')
movies.head()









    Out[265]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      0
      9.3
      The Shawshank Redemption
      R
      Crime
      142
      [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    
    
      1
      9.2
      The Godfather
      R
      Crime
      175
      [u'Marlon Brando', u'Al Pacino', u'James Caan']
    
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      3
      9.0
      The Dark Knight
      PG-13
      Action
      152
      [u'Christian Bale', u'Heath Ledger', u'Aaron E...
    
    
      4
      8.9
      Pulp Fiction
      R
      Crime
      154
      [u'John Travolta', u'Uma Thurman', u'Samuel L....



In [266]:

    
# count the missing values in the 'content_rating' Series
movies.content_rating.isnull().sum()









    Out[266]:





3



In [267]:

    
# examine the DataFrame rows that contain those missing values
movies[movies.content_rating.isnull()]









    Out[267]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      187
      8.2
      Butch Cassidy and the Sundance Kid
      NaN
      Biography
      110
      [u'Paul Newman', u'Robert Redford', u'Katharin...
    
    
      649
      7.7
      Where Eagles Dare
      NaN
      Action
      158
      [u'Richard Burton', u'Clint Eastwood', u'Mary ...
    
    
      936
      7.4
      True Grit
      NaN
      Adventure
      128
      [u'John Wayne', u'Kim Darby', u'Glen Campbell']



In [268]:

    
# examine the unique values in the 'content_rating' Series
movies.content_rating.value_counts()









    Out[268]:





R            460
PG-13        189
PG           123
NOT RATED     65
APPROVED      47
UNRATED       38
G             32
PASSED         7
NC-17          7
X              4
GP             3
TV-MA          1
Name: content_rating, dtype: int64

Goal: Mark the 'NOT RATED' values as missing values, represented by 'NaN'.



In [269]:

    
# first, locate the relevant rows
movies[movies.content_rating=='NOT RATED'].head()









    Out[269]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      5
      8.9
      12 Angry Men
      NOT RATED
      Drama
      96
      [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
    
    
      6
      8.9
      The Good, the Bad and the Ugly
      NOT RATED
      Western
      161
      [u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...
    
    
      41
      8.5
      Sunset Blvd.
      NOT RATED
      Drama
      110
      [u'William Holden', u'Gloria Swanson', u'Erich...
    
    
      63
      8.4
      M
      NOT RATED
      Crime
      99
      [u'Peter Lorre', u'Ellen Widmann', u'Inge Land...
    
    
      66
      8.4
      Munna Bhai M.B.B.S.
      NOT RATED
      Comedy
      156
      [u'Sunil Dutt', u'Sanjay Dutt', u'Arshad Warsi']



In [270]:

    
# then, select the 'content_rating' Series from those rows
movies[movies.content_rating=='NOT RATED'].content_rating.head()









    Out[270]:





5     NOT RATED
6     NOT RATED
41    NOT RATED
63    NOT RATED
66    NOT RATED
Name: content_rating, dtype: object



In [271]:

    
# finally, replace the 'NOT RATED' values with 'NaN' (imported from NumPy)
import numpy as np
movies[movies.content_rating=='NOT RATED'].content_rating = np.nan









    



c:\Users\Kevin\Anaconda\lib\site-packages\pandas\core\generic.py:2701: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

Problem: That statement involves two operations, a __getitem__ and a __setitem__. pandas can't guarantee whether the __getitem__ operation returns a view or a copy of the data.

If __getitem__ returns a view of the data, __setitem__ will affect the 'movies' DataFrame.
But if __getitem__ returns a copy of the data, __setitem__ will not affect the 'movies' DataFrame.



In [272]:

    
# the 'content_rating' Series has not changed
movies.content_rating.isnull().sum()









    Out[272]:





3

Solution: Use the loc method, which replaces the 'NOT RATED' values in a single __setitem__ operation.



In [273]:

    
# replace the 'NOT RATED' values with 'NaN' (does not cause a SettingWithCopyWarning)
movies.loc[movies.content_rating=='NOT RATED', 'content_rating'] = np.nan



In [274]:

    
# this time, the 'content_rating' Series has changed
movies.content_rating.isnull().sum()









    Out[274]:





68

Summary: Use the loc method any time you are selecting rows and columns in the same statement.

More information: Modern Pandas (Part 1)



In [275]:

    
# create a DataFrame only containing movies with a high 'star_rating'
top_movies = movies.loc[movies.star_rating >= 9, :]
top_movies









    Out[275]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      0
      9.3
      The Shawshank Redemption
      R
      Crime
      142
      [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    
    
      1
      9.2
      The Godfather
      R
      Crime
      175
      [u'Marlon Brando', u'Al Pacino', u'James Caan']
    
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      3
      9.0
      The Dark Knight
      PG-13
      Action
      152
      [u'Christian Bale', u'Heath Ledger', u'Aaron E...

Goal: Fix the 'duration' for 'The Shawshank Redemption'.



In [276]:

    
# overwrite the relevant cell with the correct duration
top_movies.loc[0, 'duration'] = 150









    



c:\Users\Kevin\Anaconda\lib\site-packages\pandas\core\indexing.py:465: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

Problem: pandas isn't sure whether 'top_movies' is a view or a copy of 'movies'.



In [277]:

    
# 'top_movies' DataFrame has been updated
top_movies









    Out[277]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      0
      9.3
      The Shawshank Redemption
      R
      Crime
      150
      [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    
    
      1
      9.2
      The Godfather
      R
      Crime
      175
      [u'Marlon Brando', u'Al Pacino', u'James Caan']
    
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      3
      9.0
      The Dark Knight
      PG-13
      Action
      152
      [u'Christian Bale', u'Heath Ledger', u'Aaron E...



In [278]:

    
# 'movies' DataFrame has not been updated
movies.head(1)









    Out[278]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      0
      9.3
      The Shawshank Redemption
      R
      Crime
      142
      [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...

Solution: Any time you are attempting to create a DataFrame copy, use the copy method.



In [279]:

    
# explicitly create a copy of 'movies'
top_movies = movies.loc[movies.star_rating >= 9, :].copy()



In [280]:

    
# pandas now knows that you are updating a copy instead of a view (does not cause a SettingWithCopyWarning)
top_movies.loc[0, 'duration'] = 150



In [281]:

    
# 'top_movies' DataFrame has been updated
top_movies









    Out[281]:






  
    
      
      star_rating
      title
      content_rating
      genre
      duration
      actors_list
    
  
  
    
      0
      9.3
      The Shawshank Redemption
      R
      Crime
      150
      [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    
    
      1
      9.2
      The Godfather
      R
      Crime
      175
      [u'Marlon Brando', u'Al Pacino', u'James Caan']
    
    
      2
      9.1
      The Godfather: Part II
      R
      Crime
      200
      [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    
    
      3
      9.0
      The Dark Knight
      PG-13
      Action
      152
      [u'Christian Bale', u'Heath Ledger', u'Aaron E...

Documentation on indexing and selection: Returning a view versus a copy

Stack Overflow: What is the point of views in pandas if it is undefined whether an indexing operation returns a view or a copy?

[Back to top]

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	Izze	[Clementine]	$3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98

	1\|24\|M\|technician\|85711
0	2\|53\|F\|other\|94043
1	3\|23\|M\|writer\|32067
2	4\|24\|M\|technician\|43537
3	5\|33\|F\|other\|15213
4	6\|42\|M\|executive\|98101

	1	24	M	technician	85711
0	2	53	F	other	94043
1	3	23	M	writer	32067
2	4	24	M	technician	43537
3	5	33	F	other	15213
4	6	42	M	executive	98101

	0	1	2	3	4
0	1	24	M	technician	85711
1	2	53	F	other	94043
2	3	23	M	writer	32067
3	4	24	M	technician	43537
4	5	33	F	other	15213

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....

	star_rating	duration
count	979.000000	979.000000
mean	7.889785	120.979571
std	0.336069	26.218010
min	7.400000	64.000000
25%	7.600000	102.000000
50%	7.800000	117.000000
75%	8.100000	134.000000
max	9.300000	242.000000

	title	content_rating	genre	actors_list
count	979	976	979	979
unique	975	12	16	969
top	The Girl with the Dragon Tattoo	R	Drama	[u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
freq	2	460	278	6

	star_rating	title	content_rating	genre	duration	actors_list
542	7.8	(500) Days of Summer	PG-13	Comedy	95	[u'Zooey Deschanel', u'Joseph Gordon-Levitt', ...
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
201	8.1	12 Years a Slave	R	Biography	134	[u'Chiwetel Ejiofor', u'Michael Kenneth Willia...
698	7.6	127 Hours	R	Adventure	94	[u'James Franco', u'Amber Tamblyn', u'Kate Mara']
110	8.3	2001: A Space Odyssey	G	Mystery	160	[u'Keir Dullea', u'Gary Lockwood', u'William S...

	star_rating	title	content_rating	genre	duration	actors_list
864	7.5	[Rec]	R	Horror	78	[u'Manuela Velasco', u'Ferran Terraza', u'Jorg...
526	7.8	Zulu	UNRATED	Drama	138	[u'Stanley Baker', u'Jack Hawkins', u'Ulla Jac...
615	7.7	Zombieland	R	Comedy	88	[u'Jesse Eisenberg', u'Emma Stone', u'Woody Ha...
677	7.7	Zodiac	R	Crime	157	[u'Jake Gyllenhaal', u'Robert Downey Jr.', u'M...
955	7.4	Zero Dark Thirty	R	Drama	157	[u'Jessica Chastain', u'Joel Edgerton', u'Chri...

	star_rating	title	content_rating	genre	duration	actors_list
713	7.6	The Jungle Book	APPROVED	Animation	78	[u'Phil Harris', u'Sebastian Cabot', u'Louis P...
513	7.8	Invasion of the Body Snatchers	APPROVED	Horror	80	[u'Kevin McCarthy', u'Dana Wynter', u'Larry Ga...
272	8.1	The Killing	APPROVED	Crime	85	[u'Sterling Hayden', u'Coleen Gray', u'Vince E...
703	7.6	Dracula	APPROVED	Horror	85	[u'Bela Lugosi', u'Helen Chandler', u'David Ma...
612	7.7	A Hard Day's Night	APPROVED	Comedy	87	[u'John Lennon', u'Paul McCartney', u'George H...

	star_rating	title	content_rating	genre	duration	actors_list
17	8.7	Seven Samurai	UNRATED	Drama	207	[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
157	8.2	Gone with the Wind	G	Drama	238	[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
476	7.8	Hamlet	PG-13	Drama	242	[u'Kenneth Branagh', u'Julie Christie', u'Dere...

	star_rating	title	content_rating	genre	duration	actors_list
931	7.4	Mean Streets	R	Crime	112	[u'Robert De Niro', u'Harvey Keitel', u'David ...
939	7.4	Predestination	R	Drama	97	[u'Ethan Hawke', u'Sarah Snook', u'Noah Taylor']
941	7.4	A Bridge Too Far	PG	Drama	175	[u'Sean Connery', u"Ryan O'Neal", u'Michael Ca...
945	7.4	Take Shelter	R	Drama	120	[u'Michael Shannon', u'Jessica Chastain', u'Sh...
946	7.4	Far from Heaven	PG-13	Drama	107	[u'Julianne Moore', u'Dennis Quaid', u'Dennis ...
947	7.4	Eraserhead	UNRATED	Drama	89	[u'Jack Nance', u'Charlotte Stewart', u'Allen ...
950	7.4	Bound	R	Crime	108	[u'Jennifer Tilly', u'Gina Gershon', u'Joe Pan...
951	7.4	Sleepy Hollow	R	Drama	105	[u'Johnny Depp', u'Christina Ricci', u'Miranda...
954	7.4	X-Men	PG-13	Action	104	[u'Patrick Stewart', u'Hugh Jackman', u'Ian Mc...
955	7.4	Zero Dark Thirty	R	Drama	157	[u'Jessica Chastain', u'Joel Edgerton', u'Chri...
958	7.4	My Sister's Keeper	PG-13	Drama	109	[u'Cameron Diaz', u'Abigail Breslin', u'Alec B...
963	7.4	La Femme Nikita	R	Action	118	[u'Anne Parillaud', u'Marc Duret', u'Patrick F...
967	7.4	The Rock	R	Action	136	[u'Sean Connery', u'Nicolas Cage', u'Ed Harris']
968	7.4	The English Patient	R	Drama	162	[u'Ralph Fiennes', u'Juliette Binoche', u'Will...
969	7.4	Law Abiding Citizen	R	Crime	109	[u'Gerard Butler', u'Jamie Foxx', u'Leslie Bibb']
970	7.4	Wonder Boys	R	Drama	107	[u'Michael Douglas', u'Tobey Maguire', u'Franc...
972	7.4	Blue Valentine	NC-17	Drama	112	[u'Ryan Gosling', u'Michelle Williams', u'John...
973	7.4	The Cider House Rules	PG-13	Drama	126	[u'Tobey Maguire', u'Charlize Theron', u'Micha...
976	7.4	Master and Commander: The Far Side of the World	PG-13	Action	138	[u'Russell Crowe', u'Paul Bettany', u'Billy Bo...
978	7.4	Wall Street	R	Crime	126	[u'Charlie Sheen', u'Michael Douglas', u'Tamar...

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
count	193.000000	193.000000	193.000000	193.000000
mean	106.160622	80.994819	49.450777	4.717098
std	101.143103	88.284312	79.697598	3.773298
min	0.000000	0.000000	0.000000	0.000000
25%	20.000000	4.000000	1.000000	1.300000
50%	76.000000	56.000000	8.000000	4.200000
75%	188.000000	128.000000	59.000000	7.200000
max	376.000000	438.000000	370.000000	14.400000

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

	count	mean	min	max
continent
Africa	53	61.471698	0	376
Asia	44	37.045455	0	247
Europe	45	193.777778	0	361
North America	23	145.434783	1	285
Oceania	16	89.687500	0	306
South America	12	175.083333	93	333

	City	Colors Reported	Shape Reported	State	Time
18236	Grant Park	NaN	TRIANGLE	IL	12/31/2000 23:00
18237	Spirit Lake	NaN	DISK	IA	12/31/2000 23:00
18238	Eagle River	NaN	NaN	WI	12/31/2000 23:45
18239	Eagle River	RED	LIGHT	WI	12/31/2000 23:45
18240	Ybor	NaN	OVAL	FL	12/31/2000 23:59

	City	Colors Reported	Shape Reported	State	Time
18236	False	True	False	False	False
18237	False	True	False	False	False
18238	False	True	True	False	False
18239	False	False	False	False	False
18240	False	True	False	False	False

	City	Colors Reported	Shape Reported	State	Time
21	NaN	NaN	NaN	LA	8/15/1943 0:00
22	NaN	NaN	LIGHT	LA	8/15/1943 0:00
204	NaN	NaN	DISK	CA	7/15/1952 12:30
241	NaN	BLUE	DISK	MT	7/4/1953 14:00
613	NaN	NaN	DISK	NV	7/1/1960 12:00

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
6	Argentina	193	25	221	8.3	South America
20	Bolivia	167	41	8	3.8	South America
23	Brazil	245	145	16	7.2	South America
35	Chile	130	124	172	7.6	South America
37	Colombia	159	76	3	4.2	South America
52	Ecuador	162	74	3	4.2	South America
72	Guyana	93	302	1	7.1	South America
132	Paraguay	213	117	74	7.3	South America
133	Peru	163	160	21	6.1	South America
163	Suriname	128	178	7	5.6	South America
185	Uruguay	115	35	220	6.6	South America
188	Venezuela	333	100	3	7.7	South America

Python pandas Q&A video series by Data School

YouTube playlist and GitHub repository

Table of contents

1. What is pandas?

2. How do I read a tabular data file into pandas?

3. How do I select a pandas Series from a DataFrame?

OR

4. Why do some pandas commands end with parentheses (and others don't)?

5. How do I rename columns in a pandas DataFrame?

6. How do I remove columns from a pandas DataFrame?

7. How do I sort a pandas DataFrame or a Series?

8. How do I filter rows of a pandas DataFrame by column value?

9. How do I apply multiple filter criteria to a pandas DataFrame?

10. Your pandas questions answered!

11. How do I use the "axis" parameter in pandas?

12. How do I use string methods in pandas?

13. How do I change the data type of a pandas Series?

14. When should I use a "groupby" in pandas?

15. How do I explore a pandas Series?

16. How do I handle missing values in pandas?

17. What do I need to know about the pandas index?

18. What do I need to know about the pandas index?

19. How do I select multiple rows and columns from a pandas DataFrame?

20. When should I use the "inplace" parameter in pandas?

21. How do I make my pandas DataFrame smaller and faster?

22. How do I use pandas with scikit-learn to create Kaggle submissions?

23. More of your pandas questions answered!

24. How do I create dummy variables in pandas?

25. How do I work with dates and times in pandas?

26. How do I find and remove duplicate rows in pandas?

27. How do I avoid a SettingWithCopyWarning in pandas?

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

	City	Colors Reported	Shape Reported	State	Time
13722	Lincoln City	NaN	LIGHT	OR	6/16/1999 23:00
14564	Pagosa Springs	NaN	CIGAR	CO	9/12/1999 20:55
4176	Chicago	NaN	VARIOUS	IL	8/28/1979 2:45

	City	Colors Reported	Shape Reported	State	Time
217	Norridgewock	NaN	DISK	ME	9/15/1952 14:00
12282	Ipava	NaN	TRIANGLE	IL	10/1/1998 21:15
17933	Ellinwood	NaN	FIREBALL	KS	11/13/2000 22:00

	Embarked_C	Embarked_Q	Embarked_S
0	0.0	0.0	1.0
1	1.0	0.0	0.0
2	0.0	0.0	1.0
3	0.0	0.0	1.0
4	0.0	0.0	1.0
5	0.0	1.0	0.0
6	0.0	0.0	1.0
7	0.0	0.0	1.0
8	0.0	0.0	1.0
9	1.0	0.0	0.0

	Embarked_Q	Embarked_S
0	0.0	1.0
1	0.0	0.0
2	0.0	1.0
3	0.0	1.0
4	0.0	1.0
5	1.0	0.0
6	0.0	1.0
7	0.0	1.0
8	0.0	1.0
9	0.0	0.0

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	1930-06-01 22:00:00
1	Willingboro	NaN	OTHER	NJ	1930-06-30 20:00:00
2	Holyoke	NaN	OVAL	CO	1931-02-15 14:00:00
3	Abilene	NaN	DISK	KS	1931-06-01 13:00:00
4	New York Worlds Fair	NaN	LIGHT	NY	1933-04-18 19:00:00

	City	Colors Reported	Shape Reported	State	Time
12832	Loma Rica	NaN	LIGHT	CA	1999-01-01 02:30:00
12833	Bauxite	NaN	NaN	AR	1999-01-01 03:00:00
12834	Florence	NaN	CYLINDER	SC	1999-01-01 14:00:00
12835	Lake Henshaw	NaN	CIGAR	CA	1999-01-01 15:00:00
12836	Wilmington Island	NaN	LIGHT	GA	1999-01-01 17:15:00

	age	gender	occupation	zip_code
user_id
1	24	M	technician	85711
2	53	F	other	94043
3	23	M	writer	32067
4	24	M	technician	43537
5	33	F	other	15213

	age	gender	occupation	zip_code
user_id
496	21	F	student	55414
572	51	M	educator	20003
621	17	M	student	60402
684	28	M	student	55414
733	44	F	other	60630
805	27	F	other	20009
890	32	M	student	97301