In [1]:
# conventional way to import pandas
import pandas as pd
# get Pansda's vesrion #
print ('Pandas version', pd.__version__)


('Pandas version', u'0.18.1')

2. How do I read a tabular data file into pandas?


In [2]:
# read a dataset of Chipotle orders directly from a URL and store the results in a DataFrame
orders = pd.read_table('data/chipotle.tsv')

In [3]:
# examine the first 5 rows
orders.head()


Out[3]:
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98

Documentation for read_table


In [4]:
users = pd.read_table('data/u.user')
# examine the first 5 rows
users.head()


Out[4]:
1|24|M|technician|85711
0 2|53|F|other|94043
1 3|23|M|writer|32067
2 4|24|M|technician|43537
3 5|33|F|other|15213
4 6|42|M|executive|98101

In [5]:
users = pd.read_table('data/u.user', sep='|')
# examine the first 5 rows
users.head()


Out[5]:
1 24 M technician 85711
0 2 53 F other 94043
1 3 23 M writer 32067
2 4 24 M technician 43537
3 5 33 F other 15213
4 6 42 M executive 98101

In [6]:
users = pd.read_table('data/u.user', sep='|', header=None)
# examine the first 5 rows
users.head()


Out[6]:
0 1 2 3 4
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213

In [7]:
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table('data/u.user', sep='|', header=None, names=user_cols)
# examine the first 5 rows
users.head()


Out[7]:
user_id age gender occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213

3. How do I select a pandas Series from a DataFrame?


In [7]:
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_table('data/ufo.csv', sep=',')

OR


In [8]:
# read_csv is equivalent to read_table, except it assumes a comma separator
ufo = pd.read_csv('data/ufo.csv')
type(ufo)


Out[8]:
pandas.core.frame.DataFrame

In [9]:
# examine the first 5 rows
ufo.head()


Out[9]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

In [10]:
# select the 'City' Series using bracket notation
ufo['Colors Reported']


Out[10]:
0           NaN
1           NaN
2           NaN
3           NaN
4           NaN
5           NaN
6           NaN
7           NaN
8           NaN
9           NaN
10          NaN
11          NaN
12          RED
13          NaN
14          NaN
15          NaN
16          NaN
17          NaN
18          NaN
19          RED
20          NaN
21          NaN
22          NaN
23          NaN
24          NaN
25          NaN
26          NaN
27          NaN
28          NaN
29          NaN
          ...  
18211       NaN
18212       NaN
18213     GREEN
18214       NaN
18215       NaN
18216    ORANGE
18217       NaN
18218       NaN
18219       NaN
18220      BLUE
18221       NaN
18222       NaN
18223       NaN
18224       NaN
18225       NaN
18226       NaN
18227       NaN
18228       NaN
18229       NaN
18230       NaN
18231       NaN
18232       NaN
18233       RED
18234       NaN
18235       NaN
18236       NaN
18237       NaN
18238       NaN
18239       RED
18240       NaN
Name: Colors Reported, dtype: object

In [16]:
type(ufo['City'])


Out[16]:
pandas.core.series.Series

In [17]:
# or equivalently, use dot notation - see notes below
ufo.City


Out[17]:
0                      Ithaca
1                 Willingboro
2                     Holyoke
3                     Abilene
4        New York Worlds Fair
5                 Valley City
6                 Crater Lake
7                        Alma
8                     Eklutna
9                     Hubbard
10                    Fontana
11                   Waterloo
12                     Belton
13                     Keokuk
14                  Ludington
15                Forest Home
16                Los Angeles
17                  Hapeville
18                     Oneida
19                 Bering Sea
20                   Nebraska
21                        NaN
22                        NaN
23                  Owensboro
24                 Wilderness
25                  San Diego
26                 Wilderness
27                     Clovis
28                 Los Alamos
29               Ft. Duschene
                 ...         
18211                 Holyoke
18212                  Carson
18213                Pasadena
18214                  Austin
18215                El Campo
18216            Garden Grove
18217           Berthoud Pass
18218              Sisterdale
18219            Garden Grove
18220             Shasta Lake
18221                Franklin
18222          Albrightsville
18223              Greenville
18224                 Eufaula
18225             Simi Valley
18226           San Francisco
18227           San Francisco
18228              Kingsville
18229                 Chicago
18230             Pismo Beach
18231             Pismo Beach
18232                    Lodi
18233               Anchorage
18234                Capitola
18235          Fountain Hills
18236              Grant Park
18237             Spirit Lake
18238             Eagle River
18239             Eagle River
18240                    Ybor
Name: City, dtype: object

Bracket notation will always work, whereas dot notation has limitations:

  • Dot notation doesn't work if there are spaces in the Series name
  • Dot notation doesn't work if the Series has the same name as a DataFrame method or attribute (like 'head' or 'shape')
  • Dot notation can't be used to define the name of a new Series (see below)

In [10]:
# create a new 'Location' Series (must use bracket notation to define the Series name)
ufo['Location'] = ufo.City + ', ' + ufo.State
ufo.head()


Out[10]:
City Colors Reported Shape Reported State Time Location
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00 Ithaca, NY
1 Willingboro NaN OTHER NJ 6/30/1930 20:00 Willingboro, NJ
2 Holyoke NaN OVAL CO 2/15/1931 14:00 Holyoke, CO
3 Abilene NaN DISK KS 6/1/1931 13:00 Abilene, KS
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00 New York Worlds Fair, NY

4. Why do some pandas commands end with parentheses (and others don't)?


In [11]:
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('data/imdb_1000.csv')

Methods end with parentheses, while attributes don't:


In [12]:
# example method: show the first 5 rows
movies.head()


Out[12]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

In [13]:
# example method: calculate summary statistics
movies.describe()


Out[13]:
star_rating duration
count 979.000000 979.000000
mean 7.889785 120.979571
std 0.336069 26.218010
min 7.400000 64.000000
25% 7.600000 102.000000
50% 7.800000 117.000000
75% 8.100000 134.000000
max 9.300000 242.000000

In [14]:
# example attribute: number of rows and columns
movies.shape


Out[14]:
(979, 6)

In [15]:
# example attribute: data type of each column
movies.dtypes


Out[15]:
star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

In [16]:
# use an optional parameter to the describe method to summarize only 'object' columns
movies.describe(include=['object'])


Out[16]:
title content_rating genre actors_list
count 979 976 979 979
unique 975 12 16 969
top The Girl with the Dragon Tattoo R Drama [u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
freq 2 460 278 6

Documentation for describe

[Back to top]

5. How do I rename columns in a pandas DataFrame?


In [11]:
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('data/ufo.csv')

In [12]:
# examine the column names
ufo.columns


Out[12]:
Index([u'City', u'Colors Reported', u'Shape Reported', u'State', u'Time'], dtype='object')

In [13]:
# rename two of the columns by using the 'rename' method
ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)
ufo.columns


Out[13]:
Index([u'City', u'Colors_Reported', u'Shape_Reported', u'State', u'Time'], dtype='object')

Documentation for rename


In [22]:
# replace all of the column names by overwriting the 'columns' attribute
ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time']
ufo.columns = ufo_cols
ufo.columns


Out[22]:
Index([u'city', u'colors reported', u'shape reported', u'state', u'time'], dtype='object')

In [23]:
# replace the column names during the file reading process by using the 'names' parameter
ufo = pd.read_csv('data/ufo.csv', header=0, names=ufo_cols)
ufo.head()


Out[23]:
city colors reported shape reported state time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

Documentation for read_csv


In [24]:
# replace all spaces with underscores in the column names by using the 'str.replace' method
ufo.columns = ufo.columns.str.replace(' ', '_')
ufo.columns


Out[24]:
Index([u'city', u'colors_reported', u'shape_reported', u'state', u'time'], dtype='object')

Documentation for str.replace

[Back to top]

6. How do I remove columns from a pandas DataFrame?


In [25]:
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('data/ufo.csv')
ufo.head()


Out[25]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

In [24]:
# remove a single column (axis=1 refers to columns)
ufo.drop('Colors Reported', axis=1, inplace=True)
ufo.head()


Out[24]:
City Shape Reported State Time
0 Ithaca TRIANGLE NY 6/1/1930 22:00
1 Willingboro OTHER NJ 6/30/1930 20:00
2 Holyoke OVAL CO 2/15/1931 14:00
3 Abilene DISK KS 6/1/1931 13:00
4 New York Worlds Fair LIGHT NY 4/18/1933 19:00

Documentation for drop


In [25]:
# remove multiple columns at once
ufo.drop(['City', 'State'], axis=1, inplace=True)
ufo.head()


Out[25]:
Shape Reported Time
0 TRIANGLE 6/1/1930 22:00
1 OTHER 6/30/1930 20:00
2 OVAL 2/15/1931 14:00
3 DISK 6/1/1931 13:00
4 LIGHT 4/18/1933 19:00

In [26]:
# remove multiple rows at once (axis=0 refers to rows)
ufo.drop([0, 1], axis=0, inplace=True)
ufo.head()


Out[26]:
Shape Reported Time
2 OVAL 2/15/1931 14:00
3 DISK 6/1/1931 13:00
4 LIGHT 4/18/1933 19:00
5 DISK 9/15/1934 15:30
6 CIRCLE 6/15/1935 0:00

7. How do I sort a pandas DataFrame or a Series?


In [28]:
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('data/imdb_1000.csv')
movies.head()


Out[28]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

Note: None of the sorting methods below affect the underlying data. (In other words, the sorting is temporary).


In [29]:
# sort the 'title' Series in ascending order (returns a Series)
movies.title.sort_values()


Out[29]:
542                   (500) Days of Summer
5                             12 Angry Men
201                       12 Years a Slave
698                              127 Hours
110                  2001: A Space Odyssey
910                                   2046
596                               21 Grams
624                              25th Hour
708                       28 Days Later...
60                                3 Idiots
225                                 3-Iron
570                                    300
555                           3:10 to Yuma
427           4 Months, 3 Weeks and 2 Days
824                                     42
597                                  50/50
203                                  8 1/2
170                       A Beautiful Mind
941                       A Bridge Too Far
571                           A Bronx Tale
266                      A Christmas Story
86                      A Clockwork Orange
716                         A Few Good Men
750                    A Fish Called Wanda
276                   A Fistful of Dollars
612                     A Hard Day's Night
883                  A History of Violence
869              A Nightmare on Elm Street
865                        A Perfect World
426                              A Prophet
                      ...                 
207       What Ever Happened to Baby Jane?
562            What's Eating Gilbert Grape
719                When Harry Met Sally...
649                      Where Eagles Dare
33                                Whiplash
669                Who Framed Roger Rabbit
219        Who's Afraid of Virginia Woolf?
127                      Wild Strawberries
497    Willy Wonka & the Chocolate Factory
270                        Wings of Desire
483                           Withnail & I
920                                Witness
65             Witness for the Prosecution
970                            Wonder Boys
518                         Wreck-It Ralph
954                                  X-Men
248             X-Men: Days of Future Past
532                     X-Men: First Class
871                                     X2
695                      Y Tu Mama Tambien
403                             Ying xiong
235                                Yip Man
96                                 Yojimbo
280                     Young Frankenstein
535                                  Zelig
955                       Zero Dark Thirty
677                                 Zodiac
615                             Zombieland
526                                   Zulu
864                                  [Rec]
Name: title, dtype: object

In [29]:
# sort in descending order instead
movies.title.sort_values(ascending=False).head()


Out[29]:
864               [Rec]
526                Zulu
615          Zombieland
677              Zodiac
955    Zero Dark Thirty
Name: title, dtype: object

Documentation for sort_values for a Series. (Prior to version 0.17, use order instead.)


In [31]:
# sort the entire DataFrame by the 'title' Series (returns a DataFrame)
movies.sort_values('title').head()


Out[31]:
star_rating title content_rating genre duration actors_list
542 7.8 (500) Days of Summer PG-13 Comedy 95 [u'Zooey Deschanel', u'Joseph Gordon-Levitt', ...
5 8.9 12 Angry Men NOT RATED Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
201 8.1 12 Years a Slave R Biography 134 [u'Chiwetel Ejiofor', u'Michael Kenneth Willia...
698 7.6 127 Hours R Adventure 94 [u'James Franco', u'Amber Tamblyn', u'Kate Mara']
110 8.3 2001: A Space Odyssey G Mystery 160 [u'Keir Dullea', u'Gary Lockwood', u'William S...

In [33]:
# sort in descending order instead
movies.sort_values('title', ascending=False).head()


Out[33]:
star_rating title content_rating genre duration actors_list
864 7.5 [Rec] R Horror 78 [u'Manuela Velasco', u'Ferran Terraza', u'Jorg...
526 7.8 Zulu UNRATED Drama 138 [u'Stanley Baker', u'Jack Hawkins', u'Ulla Jac...
615 7.7 Zombieland R Comedy 88 [u'Jesse Eisenberg', u'Emma Stone', u'Woody Ha...
677 7.7 Zodiac R Crime 157 [u'Jake Gyllenhaal', u'Robert Downey Jr.', u'M...
955 7.4 Zero Dark Thirty R Drama 157 [u'Jessica Chastain', u'Joel Edgerton', u'Chri...

Documentation for sort_values for a DataFrame. (Prior to version 0.17, use sort instead.)


In [32]:
# sort the DataFrame first by 'content_rating', then by 'duration'
movies.sort_values(['content_rating', 'duration']).head()


Out[32]:
star_rating title content_rating genre duration actors_list
713 7.6 The Jungle Book APPROVED Animation 78 [u'Phil Harris', u'Sebastian Cabot', u'Louis P...
513 7.8 Invasion of the Body Snatchers APPROVED Horror 80 [u'Kevin McCarthy', u'Dana Wynter', u'Larry Ga...
272 8.1 The Killing APPROVED Crime 85 [u'Sterling Hayden', u'Coleen Gray', u'Vince E...
703 7.6 Dracula APPROVED Horror 85 [u'Bela Lugosi', u'Helen Chandler', u'David Ma...
612 7.7 A Hard Day's Night APPROVED Comedy 87 [u'John Lennon', u'Paul McCartney', u'George H...

8. How do I filter rows of a pandas DataFrame by column value?


In [35]:
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('data/imdb_1000.csv')
movies.head()


Out[35]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

In [34]:
# examine the number of rows and columns
movies.shape


Out[34]:
(979, 6)

Goal: Filter the DataFrame rows to only show movies with a 'duration' of at least 200 minutes.


In [35]:
# create a list in which each element refers to a DataFrame row: True if the row satisfies the condition, False otherwise
booleans = []
for length in movies.duration:
    if length >= 200:
        booleans.append(True)
    else:
        booleans.append(False)

In [36]:
# confirm that the list has the same length as the DataFrame
len(booleans)


Out[36]:
979

In [37]:
# examine the first five list elements
booleans[0:5]


Out[37]:
[False, False, True, False, False]

In [38]:
# convert the list to a Series
is_long = pd.Series(booleans)
is_long.head()


Out[38]:
0    False
1    False
2     True
3    False
4    False
dtype: bool

In [39]:
# use bracket notation with the boolean Series to tell the DataFrame which rows to display
movies[is_long]


Out[39]:
star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

In [40]:
# simplify the steps above: no need to write a for loop to create 'is_long' since pandas will broadcast the comparison
is_long = movies.duration >= 200
movies[is_long]

# or equivalently, write it in one line (no need to create the 'is_long' object)
movies[movies.duration >= 200]


Out[40]:
star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

In [41]:
# select the 'genre' Series from the filtered DataFrame
is_long = movies.duration >= 200
movies[is_long].genre

# or equivalently, use the 'loc' method
movies.loc[movies.duration >= 200, 'genre']


Out[41]:
2          Crime
7      Adventure
17         Drama
78         Crime
85     Adventure
142    Adventure
157        Drama
204    Adventure
445    Adventure
476        Drama
630    Biography
767       Action
Name: genre, dtype: object

Documentation for loc

[Back to top]

9. How do I apply multiple filter criteria to a pandas DataFrame?


In [36]:
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('data/imdb_1000.csv')
movies.head()


Out[36]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

In [37]:
# filter the DataFrame to only show movies with a 'duration' of at least 200 minutes
movies[movies.duration >= 200]


Out[37]:
star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

Understanding logical operators:

  • and: True only if both sides of the operator are True
  • or: True if either side of the operator is True

In [44]:
# demonstration of the 'and' operator
print(True and True)
print(True and False)
print(False and False)


True
False
False

In [45]:
# demonstration of the 'or' operator
print(True or True)
print(True or False)
print(False or False)


True
True
False

Rules for specifying multiple filter criteria in pandas:

  • use & instead of and
  • use | instead of or
  • add parentheses around each condition to specify evaluation order

Goal: Further filter the DataFrame of long movies (duration >= 200) to only show movies which also have a 'genre' of 'Drama'


In [46]:
# CORRECT: use the '&' operator to specify that both conditions are required
movies[(movies.duration >=200) & (movies.genre == 'Drama')]


Out[46]:
star_rating title content_rating genre duration actors_list
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...

In [47]:
# INCORRECT: using the '|' operator would have shown movies that are either long or dramas (or both)
movies[(movies.duration >=200) | (movies.genre == 'Drama')].head()


Out[47]:
star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
5 8.9 12 Angry Men NOT RATED Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
9 8.9 Fight Club R Drama 139 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
13 8.8 Forrest Gump PG-13 Drama 142 [u'Tom Hanks', u'Robin Wright', u'Gary Sinise']

Goal: Filter the original DataFrame to show movies with a 'genre' of 'Crime' or 'Drama' or 'Action'


In [44]:
# use the '|' operator to specify that a row can match any of the three criteria
movies[(movies.genre == 'Crime') | (movies.genre == 'Drama') | (movies.genre == 'Action')].tail(20)

# or equivalently, use the 'isin' method
#movies[movies.genre.isin(['Crime', 'Drama', 'Action'])].head(10)


Out[44]:
star_rating title content_rating genre duration actors_list
931 7.4 Mean Streets R Crime 112 [u'Robert De Niro', u'Harvey Keitel', u'David ...
939 7.4 Predestination R Drama 97 [u'Ethan Hawke', u'Sarah Snook', u'Noah Taylor']
941 7.4 A Bridge Too Far PG Drama 175 [u'Sean Connery', u"Ryan O'Neal", u'Michael Ca...
945 7.4 Take Shelter R Drama 120 [u'Michael Shannon', u'Jessica Chastain', u'Sh...
946 7.4 Far from Heaven PG-13 Drama 107 [u'Julianne Moore', u'Dennis Quaid', u'Dennis ...
947 7.4 Eraserhead UNRATED Drama 89 [u'Jack Nance', u'Charlotte Stewart', u'Allen ...
950 7.4 Bound R Crime 108 [u'Jennifer Tilly', u'Gina Gershon', u'Joe Pan...
951 7.4 Sleepy Hollow R Drama 105 [u'Johnny Depp', u'Christina Ricci', u'Miranda...
954 7.4 X-Men PG-13 Action 104 [u'Patrick Stewart', u'Hugh Jackman', u'Ian Mc...
955 7.4 Zero Dark Thirty R Drama 157 [u'Jessica Chastain', u'Joel Edgerton', u'Chri...
958 7.4 My Sister's Keeper PG-13 Drama 109 [u'Cameron Diaz', u'Abigail Breslin', u'Alec B...
963 7.4 La Femme Nikita R Action 118 [u'Anne Parillaud', u'Marc Duret', u'Patrick F...
967 7.4 The Rock R Action 136 [u'Sean Connery', u'Nicolas Cage', u'Ed Harris']
968 7.4 The English Patient R Drama 162 [u'Ralph Fiennes', u'Juliette Binoche', u'Will...
969 7.4 Law Abiding Citizen R Crime 109 [u'Gerard Butler', u'Jamie Foxx', u'Leslie Bibb']
970 7.4 Wonder Boys R Drama 107 [u'Michael Douglas', u'Tobey Maguire', u'Franc...
972 7.4 Blue Valentine NC-17 Drama 112 [u'Ryan Gosling', u'Michelle Williams', u'John...
973 7.4 The Cider House Rules PG-13 Drama 126 [u'Tobey Maguire', u'Charlize Theron', u'Micha...
976 7.4 Master and Commander: The Far Side of the World PG-13 Action 138 [u'Russell Crowe', u'Paul Bettany', u'Billy Bo...
978 7.4 Wall Street R Crime 126 [u'Charlie Sheen', u'Michael Douglas', u'Tamar...

Documentation for isin

[Back to top]

10. Your pandas questions answered!

Question: When reading from a file, how do I read in only a subset of the columns?


In [49]:
# read a dataset of UFO reports into a DataFrame, and check the columns
ufo = pd.read_csv('data/ufo.csv')
ufo.columns


Out[49]:
Index([u'City', u'Colors Reported', u'Shape Reported', u'State', u'Time'], dtype='object')

In [50]:
# specify which columns to include by name
ufo = pd.read_csv('data/ufo.csv', usecols=['City', 'State'])

# or equivalently, specify columns by position
ufo = pd.read_csv('data/ufo.csv', usecols=[0, 4])
ufo.columns


Out[50]:
Index([u'City', u'Time'], dtype='object')

Question: When reading from a file, how do I read in only a subset of the rows?


In [51]:
# specify how many rows to read
ufo = pd.read_csv('data/ufo.csv', nrows=3)
ufo


Out[51]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00

Documentation for read_csv

Question: How do I iterate through a Series?


In [52]:
# Series are directly iterable (like a list)
for c in ufo.City:
    print(c)


Ithaca
Willingboro
Holyoke

Question: How do I iterate through a DataFrame?


In [53]:
# various methods are available to iterate through a DataFrame
for index, row in ufo.iterrows():
    print(index, row.City, row.State)


(0, 'Ithaca', 'NY')
(1, 'Willingboro', 'NJ')
(2, 'Holyoke', 'CO')

Documentation for iterrows

Question: How do I drop all non-numeric columns from a DataFrame?


In [45]:
# read a dataset of alcohol consumption into a DataFrame, and check the data types
drinks = pd.read_csv('data/drinks.csv')
drinks.dtypes


Out[45]:
country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [55]:
# only include numeric columns in the DataFrame
import numpy as np
drinks.select_dtypes(include=[np.number]).dtypes


Out[55]:
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
dtype: object

Documentation for select_dtypes

Question: How do I know whether I should pass an argument as a string or a list?


In [56]:
# describe all of the numeric columns
drinks.describe()


Out[56]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
count 193.000000 193.000000 193.000000 193.000000
mean 106.160622 80.994819 49.450777 4.717098
std 101.143103 88.284312 79.697598 3.773298
min 0.000000 0.000000 0.000000 0.000000
25% 20.000000 4.000000 1.000000 1.300000
50% 76.000000 56.000000 8.000000 4.200000
75% 188.000000 128.000000 59.000000 7.200000
max 376.000000 438.000000 370.000000 14.400000

In [57]:
# pass the string 'all' to describe all columns
drinks.describe(include='all')


Out[57]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
count 193 193.000000 193.000000 193.000000 193.000000 193
unique 193 NaN NaN NaN NaN 6
top Lesotho NaN NaN NaN NaN Africa
freq 1 NaN NaN NaN NaN 53
mean NaN 106.160622 80.994819 49.450777 4.717098 NaN
std NaN 101.143103 88.284312 79.697598 3.773298 NaN
min NaN 0.000000 0.000000 0.000000 0.000000 NaN
25% NaN 20.000000 4.000000 1.000000 1.300000 NaN
50% NaN 76.000000 56.000000 8.000000 4.200000 NaN
75% NaN 188.000000 128.000000 59.000000 7.200000 NaN
max NaN 376.000000 438.000000 370.000000 14.400000 NaN

In [58]:
# pass a list of data types to only describe certain types
drinks.describe(include=['object', 'float64'])


Out[58]:
country total_litres_of_pure_alcohol continent
count 193 193.000000 193
unique 193 NaN 6
top Lesotho NaN Africa
freq 1 NaN 53
mean NaN 4.717098 NaN
std NaN 3.773298 NaN
min NaN 0.000000 NaN
25% NaN 1.300000 NaN
50% NaN 4.200000 NaN
75% NaN 7.200000 NaN
max NaN 14.400000 NaN

In [59]:
# pass a list even if you only want to describe a single data type
drinks.describe(include=['object'])


Out[59]:
country continent
count 193 193
unique 193 6
top Lesotho Africa
freq 1 53

Documentation for describe

[Back to top]

11. How do I use the "axis" parameter in pandas?


In [60]:
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('data/drinks.csv')
drinks.head()


Out[60]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [61]:
# drop a column (temporarily)
drinks.drop('continent', axis=1).head()


Out[61]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
0 Afghanistan 0 0 0 0.0
1 Albania 89 132 54 4.9
2 Algeria 25 0 14 0.7
3 Andorra 245 138 312 12.4
4 Angola 217 57 45 5.9

Documentation for drop


In [62]:
# drop a row (temporarily)
drinks.drop(2, axis=0).head()


Out[62]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
5 Antigua & Barbuda 102 128 45 4.9 North America

When referring to rows or columns with the axis parameter:

  • axis 0 refers to rows
  • axis 1 refers to columns

In [46]:
# calculate the mean of each numeric column
drinks.mean()

# or equivalently, specify the axis explicitly
drinks.mean(axis=0)


Out[46]:
beer_servings                   106.160622
spirit_servings                  80.994819
wine_servings                    49.450777
total_litres_of_pure_alcohol      4.717098
dtype: float64

Documentation for mean


In [64]:
# calculate the mean of each row
drinks.mean(axis=1).head()


Out[64]:
0      0.000
1     69.975
2      9.925
3    176.850
4     81.225
dtype: float64

When performing a mathematical operation with the axis parameter:

  • axis 0 means the operation should "move down" the row axis
  • axis 1 means the operation should "move across" the column axis

In [65]:
# 'index' is an alias for axis 0
drinks.mean(axis='index')


Out[65]:
beer_servings                   106.160622
spirit_servings                  80.994819
wine_servings                    49.450777
total_litres_of_pure_alcohol      4.717098
dtype: float64

In [66]:
# 'columns' is an alias for axis 1
drinks.mean(axis='columns').head()


Out[66]:
0      0.000
1     69.975
2      9.925
3    176.850
4     81.225
dtype: float64

12. How do I use string methods in pandas?


In [67]:
# read a dataset of Chipotle orders into a DataFrame
orders = pd.read_table('data/chipotle.tsv')
orders.head()


Out[67]:
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98

In [68]:
# normal way to access string methods in Python
'hello'.upper()


Out[68]:
'HELLO'

In [69]:
# string methods for pandas Series are accessed via 'str'
orders.item_name.str.upper().head()


Out[69]:
0             CHIPS AND FRESH TOMATO SALSA
1                                     IZZE
2                         NANTUCKET NECTAR
3    CHIPS AND TOMATILLO-GREEN CHILI SALSA
4                             CHICKEN BOWL
Name: item_name, dtype: object

In [70]:
# string method 'contains' checks for a substring and returns a boolean Series
orders.item_name.str.contains('Chicken').head()


Out[70]:
0    False
1    False
2    False
3    False
4     True
Name: item_name, dtype: bool

In [71]:
# use the boolean Series to filter the DataFrame
orders[orders.item_name.str.contains('Chicken')].head()


Out[71]:
order_id quantity item_name choice_description item_price
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... $10.98
11 6 1 Chicken Crispy Tacos [Roasted Chili Corn Salsa, [Fajita Vegetables,... $8.75
12 6 1 Chicken Soft Tacos [Roasted Chili Corn Salsa, [Rice, Black Beans,... $8.75
13 7 1 Chicken Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... $11.25

In [72]:
# string methods can be chained together
orders.choice_description.str.replace('[', '').str.replace(']', '').head()


Out[72]:
0                                                  NaN
1                                           Clementine
2                                                Apple
3                                                  NaN
4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
Name: choice_description, dtype: object

In [73]:
# many pandas string methods support regular expressions (regex)
orders.choice_description.str.replace('[\[\]]', '').head()


Out[73]:
0                                                  NaN
1                                           Clementine
2                                                Apple
3                                                  NaN
4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
Name: choice_description, dtype: object

String handling section of the pandas API reference

[Back to top]

13. How do I change the data type of a pandas Series?


In [74]:
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('data/drinks.csv')
drinks.head()


Out[74]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [75]:
# examine the data type of each Series
drinks.dtypes


Out[75]:
country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [76]:
# change the data type of an existing Series
drinks['beer_servings'] = drinks.beer_servings.astype(float)
drinks.dtypes


Out[76]:
country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

Documentation for astype


In [77]:
# alternatively, change the data type of a Series while reading in a file
drinks = pd.read_csv('data/drinks.csv', dtype={'beer_servings':float})
drinks.dtypes


Out[77]:
country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [78]:
# read a dataset of Chipotle orders into a DataFrame
orders = pd.read_table('data/chipotle.tsv')
orders.head()


Out[78]:
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98

In [79]:
# examine the data type of each Series
orders.dtypes


Out[79]:
order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object

In [80]:
# convert a string to a number in order to do math
orders.item_price.str.replace('$', '').astype(float).mean()


Out[80]:
7.464335785374397

In [81]:
# string method 'contains' checks for a substring and returns a boolean Series
orders.item_name.str.contains('Chicken').head()


Out[81]:
0    False
1    False
2    False
3    False
4     True
Name: item_name, dtype: bool

In [82]:
# convert a boolean Series to an integer (False = 0, True = 1)
orders.item_name.str.contains('Chicken').astype(int).head()


Out[82]:
0    0
1    0
2    0
3    0
4    1
Name: item_name, dtype: int32

14. When should I use a "groupby" in pandas?


In [83]:
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('data/drinks.tsv')
drinks.head()


Out[83]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [84]:
# calculate the mean beer servings across the entire dataset
drinks.beer_servings.mean()


Out[84]:
106.16062176165804

In [85]:
# calculate the mean beer servings just for countries in Africa
drinks[drinks.continent=='Africa'].beer_servings.mean()


Out[85]:
61.471698113207545

In [86]:
# calculate the mean beer servings for each continent
drinks.groupby('continent').beer_servings.mean()


Out[86]:
continent
Africa            61.471698
Asia              37.045455
Europe           193.777778
North America    145.434783
Oceania           89.687500
South America    175.083333
Name: beer_servings, dtype: float64

Documentation for groupby


In [87]:
# other aggregation functions (such as 'max') can also be used with groupby
drinks.groupby('continent').beer_servings.max()


Out[87]:
continent
Africa           376
Asia             247
Europe           361
North America    285
Oceania          306
South America    333
Name: beer_servings, dtype: int64

In [88]:
# multiple aggregation functions can be applied simultaneously
drinks.groupby('continent').beer_servings.agg(['count', 'mean', 'min', 'max'])


Out[88]:
count mean min max
continent
Africa 53 61.471698 0 376
Asia 44 37.045455 0 247
Europe 45 193.777778 0 361
North America 23 145.434783 1 285
Oceania 16 89.687500 0 306
South America 12 175.083333 93 333

Documentation for agg


In [89]:
# specifying a column to which the aggregation function should be applied is not required
drinks.groupby('continent').mean()


Out[89]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
Africa 61.471698 16.339623 16.264151 3.007547
Asia 37.045455 60.840909 9.068182 2.170455
Europe 193.777778 132.555556 142.222222 8.617778
North America 145.434783 165.739130 24.521739 5.995652
Oceania 89.687500 58.437500 35.625000 3.381250
South America 175.083333 114.750000 62.416667 6.308333

In [90]:
# allow plots to appear in the notebook
%matplotlib inline

In [91]:
# side-by-side bar plot of the DataFrame directly above
drinks.groupby('continent').mean().plot(kind='bar')


Out[91]:
<matplotlib.axes._subplots.AxesSubplot at 0x9a4fd30>

Documentation for plot

[Back to top]

15. How do I explore a pandas Series?


In [92]:
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('data/imbd_1000.csv')
movies.head()


Out[92]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

In [93]:
# examine the data type of each Series
movies.dtypes


Out[93]:
star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

Exploring a non-numeric Series:


In [94]:
# count the non-null values, unique values, and frequency of the most common value
movies.genre.describe()


Out[94]:
count       979
unique       16
top       Drama
freq        278
Name: genre, dtype: object

Documentation for describe


In [95]:
# count how many times each value in the Series occurs
movies.genre.value_counts()


Out[95]:
Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Adventure     75
Animation     62
Horror        29
Mystery       16
Western        9
Thriller       5
Sci-Fi         5
Film-Noir      3
Family         2
Fantasy        1
History        1
Name: genre, dtype: int64

Documentation for value_counts


In [96]:
# display percentages instead of raw counts
movies.genre.value_counts(normalize=True)


Out[96]:
Drama        0.283963
Comedy       0.159346
Action       0.138917
Crime        0.126660
Biography    0.078652
Adventure    0.076609
Animation    0.063330
Horror       0.029622
Mystery      0.016343
Western      0.009193
Thriller     0.005107
Sci-Fi       0.005107
Film-Noir    0.003064
Family       0.002043
Fantasy      0.001021
History      0.001021
Name: genre, dtype: float64

In [97]:
# 'value_counts' (like many pandas methods) outputs a Series
type(movies.genre.value_counts())


Out[97]:
pandas.core.series.Series

In [98]:
# thus, you can add another Series method on the end
movies.genre.value_counts().head()


Out[98]:
Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Name: genre, dtype: int64

In [99]:
# display the unique values in the Series
movies.genre.unique()


Out[99]:
array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography',
       'Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi',
       'History', 'Thriller', 'Family', 'Fantasy'], dtype=object)

In [100]:
# count the number of unique values in the Series
movies.genre.nunique()


Out[100]:
16

Documentation for unique and nunique


In [101]:
# compute a cross-tabulation of two Series
pd.crosstab(movies.genre, movies.content_rating)


Out[101]:
content_rating APPROVED G GP NC-17 NOT RATED PASSED PG PG-13 R TV-MA UNRATED X
genre
Action 3 1 1 0 4 1 11 44 67 0 3 0
Adventure 3 2 0 0 5 1 21 23 17 0 2 0
Animation 3 20 0 0 3 0 25 5 5 0 1 0
Biography 1 2 1 0 1 0 6 29 36 0 0 0
Comedy 9 2 1 1 16 3 23 23 73 0 4 1
Crime 6 0 0 1 7 1 6 4 87 0 11 1
Drama 12 3 0 4 24 1 25 55 143 1 9 1
Family 0 1 0 0 0 0 1 0 0 0 0 0
Fantasy 0 0 0 0 0 0 0 0 1 0 0 0
Film-Noir 1 0 0 0 1 0 0 0 0 0 1 0
History 0 0 0 0 0 0 0 0 0 0 1 0
Horror 2 0 0 1 1 0 1 2 16 0 5 1
Mystery 4 1 0 0 1 0 1 2 6 0 1 0
Sci-Fi 1 0 0 0 0 0 0 1 3 0 0 0
Thriller 1 0 0 0 0 0 1 0 3 0 0 0
Western 1 0 0 0 2 0 2 1 3 0 0 0

Documentation for crosstab

Exploring a numeric Series:


In [102]:
# calculate various summary statistics
movies.duration.describe()


Out[102]:
count    979.000000
mean     120.979571
std       26.218010
min       64.000000
25%      102.000000
50%      117.000000
75%      134.000000
max      242.000000
Name: duration, dtype: float64

In [103]:
# many statistics are implemented as Series methods
movies.duration.mean()


Out[103]:
120.97957099080695

Documentation for mean


In [104]:
# 'value_counts' is primarily useful for categorical data, not numerical data
movies.duration.value_counts().head()


Out[104]:
112    23
113    22
102    20
101    20
129    19
Name: duration, dtype: int64

In [105]:
# allow plots to appear in the notebook
%matplotlib inline

In [106]:
# histogram of the 'duration' Series (shows the distribution of a numerical variable)
movies.duration.plot(kind='hist')


Out[106]:
<matplotlib.axes._subplots.AxesSubplot at 0x9ead668>

In [107]:
# bar plot of the 'value_counts' for the 'genre' Series
movies.genre.value_counts().plot(kind='bar')


Out[107]:
<matplotlib.axes._subplots.AxesSubplot at 0xa1a6240>

Documentation for plot

[Back to top]

16. How do I handle missing values in pandas?


In [108]:
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('data/ufo.csv')
ufo.tail()


Out[108]:
City Colors Reported Shape Reported State Time
18236 Grant Park NaN TRIANGLE IL 12/31/2000 23:00
18237 Spirit Lake NaN DISK IA 12/31/2000 23:00
18238 Eagle River NaN NaN WI 12/31/2000 23:45
18239 Eagle River RED LIGHT WI 12/31/2000 23:45
18240 Ybor NaN OVAL FL 12/31/2000 23:59

What does "NaN" mean?

  • "NaN" is not a string, rather it's a special value: numpy.nan.
  • It stands for "Not a Number" and indicates a missing value.
  • read_csv detects missing values (by default) when reading the file, and replaces them with this special value.

Documentation for read_csv


In [109]:
# 'isnull' returns a DataFrame of booleans (True if missing, False if not missing)
ufo.isnull().tail()


Out[109]:
City Colors Reported Shape Reported State Time
18236 False True False False False
18237 False True False False False
18238 False True True False False
18239 False False False False False
18240 False True False False False

In [110]:
# 'nonnull' returns the opposite of 'isnull' (True if not missing, False if missing)
ufo.notnull().tail()


Out[110]:
City Colors Reported Shape Reported State Time
18236 True False True True True
18237 True False True True True
18238 True False False True True
18239 True True True True True
18240 True False True True True

Documentation for isnull and notnull


In [111]:
# count the number of missing values in each Series
ufo.isnull().sum()


Out[111]:
City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

This calculation works because:

  1. The sum method for a DataFrame operates on axis=0 by default (and thus produces column sums).
  2. In order to add boolean values, pandas converts True to 1 and False to 0.

In [112]:
# use the 'isnull' Series method to filter the DataFrame rows
ufo[ufo.City.isnull()].head()


Out[112]:
City Colors Reported Shape Reported State Time
21 NaN NaN NaN LA 8/15/1943 0:00
22 NaN NaN LIGHT LA 8/15/1943 0:00
204 NaN NaN DISK CA 7/15/1952 12:30
241 NaN BLUE DISK MT 7/4/1953 14:00
613 NaN NaN DISK NV 7/1/1960 12:00

How to handle missing values depends on the dataset as well as the nature of your analysis. Here are some options:


In [113]:
# examine the number of rows and columns
ufo.shape


Out[113]:
(18241, 5)

In [114]:
# if 'any' values are missing in a row, then drop that row
ufo.dropna(how='any').shape


Out[114]:
(2486, 5)

Documentation for dropna


In [115]:
# 'inplace' parameter for 'dropna' is False by default, thus rows were only dropped temporarily
ufo.shape


Out[115]:
(18241, 5)

In [116]:
# if 'all' values are missing in a row, then drop that row (none are dropped in this case)
ufo.dropna(how='all').shape


Out[116]:
(18241, 5)

In [117]:
# if 'any' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row
ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape


Out[117]:
(15576, 5)

In [118]:
# if 'all' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row
ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape


Out[118]:
(18237, 5)

In [119]:
# 'value_counts' does not include missing values by default
ufo['Shape Reported'].value_counts().head()


Out[119]:
LIGHT       2803
DISK        2122
TRIANGLE    1889
OTHER       1402
CIRCLE      1365
Name: Shape Reported, dtype: int64

In [120]:
# explicitly include missing values
ufo['Shape Reported'].value_counts(dropna=False).head()


Out[120]:
LIGHT       2803
NaN         2644
DISK        2122
TRIANGLE    1889
OTHER       1402
Name: Shape Reported, dtype: int64

Documentation for value_counts


In [121]:
# fill in missing values with a specified value
ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)

Documentation for fillna


In [122]:
# confirm that the missing values were filled in
ufo['Shape Reported'].value_counts().head()


Out[122]:
VARIOUS     2977
LIGHT       2803
DISK        2122
TRIANGLE    1889
OTHER       1402
Name: Shape Reported, dtype: int64

17. What do I need to know about the pandas index?


In [123]:
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('data/drinks.csv')
drinks.head()


Out[123]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [124]:
# every DataFrame has an index (sometimes called the "row labels")
drinks.index


Out[124]:
RangeIndex(start=0, stop=193, step=1)

In [125]:
# column names are also stored in a special "index" object
drinks.columns


Out[125]:
Index([u'country', u'beer_servings', u'spirit_servings', u'wine_servings',
       u'total_litres_of_pure_alcohol', u'continent'],
      dtype='object')

In [126]:
# neither the index nor the columns are included in the shape
drinks.shape


Out[126]:
(193, 6)

In [127]:
# index and columns both default to integers if you don't define them
pd.read_table('data/imbd_1000.csv', header=None, sep='|').head()


Out[127]:
0 1 2 3 4
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213

What is the index used for?

  1. identification
  2. selection
  3. alignment (covered in the next video)

In [128]:
# identification: index remains with each row when filtering the DataFrame
drinks[drinks.continent=='South America']


Out[128]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
6 Argentina 193 25 221 8.3 South America
20 Bolivia 167 41 8 3.8 South America
23 Brazil 245 145 16 7.2 South America
35 Chile 130 124 172 7.6 South America
37 Colombia 159 76 3 4.2 South America
52 Ecuador 162 74 3 4.2 South America
72 Guyana 93 302 1 7.1 South America
132 Paraguay 213 117 74 7.3 South America
133 Peru 163 160 21 6.1 South America
163 Suriname 128 178 7 5.6 South America
185 Uruguay 115 35 220 6.6 South America
188 Venezuela 333 100 3 7.7 South America

In [129]:
# selection: select a portion of the DataFrame using the index
drinks.loc[23, 'beer_servings']


Out[129]:
245

Documentation for loc


In [130]:
# set an existing column as the index
drinks.set_index('country', inplace=True)
drinks.head()


Out[130]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
country
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa

Documentation for set_index


In [131]:
# 'country' is now the index
drinks.index


Out[131]:
Index([u'Afghanistan', u'Albania', u'Algeria', u'Andorra', u'Angola',
       u'Antigua & Barbuda', u'Argentina', u'Armenia', u'Australia',
       u'Austria',
       ...
       u'Tanzania', u'USA', u'Uruguay', u'Uzbekistan', u'Vanuatu',
       u'Venezuela', u'Vietnam', u'Yemen', u'Zambia', u'Zimbabwe'],
      dtype='object', name=u'country', length=193)

In [132]:
# 'country' is no longer a column
drinks.columns


Out[132]:
Index([u'beer_servings', u'spirit_servings', u'wine_servings',
       u'total_litres_of_pure_alcohol', u'continent'],
      dtype='object')

In [133]:
# 'country' data is no longer part of the DataFrame contents
drinks.shape


Out[133]:
(193, 5)

In [134]:
# country name can now be used for selection
drinks.loc['Brazil', 'beer_servings']


Out[134]:
245

In [135]:
# index name is optional
drinks.index.name = None
drinks.head()


Out[135]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa

In [136]:
# restore the index name, and move the index back to a column
drinks.index.name = 'country'
drinks.reset_index(inplace=True)
drinks.head()


Out[136]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

Documentation for reset_index


In [137]:
# many DataFrame methods output a DataFrame
drinks.describe()


Out[137]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
count 193.000000 193.000000 193.000000 193.000000
mean 106.160622 80.994819 49.450777 4.717098
std 101.143103 88.284312 79.697598 3.773298
min 0.000000 0.000000 0.000000 0.000000
25% 20.000000 4.000000 1.000000 1.300000
50% 76.000000 56.000000 8.000000 4.200000
75% 188.000000 128.000000 59.000000 7.200000
max 376.000000 438.000000 370.000000 14.400000

In [138]:
# you can interact with any DataFrame using its index and columns
drinks.describe().loc['25%', 'beer_servings']


Out[138]:
20.0

18. What do I need to know about the pandas index?


In [139]:
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('data/drinks.csv')
drinks.head()


Out[139]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [140]:
# every DataFrame has an index
drinks.index


Out[140]:
RangeIndex(start=0, stop=193, step=1)

In [141]:
# every Series also has an index (which carries over from the DataFrame)
drinks.continent.head()


Out[141]:
0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: object

In [142]:
# set 'country' as the index
drinks.set_index('country', inplace=True)

Documentation for set_index


In [143]:
# Series index is on the left, values are on the right
drinks.continent.head()


Out[143]:
country
Afghanistan      Asia
Albania        Europe
Algeria        Africa
Andorra        Europe
Angola         Africa
Name: continent, dtype: object

In [144]:
# another example of a Series (output from the 'value_counts' method)
drinks.continent.value_counts()


Out[144]:
Africa           53
Europe           45
Asia             44
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64

Documentation for value_counts


In [145]:
# access the Series index
drinks.continent.value_counts().index


Out[145]:
Index([u'Africa', u'Europe', u'Asia', u'North America', u'Oceania',
       u'South America'],
      dtype='object')

In [146]:
# access the Series values
drinks.continent.value_counts().values


Out[146]:
array([53, 45, 44, 23, 16, 12], dtype=int64)

In [147]:
# elements in a Series can be selected by index (using bracket notation)
drinks.continent.value_counts()['Africa']


Out[147]:
53

In [148]:
# any Series can be sorted by its values
drinks.continent.value_counts().sort_values()


Out[148]:
South America    12
Oceania          16
North America    23
Asia             44
Europe           45
Africa           53
Name: continent, dtype: int64

In [149]:
# any Series can also be sorted by its index
drinks.continent.value_counts().sort_index()


Out[149]:
Africa           53
Asia             44
Europe           45
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64

Documentation for sort_values and sort_index

What is the index used for?

  1. identification (covered in the previous video)
  2. selection (covered in the previous video)
  3. alignment

In [150]:
# 'beer_servings' Series contains the average annual beer servings per person
drinks.beer_servings.head()


Out[150]:
country
Afghanistan      0
Albania         89
Algeria         25
Andorra        245
Angola         217
Name: beer_servings, dtype: int64

In [151]:
# create a Series containing the population of two countries
people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population')
people


Out[151]:
Albania    3000000
Andorra      85000
Name: population, dtype: int64

Documentation for Series


In [152]:
# calculate the total annual beer servings for each country
(drinks.beer_servings * people).head()


Out[152]:
Afghanistan            NaN
Albania        267000000.0
Algeria                NaN
Andorra         20825000.0
Angola                 NaN
dtype: float64
  • The two Series were aligned by their indexes.
  • If a value is missing in either Series, the result is marked as NaN.
  • Alignment enables us to easily work with incomplete data.

In [153]:
# concatenate the 'drinks' DataFrame with the 'population' Series (aligns by the index)
pd.concat([drinks, people], axis=1).head()


Out[153]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent population
Afghanistan 0 0 0 0.0 Asia NaN
Albania 89 132 54 4.9 Europe 3000000.0
Algeria 25 0 14 0.7 Africa NaN
Andorra 245 138 312 12.4 Europe 85000.0
Angola 217 57 45 5.9 Africa NaN

19. How do I select multiple rows and columns from a pandas DataFrame?


In [154]:
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('data/ufo.csv')
ufo.head(3)


Out[154]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00

The loc method is used to select rows and columns by label. You can pass it:

  • A single label
  • A list of labels
  • A slice of labels
  • A boolean Series
  • A colon (which indicates "all labels")

In [155]:
# row 0, all columns
ufo.loc[0, :]


Out[155]:
City                       Ithaca
Colors Reported               NaN
Shape Reported           TRIANGLE
State                          NY
Time               6/1/1930 22:00
Name: 0, dtype: object

In [156]:
# rows 0 and 1 and 2, all columns
ufo.loc[[0, 1, 2], :]


Out[156]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00

In [157]:
# rows 0 through 2 (inclusive), all columns
ufo.loc[0:2, :]


Out[157]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00

In [158]:
# this implies "all columns", but explicitly stating "all columns" is better
ufo.loc[0:2]


Out[158]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00

In [159]:
# rows 0 through 2 (inclusive), column 'City'
ufo.loc[0:2, 'City']


Out[159]:
0         Ithaca
1    Willingboro
2        Holyoke
Name: City, dtype: object

In [160]:
# rows 0 through 2 (inclusive), columns 'City' and 'State'
ufo.loc[0:2, ['City', 'State']]


Out[160]:
City State
0 Ithaca NY
1 Willingboro NJ
2 Holyoke CO

In [161]:
# accomplish the same thing using double brackets - but using 'loc' is preferred since it's more explicit
ufo[['City', 'State']].head(3)


Out[161]:
City State
0 Ithaca NY
1 Willingboro NJ
2 Holyoke CO

In [162]:
# rows 0 through 2 (inclusive), columns 'City' through 'State' (inclusive)
ufo.loc[0:2, 'City':'State']


Out[162]:
City Colors Reported Shape Reported State
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ
2 Holyoke NaN OVAL CO

In [163]:
# accomplish the same thing using 'head' and 'drop'
ufo.head(3).drop('Time', axis=1)


Out[163]:
City Colors Reported Shape Reported State
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ
2 Holyoke NaN OVAL CO

In [164]:
# rows in which the 'City' is 'Oakland', column 'State'
ufo.loc[ufo.City=='Oakland', 'State']


Out[164]:
1694     CA
2144     CA
4686     MD
7293     CA
8488     CA
8768     CA
10816    OR
10948    CA
11045    CA
12322    CA
12941    CA
16803    MD
17322    CA
Name: State, dtype: object

In [165]:
# accomplish the same thing using "chained indexing" - but using 'loc' is preferred since chained indexing can cause problems
ufo[ufo.City=='Oakland'].State


Out[165]:
1694     CA
2144     CA
4686     MD
7293     CA
8488     CA
8768     CA
10816    OR
10948    CA
11045    CA
12322    CA
12941    CA
16803    MD
17322    CA
Name: State, dtype: object

The iloc method is used to select rows and columns by integer position. You can pass it:

  • A single integer position
  • A list of integer positions
  • A slice of integer positions
  • A colon (which indicates "all integer positions")

In [166]:
# rows in positions 0 and 1, columns in positions 0 and 3
ufo.iloc[[0, 1], [0, 3]]


Out[166]:
City State
0 Ithaca NY
1 Willingboro NJ

In [167]:
# rows in positions 0 through 2 (exclusive), columns in positions 0 through 4 (exclusive)
ufo.iloc[0:2, 0:4]


Out[167]:
City Colors Reported Shape Reported State
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ

In [168]:
# rows in positions 0 through 2 (exclusive), all columns
ufo.iloc[0:2, :]


Out[168]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00

In [169]:
# accomplish the same thing - but using 'iloc' is preferred since it's more explicit
ufo[0:2]


Out[169]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00

The ix method is used to select rows and columns by label or integer position, and should only be used when you need to mix label-based and integer-based selection in the same call.


In [170]:
# read a dataset of alcohol consumption into a DataFrame and set 'country' as the index
drinks = pd.read_csv('data/drinks.csv', index_col='country')
drinks.head()


Out[170]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
country
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa

In [171]:
# row with label 'Albania', column in position 0
drinks.ix['Albania', 0]


Out[171]:
89

In [172]:
# row in position 1, column with label 'beer_servings'
drinks.ix[1, 'beer_servings']


Out[172]:
89

Rules for using numbers with ix:

  • If the index is strings, numbers are treated as integer positions, and thus slices are exclusive on the right.
  • If the index is integers, numbers are treated as labels, and thus slices are inclusive.

In [173]:
# rows 'Albania' through 'Andorra' (inclusive), columns in positions 0 through 2 (exclusive)
drinks.ix['Albania':'Andorra', 0:2]


Out[173]:
beer_servings spirit_servings
country
Albania 89 132
Algeria 25 0
Andorra 245 138

In [174]:
# rows 0 through 2 (inclusive), columns in positions 0 through 2 (exclusive)
ufo.ix[0:2, 0:2]


Out[174]:
City Colors Reported
0 Ithaca NaN
1 Willingboro NaN
2 Holyoke NaN

20. When should I use the "inplace" parameter in pandas?


In [175]:
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('data/ufo.csv')
ufo.head()


Out[175]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

In [176]:
ufo.shape


Out[176]:
(18241, 5)

In [177]:
# remove the 'City' column (doesn't affect the DataFrame since inplace=False)
ufo.drop('City', axis=1).head()


Out[177]:
Colors Reported Shape Reported State Time
0 NaN TRIANGLE NY 6/1/1930 22:00
1 NaN OTHER NJ 6/30/1930 20:00
2 NaN OVAL CO 2/15/1931 14:00
3 NaN DISK KS 6/1/1931 13:00
4 NaN LIGHT NY 4/18/1933 19:00

In [178]:
# confirm that the 'City' column was not actually removed
ufo.head()


Out[178]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

In [179]:
# remove the 'City' column (does affect the DataFrame since inplace=True)
ufo.drop('City', axis=1, inplace=True)

In [180]:
# confirm that the 'City' column was actually removed
ufo.head()


Out[180]:
Colors Reported Shape Reported State Time
0 NaN TRIANGLE NY 6/1/1930 22:00
1 NaN OTHER NJ 6/30/1930 20:00
2 NaN OVAL CO 2/15/1931 14:00
3 NaN DISK KS 6/1/1931 13:00
4 NaN LIGHT NY 4/18/1933 19:00

In [181]:
# drop a row if any value is missing from that row (doesn't affect the DataFrame since inplace=False)
ufo.dropna(how='any').shape


Out[181]:
(2490, 4)

In [182]:
# confirm that no rows were actually removed
ufo.shape


Out[182]:
(18241, 4)

In [183]:
# use an assignment statement instead of the 'inplace' parameter
ufo = ufo.set_index('Time')
ufo.tail()


Out[183]:
Colors Reported Shape Reported State
Time
12/31/2000 23:00 NaN TRIANGLE IL
12/31/2000 23:00 NaN DISK IA
12/31/2000 23:45 NaN NaN WI
12/31/2000 23:45 RED LIGHT WI
12/31/2000 23:59 NaN OVAL FL

In [184]:
# fill missing values using "backward fill" strategy (doesn't affect the DataFrame since inplace=False)
ufo.fillna(method='bfill').tail()


Out[184]:
Colors Reported Shape Reported State
Time
12/31/2000 23:00 RED TRIANGLE IL
12/31/2000 23:00 RED DISK IA
12/31/2000 23:45 RED LIGHT WI
12/31/2000 23:45 RED LIGHT WI
12/31/2000 23:59 NaN OVAL FL

In [185]:
# compare with "forward fill" strategy (doesn't affect the DataFrame since inplace=False)
ufo.fillna(method='ffill').tail()


Out[185]:
Colors Reported Shape Reported State
Time
12/31/2000 23:00 RED TRIANGLE IL
12/31/2000 23:00 RED DISK IA
12/31/2000 23:45 RED DISK WI
12/31/2000 23:45 RED LIGHT WI
12/31/2000 23:59 RED OVAL FL

21. How do I make my pandas DataFrame smaller and faster?


In [186]:
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('data/drinks.csv')
drinks.head()


Out[186]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [187]:
# exact memory usage is unknown because object columns are references elsewhere
drinks.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.1+ KB

In [188]:
# force pandas to calculate the true memory usage
drinks.info(memory_usage='deep')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 24.4 KB

In [189]:
# calculate the memory usage for each Series (in bytes)
drinks.memory_usage(deep=True)


Out[189]:
Index                             72
country                         9500
beer_servings                   1544
spirit_servings                 1544
wine_servings                   1544
total_litres_of_pure_alcohol    1544
continent                       9244
dtype: int64

Documentation for info and memory_usage


In [190]:
# use the 'category' data type (new in pandas 0.15) to store the 'continent' strings as integers
drinks['continent'] = drinks.continent.astype('category')
drinks.dtypes


Out[190]:
country                           object
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object

In [191]:
# 'continent' Series appears to be unchanged
drinks.continent.head()


Out[191]:
0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: category
Categories (6, object): [Africa, Asia, Europe, North America, Oceania, South America]

In [192]:
# strings are now encoded (0 means 'Africa', 1 means 'Asia', 2 means 'Europe', etc.)
drinks.continent.cat.codes.head()


Out[192]:
0    1
1    2
2    0
3    2
4    0
dtype: int8

In [193]:
# memory usage has been drastically reduced
drinks.memory_usage(deep=True)


Out[193]:
Index                             72
country                         9500
beer_servings                   1544
spirit_servings                 1544
wine_servings                   1544
total_litres_of_pure_alcohol    1544
continent                        488
dtype: int64

In [194]:
# repeat this process for the 'country' Series
drinks['country'] = drinks.country.astype('category')
drinks.memory_usage(deep=True)


Out[194]:
Index                             72
country                         9886
beer_servings                   1544
spirit_servings                 1544
wine_servings                   1544
total_litres_of_pure_alcohol    1544
continent                        488
dtype: int64

In [195]:
# memory usage increased because we created 193 categories
drinks.country.cat.categories


Out[195]:
Index([u'Afghanistan', u'Albania', u'Algeria', u'Andorra', u'Angola',
       u'Antigua & Barbuda', u'Argentina', u'Armenia', u'Australia',
       u'Austria',
       ...
       u'United Arab Emirates', u'United Kingdom', u'Uruguay', u'Uzbekistan',
       u'Vanuatu', u'Venezuela', u'Vietnam', u'Yemen', u'Zambia', u'Zimbabwe'],
      dtype='object', length=193)

The category data type should only be used with a string Series that has a small number of possible values.


In [196]:
# create a small DataFrame from a dictionary
df = pd.DataFrame({'ID':[100, 101, 102, 103], 'quality':['good', 'very good', 'good', 'excellent']})
df


Out[196]:
ID quality
0 100 good
1 101 very good
2 102 good
3 103 excellent

In [197]:
# sort the DataFrame by the 'quality' Series (alphabetical order)
df.sort_values('quality')


Out[197]:
ID quality
3 103 excellent
0 100 good
2 102 good
1 101 very good

In [198]:
# define a logical ordering for the categories
df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)
df.quality


Out[198]:
0         good
1    very good
2         good
3    excellent
Name: quality, dtype: category
Categories (3, object): [good < very good < excellent]

In [199]:
# sort the DataFrame by the 'quality' Series (logical order)
df.sort_values('quality')


Out[199]:
ID quality
0 100 good
2 102 good
1 101 very good
3 103 excellent

In [200]:
# comparison operators work with ordered categories
df.loc[df.quality > 'good', :]


Out[200]:
ID quality
1 101 very good
3 103 excellent

22. How do I use pandas with scikit-learn to create Kaggle submissions?


In [201]:
# read the training dataset from Kaggle's Titanic competition into a DataFrame
train = pd.read_csv('http://bit.ly/kaggletrain')
train.head()


Out[201]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Goal: Predict passenger survival aboard the Titanic based on passenger attributes

Video: What is machine learning, and how does it work?


In [202]:
# create a feature matrix 'X' by selecting two DataFrame columns
feature_cols = ['Pclass', 'Parch']
X = train.loc[:, feature_cols]
X.shape


Out[202]:
(891, 2)

In [203]:
# create a response vector 'y' by selecting a Series
y = train.Survived
y.shape


Out[203]:
(891L,)

Note: There is no need to convert these pandas objects to NumPy arrays. scikit-learn will understand these objects as long as they are entirely numeric and the proper shapes.


In [204]:
# fit a classification model to the training data
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)


Out[204]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [205]:
# read the testing dataset from Kaggle's Titanic competition into a DataFrame
test = pd.read_csv('http://bit.ly/kaggletest')
test.head()


Out[205]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

In [206]:
# create a feature matrix from the testing data that matches the training data
X_new = test.loc[:, feature_cols]
X_new.shape


Out[206]:
(418, 2)

In [207]:
# use the fitted model to make predictions for the testing set observations
new_pred_class = logreg.predict(X_new)

In [208]:
# create a DataFrame of passenger IDs and testing set predictions
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).head()


Out[208]:
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 0

Documentation for the DataFrame constructor


In [209]:
# ensure that PassengerID is the first column by setting it as the index
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').head()


Out[209]:
Survived
PassengerId
892 0
893 0
894 0
895 0
896 0

In [210]:
# write the DataFrame to a CSV file that can be submitted to Kaggle
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').to_csv('sub.csv')

Documentation for to_csv


In [211]:
# save a DataFrame to disk ("pickle it")
train.to_pickle('train.pkl')

In [212]:
# read a pickled object from disk ("unpickle it")
pd.read_pickle('train.pkl').head()


Out[212]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Documentation for to_pickle and read_pickle

[Back to top]

23. More of your pandas questions answered!

Question: Could you explain how to read the pandas documentation?

pandas API reference

Question: What is the difference between ufo.isnull() and pd.isnull(ufo)?


In [213]:
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('http://bit.ly/uforeports')
ufo.head()


Out[213]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

In [214]:
# use 'isnull' as a top-level function
pd.isnull(ufo).head()


Out[214]:
City Colors Reported Shape Reported State Time
0 False True False False False
1 False True False False False
2 False True False False False
3 False True False False False
4 False True False False False

In [215]:
# equivalent: use 'isnull' as a DataFrame method
ufo.isnull().head()


Out[215]:
City Colors Reported Shape Reported State Time
0 False True False False False
1 False True False False False
2 False True False False False
3 False True False False False
4 False True False False False

Documentation for isnull

Question: Why are DataFrame slices inclusive when using .loc, but exclusive when using .iloc?


In [216]:
# label-based slicing is inclusive of the start and stop
ufo.loc[0:4, :]


Out[216]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

In [217]:
# position-based slicing is inclusive of the start and exclusive of the stop
ufo.iloc[0:4, :]


Out[217]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00

Documentation for loc and iloc


In [218]:
# 'iloc' is simply following NumPy's slicing convention...
ufo.values[0:4, :]


Out[218]:
array([['Ithaca', nan, 'TRIANGLE', 'NY', '6/1/1930 22:00'],
       ['Willingboro', nan, 'OTHER', 'NJ', '6/30/1930 20:00'],
       ['Holyoke', nan, 'OVAL', 'CO', '2/15/1931 14:00'],
       ['Abilene', nan, 'DISK', 'KS', '6/1/1931 13:00']], dtype=object)

In [219]:
# ...and NumPy is simply following Python's slicing convention
'python'[0:4]


Out[219]:
'pyth'

In [220]:
# 'loc' is inclusive of the stopping label because you don't necessarily know what label will come after it
ufo.loc[0:4, 'City':'State']


Out[220]:
City Colors Reported Shape Reported State
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ
2 Holyoke NaN OVAL CO
3 Abilene NaN DISK KS
4 New York Worlds Fair NaN LIGHT NY

Question: How do I randomly sample rows from a DataFrame?


In [221]:
# sample 3 rows from the DataFrame without replacement (new in pandas 0.16.1)
ufo.sample(n=3)


Out[221]:
City Colors Reported Shape Reported State Time
13722 Lincoln City NaN LIGHT OR 6/16/1999 23:00
14564 Pagosa Springs NaN CIGAR CO 9/12/1999 20:55
4176 Chicago NaN VARIOUS IL 8/28/1979 2:45

Documentation for sample


In [222]:
# use the 'random_state' parameter for reproducibility
ufo.sample(n=3, random_state=42)


Out[222]:
City Colors Reported Shape Reported State Time
217 Norridgewock NaN DISK ME 9/15/1952 14:00
12282 Ipava NaN TRIANGLE IL 10/1/1998 21:15
17933 Ellinwood NaN FIREBALL KS 11/13/2000 22:00

In [223]:
# sample 75% of the DataFrame's rows without replacement
train = ufo.sample(frac=0.75, random_state=99)

In [224]:
# store the remaining 25% of the rows in another DataFrame
test = ufo.loc[~ufo.index.isin(train.index), :]

Documentation for isin

[Back to top]

24. How do I create dummy variables in pandas?


In [225]:
# read the training dataset from Kaggle's Titanic competition
train = pd.read_csv('http://bit.ly/kaggletrain')
train.head()


Out[225]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [226]:
# create the 'Sex_male' dummy variable using the 'map' method
train['Sex_male'] = train.Sex.map({'female':0, 'male':1})
train.head()


Out[226]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Sex_male
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 1

Documentation for map


In [227]:
# alternative: use 'get_dummies' to create one column for every possible value
pd.get_dummies(train.Sex).head()


Out[227]:
female male
0 0.0 1.0
1 1.0 0.0
2 1.0 0.0
3 1.0 0.0
4 0.0 1.0

Generally speaking:

  • If you have "K" possible values for a categorical feature, you only need "K-1" dummy variables to capture all of the information about that feature.
  • One convention is to drop the first dummy variable, which defines that level as the "baseline".

In [228]:
# drop the first dummy variable ('female') using the 'iloc' method
pd.get_dummies(train.Sex).iloc[:, 1:].head()


Out[228]:
male
0 1.0
1 0.0
2 0.0
3 0.0
4 1.0

In [229]:
# add a prefix to identify the source of the dummy variables
pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:].head()


Out[229]:
Sex_male
0 1.0
1 0.0
2 0.0
3 0.0
4 1.0

In [230]:
# use 'get_dummies' with a feature that has 3 possible values
pd.get_dummies(train.Embarked, prefix='Embarked').head(10)


Out[230]:
Embarked_C Embarked_Q Embarked_S
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 0.0 1.0
3 0.0 0.0 1.0
4 0.0 0.0 1.0
5 0.0 1.0 0.0
6 0.0 0.0 1.0
7 0.0 0.0 1.0
8 0.0 0.0 1.0
9 1.0 0.0 0.0

In [231]:
# drop the first dummy variable ('C')
pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:].head(10)


Out[231]:
Embarked_Q Embarked_S
0 0.0 1.0
1 0.0 0.0
2 0.0 1.0
3 0.0 1.0
4 0.0 1.0
5 1.0 0.0
6 0.0 1.0
7 0.0 1.0
8 0.0 1.0
9 0.0 0.0

How to translate these values back to the original 'Embarked' value:

  • 0, 0 means C
  • 1, 0 means Q
  • 0, 1 means S

In [232]:
# save the DataFrame of dummy variables and concatenate them to the original DataFrame
embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]
train = pd.concat([train, embarked_dummies], axis=1)
train.head()


Out[232]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Sex_male Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 0.0 1.0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 0 0.0 0.0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0 0.0 1.0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 0 0.0 1.0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 1 0.0 1.0

Documentation for concat


In [233]:
# reset the DataFrame
train = pd.read_csv('http://bit.ly/kaggletrain')
train.head()


Out[233]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [234]:
# pass the DataFrame to 'get_dummies' and specify which columns to dummy (it drops the original columns)
pd.get_dummies(train, columns=['Sex', 'Embarked']).head()


Out[234]:
PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN 0.0 1.0 0.0 0.0 1.0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 PC 17599 71.2833 C85 1.0 0.0 1.0 0.0 0.0
2 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN 1.0 0.0 0.0 0.0 1.0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 1.0 0.0 0.0 0.0 1.0
4 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN 0.0 1.0 0.0 0.0 1.0

In [235]:
# use the 'drop_first' parameter (new in pandas 0.18) to drop the first dummy variable for each feature
pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True).head()


Out[235]:
PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Sex_male Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN 1.0 0.0 1.0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 PC 17599 71.2833 C85 0.0 0.0 0.0
2 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0.0 0.0 1.0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 0.0 0.0 1.0
4 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN 1.0 0.0 1.0

Documentation for get_dummies

[Back to top]

25. How do I work with dates and times in pandas?


In [236]:
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('http://bit.ly/uforeports')
ufo.head()


Out[236]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

In [237]:
# 'Time' is currently stored as a string
ufo.dtypes


Out[237]:
City               object
Colors Reported    object
Shape Reported     object
State              object
Time               object
dtype: object

In [238]:
# hour could be accessed using string slicing, but this approach breaks too easily
ufo.Time.str.slice(-5, -3).astype(int).head()


Out[238]:
0    22
1    20
2    14
3    13
4    19
Name: Time, dtype: int32

In [239]:
# convert 'Time' to datetime format
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo.head()


Out[239]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 1930-06-01 22:00:00
1 Willingboro NaN OTHER NJ 1930-06-30 20:00:00
2 Holyoke NaN OVAL CO 1931-02-15 14:00:00
3 Abilene NaN DISK KS 1931-06-01 13:00:00
4 New York Worlds Fair NaN LIGHT NY 1933-04-18 19:00:00

In [240]:
ufo.dtypes


Out[240]:
City                       object
Colors Reported            object
Shape Reported             object
State                      object
Time               datetime64[ns]
dtype: object

Documentation for to_datetime


In [241]:
# convenient Series attributes are now available
ufo.Time.dt.hour.head()


Out[241]:
0    22
1    20
2    14
3    13
4    19
Name: Time, dtype: int64

In [242]:
ufo.Time.dt.weekday_name.head()


Out[242]:
0     Sunday
1     Monday
2     Sunday
3     Monday
4    Tuesday
Name: Time, dtype: object

In [243]:
ufo.Time.dt.dayofyear.head()


Out[243]:
0    152
1    181
2     46
3    152
4    108
Name: Time, dtype: int64

In [244]:
# convert a single string to datetime format (outputs a timestamp object)
ts = pd.to_datetime('1/1/1999')
ts


Out[244]:
Timestamp('1999-01-01 00:00:00')

In [245]:
# compare a datetime Series with a timestamp
ufo.loc[ufo.Time >= ts, :].head()


Out[245]:
City Colors Reported Shape Reported State Time
12832 Loma Rica NaN LIGHT CA 1999-01-01 02:30:00
12833 Bauxite NaN NaN AR 1999-01-01 03:00:00
12834 Florence NaN CYLINDER SC 1999-01-01 14:00:00
12835 Lake Henshaw NaN CIGAR CA 1999-01-01 15:00:00
12836 Wilmington Island NaN LIGHT GA 1999-01-01 17:15:00

In [246]:
# perform mathematical operations with timestamps (outputs a timedelta object)
ufo.Time.max() - ufo.Time.min()


Out[246]:
Timedelta('25781 days 01:59:00')

In [247]:
# timedelta objects also have attributes you can access
(ufo.Time.max() - ufo.Time.min()).days


Out[247]:
25781L

In [248]:
# allow plots to appear in the notebook
%matplotlib inline

In [249]:
# count the number of UFO reports per year
ufo['Year'] = ufo.Time.dt.year
ufo.Year.value_counts().sort_index().head()


Out[249]:
1930    2
1931    2
1933    1
1934    1
1935    1
Name: Year, dtype: int64

In [250]:
# plot the number of UFO reports per year (line plot is the default)
ufo.Year.value_counts().sort_index().plot()


Out[250]:
<matplotlib.axes._subplots.AxesSubplot at 0xd571278>

26. How do I find and remove duplicate rows in pandas?


In [251]:
# read a dataset of movie reviewers into a DataFrame
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols, index_col='user_id')
users.head()


Out[251]:
age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213

In [252]:
users.shape


Out[252]:
(943, 4)

In [253]:
# detect duplicate zip codes: True if an item is identical to a previous item
users.zip_code.duplicated().tail()


Out[253]:
user_id
939    False
940     True
941    False
942    False
943    False
Name: zip_code, dtype: bool

In [254]:
# count the duplicate items (True becomes 1, False becomes 0)
users.zip_code.duplicated().sum()


Out[254]:
148

In [255]:
# detect duplicate DataFrame rows: True if an entire row is identical to a previous row
users.duplicated().tail()


Out[255]:
user_id
939    False
940    False
941    False
942    False
943    False
dtype: bool

In [256]:
# count the duplicate rows
users.duplicated().sum()


Out[256]:
7

Logic for duplicated:

  • keep='first' (default): Mark duplicates as True except for the first occurrence.
  • keep='last': Mark duplicates as True except for the last occurrence.
  • keep=False: Mark all duplicates as True.

In [257]:
# examine the duplicate rows (ignoring the first occurrence)
users.loc[users.duplicated(keep='first'), :]


Out[257]:
age gender occupation zip_code
user_id
496 21 F student 55414
572 51 M educator 20003
621 17 M student 60402
684 28 M student 55414
733 44 F other 60630
805 27 F other 20009
890 32 M student 97301

In [258]:
# examine the duplicate rows (ignoring the last occurrence)
users.loc[users.duplicated(keep='last'), :]


Out[258]:
age gender occupation zip_code
user_id
67 17 M student 60402
85 51 M educator 20003
198 21 F student 55414
350 32 M student 97301
428 28 M student 55414
437 27 F other 20009
460 44 F other 60630

In [259]:
# examine the duplicate rows (including all duplicates)
users.loc[users.duplicated(keep=False), :]


Out[259]:
age gender occupation zip_code
user_id
67 17 M student 60402
85 51 M educator 20003
198 21 F student 55414
350 32 M student 97301
428 28 M student 55414
437 27 F other 20009
460 44 F other 60630
496 21 F student 55414
572 51 M educator 20003
621 17 M student 60402
684 28 M student 55414
733 44 F other 60630
805 27 F other 20009
890 32 M student 97301

In [260]:
# drop the duplicate rows (inplace=False by default)
users.drop_duplicates(keep='first').shape


Out[260]:
(936, 4)

In [261]:
users.drop_duplicates(keep='last').shape


Out[261]:
(936, 4)

In [262]:
users.drop_duplicates(keep=False).shape


Out[262]:
(929, 4)

Documentation for drop_duplicates


In [263]:
# only consider a subset of columns when identifying duplicates
users.duplicated(subset=['age', 'zip_code']).sum()


Out[263]:
16

In [264]:
users.drop_duplicates(subset=['age', 'zip_code']).shape


Out[264]:
(927, 4)

27. How do I avoid a SettingWithCopyWarning in pandas?


In [265]:
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('http://bit.ly/imdbratings')
movies.head()


Out[265]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

In [266]:
# count the missing values in the 'content_rating' Series
movies.content_rating.isnull().sum()


Out[266]:
3

In [267]:
# examine the DataFrame rows that contain those missing values
movies[movies.content_rating.isnull()]


Out[267]:
star_rating title content_rating genre duration actors_list
187 8.2 Butch Cassidy and the Sundance Kid NaN Biography 110 [u'Paul Newman', u'Robert Redford', u'Katharin...
649 7.7 Where Eagles Dare NaN Action 158 [u'Richard Burton', u'Clint Eastwood', u'Mary ...
936 7.4 True Grit NaN Adventure 128 [u'John Wayne', u'Kim Darby', u'Glen Campbell']

In [268]:
# examine the unique values in the 'content_rating' Series
movies.content_rating.value_counts()


Out[268]:
R            460
PG-13        189
PG           123
NOT RATED     65
APPROVED      47
UNRATED       38
G             32
PASSED         7
NC-17          7
X              4
GP             3
TV-MA          1
Name: content_rating, dtype: int64

Goal: Mark the 'NOT RATED' values as missing values, represented by 'NaN'.


In [269]:
# first, locate the relevant rows
movies[movies.content_rating=='NOT RATED'].head()


Out[269]:
star_rating title content_rating genre duration actors_list
5 8.9 12 Angry Men NOT RATED Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
6 8.9 The Good, the Bad and the Ugly NOT RATED Western 161 [u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...
41 8.5 Sunset Blvd. NOT RATED Drama 110 [u'William Holden', u'Gloria Swanson', u'Erich...
63 8.4 M NOT RATED Crime 99 [u'Peter Lorre', u'Ellen Widmann', u'Inge Land...
66 8.4 Munna Bhai M.B.B.S. NOT RATED Comedy 156 [u'Sunil Dutt', u'Sanjay Dutt', u'Arshad Warsi']

In [270]:
# then, select the 'content_rating' Series from those rows
movies[movies.content_rating=='NOT RATED'].content_rating.head()


Out[270]:
5     NOT RATED
6     NOT RATED
41    NOT RATED
63    NOT RATED
66    NOT RATED
Name: content_rating, dtype: object

In [271]:
# finally, replace the 'NOT RATED' values with 'NaN' (imported from NumPy)
import numpy as np
movies[movies.content_rating=='NOT RATED'].content_rating = np.nan


c:\Users\Kevin\Anaconda\lib\site-packages\pandas\core\generic.py:2701: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

Problem: That statement involves two operations, a __getitem__ and a __setitem__. pandas can't guarantee whether the __getitem__ operation returns a view or a copy of the data.

  • If __getitem__ returns a view of the data, __setitem__ will affect the 'movies' DataFrame.
  • But if __getitem__ returns a copy of the data, __setitem__ will not affect the 'movies' DataFrame.

In [272]:
# the 'content_rating' Series has not changed
movies.content_rating.isnull().sum()


Out[272]:
3

Solution: Use the loc method, which replaces the 'NOT RATED' values in a single __setitem__ operation.


In [273]:
# replace the 'NOT RATED' values with 'NaN' (does not cause a SettingWithCopyWarning)
movies.loc[movies.content_rating=='NOT RATED', 'content_rating'] = np.nan

In [274]:
# this time, the 'content_rating' Series has changed
movies.content_rating.isnull().sum()


Out[274]:
68

Summary: Use the loc method any time you are selecting rows and columns in the same statement.

More information: Modern Pandas (Part 1)


In [275]:
# create a DataFrame only containing movies with a high 'star_rating'
top_movies = movies.loc[movies.star_rating >= 9, :]
top_movies


Out[275]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...

Goal: Fix the 'duration' for 'The Shawshank Redemption'.


In [276]:
# overwrite the relevant cell with the correct duration
top_movies.loc[0, 'duration'] = 150


c:\Users\Kevin\Anaconda\lib\site-packages\pandas\core\indexing.py:465: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

Problem: pandas isn't sure whether 'top_movies' is a view or a copy of 'movies'.


In [277]:
# 'top_movies' DataFrame has been updated
top_movies


Out[277]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 150 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...

In [278]:
# 'movies' DataFrame has not been updated
movies.head(1)


Out[278]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...

Solution: Any time you are attempting to create a DataFrame copy, use the copy method.


In [279]:
# explicitly create a copy of 'movies'
top_movies = movies.loc[movies.star_rating >= 9, :].copy()

In [280]:
# pandas now knows that you are updating a copy instead of a view (does not cause a SettingWithCopyWarning)
top_movies.loc[0, 'duration'] = 150

In [281]:
# 'top_movies' DataFrame has been updated
top_movies


Out[281]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 150 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...