Welcome to the 'First steps with pandas'!

After this workshop you can (hopefully) call yourselves Data Scientists!

Gitter: https://gitter.im/first-steps-with-pandas-workshop

Before coding, let's check whether we have proper versions of libraries


In [1]:
import platform
print('Python: ' + platform.python_version())

import numpy as np
print('numpy: ' + np.__version__)

import pandas as pd
print('pandas: ' + pd.__version__)

import scipy
print('scipy: ' + scipy.__version__)

import sklearn
print('scikit-learn: ' + sklearn.__version__)

import matplotlib as plt
print('matplotlib: ' + plt.__version__)

import flask
print('flask: ' + flask.__version__)


Python: 3.6.1
numpy: 1.13.0
pandas: 0.20.2
scipy: 0.19.1
scikit-learn: 0.18.2
matplotlib: 2.0.2
flask: 0.12.2

What is pandas?

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Why to use it?

  • fast development
  • no reinventing wheel
  • fewer mistakes/bugs

In [2]:
# In case of no Internet, use:
# pd.read_json('data/cached_Python.json')

(
    pd.read_json('https://raw.githubusercontent.com/Nozdi/first-steps-with-pandas-workshop/master/data/cached_python.json')
    .resample('1W')
    .mean()
    ['daily_views']
)


Out[2]:
2016-01-03    2767.666667
2016-01-10    4626.571429
2016-01-17    4871.857143
2016-01-24    5249.333333
Freq: W-SUN, Name: daily_views, dtype: float64

It is easy to pick up

  • few simple concepts that are very powerful
  • easy, standardized API
  • good code readability

It is reasonably fast


In [3]:
some_data = [ list(range(1,100)) for x in range(1,1000) ]
some_df = pd.DataFrame(some_data)


def standard_way(data):
    return [[col*2 for col in row] for row in data]


def pandas_way(df):
    return df * 2

In [4]:
%timeit standard_way(some_data)


6.22 ms ± 505 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]:
%timeit pandas_way(some_df)


206 µs ± 5.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Data structures

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).


In [6]:
strengths = pd.Series([400, 200, 300, 400, 500])
strengths


Out[6]:
0    400
1    200
2    300
3    400
4    500
dtype: int64

In [7]:
names = pd.Series(["Batman", "Robin", "Spiderman", "Robocop", "Terminator"])
names


Out[7]:
0        Batman
1         Robin
2     Spiderman
3       Robocop
4    Terminator
dtype: object

DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

Creating


In [8]:
heroes = pd.DataFrame({
    'hero': names,
    'strength': strengths
})
heroes


Out[8]:
hero strength
0 Batman 400
1 Robin 200
2 Spiderman 300
3 Robocop 400
4 Terminator 500

In [9]:
other_heroes = pd.DataFrame([
    dict(hero="Hercules", strength=800),
    dict(hero="Conan")
])
other_heroes


Out[9]:
hero strength
0 Hercules 800.0
1 Conan NaN

In [10]:
another_heroes = pd.DataFrame([
    pd.Series(["Wonder Woman", 10, 3], index=["hero", "strength", "cookies"]),
    pd.Series(["Xena", 20, 0], index=["hero", "strength", "cookies"])
])
another_heroes


Out[10]:
hero strength cookies
0 Wonder Woman 10 3
1 Xena 20 0

Meta data


In [11]:
another_heroes.columns


Out[11]:
Index(['hero', 'strength', 'cookies'], dtype='object')

In [12]:
another_heroes.shape


Out[12]:
(2, 3)

In [13]:
another_heroes.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
hero        2 non-null object
strength    2 non-null int64
cookies     2 non-null int64
dtypes: int64(2), object(1)
memory usage: 128.0+ bytes

Selecting

[string] --> Series [ list of strings ] --> DataFrame


In [14]:
another_heroes['cookies']


Out[14]:
0    3
1    0
Name: cookies, dtype: int64

In [15]:
another_heroes.cookies


Out[15]:
0    3
1    0
Name: cookies, dtype: int64

In [16]:
another_heroes[ ['hero', 'cookies'] ]


Out[16]:
hero cookies
0 Wonder Woman 3
1 Xena 0

Chaining (most of operations on DataFrame returns new DataFrame or Series)


In [17]:
another_heroes[['hero', 'cookies']][['cookies']]


Out[17]:
cookies
0 3
1 0

In [18]:
another_heroes[['hero', 'cookies']][['cookies']]['cookies']


Out[18]:
0    3
1    0
Name: cookies, dtype: int64

EXERCISE

Create DataFrame presented below in 3 different ways

                                         movie_title  imdb_score
0                                            Avatar          7.9
1          Pirates of the Caribbean: At World's End          7.1
2                                           Spectre          6.8

Help: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#from-dict-of-series-or-dicts

With dict of Series


In [19]:
# Solution here

titles = pd.Series(["Avatar", "Pirates of the Caribbean: At World's End", "Spectre"])
imdb_scores = pd.Series([7.9, 7.1, 6.8])

pd.DataFrame({'movie_title': titles, 'imdb_score': imdb_scores})


Out[19]:
imdb_score movie_title
0 7.9 Avatar
1 7.1 Pirates of the Caribbean: At World's End
2 6.8 Spectre

With list of dicts


In [20]:
# Solution here

pd.DataFrame([
    dict(movie_title="Avatar", imdb_score=7.9),
    dict(movie_title="Pirates of the Caribbean: At World's End", imdb_score=7.1),
    dict(movie_title="Spectre", imdb_score=6.8),
])


Out[20]:
imdb_score movie_title
0 7.9 Avatar
1 7.1 Pirates of the Caribbean: At World's End
2 6.8 Spectre

With list of Series


In [21]:
# Solution here

pd.DataFrame([
    pd.Series(["Avatar", 7.9], index=['movie_title', 'imdb_score']),
    pd.Series(["Pirates of the Caribbean: At World's End", 7.1], index=['movie_title', 'imdb_score']),
    pd.Series(["Spectre", 6.8], index=['movie_title', 'imdb_score'])
])


Out[21]:
movie_title imdb_score
0 Avatar 7.9
1 Pirates of the Caribbean: At World's End 7.1
2 Spectre 6.8

I/O part I


In [22]:
# Uncomment and press tab..
# pd.read_
# SQL, csv, hdf

In [23]:
# pd.read_csv?

In [24]:
# executing bash in jupyter notebook
!head -c 500 data/cached_python.json


{"daily_views":{"1451606400000":2701,"1451692800000":2829,"1451779200000":2773,"1451865600000":5072,"1451952000000":5513,"1452038400000":5233,"1452124800000":5518,"1452211200000":5312,"1452297600000":2749,"1452384000000":2989,"1452470400000":5539,"1452556800000":6161,"1452643200000":6175,"1452729600000":5545,"1452816000000":4833,"1452902400000":2858,"1452988800000":2992,"1453075200000":4673,"1453161600000":5625,"1453248000000":5450},"month":{"1451606400000":201601,"1451692800000":201601,"1451779

In [25]:
pd.read_json('data/cached_python.json')


Out[25]:
daily_views month project rank title
2016-01-01 2701 201601 en 2251 Python_(programming_language)
2016-01-02 2829 201601 en 2251 Python_(programming_language)
2016-01-03 2773 201601 en 2251 Python_(programming_language)
2016-01-04 5072 201601 en 2251 Python_(programming_language)
2016-01-05 5513 201601 en 2251 Python_(programming_language)
2016-01-06 5233 201601 en 2251 Python_(programming_language)
2016-01-07 5518 201601 en 2251 Python_(programming_language)
2016-01-08 5312 201601 en 2251 Python_(programming_language)
2016-01-09 2749 201601 en 2251 Python_(programming_language)
2016-01-10 2989 201601 en 2251 Python_(programming_language)
2016-01-11 5539 201601 en 2251 Python_(programming_language)
2016-01-12 6161 201601 en 2251 Python_(programming_language)
2016-01-13 6175 201601 en 2251 Python_(programming_language)
2016-01-14 5545 201601 en 2251 Python_(programming_language)
2016-01-15 4833 201601 en 2251 Python_(programming_language)
2016-01-16 2858 201601 en 2251 Python_(programming_language)
2016-01-17 2992 201601 en 2251 Python_(programming_language)
2016-01-18 4673 201601 en 2251 Python_(programming_language)
2016-01-19 5625 201601 en 2251 Python_(programming_language)
2016-01-20 5450 201601 en 2251 Python_(programming_language)

EXERCISE

Load movies from data/movies.csv to variable called movies


In [26]:
# Solution here

movies = pd.read_csv('data/movies.csv')
movies.head()


Out[26]:
movie_title title_year budget gross genres language country movie_facebook_likes imdb_score num_voted_users ... actor_3_name actor_3_facebook_likes cast_total_facebook_likes color duration plot_keywords content_rating aspect_ratio facenumber_in_poster movie_imdb_link
0 Avatar 2009.0 237000000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi English USA 33000 7.9 886204 ... Wes Studi 855.0 4834 Color 178.0 avatar|future|marine|native|paraplegic PG-13 1.78 0.0 http://www.imdb.com/title/tt0499549/?ref_=fn_t...
1 Pirates of the Caribbean: At World's End 2007.0 300000000.0 309404152.0 Action|Adventure|Fantasy English USA 0 7.1 471220 ... Jack Davenport 1000.0 48350 Color 169.0 goddess|marriage ceremony|marriage proposal|pi... PG-13 2.35 0.0 http://www.imdb.com/title/tt0449088/?ref_=fn_t...
2 Spectre 2015.0 245000000.0 200074175.0 Action|Adventure|Thriller English UK 85000 6.8 275868 ... Stephanie Sigman 161.0 11700 Color 148.0 bomb|espionage|sequel|spy|terrorist PG-13 2.35 1.0 http://www.imdb.com/title/tt2379713/?ref_=fn_t...
3 The Dark Knight Rises 2012.0 250000000.0 448130642.0 Action|Thriller English USA 164000 8.5 1144337 ... Joseph Gordon-Levitt 23000.0 106759 Color 164.0 deception|imprisonment|lawlessness|police offi... PG-13 2.35 0.0 http://www.imdb.com/title/tt1345836/?ref_=fn_t...
4 Star Wars: Episode VII - The Force Awakens NaN NaN NaN Documentary NaN NaN 0 7.1 8 ... NaN NaN 143 NaN NaN NaN NaN NaN 0.0 http://www.imdb.com/title/tt5289954/?ref_=fn_t...

5 rows × 28 columns

Analyze what dimensions and columns it has


In [27]:
# Solution here

print(movies.shape)
print(movies.columns)


(5043, 28)
Index(['movie_title', 'title_year', 'budget', 'gross', 'genres', 'language',
       'country', 'movie_facebook_likes', 'imdb_score', 'num_voted_users',
       'num_critic_for_reviews', 'num_user_for_reviews', 'director_name',
       'director_facebook_likes', 'actor_1_name', 'actor_1_facebook_likes',
       'actor_2_name', 'actor_2_facebook_likes', 'actor_3_name',
       'actor_3_facebook_likes', 'cast_total_facebook_likes', 'color',
       'duration', 'plot_keywords', 'content_rating', 'aspect_ratio',
       'facenumber_in_poster', 'movie_imdb_link'],
      dtype='object')

Filtering


In [28]:
heroes


Out[28]:
hero strength
0 Batman 400
1 Robin 200
2 Spiderman 300
3 Robocop 400
4 Terminator 500

Boolean indexing


In [29]:
heroes['strength'] == 400


Out[29]:
0     True
1    False
2    False
3     True
4    False
Name: strength, dtype: bool

In [30]:
heroes[heroes['strength'] == 400]


Out[30]:
hero strength
0 Batman 400
3 Robocop 400

In [31]:
heroes[heroes['strength'] > 400]


Out[31]:
hero strength
4 Terminator 500

Multiple conditions


In [32]:
try:
    heroes[200 < heroes['strength'] < 400]
except ValueError:
    print("This cool Python syntax ain't work :(")


This cool Python syntax ain't work :(

In [33]:
heroes[
    (heroes['strength'] > 200) & 
    (heroes['strength'] < 400)
]


Out[33]:
hero strength
2 Spiderman 300

In [34]:
heroes[
    (heroes['strength'] <= 200) |
    (heroes['strength'] >= 400)
]


Out[34]:
hero strength
0 Batman 400
1 Robin 200
3 Robocop 400
4 Terminator 500

Negation

~ is a negation operator


In [35]:
~(heroes['strength'] == 400)


Out[35]:
0    False
1     True
2     True
3    False
4     True
Name: strength, dtype: bool

In [36]:
heroes['strength'] != 400


Out[36]:
0    False
1     True
2     True
3    False
4     True
Name: strength, dtype: bool

In [37]:
heroes[~(
    (heroes['strength'] <= 200) |
    (heroes['strength'] >= 400)
)]


Out[37]:
hero strength
2 Spiderman 300

Filtering for containing one of many values (SQL's IN)


In [38]:
heroes[
    heroes['hero'].isin(['Batman', 'Robin'])
]


Out[38]:
hero strength
0 Batman 400
1 Robin 200

EXERCISE

What movies has been directed by Clint Eastwood?


In [39]:
# Solution here

movies[movies['director_name'] == "Clint Eastwood"]


Out[39]:
movie_title title_year budget gross genres language country movie_facebook_likes imdb_score num_voted_users ... actor_3_name actor_3_facebook_likes cast_total_facebook_likes color duration plot_keywords content_rating aspect_ratio facenumber_in_poster movie_imdb_link
664 Space Cowboys 2000.0 65000000.0 90454043.0 Action|Adventure|Thriller English USA 0 6.4 60165 ... William Devane 416.0 17771 Black and White 130.0 nasa|old age|satellite|space|space shuttle PG-13 2.35 1.0 http://www.imdb.com/title/tt0186566/?ref_=fn_t...
772 Invictus 2009.0 60000000.0 37479778.0 Biography|Drama|History|Sport English USA 23000 7.4 124222 ... Leleti Khumalo 204.0 24458 Color 134.0 apartheid|nation|nelson mandela|president|rugby PG-13 2.35 1.0 http://www.imdb.com/title/tt1057500/?ref_=fn_t...
814 American Sniper 2014.0 58800000.0 350123553.0 Action|Biography|Drama|History|Thriller|War English USA 112000 7.3 325264 ... Keir O'Donnell 318.0 16277 Color 133.0 assassin|death of child|iraq|sniper|sniper rifle R 2.35 0.0 http://www.imdb.com/title/tt2179136/?ref_=fn_t...
877 Changeling 2008.0 55000000.0 35707327.0 Crime|Drama|Mystery|Thriller English USA 14000 7.8 199056 ... Colm Feore 539.0 12840 Color 141.0 execution by hanging|los angeles police depart... R 2.35 0.0 http://www.imdb.com/title/tt0824747/?ref_=fn_t...
906 Flags of Our Fathers 2006.0 90000000.0 33574332.0 Drama|History|War English USA 0 7.1 101221 ... Tom McCarthy 310.0 24468 Color 135.0 imperial japan|japan|japanese army|japanese so... R 2.35 0.0 http://www.imdb.com/title/tt0418689/?ref_=fn_t...
973 Absolute Power 1997.0 50000000.0 50007168.0 Action|Crime|Drama|Thriller English USA 1000 6.7 39529 ... Scott Glenn 826.0 19139 Color 121.0 death|death of wife|murder|thief|violence R 2.35 0.0 http://www.imdb.com/title/tt0118548/?ref_=fn_t...
997 Hereafter 2010.0 50000000.0 32741596.0 Drama|Fantasy English USA 16000 6.5 80140 ... Cécile De France 447.0 14344 Color 129.0 death|france|near death experience|tsunami|twi... PG-13 2.35 0.0 http://www.imdb.com/title/tt1212419/?ref_=fn_t...
1004 Blood Work 2002.0 50000000.0 26199517.0 Action|Crime|Drama|Mystery|Thriller English USA 705 6.4 34356 ... Rick Hoffman 581.0 18635 Color 110.0 boat|heart transplant|murder|serial killer|sister R 2.35 1.0 http://www.imdb.com/title/tt0309377/?ref_=fn_t...
1245 Jersey Boys 2014.0 40000000.0 47034272.0 Biography|Drama|Music|Musical English USA 16000 6.9 25465 ... Scott Vance 235.0 2281 Color 134.0 band|music group|musical quartet|singer|song R 2.35 1.0 http://www.imdb.com/title/tt1742044/?ref_=fn_t...
1453 J. Edgar 2011.0 35000000.0 37304950.0 Biography|Crime|Drama English USA 16000 6.6 102728 ... Kaitlyn Dever 363.0 36258 Color 137.0 agent|fbi|federal bureau of investigation|gay ... R 2.35 0.0 http://www.imdb.com/title/tt1616195/?ref_=fn_t...
1462 Midnight in the Garden of Good and Evil 1997.0 30000000.0 25078937.0 Crime|Drama|Mystery|Thriller English USA 0 6.6 31751 ... Leon Rippy 229.0 19689 Color 155.0 graveyard|murder|murder trial|reporter|trial R 1.85 2.0 http://www.imdb.com/title/tt0119668/?ref_=fn_t...
1603 Mystic River 2003.0 25000000.0 90135191.0 Crime|Drama|Mystery|Thriller English USA 12000 8.0 338415 ... Tom Guiry 262.0 1942 Color 138.0 ballistics|detective|friend|gun|sexual abuse R 2.35 0.0 http://www.imdb.com/title/tt0327056/?ref_=fn_t...
1604 Million Dollar Baby 2004.0 30000000.0 100422786.0 Drama|Sport English USA 14000 8.1 482064 ... Mike Colter 567.0 28830 Color 132.0 boxing|boxing trainer|death|suicide attempt|wo... PG-13 2.35 1.0 http://www.imdb.com/title/tt0405159/?ref_=fn_t...
1869 Gran Torino 2008.0 33000000.0 148085755.0 Drama English USA 28000 8.2 561773 ... Ahney Her 198.0 17347 Color 116.0 gang|gang rape|hmong|teenager|widower R 2.35 0.0 http://www.imdb.com/title/tt1205489/?ref_=fn_t...
2086 The Bridges of Madison County 1995.0 35000000.0 70960517.0 Drama|Romance English USA 0 7.5 54631 ... Debra Monk 86.0 27237 Color 135.0 1960s|bridge|housewife|love|photographer PG-13 1.85 0.0 http://www.imdb.com/title/tt0112579/?ref_=fn_t...
2132 Firefox 1982.0 21000000.0 46700000.0 Action|Adventure|Thriller English USA 0 5.9 19832 ... Kenneth Colley 216.0 16805 Color 136.0 fighter plane|jet fighter|refueling|russia|sov... PG 2.35 1.0 http://www.imdb.com/title/tt0083943/?ref_=fn_t...
2760 Unforgiven 1992.0 14400000.0 101157447.0 Drama|Western English USA 10000 8.3 277505 ... Frances Fisher 638.0 28544 Color 131.0 englishman|leaving flowers on a grave|one last... R 2.35 1.0 http://www.imdb.com/title/tt0105695/?ref_=fn_t...
2863 Letters from Iwo Jima 2006.0 19000000.0 13753931.0 Drama|History|War Japanese USA 5000 7.9 132149 ... Shidô Nakamura 78.0 751 Color 141.0 blood splatter|general|island|japan|world war two R 2.35 0.0 http://www.imdb.com/title/tt0498380/?ref_=fn_t...
3540 Pale Rider 1985.0 6900000.0 41400000.0 Western English USA 0 7.3 39451 ... Sydney Penny 240.0 17209 Color 115.0 gold|miner|preacher|prospector|sheriff R 2.35 0.0 http://www.imdb.com/title/tt0089767/?ref_=fn_t...
3673 High Plains Drifter 1973.0 5500000.0 NaN Mystery|Western English USA 0 7.6 36381 ... John Hillerman 258.0 17599 Color 105.0 gravestone|gunfighter|outlaw|revenge|small town R 2.35 0.0 http://www.imdb.com/title/tt0068699/?ref_=fn_t...

20 rows × 28 columns

What movies have earned above $500m?


In [40]:
# Solution here

movies[movies['gross'] > 500e6]['movie_title']


Out[40]:
0               Avatar
17        The Avengers
26             Titanic
29      Jurassic World
66     The Dark Knight
794       The Avengers
Name: movie_title, dtype: object

Are there any Polish movies?


In [41]:
# Solution here

movies[movies['language'] == 'Polish']['movie_title']


Out[41]:
2824       Dekalog            
2828    The Border            
3207       Dekalog            
4170                       Ida
Name: movie_title, dtype: object

In [42]:
# Solution here

movies[
    (movies['movie_facebook_likes'] > 100000) &
    (movies['imdb_score'] > 8.5)
]['movie_title']


Out[42]:
96                  Interstellar
97                     Inception
1937    The Shawshank Redemption
Name: movie_title, dtype: object

In what movies main role was played by brutals like "Jason Statham", "Sylvester Stallone" or god ("Morgan Freeman")?


In [43]:
# Solution here

brutals = ["Jason Statham", "Sylvester Stallone"]
god = "Morgan Freeman"

movies[
    (movies['actor_1_name'].isin(brutals)) |
    (movies['actor_1_name'] == god) 
]['movie_title'].head()


Out[43]:
45             Furious 7
235             Oblivion
299    The Expendables 2
317    The Expendables 3
367              Ben-Hur
Name: movie_title, dtype: object

I/O part O

As numpy array


In [44]:
heroes.values


Out[44]:
array([['Batman', 400],
       ['Robin', 200],
       ['Spiderman', 300],
       ['Robocop', 400],
       ['Terminator', 500]], dtype=object)

As (list) of dicts


In [45]:
heroes.to_dict()


Out[45]:
{'hero': {0: 'Batman',
  1: 'Robin',
  2: 'Spiderman',
  3: 'Robocop',
  4: 'Terminator'},
 'strength': {0: 400, 1: 200, 2: 300, 3: 400, 4: 500}}

In [46]:
heroes.to_dict('records')


Out[46]:
[{'hero': 'Batman', 'strength': 400},
 {'hero': 'Robin', 'strength': 200},
 {'hero': 'Spiderman', 'strength': 300},
 {'hero': 'Robocop', 'strength': 400},
 {'hero': 'Terminator', 'strength': 500}]

In [47]:
heroes.to_json()


Out[47]:
'{"hero":{"0":"Batman","1":"Robin","2":"Spiderman","3":"Robocop","4":"Terminator"},"strength":{"0":400,"1":200,"2":300,"3":400,"4":500}}'

In [48]:
heroes.to_json(orient='records')


Out[48]:
'[{"hero":"Batman","strength":400},{"hero":"Robin","strength":200},{"hero":"Spiderman","strength":300},{"hero":"Robocop","strength":400},{"hero":"Terminator","strength":500}]'

In [49]:
heroes.to_csv()


Out[49]:
',hero,strength\n0,Batman,400\n1,Robin,200\n2,Spiderman,300\n3,Robocop,400\n4,Terminator,500\n'

In [50]:
heroes.to_csv(index=False)


Out[50]:
'hero,strength\nBatman,400\nRobin,200\nSpiderman,300\nRobocop,400\nTerminator,500\n'

In [51]:
heroes.to_csv('data/heroes.csv', index=False)

EXERCISE

Create a csv with movie titles and cast (actors) of movies with budget above $200m


In [52]:
# Solution here
cols = [
    'movie_title',
    'actor_1_name',
    'actor_2_name',
    'actor_3_name', 
    'budget'
]

movies[movies['budget'] > 200e6][cols].to_csv("data/expensive-cast.csv", index=False)

Create a list of dicts with movie titles and facebook likes of all Christopher Nolan's movies


In [53]:
# Solution here

cols = [
    'movie_title',
    'movie_facebook_likes'
]

movies[movies['director_name'] == 'Christopher Nolan'][cols].to_dict('r')


Out[53]:
[{'movie_facebook_likes': 164000, 'movie_title': 'The Dark Knight Rises'},
 {'movie_facebook_likes': 37000, 'movie_title': 'The Dark Knight'},
 {'movie_facebook_likes': 349000, 'movie_title': 'Interstellar'},
 {'movie_facebook_likes': 175000, 'movie_title': 'Inception'},
 {'movie_facebook_likes': 15000, 'movie_title': 'Batman Begins'},
 {'movie_facebook_likes': 0, 'movie_title': 'Insomnia'},
 {'movie_facebook_likes': 49000, 'movie_title': 'The Prestige'},
 {'movie_facebook_likes': 40000, 'movie_title': 'Memento'}]

New columns


In [54]:
heroes


Out[54]:
hero strength
0 Batman 400
1 Robin 200
2 Spiderman 300
3 Robocop 400
4 Terminator 500

Creating new column


In [55]:
heroes['health'] = np.NaN
heroes.head()


Out[55]:
hero strength health
0 Batman 400 NaN
1 Robin 200 NaN
2 Spiderman 300 NaN
3 Robocop 400 NaN
4 Terminator 500 NaN

In [56]:
heroes['health'] = 100
heroes.head()


Out[56]:
hero strength health
0 Batman 400 100
1 Robin 200 100
2 Spiderman 300 100
3 Robocop 400 100
4 Terminator 500 100

In [57]:
heroes['height'] = [180, 170, 175, 190, 185]
heroes


Out[57]:
hero strength health height
0 Batman 400 100 180
1 Robin 200 100 170
2 Spiderman 300 100 175
3 Robocop 400 100 190
4 Terminator 500 100 185

In [58]:
heroes['is_hungry'] = pd.Series([True, False, False, True, True])
heroes


Out[58]:
hero strength health height is_hungry
0 Batman 400 100 180 True
1 Robin 200 100 170 False
2 Spiderman 300 100 175 False
3 Robocop 400 100 190 True
4 Terminator 500 100 185 True

Vector operations


In [59]:
heroes['strength'] * 2


Out[59]:
0     800
1     400
2     600
3     800
4    1000
Name: strength, dtype: int64

In [60]:
heroes['strength'] / heroes['height']


Out[60]:
0    2.222222
1    1.176471
2    1.714286
3    2.105263
4    2.702703
dtype: float64

In [61]:
heroes['strength_per_cm'] = heroes['strength'] / heroes['height']
heroes


Out[61]:
hero strength health height is_hungry strength_per_cm
0 Batman 400 100 180 True 2.222222
1 Robin 200 100 170 False 1.176471
2 Spiderman 300 100 175 False 1.714286
3 Robocop 400 100 190 True 2.105263
4 Terminator 500 100 185 True 2.702703

Map, apply, applymap, str


In [62]:
pd.Series([1, 2, 3]).map(lambda x: x**3)


Out[62]:
0     1
1     8
2    27
dtype: int64

In [63]:
pd.Series(['Batman', 'Robin']).map(lambda x: x[:2])


Out[63]:
0    Ba
1    Ro
dtype: object

In [64]:
# however, more idiomatic approach for strings is to do..
pd.Series(['Batman', 'Robin']).str[:2]


Out[64]:
0    Ba
1    Ro
dtype: object

In [65]:
pd.Series(['Batman', 'Robin']).str.lower()


Out[65]:
0    batman
1     robin
dtype: object

In [66]:
pd.Series([
    ['Batman', 'Robin'],
    ['Robocop']
]).map(len)


Out[66]:
0    2
1    1
dtype: int64

In [67]:
heroes['code'] = heroes['hero'].map(lambda name: name[:2])
heroes


Out[67]:
hero strength health height is_hungry strength_per_cm code
0 Batman 400 100 180 True 2.222222 Ba
1 Robin 200 100 170 False 1.176471 Ro
2 Spiderman 300 100 175 False 1.714286 Sp
3 Robocop 400 100 190 True 2.105263 Ro
4 Terminator 500 100 185 True 2.702703 Te

In [68]:
heroes['effective_strength'] = heroes.apply(
    lambda row: (not row['is_hungry']) * row['strength'],
    axis=1
)
heroes.head()


Out[68]:
hero strength health height is_hungry strength_per_cm code effective_strength
0 Batman 400 100 180 True 2.222222 Ba 0
1 Robin 200 100 170 False 1.176471 Ro 200
2 Spiderman 300 100 175 False 1.714286 Sp 300
3 Robocop 400 100 190 True 2.105263 Ro 0
4 Terminator 500 100 185 True 2.702703 Te 0

In [69]:
heroes[['health', 'strength']] = heroes[['health', 'strength']].applymap(
    lambda x: x + 100
)
heroes


Out[69]:
hero strength health height is_hungry strength_per_cm code effective_strength
0 Batman 500 200 180 True 2.222222 Ba 0
1 Robin 300 200 170 False 1.176471 Ro 200
2 Spiderman 400 200 175 False 1.714286 Sp 300
3 Robocop 500 200 190 True 2.105263 Ro 0
4 Terminator 600 200 185 True 2.702703 Te 0

Cheatsheet

map: 1 => 1
apply: n => 1
applymap: n => n

Sorting and value counts (bonus skill)


In [70]:
heroes['strength'].value_counts()


Out[70]:
500    2
600    1
300    1
400    1
Name: strength, dtype: int64

In [71]:
heroes.sort_values('strength')


Out[71]:
hero strength health height is_hungry strength_per_cm code effective_strength
1 Robin 300 200 170 False 1.176471 Ro 200
2 Spiderman 400 200 175 False 1.714286 Sp 300
0 Batman 500 200 180 True 2.222222 Ba 0
3 Robocop 500 200 190 True 2.105263 Ro 0
4 Terminator 600 200 185 True 2.702703 Te 0

In [72]:
heroes.sort_values(
    ['is_hungry', 'code'],
    ascending=[False, True]
)


Out[72]:
hero strength health height is_hungry strength_per_cm code effective_strength
0 Batman 500 200 180 True 2.222222 Ba 0
3 Robocop 500 200 190 True 2.105263 Ro 0
4 Terminator 600 200 185 True 2.702703 Te 0
1 Robin 300 200 170 False 1.176471 Ro 200
2 Spiderman 400 200 175 False 1.714286 Sp 300

EXERCISE

What are 10 most profitable movies? (ratio between gross and budget)


In [73]:
# Solution here

movies['profitability'] = movies['gross'] / movies['budget']
movies.sort_values('profitability', ascending=False).head(10)


Out[73]:
movie_title title_year budget gross genres language country movie_facebook_likes imdb_score num_voted_users ... actor_3_facebook_likes cast_total_facebook_likes color duration plot_keywords content_rating aspect_ratio facenumber_in_poster movie_imdb_link profitability
4793 Paranormal Activity 2007.0 15000.0 107917283.0 Horror English USA 12000 6.3 184824 ... 21.0 330 Color 84.0 dark force|entity|evil force|found footage|par... R 1.85 0.0 http://www.imdb.com/title/tt1179904/?ref_=fn_t... 7194.485533
4799 Tarnation 2003.0 218.0 592014.0 Biography|Documentary English USA 754 7.2 5709 ... 0.0 78 Color 88.0 answering machine|home movie|lithium|schizophr... Unrated 1.37 3.0 http://www.imdb.com/title/tt0390538/?ref_=fn_t... 2715.660550
4707 The Blair Witch Project 1999.0 60000.0 140530114.0 Horror English USA 0 6.4 186786 ... 39.0 399 Black and White 81.0 found footage|looking at the camera|lost in th... R 1.33 0.0 http://www.imdb.com/title/tt0185937/?ref_=fn_t... 2342.168567
4984 The Brothers McMullen 1995.0 25000.0 10246600.0 Comedy|Drama|Romance English USA 265 6.6 6375 ... 73.0 388 Color 98.0 abusive father|critically acclaimed|loss of fa... R 1.85 0.0 http://www.imdb.com/title/tt0112585/?ref_=fn_t... 409.864000
4936 The Texas Chain Saw Massacre 1974.0 83532.0 30859000.0 Horror|Thriller English USA 0 7.5 96411 ... 177.0 1094 Color 88.0 cannibal|chainsaw|hitchhiker|independent film|... R 1.85 0.0 http://www.imdb.com/title/tt0072271/?ref_=fn_t... 369.427285
3278 The Texas Chain Saw Massacre 1974.0 83532.0 30859000.0 Horror|Thriller English USA 0 7.5 96410 ... 177.0 1094 Color 88.0 cannibal|chainsaw|hitchhiker|independent film|... R 1.85 0.0 http://www.imdb.com/title/tt0072271/?ref_=fn_t... 369.427285
5035 El Mariachi 1992.0 7000.0 2040920.0 Action|Crime|Drama|Romance|Thriller Spanish USA 0 6.9 52055 ... 6.0 147 Color 81.0 assassin|death|guitar|gun|mariachi R 1.37 0.0 http://www.imdb.com/title/tt0104815/?ref_=fn_t... 291.560000
4956 The Gallows 2015.0 100000.0 22757819.0 Horror|Thriller English USA 0 4.2 13521 ... 7.0 276 Color 81.0 breaking and entering|gallows|hanging|high sch... R 1.85 0.0 http://www.imdb.com/title/tt2309260/?ref_=fn_t... 227.578190
4977 Super Size Me 2004.0 65000.0 11529368.0 Comedy|Documentary|Drama English USA 0 7.3 85028 ... 0.0 0 Color 100.0 experiment|fast food|food industry|meal|overea... PG 1.78 2.0 http://www.imdb.com/title/tt0390521/?ref_=fn_t... 177.374892
2619 Halloween 1978.0 300000.0 47000000.0 Horror|Thriller English USA 12000 7.9 157857 ... 598.0 4400 Color 101.0 halloween|masked killer|michael myers|slasher|... R 2.35 0.0 http://www.imdb.com/title/tt0077651/?ref_=fn_t... 156.666667

10 rows × 29 columns

Create a column 'first_genre'. What is the distribution of values in this column?


In [74]:
# Solution here

movies['first_genre'] = movies['genres'].str.split('|').str[0]
movies.head()


Out[74]:
movie_title title_year budget gross genres language country movie_facebook_likes imdb_score num_voted_users ... cast_total_facebook_likes color duration plot_keywords content_rating aspect_ratio facenumber_in_poster movie_imdb_link profitability first_genre
0 Avatar 2009.0 237000000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi English USA 33000 7.9 886204 ... 4834 Color 178.0 avatar|future|marine|native|paraplegic PG-13 1.78 0.0 http://www.imdb.com/title/tt0499549/?ref_=fn_t... 3.208885 Action
1 Pirates of the Caribbean: At World's End 2007.0 300000000.0 309404152.0 Action|Adventure|Fantasy English USA 0 7.1 471220 ... 48350 Color 169.0 goddess|marriage ceremony|marriage proposal|pi... PG-13 2.35 0.0 http://www.imdb.com/title/tt0449088/?ref_=fn_t... 1.031347 Action
2 Spectre 2015.0 245000000.0 200074175.0 Action|Adventure|Thriller English UK 85000 6.8 275868 ... 11700 Color 148.0 bomb|espionage|sequel|spy|terrorist PG-13 2.35 1.0 http://www.imdb.com/title/tt2379713/?ref_=fn_t... 0.816629 Action
3 The Dark Knight Rises 2012.0 250000000.0 448130642.0 Action|Thriller English USA 164000 8.5 1144337 ... 106759 Color 164.0 deception|imprisonment|lawlessness|police offi... PG-13 2.35 0.0 http://www.imdb.com/title/tt1345836/?ref_=fn_t... 1.792523 Action
4 Star Wars: Episode VII - The Force Awakens NaN NaN NaN Documentary NaN NaN 0 7.1 8 ... 143 NaN NaN NaN NaN NaN 0.0 http://www.imdb.com/title/tt5289954/?ref_=fn_t... NaN Documentary

5 rows × 30 columns

Visualizing data


In [75]:
heroes


Out[75]:
hero strength health height is_hungry strength_per_cm code effective_strength
0 Batman 500 200 180 True 2.222222 Ba 0
1 Robin 300 200 170 False 1.176471 Ro 200
2 Spiderman 400 200 175 False 1.714286 Sp 300
3 Robocop 500 200 190 True 2.105263 Ro 0
4 Terminator 600 200 185 True 2.702703 Te 0

Basic stats


In [76]:
heroes.describe()


Out[76]:
strength health height strength_per_cm effective_strength
count 5.000000 5.0 5.000000 5.000000 5.000000
mean 460.000000 200.0 180.000000 1.984189 100.000000
std 114.017543 0.0 7.905694 0.572896 141.421356
min 300.000000 200.0 170.000000 1.176471 0.000000
25% 400.000000 200.0 175.000000 1.714286 0.000000
50% 500.000000 200.0 180.000000 2.105263 0.000000
75% 500.000000 200.0 185.000000 2.222222 200.000000
max 600.000000 200.0 190.000000 2.702703 300.000000

Plotting


In [77]:
%matplotlib inline

In [78]:
pd.Series([1, 2, 3]).plot()


Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x11104a0b8>

In [79]:
pd.Series([1, 2, 3], index=['Batman', 'Robin', 'Rambo']).plot()


Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x1110dccf8>

In [80]:
pd.Series([1, 2, 3], index=['Batman', 'Robin', 'Rambo']).plot(kind='bar')


Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x11111d160>

In [81]:
pd.Series([1, 2, 3], index=['Batman', 'Robin', 'Rambo']).plot(
    kind='bar',
    figsize=(15, 6)
)


Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x111296eb8>

In [82]:
pd.Series([1, 2, 3], index=['Batman', 'Robin', 'Rambo']).plot(kind='pie')


Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x111473240>

In [83]:
heroes.plot()


Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x111378748>

In [84]:
indexed_heroes = heroes.set_index('hero')
indexed_heroes


Out[84]:
strength health height is_hungry strength_per_cm code effective_strength
hero
Batman 500 200 180 True 2.222222 Ba 0
Robin 300 200 170 False 1.176471 Ro 200
Spiderman 400 200 175 False 1.714286 Sp 300
Robocop 500 200 190 True 2.105263 Ro 0
Terminator 600 200 185 True 2.702703 Te 0

In [85]:
indexed_heroes.plot()


Out[85]:
<matplotlib.axes._subplots.AxesSubplot at 0x111645be0>

In [86]:
indexed_heroes.plot(kind='barh')


Out[86]:
<matplotlib.axes._subplots.AxesSubplot at 0x1116f1b70>

In [87]:
indexed_heroes.plot(kind='bar', subplots=True, figsize=(15, 15))


Out[87]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x11189ce80>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x111a19fd0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x111957748>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x111ae2710>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x111b43780>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x111b437b8>], dtype=object)

In [88]:
indexed_heroes[['height', 'strength']].plot(kind='bar')


Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x111d724e0>

In [89]:
heroes.plot(x='hero', y=['height', 'strength'], kind='bar')


Out[89]:
<matplotlib.axes._subplots.AxesSubplot at 0x111aff828>

In [90]:
# alternative to subplots
heroes.plot(
    x='hero',
    y=['height', 'strength'],
    kind='bar',
    secondary_y='strength',
    figsize=(10,8)
)


Out[90]:
<matplotlib.axes._subplots.AxesSubplot at 0x1117922b0>

In [91]:
heroes.plot(
    x='hero',
    y=['height', 'strength'],
    kind='bar',
    secondary_y='strength',
    title='Super plot of super heroes',
    figsize=(10,8)
)


Out[91]:
<matplotlib.axes._subplots.AxesSubplot at 0x111d80860>

Histogram


In [92]:
heroes.hist(figsize=(10, 10))


Out[92]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x112a11550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x112bfcda0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x112a567b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x112cc2668>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x112d217f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x112d21828>]], dtype=object)

In [93]:
heroes.hist(
    figsize=(10, 10),
    bins=2
)


Out[93]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x112d47780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1131ce128>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1131deb70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x113292f98>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1132f66a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1132f66d8>]], dtype=object)

DataFrames everywhere.. are easy to plot


In [94]:
heroes.describe()['strength'].plot(kind='bar')


Out[94]:
<matplotlib.axes._subplots.AxesSubplot at 0x113485c50>

EXERCISE

Create a chart presenting grosses of movies directed by Christopher Nolan


In [95]:
# Solution here

nolan_movies = movies[movies['director_name'] == 'Christopher Nolan']
nolan_movies = nolan_movies.set_index('movie_title')

nolan_movies['gross'].plot(kind='bar')


Out[95]:
<matplotlib.axes._subplots.AxesSubplot at 0x11372fa90>

What are typical durations of the movies?


In [96]:
# Solution here

movies['duration'].hist(bins=25)


Out[96]:
<matplotlib.axes._subplots.AxesSubplot at 0x112cdfc18>

What is percentage distribution of first genre? (cake)


In [97]:
# Solution here

movies['first_genre'].value_counts().plot(
    kind='pie',
    figsize=(15,15)
)


Out[97]:
<matplotlib.axes._subplots.AxesSubplot at 0x113a7dcc0>

Aggregation

Grouping


In [98]:
movie_heroes = pd.DataFrame({
    'hero': ['Batman', 'Robin', 'Spiderman', 'Robocop', 'Lex Luthor', 'Dr Octopus'],
    'movie': ['Batman', 'Batman', 'Spiderman', 'Robocop', 'Spiderman', 'Spiderman'],
    'strength': [400, 100, 400, 560, 89, 300],
    'speed': [100, 10, 200, 1, 20, None],
})
movie_heroes = movie_heroes.set_index('hero')
movie_heroes


Out[98]:
movie speed strength
hero
Batman Batman 100.0 400
Robin Batman 10.0 100
Spiderman Spiderman 200.0 400
Robocop Robocop 1.0 560
Lex Luthor Spiderman 20.0 89
Dr Octopus Spiderman NaN 300

In [99]:
movie_heroes.groupby('movie')


Out[99]:
<pandas.core.groupby.DataFrameGroupBy object at 0x113bb5e10>

In [100]:
list(movie_heroes.groupby('movie'))


Out[100]:
[('Batman',          movie  speed  strength
  hero                           
  Batman  Batman  100.0       400
  Robin   Batman   10.0       100),
 ('Robocop',            movie  speed  strength
  hero                             
  Robocop  Robocop    1.0       560),
 ('Spiderman',                 movie  speed  strength
  hero                                  
  Spiderman   Spiderman  200.0       400
  Lex Luthor  Spiderman   20.0        89
  Dr Octopus  Spiderman    NaN       300)]

Aggregating


In [101]:
movie_heroes.groupby('movie').size()


Out[101]:
movie
Batman       2
Robocop      1
Spiderman    3
dtype: int64

In [102]:
movie_heroes.groupby('movie').count()


Out[102]:
speed strength
movie
Batman 2 2
Robocop 1 1
Spiderman 2 3

In [103]:
movie_heroes.groupby('movie')['speed'].sum()


Out[103]:
movie
Batman       110.0
Robocop        1.0
Spiderman    220.0
Name: speed, dtype: float64

In [104]:
movie_heroes.groupby('movie').mean()


Out[104]:
speed strength
movie
Batman 55.0 250
Robocop 1.0 560
Spiderman 110.0 263

In [105]:
movie_heroes.groupby('movie').apply(
    lambda group: group['strength'] / group['strength'].max()
)


Out[105]:
movie      hero      
Batman     Batman        1.0000
           Robin         0.2500
Robocop    Robocop       1.0000
Spiderman  Spiderman     1.0000
           Lex Luthor    0.2225
           Dr Octopus    0.7500
Name: strength, dtype: float64

In [106]:
movie_heroes.groupby('movie').agg({
    'speed': 'mean',
    'strength': 'max',
})


Out[106]:
speed strength
movie
Batman 55.0 400
Robocop 1.0 560
Spiderman 110.0 400

In [107]:
movie_heroes = movie_heroes.reset_index()
movie_heroes


Out[107]:
hero movie speed strength
0 Batman Batman 100.0 400
1 Robin Batman 10.0 100
2 Spiderman Spiderman 200.0 400
3 Robocop Robocop 1.0 560
4 Lex Luthor Spiderman 20.0 89
5 Dr Octopus Spiderman NaN 300

In [108]:
movie_heroes.groupby(['movie', 'hero']).mean()


Out[108]:
speed strength
movie hero
Batman Batman 100.0 400
Robin 10.0 100
Robocop Robocop 1.0 560
Spiderman Dr Octopus NaN 300
Lex Luthor 20.0 89
Spiderman 200.0 400

EXERCISE

What was maximal gross in each year?


In [109]:
# Solution here

movies.groupby('title_year')['gross'].max().tail(10).plot(kind='bar')


Out[109]:
<matplotlib.axes._subplots.AxesSubplot at 0x113aad4e0>

Which director earns the most on average?


In [110]:
# Solution here

(
    movies
    .groupby('director_name')
    ['gross']
    .mean()
    .sort_values(ascending=False)
    .head(3)
)


Out[110]:
director_name
Joss Whedon    432721657.0
Lee Unkrich    414984497.0
Chris Buck     400736600.0
Name: gross, dtype: float64

Index related operations

Data alignment on Index


In [111]:
movie_heroes


Out[111]:
hero movie speed strength
0 Batman Batman 100.0 400
1 Robin Batman 10.0 100
2 Spiderman Spiderman 200.0 400
3 Robocop Robocop 1.0 560
4 Lex Luthor Spiderman 20.0 89
5 Dr Octopus Spiderman NaN 300

In [112]:
apetite = pd.DataFrame([
    dict(hero='Spiderman', is_hungry=True),
    dict(hero='Robocop', is_hungry=False)
])
apetite


Out[112]:
hero is_hungry
0 Spiderman True
1 Robocop False

In [113]:
movie_heroes['is_hungry'] = apetite['is_hungry']
movie_heroes


Out[113]:
hero movie speed strength is_hungry
0 Batman Batman 100.0 400 True
1 Robin Batman 10.0 100 False
2 Spiderman Spiderman 200.0 400 NaN
3 Robocop Robocop 1.0 560 NaN
4 Lex Luthor Spiderman 20.0 89 NaN
5 Dr Octopus Spiderman NaN 300 NaN

In [114]:
apetite.index = [2, 3]

In [115]:
movie_heroes['is_hungry'] = apetite['is_hungry']
movie_heroes


Out[115]:
hero movie speed strength is_hungry
0 Batman Batman 100.0 400 NaN
1 Robin Batman 10.0 100 NaN
2 Spiderman Spiderman 200.0 400 True
3 Robocop Robocop 1.0 560 False
4 Lex Luthor Spiderman 20.0 89 NaN
5 Dr Octopus Spiderman NaN 300 NaN

Setting index


In [116]:
indexed_movie_heroes = movie_heroes.set_index('hero')
indexed_movie_heroes


Out[116]:
movie speed strength is_hungry
hero
Batman Batman 100.0 400 NaN
Robin Batman 10.0 100 NaN
Spiderman Spiderman 200.0 400 True
Robocop Robocop 1.0 560 False
Lex Luthor Spiderman 20.0 89 NaN
Dr Octopus Spiderman NaN 300 NaN

In [117]:
indexed_apetite = apetite.set_index('hero')
indexed_apetite


Out[117]:
is_hungry
hero
Spiderman True
Robocop False

In [118]:
# and alignment works well automagically..

indexed_movie_heroes['is_hungry'] = indexed_apetite['is_hungry']

In [119]:
indexed_movie_heroes


Out[119]:
movie speed strength is_hungry
hero
Batman Batman 100.0 400 NaN
Robin Batman 10.0 100 NaN
Spiderman Spiderman 200.0 400 True
Robocop Robocop 1.0 560 False
Lex Luthor Spiderman 20.0 89 NaN
Dr Octopus Spiderman NaN 300 NaN

Merging two DFs (a'la SQL join)


In [120]:
movie_heroes


Out[120]:
hero movie speed strength is_hungry
0 Batman Batman 100.0 400 NaN
1 Robin Batman 10.0 100 NaN
2 Spiderman Spiderman 200.0 400 True
3 Robocop Robocop 1.0 560 False
4 Lex Luthor Spiderman 20.0 89 NaN
5 Dr Octopus Spiderman NaN 300 NaN

In [121]:
apetite


Out[121]:
hero is_hungry
2 Spiderman True
3 Robocop False

In [122]:
# couple of other arguments available here

pd.merge(
    movie_heroes[['hero', 'speed']],
    apetite,
    on=['hero'],
    how='outer'
)


Out[122]:
hero speed is_hungry
0 Batman 100.0 NaN
1 Robin 10.0 NaN
2 Spiderman 200.0 True
3 Robocop 1.0 False
4 Lex Luthor 20.0 NaN
5 Dr Octopus NaN NaN

DateTime operations


In [123]:
spiderman_meals = pd.DataFrame([
        dict(time='2016-10-15 10:00', calories=300),
        dict(time='2016-10-15 13:00', calories=900),
        dict(time='2016-10-15 15:00', calories=1200),
        dict(time='2016-10-15 21:00', calories=700),
        dict(time='2016-10-16 07:00', calories=1600),
        dict(time='2016-10-16 13:00', calories=600),
        dict(time='2016-10-16 16:00', calories=900),
        dict(time='2016-10-16 20:00', calories=500),
        dict(time='2016-10-16 21:00', calories=300),
        dict(time='2016-10-17 08:00', calories=900),
    ])
spiderman_meals


Out[123]:
calories time
0 300 2016-10-15 10:00
1 900 2016-10-15 13:00
2 1200 2016-10-15 15:00
3 700 2016-10-15 21:00
4 1600 2016-10-16 07:00
5 600 2016-10-16 13:00
6 900 2016-10-16 16:00
7 500 2016-10-16 20:00
8 300 2016-10-16 21:00
9 900 2016-10-17 08:00

In [124]:
spiderman_meals.dtypes


Out[124]:
calories     int64
time        object
dtype: object

In [125]:
spiderman_meals['time'] = pd.to_datetime(spiderman_meals['time'])
spiderman_meals.dtypes


Out[125]:
calories             int64
time        datetime64[ns]
dtype: object

In [126]:
spiderman_meals


Out[126]:
calories time
0 300 2016-10-15 10:00:00
1 900 2016-10-15 13:00:00
2 1200 2016-10-15 15:00:00
3 700 2016-10-15 21:00:00
4 1600 2016-10-16 07:00:00
5 600 2016-10-16 13:00:00
6 900 2016-10-16 16:00:00
7 500 2016-10-16 20:00:00
8 300 2016-10-16 21:00:00
9 900 2016-10-17 08:00:00

In [127]:
spiderman_meals = spiderman_meals.set_index('time')
spiderman_meals


Out[127]:
calories
time
2016-10-15 10:00:00 300
2016-10-15 13:00:00 900
2016-10-15 15:00:00 1200
2016-10-15 21:00:00 700
2016-10-16 07:00:00 1600
2016-10-16 13:00:00 600
2016-10-16 16:00:00 900
2016-10-16 20:00:00 500
2016-10-16 21:00:00 300
2016-10-17 08:00:00 900

In [128]:
spiderman_meals.index


Out[128]:
DatetimeIndex(['2016-10-15 10:00:00', '2016-10-15 13:00:00',
               '2016-10-15 15:00:00', '2016-10-15 21:00:00',
               '2016-10-16 07:00:00', '2016-10-16 13:00:00',
               '2016-10-16 16:00:00', '2016-10-16 20:00:00',
               '2016-10-16 21:00:00', '2016-10-17 08:00:00'],
              dtype='datetime64[ns]', name='time', freq=None)

Filtering


In [129]:
spiderman_meals["2016-10-15"]


Out[129]:
calories
time
2016-10-15 10:00:00 300
2016-10-15 13:00:00 900
2016-10-15 15:00:00 1200
2016-10-15 21:00:00 700

In [130]:
spiderman_meals["2016-10-16 10:00":]


Out[130]:
calories
time
2016-10-16 13:00:00 600
2016-10-16 16:00:00 900
2016-10-16 20:00:00 500
2016-10-16 21:00:00 300
2016-10-17 08:00:00 900

In [131]:
spiderman_meals["2016-10-16 10:00":"2016-10-16 20:00"]


Out[131]:
calories
time
2016-10-16 13:00:00 600
2016-10-16 16:00:00 900
2016-10-16 20:00:00 500

In [132]:
spiderman_meals["2016-10"]


Out[132]:
calories
time
2016-10-15 10:00:00 300
2016-10-15 13:00:00 900
2016-10-15 15:00:00 1200
2016-10-15 21:00:00 700
2016-10-16 07:00:00 1600
2016-10-16 13:00:00 600
2016-10-16 16:00:00 900
2016-10-16 20:00:00 500
2016-10-16 21:00:00 300
2016-10-17 08:00:00 900

Resampling (downsampling and upsampling)


In [133]:
spiderman_meals.resample('1D').sum()


Out[133]:
calories
time
2016-10-15 3100
2016-10-16 3900
2016-10-17 900

In [134]:
spiderman_meals.resample('1H').mean()


Out[134]:
calories
time
2016-10-15 10:00:00 300.0
2016-10-15 11:00:00 NaN
2016-10-15 12:00:00 NaN
2016-10-15 13:00:00 900.0
2016-10-15 14:00:00 NaN
2016-10-15 15:00:00 1200.0
2016-10-15 16:00:00 NaN
2016-10-15 17:00:00 NaN
2016-10-15 18:00:00 NaN
2016-10-15 19:00:00 NaN
2016-10-15 20:00:00 NaN
2016-10-15 21:00:00 700.0
2016-10-15 22:00:00 NaN
2016-10-15 23:00:00 NaN
2016-10-16 00:00:00 NaN
2016-10-16 01:00:00 NaN
2016-10-16 02:00:00 NaN
2016-10-16 03:00:00 NaN
2016-10-16 04:00:00 NaN
2016-10-16 05:00:00 NaN
2016-10-16 06:00:00 NaN
2016-10-16 07:00:00 1600.0
2016-10-16 08:00:00 NaN
2016-10-16 09:00:00 NaN
2016-10-16 10:00:00 NaN
2016-10-16 11:00:00 NaN
2016-10-16 12:00:00 NaN
2016-10-16 13:00:00 600.0
2016-10-16 14:00:00 NaN
2016-10-16 15:00:00 NaN
2016-10-16 16:00:00 900.0
2016-10-16 17:00:00 NaN
2016-10-16 18:00:00 NaN
2016-10-16 19:00:00 NaN
2016-10-16 20:00:00 500.0
2016-10-16 21:00:00 300.0
2016-10-16 22:00:00 NaN
2016-10-16 23:00:00 NaN
2016-10-17 00:00:00 NaN
2016-10-17 01:00:00 NaN
2016-10-17 02:00:00 NaN
2016-10-17 03:00:00 NaN
2016-10-17 04:00:00 NaN
2016-10-17 05:00:00 NaN
2016-10-17 06:00:00 NaN
2016-10-17 07:00:00 NaN
2016-10-17 08:00:00 900.0

In [135]:
spiderman_meals.resample('1H').ffill()


Out[135]:
calories
time
2016-10-15 10:00:00 300
2016-10-15 11:00:00 300
2016-10-15 12:00:00 300
2016-10-15 13:00:00 900
2016-10-15 14:00:00 900
2016-10-15 15:00:00 1200
2016-10-15 16:00:00 1200
2016-10-15 17:00:00 1200
2016-10-15 18:00:00 1200
2016-10-15 19:00:00 1200
2016-10-15 20:00:00 1200
2016-10-15 21:00:00 700
2016-10-15 22:00:00 700
2016-10-15 23:00:00 700
2016-10-16 00:00:00 700
2016-10-16 01:00:00 700
2016-10-16 02:00:00 700
2016-10-16 03:00:00 700
2016-10-16 04:00:00 700
2016-10-16 05:00:00 700
2016-10-16 06:00:00 700
2016-10-16 07:00:00 1600
2016-10-16 08:00:00 1600
2016-10-16 09:00:00 1600
2016-10-16 10:00:00 1600
2016-10-16 11:00:00 1600
2016-10-16 12:00:00 1600
2016-10-16 13:00:00 600
2016-10-16 14:00:00 600
2016-10-16 15:00:00 600
2016-10-16 16:00:00 900
2016-10-16 17:00:00 900
2016-10-16 18:00:00 900
2016-10-16 19:00:00 900
2016-10-16 20:00:00 500
2016-10-16 21:00:00 300
2016-10-16 22:00:00 300
2016-10-16 23:00:00 300
2016-10-17 00:00:00 300
2016-10-17 01:00:00 300
2016-10-17 02:00:00 300
2016-10-17 03:00:00 300
2016-10-17 04:00:00 300
2016-10-17 05:00:00 300
2016-10-17 06:00:00 300
2016-10-17 07:00:00 300
2016-10-17 08:00:00 900

In [136]:
spiderman_meals.resample('1D').first()


Out[136]:
calories
time
2016-10-15 300
2016-10-16 1600
2016-10-17 900

EXERCISE

Read Star Wars: The Force Awakens's tweets from data/theforceawakens_tweets.csv. Create DateTimeIndex from created_at column.


In [137]:
# Solution here

force_awakens_tweets = pd.read_csv(
    'data/theforceawakens_tweets.csv',
    parse_dates=['created_at'],
    index_col='created_at'
)
force_awakens_tweets.head()


Out[137]:
tweet_id
created_at
2016-09-21 20:02:48 778685932769730565
2016-09-21 20:04:26 778686345170477057
2016-09-21 20:05:11 778686534207713280
2016-09-21 20:08:19 778687322204233728
2016-09-21 20:08:55 778687471961767937

How many tweets did Star Wars: The Force Awakens have in each of last days?


In [138]:
# Solution here

force_awakens_tweets.resample('1D').count()


Out[138]:
tweet_id
created_at
2016-09-21 153
2016-09-22 747
2016-09-23 1313
2016-09-24 898
2016-09-25 954
2016-09-26 791
2016-09-27 789
2016-09-28 695
2016-09-29 919
2016-09-30 944
2016-10-01 638

In [139]:
# Solution here

(
    force_awakens_tweets
    .resample('4H')
    .count()
    .plot(figsize=(15, 5))
)
    
(
    force_awakens_tweets["2016-09-29":]
    .resample('1H')
    .count()
    .plot(figsize=(15, 5))
)


Out[139]:
<matplotlib.axes._subplots.AxesSubplot at 0x114644208>

Advanced topics + Advanced exercises

Filling missing data


In [140]:
heroes_with_missing = pd.DataFrame([
        ('Batman', None, None),
        ('Robin', None, 100),
        ('Spiderman', 400, 90),
        ('Robocop', 500, 95),
        ('Terminator', 600, None)
    ], columns=['hero', 'strength', 'health'])
heroes_with_missing


Out[140]:
hero strength health
0 Batman NaN NaN
1 Robin NaN 100.0
2 Spiderman 400.0 90.0
3 Robocop 500.0 95.0
4 Terminator 600.0 NaN

In [141]:
heroes_with_missing.dropna()


Out[141]:
hero strength health
2 Spiderman 400.0 90.0
3 Robocop 500.0 95.0

In [142]:
heroes_with_missing.fillna(0)


Out[142]:
hero strength health
0 Batman 0.0 0.0
1 Robin 0.0 100.0
2 Spiderman 400.0 90.0
3 Robocop 500.0 95.0
4 Terminator 600.0 0.0

In [143]:
heroes_with_missing.fillna({'strength': 10, 'health': 20})


Out[143]:
hero strength health
0 Batman 10.0 20.0
1 Robin 10.0 100.0
2 Spiderman 400.0 90.0
3 Robocop 500.0 95.0
4 Terminator 600.0 20.0

In [144]:
heroes_with_missing.fillna(heroes_with_missing.min())


Out[144]:
hero strength health
0 Batman 400.0 90.0
1 Robin 400.0 100.0
2 Spiderman 400.0 90.0
3 Robocop 500.0 95.0
4 Terminator 600.0 90.0

In [145]:
heroes_with_missing.fillna(heroes_with_missing.median())


Out[145]:
hero strength health
0 Batman 500.0 95.0
1 Robin 500.0 100.0
2 Spiderman 400.0 90.0
3 Robocop 500.0 95.0
4 Terminator 600.0 95.0

Scikit-learn


In [146]:
pd.DataFrame({'x': [1, 2], 'y': [10, 20]}).plot(x='x', y='y', kind='scatter')


Out[146]:
<matplotlib.axes._subplots.AxesSubplot at 0x113cc9710>

In [147]:
from sklearn.linear_model import LinearRegression

X=[ [1], [2] ]
y=[ 10, 20 ]

clf = LinearRegression()
clf.fit(X, y)


/Users/jacekrzeszutek/.virtualenvs/first-steps-with-pandas-3.6.1/lib/python3.6/site-packages/scipy/linalg/basic.py:1018: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)
Out[147]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [148]:
clf.predict([ [0.5], [2], [4] ])


Out[148]:
array([  5.,  20.,  40.])

In [149]:
X = np.array([ [1], [2] ])
y = np.array([ 10, 20 ])

X


Out[149]:
array([[1],
       [2]])

In [150]:
clf = LinearRegression()
clf.fit(X, y)


Out[150]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [151]:
clf.predict( np.array([ [0.5], [2], [4] ]) )


Out[151]:
array([  5.,  20.,  40.])

In [152]:
train_df = pd.DataFrame([
        (1, 10),
        (2, 20),
    ], columns=['x', 'y'])

train_df


Out[152]:
x y
0 1 10
1 2 20

In [153]:
clf = LinearRegression()
clf.fit(train_df[['x']], train_df['y'])


Out[153]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [154]:
clf.predict([[0.5]])


Out[154]:
array([ 5.])

In [155]:
test_df = pd.DataFrame({'x': [0.5, 1.5, 4]})
test_df


Out[155]:
x
0 0.5
1 1.5
2 4.0

In [156]:
clf.predict(test_df[['x']])


Out[156]:
array([  5.,  15.,  40.])

In [157]:
test_df['y'] = clf.predict(test_df[['x']])
test_df


Out[157]:
x y
0 0.5 5.0
1 1.5 15.0
2 4.0 40.0

In [158]:
train_df['color'] = 'blue'
test_df['color'] = 'red'

all_df = train_df.append(test_df)
all_df.plot(x='x', y='y', kind='scatter', figsize=(10, 8), color=all_df['color'])


Out[158]:
<matplotlib.axes._subplots.AxesSubplot at 0x114948d30>

EXERCISE

Integration with scikit-learn: Create a model that tries to predict gross of movie. Use any features of the movies dataset.


In [159]:
# Solution here

from sklearn.linear_model import LinearRegression

FEATURES = ['num_voted_users', 'imdb_score']
TARGET = 'gross'

movies_with_data = movies[FEATURES + [TARGET]].dropna()

X = movies_with_data[FEATURES].values
y = movies_with_data[TARGET].values

clf = LinearRegression()
clf.fit(X, y)

clf.predict([
    [800000, 8.0],
    [400000, 8.0],
    [400000, 4.0],
    [ 40000, 8.0],
])


Out[159]:
array([  2.62239503e+08,   1.33963329e+08,   1.64074385e+08,
         1.85147715e+07])

Create a method discovering movies with plot keywords similar to the given list of keywords (i.e. ['magic', 'harry', 'wizard'])


In [160]:
# Solution here

def discover_similar_plot(target_keywords, threshold=0.5):
    movies_with_plot = movies.dropna(
        subset=['plot_keywords']
    ).copy()
    
    movies_with_plot['plot_keywords_set'] = movies_with_plot[
        'plot_keywords'
    ].str.split('|').map(set)
    
    movies_with_plot['match_count'] = movies_with_plot[
        'plot_keywords_set'
    ].map(
        lambda keywords: len(keywords.intersection(target_keywords))
    )
    
    return movies_with_plot[
        (movies_with_plot['match_count'] >= threshold*len(target_keywords))
    ]

discover_similar_plot(['magic', 'harry', 'wizard'])['movie_title']


Out[160]:
114    Harry Potter and the Order of the Phoenix
115          Harry Potter and the Goblet of Fire
195     Harry Potter and the Prisoner of Azkaban
Name: movie_title, dtype: object

Integration with Flask

In the file flask_exercise.py you'll find the scaffolding for Flask app.
Create endpoints returning:

  • all movie titles available in the movies dataset
  • 10 worst rated movies ever
  • 10 best rated (imdb_score) movies in a given year

In [161]:
# Solution in flask_exercise.py

Thank you for your effort! We hope that you enjoyed the journey!

Any questions?

Any feedback?

You can reach us:

  • rzeszutekjacek at the most popular service gmail.com
  • mateuszflieger at as you would expect gmail.com

http://66.media.tumblr.com/b0e67112232adb68180fe7b988304abd/tumblr_inline_nw063rleWO1tp3b6e_1280.jpg