Welcome to the 'First steps with pandas'!

After this workshop you can (hopefully) call yourselves Data Scientists!

Gitter: https://gitter.im/first-steps-with-pandas-workshop

Before coding, let's check whether we have proper versions of libraries



In [1]:

    
import platform
print('Python: ' + platform.python_version())

import numpy as np
print('numpy: ' + np.__version__)

import pandas as pd
print('pandas: ' + pd.__version__)

import scipy
print('scipy: ' + scipy.__version__)

import sklearn
print('scikit-learn: ' + sklearn.__version__)

import matplotlib as plt
print('matplotlib: ' + plt.__version__)

import flask
print('flask: ' + flask.__version__)









    



Python: 3.6.1
numpy: 1.13.0
pandas: 0.20.2
scipy: 0.19.1
scikit-learn: 0.18.2
matplotlib: 2.0.2
flask: 0.12.2

What is pandas?

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Why to use it?

fast development
no reinventing wheel
fewer mistakes/bugs



In [2]:

    
# In case of no Internet, use:
# pd.read_json('data/cached_Python.json')

(
    pd.read_json('https://raw.githubusercontent.com/Nozdi/first-steps-with-pandas-workshop/master/data/cached_python.json')
    .resample('1W')
    .mean()
    ['daily_views']
)









    Out[2]:





2016-01-03    2767.666667
2016-01-10    4626.571429
2016-01-17    4871.857143
2016-01-24    5249.333333
Freq: W-SUN, Name: daily_views, dtype: float64

It is easy to pick up

few simple concepts that are very powerful
easy, standardized API
good code readability

It is reasonably fast



In [3]:

    
some_data = [ list(range(1,100)) for x in range(1,1000) ]
some_df = pd.DataFrame(some_data)


def standard_way(data):
    return [[col*2 for col in row] for row in data]


def pandas_way(df):
    return df * 2



In [4]:

    
%timeit standard_way(some_data)









    



6.22 ms ± 505 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



In [5]:

    
%timeit pandas_way(some_df)









    



206 µs ± 5.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

It has a very cool name.

https://c1.staticflickr.com/5/4058/4466498508_35a8172ac1_b.jpg

Library highlights

http://pandas.pydata.org/#library-highlights
http://pandas.pydata.org/pandas-docs/stable/api.html

Data structures

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).



In [6]:

    
strengths = pd.Series([400, 200, 300, 400, 500])
strengths









    Out[6]:





0    400
1    200
2    300
3    400
4    500
dtype: int64



In [7]:

    
names = pd.Series(["Batman", "Robin", "Spiderman", "Robocop", "Terminator"])
names









    Out[7]:





0        Batman
1         Robin
2     Spiderman
3       Robocop
4    Terminator
dtype: object

DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

Creating



In [8]:

    
heroes = pd.DataFrame({
    'hero': names,
    'strength': strengths
})
heroes



In [9]:

    
other_heroes = pd.DataFrame([
    dict(hero="Hercules", strength=800),
    dict(hero="Conan")
])
other_heroes









    Out[9]:







  
    
      
      hero
      strength
    
  
  
    
      0
      Hercules
      800.0
    
    
      1
      Conan
      NaN



In [10]:

    
another_heroes = pd.DataFrame([
    pd.Series(["Wonder Woman", 10, 3], index=["hero", "strength", "cookies"]),
    pd.Series(["Xena", 20, 0], index=["hero", "strength", "cookies"])
])
another_heroes









    Out[10]:







  
    
      
      hero
      strength
      cookies
    
  
  
    
      0
      Wonder Woman
      10
      3
    
    
      1
      Xena
      20
      0

Meta data



In [11]:

    
another_heroes.columns









    Out[11]:





Index(['hero', 'strength', 'cookies'], dtype='object')



In [12]:

    
another_heroes.shape









    Out[12]:





(2, 3)



In [13]:

    
another_heroes.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
hero        2 non-null object
strength    2 non-null int64
cookies     2 non-null int64
dtypes: int64(2), object(1)
memory usage: 128.0+ bytes

Selecting

[string] --> Series [ list of strings ] --> DataFrame



In [14]:

    
another_heroes['cookies']









    Out[14]:





0    3
1    0
Name: cookies, dtype: int64



In [15]:

    
another_heroes.cookies









    Out[15]:





0    3
1    0
Name: cookies, dtype: int64



In [16]:

    
another_heroes[ ['hero', 'cookies'] ]









    Out[16]:







  
    
      
      hero
      cookies
    
  
  
    
      0
      Wonder Woman
      3
    
    
      1
      Xena
      0

Chaining (most of operations on DataFrame returns new DataFrame or Series)



In [17]:

    
another_heroes[['hero', 'cookies']][['cookies']]



In [18]:

    
another_heroes[['hero', 'cookies']][['cookies']]['cookies']









    Out[18]:





0    3
1    0
Name: cookies, dtype: int64

EXERCISE

Create DataFrame presented below in 3 different ways

                                         movie_title  imdb_score
0                                            Avatar          7.9
1          Pirates of the Caribbean: At World's End          7.1
2                                           Spectre          6.8

Help: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#from-dict-of-series-or-dicts

With dict of Series



In [19]:

    
# Solution here

titles = pd.Series(["Avatar", "Pirates of the Caribbean: At World's End", "Spectre"])
imdb_scores = pd.Series([7.9, 7.1, 6.8])

pd.DataFrame({'movie_title': titles, 'imdb_score': imdb_scores})









    Out[19]:







  
    
      
      imdb_score
      movie_title
    
  
  
    
      0
      7.9
      Avatar
    
    
      1
      7.1
      Pirates of the Caribbean: At World's End
    
    
      2
      6.8
      Spectre

With list of dicts



In [20]:

    
# Solution here

pd.DataFrame([
    dict(movie_title="Avatar", imdb_score=7.9),
    dict(movie_title="Pirates of the Caribbean: At World's End", imdb_score=7.1),
    dict(movie_title="Spectre", imdb_score=6.8),
])









    Out[20]:







  
    
      
      imdb_score
      movie_title
    
  
  
    
      0
      7.9
      Avatar
    
    
      1
      7.1
      Pirates of the Caribbean: At World's End
    
    
      2
      6.8
      Spectre

With list of Series



In [21]:

    
# Solution here

pd.DataFrame([
    pd.Series(["Avatar", 7.9], index=['movie_title', 'imdb_score']),
    pd.Series(["Pirates of the Caribbean: At World's End", 7.1], index=['movie_title', 'imdb_score']),
    pd.Series(["Spectre", 6.8], index=['movie_title', 'imdb_score'])
])









    Out[21]:







  
    
      
      movie_title
      imdb_score
    
  
  
    
      0
      Avatar
      7.9
    
    
      1
      Pirates of the Caribbean: At World's End
      7.1
    
    
      2
      Spectre
      6.8

I/O part I

Reading popular formats / data sources



In [22]:

    
# Uncomment and press tab..
# pd.read_
# SQL, csv, hdf



In [23]:

    
# pd.read_csv?



In [24]:

    
# executing bash in jupyter notebook
!head -c 500 data/cached_python.json









    



{"daily_views":{"1451606400000":2701,"1451692800000":2829,"1451779200000":2773,"1451865600000":5072,"1451952000000":5513,"1452038400000":5233,"1452124800000":5518,"1452211200000":5312,"1452297600000":2749,"1452384000000":2989,"1452470400000":5539,"1452556800000":6161,"1452643200000":6175,"1452729600000":5545,"1452816000000":4833,"1452902400000":2858,"1452988800000":2992,"1453075200000":4673,"1453161600000":5625,"1453248000000":5450},"month":{"1451606400000":201601,"1451692800000":201601,"1451779



In [25]:

    
pd.read_json('data/cached_python.json')









    Out[25]:







  
    
      
      daily_views
      month
      project
      rank
      title
    
  
  
    
      2016-01-01
      2701
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-02
      2829
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-03
      2773
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-04
      5072
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-05
      5513
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-06
      5233
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-07
      5518
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-08
      5312
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-09
      2749
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-10
      2989
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-11
      5539
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-12
      6161
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-13
      6175
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-14
      5545
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-15
      4833
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-16
      2858
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-17
      2992
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-18
      4673
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-19
      5625
      201601
      en
      2251
      Python_(programming_language)
    
    
      2016-01-20
      5450
      201601
      en
      2251
      Python_(programming_language)

EXERCISE

Load movies from data/movies.csv to variable called `movies`



In [26]:

    
# Solution here

movies = pd.read_csv('data/movies.csv')
movies.head()









    Out[26]:







  
    
      
      movie_title
      title_year
      budget
      gross
      genres
      language
      country
      movie_facebook_likes
      imdb_score
      num_voted_users
      ...
      actor_3_name
      actor_3_facebook_likes
      cast_total_facebook_likes
      color
      duration
      plot_keywords
      content_rating
      aspect_ratio
      facenumber_in_poster
      movie_imdb_link
    
  
  
    
      0
      Avatar
      2009.0
      237000000.0
      760505847.0
      Action|Adventure|Fantasy|Sci-Fi
      English
      USA
      33000
      7.9
      886204
      ...
      Wes Studi
      855.0
      4834
      Color
      178.0
      avatar|future|marine|native|paraplegic
      PG-13
      1.78
      0.0
      http://www.imdb.com/title/tt0499549/?ref_=fn_t...
    
    
      1
      Pirates of the Caribbean: At World's End
      2007.0
      300000000.0
      309404152.0
      Action|Adventure|Fantasy
      English
      USA
      0
      7.1
      471220
      ...
      Jack Davenport
      1000.0
      48350
      Color
      169.0
      goddess|marriage ceremony|marriage proposal|pi...
      PG-13
      2.35
      0.0
      http://www.imdb.com/title/tt0449088/?ref_=fn_t...
    
    
      2
      Spectre
      2015.0
      245000000.0
      200074175.0
      Action|Adventure|Thriller
      English
      UK
      85000
      6.8
      275868
      ...
      Stephanie Sigman
      161.0
      11700
      Color
      148.0
      bomb|espionage|sequel|spy|terrorist
      PG-13
      2.35
      1.0
      http://www.imdb.com/title/tt2379713/?ref_=fn_t...
    
    
      3
      The Dark Knight Rises
      2012.0
      250000000.0
      448130642.0
      Action|Thriller
      English
      USA
      164000
      8.5
      1144337
      ...
      Joseph Gordon-Levitt
      23000.0
      106759
      Color
      164.0
      deception|imprisonment|lawlessness|police offi...
      PG-13
      2.35
      0.0
      http://www.imdb.com/title/tt1345836/?ref_=fn_t...
    
    
      4
      Star Wars: Episode VII - The Force Awakens
      NaN
      NaN
      NaN
      Documentary
      NaN
      NaN
      0
      7.1
      8
      ...
      NaN
      NaN
      143
      NaN
      NaN
      NaN
      NaN
      NaN
      0.0
      http://www.imdb.com/title/tt5289954/?ref_=fn_t...
    
  

5 rows × 28 columns

Analyze what dimensions and columns it has



In [27]:

    
# Solution here

print(movies.shape)
print(movies.columns)









    



(5043, 28)
Index(['movie_title', 'title_year', 'budget', 'gross', 'genres', 'language',
       'country', 'movie_facebook_likes', 'imdb_score', 'num_voted_users',
       'num_critic_for_reviews', 'num_user_for_reviews', 'director_name',
       'director_facebook_likes', 'actor_1_name', 'actor_1_facebook_likes',
       'actor_2_name', 'actor_2_facebook_likes', 'actor_3_name',
       'actor_3_facebook_likes', 'cast_total_facebook_likes', 'color',
       'duration', 'plot_keywords', 'content_rating', 'aspect_ratio',
       'facenumber_in_poster', 'movie_imdb_link'],
      dtype='object')

Filtering



In [28]:

    
heroes

Boolean indexing



In [29]:

    
heroes['strength'] == 400









    Out[29]:





0     True
1    False
2    False
3     True
4    False
Name: strength, dtype: bool



In [30]:

    
heroes[heroes['strength'] == 400]









    Out[30]:







  
    
      
      hero
      strength
    
  
  
    
      0
      Batman
      400
    
    
      3
      Robocop
      400



In [31]:

    
heroes[heroes['strength'] > 400]









    Out[31]:







  
    
      
      hero
      strength
    
  
  
    
      4
      Terminator
      500

Multiple conditions



In [32]:

    
try:
    heroes[200 < heroes['strength'] < 400]
except ValueError:
    print("This cool Python syntax ain't work :(")









    



This cool Python syntax ain't work :(



In [33]:

    
heroes[
    (heroes['strength'] > 200) & 
    (heroes['strength'] < 400)
]









    Out[33]:







  
    
      
      hero
      strength
    
  
  
    
      2
      Spiderman
      300



In [34]:

    
heroes[
    (heroes['strength'] <= 200) |
    (heroes['strength'] >= 400)
]

Negation

~ is a negation operator



In [35]:

    
~(heroes['strength'] == 400)









    Out[35]:





0    False
1     True
2     True
3    False
4     True
Name: strength, dtype: bool



In [36]:

    
heroes['strength'] != 400









    Out[36]:





0    False
1     True
2     True
3    False
4     True
Name: strength, dtype: bool



In [37]:

    
heroes[~(
    (heroes['strength'] <= 200) |
    (heroes['strength'] >= 400)
)]









    Out[37]:







  
    
      
      hero
      strength
    
  
  
    
      2
      Spiderman
      300

Filtering for containing one of many values (SQL's IN)



In [38]:

    
heroes[
    heroes['hero'].isin(['Batman', 'Robin'])
]

EXERCISE

What movies has been directed by Clint Eastwood?



In [39]:

    
# Solution here

movies[movies['director_name'] == "Clint Eastwood"]









    Out[39]:







  
    
      
      movie_title
      title_year
      budget
      gross
      genres
      language
      country
      movie_facebook_likes
      imdb_score
      num_voted_users
      ...
      actor_3_name
      actor_3_facebook_likes
      cast_total_facebook_likes
      color
      duration
      plot_keywords
      content_rating
      aspect_ratio
      facenumber_in_poster
      movie_imdb_link
    
  
  
    
      664
      Space Cowboys
      2000.0
      65000000.0
      90454043.0
      Action|Adventure|Thriller
      English
      USA
      0
      6.4
      60165
      ...
      William Devane
      416.0
      17771
      Black and White
      130.0
      nasa|old age|satellite|space|space shuttle
      PG-13
      2.35
      1.0
      http://www.imdb.com/title/tt0186566/?ref_=fn_t...
    
    
      772
      Invictus
      2009.0
      60000000.0
      37479778.0
      Biography|Drama|History|Sport
      English
      USA
      23000
      7.4
      124222
      ...
      Leleti Khumalo
      204.0
      24458
      Color
      134.0
      apartheid|nation|nelson mandela|president|rugby
      PG-13
      2.35
      1.0
      http://www.imdb.com/title/tt1057500/?ref_=fn_t...
    
    
      814
      American Sniper
      2014.0
      58800000.0
      350123553.0
      Action|Biography|Drama|History|Thriller|War
      English
      USA
      112000
      7.3
      325264
      ...
      Keir O'Donnell
      318.0
      16277
      Color
      133.0
      assassin|death of child|iraq|sniper|sniper rifle
      R
      2.35
      0.0
      http://www.imdb.com/title/tt2179136/?ref_=fn_t...
    
    
      877
      Changeling
      2008.0
      55000000.0
      35707327.0
      Crime|Drama|Mystery|Thriller
      English
      USA
      14000
      7.8
      199056
      ...
      Colm Feore
      539.0
      12840
      Color
      141.0
      execution by hanging|los angeles police depart...
      R
      2.35
      0.0
      http://www.imdb.com/title/tt0824747/?ref_=fn_t...
    
    
      906
      Flags of Our Fathers
      2006.0
      90000000.0
      33574332.0
      Drama|History|War
      English
      USA
      0
      7.1
      101221
      ...
      Tom McCarthy
      310.0
      24468
      Color
      135.0
      imperial japan|japan|japanese army|japanese so...
      R
      2.35
      0.0
      http://www.imdb.com/title/tt0418689/?ref_=fn_t...
    
    
      973
      Absolute Power
      1997.0
      50000000.0
      50007168.0
      Action|Crime|Drama|Thriller
      English
      USA
      1000
      6.7
      39529
      ...
      Scott Glenn
      826.0
      19139
      Color
      121.0
      death|death of wife|murder|thief|violence
      R
      2.35
      0.0
      http://www.imdb.com/title/tt0118548/?ref_=fn_t...
    
    
      997
      Hereafter
      2010.0
      50000000.0
      32741596.0
      Drama|Fantasy
      English
      USA
      16000
      6.5
      80140
      ...
      Cécile De France
      447.0
      14344
      Color
      129.0
      death|france|near death experience|tsunami|twi...
      PG-13
      2.35
      0.0
      http://www.imdb.com/title/tt1212419/?ref_=fn_t...
    
    
      1004
      Blood Work
      2002.0
      50000000.0
      26199517.0
      Action|Crime|Drama|Mystery|Thriller
      English
      USA
      705
      6.4
      34356
      ...
      Rick Hoffman
      581.0
      18635
      Color
      110.0
      boat|heart transplant|murder|serial killer|sister
      R
      2.35
      1.0
      http://www.imdb.com/title/tt0309377/?ref_=fn_t...
    
    
      1245
      Jersey Boys
      2014.0
      40000000.0
      47034272.0
      Biography|Drama|Music|Musical
      English
      USA
      16000
      6.9
      25465
      ...
      Scott Vance
      235.0
      2281
      Color
      134.0
      band|music group|musical quartet|singer|song
      R
      2.35
      1.0
      http://www.imdb.com/title/tt1742044/?ref_=fn_t...
    
    
      1453
      J. Edgar
      2011.0
      35000000.0
      37304950.0
      Biography|Crime|Drama
      English
      USA
      16000
      6.6
      102728
      ...
      Kaitlyn Dever
      363.0
      36258
      Color
      137.0
      agent|fbi|federal bureau of investigation|gay ...
      R
      2.35
      0.0
      http://www.imdb.com/title/tt1616195/?ref_=fn_t...
    
    
      1462
      Midnight in the Garden of Good and Evil
      1997.0
      30000000.0
      25078937.0
      Crime|Drama|Mystery|Thriller
      English
      USA
      0
      6.6
      31751
      ...
      Leon Rippy
      229.0
      19689
      Color
      155.0
      graveyard|murder|murder trial|reporter|trial
      R
      1.85
      2.0
      http://www.imdb.com/title/tt0119668/?ref_=fn_t...
    
    
      1603
      Mystic River
      2003.0
      25000000.0
      90135191.0
      Crime|Drama|Mystery|Thriller
      English
      USA
      12000
      8.0
      338415
      ...
      Tom Guiry
      262.0
      1942
      Color
      138.0
      ballistics|detective|friend|gun|sexual abuse
      R
      2.35
      0.0
      http://www.imdb.com/title/tt0327056/?ref_=fn_t...
    
    
      1604
      Million Dollar Baby
      2004.0
      30000000.0
      100422786.0
      Drama|Sport
      English
      USA
      14000
      8.1
      482064
      ...
      Mike Colter
      567.0
      28830
      Color
      132.0
      boxing|boxing trainer|death|suicide attempt|wo...
      PG-13
      2.35
      1.0
      http://www.imdb.com/title/tt0405159/?ref_=fn_t...
    
    
      1869
      Gran Torino
      2008.0
      33000000.0
      148085755.0
      Drama
      English
      USA
      28000
      8.2
      561773
      ...
      Ahney Her
      198.0
      17347
      Color
      116.0
      gang|gang rape|hmong|teenager|widower
      R
      2.35
      0.0
      http://www.imdb.com/title/tt1205489/?ref_=fn_t...
    
    
      2086
      The Bridges of Madison County
      1995.0
      35000000.0
      70960517.0
      Drama|Romance
      English
      USA
      0
      7.5
      54631
      ...
      Debra Monk
      86.0
      27237
      Color
      135.0
      1960s|bridge|housewife|love|photographer
      PG-13
      1.85
      0.0
      http://www.imdb.com/title/tt0112579/?ref_=fn_t...
    
    
      2132
      Firefox
      1982.0
      21000000.0
      46700000.0
      Action|Adventure|Thriller
      English
      USA
      0
      5.9
      19832
      ...
      Kenneth Colley
      216.0
      16805
      Color
      136.0
      fighter plane|jet fighter|refueling|russia|sov...
      PG
      2.35
      1.0
      http://www.imdb.com/title/tt0083943/?ref_=fn_t...
    
    
      2760
      Unforgiven
      1992.0
      14400000.0
      101157447.0
      Drama|Western
      English
      USA
      10000
      8.3
      277505
      ...
      Frances Fisher
      638.0
      28544
      Color
      131.0
      englishman|leaving flowers on a grave|one last...
      R
      2.35
      1.0
      http://www.imdb.com/title/tt0105695/?ref_=fn_t...
    
    
      2863
      Letters from Iwo Jima
      2006.0
      19000000.0
      13753931.0
      Drama|History|War
      Japanese
      USA
      5000
      7.9
      132149
      ...
      Shidô Nakamura
      78.0
      751
      Color
      141.0
      blood splatter|general|island|japan|world war two
      R
      2.35
      0.0
      http://www.imdb.com/title/tt0498380/?ref_=fn_t...
    
    
      3540
      Pale Rider
      1985.0
      6900000.0
      41400000.0
      Western
      English
      USA
      0
      7.3
      39451
      ...
      Sydney Penny
      240.0
      17209
      Color
      115.0
      gold|miner|preacher|prospector|sheriff
      R
      2.35
      0.0
      http://www.imdb.com/title/tt0089767/?ref_=fn_t...
    
    
      3673
      High Plains Drifter
      1973.0
      5500000.0
      NaN
      Mystery|Western
      English
      USA
      0
      7.6
      36381
      ...
      John Hillerman
      258.0
      17599
      Color
      105.0
      gravestone|gunfighter|outlaw|revenge|small town
      R
      2.35
      0.0
      http://www.imdb.com/title/tt0068699/?ref_=fn_t...
    
  

20 rows × 28 columns

What movies have earned above $500m?



In [40]:

    
# Solution here

movies[movies['gross'] > 500e6]['movie_title']









    Out[40]:





0               Avatar
17        The Avengers
26             Titanic
29      Jurassic World
66     The Dark Knight
794       The Avengers
Name: movie_title, dtype: object

Are there any Polish movies?



In [41]:

    
# Solution here

movies[movies['language'] == 'Polish']['movie_title']









    Out[41]:





2824       Dekalog            
2828    The Border            
3207       Dekalog            
4170                       Ida
Name: movie_title, dtype: object

What are really popular great movies? (> 100k FB likes, > 8.5 IMDB score)



In [42]:

    
# Solution here

movies[
    (movies['movie_facebook_likes'] > 100000) &
    (movies['imdb_score'] > 8.5)
]['movie_title']









    Out[42]:





96                  Interstellar
97                     Inception
1937    The Shawshank Redemption
Name: movie_title, dtype: object

In what movies main role was played by brutals like "Jason Statham", "Sylvester Stallone" or god ("Morgan Freeman")?



In [43]:

    
# Solution here

brutals = ["Jason Statham", "Sylvester Stallone"]
god = "Morgan Freeman"

movies[
    (movies['actor_1_name'].isin(brutals)) |
    (movies['actor_1_name'] == god) 
]['movie_title'].head()









    Out[43]:





45             Furious 7
235             Oblivion
299    The Expendables 2
317    The Expendables 3
367              Ben-Hur
Name: movie_title, dtype: object

I/O part O

As numpy array



In [44]:

    
heroes.values









    Out[44]:





array([['Batman', 400],
       ['Robin', 200],
       ['Spiderman', 300],
       ['Robocop', 400],
       ['Terminator', 500]], dtype=object)

As (list) of dicts



In [45]:

    
heroes.to_dict()









    Out[45]:





{'hero': {0: 'Batman',
  1: 'Robin',
  2: 'Spiderman',
  3: 'Robocop',
  4: 'Terminator'},
 'strength': {0: 400, 1: 200, 2: 300, 3: 400, 4: 500}}



In [46]:

    
heroes.to_dict('records')









    Out[46]:





[{'hero': 'Batman', 'strength': 400},
 {'hero': 'Robin', 'strength': 200},
 {'hero': 'Spiderman', 'strength': 300},
 {'hero': 'Robocop', 'strength': 400},
 {'hero': 'Terminator', 'strength': 500}]

As popular data format



In [47]:

    
heroes.to_json()









    Out[47]:





'{"hero":{"0":"Batman","1":"Robin","2":"Spiderman","3":"Robocop","4":"Terminator"},"strength":{"0":400,"1":200,"2":300,"3":400,"4":500}}'



In [48]:

    
heroes.to_json(orient='records')









    Out[48]:





'[{"hero":"Batman","strength":400},{"hero":"Robin","strength":200},{"hero":"Spiderman","strength":300},{"hero":"Robocop","strength":400},{"hero":"Terminator","strength":500}]'



In [49]:

    
heroes.to_csv()









    Out[49]:





',hero,strength\n0,Batman,400\n1,Robin,200\n2,Spiderman,300\n3,Robocop,400\n4,Terminator,500\n'



In [50]:

    
heroes.to_csv(index=False)









    Out[50]:





'hero,strength\nBatman,400\nRobin,200\nSpiderman,300\nRobocop,400\nTerminator,500\n'



In [51]:

    
heroes.to_csv('data/heroes.csv', index=False)

EXERCISE

Create a csv with movie titles and cast (actors) of movies with budget above $200m



In [52]:

    
# Solution here
cols = [
    'movie_title',
    'actor_1_name',
    'actor_2_name',
    'actor_3_name', 
    'budget'
]

movies[movies['budget'] > 200e6][cols].to_csv("data/expensive-cast.csv", index=False)

Create a list of dicts with movie titles and facebook likes of all Christopher Nolan's movies



In [53]:

    
# Solution here

cols = [
    'movie_title',
    'movie_facebook_likes'
]

movies[movies['director_name'] == 'Christopher Nolan'][cols].to_dict('r')









    Out[53]:





[{'movie_facebook_likes': 164000, 'movie_title': 'The Dark Knight Rises'},
 {'movie_facebook_likes': 37000, 'movie_title': 'The Dark Knight'},
 {'movie_facebook_likes': 349000, 'movie_title': 'Interstellar'},
 {'movie_facebook_likes': 175000, 'movie_title': 'Inception'},
 {'movie_facebook_likes': 15000, 'movie_title': 'Batman Begins'},
 {'movie_facebook_likes': 0, 'movie_title': 'Insomnia'},
 {'movie_facebook_likes': 49000, 'movie_title': 'The Prestige'},
 {'movie_facebook_likes': 40000, 'movie_title': 'Memento'}]

New columns



In [54]:

    
heroes

Creating new column



In [55]:

    
heroes['health'] = np.NaN
heroes.head()









    Out[55]:







  
    
      
      hero
      strength
      health
    
  
  
    
      0
      Batman
      400
      NaN
    
    
      1
      Robin
      200
      NaN
    
    
      2
      Spiderman
      300
      NaN
    
    
      3
      Robocop
      400
      NaN
    
    
      4
      Terminator
      500
      NaN



In [56]:

    
heroes['health'] = 100
heroes.head()



In [57]:

    
heroes['height'] = [180, 170, 175, 190, 185]
heroes



In [58]:

    
heroes['is_hungry'] = pd.Series([True, False, False, True, True])
heroes









    Out[58]:







  
    
      
      hero
      strength
      health
      height
      is_hungry
    
  
  
    
      0
      Batman
      400
      100
      180
      True
    
    
      1
      Robin
      200
      100
      170
      False
    
    
      2
      Spiderman
      300
      100
      175
      False
    
    
      3
      Robocop
      400
      100
      190
      True
    
    
      4
      Terminator
      500
      100
      185
      True

Vector operations



In [59]:

    
heroes['strength'] * 2









    Out[59]:





0     800
1     400
2     600
3     800
4    1000
Name: strength, dtype: int64



In [60]:

    
heroes['strength'] / heroes['height']









    Out[60]:





0    2.222222
1    1.176471
2    1.714286
3    2.105263
4    2.702703
dtype: float64



In [61]:

    
heroes['strength_per_cm'] = heroes['strength'] / heroes['height']
heroes









    Out[61]:







  
    
      
      hero
      strength
      health
      height
      is_hungry
      strength_per_cm
    
  
  
    
      0
      Batman
      400
      100
      180
      True
      2.222222
    
    
      1
      Robin
      200
      100
      170
      False
      1.176471
    
    
      2
      Spiderman
      300
      100
      175
      False
      1.714286
    
    
      3
      Robocop
      400
      100
      190
      True
      2.105263
    
    
      4
      Terminator
      500
      100
      185
      True
      2.702703

Map, apply, applymap, str



In [62]:

    
pd.Series([1, 2, 3]).map(lambda x: x**3)









    Out[62]:





0     1
1     8
2    27
dtype: int64



In [63]:

    
pd.Series(['Batman', 'Robin']).map(lambda x: x[:2])









    Out[63]:





0    Ba
1    Ro
dtype: object



In [64]:

    
# however, more idiomatic approach for strings is to do..
pd.Series(['Batman', 'Robin']).str[:2]









    Out[64]:





0    Ba
1    Ro
dtype: object



In [65]:

    
pd.Series(['Batman', 'Robin']).str.lower()









    Out[65]:





0    batman
1     robin
dtype: object



In [66]:

    
pd.Series([
    ['Batman', 'Robin'],
    ['Robocop']
]).map(len)









    Out[66]:





0    2
1    1
dtype: int64



In [67]:

    
heroes['code'] = heroes['hero'].map(lambda name: name[:2])
heroes









    Out[67]:







  
    
      
      hero
      strength
      health
      height
      is_hungry
      strength_per_cm
      code
    
  
  
    
      0
      Batman
      400
      100
      180
      True
      2.222222
      Ba
    
    
      1
      Robin
      200
      100
      170
      False
      1.176471
      Ro
    
    
      2
      Spiderman
      300
      100
      175
      False
      1.714286
      Sp
    
    
      3
      Robocop
      400
      100
      190
      True
      2.105263
      Ro
    
    
      4
      Terminator
      500
      100
      185
      True
      2.702703
      Te



In [68]:

    
heroes['effective_strength'] = heroes.apply(
    lambda row: (not row['is_hungry']) * row['strength'],
    axis=1
)
heroes.head()









    Out[68]:







  
    
      
      hero
      strength
      health
      height
      is_hungry
      strength_per_cm
      code
      effective_strength
    
  
  
    
      0
      Batman
      400
      100
      180
      True
      2.222222
      Ba
      0
    
    
      1
      Robin
      200
      100
      170
      False
      1.176471
      Ro
      200
    
    
      2
      Spiderman
      300
      100
      175
      False
      1.714286
      Sp
      300
    
    
      3
      Robocop
      400
      100
      190
      True
      2.105263
      Ro
      0
    
    
      4
      Terminator
      500
      100
      185
      True
      2.702703
      Te
      0



In [69]:

    
heroes[['health', 'strength']] = heroes[['health', 'strength']].applymap(
    lambda x: x + 100
)
heroes









    Out[69]:







  
    
      
      hero
      strength
      health
      height
      is_hungry
      strength_per_cm
      code
      effective_strength
    
  
  
    
      0
      Batman
      500
      200
      180
      True
      2.222222
      Ba
      0
    
    
      1
      Robin
      300
      200
      170
      False
      1.176471
      Ro
      200
    
    
      2
      Spiderman
      400
      200
      175
      False
      1.714286
      Sp
      300
    
    
      3
      Robocop
      500
      200
      190
      True
      2.105263
      Ro
      0
    
    
      4
      Terminator
      600
      200
      185
      True
      2.702703
      Te
      0

Cheatsheet

map: 1 => 1
apply: n => 1
applymap: n => n

Sorting and value counts (bonus skill)



In [70]:

    
heroes['strength'].value_counts()









    Out[70]:





500    2
600    1
300    1
400    1
Name: strength, dtype: int64



In [71]:

    
heroes.sort_values('strength')









    Out[71]:







  
    
      
      hero
      strength
      health
      height
      is_hungry
      strength_per_cm
      code
      effective_strength
    
  
  
    
      1
      Robin
      300
      200
      170
      False
      1.176471
      Ro
      200
    
    
      2
      Spiderman
      400
      200
      175
      False
      1.714286
      Sp
      300
    
    
      0
      Batman
      500
      200
      180
      True
      2.222222
      Ba
      0
    
    
      3
      Robocop
      500
      200
      190
      True
      2.105263
      Ro
      0
    
    
      4
      Terminator
      600
      200
      185
      True
      2.702703
      Te
      0



In [72]:

    
heroes.sort_values(
    ['is_hungry', 'code'],
    ascending=[False, True]
)









    Out[72]:







  
    
      
      hero
      strength
      health
      height
      is_hungry
      strength_per_cm
      code
      effective_strength
    
  
  
    
      0
      Batman
      500
      200
      180
      True
      2.222222
      Ba
      0
    
    
      3
      Robocop
      500
      200
      190
      True
      2.105263
      Ro
      0
    
    
      4
      Terminator
      600
      200
      185
      True
      2.702703
      Te
      0
    
    
      1
      Robin
      300
      200
      170
      False
      1.176471
      Ro
      200
    
    
      2
      Spiderman
      400
      200
      175
      False
      1.714286
      Sp
      300

EXERCISE

What are 10 most profitable movies? (ratio between gross and budget)



In [73]:

    
# Solution here

movies['profitability'] = movies['gross'] / movies['budget']
movies.sort_values('profitability', ascending=False).head(10)









    Out[73]:







  
    
      
      movie_title
      title_year
      budget
      gross
      genres
      language
      country
      movie_facebook_likes
      imdb_score
      num_voted_users
      ...
      actor_3_facebook_likes
      cast_total_facebook_likes
      color
      duration
      plot_keywords
      content_rating
      aspect_ratio
      facenumber_in_poster
      movie_imdb_link
      profitability
    
  
  
    
      4793
      Paranormal Activity
      2007.0
      15000.0
      107917283.0
      Horror
      English
      USA
      12000
      6.3
      184824
      ...
      21.0
      330
      Color
      84.0
      dark force|entity|evil force|found footage|par...
      R
      1.85
      0.0
      http://www.imdb.com/title/tt1179904/?ref_=fn_t...
      7194.485533
    
    
      4799
      Tarnation
      2003.0
      218.0
      592014.0
      Biography|Documentary
      English
      USA
      754
      7.2
      5709
      ...
      0.0
      78
      Color
      88.0
      answering machine|home movie|lithium|schizophr...
      Unrated
      1.37
      3.0
      http://www.imdb.com/title/tt0390538/?ref_=fn_t...
      2715.660550
    
    
      4707
      The Blair Witch Project
      1999.0
      60000.0
      140530114.0
      Horror
      English
      USA
      0
      6.4
      186786
      ...
      39.0
      399
      Black and White
      81.0
      found footage|looking at the camera|lost in th...
      R
      1.33
      0.0
      http://www.imdb.com/title/tt0185937/?ref_=fn_t...
      2342.168567
    
    
      4984
      The Brothers McMullen
      1995.0
      25000.0
      10246600.0
      Comedy|Drama|Romance
      English
      USA
      265
      6.6
      6375
      ...
      73.0
      388
      Color
      98.0
      abusive father|critically acclaimed|loss of fa...
      R
      1.85
      0.0
      http://www.imdb.com/title/tt0112585/?ref_=fn_t...
      409.864000
    
    
      4936
      The Texas Chain Saw Massacre
      1974.0
      83532.0
      30859000.0
      Horror|Thriller
      English
      USA
      0
      7.5
      96411
      ...
      177.0
      1094
      Color
      88.0
      cannibal|chainsaw|hitchhiker|independent film|...
      R
      1.85
      0.0
      http://www.imdb.com/title/tt0072271/?ref_=fn_t...
      369.427285
    
    
      3278
      The Texas Chain Saw Massacre
      1974.0
      83532.0
      30859000.0
      Horror|Thriller
      English
      USA
      0
      7.5
      96410
      ...
      177.0
      1094
      Color
      88.0
      cannibal|chainsaw|hitchhiker|independent film|...
      R
      1.85
      0.0
      http://www.imdb.com/title/tt0072271/?ref_=fn_t...
      369.427285
    
    
      5035
      El Mariachi
      1992.0
      7000.0
      2040920.0
      Action|Crime|Drama|Romance|Thriller
      Spanish
      USA
      0
      6.9
      52055
      ...
      6.0
      147
      Color
      81.0
      assassin|death|guitar|gun|mariachi
      R
      1.37
      0.0
      http://www.imdb.com/title/tt0104815/?ref_=fn_t...
      291.560000
    
    
      4956
      The Gallows
      2015.0
      100000.0
      22757819.0
      Horror|Thriller
      English
      USA
      0
      4.2
      13521
      ...
      7.0
      276
      Color
      81.0
      breaking and entering|gallows|hanging|high sch...
      R
      1.85
      0.0
      http://www.imdb.com/title/tt2309260/?ref_=fn_t...
      227.578190
    
    
      4977
      Super Size Me
      2004.0
      65000.0
      11529368.0
      Comedy|Documentary|Drama
      English
      USA
      0
      7.3
      85028
      ...
      0.0
      0
      Color
      100.0
      experiment|fast food|food industry|meal|overea...
      PG
      1.78
      2.0
      http://www.imdb.com/title/tt0390521/?ref_=fn_t...
      177.374892
    
    
      2619
      Halloween
      1978.0
      300000.0
      47000000.0
      Horror|Thriller
      English
      USA
      12000
      7.9
      157857
      ...
      598.0
      4400
      Color
      101.0
      halloween|masked killer|michael myers|slasher|...
      R
      2.35
      0.0
      http://www.imdb.com/title/tt0077651/?ref_=fn_t...
      156.666667
    
  

10 rows × 29 columns

Create a column 'first_genre'. What is the distribution of values in this column?



In [74]:

    
# Solution here

movies['first_genre'] = movies['genres'].str.split('|').str[0]
movies.head()









    Out[74]:







  
    
      
      movie_title
      title_year
      budget
      gross
      genres
      language
      country
      movie_facebook_likes
      imdb_score
      num_voted_users
      ...
      cast_total_facebook_likes
      color
      duration
      plot_keywords
      content_rating
      aspect_ratio
      facenumber_in_poster
      movie_imdb_link
      profitability
      first_genre
    
  
  
    
      0
      Avatar
      2009.0
      237000000.0
      760505847.0
      Action|Adventure|Fantasy|Sci-Fi
      English
      USA
      33000
      7.9
      886204
      ...
      4834
      Color
      178.0
      avatar|future|marine|native|paraplegic
      PG-13
      1.78
      0.0
      http://www.imdb.com/title/tt0499549/?ref_=fn_t...
      3.208885
      Action
    
    
      1
      Pirates of the Caribbean: At World's End
      2007.0
      300000000.0
      309404152.0
      Action|Adventure|Fantasy
      English
      USA
      0
      7.1
      471220
      ...
      48350
      Color
      169.0
      goddess|marriage ceremony|marriage proposal|pi...
      PG-13
      2.35
      0.0
      http://www.imdb.com/title/tt0449088/?ref_=fn_t...
      1.031347
      Action
    
    
      2
      Spectre
      2015.0
      245000000.0
      200074175.0
      Action|Adventure|Thriller
      English
      UK
      85000
      6.8
      275868
      ...
      11700
      Color
      148.0
      bomb|espionage|sequel|spy|terrorist
      PG-13
      2.35
      1.0
      http://www.imdb.com/title/tt2379713/?ref_=fn_t...
      0.816629
      Action
    
    
      3
      The Dark Knight Rises
      2012.0
      250000000.0
      448130642.0
      Action|Thriller
      English
      USA
      164000
      8.5
      1144337
      ...
      106759
      Color
      164.0
      deception|imprisonment|lawlessness|police offi...
      PG-13
      2.35
      0.0
      http://www.imdb.com/title/tt1345836/?ref_=fn_t...
      1.792523
      Action
    
    
      4
      Star Wars: Episode VII - The Force Awakens
      NaN
      NaN
      NaN
      Documentary
      NaN
      NaN
      0
      7.1
      8
      ...
      143
      NaN
      NaN
      NaN
      NaN
      NaN
      0.0
      http://www.imdb.com/title/tt5289954/?ref_=fn_t...
      NaN
      Documentary
    
  

5 rows × 30 columns

Visualizing data



In [75]:

    
heroes









    Out[75]:







  
    
      
      hero
      strength
      health
      height
      is_hungry
      strength_per_cm
      code
      effective_strength
    
  
  
    
      0
      Batman
      500
      200
      180
      True
      2.222222
      Ba
      0
    
    
      1
      Robin
      300
      200
      170
      False
      1.176471
      Ro
      200
    
    
      2
      Spiderman
      400
      200
      175
      False
      1.714286
      Sp
      300
    
    
      3
      Robocop
      500
      200
      190
      True
      2.105263
      Ro
      0
    
    
      4
      Terminator
      600
      200
      185
      True
      2.702703
      Te
      0

Basic stats



In [76]:

    
heroes.describe()









    Out[76]:







  
    
      
      strength
      health
      height
      strength_per_cm
      effective_strength
    
  
  
    
      count
      5.000000
      5.0
      5.000000
      5.000000
      5.000000
    
    
      mean
      460.000000
      200.0
      180.000000
      1.984189
      100.000000
    
    
      std
      114.017543
      0.0
      7.905694
      0.572896
      141.421356
    
    
      min
      300.000000
      200.0
      170.000000
      1.176471
      0.000000
    
    
      25%
      400.000000
      200.0
      175.000000
      1.714286
      0.000000
    
    
      50%
      500.000000
      200.0
      180.000000
      2.105263
      0.000000
    
    
      75%
      500.000000
      200.0
      185.000000
      2.222222
      200.000000
    
    
      max
      600.000000
      200.0
      190.000000
      2.702703
      300.000000

Plotting



In [77]:

    
%matplotlib inline



In [78]:

    
pd.Series([1, 2, 3]).plot()









    Out[78]:





<matplotlib.axes._subplots.AxesSubplot at 0x11104a0b8>



In [79]:

    
pd.Series([1, 2, 3], index=['Batman', 'Robin', 'Rambo']).plot()









    Out[79]:





<matplotlib.axes._subplots.AxesSubplot at 0x1110dccf8>



In [80]:

    
pd.Series([1, 2, 3], index=['Batman', 'Robin', 'Rambo']).plot(kind='bar')









    Out[80]:





<matplotlib.axes._subplots.AxesSubplot at 0x11111d160>



In [81]:

    
pd.Series([1, 2, 3], index=['Batman', 'Robin', 'Rambo']).plot(
    kind='bar',
    figsize=(15, 6)
)









    Out[81]:





<matplotlib.axes._subplots.AxesSubplot at 0x111296eb8>



In [82]:

    
pd.Series([1, 2, 3], index=['Batman', 'Robin', 'Rambo']).plot(kind='pie')









    Out[82]:





<matplotlib.axes._subplots.AxesSubplot at 0x111473240>



In [83]:

    
heroes.plot()









    Out[83]:





<matplotlib.axes._subplots.AxesSubplot at 0x111378748>



In [84]:

    
indexed_heroes = heroes.set_index('hero')
indexed_heroes









    Out[84]:







  
    
      
      strength
      health
      height
      is_hungry
      strength_per_cm
      code
      effective_strength
    
    
      hero
      
      
      
      
      
      
      
    
  
  
    
      Batman
      500
      200
      180
      True
      2.222222
      Ba
      0
    
    
      Robin
      300
      200
      170
      False
      1.176471
      Ro
      200
    
    
      Spiderman
      400
      200
      175
      False
      1.714286
      Sp
      300
    
    
      Robocop
      500
      200
      190
      True
      2.105263
      Ro
      0
    
    
      Terminator
      600
      200
      185
      True
      2.702703
      Te
      0



In [85]:

    
indexed_heroes.plot()









    Out[85]:





<matplotlib.axes._subplots.AxesSubplot at 0x111645be0>



In [86]:

    
indexed_heroes.plot(kind='barh')









    Out[86]:





<matplotlib.axes._subplots.AxesSubplot at 0x1116f1b70>



In [87]:

    
indexed_heroes.plot(kind='bar', subplots=True, figsize=(15, 15))









    Out[87]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x11189ce80>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x111a19fd0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x111957748>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x111ae2710>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x111b43780>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x111b437b8>], dtype=object)



In [88]:

    
indexed_heroes[['height', 'strength']].plot(kind='bar')









    Out[88]:





<matplotlib.axes._subplots.AxesSubplot at 0x111d724e0>



In [89]:

    
heroes.plot(x='hero', y=['height', 'strength'], kind='bar')









    Out[89]:





<matplotlib.axes._subplots.AxesSubplot at 0x111aff828>



In [90]:

    
# alternative to subplots
heroes.plot(
    x='hero',
    y=['height', 'strength'],
    kind='bar',
    secondary_y='strength',
    figsize=(10,8)
)









    Out[90]:





<matplotlib.axes._subplots.AxesSubplot at 0x1117922b0>



In [91]:

    
heroes.plot(
    x='hero',
    y=['height', 'strength'],
    kind='bar',
    secondary_y='strength',
    title='Super plot of super heroes',
    figsize=(10,8)
)









    Out[91]:





<matplotlib.axes._subplots.AxesSubplot at 0x111d80860>

Histogram



In [92]:

    
heroes.hist(figsize=(10, 10))









    Out[92]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x112a11550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x112bfcda0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x112a567b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x112cc2668>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x112d217f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x112d21828>]], dtype=object)



In [93]:

    
heroes.hist(
    figsize=(10, 10),
    bins=2
)









    Out[93]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x112d47780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1131ce128>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1131deb70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x113292f98>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1132f66a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1132f66d8>]], dtype=object)

DataFrames everywhere.. are easy to plot



In [94]:

    
heroes.describe()['strength'].plot(kind='bar')









    Out[94]:





<matplotlib.axes._subplots.AxesSubplot at 0x113485c50>

EXERCISE

Create a chart presenting grosses of movies directed by Christopher Nolan



In [95]:

    
# Solution here

nolan_movies = movies[movies['director_name'] == 'Christopher Nolan']
nolan_movies = nolan_movies.set_index('movie_title')

nolan_movies['gross'].plot(kind='bar')









    Out[95]:





<matplotlib.axes._subplots.AxesSubplot at 0x11372fa90>

What are typical durations of the movies?



In [96]:

    
# Solution here

movies['duration'].hist(bins=25)









    Out[96]:





<matplotlib.axes._subplots.AxesSubplot at 0x112cdfc18>

What is percentage distribution of first genre? (cake)



In [97]:

    
# Solution here

movies['first_genre'].value_counts().plot(
    kind='pie',
    figsize=(15,15)
)









    Out[97]:





<matplotlib.axes._subplots.AxesSubplot at 0x113a7dcc0>

Aggregation

Grouping

https://www.safaribooksonline.com/library/view/learning-pandas/9781783985128/graphics/5128OS_09_01.jpg



In [98]:

    
movie_heroes = pd.DataFrame({
    'hero': ['Batman', 'Robin', 'Spiderman', 'Robocop', 'Lex Luthor', 'Dr Octopus'],
    'movie': ['Batman', 'Batman', 'Spiderman', 'Robocop', 'Spiderman', 'Spiderman'],
    'strength': [400, 100, 400, 560, 89, 300],
    'speed': [100, 10, 200, 1, 20, None],
})
movie_heroes = movie_heroes.set_index('hero')
movie_heroes









    Out[98]:







  
    
      
      movie
      speed
      strength
    
    
      hero
      
      
      
    
  
  
    
      Batman
      Batman
      100.0
      400
    
    
      Robin
      Batman
      10.0
      100
    
    
      Spiderman
      Spiderman
      200.0
      400
    
    
      Robocop
      Robocop
      1.0
      560
    
    
      Lex Luthor
      Spiderman
      20.0
      89
    
    
      Dr Octopus
      Spiderman
      NaN
      300



In [99]:

    
movie_heroes.groupby('movie')









    Out[99]:





<pandas.core.groupby.DataFrameGroupBy object at 0x113bb5e10>



In [100]:

    
list(movie_heroes.groupby('movie'))









    Out[100]:





[('Batman',          movie  speed  strength
  hero                           
  Batman  Batman  100.0       400
  Robin   Batman   10.0       100),
 ('Robocop',            movie  speed  strength
  hero                             
  Robocop  Robocop    1.0       560),
 ('Spiderman',                 movie  speed  strength
  hero                                  
  Spiderman   Spiderman  200.0       400
  Lex Luthor  Spiderman   20.0        89
  Dr Octopus  Spiderman    NaN       300)]

Aggregating



In [101]:

    
movie_heroes.groupby('movie').size()









    Out[101]:





movie
Batman       2
Robocop      1
Spiderman    3
dtype: int64



In [102]:

    
movie_heroes.groupby('movie').count()









    Out[102]:







  
    
      
      speed
      strength
    
    
      movie
      
      
    
  
  
    
      Batman
      2
      2
    
    
      Robocop
      1
      1
    
    
      Spiderman
      2
      3



In [103]:

    
movie_heroes.groupby('movie')['speed'].sum()









    Out[103]:





movie
Batman       110.0
Robocop        1.0
Spiderman    220.0
Name: speed, dtype: float64



In [104]:

    
movie_heroes.groupby('movie').mean()









    Out[104]:







  
    
      
      speed
      strength
    
    
      movie
      
      
    
  
  
    
      Batman
      55.0
      250
    
    
      Robocop
      1.0
      560
    
    
      Spiderman
      110.0
      263



In [105]:

    
movie_heroes.groupby('movie').apply(
    lambda group: group['strength'] / group['strength'].max()
)









    Out[105]:





movie      hero      
Batman     Batman        1.0000
           Robin         0.2500
Robocop    Robocop       1.0000
Spiderman  Spiderman     1.0000
           Lex Luthor    0.2225
           Dr Octopus    0.7500
Name: strength, dtype: float64



In [106]:

    
movie_heroes.groupby('movie').agg({
    'speed': 'mean',
    'strength': 'max',
})









    Out[106]:







  
    
      
      speed
      strength
    
    
      movie
      
      
    
  
  
    
      Batman
      55.0
      400
    
    
      Robocop
      1.0
      560
    
    
      Spiderman
      110.0
      400



In [107]:

    
movie_heroes = movie_heroes.reset_index()
movie_heroes









    Out[107]:







  
    
      
      hero
      movie
      speed
      strength
    
  
  
    
      0
      Batman
      Batman
      100.0
      400
    
    
      1
      Robin
      Batman
      10.0
      100
    
    
      2
      Spiderman
      Spiderman
      200.0
      400
    
    
      3
      Robocop
      Robocop
      1.0
      560
    
    
      4
      Lex Luthor
      Spiderman
      20.0
      89
    
    
      5
      Dr Octopus
      Spiderman
      NaN
      300



In [108]:

    
movie_heroes.groupby(['movie', 'hero']).mean()









    Out[108]:







  
    
      
      
      speed
      strength
    
    
      movie
      hero
      
      
    
  
  
    
      Batman
      Batman
      100.0
      400
    
    
      Robin
      10.0
      100
    
    
      Robocop
      Robocop
      1.0
      560
    
    
      Spiderman
      Dr Octopus
      NaN
      300
    
    
      Lex Luthor
      20.0
      89
    
    
      Spiderman
      200.0
      400

EXERCISE

What was maximal gross in each year?



In [109]:

    
# Solution here

movies.groupby('title_year')['gross'].max().tail(10).plot(kind='bar')









    Out[109]:





<matplotlib.axes._subplots.AxesSubplot at 0x113aad4e0>

Which director earns the most on average?



In [110]:

    
# Solution here

(
    movies
    .groupby('director_name')
    ['gross']
    .mean()
    .sort_values(ascending=False)
    .head(3)
)









    Out[110]:





director_name
Joss Whedon    432721657.0
Lee Unkrich    414984497.0
Chris Buck     400736600.0
Name: gross, dtype: float64

Data alignment on Index



In [111]:

    
movie_heroes









    Out[111]:







  
    
      
      hero
      movie
      speed
      strength
    
  
  
    
      0
      Batman
      Batman
      100.0
      400
    
    
      1
      Robin
      Batman
      10.0
      100
    
    
      2
      Spiderman
      Spiderman
      200.0
      400
    
    
      3
      Robocop
      Robocop
      1.0
      560
    
    
      4
      Lex Luthor
      Spiderman
      20.0
      89
    
    
      5
      Dr Octopus
      Spiderman
      NaN
      300



In [112]:

    
apetite = pd.DataFrame([
    dict(hero='Spiderman', is_hungry=True),
    dict(hero='Robocop', is_hungry=False)
])
apetite









    Out[112]:







  
    
      
      hero
      is_hungry
    
  
  
    
      0
      Spiderman
      True
    
    
      1
      Robocop
      False



In [113]:

    
movie_heroes['is_hungry'] = apetite['is_hungry']
movie_heroes









    Out[113]:







  
    
      
      hero
      movie
      speed
      strength
      is_hungry
    
  
  
    
      0
      Batman
      Batman
      100.0
      400
      True
    
    
      1
      Robin
      Batman
      10.0
      100
      False
    
    
      2
      Spiderman
      Spiderman
      200.0
      400
      NaN
    
    
      3
      Robocop
      Robocop
      1.0
      560
      NaN
    
    
      4
      Lex Luthor
      Spiderman
      20.0
      89
      NaN
    
    
      5
      Dr Octopus
      Spiderman
      NaN
      300
      NaN



In [114]:

    
apetite.index = [2, 3]



In [115]:

    
movie_heroes['is_hungry'] = apetite['is_hungry']
movie_heroes









    Out[115]:







  
    
      
      hero
      movie
      speed
      strength
      is_hungry
    
  
  
    
      0
      Batman
      Batman
      100.0
      400
      NaN
    
    
      1
      Robin
      Batman
      10.0
      100
      NaN
    
    
      2
      Spiderman
      Spiderman
      200.0
      400
      True
    
    
      3
      Robocop
      Robocop
      1.0
      560
      False
    
    
      4
      Lex Luthor
      Spiderman
      20.0
      89
      NaN
    
    
      5
      Dr Octopus
      Spiderman
      NaN
      300
      NaN

Setting index



In [116]:

    
indexed_movie_heroes = movie_heroes.set_index('hero')
indexed_movie_heroes









    Out[116]:







  
    
      
      movie
      speed
      strength
      is_hungry
    
    
      hero
      
      
      
      
    
  
  
    
      Batman
      Batman
      100.0
      400
      NaN
    
    
      Robin
      Batman
      10.0
      100
      NaN
    
    
      Spiderman
      Spiderman
      200.0
      400
      True
    
    
      Robocop
      Robocop
      1.0
      560
      False
    
    
      Lex Luthor
      Spiderman
      20.0
      89
      NaN
    
    
      Dr Octopus
      Spiderman
      NaN
      300
      NaN



In [117]:

    
indexed_apetite = apetite.set_index('hero')
indexed_apetite









    Out[117]:







  
    
      
      is_hungry
    
    
      hero
      
    
  
  
    
      Spiderman
      True
    
    
      Robocop
      False



In [118]:

    
# and alignment works well automagically..

indexed_movie_heroes['is_hungry'] = indexed_apetite['is_hungry']



In [119]:

    
indexed_movie_heroes









    Out[119]:







  
    
      
      movie
      speed
      strength
      is_hungry
    
    
      hero
      
      
      
      
    
  
  
    
      Batman
      Batman
      100.0
      400
      NaN
    
    
      Robin
      Batman
      10.0
      100
      NaN
    
    
      Spiderman
      Spiderman
      200.0
      400
      True
    
    
      Robocop
      Robocop
      1.0
      560
      False
    
    
      Lex Luthor
      Spiderman
      20.0
      89
      NaN
    
    
      Dr Octopus
      Spiderman
      NaN
      300
      NaN

Merging two DFs (a'la SQL join)



In [120]:

    
movie_heroes









    Out[120]:







  
    
      
      hero
      movie
      speed
      strength
      is_hungry
    
  
  
    
      0
      Batman
      Batman
      100.0
      400
      NaN
    
    
      1
      Robin
      Batman
      10.0
      100
      NaN
    
    
      2
      Spiderman
      Spiderman
      200.0
      400
      True
    
    
      3
      Robocop
      Robocop
      1.0
      560
      False
    
    
      4
      Lex Luthor
      Spiderman
      20.0
      89
      NaN
    
    
      5
      Dr Octopus
      Spiderman
      NaN
      300
      NaN



In [121]:

    
apetite









    Out[121]:







  
    
      
      hero
      is_hungry
    
  
  
    
      2
      Spiderman
      True
    
    
      3
      Robocop
      False



In [122]:

    
# couple of other arguments available here

pd.merge(
    movie_heroes[['hero', 'speed']],
    apetite,
    on=['hero'],
    how='outer'
)









    Out[122]:







  
    
      
      hero
      speed
      is_hungry
    
  
  
    
      0
      Batman
      100.0
      NaN
    
    
      1
      Robin
      10.0
      NaN
    
    
      2
      Spiderman
      200.0
      True
    
    
      3
      Robocop
      1.0
      False
    
    
      4
      Lex Luthor
      20.0
      NaN
    
    
      5
      Dr Octopus
      NaN
      NaN

DateTime operations



In [123]:

    
spiderman_meals = pd.DataFrame([
        dict(time='2016-10-15 10:00', calories=300),
        dict(time='2016-10-15 13:00', calories=900),
        dict(time='2016-10-15 15:00', calories=1200),
        dict(time='2016-10-15 21:00', calories=700),
        dict(time='2016-10-16 07:00', calories=1600),
        dict(time='2016-10-16 13:00', calories=600),
        dict(time='2016-10-16 16:00', calories=900),
        dict(time='2016-10-16 20:00', calories=500),
        dict(time='2016-10-16 21:00', calories=300),
        dict(time='2016-10-17 08:00', calories=900),
    ])
spiderman_meals









    Out[123]:







  
    
      
      calories
      time
    
  
  
    
      0
      300
      2016-10-15 10:00
    
    
      1
      900
      2016-10-15 13:00
    
    
      2
      1200
      2016-10-15 15:00
    
    
      3
      700
      2016-10-15 21:00
    
    
      4
      1600
      2016-10-16 07:00
    
    
      5
      600
      2016-10-16 13:00
    
    
      6
      900
      2016-10-16 16:00
    
    
      7
      500
      2016-10-16 20:00
    
    
      8
      300
      2016-10-16 21:00
    
    
      9
      900
      2016-10-17 08:00



In [124]:

    
spiderman_meals.dtypes









    Out[124]:





calories     int64
time        object
dtype: object



In [125]:

    
spiderman_meals['time'] = pd.to_datetime(spiderman_meals['time'])
spiderman_meals.dtypes









    Out[125]:





calories             int64
time        datetime64[ns]
dtype: object



In [126]:

    
spiderman_meals









    Out[126]:







  
    
      
      calories
      time
    
  
  
    
      0
      300
      2016-10-15 10:00:00
    
    
      1
      900
      2016-10-15 13:00:00
    
    
      2
      1200
      2016-10-15 15:00:00
    
    
      3
      700
      2016-10-15 21:00:00
    
    
      4
      1600
      2016-10-16 07:00:00
    
    
      5
      600
      2016-10-16 13:00:00
    
    
      6
      900
      2016-10-16 16:00:00
    
    
      7
      500
      2016-10-16 20:00:00
    
    
      8
      300
      2016-10-16 21:00:00
    
    
      9
      900
      2016-10-17 08:00:00



In [127]:

    
spiderman_meals = spiderman_meals.set_index('time')
spiderman_meals









    Out[127]:







  
    
      
      calories
    
    
      time
      
    
  
  
    
      2016-10-15 10:00:00
      300
    
    
      2016-10-15 13:00:00
      900
    
    
      2016-10-15 15:00:00
      1200
    
    
      2016-10-15 21:00:00
      700
    
    
      2016-10-16 07:00:00
      1600
    
    
      2016-10-16 13:00:00
      600
    
    
      2016-10-16 16:00:00
      900
    
    
      2016-10-16 20:00:00
      500
    
    
      2016-10-16 21:00:00
      300
    
    
      2016-10-17 08:00:00
      900



In [128]:

    
spiderman_meals.index









    Out[128]:





DatetimeIndex(['2016-10-15 10:00:00', '2016-10-15 13:00:00',
               '2016-10-15 15:00:00', '2016-10-15 21:00:00',
               '2016-10-16 07:00:00', '2016-10-16 13:00:00',
               '2016-10-16 16:00:00', '2016-10-16 20:00:00',
               '2016-10-16 21:00:00', '2016-10-17 08:00:00'],
              dtype='datetime64[ns]', name='time', freq=None)

Filtering



In [129]:

    
spiderman_meals["2016-10-15"]









    Out[129]:







  
    
      
      calories
    
    
      time
      
    
  
  
    
      2016-10-15 10:00:00
      300
    
    
      2016-10-15 13:00:00
      900
    
    
      2016-10-15 15:00:00
      1200
    
    
      2016-10-15 21:00:00
      700



In [130]:

    
spiderman_meals["2016-10-16 10:00":]









    Out[130]:







  
    
      
      calories
    
    
      time
      
    
  
  
    
      2016-10-16 13:00:00
      600
    
    
      2016-10-16 16:00:00
      900
    
    
      2016-10-16 20:00:00
      500
    
    
      2016-10-16 21:00:00
      300
    
    
      2016-10-17 08:00:00
      900



In [131]:

    
spiderman_meals["2016-10-16 10:00":"2016-10-16 20:00"]









    Out[131]:







  
    
      
      calories
    
    
      time
      
    
  
  
    
      2016-10-16 13:00:00
      600
    
    
      2016-10-16 16:00:00
      900
    
    
      2016-10-16 20:00:00
      500



In [132]:

    
spiderman_meals["2016-10"]









    Out[132]:







  
    
      
      calories
    
    
      time
      
    
  
  
    
      2016-10-15 10:00:00
      300
    
    
      2016-10-15 13:00:00
      900
    
    
      2016-10-15 15:00:00
      1200
    
    
      2016-10-15 21:00:00
      700
    
    
      2016-10-16 07:00:00
      1600
    
    
      2016-10-16 13:00:00
      600
    
    
      2016-10-16 16:00:00
      900
    
    
      2016-10-16 20:00:00
      500
    
    
      2016-10-16 21:00:00
      300
    
    
      2016-10-17 08:00:00
      900

Resampling (downsampling and upsampling)



In [133]:

    
spiderman_meals.resample('1D').sum()



In [134]:

    
spiderman_meals.resample('1H').mean()









    Out[134]:







  
    
      
      calories
    
    
      time
      
    
  
  
    
      2016-10-15 10:00:00
      300.0
    
    
      2016-10-15 11:00:00
      NaN
    
    
      2016-10-15 12:00:00
      NaN
    
    
      2016-10-15 13:00:00
      900.0
    
    
      2016-10-15 14:00:00
      NaN
    
    
      2016-10-15 15:00:00
      1200.0
    
    
      2016-10-15 16:00:00
      NaN
    
    
      2016-10-15 17:00:00
      NaN
    
    
      2016-10-15 18:00:00
      NaN
    
    
      2016-10-15 19:00:00
      NaN
    
    
      2016-10-15 20:00:00
      NaN
    
    
      2016-10-15 21:00:00
      700.0
    
    
      2016-10-15 22:00:00
      NaN
    
    
      2016-10-15 23:00:00
      NaN
    
    
      2016-10-16 00:00:00
      NaN
    
    
      2016-10-16 01:00:00
      NaN
    
    
      2016-10-16 02:00:00
      NaN
    
    
      2016-10-16 03:00:00
      NaN
    
    
      2016-10-16 04:00:00
      NaN
    
    
      2016-10-16 05:00:00
      NaN
    
    
      2016-10-16 06:00:00
      NaN
    
    
      2016-10-16 07:00:00
      1600.0
    
    
      2016-10-16 08:00:00
      NaN
    
    
      2016-10-16 09:00:00
      NaN
    
    
      2016-10-16 10:00:00
      NaN
    
    
      2016-10-16 11:00:00
      NaN
    
    
      2016-10-16 12:00:00
      NaN
    
    
      2016-10-16 13:00:00
      600.0
    
    
      2016-10-16 14:00:00
      NaN
    
    
      2016-10-16 15:00:00
      NaN
    
    
      2016-10-16 16:00:00
      900.0
    
    
      2016-10-16 17:00:00
      NaN
    
    
      2016-10-16 18:00:00
      NaN
    
    
      2016-10-16 19:00:00
      NaN
    
    
      2016-10-16 20:00:00
      500.0
    
    
      2016-10-16 21:00:00
      300.0
    
    
      2016-10-16 22:00:00
      NaN
    
    
      2016-10-16 23:00:00
      NaN
    
    
      2016-10-17 00:00:00
      NaN
    
    
      2016-10-17 01:00:00
      NaN
    
    
      2016-10-17 02:00:00
      NaN
    
    
      2016-10-17 03:00:00
      NaN
    
    
      2016-10-17 04:00:00
      NaN
    
    
      2016-10-17 05:00:00
      NaN
    
    
      2016-10-17 06:00:00
      NaN
    
    
      2016-10-17 07:00:00
      NaN
    
    
      2016-10-17 08:00:00
      900.0



In [135]:

    
spiderman_meals.resample('1H').ffill()









    Out[135]:







  
    
      
      calories
    
    
      time
      
    
  
  
    
      2016-10-15 10:00:00
      300
    
    
      2016-10-15 11:00:00
      300
    
    
      2016-10-15 12:00:00
      300
    
    
      2016-10-15 13:00:00
      900
    
    
      2016-10-15 14:00:00
      900
    
    
      2016-10-15 15:00:00
      1200
    
    
      2016-10-15 16:00:00
      1200
    
    
      2016-10-15 17:00:00
      1200
    
    
      2016-10-15 18:00:00
      1200
    
    
      2016-10-15 19:00:00
      1200
    
    
      2016-10-15 20:00:00
      1200
    
    
      2016-10-15 21:00:00
      700
    
    
      2016-10-15 22:00:00
      700
    
    
      2016-10-15 23:00:00
      700
    
    
      2016-10-16 00:00:00
      700
    
    
      2016-10-16 01:00:00
      700
    
    
      2016-10-16 02:00:00
      700
    
    
      2016-10-16 03:00:00
      700
    
    
      2016-10-16 04:00:00
      700
    
    
      2016-10-16 05:00:00
      700
    
    
      2016-10-16 06:00:00
      700
    
    
      2016-10-16 07:00:00
      1600
    
    
      2016-10-16 08:00:00
      1600
    
    
      2016-10-16 09:00:00
      1600
    
    
      2016-10-16 10:00:00
      1600
    
    
      2016-10-16 11:00:00
      1600
    
    
      2016-10-16 12:00:00
      1600
    
    
      2016-10-16 13:00:00
      600
    
    
      2016-10-16 14:00:00
      600
    
    
      2016-10-16 15:00:00
      600
    
    
      2016-10-16 16:00:00
      900
    
    
      2016-10-16 17:00:00
      900
    
    
      2016-10-16 18:00:00
      900
    
    
      2016-10-16 19:00:00
      900
    
    
      2016-10-16 20:00:00
      500
    
    
      2016-10-16 21:00:00
      300
    
    
      2016-10-16 22:00:00
      300
    
    
      2016-10-16 23:00:00
      300
    
    
      2016-10-17 00:00:00
      300
    
    
      2016-10-17 01:00:00
      300
    
    
      2016-10-17 02:00:00
      300
    
    
      2016-10-17 03:00:00
      300
    
    
      2016-10-17 04:00:00
      300
    
    
      2016-10-17 05:00:00
      300
    
    
      2016-10-17 06:00:00
      300
    
    
      2016-10-17 07:00:00
      300
    
    
      2016-10-17 08:00:00
      900



In [136]:

    
spiderman_meals.resample('1D').first()

EXERCISE

Read Star Wars: The Force Awakens's tweets from data/theforceawakens_tweets.csv. Create DateTimeIndex from created_at column.



In [137]:

    
# Solution here

force_awakens_tweets = pd.read_csv(
    'data/theforceawakens_tweets.csv',
    parse_dates=['created_at'],
    index_col='created_at'
)
force_awakens_tweets.head()









    Out[137]:







  
    
      
      tweet_id
    
    
      created_at
      
    
  
  
    
      2016-09-21 20:02:48
      778685932769730565
    
    
      2016-09-21 20:04:26
      778686345170477057
    
    
      2016-09-21 20:05:11
      778686534207713280
    
    
      2016-09-21 20:08:19
      778687322204233728
    
    
      2016-09-21 20:08:55
      778687471961767937

How many tweets did Star Wars: The Force Awakens have in each of last days?



In [138]:

    
# Solution here

force_awakens_tweets.resample('1D').count()

What were the most popular tweeting times of the day for that movie?



In [139]:

    
# Solution here

(
    force_awakens_tweets
    .resample('4H')
    .count()
    .plot(figsize=(15, 5))
)
    
(
    force_awakens_tweets["2016-09-29":]
    .resample('1H')
    .count()
    .plot(figsize=(15, 5))
)









    Out[139]:





<matplotlib.axes._subplots.AxesSubplot at 0x114644208>

Advanced topics + Advanced exercises

Filling missing data



In [140]:

    
heroes_with_missing = pd.DataFrame([
        ('Batman', None, None),
        ('Robin', None, 100),
        ('Spiderman', 400, 90),
        ('Robocop', 500, 95),
        ('Terminator', 600, None)
    ], columns=['hero', 'strength', 'health'])
heroes_with_missing









    Out[140]:







  
    
      
      hero
      strength
      health
    
  
  
    
      0
      Batman
      NaN
      NaN
    
    
      1
      Robin
      NaN
      100.0
    
    
      2
      Spiderman
      400.0
      90.0
    
    
      3
      Robocop
      500.0
      95.0
    
    
      4
      Terminator
      600.0
      NaN



In [141]:

    
heroes_with_missing.dropna()









    Out[141]:







  
    
      
      hero
      strength
      health
    
  
  
    
      2
      Spiderman
      400.0
      90.0
    
    
      3
      Robocop
      500.0
      95.0



In [142]:

    
heroes_with_missing.fillna(0)









    Out[142]:







  
    
      
      hero
      strength
      health
    
  
  
    
      0
      Batman
      0.0
      0.0
    
    
      1
      Robin
      0.0
      100.0
    
    
      2
      Spiderman
      400.0
      90.0
    
    
      3
      Robocop
      500.0
      95.0
    
    
      4
      Terminator
      600.0
      0.0



In [143]:

    
heroes_with_missing.fillna({'strength': 10, 'health': 20})



In [144]:

    
heroes_with_missing.fillna(heroes_with_missing.min())



In [145]:

    
heroes_with_missing.fillna(heroes_with_missing.median())

Scikit-learn



In [146]:

    
pd.DataFrame({'x': [1, 2], 'y': [10, 20]}).plot(x='x', y='y', kind='scatter')









    Out[146]:





<matplotlib.axes._subplots.AxesSubplot at 0x113cc9710>



In [147]:

    
from sklearn.linear_model import LinearRegression

X=[ [1], [2] ]
y=[ 10, 20 ]

clf = LinearRegression()
clf.fit(X, y)









    



/Users/jacekrzeszutek/.virtualenvs/first-steps-with-pandas-3.6.1/lib/python3.6/site-packages/scipy/linalg/basic.py:1018: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)






    Out[147]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [148]:

    
clf.predict([ [0.5], [2], [4] ])









    Out[148]:





array([  5.,  20.,  40.])



In [149]:

    
X = np.array([ [1], [2] ])
y = np.array([ 10, 20 ])

X









    Out[149]:





array([[1],
       [2]])



In [150]:

    
clf = LinearRegression()
clf.fit(X, y)









    Out[150]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [151]:

    
clf.predict( np.array([ [0.5], [2], [4] ]) )









    Out[151]:





array([  5.,  20.,  40.])



In [152]:

    
train_df = pd.DataFrame([
        (1, 10),
        (2, 20),
    ], columns=['x', 'y'])

train_df



In [153]:

    
clf = LinearRegression()
clf.fit(train_df[['x']], train_df['y'])









    Out[153]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [154]:

    
clf.predict([[0.5]])









    Out[154]:





array([ 5.])



In [155]:

    
test_df = pd.DataFrame({'x': [0.5, 1.5, 4]})
test_df



In [156]:

    
clf.predict(test_df[['x']])









    Out[156]:





array([  5.,  15.,  40.])



In [157]:

    
test_df['y'] = clf.predict(test_df[['x']])
test_df



In [158]:

    
train_df['color'] = 'blue'
test_df['color'] = 'red'

all_df = train_df.append(test_df)
all_df.plot(x='x', y='y', kind='scatter', figsize=(10, 8), color=all_df['color'])









    Out[158]:





<matplotlib.axes._subplots.AxesSubplot at 0x114948d30>

More models to try: http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

EXERCISE

Integration with scikit-learn: Create a model that tries to predict gross of movie. Use any features of the movies dataset.



In [159]:

    
# Solution here

from sklearn.linear_model import LinearRegression

FEATURES = ['num_voted_users', 'imdb_score']
TARGET = 'gross'

movies_with_data = movies[FEATURES + [TARGET]].dropna()

X = movies_with_data[FEATURES].values
y = movies_with_data[TARGET].values

clf = LinearRegression()
clf.fit(X, y)

clf.predict([
    [800000, 8.0],
    [400000, 8.0],
    [400000, 4.0],
    [ 40000, 8.0],
])









    Out[159]:





array([  2.62239503e+08,   1.33963329e+08,   1.64074385e+08,
         1.85147715e+07])

Create a method discovering movies with plot keywords similar to the given list of keywords (i.e. `['magic', 'harry', 'wizard']`)



In [160]:

    
# Solution here

def discover_similar_plot(target_keywords, threshold=0.5):
    movies_with_plot = movies.dropna(
        subset=['plot_keywords']
    ).copy()
    
    movies_with_plot['plot_keywords_set'] = movies_with_plot[
        'plot_keywords'
    ].str.split('|').map(set)
    
    movies_with_plot['match_count'] = movies_with_plot[
        'plot_keywords_set'
    ].map(
        lambda keywords: len(keywords.intersection(target_keywords))
    )
    
    return movies_with_plot[
        (movies_with_plot['match_count'] >= threshold*len(target_keywords))
    ]

discover_similar_plot(['magic', 'harry', 'wizard'])['movie_title']









    Out[160]:





114    Harry Potter and the Order of the Phoenix
115          Harry Potter and the Goblet of Fire
195     Harry Potter and the Prisoner of Azkaban
Name: movie_title, dtype: object

Integration with Flask

In the file flask_exercise.py you'll find the scaffolding for Flask app.
Create endpoints returning:

all movie titles available in the movies dataset
10 worst rated movies ever
10 best rated (imdb_score) movies in a given year



In [161]:

    
# Solution in flask_exercise.py

Thank you for your effort! We hope that you enjoyed the journey!

Any questions?

Any feedback?

You can reach us:

rzeszutekjacek at the most popular service gmail.com
mateuszflieger at as you would expect gmail.com

http://66.media.tumblr.com/b0e67112232adb68180fe7b988304abd/tumblr_inline_nw063rleWO1tp3b6e_1280.jpg

	imdb_score	movie_title
0	7.9	Avatar
1	7.1	Pirates of the Caribbean: At World's End
2	6.8	Spectre

	daily_views	month	project	rank	title
2016-01-01	2701	201601	en	2251	Python_(programming_language)
2016-01-02	2829	201601	en	2251	Python_(programming_language)
2016-01-03	2773	201601	en	2251	Python_(programming_language)
2016-01-04	5072	201601	en	2251	Python_(programming_language)
2016-01-05	5513	201601	en	2251	Python_(programming_language)
2016-01-06	5233	201601	en	2251	Python_(programming_language)
2016-01-07	5518	201601	en	2251	Python_(programming_language)
2016-01-08	5312	201601	en	2251	Python_(programming_language)
2016-01-09	2749	201601	en	2251	Python_(programming_language)
2016-01-10	2989	201601	en	2251	Python_(programming_language)
2016-01-11	5539	201601	en	2251	Python_(programming_language)
2016-01-12	6161	201601	en	2251	Python_(programming_language)
2016-01-13	6175	201601	en	2251	Python_(programming_language)
2016-01-14	5545	201601	en	2251	Python_(programming_language)
2016-01-15	4833	201601	en	2251	Python_(programming_language)
2016-01-16	2858	201601	en	2251	Python_(programming_language)
2016-01-17	2992	201601	en	2251	Python_(programming_language)
2016-01-18	4673	201601	en	2251	Python_(programming_language)
2016-01-19	5625	201601	en	2251	Python_(programming_language)
2016-01-20	5450	201601	en	2251	Python_(programming_language)

	movie_title	title_year	budget	gross	genres	language	country	movie_facebook_likes	imdb_score	num_voted_users	...	actor_3_name	actor_3_facebook_likes	cast_total_facebook_likes	color	duration	plot_keywords	content_rating	aspect_ratio	facenumber_in_poster	movie_imdb_link
0	Avatar	2009.0	237000000.0	760505847.0	Action\|Adventure\|Fantasy\|Sci-Fi	English	USA	33000	7.9	886204	...	Wes Studi	855.0	4834	Color	178.0	avatar\|future\|marine\|native\|paraplegic	PG-13	1.78	0.0	http://www.imdb.com/title/tt0499549/?ref_=fn_t...
1	Pirates of the Caribbean: At World's End	2007.0	300000000.0	309404152.0	Action\|Adventure\|Fantasy	English	USA	0	7.1	471220	...	Jack Davenport	1000.0	48350	Color	169.0	goddess\|marriage ceremony\|marriage proposal\|pi...	PG-13	2.35	0.0	http://www.imdb.com/title/tt0449088/?ref_=fn_t...
2	Spectre	2015.0	245000000.0	200074175.0	Action\|Adventure\|Thriller	English	UK	85000	6.8	275868	...	Stephanie Sigman	161.0	11700	Color	148.0	bomb\|espionage\|sequel\|spy\|terrorist	PG-13	2.35	1.0	http://www.imdb.com/title/tt2379713/?ref_=fn_t...
3	The Dark Knight Rises	2012.0	250000000.0	448130642.0	Action\|Thriller	English	USA	164000	8.5	1144337	...	Joseph Gordon-Levitt	23000.0	106759	Color	164.0	deception\|imprisonment\|lawlessness\|police offi...	PG-13	2.35	0.0	http://www.imdb.com/title/tt1345836/?ref_=fn_t...
4	Star Wars: Episode VII - The Force Awakens	NaN	NaN	NaN	Documentary	NaN	NaN	0	7.1	8	...	NaN	NaN	143	NaN	NaN	NaN	NaN	NaN	0.0	http://www.imdb.com/title/tt5289954/?ref_=fn_t...

	movie_title	title_year	budget	gross	genres	language	country	movie_facebook_likes	imdb_score	num_voted_users	...	actor_3_name	actor_3_facebook_likes	cast_total_facebook_likes	color	duration	plot_keywords	content_rating	aspect_ratio	facenumber_in_poster	movie_imdb_link
664	Space Cowboys	2000.0	65000000.0	90454043.0	Action\|Adventure\|Thriller	English	USA	0	6.4	60165	...	William Devane	416.0	17771	Black and White	130.0	nasa\|old age\|satellite\|space\|space shuttle	PG-13	2.35	1.0	http://www.imdb.com/title/tt0186566/?ref_=fn_t...
772	Invictus	2009.0	60000000.0	37479778.0	Biography\|Drama\|History\|Sport	English	USA	23000	7.4	124222	...	Leleti Khumalo	204.0	24458	Color	134.0	apartheid\|nation\|nelson mandela\|president\|rugby	PG-13	2.35	1.0	http://www.imdb.com/title/tt1057500/?ref_=fn_t...
814	American Sniper	2014.0	58800000.0	350123553.0	Action\|Biography\|Drama\|History\|Thriller\|War	English	USA	112000	7.3	325264	...	Keir O'Donnell	318.0	16277	Color	133.0	assassin\|death of child\|iraq\|sniper\|sniper rifle	R	2.35	0.0	http://www.imdb.com/title/tt2179136/?ref_=fn_t...
877	Changeling	2008.0	55000000.0	35707327.0	Crime\|Drama\|Mystery\|Thriller	English	USA	14000	7.8	199056	...	Colm Feore	539.0	12840	Color	141.0	execution by hanging\|los angeles police depart...	R	2.35	0.0	http://www.imdb.com/title/tt0824747/?ref_=fn_t...
906	Flags of Our Fathers	2006.0	90000000.0	33574332.0	Drama\|History\|War	English	USA	0	7.1	101221	...	Tom McCarthy	310.0	24468	Color	135.0	imperial japan\|japan\|japanese army\|japanese so...	R	2.35	0.0	http://www.imdb.com/title/tt0418689/?ref_=fn_t...
973	Absolute Power	1997.0	50000000.0	50007168.0	Action\|Crime\|Drama\|Thriller	English	USA	1000	6.7	39529	...	Scott Glenn	826.0	19139	Color	121.0	death\|death of wife\|murder\|thief\|violence	R	2.35	0.0	http://www.imdb.com/title/tt0118548/?ref_=fn_t...
997	Hereafter	2010.0	50000000.0	32741596.0	Drama\|Fantasy	English	USA	16000	6.5	80140	...	Cécile De France	447.0	14344	Color	129.0	death\|france\|near death experience\|tsunami\|twi...	PG-13	2.35	0.0	http://www.imdb.com/title/tt1212419/?ref_=fn_t...
1004	Blood Work	2002.0	50000000.0	26199517.0	Action\|Crime\|Drama\|Mystery\|Thriller	English	USA	705	6.4	34356	...	Rick Hoffman	581.0	18635	Color	110.0	boat\|heart transplant\|murder\|serial killer\|sister	R	2.35	1.0	http://www.imdb.com/title/tt0309377/?ref_=fn_t...
1245	Jersey Boys	2014.0	40000000.0	47034272.0	Biography\|Drama\|Music\|Musical	English	USA	16000	6.9	25465	...	Scott Vance	235.0	2281	Color	134.0	band\|music group\|musical quartet\|singer\|song	R	2.35	1.0	http://www.imdb.com/title/tt1742044/?ref_=fn_t...
1453	J. Edgar	2011.0	35000000.0	37304950.0	Biography\|Crime\|Drama	English	USA	16000	6.6	102728	...	Kaitlyn Dever	363.0	36258	Color	137.0	agent\|fbi\|federal bureau of investigation\|gay ...	R	2.35	0.0	http://www.imdb.com/title/tt1616195/?ref_=fn_t...
1462	Midnight in the Garden of Good and Evil	1997.0	30000000.0	25078937.0	Crime\|Drama\|Mystery\|Thriller	English	USA	0	6.6	31751	...	Leon Rippy	229.0	19689	Color	155.0	graveyard\|murder\|murder trial\|reporter\|trial	R	1.85	2.0	http://www.imdb.com/title/tt0119668/?ref_=fn_t...
1603	Mystic River	2003.0	25000000.0	90135191.0	Crime\|Drama\|Mystery\|Thriller	English	USA	12000	8.0	338415	...	Tom Guiry	262.0	1942	Color	138.0	ballistics\|detective\|friend\|gun\|sexual abuse	R	2.35	0.0	http://www.imdb.com/title/tt0327056/?ref_=fn_t...
1604	Million Dollar Baby	2004.0	30000000.0	100422786.0	Drama\|Sport	English	USA	14000	8.1	482064	...	Mike Colter	567.0	28830	Color	132.0	boxing\|boxing trainer\|death\|suicide attempt\|wo...	PG-13	2.35	1.0	http://www.imdb.com/title/tt0405159/?ref_=fn_t...
1869	Gran Torino	2008.0	33000000.0	148085755.0	Drama	English	USA	28000	8.2	561773	...	Ahney Her	198.0	17347	Color	116.0	gang\|gang rape\|hmong\|teenager\|widower	R	2.35	0.0	http://www.imdb.com/title/tt1205489/?ref_=fn_t...
2086	The Bridges of Madison County	1995.0	35000000.0	70960517.0	Drama\|Romance	English	USA	0	7.5	54631	...	Debra Monk	86.0	27237	Color	135.0	1960s\|bridge\|housewife\|love\|photographer	PG-13	1.85	0.0	http://www.imdb.com/title/tt0112579/?ref_=fn_t...
2132	Firefox	1982.0	21000000.0	46700000.0	Action\|Adventure\|Thriller	English	USA	0	5.9	19832	...	Kenneth Colley	216.0	16805	Color	136.0	fighter plane\|jet fighter\|refueling\|russia\|sov...	PG	2.35	1.0	http://www.imdb.com/title/tt0083943/?ref_=fn_t...
2760	Unforgiven	1992.0	14400000.0	101157447.0	Drama\|Western	English	USA	10000	8.3	277505	...	Frances Fisher	638.0	28544	Color	131.0	englishman\|leaving flowers on a grave\|one last...	R	2.35	1.0	http://www.imdb.com/title/tt0105695/?ref_=fn_t...
2863	Letters from Iwo Jima	2006.0	19000000.0	13753931.0	Drama\|History\|War	Japanese	USA	5000	7.9	132149	...	Shidô Nakamura	78.0	751	Color	141.0	blood splatter\|general\|island\|japan\|world war two	R	2.35	0.0	http://www.imdb.com/title/tt0498380/?ref_=fn_t...
3540	Pale Rider	1985.0	6900000.0	41400000.0	Western	English	USA	0	7.3	39451	...	Sydney Penny	240.0	17209	Color	115.0	gold\|miner\|preacher\|prospector\|sheriff	R	2.35	0.0	http://www.imdb.com/title/tt0089767/?ref_=fn_t...
3673	High Plains Drifter	1973.0	5500000.0	NaN	Mystery\|Western	English	USA	0	7.6	36381	...	John Hillerman	258.0	17599	Color	105.0	gravestone\|gunfighter\|outlaw\|revenge\|small town	R	2.35	0.0	http://www.imdb.com/title/tt0068699/?ref_=fn_t...

	hero	strength	health
0	Batman	400	NaN
1	Robin	200	NaN
2	Spiderman	300	NaN
3	Robocop	400	NaN
4	Terminator	500	NaN

	hero	strength	health	height	is_hungry
0	Batman	400	100	180	True
1	Robin	200	100	170	False
2	Spiderman	300	100	175	False
3	Robocop	400	100	190	True
4	Terminator	500	100	185	True

	hero	strength	health	height	is_hungry	strength_per_cm
0	Batman	400	100	180	True	2.222222
1	Robin	200	100	170	False	1.176471
2	Spiderman	300	100	175	False	1.714286
3	Robocop	400	100	190	True	2.105263
4	Terminator	500	100	185	True	2.702703

	movie_title	title_year	budget	gross	genres	language	country	movie_facebook_likes	imdb_score	num_voted_users	...	actor_3_facebook_likes	cast_total_facebook_likes	color	duration	plot_keywords	content_rating	aspect_ratio	facenumber_in_poster	movie_imdb_link	profitability
4793	Paranormal Activity	2007.0	15000.0	107917283.0	Horror	English	USA	12000	6.3	184824	...	21.0	330	Color	84.0	dark force\|entity\|evil force\|found footage\|par...	R	1.85	0.0	http://www.imdb.com/title/tt1179904/?ref_=fn_t...	7194.485533
4799	Tarnation	2003.0	218.0	592014.0	Biography\|Documentary	English	USA	754	7.2	5709	...	0.0	78	Color	88.0	answering machine\|home movie\|lithium\|schizophr...	Unrated	1.37	3.0	http://www.imdb.com/title/tt0390538/?ref_=fn_t...	2715.660550
4707	The Blair Witch Project	1999.0	60000.0	140530114.0	Horror	English	USA	0	6.4	186786	...	39.0	399	Black and White	81.0	found footage\|looking at the camera\|lost in th...	R	1.33	0.0	http://www.imdb.com/title/tt0185937/?ref_=fn_t...	2342.168567
4984	The Brothers McMullen	1995.0	25000.0	10246600.0	Comedy\|Drama\|Romance	English	USA	265	6.6	6375	...	73.0	388	Color	98.0	abusive father\|critically acclaimed\|loss of fa...	R	1.85	0.0	http://www.imdb.com/title/tt0112585/?ref_=fn_t...	409.864000
4936	The Texas Chain Saw Massacre	1974.0	83532.0	30859000.0	Horror\|Thriller	English	USA	0	7.5	96411	...	177.0	1094	Color	88.0	cannibal\|chainsaw\|hitchhiker\|independent film\|...	R	1.85	0.0	http://www.imdb.com/title/tt0072271/?ref_=fn_t...	369.427285
3278	The Texas Chain Saw Massacre	1974.0	83532.0	30859000.0	Horror\|Thriller	English	USA	0	7.5	96410	...	177.0	1094	Color	88.0	cannibal\|chainsaw\|hitchhiker\|independent film\|...	R	1.85	0.0	http://www.imdb.com/title/tt0072271/?ref_=fn_t...	369.427285
5035	El Mariachi	1992.0	7000.0	2040920.0	Action\|Crime\|Drama\|Romance\|Thriller	Spanish	USA	0	6.9	52055	...	6.0	147	Color	81.0	assassin\|death\|guitar\|gun\|mariachi	R	1.37	0.0	http://www.imdb.com/title/tt0104815/?ref_=fn_t...	291.560000
4956	The Gallows	2015.0	100000.0	22757819.0	Horror\|Thriller	English	USA	0	4.2	13521	...	7.0	276	Color	81.0	breaking and entering\|gallows\|hanging\|high sch...	R	1.85	0.0	http://www.imdb.com/title/tt2309260/?ref_=fn_t...	227.578190
4977	Super Size Me	2004.0	65000.0	11529368.0	Comedy\|Documentary\|Drama	English	USA	0	7.3	85028	...	0.0	0	Color	100.0	experiment\|fast food\|food industry\|meal\|overea...	PG	1.78	2.0	http://www.imdb.com/title/tt0390521/?ref_=fn_t...	177.374892
2619	Halloween	1978.0	300000.0	47000000.0	Horror\|Thriller	English	USA	12000	7.9	157857	...	598.0	4400	Color	101.0	halloween\|masked killer\|michael myers\|slasher\|...	R	2.35	0.0	http://www.imdb.com/title/tt0077651/?ref_=fn_t...	156.666667

	strength	health	height	strength_per_cm	effective_strength
count	5.000000	5.0	5.000000	5.000000	5.000000
mean	460.000000	200.0	180.000000	1.984189	100.000000
std	114.017543	0.0	7.905694	0.572896	141.421356
min	300.000000	200.0	170.000000	1.176471	0.000000
25%	400.000000	200.0	175.000000	1.714286	0.000000
50%	500.000000	200.0	180.000000	2.105263	0.000000
75%	500.000000	200.0	185.000000	2.222222	200.000000
max	600.000000	200.0	190.000000	2.702703	300.000000

	strength	health	height	is_hungry	strength_per_cm	code	effective_strength
hero
Batman	500	200	180	True	2.222222	Ba	0
Robin	300	200	170	False	1.176471	Ro	200
Spiderman	400	200	175	False	1.714286	Sp	300
Robocop	500	200	190	True	2.105263	Ro	0
Terminator	600	200	185	True	2.702703	Te	0

		speed	strength
movie	hero
Batman	Batman	100.0	400
Batman	Robin	10.0	100
Robocop	Robocop	1.0	560
Spiderman	Dr Octopus	NaN	300
	Lex Luthor	20.0	89
	Spiderman	200.0	400

	calories	time
0	300	2016-10-15 10:00
1	900	2016-10-15 13:00
2	1200	2016-10-15 15:00
3	700	2016-10-15 21:00
4	1600	2016-10-16 07:00
5	600	2016-10-16 13:00
6	900	2016-10-16 16:00
7	500	2016-10-16 20:00
8	300	2016-10-16 21:00
9	900	2016-10-17 08:00

	calories	time
0	300	2016-10-15 10:00:00
1	900	2016-10-15 13:00:00
2	1200	2016-10-15 15:00:00
3	700	2016-10-15 21:00:00
4	1600	2016-10-16 07:00:00
5	600	2016-10-16 13:00:00
6	900	2016-10-16 16:00:00
7	500	2016-10-16 20:00:00
8	300	2016-10-16 21:00:00
9	900	2016-10-17 08:00:00

	calories
time
2016-10-15 10:00:00	300
2016-10-15 13:00:00	900
2016-10-15 15:00:00	1200
2016-10-15 21:00:00	700
2016-10-16 07:00:00	1600
2016-10-16 13:00:00	600
2016-10-16 16:00:00	900
2016-10-16 20:00:00	500
2016-10-16 21:00:00	300
2016-10-17 08:00:00	900

	calories
time
2016-10-15 10:00:00	300.0
2016-10-15 11:00:00	NaN
2016-10-15 12:00:00	NaN
2016-10-15 13:00:00	900.0
2016-10-15 14:00:00	NaN
2016-10-15 15:00:00	1200.0
2016-10-15 16:00:00	NaN
2016-10-15 17:00:00	NaN
2016-10-15 18:00:00	NaN
2016-10-15 19:00:00	NaN
2016-10-15 20:00:00	NaN
2016-10-15 21:00:00	700.0
2016-10-15 22:00:00	NaN
2016-10-15 23:00:00	NaN
2016-10-16 00:00:00	NaN
2016-10-16 01:00:00	NaN
2016-10-16 02:00:00	NaN
2016-10-16 03:00:00	NaN
2016-10-16 04:00:00	NaN
2016-10-16 05:00:00	NaN
2016-10-16 06:00:00	NaN
2016-10-16 07:00:00	1600.0
2016-10-16 08:00:00	NaN
2016-10-16 09:00:00	NaN
2016-10-16 10:00:00	NaN
2016-10-16 11:00:00	NaN
2016-10-16 12:00:00	NaN
2016-10-16 13:00:00	600.0
2016-10-16 14:00:00	NaN
2016-10-16 15:00:00	NaN
2016-10-16 16:00:00	900.0
2016-10-16 17:00:00	NaN
2016-10-16 18:00:00	NaN
2016-10-16 19:00:00	NaN
2016-10-16 20:00:00	500.0
2016-10-16 21:00:00	300.0
2016-10-16 22:00:00	NaN
2016-10-16 23:00:00	NaN
2016-10-17 00:00:00	NaN
2016-10-17 01:00:00	NaN
2016-10-17 02:00:00	NaN
2016-10-17 03:00:00	NaN
2016-10-17 04:00:00	NaN
2016-10-17 05:00:00	NaN
2016-10-17 06:00:00	NaN
2016-10-17 07:00:00	NaN
2016-10-17 08:00:00	900.0

	tweet_id
created_at
2016-09-21 20:02:48	778685932769730565
2016-09-21 20:04:26	778686345170477057
2016-09-21 20:05:11	778686534207713280
2016-09-21 20:08:19	778687322204233728
2016-09-21 20:08:55	778687471961767937

	tweet_id
created_at
2016-09-21	153
2016-09-22	747
2016-09-23	1313
2016-09-24	898
2016-09-25	954
2016-09-26	791
2016-09-27	789
2016-09-28	695
2016-09-29	919
2016-09-30	944
2016-10-01	638

	hero	strength	health
0	Batman	400.0	90.0
1	Robin	400.0	100.0
2	Spiderman	400.0	90.0
3	Robocop	500.0	95.0
4	Terminator	600.0	90.0

Welcome to the 'First steps with pandas'!

Before coding, let's check whether we have proper versions of libraries

What is pandas?

Why to use it?

It has ready solutions for most of data-related problems

It is easy to pick up

It is reasonably fast

It has a very cool name.

Library highlights

Data structures

Series

DataFrame

Creating

Meta data

Selecting

Chaining (most of operations on DataFrame returns new DataFrame or Series)

EXERCISE

With dict of Series

With list of dicts

With list of Series

I/O part I

Reading popular formats / data sources

EXERCISE

Load movies from data/movies.csv to variable called movies

Analyze what dimensions and columns it has

Filtering

Boolean indexing

Multiple conditions

Negation

Filtering for containing one of many values (SQL's IN)

EXERCISE

What movies has been directed by Clint Eastwood?

What movies have earned above $500m?

Are there any Polish movies?

What are really popular great movies? (> 100k FB likes, > 8.5 IMDB score)

In what movies main role was played by brutals like "Jason Statham", "Sylvester Stallone" or god ("Morgan Freeman")?

I/O part O

As numpy array

As (list) of dicts

As popular data format

EXERCISE

Create a csv with movie titles and cast (actors) of movies with budget above $200m

Create a list of dicts with movie titles and facebook likes of all Christopher Nolan's movies

New columns

Creating new column

Vector operations

Map, apply, applymap, str

Cheatsheet

Sorting and value counts (bonus skill)

EXERCISE

What are 10 most profitable movies? (ratio between gross and budget)

Create a column 'first_genre'. What is the distribution of values in this column?

Visualizing data

Basic stats

Plotting

Histogram

DataFrames everywhere.. are easy to plot

EXERCISE

Create a chart presenting grosses of movies directed by Christopher Nolan

What are typical durations of the movies?

What is percentage distribution of first genre? (cake)

Aggregation

Grouping

Aggregating

EXERCISE

What was maximal gross in each year?

Which director earns the most on average?

Index related operations

Data alignment on Index

Setting index

Merging two DFs (a'la SQL join)

DateTime operations

Filtering

Resampling (downsampling and upsampling)

EXERCISE

Read Star Wars: The Force Awakens's tweets from data/theforceawakens_tweets.csv. Create DateTimeIndex from created_at column.

How many tweets did Star Wars: The Force Awakens have in each of last days?

What were the most popular tweeting times of the day for that movie?

Advanced topics + Advanced exercises

Filling missing data

Load movies from data/movies.csv to variable called `movies`

Create a method discovering movies with plot keywords similar to the given list of keywords (i.e. `['magic', 'harry', 'wizard']`)