Lab 1. An Introduction to Pandas and Python


In [1]:
# The %... is an iPython thing, and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.
%matplotlib inline 
#this line above prepares IPython notebook for working with matplotlib

# See all the "as ..." contructs? They're just aliasing the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().

import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm #allows us easy access to colormaps
import matplotlib.pyplot as plt #sets up plotting under plt
import pandas as pd #lets us handle data as dataframes
#sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns #sets up styles and gives us more plotting options

Python depends on packages for most of its functionality; these can be either built-in (such as sys), or third-party (like all the packages below). Either way you need to import the packages you need before using them.

The Notebook

Look up http:/www.google.com Lets eat a burrito. $\alpha = \frac{\beta}{\gamma}$

Longer:

$$\alpha = \frac{\beta}{\gamma}$$
  1. an item
  2. another item
  3. i like items

Pandas

Get Cheatsheet:

from https://drive.google.com/folderview?id=0ByIrJAE4KMTtaGhRcXkxNHhmY2M&usp=sharing

We read in some data from a CSV file. CSV files can be output by any spreadsheet software, and are plain text, so make a great way to share data. This dataset is from Goodreads: i scraped the highest regarded (according to Goodread's proprietary algorithm) books on that site. Ypu'll see how to do such a scraping in the next lab.


In [2]:
df=pd.read_csv("all.csv", header=None,
               names=["rating", 'review_count', 'isbn', 'booktype','author_url', 'year', 'genre_urls', 'dir','rating_count', 'name'],
)
df.head()


Out[2]:
rating review_count isbn booktype author_url year genre_urls dir rating_count name
0 4.40 136455 0439023483 good_reads:book https://www.goodreads.com/author/show/153394.S... 2008 /genres/young-adult|/genres/science-fiction|/g... dir01/2767052-the-hunger-games.html 2958974 The Hunger Games (The Hunger Games, #1)
1 4.41 16648 0439358078 good_reads:book https://www.goodreads.com/author/show/1077326.... 2003 /genres/fantasy|/genres/young-adult|/genres/fi... dir01/2.Harry_Potter_and_the_Order_of_the_Phoe... 1284478 Harry Potter and the Order of the Phoenix (Har...
2 3.56 85746 0316015849 good_reads:book https://www.goodreads.com/author/show/941441.S... 2005 /genres/young-adult|/genres/fantasy|/genres/ro... dir01/41865.Twilight.html 2579564 Twilight (Twilight, #1)
3 4.23 47906 0061120081 good_reads:book https://www.goodreads.com/author/show/1825.Har... 1960 /genres/classics|/genres/fiction|/genres/histo... dir01/2657.To_Kill_a_Mockingbird.html 2078123 To Kill a Mockingbird
4 4.23 34772 0679783261 good_reads:book https://www.goodreads.com/author/show/1265.Jan... 1813 /genres/classics|/genres/fiction|/genres/roman... dir01/1885.Pride_and_Prejudice.html 1388992 Pride and Prejudice

Notice we have a table! A spreadsheet! And it indexed the rows. Pandas (borrowing from R) calls it a DataFrame. Lets see the types of the columns...

df, in python parlance, is an instance of the pd.DataFrame class, created by calling the pd.read_csv function, which cllas the DataFrame constructor inside of it. If you dont understand this sentence, dont worry, it will become clearer later. What you need to take away is that df is a dataframe object, and it has methods, or functions belonging to it, which allow it to do things. For example df.head() is a method that shows the first 5 rows of the dataframe.

The basics


In [3]:
df.dtypes


Out[3]:
rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year            float64
genre_urls       object
dir              object
rating_count     object
name             object
dtype: object

The shape of the object is:


In [4]:
df.shape


Out[4]:
(6000, 10)

6000 rows times 10 columns. A spredsheet is a table is a matrix. How can we access members of this tuple (brackets like so:() )


In [5]:
df.shape[0], df.shape[1]


Out[5]:
(6000, 10)

These are the column names.


In [6]:
df.columns


Out[6]:
Index([u'rating', u'review_count', u'isbn', u'booktype', u'author_url', u'year', u'genre_urls', u'dir', u'rating_count', u'name'], dtype='object')

As the diagram above shows, pandas considers a table (dataframe) as a pasting of many "series" together, horizontally.


In [7]:
type(df.rating), type(df)


Out[7]:
(pandas.core.series.Series, pandas.core.frame.DataFrame)

Querying

A spreadsheet is useless if you cant dice/sort/etc it. Here we look for all books with a rating less than 3.


In [8]:
df.rating < 3


Out[8]:
0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
5970    False
5971    False
5972    False
5973    False
5974    False
5975    False
5976    False
5977    False
5978    False
5979     True
5980    False
5981    False
5982    False
5983    False
5984    False
5985    False
5986    False
5987    False
5988    False
5989    False
5990    False
5991    False
5992    False
5993    False
5994    False
5995    False
5996    False
5997    False
5998    False
5999    False
Name: rating, dtype: bool

This gives us Trues and Falses. Such a series is called a mask. If we count the number of Trues, and divide by the total, we'll get the fraction of ratings $\lt$ 3. To do this numerically see this:


In [9]:
np.sum(df.rating < 3)


Out[9]:
4

Why did that work?


In [10]:
print 1*True, 1*False


1 0

So we ought to be able to do this


In [11]:
np.sum(df.rating < 3)/df.shape[0]


Out[11]:
0

But we get a 0? Why? In Python 2.x division is integer division by default. So one can fix by converting the df.shape[0] to a float


In [12]:
np.sum(df.rating < 3)/float(df.shape[0])


Out[12]:
0.00066666666666666664

Notice that you could just find the average since the Trues map to 1s.


In [15]:
np.mean(df.rating < 3.0)


Out[15]:
0.00066666666666666664

Or directly, in Pandas, which works since df.rating < 3 is a pandas Series.


In [16]:
(df.rating < 3).mean()


Out[16]:
0.00066666666666666664

Filtering

Here are two ways to get a filtered dataframe


In [22]:
df.query("rating > 4.5")


Out[22]:
rating review_count isbn booktype author_url year genre_urls dir rating_count name
17 4.58 1314 0345538374 good_reads:book https://www.goodreads.com/author/show/656983.J... 1973 /genres/fantasy|/genres/classics|/genres/scien... dir01/30.J_R_R_Tolkien_4_Book_Boxed_Set.html 68495 J.R.R. Tolkien 4-Book Boxed Set
162 4.55 15777 075640407X good_reads:book https://www.goodreads.com/author/show/108424.P... 2007 /genres/fantasy|/genres/fiction dir02/186074.The_Name_of_the_Wind.html 210018 The Name of the Wind (The Kingkiller Chronicle...
222 4.53 15256 055357342X good_reads:book https://www.goodreads.com/author/show/346732.G... 2000 /genres/fantasy|/genres/fiction|/genres/fantas... dir03/62291.A_Storm_of_Swords.html 327992 A Storm of Swords (A Song of Ice and Fire, #3)
242 4.53 5404 0545265355 good_reads:book https://www.goodreads.com/author/show/153394.S... 2010 /genres/young-adult|/genres/fiction|/genres/fa... dir03/7938275-the-hunger-games-trilogy-boxset.... 102330 The Hunger Games Trilogy Boxset (The Hunger Ga...
249 4.80 644 0740748475 good_reads:book https://www.goodreads.com/author/show/13778.Bi... 2005 /genres/sequential-art|/genres/comics|/genres/... dir03/24812.The_Complete_Calvin_and_Hobbes.html 22674 The Complete Calvin and Hobbes
284 4.58 15195 1406321346 good_reads:book https://www.goodreads.com/author/show/150038.C... 2013 /genres/fantasy|/genres/young-adult|/genres/fa... dir03/18335634-clockwork-princess.html 130161 Clockwork Princess (The Infernal Devices, #3)
304 4.54 572 0140259449 good_reads:book https://www.goodreads.com/author/show/1265.Jan... 1933 /genres/classics|/genres/fiction|/genres/roman... dir04/14905.The_Complete_Novels.html 17539 The Complete Novels
386 4.55 8820 0756404738 good_reads:book https://www.goodreads.com/author/show/108424.P... 2011 /genres/fantasy|/genres/fantasy|/genres/epic-f... dir04/1215032.The_Wise_Man_s_Fear.html 142499 The Wise Man's Fear (The Kingkiller Chronicle,...
400 4.53 9292 1423140605 good_reads:book https://www.goodreads.com/author/show/15872.Ri... 2012 /genres/fantasy|/genres/young-adult|/genres/fa... dir05/12127750-the-mark-of-athena.html 128412 The Mark of Athena (The Heroes of Olympus, #3)
475 4.57 824 1416997857 good_reads:book https://www.goodreads.com/author/show/150038.C... 2009 /genres/fantasy|/genres/young-adult|/genres/fa... dir05/6485421-the-mortal-instruments-boxed-set... 39720 The Mortal Instruments Boxed Set (The Mortal I...
483 4.59 2622 0312362153 good_reads:book https://www.goodreads.com/author/show/4430.She... 2008 /genres/romance|/genres/paranormal-romance|/ge... dir05/2299110.Acheron.html 35028 Acheron (Dark-Hunter, #8)
554 4.54 4809 0385341679 good_reads:book https://www.goodreads.com/author/show/48206.Ka... 2011 /genres/fantasy|/genres/urban-fantasy|/genres/... dir06/7304203-shadowfever.html 52812 Shadowfever (Fever, #5)
577 4.60 5732 0765326353 good_reads:book https://www.goodreads.com/author/show/38550.Br... 2010 /genres/science-fiction-fantasy|/genres/fantas... dir06/7235533-the-way-of-kings.html 76551 The Way of Kings (The Stormlight Archive, #1)
620 4.54 7767 1423146727 good_reads:book https://www.goodreads.com/author/show/15872.Ri... 2013 /genres/fantasy|/genres/young-adult|/genres/fa... dir07/12127810-the-house-of-hades.html 72082 The House of Hades (The Heroes of Olympus, #4)
840 4.57 431 1423113497 good_reads:book https://www.goodreads.com/author/show/15872.Ri... 2008 /genres/fantasy|/genres/young-adult|/genres/fa... dir09/3165162-percy-jackson-and-the-olympians-... 22937 Percy Jackson and the Olympians Boxed Set (Per...
883 4.58 558 0140286802 good_reads:book https://www.goodreads.com/author/show/500.Jorg... 1998 /genres/short-stories|/genres/literature|/genr... dir09/17961.Collected_Fictions.html 12596 Collected Fictions
911 4.85 26 1491732954 good_reads:book https://www.goodreads.com/author/show/8189303.... 2014 /genres/fiction dir10/22242097-honor-and-polygamy.html 97 Honor and Polygamy
935 4.64 148 1595142711 good_reads:book https://www.goodreads.com/author/show/137902.R... 2009 /genres/paranormal|/genres/vampires|/genres/yo... dir10/6339989-vampire-academy-collection.html 21743 Vampire Academy Collection (Vampire Academy, #...
938 4.51 11011 1481426303 good_reads:book https://www.goodreads.com/author/show/150038.C... 2014 /genres/fantasy|/genres/young-adult|/genres/fa... dir10/8755785-city-of-heavenly-fire.html 69924 City of Heavenly Fire (The Mortal Instruments,...
953 4.56 27 1477276068 good_reads:book https://www.goodreads.com/author/show/6621980.... 2012 NaN dir10/16243767-crossing-the-seas.html 90 Crossing the Seas
958 4.57 38199 0545010225 good_reads:book https://www.goodreads.com/author/show/1077326.... 2007 /genres/fantasy|/genres/young-adult|/genres/fa... dir10/136251.Harry_Potter_and_the_Deathly_Hall... 1245866 Harry Potter and the Deathly Hallows (Harry Po...
1033 4.56 1304 0007119550 good_reads:book https://www.goodreads.com/author/show/346732.G... 2000 /genres/fiction|/genres/fantasy|/genres/epic-f... dir11/147915.A_Storm_of_Swords.html 41161 A Storm of Swords (A Song of Ice and Fire, #3-2)
1109 4.70 23 NaN good_reads:book https://www.goodreads.com/author/show/7488658.... 2013 /genres/romance dir12/19181419-a-bird-without-wings.html 56 A Bird Without Wings
1127 4.52 644 0141183047 good_reads:book https://www.goodreads.com/author/show/7816.Fer... 1982 /genres/poetry|/genres/fiction|/genres/philoso... dir12/45974.The_Book_of_Disquiet.html 7463 The Book of Disquiet
1151 4.64 84 1491877928 good_reads:book https://www.goodreads.com/author/show/7271860.... 2013 /genres/war|/genres/historical-fiction|/genres... dir12/18501652-the-guardian-of-secrets-and-her... 167 The Guardian of Secrets and Her Deathly Pact
1186 4.51 4853 1619630621 good_reads:book https://www.goodreads.com/author/show/3433047.... 2013 /genres/fantasy|/genres/young-adult|/genres/ro... dir12/17167166-crown-of-midnight.html 34142 Crown of Midnight (Throne of Glass, #2)
1202 4.59 1260 0310902711 good_reads:book https://www.goodreads.com/author/show/5158478.... 1972 /genres/religion|/genres/christian|/genres/non... dir13/280111.Holy_Bible.html 25584 Holy Bible
1260 4.60 1943 0842377506 good_reads:book https://www.goodreads.com/author/show/6492.Fra... 1993 /genres/christian-fiction|/genres/historical-f... dir13/95617.A_Voice_in_the_Wind.html 37923 A Voice in the Wind (Mark of the Lion, #1)
1268 4.52 215 1557091528 good_reads:book https://www.goodreads.com/author/show/63859.Ja... 1787 /genres/history|/genres/non-fiction|/genres/po... dir13/89959.The_Constitution_of_the_United_Sta... 12894 The Constitution of the United States of America
1300 4.61 24 1499227299 good_reads:book https://www.goodreads.com/author/show/7414345.... 2014 /genres/paranormal|/genres/vampires|/genres/pa... dir14/22090082-vampire-princess-rising.html 128 Vampire Princess Rising (The Winters Family Sa...
... ... ... ... ... ... ... ... ... ... ...
5532 4.86 4 1477504540 good_reads:book https://www.goodreads.com/author/show/5989528.... 2013 NaN dir56/17695243-call-of-the-lost-ages.html 7 Call Of The Lost Ages
5549 4.62 13 0882408704 good_reads:book https://www.goodreads.com/author/show/947.Will... 1899 /genres/classics|/genres/fiction|/genres/poetr... dir56/17134346-the-complete-works-of-william-s... 217 The Complete Works of William Shakespeare
5557 4.61 14 NaN good_reads:book https://www.goodreads.com/author/show/32401.Al... 2006 /genres/fantasy|/genres/young-adult dir56/13488552-the-books-of-pellinor.html 394 The Books of Pellinor
5563 4.70 30 NaN good_reads:book https://www.goodreads.com/author/show/7153266.... 2014 /genres/childrens dir56/20445451-children-s-book.html 57 Children's book
5564 5.00 9 NaN good_reads:book https://www.goodreads.com/author/show/7738947.... 2014 /genres/romance|/genres/new-adult dir56/21902777-untainted.html 14 Untainted (Photographer Trilogy, #3)
5584 4.75 3 1481959824 good_reads:book https://www.goodreads.com/author/show/5100743.... 2013 NaN dir56/17606460-why-not-world.html 8 Why Not-World
5588 4.66 190 NaN good_reads:book https://www.goodreads.com/author/show/4942228.... 2011 /genres/romance|/genres/m-m-romance|/genres/sc... dir56/11737700-fade.html 996 Fade (In the company of shadows, #4)
5591 4.58 31 1500118680 good_reads:book https://www.goodreads.com/author/show/7738947.... 2014 /genres/romance|/genres/new-adult dir56/22023804-logan-s-story.html 45 Logan's Story (Sand & Clay, #0.5)
5601 4.66 312 0842384898 good_reads:book https://www.goodreads.com/author/show/5158478.... 1902 /genres/christian|/genres/religion|/genres/non... dir57/930470.Holy_Bible.html 2666 Holy Bible
5607 4.66 513 0007444397 good_reads:book https://www.goodreads.com/author/show/4659154.... 2011 /genres/non-fiction|/genres/biography dir57/11792612-dare-to-dream.html 5572 Dare to Dream (100% Official)
5619 4.52 462 0991190920 good_reads:book https://www.goodreads.com/author/show/7092218.... 2014 /genres/fantasy|/genres/paranormal|/genres/fai... dir57/18188649-escaping-destiny.html 3795 Escaping Destiny (The Fae Chronicles, #3)
5635 4.54 958 0778315703 good_reads:book https://www.goodreads.com/author/show/4480131.... 2013 /genres/erotica|/genres/bdsm|/genres/adult-fic... dir57/17251444-the-mistress.html 4869 The Mistress (The Original Sinners, #4)
5642 4.70 158 1417642165 good_reads:book https://www.goodreads.com/author/show/13778.Bi... 1992 /genres/sequential-art|/genres/comics|/genres/... dir57/70487.Calvin_and_Hobbes.html 9224 Calvin and Hobbes
5657 4.80 8 1469908530 good_reads:book https://www.goodreads.com/author/show/4695431.... 2012 /genres/fantasy dir57/15734769-myrtle-mae-and-the-mirror-in-th... 10 Myrtle Mae and the Mirror in the Attic (The Ma...
5665 4.53 61 NaN good_reads:book https://www.goodreads.com/author/show/7738947.... 2014 /genres/romance|/genres/new-adult|/genres/myst... dir57/20975446-tainted-pictures.html 103 Tainted Pictures (Photographer Trilogy, #2)
5683 4.56 204 NaN good_reads:book https://www.goodreads.com/author/show/3097905.... NaN /genres/fantasy|/genres/young-adult|/genres/ro... dir57/12474623-tiger-s-dream.html 895 Tiger's Dream (The Tiger Saga, #5)
5692 5.00 0 NaN good_reads:book https://www.goodreads.com/author/show/5989528.... 2012 NaN dir57/14288412-abstraction-in-theory---laws-of... 6 Abstraction In Theory - Laws Of Physical Trans...
5716 4.67 34 0810117134 good_reads:book https://www.goodreads.com/author/show/205563.M... 1970 /genres/classics|/genres/fiction|/genres/histo... dir58/1679497.The_Fortress.html 1335 The Fortress
5717 4.71 4 NaN good_reads:book https://www.goodreads.com/author/show/5838022.... 2012 NaN dir58/13741511-american-amaranth.html 14 American Amaranth
5718 4.60 656 1613725132 good_reads:book https://www.goodreads.com/author/show/1122775.... 2012 /genres/romance|/genres/m-m-romance|/genres/ro... dir58/13246997-armed-dangerous.html 5268 Armed & Dangerous (Cut & Run, #5)
5726 4.55 106 1594170347 good_reads:book https://www.goodreads.com/author/show/5158478.... 1952 /genres/religion|/genres/reference|/genres/rel... dir58/147635.Holy_Bible.html 1750 Holy Bible
5729 4.83 16 NaN good_reads:book https://www.goodreads.com/author/show/7058502.... 2014 NaN dir58/22312293-the-keeper.html 29 The Keeper (The Keeper, #5)
5753 4.61 811 1937551865 good_reads:book https://www.goodreads.com/author/show/1122775.... 2013 /genres/romance|/genres/m-m-romance|/genres/ro... dir58/16159276-touch-geaux.html 4212 Touch & Geaux (Cut & Run, #7)
5764 4.54 228 NaN good_reads:book https://www.goodreads.com/author/show/2112402.... 2013 /genres/non-fiction|/genres/self-help|/genres/... dir58/18479831-staying-strong.html 2343 Staying Strong
5778 4.63 0 NaN good_reads:book https://www.goodreads.com/author/show/4808225.... 2010 NaN dir58/11187937-un-spoken.html 19 (Un) Spoken
5806 4.57 121 0679777458 good_reads:book https://www.goodreads.com/author/show/8361.Dor... 1966 /genres/historical-fiction|/genres/fiction|/ge... dir59/351211.The_Disorderly_Knights.html 2177 The Disorderly Knights (The Lymond Chronicles,...
5873 4.55 103 144247372X good_reads:book https://www.goodreads.com/author/show/2876763.... 2012 /genres/fantasy|/genres/paranormal|/genres/ang... dir59/14367071-the-complete-hush-hush-saga.html 2869 The Complete Hush, Hush Saga
5874 4.78 18 2851944371 good_reads:book https://www.goodreads.com/author/show/318835.O... 1972 /genres/poetry|/genres/fiction|/genres/nobel-p... dir59/2014000.Le_Monogramme.html 565 Le Monogramme
5880 4.61 123 NaN good_reads:book https://www.goodreads.com/author/show/4942228.... 2010 /genres/romance|/genres/m-m-romance|/genres/sc... dir59/10506860-the-interludes.html 1031 The Interludes (In the company of shadows, #3)
5957 4.72 104 178048044X good_reads:book https://www.goodreads.com/author/show/20248.J_... 2010 /genres/romance|/genres/paranormal|/genres/vam... dir60/10780042-j-r-ward-collection.html 1788 J. R. Ward Collection

224 rows × 10 columns

Here we create a mask and use it to "index" into the dataframe to get the rows we want.


In [37]:
df[df.year < 0]


Out[37]:
rating review_count isbn booktype author_url year genre_urls dir rating_count name author
47 3.68 5785 0143039954 book https://www.goodreads.com/author/show/903.Homer -800 /genres/classics|/genres/fiction|/genres/poetr... dir01/1381.The_Odyssey.html 560248 The Odyssey Homer
246 4.01 365 0147712556 book https://www.goodreads.com/author/show/903.Homer -800 /genres/classics|/genres/fantasy|/genres/mytho... dir03/1375.The_Iliad_The_Odyssey.html 35123 The Iliad/The Odyssey Homer
455 3.85 1499 0140449140 book https://www.goodreads.com/author/show/879.Plato -380 /genres/philosophy|/genres/classics|/genres/no... dir05/30289.The_Republic.html 82022 The Republic Plato
596 3.77 1240 0679729526 book https://www.goodreads.com/author/show/919.Virgil -29 /genres/classics|/genres/poetry|/genres/fictio... dir06/12914.The_Aeneid.html 60308 The Aeneid Virgil
629 3.64 1231 1580495931 book https://www.goodreads.com/author/show/1002.Sop... -429 /genres/classics|/genres/plays|/genres/drama|/... dir07/1554.Oedipus_Rex.html 93192 Oedipus Rex Sophocles
674 3.92 3559 1590302257 book https://www.goodreads.com/author/show/1771.Sun... -512 /genres/non-fiction|/genres/politics|/genres/c... dir07/10534.The_Art_of_War.html 114619 The Art of War Sun_Tzu
746 4.06 1087 0140449183 book https://www.goodreads.com/author/show/5158478.... -500 /genres/classics|/genres/spirituality|/genres/... dir08/99944.The_Bhagavad_Gita.html 31634 The Bhagavad Gita Anonymous
777 3.52 1038 1580493882 book https://www.goodreads.com/author/show/1002.Sop... -442 /genres/drama|/genres/fiction|/genres/classics... dir08/7728.Antigone.html 49084 Antigone Sophocles
1233 3.94 704 015602764X book https://www.goodreads.com/author/show/1002.Sop... -400 /genres/classics|/genres/plays|/genres/drama|/... dir13/1540.The_Oedipus_Cycle.html 36008 The Oedipus Cycle Sophocles
1397 4.03 890 0192840509 book https://www.goodreads.com/author/show/12452.Aesop -560 /genres/classics|/genres/childrens|/genres/lit... dir14/21348.Aesop_s_Fables.html 71259 Aesop's Fables Aesop
1398 3.60 1644 0141026286 book https://www.goodreads.com/author/show/5158478.... -1500 /genres/religion|/genres/literature|/genres/an... dir14/19351.The_Epic_of_Gilgamesh.html 42026 The Epic of Gilgamesh Anonymous
1428 3.80 539 0486275485 book https://www.goodreads.com/author/show/973.Euri... -431 /genres/classics|/genres/plays|/genres/drama|/... dir15/752900.Medea.html 29858 Medea Euripides
1815 3.96 493 0140443339 book https://www.goodreads.com/author/show/990.Aesc... -458 /genres/classics|/genres/plays|/genres/drama|/... dir19/1519.The_Oresteia.html 18729 The Oresteia Aeschylus
1882 4.02 377 0872205541 book https://www.goodreads.com/author/show/879.Plato -400 /genres/philosophy|/genres/classics|/genres/no... dir19/22632.The_Trial_and_Death_of_Socrates.html 18712 The Trial and Death of Socrates Plato
2078 3.84 399 0140440399 book https://www.goodreads.com/author/show/957.Thuc... -411 /genres/history|/genres/classics|/genres/non-f... dir21/261243.The_History_of_the_Peloponnesian_... 17212 The History of the Peloponnesian War Thucydides
2527 3.94 506 0140449086 book https://www.goodreads.com/author/show/901.Hero... -440 /genres/history|/genres/classics|/genres/non-f... dir26/1362.The_Histories.html 20570 The Histories Herodotus
3133 4.30 131 0872203492 book https://www.goodreads.com/author/show/879.Plato -400 /genres/philosophy|/genres/classics|/genres/no... dir32/9462.Complete_Works.html 7454 Complete Works Plato
3274 3.88 411 0140449493 book https://www.goodreads.com/author/show/2192.Ari... -350 /genres/philosophy|/genres/classics|/genres/no... dir33/19068.The_Nicomachean_Ethics.html 16534 The Nicomachean Ethics Aristotle
3757 3.82 364 0872206033 book https://www.goodreads.com/author/show/1011.Ari... -411 /genres/plays|/genres/classics|/genres/drama|/... dir38/1591.Lysistrata.html 18070 Lysistrata Aristophanes
4402 3.99 516 0140449272 book https://www.goodreads.com/author/show/879.Plato -370 /genres/non-fiction|/genres/classics|/genres/p... dir45/81779.The_Symposium.html 18457 The Symposium Plato
4475 4.11 281 0865163480 book https://www.goodreads.com/author/show/879.Plato -390 /genres/philosophy|/genres/classics|/genres/no... dir45/73945.Apology.html 11478 Apology Plato
5367 4.07 133 0872206335 book https://www.goodreads.com/author/show/879.Plato -360 /genres/philosophy|/genres/classics|/genres/no... dir54/30292.Five_Dialogues.html 9964 Five Dialogues Plato

If you want to combine these conditions, use the second form and put '()' brackets around each condition. The query uses a boolean AND. Each condition ceates a mask of trues and falses.


In [19]:
df[(df.year < 0) & (df.rating > 4)]#there were none greater than 4.5!


Out[19]:
rating review_count isbn booktype author_url year genre_urls dir rating_count name
246 4.01 365 0147712556 good_reads:book https://www.goodreads.com/author/show/903.Homer -800 /genres/classics|/genres/fantasy|/genres/mytho... dir03/1375.The_Iliad_The_Odyssey.html 35123 The Iliad/The Odyssey
746 4.06 1087 0140449183 good_reads:book https://www.goodreads.com/author/show/5158478.... -500 /genres/classics|/genres/spirituality|/genres/... dir08/99944.The_Bhagavad_Gita.html 31634 The Bhagavad Gita
1397 4.03 890 0192840509 good_reads:book https://www.goodreads.com/author/show/12452.Aesop -560 /genres/classics|/genres/childrens|/genres/lit... dir14/21348.Aesop_s_Fables.html 71259 Aesop's Fables
1882 4.02 377 0872205541 good_reads:book https://www.goodreads.com/author/show/879.Plato -400 /genres/philosophy|/genres/classics|/genres/no... dir19/22632.The_Trial_and_Death_of_Socrates.html 18712 The Trial and Death of Socrates
3133 4.30 131 0872203492 good_reads:book https://www.goodreads.com/author/show/879.Plato -400 /genres/philosophy|/genres/classics|/genres/no... dir32/9462.Complete_Works.html 7454 Complete Works
4475 4.11 281 0865163480 good_reads:book https://www.goodreads.com/author/show/879.Plato -390 /genres/philosophy|/genres/classics|/genres/no... dir45/73945.Apology.html 11478 Apology
5367 4.07 133 0872206335 good_reads:book https://www.goodreads.com/author/show/879.Plato -360 /genres/philosophy|/genres/classics|/genres/no... dir54/30292.Five_Dialogues.html 9964 Five Dialogues

Cleaning

We first check the datatypes. Notice that review_count, rating_count are of type object (which means they are either strings or Pandas couldnt figure what they are), while year is a float.


In [20]:
df.dtypes


Out[20]:
rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year            float64
genre_urls       object
dir              object
rating_count     object
name             object
dtype: object

Suppose we try and fix this


In [21]:
df['rating_count']=df.rating_count.astype(int)
df['review_count']=df.review_count.astype(int)
df['year']=df.year.astype(int)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-8bf38ae9d108> in <module>()
----> 1 df['rating_count']=df.rating_count.astype(int)
      2 df['review_count']=df.review_count.astype(int)
      3 df['year']=df.year.astype(int)

//anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in astype(self, dtype, copy, raise_on_error, **kwargs)
   2409 
   2410         mgr = self._data.astype(
-> 2411             dtype=dtype, copy=copy, raise_on_error=raise_on_error, **kwargs)
   2412         return self._constructor(mgr).__finalize__(self)
   2413 

//anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in astype(self, dtype, **kwargs)
   2502 
   2503     def astype(self, dtype, **kwargs):
-> 2504         return self.apply('astype', dtype=dtype, **kwargs)
   2505 
   2506     def convert(self, **kwargs):

//anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in apply(self, f, axes, filter, do_integrity_check, **kwargs)
   2457                                                  copy=align_copy)
   2458 
-> 2459             applied = getattr(b, f)(**kwargs)
   2460 
   2461             if isinstance(applied, list):

//anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in astype(self, dtype, copy, raise_on_error, values, **kwargs)
    371     def astype(self, dtype, copy=False, raise_on_error=True, values=None, **kwargs):
    372         return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 373                             values=values, **kwargs)
    374 
    375     def _astype(self, dtype, copy=False, raise_on_error=True, values=None,

//anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in _astype(self, dtype, copy, raise_on_error, values, klass, **kwargs)
    401             if values is None:
    402                 # _astype_nansafe works fine with 1-d only
--> 403                 values = com._astype_nansafe(self.values.ravel(), dtype, copy=True)
    404                 values = values.reshape(self.values.shape)
    405             newb = make_block(values,

//anaconda/lib/python2.7/site-packages/pandas/core/common.pyc in _astype_nansafe(arr, dtype, copy)
   2729     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
   2730         # work around NumPy brokenness, #1987
-> 2731         return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
   2732 
   2733     if copy:

pandas/lib.pyx in pandas.lib.astype_intsafe (pandas/lib.c:14844)()

pandas/src/util.pxd in util.set_value_at (pandas/lib.c:63086)()

ValueError: invalid literal for long() with base 10: 'None'

Oppos we got an error. Something is not right. Its trying to convert some python datatype: None into an int. This usually means data was missing. Was it?


In [22]:
df[df.year.isnull()]


Out[22]:
rating review_count isbn booktype author_url year genre_urls dir rating_count name
2442 4.23 526 good_reads:book https://www.goodreads.com/author/show/623606.A... NaN /genres/religion|/genres/islam|/genres/non-fic... dir25/1301625.La_Tahzan.html 4134 La Tahzan
2869 4.61 2 good_reads:book https://www.goodreads.com/author/show/8182217.... NaN dir29/22031070-my-death-experiences---a-preach... 23 My Death Experiences - A Preacher’s 18 Apoca...
3643 NaN None None None None NaN dir37/9658936-harry-potter.html None None
5282 NaN None None None None NaN dir53/113138.The_Winner.html None None
5572 3.71 35 8423336603 good_reads:book https://www.goodreads.com/author/show/285658.E... NaN /genres/fiction dir56/890680._rase_una_vez_el_amor_pero_tuve_q... 403 Érase una vez el amor pero tuve que matarlo. ...
5658 4.32 44 good_reads:book https://www.goodreads.com/author/show/25307.Ro... NaN /genres/fantasy|/genres/fantasy|/genres/epic-f... dir57/5533041-assassin-s-apprentice-royal-assa... 3850 Assassin's Apprentice / Royal Assassin (Farsee...
5683 4.56 204 good_reads:book https://www.goodreads.com/author/show/3097905.... NaN /genres/fantasy|/genres/young-adult|/genres/ro... dir57/12474623-tiger-s-dream.html 895 Tiger's Dream (The Tiger Saga, #5)

Aha, we had some incomplete data. Lets get rid of it


In [23]:
df = df[df.year.notnull()]
df.shape


Out[23]:
(5993, 10)

We removed those 7 rows. Lets try the type conversion again


In [26]:
df['rating_count']=df.rating_count.astype(int)
df['review_count']=df.review_count.astype(int)
df['year']=df.year.astype(int)

In [27]:
df.dtypes


Out[27]:
rating          float64
review_count      int64
isbn             object
booktype         object
author_url       object
year              int64
genre_urls       object
dir              object
rating_count      int64
name             object
dtype: object

Much cleaner now!

Visualizing

Pandas has handy built in visualization.


In [23]:
df.rating.hist();


We can do this in more detail, plotting against a mean, with cutom binsize or number of bins. Note how to label axes and create legends.


In [24]:
sns.set_context("notebook")
meanrat=df.rating.mean()
#you can get means and medians in different ways
print meanrat, np.mean(df.rating), df.rating.median()
with sns.axes_style("whitegrid"):
    df.rating.hist(bins=30, alpha=0.4);
    plt.axvline(meanrat, 0, 0.75, color='r', label='Mean')
    plt.xlabel("average rating of book")
    plt.ylabel("Counts")
    plt.title("Ratings Histogram")
    plt.legend()
    #sns.despine()


4.04220073358 4.04220073358 4.05

One can see the sparseness of review counts. This will be important when we learn about recommendations: we'll have to regularize our models to deal with it.


In [34]:
df.review_count.hist(bins=np.arange(0, 40000, 400))


Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x105bd4550>

The structure may be easier to see if we rescale the x-axis to be logarithmic.


In [35]:
df.review_count.hist(bins=100)
plt.xscale("log");


Here we make a scatterplot in matplotlib of rating against year. By setting the alpha transparency low we can how the density of highly rated books on goodreads has changed.


In [38]:
plt.scatter(df.year, df.rating, lw=0, alpha=.08)
plt.xlim([1900,2010])
plt.xlabel("Year")
plt.ylabel("Rating")


Out[38]:
<matplotlib.text.Text at 0x109ca9090>

Pythons and ducks

Notice that we used the series in the x-list and y-list slots in the scatter function in the plt module.

In working with python I always remember: a python is a duck.

What I mean is, python has a certain way of doing things. For example lets call one of these ways listiness. Listiness works on lists, dictionaries, files, and a general notion of something called an iterator.

A Pandas series plays like a python list:


In [28]:
alist=[1,2,3,4,5]

We can construct another list by using the syntax below, also called a list comprehension.


In [29]:
asquaredlist=[i*i for i in alist]
asquaredlist


Out[29]:
[1, 4, 9, 16, 25]

And then we can again make a scatterplot


In [30]:
plt.scatter(alist, asquaredlist);



In [31]:
print type(alist)


<type 'list'>

In other words, something is a duck if it quacks like a duck. A Pandas series quacks like a python list. They both support something called the iterator protocol, an notion of behaving in a "listy" way. And Python functions like plt.scatter will accept anything that behaves listy. Indeed here's one more example:


In [34]:
plt.hist(df.rating_count.values, bins=100, alpha=0.5);



In [35]:
print type(df.rating_count), type(df.rating_count.values)


<class 'pandas.core.series.Series'> <type 'numpy.ndarray'>

Series and numpy lists behave similarly as well.

Vectorization

Numpy arrays are a bit different from regular python lists, and are the bread and butter of data science. Pandas Series are built atop them.


In [36]:
alist + alist


Out[36]:
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

In [37]:
np.array(alist)


Out[37]:
array([1, 2, 3, 4, 5])

In [38]:
np.array(alist)+np.array(alist)


Out[38]:
array([ 2,  4,  6,  8, 10])

In [39]:
np.array(alist)**2


Out[39]:
array([ 1,  4,  9, 16, 25])

In other words, operations on numpy arrays, and by extension, Pandas Series, are vectorized. You can add two numpy lists by just using + whereas the result isnt what you might expect for regular python lists. To add regular python lists elementwise, you will need to use a loop:


In [40]:
newlist=[]
for item in alist:
    newlist.append(item+item)
newlist


Out[40]:
[2, 4, 6, 8, 10]

Vectorization is a powerful idiom, and we will use it a lot in this class. And, for almost all data intensive computing, we will use numpy arrays rather than python lists, as the python numerical stack is based on it.

You have seen this in idea in spreadsheets where you add an entire column to another one.

Two final examples


In [41]:
a=np.array([1,2,3,4,5])
print type(a)
b=np.array([1,2,3,4,5])

print a*b


<type 'numpy.ndarray'>
[ 1  4  9 16 25]

In [42]:
a+1


Out[42]:
array([2, 3, 4, 5, 6])