Chapter 2, 3 of PDA


In [1]:
%pylab --no-import-all inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
import matplotlib.pyplot as plt
import numpy as np

from pylab import figure, show

from pandas import DataFrame, Series
import pandas as pd

Preliminaries: Assumed location of pydata-book files

To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from

https://github.com/pydata/pydata-book

in a local directory, which in my case is "/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/"

and then symbolically linked (ln -s) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X

cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data
ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book

That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.

With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths.


In [3]:
import os

USAGOV_BITLY_PATH = os.path.join(os.pardir, "pydata-book", "ch02", "usagov_bitly_data2012-03-16-1331923249.txt")
MOVIELENS_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "movielens")
NAMES_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "names")

assert os.path.exists(USAGOV_BITLY_PATH)
assert os.path.exists(MOVIELENS_DIR)
assert os.path.exists(NAMES_DIR)

Please make sure the above assertions work

usa.gov bit.ly example

(PfDA, p. 18)

What's in the data file?

http://my.safaribooksonline.com/book/programming/python/9781449323592/2dot-introductory-examples/id2802197 :

In 2011, URL shortening service bit.ly partnered with the United States government website usa.gov to provide a feed of anonymous data gathered from users who shorten links ending with .gov or .mil.

Hourly archive of data: http://bitly.measuredvoice.com/bitly_archive/?C=M;O=D


In [4]:
open(USAGOV_BITLY_PATH).readline()


Out[4]:
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'

In [5]:
import json
records = [json.loads(line) for line in open(USAGOV_BITLY_PATH)]  # list comprehension

Counting Time Zones with pandas

Recall what records is


In [6]:
len(records)


Out[6]:
3560

In [7]:
# list of dict -> DataFrame

frame = DataFrame(records)
frame.head()


Out[7]:
_heartbeat_ a al c cy g gr h hc hh kw l ll nk r t tz u
0 NaN Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... en-US,en;q=0.8 US Danvers A6qOVH MA wfLQtf 1331822918 1.usa.gov NaN orofrog [42.576698, -70.954903] 1 http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/... 1331923247 America/New_York http://www.ncbi.nlm.nih.gov/pubmed/22415991
1 NaN GoogleMaps/RochesterNY NaN US Provo mwszkS UT mwszkS 1308262393 j.mp NaN bitly [40.218102, -111.613297] 0 http://www.AwareMap.com/ 1331923249 America/Denver http://www.monroecounty.gov/etc/911/rss.php
2 NaN Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... en-US US Washington xxr3Qb DC xxr3Qb 1331919941 1.usa.gov NaN bitly [38.9007, -77.043098] 1 http://t.co/03elZC4Q 1331923250 America/New_York http://boxer.senate.gov/en/press/releases/0316...
3 NaN Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)... pt-br BR Braz zCaLwp 27 zUtuOu 1331923068 1.usa.gov NaN alelex88 [-23.549999, -46.616699] 0 direct 1331923249 America/Sao_Paulo http://apod.nasa.gov/apod/ap120312.html
4 NaN Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... en-US,en;q=0.8 US Shrewsbury 9b6kNl MA 9b6kNl 1273672411 bit.ly NaN bitly [42.286499, -71.714699] 0 http://www.shrewsbury-ma.gov/selco/ 1331923251 America/New_York http://www.shrewsbury-ma.gov/egov/gallery/1341...

5 rows × 18 columns

movielens dataset

PDA p. 26

http://www.grouplens.org/node/73 --> there's also a 10 million ratings dataset -- would be interesting to try out to test scalability of running IPython notebook on laptop


In [8]:
# let's take a look at the data

# my local dir: /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ch02/movielens

!head $MOVIELENS_DIR/movies.dat


1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller

In [26]:
# how many movies?
!wc $MOVIELENS_DIR/movies.dat


    3883   15675  171308 ../pydata-book/ch02/movielens/movies.dat

In [10]:
!head $MOVIELENS_DIR/users.dat


1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
6::F::50::9::55117
7::M::35::1::06810
8::M::25::12::11413
9::M::25::17::61614
10::F::35::1::95370

In [11]:
!head $MOVIELENS_DIR/ratings.dat


1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368

In [28]:
import pandas as pd
import os

os.path.join(MOVIELENS_DIR, 'users.dat')


Out[28]:
'../pydata-book/ch02/movielens/users.dat'

In [34]:
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table(os.path.join(MOVIELENS_DIR, 'users.dat'), sep='::', header=None,
  names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(os.path.join(MOVIELENS_DIR, 'ratings.dat'), sep='::', header=None,
  names=rnames)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table(os.path.join(MOVIELENS_DIR, 'movies.dat'), sep='::', header=None,names=mnames, encoding='iso-8859-1')
movies[:5]


Out[34]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy

5 rows × 3 columns


In [35]:
movies = pd.read_table(os.path.join(MOVIELENS_DIR, 'movies.dat'), sep='::', header=None,names=mnames)
movies[:5]


Out[35]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy

5 rows × 3 columns


In [36]:
import traceback

try:
    movies[:100]
except:
    traceback.print_exc()

In [15]:
# explicit encoding of movies file

import pandas as pd
import codecs


unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table(os.path.join(MOVIELENS_DIR, 'users.dat'), sep='::', header=None,
  names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(os.path.join(MOVIELENS_DIR, 'ratings.dat'), sep='::', header=None,
  names=rnames)


movies_file = codecs.open(os.path.join(MOVIELENS_DIR, 'movies.dat'), encoding='iso-8859-1')

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table(movies_file, sep='::', header=None,
  names=mnames)

In [37]:
movies[:5]


Out[37]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy

5 rows × 3 columns


In [17]:
users[:5]


Out[17]:
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455

5 rows × 5 columns


In [38]:
movies[:5]


Out[38]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy

5 rows × 3 columns

hmmm...age 1? Where to learn about occupation types? We have zip data...so it'd be fun to map. Might be useful to look at distribution of age, gender, and zip.

check on encoding of the movie files


In [19]:
import codecs
from itertools import islice

fname = os.path.join(MOVIELENS_DIR, "movies.dat")

f = codecs.open(fname, encoding='iso-8859-1')
for line in islice(f,100):
    print line


1::Toy Story (1995)::Animation|Children's|Comedy

2::Jumanji (1995)::Adventure|Children's|Fantasy

3::Grumpier Old Men (1995)::Comedy|Romance

4::Waiting to Exhale (1995)::Comedy|Drama

5::Father of the Bride Part II (1995)::Comedy

6::Heat (1995)::Action|Crime|Thriller

7::Sabrina (1995)::Comedy|Romance

8::Tom and Huck (1995)::Adventure|Children's

9::Sudden Death (1995)::Action

10::GoldenEye (1995)::Action|Adventure|Thriller

11::American President, The (1995)::Comedy|Drama|Romance

12::Dracula: Dead and Loving It (1995)::Comedy|Horror

13::Balto (1995)::Animation|Children's

14::Nixon (1995)::Drama

15::Cutthroat Island (1995)::Action|Adventure|Romance

16::Casino (1995)::Drama|Thriller

17::Sense and Sensibility (1995)::Drama|Romance

18::Four Rooms (1995)::Thriller

19::Ace Ventura: When Nature Calls (1995)::Comedy

20::Money Train (1995)::Action

21::Get Shorty (1995)::Action|Comedy|Drama

22::Copycat (1995)::Crime|Drama|Thriller

23::Assassins (1995)::Thriller

24::Powder (1995)::Drama|Sci-Fi

25::Leaving Las Vegas (1995)::Drama|Romance

26::Othello (1995)::Drama

27::Now and Then (1995)::Drama

28::Persuasion (1995)::Romance

29::City of Lost Children, The (1995)::Adventure|Sci-Fi

30::Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)::Drama

31::Dangerous Minds (1995)::Drama

32::Twelve Monkeys (1995)::Drama|Sci-Fi

33::Wings of Courage (1995)::Adventure|Romance

34::Babe (1995)::Children's|Comedy|Drama

35::Carrington (1995)::Drama|Romance

36::Dead Man Walking (1995)::Drama

37::Across the Sea of Time (1995)::Documentary

38::It Takes Two (1995)::Comedy

39::Clueless (1995)::Comedy|Romance

40::Cry, the Beloved Country (1995)::Drama

41::Richard III (1995)::Drama|War

42::Dead Presidents (1995)::Action|Crime|Drama

43::Restoration (1995)::Drama

44::Mortal Kombat (1995)::Action|Adventure

45::To Die For (1995)::Comedy|Drama

46::How to Make an American Quilt (1995)::Drama|Romance

47::Seven (Se7en) (1995)::Crime|Thriller

48::Pocahontas (1995)::Animation|Children's|Musical|Romance

49::When Night Is Falling (1995)::Drama|Romance

50::Usual Suspects, The (1995)::Crime|Thriller

51::Guardian Angel (1994)::Action|Drama|Thriller

52::Mighty Aphrodite (1995)::Comedy

53::Lamerica (1994)::Drama

54::Big Green, The (1995)::Children's|Comedy

55::Georgia (1995)::Drama

56::Kids of the Round Table (1995)::Adventure|Children's|Fantasy

57::Home for the Holidays (1995)::Drama

58::Postino, Il (The Postman) (1994)::Drama|Romance

59::Confessional, The (Le Confessionnal) (1995)::Drama|Mystery

60::Indian in the Cupboard, The (1995)::Adventure|Children's|Fantasy

61::Eye for an Eye (1996)::Drama|Thriller

62::Mr. Holland's Opus (1995)::Drama

63::Don't Be a Menace to South Central While Drinking Your Juice in the Hood (1996)::Comedy

64::Two if by Sea (1996)::Comedy|Romance

65::Bio-Dome (1996)::Comedy

66::Lawnmower Man 2: Beyond Cyberspace (1996)::Sci-Fi|Thriller

67::Two Bits (1995)::Drama

68::French Twist (Gazon maudit) (1995)::Comedy|Romance

69::Friday (1995)::Comedy

70::From Dusk Till Dawn (1996)::Action|Comedy|Crime|Horror|Thriller

71::Fair Game (1995)::Action

72::Kicking and Screaming (1995)::Comedy|Drama

73::Misérables, Les (1995)::Drama|Musical

74::Bed of Roses (1996)::Drama|Romance

75::Big Bully (1996)::Comedy|Drama

76::Screamers (1995)::Sci-Fi|Thriller

77::Nico Icon (1995)::Documentary

78::Crossing Guard, The (1995)::Drama

79::Juror, The (1996)::Drama|Thriller

80::White Balloon, The (Badkonake Sefid ) (1995)::Drama

81::Things to Do in Denver when You're Dead (1995)::Crime|Drama|Romance

82::Antonia's Line (Antonia) (1995)::Drama

83::Once Upon a Time... When We Were Colored (1995)::Drama

84::Last Summer in the Hamptons (1995)::Comedy|Drama

85::Angels and Insects (1995)::Drama|Romance

86::White Squall (1996)::Adventure|Drama

87::Dunston Checks In (1996)::Children's|Comedy

88::Black Sheep (1996)::Comedy

89::Nick of Time (1995)::Action|Thriller

90::Journey of August King, The (1995)::Drama

92::Mary Reilly (1996)::Drama|Thriller

93::Vampire in Brooklyn (1995)::Comedy|Romance

94::Beautiful Girls (1996)::Drama

95::Broken Arrow (1996)::Action|Thriller

96::In the Bleak Midwinter (1995)::Comedy

97::Hate (Haine, La) (1995)::Drama

98::Shopping (1994)::Action|Thriller

99::Heidi Fleiss: Hollywood Madam (1995)::Documentary

100::City Hall (1996)::Drama|Thriller

101::Bottle Rocket (1996)::Comedy


In [20]:
import pandas as pd
import codecs

movies_file = codecs.open(os.path.join(MOVIELENS_DIR, 'movies.dat'), encoding='iso-8859-1')

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table(movies_file, sep='::', header=None,
  names=mnames)

print (movies.ix[72]['title'] == u'Misérables, Les (1995)')


True

Baby names dataset


In [21]:
import pandas as pd
import codecs

names1880_file = codecs.open(os.path.join(NAMES_DIR,'yob2010.txt'), encoding='iso-8859-1')
names1880 = pd.read_csv(names1880_file, names=['name', 'sex', 'births'])

names1880


Out[21]:
name sex births
0 Isabella F 22731
1 Sophia F 20477
2 Emma F 17179
3 Olivia F 16860
4 Ava F 15300
5 Emily F 14172
6 Abigail F 14124
7 Madison F 13070
8 Chloe F 11656
9 Mia F 10541
10 Addison F 10253
11 Elizabeth F 10135
12 Ella F 9796
13 Natalie F 8715
14 Samantha F 8334
15 Alexis F 8181
16 Lily F 7900
17 Grace F 7598
18 Hailey F 6969
19 Alyssa F 6934
20 Lillian F 6898
21 Hannah F 6891
22 Avery F 6633
23 Leah F 6474
24 Nevaeh F 6345
25 Sofia F 6282
26 Ashley F 6276
27 Anna F 6242
28 Brianna F 6224
29 Sarah F 6223
30 Zoe F 6200
31 Victoria F 6174
32 Gabriella F 6101
33 Brooklyn F 6068
34 Kaylee F 6049
35 Taylor F 5858
36 Layla F 5853
37 Allison F 5829
38 Evelyn F 5802
39 Riley F 5506
40 Amelia F 5417
41 Khloe F 5369
42 Makayla F 5348
43 Aubrey F 5337
44 Charlotte F 5314
45 Savannah F 5311
46 Zoey F 5164
47 Bella F 5082
48 Kayla F 5021
49 Alexa F 5012
50 Peyton F 4937
51 Audrey F 4910
52 Claire F 4885
53 Arianna F 4797
54 Julia F 4640
55 Aaliyah F 4628
56 Kylie F 4564
57 Lauren F 4429
58 Sophie F 4380
59 Sydney F 4301
... ... ...

33838 rows × 3 columns


In [22]:
# sort by name

names1880.sort('births', ascending=False)[:10]


Out[22]:
name sex births
0 Isabella F 22731
19698 Jacob M 21875
1 Sophia F 20477
19699 Ethan M 17866
2 Emma F 17179
19700 Michael M 17133
19701 Jayden M 17030
19702 William M 16870
3 Olivia F 16860
19703 Alexander M 16634

10 rows × 3 columns


In [23]:
names1880[names1880.sex == 'F'].sort('births', ascending=False)[:10]


Out[23]:
name sex births
0 Isabella F 22731
1 Sophia F 20477
2 Emma F 17179
3 Olivia F 16860
4 Ava F 15300
5 Emily F 14172
6 Abigail F 14124
7 Madison F 13070
8 Chloe F 11656
9 Mia F 10541

10 rows × 3 columns


In [24]:
names1880['births'].plot()


Out[24]:
<matplotlib.axes.AxesSubplot at 0x133fd410>

In [25]:
names1880['births'].count()


Out[25]:
33838

In [25]: