Chapter 2, 3 of PDA
In [1]:
%pylab --no-import-all inline
In [2]:
import matplotlib.pyplot as plt
import numpy as np
from pylab import figure, show
from pandas import DataFrame, Series
import pandas as pd
To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from
https://github.com/pydata/pydata-book
in a local directory, which in my case is "/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/"
and then symbolically linked (ln -s
) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X
cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data
ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book
That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.
With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths.
In [3]:
import os
USAGOV_BITLY_PATH = os.path.join(os.pardir, "pydata-book", "ch02", "usagov_bitly_data2012-03-16-1331923249.txt")
MOVIELENS_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "movielens")
NAMES_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "names")
assert os.path.exists(USAGOV_BITLY_PATH)
assert os.path.exists(MOVIELENS_DIR)
assert os.path.exists(NAMES_DIR)
Please make sure the above assertions work
(PfDA
, p. 18)
What's in the data file?
In 2011, URL shortening service bit.ly partnered with the United States government website usa.gov to provide a feed of anonymous data gathered from users who shorten links ending with .gov or .mil.
Hourly archive of data: http://bitly.measuredvoice.com/bitly_archive/?C=M;O=D
In [4]:
open(USAGOV_BITLY_PATH).readline()
Out[4]:
In [5]:
import json
records = [json.loads(line) for line in open(USAGOV_BITLY_PATH)] # list comprehension
Recall what records
is
In [6]:
len(records)
Out[6]:
In [7]:
# list of dict -> DataFrame
frame = DataFrame(records)
frame.head()
Out[7]:
PDA p. 26
http://www.grouplens.org/node/73 --> there's also a 10 million ratings dataset -- would be interesting to try out to test scalability of running IPython notebook on laptop
In [8]:
# let's take a look at the data
# my local dir: /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ch02/movielens
!head $MOVIELENS_DIR/movies.dat
In [26]:
# how many movies?
!wc $MOVIELENS_DIR/movies.dat
In [10]:
!head $MOVIELENS_DIR/users.dat
In [11]:
!head $MOVIELENS_DIR/ratings.dat
In [28]:
import pandas as pd
import os
os.path.join(MOVIELENS_DIR, 'users.dat')
Out[28]:
In [34]:
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table(os.path.join(MOVIELENS_DIR, 'users.dat'), sep='::', header=None,
names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(os.path.join(MOVIELENS_DIR, 'ratings.dat'), sep='::', header=None,
names=rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table(os.path.join(MOVIELENS_DIR, 'movies.dat'), sep='::', header=None,names=mnames, encoding='iso-8859-1')
movies[:5]
Out[34]:
In [35]:
movies = pd.read_table(os.path.join(MOVIELENS_DIR, 'movies.dat'), sep='::', header=None,names=mnames)
movies[:5]
Out[35]:
In [36]:
import traceback
try:
movies[:100]
except:
traceback.print_exc()
In [15]:
# explicit encoding of movies file
import pandas as pd
import codecs
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table(os.path.join(MOVIELENS_DIR, 'users.dat'), sep='::', header=None,
names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(os.path.join(MOVIELENS_DIR, 'ratings.dat'), sep='::', header=None,
names=rnames)
movies_file = codecs.open(os.path.join(MOVIELENS_DIR, 'movies.dat'), encoding='iso-8859-1')
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table(movies_file, sep='::', header=None,
names=mnames)
In [37]:
movies[:5]
Out[37]:
In [17]:
users[:5]
Out[17]:
In [38]:
movies[:5]
Out[38]:
hmmm...age 1? Where to learn about occupation types? We have zip data...so it'd be fun to map. Might be useful to look at distribution of age, gender, and zip.
In [19]:
import codecs
from itertools import islice
fname = os.path.join(MOVIELENS_DIR, "movies.dat")
f = codecs.open(fname, encoding='iso-8859-1')
for line in islice(f,100):
print line
In [20]:
import pandas as pd
import codecs
movies_file = codecs.open(os.path.join(MOVIELENS_DIR, 'movies.dat'), encoding='iso-8859-1')
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table(movies_file, sep='::', header=None,
names=mnames)
print (movies.ix[72]['title'] == u'Misérables, Les (1995)')
In [21]:
import pandas as pd
import codecs
names1880_file = codecs.open(os.path.join(NAMES_DIR,'yob2010.txt'), encoding='iso-8859-1')
names1880 = pd.read_csv(names1880_file, names=['name', 'sex', 'births'])
names1880
Out[21]:
In [22]:
# sort by name
names1880.sort('births', ascending=False)[:10]
Out[22]:
In [23]:
names1880[names1880.sex == 'F'].sort('births', ascending=False)[:10]
Out[23]:
In [24]:
names1880['births'].plot()
Out[24]:
In [25]:
names1880['births'].count()
Out[25]:
In [25]: