Homework 2


Follow the README in Brandon Rhodes' Pycon Pandas Tutorial. The steps are briefly summarised below.

  1. Clone the pycon-pandas-tutorial repository.
  2. Download the data: actors.list.gz, actresses.list.gz, genres.list.gz, and release-dates.list.gz.
  3. Move the data files into the build folder of the repository.
  4. Run the build script from the terminal.


  • Do not clone a repository inside of another repository.
  • Do not extract the ".gz" files after downloading. Move them as is.
  • Please contact me if you're having issues downloading the files from the links provided in the the tutorial's README. Some students run into authentication issues.
  • Run terminal commands in the terminal - not spyder or jupyter notebook. When reading technical documentation, terminal commands often are shown with a "\$ " in the front singifying that they are to be in the terminal. In addition, any command that starts with python is generally to be run in the terminal.
  • Original source for IMDB Data: http://www.imdb.com/interfaces

If you missed class, please watch Brandon's Pandas Tutorial Video from Pycon 2015 in Montréal.

Brandon Rhodes (Website | Twitter | Github | StackOverflow)

Exercise 1

Complete Exercises-1.ipynb.


After completing the exercise, copy it into the workspace folder of the DAT-DC-12 repository. If you're having trouble with the command line, just copy and paste the file from one folder into the other. Commit and push the code.

[OPTIONAL] You'll notice that trying to run the Exercises-1.ipynb notebook from the workspace folder will result in an error. It will no longer find the needed csv files. To fix this: Copy the three csv data files (titles.csv, release_dates.csv, and cast.csv) into the data folder of the class repository . Then, in Exercises-1.ipynb change the line that reads the csv from data/titles.csv to ../data/titles.csv. By adding the .. in the front we can now run the notebook from the workspace folder.

My current working directory is my ~/Development folder. Within this folder I have my DAT-DC-12 folder and my pycon-pandas-tutorial folder.

➜   Development  pwd
➜   Development  cd DAT-DC-12
➜   DAT-DC-12 [master]  pwd
➜   DAT-DC-12 [master]  cd ..
➜   Development  cd pycon-pandas-tutorial
➜   pycon-pandas-tutorial [master]  pwd
➜   pycon-pandas-tutorial [master]  cd ..

Copy the three csvs (titles.csv, release_dates.csv, and cast.csv) from pycon-pandas-tutorial/data/ to DAT-DC-12/data/.

➜   Development  cp pycon-pandas-tutorial/data/titles.csv DAT-DC-12/data
'pycon-pandas-tutorial/data/titles.csv' -> 'DAT-DC-12/data/titles.csv'
➜   Development  cp pycon-pandas-tutorial/data/release_dates.csv DAT-DC-12/data
'pycon-pandas-tutorial/data/release_dates.csv' -> 'DAT-DC-12/data/release_dates.csv'
➜   Development  cp pycon-pandas-tutorial/data/cast.csv DAT-DC-12/data
'pycon-pandas-tutorial/data/cast.csv' -> 'DAT-DC-12/data/cast.csv'

These csvs files are quite large so we want to ensure that we don't commit them to the repository. This is why you may notice that they are gitignored.