Week 1

Aims:

  • To get all the bits and pieces set up
  • Enter a kaggle competition

Steps

0. Install Anaconda

(Contains jupyter, where we type notes, code and display things)

1. Install Git

(to get project files for the course)

2. Set up Amazon Web Services - AWS

(We will use their big computers)

3. If using windows, set up Cygwin

(allows windows computer to interact with a unix computer such as AWS)

4. Install AWS command line tools

(so you can command AWS from your computer)

5. Set up and configure AWS

(tell amazon what we want)

6. Log in to your AWS instance

(first-time login)

7. Daily AWS use

(bare essentials to get into your instance)

8. Get the course files onto the EC2 instance

(Our files won't be stored on our personal computers)

9. Get the data!

(Setup up Kaggle)

10. Check out the code for our basic deep network

(Poke around and run the code)

11. Rapid fire

(Get practical, submit a squid or even an old scab, perfect it later)

0. Install Anaconda

Explanation:

  • Data science super-program
  • Contains jupyter which is a notebook where you can make notes, code and display things
    • Open notebook using terminal
      jupyter notebook
  • Can also use it to install many other programs in the future
    • To install a particular program
      conda install 'program name'

Steps:

  1. https://www.continuum.io/downloads
  2. Install python 2.7 using the graphical installer
  3. Open a notebook to start working
    jupyter notebook
Different versions of python

If you have two versions of anaconda installed, you can reorder the folders that the computer searches for python in to pick and choose which one you use as default. To check which python:

python --version

To view the order that folders are searched for

  1. Go to your 'environment'
    env
  2. look for the part that says "PATH=/blah/blah:/gah:/wah/ma:/trash"
  3. This means it looks for python first in /blah/blah then /gah then /wah/ma then /trash
  4. To change the order you can go have a look what's in the bash profile
    vim ~/.bash_profile
  5. vim is a text editor (the original text editor). It uses fancy commands to be efficient
  6. The bash profile is a thing that adds bits to your path when a new terminal is opened
  7. Play around with the order of the things in the the bash profile using vim commands
  8. e.g. yy cuts a line, p places it somewhere, dd deletes a line, :wq writes and quits
  9. Then once your back out of there execute the file to make the changes
    source ~/.bash_profile
  10. Then check what happened to your path in environment
    env
  11. If you mucked it up, copy the bit of the path that you want to keep e.g. only the folders /blah/blah and /gah, not /wah/ma or /trash
  12. then type
    export PATH=/blah/blah:/gah
  13. check your path again
    env
  14. then have a look at the bash profile
    vim ~/.bash_profile
  15. If you like what it says e.g. the order of the exports in there, write and quit
    :wq
  16. then execute the file
    source ~/.bash_profile
  17. check path
    env
  18. if it's all good then exit terminal window and reopen to make sure it's permanent
    env

TIP: if you ever want to interrupt some process that the terminal is running -> ctrl+c


1. Install Git and make a folder for cheat-sheets

Explanation:

  • Git (as in get) gets things
  • It's the way most people download projects and collaborate on things.
    • It allows people to work on the same project at the same time and basically keeps track of what changes have been made and then makes sure there are no clashes.
  • We will use it to 'clone' the project files from the web to our computer
  • E.g.
    git clone www.projectlocation.com
  • Because we're going to be using different programs and we just want the basic bits from each, it'll be useful to have a folder with cheat-sheets for each of the different programs

How to get git: Summary: https://help.github.com/articles/set-up-git/

  1. Download here: https://git-scm.com/downloads
  2. If computer asks for you to install xcode developer tools, say ok
    • If already installed, can update with
      git clone https://github.com/git/git

Make a cheat-sheet dump!

  • Every time you install a new tool, google 'tool name cheat sheet pdf' and lump it into a folder with all your other cheat sheets
  • We'll use WWW Get (wget) to grab things using their URL
  1. Install wget and tree. Windows skip to step 2 (we will make sure they are installed with cygwin)
    1. We'll use homebrew (another package manager - like a non-science version of anaconda) to install wget.
      /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
    2. use homebrew to install wget
      brew install wget
    3. use hombrew to also install tree
      brew install tree
    4. If you want tree on cygwin, you can re-run the cygwin installer and make sure 'tree' is included along with 'wget'
  2. Make a folder where you want to keep computer and data science project notes
    • My setup:
      1. Went to my main directory
        cd
      2. Made a folder/directory called 'proj'. When naming things, spaces make it slow and difficult in terminal because spaces are used to separate commands, so use an underline instead of a space
        mkdir proj
      3. Went into 'proj' and make a folder for deep learning 'dl'
        mkdir dl
      4. Went into 'dl'
        cd dl
      5. Cheat cheet folder
        mkdir cheat_sheets
      6. Go there
        cd cheat_sheets
  3. Got to a cheat sheet's website in your browser and copy the url (can right click a link and copy link address too)
  4. Use web get to get the file and into your cheat sheets folder
    wget url
    • bash sheet
      wget -O bash_sheet.pdf http://www.lsv.ens-cachan.fr/~fthire/teaching/2016-2017/programmation-1/cheatsheet/shell.pdf
    • shorter bash sheet
      wget -O bash_quick_sheet.pdf http://sites.tufts.edu/cbi/files/2013/01/linux_cheat_sheet.pdf
    • download conda tips and rename conda_sheet.pdf
      wget -O conda_cheat.pdf https://conda.io/docs/_downloads/conda-cheatsheet.pdf
    • git sheet
      wget -O git_sheet.pdf https://services.github.com/on-demand/downloads/github-git-cheat-sheet.pdf
    • jupyter sheet
      wget -O jupyter_sheet.pdf https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/pdf/
    • python sheet
      wget -O python_sheet.pdf https://github.com/ehmatthes/pcc/releases/download/v1.0.0/beginners_python_cheat_sheet_pcc_all.pdf
    • tmux sheet
      wget -O tmux_sheet.pdf http://alvinalexander.com/downloads/linux/tmux-cheat-sheet.pdf
    • aws special commands (alias commands) they made for the course
      wget -O aws_alias_sheet.html http://wiki.fast.ai/index.php/Aws-alias
    • kaggle sheet
      wget -O kaggle_sheet.html https://github.com/floydwch/kaggle-cli
    • tree (viewing file structure)
      wget -O tree_sheet.html http://mama.indstate.edu/users/ice/tree/tree.1.html
  5. To open a file from the command line
    open blah_blah.pdf

A tip: be careful to always know what folder you are in (pwd) when removing files (rm 'file') because you might accidentally delete something important


2. Set up Amazon Web Services (AWS)

Explanation:

  • Deep learning involves performing multiple simple operations
  • Runing calculations on your computer's CPU (central processing unit) is possible, but would take aaaages for all but the simplest of problems
  • Deep learning has been revolutionised by sending computations to GPUs (Graphics Processing Units) which can do many simple computations simultaneously
  • Sending these computations to the GPU is possible only with GPUs from NVIDIA
  • Using amazon's NVIDIA GPU's is cheaper and easier than buying your own and setting it up

Video: https://www.youtube.com/watch?v=8rjRfW4JM2I

Notes: http://wiki.fast.ai/index.php/AWS_install

There will be two types of AWS:

  • T2 is shitty and free (for tinkering, testing and prototyping) (setup_t2.sh)
  • P2 beefy and $0.90 per hour (for training the models) (setup_p2.sh)

Install the free version (T2) of AWS

  1. Set up AWS account
  2. Get permission to use the P2 version of AWS
    • aws.amazon.com/contact-us/ec2-request
    • limit type EC2 instances
    • region: 'US west (oregon)'
    • primary instance type select 'p2.xlarge'
    • limit 'instance limit'
    • new limit value '1'
    • use case description 'fast.ai MOOC'
    • 'submit'
  3. Download folder from the course containing AWS setup files
    1. Open terminal (mac) or command line (windows)
    2. Use change directory (cd) to navigate to where you want the folder to go
      • list files and folders in current folder
        ls
      • go to a folder
        cd 'folder name'
      • go back up a folder level
        cd ..
      • find which folder you are currently in
        pwd
    3. Get the folder with all the AWS files they made for the course
      git clone https://github.com/fastai/courses
      • This copies their folder to the folder you are currently in

3. If using windows, install Cygwin

Explanation:

  • Unix is a family of operating systems, including linux, which is what the AWS computers use
  • To speak to a unix computer, you type into a window called bash
  • Mac's have a bash window already, but windows aint
  • Cygwin is a bash window, will use it to run a program for the AWS to use

Download and install:


4. Install AWS command line tools

Explanation:

  • AWS are computers that are set up to receive specific instructions about how to run
  • We will control them from our computer. e.g start, stop
  • We need to install the tools on our computer to do that

Steps:

  1. Make sure pip is up to date
    • pip (Pip Installs Package) is a tool we use to get and install other tools/packages from the internet
    • Make sure pip is up to date
      pip install --upgrade pip
  2. Install the AWS command line interface
    pip install awscli

5. Set up and configure AWS

Explanation:

  • Amazon remembers what computer setup you have and keeps track of things by giving you a key
  • Need to make sure we use our keys when we access AWS
  • Steps 1-3 only when first setting up. Future day-to-day use just do step 4 onwards
  • Names:
    • AWS: amazon web service (amazon allows access to their computers)
    • EC2: elasic compute cloud (amazon platform where you can get different computer setups)
    • AMI: amazon machine image (a specific setup that contains settings and folders. The course creators set up out AMI with anaconda and some course notes already in it. So handy. It is a virtual machine - it runs on a big server but you treat it like a desktop computer. Elastic because you can change which hardware you run it on - GPUs, CPUs different sizes and capabilities)
    • Instance: A setup of an AMI on a particular type of hardware (our course AMI on a P2 is an instance, same for when it's on a T2)

I think oregon is the best location for Aus at the moment. It's ages away, but I don't think the course AMI works for the amazon computers located in Sydney, which would be much quicker for us to use

Steps:

  1. Log in at https://aws.amazon.com/
  2. Create a 'user'
    1. 'Services' tab
    2. 'Security, Identity and Compliance' heading
    3. 'IAM' link
    4. 'Users' tab on left
    5. 'Add user' blue button
    6. Enter your name
    7. Tick 'programmatic access' and 'AWS Management Console Access'
    8. Make up password
    9. Uncheck 'require password reset'
    10. 'Next'
    11. Click 'attach exising policies directly'
    12. Choose 'AdministratorAccess'
    13. 'Next: review'
    14. 'Create user'
    15. Save key and secret key to a word document
  3. Configure AWS
    1. Configure
      aws configure
      • Enter access key ID (copy+paste)
      • Enter secret access key (copy+paste)
      • Enter region:
        us-west-2
      • Default output format
        text
    2. change to the directory where we cloned the course folder, enter the 'courses' folder
    3. change to 'setup' folder
    4. execute the setup file using bash (if you have the p2 approved, use that, otherwise use t2 for now)
      • If P2 approved
        bash setup_p2.sh
      • Otherwise
        bash setup_t2.sh
    5. Wait for the thing to finish
    6. Copy and paste the details it spits out to a word document and save for later

6. Log in to your AWS instance for the first time

Explanation:

  • Now you've run the scripts that got the instance set up, you can SSH in (log in from any computer
  • Secure Shell (SSH) works by you specifying the IP address of the computer you want to log into remotely from your bash shell (terminal on mac, cygwin on windows)
  • Anyone can do this, so until we change our passwords from the default

Steps:

  1. Connect to AWS
    1. Copy the connect line (starting from 'ssh...' onwards) and run
      • e.g. -> ssh...etc
      • Will be of the format:
        ssh -i /Users/yourdirectory/.ssh/aws-key-fast-ai.pem ubuntu@ec2-a-bunch-of-numbers.us-west-2.compute.amazonaws.com
    2. Type 'yes' to approve the authenticity
    3. First time only: There is a file they accidentally left in there called bash_history which you need to delete (only the first time you log in to the server) otherwise you can't save any bash preferences. (e.g like the PATH variable, and other preferences)
      • to list regular files
        ls
      • to list all files (including secret ones like the bash preferences
        ls -a
      • to remove or alter these secret files you have to specify that you are a super user, who knows what's up. The command is 'super user do' (sudo)
        sudo rm .bash_history
  2. If you are using the P2 instance, check out the details with
    nvidia-smi
  3. Start a jupyter notebook
    1. Get the AWS server to open up a port for the notebook
      jupyter notebook
    2. Wait for it to start and see what port the notebook is running at. It will give an http://(blahblah):PORT
    3. Go to your browser and type in the address for your instance followed by a colon and then the port number
      1. You can copy the address from the login you used (the bit after 'ubuntu@'). E.g. ec2-a-bunch-of-numbers.us-west-2.compute.amazonaws.com:PORT
      2. Or you can type the IP address from the bunch of numbers. E.g. a.bunch.of.numbers:PORT
      3. Or you can get the IP address from aws.amazon.com under 'network and security', 'elastic IPs'
    4. Enter the passowrd: dl_course
  4. Open a workbook by clicking 'new' (right top) and selecting 'python (conda root)'
    1. Test you can add numbers. Hit: shift+enter to execute a cell
      1+2
    2. Test you can import theano, the deep leaning language:
      import theano
    3. Test you can import keras, the simpler language that instructs theano:
      import keras
    4. Get cracking on some code!
  5. When you've finished ya thangs, shut down your instance
    • Might be able to leave your T2 open (not sure), but the P2 will be charging you
    • Starting the P2 costs 90c even if it's multiple times within the one hour
    • Go to aws.amazon.com and navigate to 'EC2' then 'running instances'
    • Right click and hit 'stop', rather than 'terminate'
      • 'stop'
        • No longer charged money (P2)
        • Your files will stored on the virtual hard drive
      • 'terminate'
        • No longer cahrged money (P2)
        • Files deleted from virtual hard drive
        • Don't use this unless you want to get rid of an instance completely
        • If you terminate, you'll need to start a new instance like before with -> aws configure
  6. See if you've been billed for anything
    1. Go to aws.amazon.com
    2. Click on your name (top right)
    3. 'my billing dashboard'
    4. Check how much storage you've used
      • I think we get 30gb per month, not exactly sure if that's downloads or just total size of your files so be mindful of downloading lots of things to your AWS instance

7. Daily AWS use

Description:

  • How to get into your EC2 ASAP

Steps:

  1. Go to folder with course files in it, then into the setup folder /course/setup
  2. Start the alias, which simplifies the AWS commands for us
    source aws-alias.sh
  3. See the list of alias commands and what they are doing behind the scenes
    alias
    aws-get-t2
    aws-get-p2
    aws-start
    aws-ip
    aws-nb
    aws-ssh
    aws-stop
  4. We've already started the thing via the website so to log in, do these:
    1. get t2
      aws-get-t2
    2. start
      aws-start
      • If you have trouble at this step, go to aws.amazon.com and navigate to EC2, right click and start the instance you want. Then return to your bash window
    3. get ip
      aws-ip
    4. start secure shell
      aws-ssh
  5. Start a notebook by copying the IP address you just printed and adding :8888 to the end in your browser (n.u.m.b.e.r.s:8888)
  6. The terminal/cygwin window will turn into a notebook logger, if you want another bash window you'll have to open another window (either new window or a new tmux pane)
    1. In new window navigate to your home course directory and to the course/setup folder
    2. Activate the command aliases
      source aws-alias.sh
    3. Get that window to get the t2 details
      aws-get-t2
    4. Get the IP address
      aws-ip
    5. SSH in
      aws-ssh
  7. Now you can have a notebook running to write the deep learning program, but also be able to access the instance via the bash shell to manage files on the instance

8. Get the course files onto the instance

Explanation:

  • We've set up our virtual computer (EC2), now we want our project stuff inside it

Steps:

  1. Get git on your EC2 with the linux Advanced Packaging Tool, as a SuperUser (Do)
    sudo apt-get install git
  2. Navigate on your instance to where you want the course folder to go. I'm using home directory (/home/ubuntu). Ubuntu is the name of the version of the linux operating system that the amaon computers use
    cd
  3. Use git to clone the course files from the course github site (on aws)
    git clone https://github.com/fastai/courses.git
  4. install tree to inspect your folders
    sudo apt-get install tree
  5. Visualise what you got by using tree to see 'd'irectories
    tree -d

9. Get the data!

Description:

  • We'll be downloading the images onto the EC2 where they will stay
  • During the project we'll be using data from kaggle competitions
  • We download kaggle data with the the kaggle command line interface
  • Our data is coming from this competition: https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition but the course has taken that data and put it into nice folder for us already!

Steps

  1. Set up Kaggle
    1. Upgrade pip on your EC2
      pip install --upgrade pip
    2. Install kaggle command line interface
      pip install kaggle-cli
    3. In your browser go to kaggle.com and set up an account manually, don't link it to facebook, the cli doesn't like that. Remember your USERNAME and PASSWORD
    4. Go to the competition website https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition
    5. Go to 'more' (top right) then 'rules' and scroll down to accept the rules
    6. Configure your cli with your password, so your EC2 can talk to the kaggle website to download data and submit entries
      kg config -g -u USERNAME -p PASSWORD -c dogs-vs-cats-redux-kernels-edition
    7. Install unzip
      sudo apt install unzip
  2. Download the data. The course have downloaded the data from kaggle and put it in nice folders for us already
    1. Navigate to /courses/deeplearning1/nbs so that our data goes in the same folder as the rest of the project
    2. make a data folder
      mkdir data
    3. go there
      cd data
    4. get the zipped data from the course platform website
      wget http://www.platform.ai/files/dogscats.zip
    5. unzip it
      unzip dogscats.zip
    6. Remove the zip file
      rm dogscats.zip
    7. Inspect the folder structure
      tree -d
    8. Inspect the distribution of files by looking at the sizes
      tree -d --du


ALTERNATIVELY If you wanted to download the data straight from kaggle, and organise the files yourself

  1. Download the data
    1. Navigate to /courses/deeplearning1/nbs so that our data goes in the same folder as the rest of the project
    2. Make a data folder
      mkdir data
    3. Go into the data folder
      cd data
    4. Make a dogscats folder
      mkdir dogscats
    5. Go into that folder
    6. Download the data (now that we've told kaggle the name of our competition). Takes a while
      kg download
    7. Unzip test images
      unzip test.zip
    8. Unzip train images
      unzip train.zip
    9. Remove test.zip
      rm test.zip
    10. Remove train.zip
      rm train.zip
  2. Make sure our folder structure is accurate so that the training works (we will need training, validation and test folders)
    1. Make sure you're in /courses/deeplearning1/nbs/dogscats
      pwd
    2. Make sure the 'train' and 'test' folders are still there from when we downloaded them. They should be about 286,720 bytes and 757,760 bytes
      ls -l
      • Train (757,760 bytes): This will be used to fit the parameters of the model
      • Test (286,720 bytes): This will be fresh data the model hasn't seen. It will be used to see how good it is. The Kaggle website has a second set of secret test data which it will use to see how good we are
      • Note that our training folder is much larger than the test folder to mazimise the information we have to learn from before being tested
    3. Make an empty validation folder
      mkdir valid
      • This will be used to fine tune the parameters of the model. Will need to put images in it (one tenth the amount of the train folder)
    4. We also want a duplicate of all of these in with tiny quantities of data so we can rapidly test
      1. make a sample directory
        mkdir sample
      2. enter sample directory
        cd sample
        mkdir train
        mkdir valid
        mkdir test
    5. navigate back to deeplearning1/nbs
      cd ..
      cd ..
    6. See your folder structure
      tree -d
    7. In every end-folder make a 'cats' folder and a 'dogs' folder (navigate with cd 'name' and cd .., make with mkdir)
      • data
        • catsdogs
          • sample
            • train
              • - cats - dogs
            • valid
              • - cats - dogs
          • test
          • train
            • cats
            • dogs
          • valid
            • cats
            • dogs
    8. See how many bytes each folder has
      tree --du
    9. Distribute the files as you need them

10. Check out the code for our basic deep network

  1. Find the file deeplearning1/nbs/lesson1.ipynb and make a copy
    cp lesson1.ipynb lesson1_copy.ipynb
  2. Go over to your notebook (or run notebook with -> jupyter notebook) and open deeplearning1/nbs/lesson1_copy.ipynb
  3. Inspect the code, then at each cell, hit shift+enter to run. Start from the top to the bottom. When the cell is busy it will change it's line number (top left) to an asterisk, so wait for that to finish
    1. line1: Matplotlib raises a warning that it's taking it's time doing some font things. All good
    2. Line2: Where we are looking for our files (small sample (quick), or big full set of data (slow))
      • At present we haven't yet copied any images from our regular data in to our sample folders
      • Once we do we can # comment out the real data path and uncomment the sample data path
    3. Line3: Imports modules that we have already installed as part of anaconda (numpy, matplotlib)
    4. Line4: Utils is a module that the course people wrote which simplifies some of the things we want to achieve. To inspect what utils can do, go to the document called utils.py and have a peek
    5. Line6: VGG16!
      • We import the module that containes the specifics of the winning deep neural network from an old imagenet competition.
      • It's been turned into a module that we can import so we don't have to look at any of the nuts and bolts in week 1.
      • If you want to look at what the vgg16 module contains, go inspect the vgg16.py file in the same directory
        • Do it first in terminal with the concatenate+print command
          cat vgg16.py
        • Then via jupyter navigation

I'll put more here when I work out what it is

Need to do, but haven't looked into:

  • Add the data folder to .gitignore
  • How to upload a model to git for use on a different computer

11. Rapid fire

Description: No messing around. I have to go to bed and cats and dogs are running away!

  1. Get into P2
  2. Git clone the repo
    git clone https://github.com/fastai/courses.git
  3. install pip
    pip install --upgrade pip
  4. install kaggle
    pip install kaggle-cli
    • Configure kaggle
      kg config -g -u USERNAME -p PASSWORD -c dogs-vs-cats-redux-kernels-edition
  5. In a separate bash window, ssh into the p2 instance, open jupyter notebook courses/deeplearning1/nbs/dogs_cats_redux
  6. make a 'data' folder, then inside that a 'redux' folder
    • Download data
      kg download
    • Get unzip
      sudo apt install unzip
    • Unzip the test and train zips, delete zip files
    • Get tree
      sudo apt install tree
    • See what you've got
      tree -d --du
  7. Run the boxes up to 'Action Plan'
  8. See what you did
    tree -d --du
  9. Run up to "Rearrange image files into their respective directories"
  10. See what you did
    tree -d --du
    • You moved some files into the sample/train folder and some into the validation folder
  11. Run up to "Finetuning and training"
  12. See what you done
  13. Run up to "Generate Predictions"
  14. Wait ages. Like 650s x 3 = 32 mins
  15. Run the next line to make predictions - takes like 3 mins or so
  16. Run the rest!
    kg submit submission1.csv -u USERNAME -p PASSWORD -c dogs-vs-cats-redux-kernels-edition -m any_message

Phew!


In [ ]: