Getting started - A guide to Git, GitHub and IPython Notebooks

What you will need for this course

This course focuses on developing practical skills in working with data and providing students with a hands-on understanding of classical data analysis techniques. As expected, this will be a code-intensive course. We have chosen Python as the language to work with, since it allows for fast prototyping and is supported by a great variety of scientific (and, specifically, data related) libraries. For a quick introduction to Python, you can check Lecture 1.

The materials of this course can be found under this GitHub account. Both the lectures and the homeworks of this course are in the format of IPython notebooks.

Installing Python

There are many ways to install Python. We highly recommend the free Anaconda Scientific Python distribution, which you can download from https://store.continuum.io/cshop/anaconda/. This Python distribution contains most of the packages that we will be using throughout the course. It also includes an easy-to-use but powerful packagin system, conda. For compabitility reasons, we will be using Python 2.7, so make sure to download the correct version of Anaconda.

Installing Git

One of the goals of this course is make you familiar with the workflow of code-versioning and collaboration. We will be using GitHub to host all the materials of the course, and we will expect you to use it also when submitting your homeworks. You should download git from here if it is not already installed in your machine and create a profile on GitHub. You can find extensive documentation on how to use git on the Help Pages of Github, on Atlassian, on GitRef and many other sites.


Working with Git

Configuration

The first time we use git on a new machine, we need to configure our name and email

$ git config --global user.name "Charalampos Mavroforakis"
$ git config -- global user.mail "cmav@bu.edu"

Use the email that you used for your GitHub account.

Creating a Repository

After installing Git, we can configure our first repository. First, let's create a new directory.

$ mkdir thoughts
$ cd thoughts

Now, we can create a git repository in this directory.

$ git init

We can check that everything is set up correctly by asking git to tell us the status of our project.

$ git status
On branch master

Initial commit

nothing to commit (create/copy files and use "git add" to track)

Now, create a file named science.txt, edit it with your favorite text editor and add the following lines

Starting to think about data

If we check the status of our repository again, git tells us that there is a new file:

$ git status
On branch master

Initial commit

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        science.txt

nothing added to commit but untracked files present (use "git add" to track)

The "untracked files" message means that there's a file in the directory that git isn't keeping track of. We can tell git that it should do so using git add:

$ git add science.txt

and then check that the file is now being tracked:

$ git status
On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

        new file:   science.txt

git now knows that it's supposed to keep track of science.txt, but it hasn't yet recorded any changes for posterity as a commit. To get it to do that, we need to run one more command:

$ git commit -m "Preparing for science"
[master (root-commit) f516d22] Preparing for science
 1 file changed, 1 insertion(+)
 create mode 100644 science.txt

When we run git commit, git takes everything we have told it to save by using git add and stores a copy permanently inside the special .git directory. This permanent copy is called a revision and its short identifier is f516d22. (Your revision may have another identifier.)

We use the -m flag (for "message") to record a comment that will help us remember later on what we did and why. If we just run git commit without the -m option, git will launch vim (or whatever other editor we configured at the start) so that we can write a longer message. If you are using Windows and you are not familiar with vim, try installing GitPad.

If we run git status now:

$ git status
On branch master
nothing to commit, working directory clean

it tells us everything is up to date. If we want to know what we've done recently, we can ask git to show us the project's history using git log:

$ git log
Author: Charalampos Mavroforakis <cmav@bu.edu>
Date:   Sun Jan 25 12:48:44 2015 -0500

    Preparing for science

Changing a file

Now, suppose that we want to edit the file:

Starting to think about data
I need to attend CS591

Now if we run git status, git will tell us that a file that it is tracking has been modified:

$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   science.txt

no changes added to commit (use "git add" and/or "git commit -a")

The last line is the key phrase: "no changes added to commit". We have changed this file, but we haven't told git we will want to save those changes (which we do with git add) much less actually saved them. Let's double-check our work using git diff, which shows us the differences between the current state of the file and the most recently saved version:

$ git diff
diff --git a/science.txt b/science.txt
index 0ac4b7b..c5b1b05 100644
--- a/science.txt
+++ b/science.txt
@@ -1 +1,2 @@
 Starting to think about data
+I need to attend CS591

Let's commit our change:

$ git commit -m "Related course"
On branch master
Changes not staged for commit:
        modified:   science.txt

no changes added to commit

Whoops! Git won't commit the file because we didn't use git add first. Let's fix that:

$ git add science.txt
$ git commit -m "Related course"
[master 1bd7277] Related course
 1 file changed, 1 insertion(+)

Git insists that we add files to the set we want to commit before actually committing anything because we may not want to commit everything at once. For example, suppose we're adding a few citations to our project. We might want to commit those additions, and the corresponding addition to the bibliography, but not commit the work we're doing on the analysis (which we haven't finished yet).

To allow for this, git has a special staging area where it keeps track of things that have been added to the current change set but not yet committed. git add puts things in this area, and git commit then copies them to long-term storage (as a commit):

Recovering old versions

We can save changes to files and see what we have changed. How can we restore older versions however? Let's suppose we accidentally overwrite the file:

$ cat science.txt
Despair! Nothing works

Now, git status tells us that the file has been changed, but those changes haven't been staged:

$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   science.txt

no changes added to commit (use "git add" and/or "git commit -a")

We can put things back the way they were by using git checkout:

$ git checkout HEAD science.txt
$ cat science.txt
Starting to think about data
I need to attend CS591

More information

You can find more information on git here:

IMPORTANT!

Never git add sensitive files, e.g. passwords, keys, etc., unless you are really sure you need this.


IPython Notebook

IPython has become the standard for interactive computing in Python. After installing Anaconda, you can access IPython (and the Notebooks) either through the Anaconda Launcher or the Anaconda command prompt.

To run the IPython Notebook server from the command line, type ipython notebook from the terminal. Your web browser will open and load the environment.

In the notebook, you can type and run code:


In [2]:
print "hi!"


hi!

You can use auto-complete (with the TAB key) and see the documentation (by adding `?`):


In [19]:
import os
# os.listdir?

The errors are nicely formatted:


In [21]:
1/0


---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-21-05c9758a9c21> in <module>()
----> 1 1/0

ZeroDivisionError: integer division or modulo by zero

Essential Shortcuts

  • Esc / Enter: Switch between edit and command mode
  • Execute cells
    • Shift-Enter: Run and move to the next cell
    • Alt-Enter: Run and make new cell
    • Ctrl-Enter: Run in place
  • a / b : Insert cell below / above
  • d: Delete cell

Conda Package Manager

Anaconda also installs a package manager, that makes it easy to install and update Python packages. To call it, you need to type conda in the Anaconda command prompt. You can read a brief FAQ for conda here.

Working from the Undergraduate lab

Anaconda has been installed in the Linux machines of the Undergraduate lab as well. If you want to work from there, you need to follow the next steps:

  1. Add the conda executable to your PATH

    $ export PATH=/usr/local/anaconda/bin:$PATH
  2. Create a new environment (only do this once)

    $ conda create -p ~/envs/test numpy scipy networkx pandas  scikit-learn matplotlib beautiful-soup ipython-notebook=2.2

    You can change its name to something other than test.

  3. Activate the environment

    $ source activate ~/envs/test
  4. Run IPython or IPython notebook

    $ ipython2 notebook
  5. Deactivate the environment

    $ source deactivate

GitHub

Systems like git allow us to move work between any two repositories. In practice, though, it's easiest to use one copy as a central hub, and to keep it on the web rather than on someone's laptop. Most programmers use hosting services like GitHub or BitBucket to hold those master copies. For the purpose of our course, we will be using GitHub to host the course material. You will also submit your homeworks through this platform. Next, we will cover how you can fork and clone the course's repository and how to submit your solutions to the homework. For more information on how to create your own repository on GitHub and upload code to it, please see the tutorial by Software Carpentry.

Course repositories

The material of the course is hosted on GitHub, under this account.

Clone the lecture repository

In order to download a copy of the lectures and run them locally on your computer, you need to clone the lecture repository. To do that:

  1. Create a new folder for the course.
    $ mkdir cs591
    $ cd cs591
  2. Copy the clone url from the repository's website.
  3. Clone the repository from git.
    $ git clone https://github.com/dataminingapp/dataminingapp-lectures.git

You should now have a directory named dataminingapp-lectures with the course material.

To update the repository and download the new material, type

$ git pull

Fork & Clone the homework repository

In order to download and submit the homework, you will need to follow the next steps. You need to do this once:

  1. Fork the homework repository under your GitHub account.
  2. Clone your fork locally, as above.
  3. Set up the upstream channel, so that you can download the changes
    $ git remote add upstream https://github.com/dataminingapp/spring-2015-homeworks.git

Now, everytime that you want to work on the homework, you will need to :

  1. Make sure your fork is up-to-date

    $ git pull --rebase upstream master
  2. Work on the homework. Don't forget to commit regularly!

    <...>
    $ git add Homework-0.ipynb
    $ git commit -m "Adds a hello-world function"
  3. Push your changes to your fork, on GitHub

    $ git push
  4. Before the submission deadline, make sure that the changes that you did to the homework (the commits) are reflected in your online fork.


Practice

Now, practice what we have seen today by solving and submitting Homework 0.



In [1]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]: