This course focuses on developing practical skills in working with data and providing students with a hands-on understanding of classical data analysis techniques. As expected, this will be a code-intensive course. We have chosen Python as the language to work with, since it allows for fast prototyping and is supported by a great variety of scientific (and, specifically, data related) libraries. For a quick introduction to Python, you can check Lecture 1.
The materials of this course can be found under this GitHub account. Both the lectures and the homeworks of this course are in the format of IPython notebooks.
There are many ways to install Python. We highly recommend the free Anaconda Scientific Python distribution, which you can download from https://store.continuum.io/cshop/anaconda/. This Python distribution contains most of the packages that we will be using throughout the course. It also includes an easy-to-use but powerful packagin system, conda. For compabitility reasons, we will be using Python 2.7, so make sure to download the correct version of Anaconda.
One of the goals of this course is make you familiar with the workflow of code-versioning and collaboration. We will be using GitHub to host all the materials of the course, and we will expect you to use it also when submitting your homeworks. You should download git from here if it is not already installed in your machine and create a profile on GitHub. You can find extensive documentation on how to use git on the Help Pages of Github, on Atlassian, on GitRef and many other sites.
The first time we use git on a new machine, we need to configure our name and email
$ git config --global user.name "Charalampos Mavroforakis"
$ git config -- global user.mail "cmav@bu.edu"
Use the email that you used for your GitHub account.
After installing Git, we can configure our first repository. First, let's create a new directory.
$ mkdir thoughts
$ cd thoughts
Now, we can create a git repository in this directory.
$ git init
We can check that everything is set up correctly by asking git to tell us the status of our project.
$ git status
On branch master
Initial commit
nothing to commit (create/copy files and use "git add" to track)
Now, create a file named science.txt
, edit it with your favorite text editor and add the following lines
Starting to think about data
If we check the status of our repository again, git tells us that there is a new file:
$ git status
On branch master
Initial commit
Untracked files:
(use "git add <file>..." to include in what will be committed)
science.txt
nothing added to commit but untracked files present (use "git add" to track)
The "untracked files" message means that there's a file in the directory that git isn't keeping track of. We can tell git that it should do so using git add
:
$ git add science.txt
and then check that the file is now being tracked:
$ git status
On branch master
Initial commit
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: science.txt
git now knows that it's supposed to keep track of science.txt
, but it hasn't yet recorded any changes for posterity as a commit. To get it to do that, we need to run one more command:
$ git commit -m "Preparing for science"
[master (root-commit) f516d22] Preparing for science
1 file changed, 1 insertion(+)
create mode 100644 science.txt
When we run git commit
, git takes everything we have told it to save by using git add
and stores a copy permanently inside the special .git
directory. This permanent copy is called a revision and its short identifier is f516d22. (Your revision may have another identifier.)
We use the -m flag (for "message") to record a comment that will help us remember later on what we did and why. If we just run git commit
without the -m
option, git will launch vim
(or whatever other editor we configured at the start) so that we can write a longer message. If you are using Windows and you are not familiar with vim
, try installing GitPad.
If we run git status now:
$ git status
On branch master
nothing to commit, working directory clean
it tells us everything is up to date. If we want to know what we've done recently, we can ask git to show us the project's history using git log
:
$ git log
Author: Charalampos Mavroforakis <cmav@bu.edu>
Date: Sun Jan 25 12:48:44 2015 -0500
Preparing for science
Now, suppose that we want to edit the file:
Starting to think about data
I need to attend CS591
Now if we run git status
, git will tell us that a file that it is tracking has been modified:
$ git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: science.txt
no changes added to commit (use "git add" and/or "git commit -a")
The last line is the key phrase: "no changes added to commit". We have changed this file, but we haven't told git we will want to save those changes (which we do with git add
) much less actually saved them. Let's double-check our work using git diff
, which shows us the differences between the current state of the file and the most recently saved version:
$ git diff
diff --git a/science.txt b/science.txt
index 0ac4b7b..c5b1b05 100644
--- a/science.txt
+++ b/science.txt
@@ -1 +1,2 @@
Starting to think about data
+I need to attend CS591
Let's commit our change:
$ git commit -m "Related course"
On branch master
Changes not staged for commit:
modified: science.txt
no changes added to commit
Whoops! Git won't commit the file because we didn't use git add
first. Let's fix that:
$ git add science.txt
$ git commit -m "Related course"
[master 1bd7277] Related course
1 file changed, 1 insertion(+)
Git insists that we add files to the set we want to commit before actually committing anything because we may not want to commit everything at once. For example, suppose we're adding a few citations to our project. We might want to commit those additions, and the corresponding addition to the bibliography, but not commit the work we're doing on the analysis (which we haven't finished yet).
To allow for this, git has a special staging area where it keeps track of things that have been added to the current change set but not yet committed. git add
puts things in this area, and git commit
then copies them to long-term storage (as a commit):
We can save changes to files and see what we have changed. How can we restore older versions however? Let's suppose we accidentally overwrite the file:
$ cat science.txt
Despair! Nothing works
Now, git status
tells us that the file has been changed, but those changes haven't been staged:
$ git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: science.txt
no changes added to commit (use "git add" and/or "git commit -a")
We can put things back the way they were by using git checkout
:
$ git checkout HEAD science.txt
$ cat science.txt
Starting to think about data
I need to attend CS591
You can find more information on git here:
IMPORTANT!
Never git add
sensitive files, e.g. passwords, keys, etc., unless you are really sure you need this.
IPython has become the standard for interactive computing in Python. After installing Anaconda, you can access IPython (and the Notebooks) either through the Anaconda Launcher or the Anaconda command prompt.
To run the IPython Notebook server from the command line, type ipython notebook
from the terminal. Your web browser will open and load the environment.
In the notebook, you can type and run code:
In [2]:
print "hi!"
You can use auto-complete (with the TAB key) and see the documentation (by adding `?`):
In [19]:
import os
# os.listdir?
The errors are nicely formatted:
In [21]:
1/0
Anaconda also installs a package manager, that makes it easy to install and update Python packages. To call it, you need to type conda
in the Anaconda command prompt. You can read a brief FAQ for conda
here.
Anaconda has been installed in the Linux machines of the Undergraduate lab as well. If you want to work from there, you need to follow the next steps:
Add the conda
executable to your PATH
$ export PATH=/usr/local/anaconda/bin:$PATH
Create a new environment (only do this once)
$ conda create -p ~/envs/test numpy scipy networkx pandas scikit-learn matplotlib beautiful-soup ipython-notebook=2.2
You can change its name to something other than test
.
Activate the environment
$ source activate ~/envs/test
Run IPython or IPython notebook
$ ipython2 notebook
Deactivate the environment
$ source deactivate
Systems like git allow us to move work between any two repositories. In practice, though, it's easiest to use one copy as a central hub, and to keep it on the web rather than on someone's laptop. Most programmers use hosting services like GitHub or BitBucket to hold those master copies. For the purpose of our course, we will be using GitHub to host the course material. You will also submit your homeworks through this platform. Next, we will cover how you can fork and clone the course's repository and how to submit your solutions to the homework. For more information on how to create your own repository on GitHub and upload code to it, please see the tutorial by Software Carpentry.
The material of the course is hosted on GitHub, under this account.
In order to download a copy of the lectures and run them locally on your computer, you need to clone the lecture repository. To do that:
$ mkdir cs591
$ cd cs591
$ git clone https://github.com/dataminingapp/dataminingapp-lectures.git
You should now have a directory named dataminingapp-lectures
with the course material.
To update the repository and download the new material, type
$ git pull
In order to download and submit the homework, you will need to follow the next steps. You need to do this once:
$ git remote add upstream https://github.com/dataminingapp/spring-2015-homeworks.git
Now, everytime that you want to work on the homework, you will need to :
Make sure your fork is up-to-date
$ git pull --rebase upstream master
Work on the homework. Don't forget to commit regularly!
<...>
$ git add Homework-0.ipynb
$ git commit -m "Adds a hello-world function"
Push your changes to your fork, on GitHub
$ git push
Before the submission deadline, make sure that the changes that you did to the homework (the commits) are reflected in your online fork.
Now, practice what we have seen today by solving and submitting Homework 0.
In [1]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
styles = open("../theme/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[1]: