Version Control with Git

This lesson is adapted from the lesson of the same name by Software Carpentry.

This lesson can be done interactively with the students and this notebook distributed for future reference.

Version control is the lab notebook of the digital world: it's what professionals use to keep track of what they've done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn't just for software: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system.

version control: A tool for managing changes to a set of files. Each set of changes creates a new revision of the files; the version control system allows users to recover old revisions reliably, and helps manage conflicting changes made by different users.

Why do we need it?

Two people want to collaborate on a software program at the same time, but they have run into problems doing this in the past. If they take turns, each one will spend a lot of time waiting for the other to finish, but if they work on their own copies and email changes back and forth things will be lost, overwritten, or duplicated.

The right solution is to use version control to manage their work. Version control is better than mailing files back and forth because:

Nothing that is committed to version control is ever lost. This means it can be used like the "undo" feature in an editor, and since all old versions of files are saved it's always possible to go back in time to see exactly who wrote what on a particular day, or what version of a program was used to generate a particular set of results.
It keeps a record of who made what changes when, so that if people have questions later on, they know who to ask.
It's hard (but not impossible) to accidentally overlook or overwrite someone's changes: the version control system automatically notifies users whenever there's a conflict between one person's work and another's.

Objectives

Explain which initialization and configuration steps are required once per machine, and which are required once per repository.
Go through the modify-add-commit cycle for single and multiple files and explain where information is stored at each stage.
Identify and Use Git revision numbers.
Compare files with old versions of themselves.
Restore old versions of files.
Configure Git to ignore specific files, and explain why it is sometimes useful to do so.

Setting Up

The first time we use Git on a new machine, we need to configure a few things. Since we may be moving around among computers on Cal Poly's Active Directory, we are going to set up a script to help us with that.

Use nano to create a new file in your home directory called configureGit.sh

$ nano configureGit.sh

The file should contain:

#!/bin/bash
git config --global user.name "<YOUR NAME>"
git config --global user.email "<YOUR CAL POLY EMAIL>"
git config --global color.ui "auto"
git config --global core.editor "nano"

Git commands are written git verb, where verb is what we actually want it to do. In this case, we're telling Git:

our name and email address,
to colorize output,
what our favorite text editor is, and
that we want to use these settings globally (i.e., for every project)

The four commands in our file only need to be run once: the flag --global tells Git to use the settings for every project on this machine, but once we have the script (that's what <file>.sh means), if we need to reconfigure git again, we can just run the script, rather than remembering the commands.

Ok, we've created the script, now we need to run it. To do that type:

$ source configureGit.sh

Creating a repository

Once Git is configured, we can start using it. Let's version control the directory we created last time by telling Git to make it a repository - a place where Git can store old versions of our files:

$ cd PHYS202-S14
$ git init

If we use ls to show the directory's contents, it appears that nothing has changed, but if we add the -a flag to show everything, we can see that Git has created a hidden directory called .git.

Git stores information about the project in this special sub-directory. If we ever delete it, we will lose the project's history.

We can check that everything is set up correctly by asking Git to tell us the status of our project:

$ git status

Tracking Changes to Files

The "untracked files" message means that there's a file in the directory that Git isn't keeping track of. We can tell Git that it should do so using git add:

$ git add pledge.txt

and then check that the right thing happened:

$ git status

pledge.txt is now in the index - Git now knows that it's supposed to keep track of this file, but it hasn't yet recorded any changes for posterity as a commit. To get it to do that, we need to run one more command:

$ git commit -m "Academic honesty pledge"

When we run git commit, Git takes everything we have told it to save by using git add and stores a copy permanently inside the special .git directory. This permanent copy is called a revision and its short identifier is a number such as cf4fb04 (Your revision may have another identifier.)

We use the -m flag (for "message") to record a comment that will help us remember later on what we did and why. If we just run git commit without the -m option, Git will launch nano (or whatever other editor we configured at the start) so that we can write a longer message.

If we run git status now:

$ git status

it tells us everything is up to date. If we want to know what we've done recently, we can ask Git to show us the project's history using git log:

$ git log

git log lists all revisions made to a repository in reverse chronological order. The listing for each revision includes the revision's full identifier (which starts with the same characters as the short identifier printed by the git commit command earlier), the revision's author, when it was created, and the log message Git was given when the revision was created.

Where are my changes?

If we run ls at this point, we will still see just one file called pledge.txt. That's because Git saves information about files' history in the special .git directory mentioned earlier so that our filesystem doesn't become cluttered (and so that we can't accidentally edit or delete an old version).

Changing a file

Let's edit the pledge.txt file we created previously to change the date to today's date.

$ nano pledge.txt

When we run git status now, it tells us that a file it already knows about has been modified:

$ git status

The last line is the key phrase: "no changes added to commit". We have changed this file in our working tree, but we haven't promoted those changes to the index or saved them as as commit. Let's double-check our work using git diff, which shows us the differences between the current state of the file and the most recently saved version:

$ git diff

The output is cryptic because it is actually a series of commands for tools like editors and patch telling them how to reconstruct one file given the other. If we can break it down into pieces:

The first line tells us that Git is using the Unix diff command to compare the old and new versions of the file.
The second line tells exactly which revisions of the file Git is comparing; df0654a and 315bf3a are unique computer-generated labels for those revisions.
The remaining lines show us the actual differences and the lines on which they occur.

Let's commit our change:

$ git commit -m 'update the date in pledge file'

Whoops: Git won't commit because we didn't use git add first - there's nothing in the index and nothing for git to make a commit out of! Remember to promote our work from the working tree to the index first using 'git add':

$ git add pledge.txt
$ git commit -m 'update the date in pledge file'

Git insists that we add files to the set we want to commit before actually committing anything because we may not want to commit everything at once. For example, suppose we're adding a few citations to our supervisor's work to our thesis. We might want to commit those additions, and the corresponding addition to the bibliography, but not commit the work we're doing on the conclusion (which we haven't finished yet).

To allow for this, Git has a special staging area where it keeps track of things that have been added to the current change set but not yet committed. git add puts things in this area (the index), and git commit then copies them to long-term storage (as a commit):

Working files (what we see) --> git add --> Staging area (ready to commit) --> git commit --> Repository (Permanent storage)

Let's watch as our changes to a file move from our editor to the staging area and into long-term storage. First, we'll add another line to the file to indicate the location where we signed the pledge:

Location: San Luis Obispo, CA

$ nano pledge.txt
$ git diff

So far, so good: we've added one line to the end of the file (shown with a + in the first column). Now let's put that change in the staging area and see what git diff reports:

$ git add pledge.txt
$ git diff

There is no output: as far as Git can tell, there's no difference between what it's been asked to save permanently and what's currently in the directory. However, if we do this:

$ git diff --staged

it shows us the difference between the last committed change and what's in the staging area. Let's save our changes:

$ git commit -m 'added location'

check our status:

$ git status

Now let's look at the history of what we've done so far:

$ git log

Exploring History

If we want to see what we changed when, we use git diff again, but refer to old versions using the notation HEAD~1, HEAD~2, and so on:

$ git diff HEAD~1 pledge.txt
$ git diff HEAD~2 pledge.txt

Recall above we mentioned that revisions are chained together. In Git, the word HEAD always refers to the most recent end of that chain, i.e., the last revision that was tacked on. Every time we commit, HEAD moves forward to point at that new latest revision. We can step backwards on the chain using the ~ notation: HEAD~1 (pronounced "head minus one") means "the previous revision", and HEAD~123 goes back 123 revisions from where we are now.

We can also refer to revisions using those long strings of digits and letters that git log displays. These are unique IDs for the changes, and "unique" really does mean unique: every change to any set of files on any machine has a unique 40-character identifier. Our first commit was given the ID f22b25e3233b4645dabd0d81e651fe074bd8e73b, so let's try this:

$ git diff f22b25e3233b4645dabd0d81e651fe074bd8e73b pledge.txt

That's the right answer, but typing random 40-character strings is annoying, so Git lets us use just the first few:

$ git diff f22b25e pledge.txt

Recovering Old Versions

All right: we can save changes to files and see what we've changed—how can we restore older versions of things? Let's suppose we accidentally overwrite our file. Edit the file and remove several lines from it.

$ nano pledge.txt
$ cat pledge.txt

git status now tells us that the file has been changed, but those changes haven't been staged:

$ git status

We can put things back the way they were by using git checkout:

$ git checkout HEAD pledge.txt    
$ cat pledge.txt

As you might guess from its name, git checkout checks out (i.e., restores) an old version of a file. In this case, we're telling Git that we want to recover the version of the file recorded in HEAD, which is the last saved revision. If we want to go back even further, we can use a revision identifier instead:

$ git checkout f22b25e pledge.txt

It's important to remember that we must use the revision number that identifies the state of the repository before the change we're trying to undo. A common mistake is to use the revision number of the commit in which we made the change we're trying to get rid of.

Simplifying the Common Case

If you read the output of git status carefully, you'll see that it includes this hint:

(use "git checkout -- <file>..." to discard changes in working directory)

As it says, git checkout without a version identifier restores files to the state saved in HEAD. The double dash -- is needed to separate the names of the files being recovered from the command itself: without it, Git would try to use the name of the file as the revision identifier.

The fact that files can be reverted one by one tends to change the way people organize their work. If everything is in one large document, it's hard (but not impossible) to undo changes to the introduction without also undoing changes made later to the conclusion. If the introduction and conclusion are stored in separate files, on the other hand, moving backward and forward in time becomes much easier.

Version control our configureGit.sh script

Use the UNIX mv command to move the file from our home directory to our git repo. The syntax of the command is

$ mv <source> <destination>

In this case, do

$ mv ~/configureGit.sh .
$ ls

Now add and commit the file to version control:

$ git add configureGit.sh
$ git commit -m 'store global config script'
$ git status

`git mv` and `rm`: moving and removing files

While git add is used to add fils to the list git tracks, we must also tell it if we want their names to change or for it to stop tracking them. In familiar Unix fashion, the mv and rm git commands do precisely this:

$ git mv pledge.txt pledgecopy.txt
$ ls
$ git status

Note that these changes must be committed too, to become permanent! In git's world, until something hasn't been committed, it isn't permanently recorded anywhere.

$ git commit -m 'renamed pledge'
$ git status

And git rm works in a similar fashion.

Exercise

Add a new file file2.txt, commit it, make some changes to it, commit them again, and then remove it (and don't forget to commit this last step!).

Ignoring Things

What if we have files that we do not want Git to track for us, like backup files created by our editor or intermediate files created during data analysis or annoying Mac OS X files like .DS_Store?

$ git status

shows us that the .DS_Store file is untracked. But we don't want to track it anyway. Putting these files under version control would be a waste of disk space. What's worse, having them all listed could distract us from changes that actually matter, so let's tell Git to ignore them.

We do this by creating a file in the root directory of our project called .gitignore. Add a line to the file with just the name of the file to ignore:

.DS_Store

$ nano .gitignore
$ cat .gitignore

Once we have created this file, the output of git status is much cleaner:

$ git status

The only thing Git notices now is the newly-created .gitignore file. You might think we wouldn't want to track it, but everyone we're sharing our repository with will probably want to ignore the same things that we're ignoring. Let's add and commit .gitignore:

$ git add .gitignore
$ git commit -m "Add the ignore file"    
$ git status

As a bonus, using .gitignore helps us avoid accidentally adding files to the repository that we don't want.

If we really want to override our ignore settings, we can use git add -f to force Git to add something. We can also always see the status of ignored files if we want:

$ git status --ignored

Key Points

Use git config to configure a user name, email address, editor, and other preferences once per machine.
git init initializes a repository.
git status shows the status of a repository.
Files can be stored in a project's working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where snapshots are permanently recorded).
git add puts files in the staging area.
git commit creates a snapshot of the staging area in the local repository.
Always write a log message when committing changes.
git diff displays differences between revisions.
git checkout recovers old versions of files.
The .gitignore file tells Git what files to ignore.

Challenge

Create a new Git repository on your computer called bio. Write a three-line biography for yourself in a file called me.txt, commit your changes, then modify one line and add a fourth and display the differences between its updated state and its original state.

Collaborating with Github

Objectives

Explain what remote repositories are and why they are useful.
Explain what happens when a remote repository is cloned.
Explain what happens when changes are pushed to or pulled from a remote repository.

Version control really comes into its own when we begin to collaborate with other people. We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another.

Systems like Git allow us to move work between any two repositories. In practice, though, it's easiest to use one copy as a central hub, and to keep it on the web rather than on someone's laptop. Most programmers use hosting services like GitHub or BitBucket to hold those master copies.

Let's start by sharing the changes we've made to our current project with the world. Create an account using your Cal Poly User name (if taken, add 202 to the end) at GitHub. Log in, then click on the icon in the top right corner to create a new repository called PHYS202-S14.

Name your repository "PHYS202-S14" and then click "Create Repository".

As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository.

This effectively does the following on GitHub's servers:

$ mkdir PHYS202-S14
$ cd PHYS202-S14
$ git init

Our local repository still contains our earlier work on pledge.txt, but the remote repository on GitHub doesn't contain any files yet.

The next step is to connect the two repositories. We do this by making the GitHub repository a remote for the local repository. The home page of the repository on GitHub includes the string we need to identify it.

Click on the 'HTTPS' link to change the protocol from SSH to HTTPS. It's slightly less convenient for day-to-day use, but much less work for beginners to set up.

Copy that URL from the browser, go into the local PHYS202-S14 repository in the terminal, and run this command:

$ git remote add origin https://github.com/<USERNAME>/PHYS202-S14

We can check that the command has worked by running

$ git remote -v

The name origin is a local nickname for your remote repository: we could use something else if we wanted to, but origin is by far the most common choice.

Once the nickname origin is set up, this command will push the changes from our local repository to the repository on GitHub:

$ git push origin master

Our local and remote repositories are now in sync with each other. Check out the website to see that your pledge.txt file is now in the cloud.

We can pull changes from the remote repository to the local one as well:

$ git pull origin master

Pulling has no effect in this case because the two repositories are already synchronized. If someone else had pushed some changes to the repository on GitHub, though, this command would download them to our local repository.

We can simulate working with a collaborator using another copy of the repository on our local machine. To do this, open a new terminal window. It should place you in your home directory by default. cd to the directory /tmp. (Note the absolute path: don't make tmp a subdirectory of the existing repository). Instead of creating a new repository here with git init, we will clone the existing repository from GitHub:

$ cd /tmp
$ git clone https://github.com/<USERNAME>/PHYS202-S14.git

git clone creates a fresh local copy of a remote repository. (We did it in /tmp or some other directory so that we don't overwrite our existing planets directory.) Our computer now has two copies of the repository.

Let's make a change in the copy in /tmp/PHYS202-S14. Let's add our configureGit.sh script to the repository.

$ cd /tmp/PHYS202-S14
$ mv ~/configureGit.sh .
$ ls
$ ls ~

The mv command stands for "move" and it removes the file from our home directory ~ and places it into /tmp/PHYS202-S14.

Let's add it to our staging area and then commit it permanently to storage in our local repository and then let's send it to GitHub:

$ git add configureGit.sh
$ git commit -m 'configure script so I don't have to remember rarely used commands'
$ git push origin master

Note that we didn't have to create a remote called origin: Git does this automatically, using that name, when we clone a repository. (This is why origin was a sensible choice earlier when we were setting up remotes by hand.)

We can now download changes into the original repository on our machine. Close the terminal in which you were working in the /tmp area and go back to your other terminal window.

$ pwd
$ git pull origin master

Now all three copies of our repo (the one in ~/PHYS202-S14, the one in /tmp/PHYS202-S14, and the one on GitHub) are all in sync.

In practice, we would probably never have two copies of the same remote repository on the same computer at once. Instead, one of those copies would be on our laptop, and the other on a lab machine, or on someone else's computer. Pushing and pulling changes gives us a reliable way to share work between different people and machines.

Key Points

A local Git repository can be connected to one or more remote repositories.
Use the HTTPS protocol to connect to remote repositories until you have learned how to set up SSH.
git push copies changes from a local repository to a remote repository.
git pull copies changes from a remote repository to a local repository.
git clone copies a remote repository to create a local repository with a remote called origin automatically set up.

Conflicts

Objectives

Explain what conflicts are and when they can occur.
Resolve conflicts resulting from a merge.

As soon as people can work in parallel, someone's going to step on someone else's toes. This will even happen with a single person: if we are working on a piece of software on both our laptop and a server in the lab, we could make different changes to each copy. Version control helps us manage these conflicts by giving us tools to resolve overlapping changes.

To see how we can resolve conflicts, we must first create one. The file pledge.txt currently looks like this in both local copies of our PHYS202-S14 repository (the one in our home directory and the one in /tmp):

$ cat pledge.txt

Let's add a line to the copy under our home directory:

This line added to our home copy

$ nano pledge.txt

and then push the change to GitHub:

$ git add pledge.txt
$ git commit -m "Adding a line in our home copy"
$ git push origin master

Now we have one local copy in sync with GitHub and one that is out of sync (the one in /tmp)

Open a fresh terminal window and cd into the /tmp directory. Now let's make a different change there without updating from GitHub:

We added a different line in the temporary copy

$ cd /tmp/PHYS202-S14
$ nano pledge.txt
$ cat pledge.txt

We can commit the change locally:

$ git add pledge.txt
$ git commit -m "Adding a line in the temporary copy"

but Git won't let us push it to GitHub:

$ git push origin master

Git detects that the changes made in one copy overlap with those made in the other and stops us from trampling on our previous work. What we have to do is pull the changes from GitHub, merge them into the copy we're currently working in, and then push that. Let's start by pulling:

$ git pull origin master

git pull tells us there's a conflict, and marks that conflict in the affected file:

$ cat pledge.txt

...
Location: San Luis Obispo, CA
<<<<<<< HEAD
We added a different line in the temporary copy
=======
This line added to our home copy
>>>>>>> dabb4c8c450e8475aee9b14b4383acc99f42af1d

Our change—the one in HEAD—is preceded by <<<<<<<. Git has then inserted ======= as a separator between the conflicting changes and marked the end of the content downloaded from GitHub with >>>>>>>. (The string of letters and digits after that marker identifies the revision we've just downloaded.)

It is now up to us to edit this file to remove these markers and reconcile the changes. We can do anything we want: keep the change in this branch, keep the change made in the other, write something new to replace both, or get rid of the change entirely. Let's fix both by removing the added lines altogether.

To finish merging, we add pledge.txt to the changes being made by the merge and then commit:

$ git add pledge.txt
$ git status

$ git commit -m "Merging changes from GitHub"

We still need to send the revision to GitHub:

$ git push origin master

Git keeps track of what we've merged with what, so we don't have to fix things by hand again if we switch back to the repository in our home directory and pull from GitHub. Go back to your other terminal window:

$ cd ~/PHYS202-S14
$ git pull origin master

we get the merged file.

$ cat pledge.txt

We don't need to merge again because GitHub knows someone has already done that.

Version control's ability to merge conflicting changes is another reason users tend to divide their programs and papers into multiple files instead of storing everything in one large file. There's another benefit too: whenever there are repeated conflicts in a particular file, the version control system is essentially trying to tell its users that they ought to clarify who's responsible for what, or find a way to divide the work up differently.

Key Points

Conflicts occur when two or more people change the same file(s) at the same time.
The version control system does not allow people to blindly overwrite each other's changes. Instead, it highlights conflicts so that they can be resolved.

Establishing a Workflow

To make sure your local and remote repositories are always in sync, develop a "workflow", which means the habitual way you interact with the computer to accomplish your work. Here is a good suggestion to get you started.

First time you sit down at the computer (in lab or at home):

$ cd PHYS202-S14
$ git pull origin master
$ ipython notebook

Create new notebooks, save work as you go. When you are done for the day, close the notebook server and do:

$ git status
$ git add <and new or modified files>
$ git commit -m "<comment about what changed or was added>"
$ git push origin master
$ git status

Replacing the text inside the chevrons < > with the appropriate information.

Make sure the status says

On branch master
Your branch is up-to-date with 'origin/master'.

nothing to commit, working directory clean

Then, open the web browser to http://www.github.com and doublecheck that your changes have appeared. If so, you can now close your terminal window and log out.

All content is under a modified MIT License, and can be freely used and adapted. See the full license text here.

Version Control with Git

Why do we need it?

Objectives

Setting Up

Creating a repository

Tracking Changes to Files

Where are my changes?

Changing a file

Exploring History

Recovering Old Versions

Simplifying the Common Case

Version control our configureGit.sh script

git mv and rm: moving and removing files

Exercise

Ignoring Things

Key Points

Challenge

Collaborating with Github

Objectives

Key Points

Conflicts

Objectives

Key Points

Establishing a Workflow

`git mv` and `rm`: moving and removing files