This lesson is adapted from the lesson of the same name by Software Carpentry.
This lesson can be done interactively with the students and this notebook distributed for future reference.
Version control is the lab notebook of the digital world: it's what professionals use to keep track of what they've done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn't just for software: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system.
version control: A tool for managing changes to a set of files. Each set of changes creates a new revision of the files; the version control system allows users to recover old revisions reliably, and helps manage conflicting changes made by different users.
Two people want to collaborate on a software program at the same time, but they have run into problems doing this in the past. If they take turns, each one will spend a lot of time waiting for the other to finish, but if they work on their own copies and email changes back and forth things will be lost, overwritten, or duplicated.
The right solution is to use version control to manage their work. Version control is better than mailing files back and forth because:
The first time we use Git on a new machine, we need to configure a few things. Since we may be moving around among computers on Cal Poly's Active Directory, we are going to set up a script to help us with that.
Use nano
to create a new file in your home directory called configureGit.sh
$ nano configureGit.sh
The file should contain:
#!/bin/bash
git config --global user.name "<YOUR NAME>"
git config --global user.email "<YOUR CAL POLY EMAIL>"
git config --global color.ui "auto"
git config --global core.editor "nano"
Git commands are written git verb
, where verb
is what we actually want it to do. In this case, we're telling Git:
The four commands in our file only need to be run once: the flag --global
tells Git to use the settings for every project on this machine, but once we have the script (that's what <file>.sh
means), if we need to reconfigure git again, we can just run the script, rather than remembering the commands.
Ok, we've created the script, now we need to run it. To do that type:
$ source configureGit.sh
Once Git is configured, we can start using it. Let's version control the directory we created last time by telling Git to make it a repository - a place where Git can store old versions of our files:
$ cd PHYS202-S14
$ git init
If we use ls
to show the directory's contents, it appears that nothing has changed, but if we add the -a
flag to show everything, we can see that Git has created a hidden directory called .git
.
Git stores information about the project in this special sub-directory. If we ever delete it, we will lose the project's history.
We can check that everything is set up correctly by asking Git to tell us the status of our project:
$ git status
The "untracked files" message means that there's a file in the directory that Git isn't keeping track of. We can tell Git that it should do so using git add
:
$ git add pledge.txt
and then check that the right thing happened:
$ git status
pledge.txt
is now in the index - Git now knows that it's supposed to keep track of this file, but it hasn't yet recorded any changes for posterity as a commit. To get it to do that, we need to run one more command:
$ git commit -m "Academic honesty pledge"
When we run git commit
, Git takes everything we have told it to save by using git add
and stores a copy permanently inside the special .git
directory. This permanent copy is called a revision and its short identifier is a number such as cf4fb04 (Your revision may have another identifier.)
We use the -m
flag (for "message") to record a comment that will help us remember later on what we did and why. If we just run git commit
without the -m
option, Git will launch nano
(or whatever other editor we configured at the start) so that we can write a longer message.
If we run git status now:
$ git status
it tells us everything is up to date. If we want to know what we've done recently, we can ask Git to show us the project's history using git log
:
$ git log
git log
lists all revisions made to a repository in reverse chronological order. The listing for each revision includes the revision's full identifier (which starts with the same characters as the short identifier printed by the git commit
command earlier), the revision's author, when it was created, and the log message Git was given when the revision was created.
If we run ls
at this point, we will still see just one file called pledge.txt
. That's because Git saves information about files' history in the special .git
directory mentioned earlier so that our filesystem doesn't become cluttered (and so that we can't accidentally edit or delete an old version).
Let's edit the pledge.txt
file we created previously to change the date to today's date.
$ nano pledge.txt
When we run git status
now, it tells us that a file it already knows about has been modified:
$ git status
The last line is the key phrase: "no changes added to commit". We have changed this file in our working tree, but we haven't promoted those changes to the index or saved them as as commit. Let's double-check our work using git diff
, which shows us the differences between the current state of the file and the most recently saved version:
$ git diff
The output is cryptic because it is actually a series of commands for tools like editors and patch telling them how to reconstruct one file given the other. If we can break it down into pieces:
diff
command to compare the old and new versions of the file.Let's commit our change:
$ git commit -m 'update the date in pledge file'
Whoops: Git won't commit because we didn't use git add
first - there's nothing in the index and nothing for git to make a commit out of! Remember to promote our work from the working tree to the index first using 'git add'
:
$ git add pledge.txt
$ git commit -m 'update the date in pledge file'
Git insists that we add files to the set we want to commit before actually committing anything because we may not want to commit everything at once. For example, suppose we're adding a few citations to our supervisor's work to our thesis. We might want to commit those additions, and the corresponding addition to the bibliography, but not commit the work we're doing on the conclusion (which we haven't finished yet).
To allow for this, Git has a special staging area where it keeps track of things that have been added to the current change set but not yet committed. git add
puts things in this area (the index), and git commit
then copies them to long-term storage (as a commit):
Working files (what we see) --> git add
--> Staging area (ready to commit) --> git commit
--> Repository (Permanent storage)
Let's watch as our changes to a file move from our editor to the staging area and into long-term storage. First, we'll add another line to the file to indicate the location where we signed the pledge:
Location: San Luis Obispo, CA
$ nano pledge.txt
$ git diff
So far, so good: we've added one line to the end of the file (shown with a + in the first column). Now let's put that change in the staging area and see what git diff
reports:
$ git add pledge.txt
$ git diff
There is no output: as far as Git can tell, there's no difference between what it's been asked to save permanently and what's currently in the directory. However, if we do this:
$ git diff --staged
it shows us the difference between the last committed change and what's in the staging area. Let's save our changes:
$ git commit -m 'added location'
check our status:
$ git status
Now let's look at the history of what we've done so far:
$ git log
If we want to see what we changed when, we use git diff
again, but refer to old versions using the notation HEAD~1
, HEAD~2
, and so on:
$ git diff HEAD~1 pledge.txt
$ git diff HEAD~2 pledge.txt
Recall above we mentioned that revisions are chained together. In Git, the word HEAD
always refers to the most recent end of that chain, i.e., the last revision that was tacked on. Every time we commit, HEAD
moves forward to point at that new latest revision. We can step backwards on the chain using the ~ notation: HEAD~1
(pronounced "head minus one") means "the previous revision", and HEAD~123
goes back 123 revisions from where we are now.
We can also refer to revisions using those long strings of digits and letters that git log
displays. These are unique IDs for the changes, and "unique" really does mean unique: every change to any set of files on any machine has a unique 40-character identifier. Our first commit was given the ID f22b25e3233b4645dabd0d81e651fe074bd8e73b
, so let's try this:
$ git diff f22b25e3233b4645dabd0d81e651fe074bd8e73b pledge.txt
That's the right answer, but typing random 40-character strings is annoying, so Git lets us use just the first few:
$ git diff f22b25e pledge.txt
All right: we can save changes to files and see what we've changed—how can we restore older versions of things? Let's suppose we accidentally overwrite our file. Edit the file and remove several lines from it.
$ nano pledge.txt
$ cat pledge.txt
git status
now tells us that the file has been changed, but those changes haven't been staged:
$ git status
We can put things back the way they were by using git checkout
:
$ git checkout HEAD pledge.txt
$ cat pledge.txt
As you might guess from its name, git checkout checks out (i.e., restores) an old version of a file. In this case, we're telling Git that we want to recover the version of the file recorded in HEAD
, which is the last saved revision. If we want to go back even further, we can use a revision identifier instead:
$ git checkout f22b25e pledge.txt
It's important to remember that we must use the revision number that identifies the state of the repository before the change we're trying to undo. A common mistake is to use the revision number of the commit in which we made the change we're trying to get rid of.
If you read the output of git status
carefully, you'll see that it includes this hint:
(use "git checkout -- <file>..." to discard changes in working directory)
As it says, git checkout
without a version identifier restores files to the state saved in HEAD
. The double dash --
is needed to separate the names of the files being recovered from the command itself: without it, Git would try to use the name of the file as the revision identifier.
The fact that files can be reverted one by one tends to change the way people organize their work. If everything is in one large document, it's hard (but not impossible) to undo changes to the introduction without also undoing changes made later to the conclusion. If the introduction and conclusion are stored in separate files, on the other hand, moving backward and forward in time becomes much easier.
Use the UNIX mv
command to move the file from our home directory to our git repo. The syntax of the command is
$ mv <source> <destination>
In this case, do
$ mv ~/configureGit.sh .
$ ls
Now add and commit the file to version control:
$ git add configureGit.sh
$ git commit -m 'store global config script'
$ git status
While git add
is used to add fils to the list git tracks, we must also tell it if we want their names to change or for it to stop tracking them. In familiar Unix fashion, the mv
and rm
git commands do precisely this:
$ git mv pledge.txt pledgecopy.txt
$ ls
$ git status
Note that these changes must be committed too, to become permanent! In git's world, until something hasn't been committed, it isn't permanently recorded anywhere.
$ git commit -m 'renamed pledge'
$ git status
And git rm
works in a similar fashion.
Add a new file file2
.txt, commit it, make some changes to it, commit them again, and then remove it (and don't forget to commit this last step!).
What if we have files that we do not want Git to track for us, like backup files created by our editor or intermediate files created during data analysis or annoying Mac OS X files like .DS_Store
?
$ git status
shows us that the .DS_Store
file is untracked. But we don't want to track it anyway. Putting these files under version control would be a waste of disk space. What's worse, having them all listed could distract us from changes that actually matter, so let's tell Git to ignore them.
We do this by creating a file in the root directory of our project called .gitignore
. Add a line to the file with just the name of the file to ignore:
.DS_Store
$ nano .gitignore
$ cat .gitignore
Once we have created this file, the output of git status
is much cleaner:
$ git status
The only thing Git notices now is the newly-created .gitignore
file. You might think we wouldn't want to track it, but everyone we're sharing our repository with will probably want to ignore the same things that we're ignoring. Let's add and commit .gitignore
:
$ git add .gitignore
$ git commit -m "Add the ignore file"
$ git status
As a bonus, using .gitignore
helps us avoid accidentally adding files to the repository that we don't want.
If we really want to override our ignore settings, we can use git add -f
to force Git to add something. We can also always see the status of ignored files if we want:
$ git status --ignored
git config
to configure a user name, email address, editor, and other preferences once per machine.git init
initializes a repository.git status
shows the status of a repository.git add
puts files in the staging area.git commit
creates a snapshot of the staging area in the local repository.git diff
displays differences between revisions.git checkout
recovers old versions of files..gitignore
file tells Git what files to ignore.Create a new Git repository on your computer called bio
. Write a three-line biography for yourself in a file called me.txt
, commit your changes, then modify one line and add a fourth and display the differences between its updated state and its original state.
Version control really comes into its own when we begin to collaborate with other people. We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another.
Systems like Git allow us to move work between any two repositories. In practice, though, it's easiest to use one copy as a central hub, and to keep it on the web rather than on someone's laptop. Most programmers use hosting services like GitHub or BitBucket to hold those master copies.
Let's start by sharing the changes we've made to our current project with the world. Create an account using your Cal Poly User name (if taken, add 202
to the end) at GitHub. Log in, then click on the icon in the top right corner to create a new repository called PHYS202-S14
.
Name your repository "PHYS202-S14" and then click "Create Repository".
As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository.
This effectively does the following on GitHub's servers:
$ mkdir PHYS202-S14
$ cd PHYS202-S14
$ git init
Our local repository still contains our earlier work on pledge.txt
, but the remote repository on GitHub doesn't contain any files yet.
The next step is to connect the two repositories. We do this by making the GitHub repository a remote for the local repository. The home page of the repository on GitHub includes the string we need to identify it.
Click on the 'HTTPS' link to change the protocol from SSH to HTTPS. It's slightly less convenient for day-to-day use, but much less work for beginners to set up.
Copy that URL from the browser, go into the local PHYS202-S14
repository in the terminal, and run this command:
$ git remote add origin https://github.com/<USERNAME>/PHYS202-S14
We can check that the command has worked by running
$ git remote -v
The name origin
is a local nickname for your remote repository: we could use something else if we wanted to, but origin
is by far the most common choice.
Once the nickname origin
is set up, this command will push the changes from our local repository to the repository on GitHub:
$ git push origin master
Our local and remote repositories are now in sync with each other. Check out the website to see that your pledge.txt
file is now in the cloud.
We can pull changes from the remote repository to the local one as well:
$ git pull origin master
Pulling has no effect in this case because the two repositories are already synchronized. If someone else had pushed some changes to the repository on GitHub, though, this command would download them to our local repository.
We can simulate working with a collaborator using another copy of the repository on our local machine. To do this, open a new terminal window. It should place you in your home directory by default. cd
to the directory /tmp
. (Note the absolute path: don't make tmp
a subdirectory of the existing repository). Instead of creating a new repository here with git init
, we will clone the existing repository from GitHub:
$ cd /tmp
$ git clone https://github.com/<USERNAME>/PHYS202-S14.git
git clone
creates a fresh local copy of a remote repository. (We did it in /tmp
or some other directory so that we don't overwrite our existing planets directory.) Our computer now has two copies of the repository.
Let's make a change in the copy in /tmp/PHYS202-S14
. Let's add our configureGit.sh
script to the repository.
$ cd /tmp/PHYS202-S14
$ mv ~/configureGit.sh .
$ ls
$ ls ~
The mv
command stands for "move" and it removes the file from our home directory ~
and places it into /tmp/PHYS202-S14
.
Let's add it to our staging area and then commit it permanently to storage in our local repository and then let's send it to GitHub:
$ git add configureGit.sh
$ git commit -m 'configure script so I don't have to remember rarely used commands'
$ git push origin master
Note that we didn't have to create a remote called origin
: Git does this automatically, using that name, when we clone a repository. (This is why origin was a sensible choice earlier when we were setting up remotes by hand.)
We can now download changes into the original repository on our machine. Close the terminal in which you were working in the /tmp
area and go back to your other terminal window.
$ pwd
$ git pull origin master
Now all three copies of our repo (the one in ~/PHYS202-S14
, the one in /tmp/PHYS202-S14
, and the one on GitHub) are all in sync.
In practice, we would probably never have two copies of the same remote repository on the same computer at once. Instead, one of those copies would be on our laptop, and the other on a lab machine, or on someone else's computer. Pushing and pulling changes gives us a reliable way to share work between different people and machines.
git push
copies changes from a local repository to a remote repository.git pull
copies changes from a remote repository to a local repository.git clone
copies a remote repository to create a local repository with a remote called origin automatically set up.As soon as people can work in parallel, someone's going to step on someone else's toes. This will even happen with a single person: if we are working on a piece of software on both our laptop and a server in the lab, we could make different changes to each copy. Version control helps us manage these conflicts by giving us tools to resolve overlapping changes.
To see how we can resolve conflicts, we must first create one. The file pledge.txt
currently looks like this in both local copies of our PHYS202-S14
repository (the one in our home directory and the one in /tmp
):
$ cat pledge.txt
Let's add a line to the copy under our home directory:
This line added to our home copy
$ nano pledge.txt
and then push the change to GitHub:
$ git add pledge.txt
$ git commit -m "Adding a line in our home copy"
$ git push origin master
Now we have one local copy in sync with GitHub and one that is out of sync (the one in /tmp
)
Open a fresh terminal window and cd
into the /tmp
directory. Now let's make a different change there without updating from GitHub:
We added a different line in the temporary copy
$ cd /tmp/PHYS202-S14
$ nano pledge.txt
$ cat pledge.txt
We can commit the change locally:
$ git add pledge.txt
$ git commit -m "Adding a line in the temporary copy"
but Git won't let us push it to GitHub:
$ git push origin master
Git detects that the changes made in one copy overlap with those made in the other and stops us from trampling on our previous work. What we have to do is pull the changes from GitHub, merge them into the copy we're currently working in, and then push that. Let's start by pulling:
$ git pull origin master
git pull
tells us there's a conflict, and marks that conflict in the affected file:
$ cat pledge.txt
...
Location: San Luis Obispo, CA
<<<<<<< HEAD
We added a different line in the temporary copy
=======
This line added to our home copy
>>>>>>> dabb4c8c450e8475aee9b14b4383acc99f42af1d
Our change—the one in HEAD
—is preceded by <<<<<<<
. Git has then inserted =======
as a separator between the conflicting changes and marked the end of the content downloaded from GitHub with >>>>>>>
. (The string of letters and digits after that marker identifies the revision we've just downloaded.)
It is now up to us to edit this file to remove these markers and reconcile the changes. We can do anything we want: keep the change in this branch, keep the change made in the other, write something new to replace both, or get rid of the change entirely. Let's fix both by removing the added lines altogether.
To finish merging, we add pledge.txt
to the changes being made by the merge and then commit:
$ git add pledge.txt
$ git status
$ git commit -m "Merging changes from GitHub"
We still need to send the revision to GitHub:
$ git push origin master
Git keeps track of what we've merged with what, so we don't have to fix things by hand again if we switch back to the repository in our home directory and pull from GitHub. Go back to your other terminal window:
$ cd ~/PHYS202-S14
$ git pull origin master
we get the merged file.
$ cat pledge.txt
We don't need to merge again because GitHub knows someone has already done that.
Version control's ability to merge conflicting changes is another reason users tend to divide their programs and papers into multiple files instead of storing everything in one large file. There's another benefit too: whenever there are repeated conflicts in a particular file, the version control system is essentially trying to tell its users that they ought to clarify who's responsible for what, or find a way to divide the work up differently.
To make sure your local and remote repositories are always in sync, develop a "workflow", which means the habitual way you interact with the computer to accomplish your work. Here is a good suggestion to get you started.
First time you sit down at the computer (in lab or at home):
$ cd PHYS202-S14
$ git pull origin master
$ ipython notebook
Create new notebooks, save work as you go. When you are done for the day, close the notebook server and do:
$ git status
$ git add <and new or modified files>
$ git commit -m "<comment about what changed or was added>"
$ git push origin master
$ git status
Replacing the text inside the chevrons < > with the appropriate information.
Make sure the status says
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean
Then, open the web browser to http://www.github.com and doublecheck that your changes have appeared. If so, you can now close your terminal window and log out.