Intro to Git

Authors: Henry Milner, Andrew Do. Some of the material in this notebook is inspired by lectures by Prof. George Necula in CS 169.

Why git?

Your first reason for this class (any likely many classes and projects to come): It's the only way to interact with other developers, because everyone uses it.

"Everyone?" Yes. Github, the biggest host for public git repositories, has 20 million repositories. There are probably many more private repositories. (You can create either.)

Better reasons:

Work without fear. If you make a change that breaks something (or just wasn't a good idea), you can always go back.
Work on multiple computers. Much simpler and less error-prone than emailing yourself files.
Collaborate with other developers.
Maintain multiple versions.

However, git can be a little confusing. Many confusions happen because people don't understand the fundamentals you'll learn today. If you've got the basics, the impact of other confusions will be bounded, and you can probably figure out how to search for a solution.

Cloning an existing repository

We made a special repository for this section (it takes 5 seconds) here:

https://github.com/DS-100/git-intro

We'll use a Jupyter notebook, but you can run any of these commands in a Bash shell. Note that cd is a magic command in Jupyter that doesn't have a ! in front of it. !cd only works for the line you write it on.

We'll check out the repo in the /tmp folder, which the OS will wipe when you reboot. Obviously, don't do that if you want to keep the repo.



In [42]:

    
cd /tmp



In [30]:

    
# Delete the repo if it happens to already exist:
!rm -rf git-intro



In [31]:

    
# Create the repo
!git clone https://github.com/DS-100/git-intro git-intro



In [44]:

    
!ls -lh | grep git-intro



In [45]:

    
cd git-intro

Looking at files in a repo

A repository is just a directory. Let's poke around.



In [37]:

    
# What files are in the repo?
!ls -lh



In [46]:

    
# What about hidden files?
!ls -alh

The special .git directory is where git stores all its magic. If you delete it (or this whole directory), the repository won't be a repository any more.



In [49]:

    
# What's the current status, according to git?
!git status



In [50]:

    
# What's the history of the repo?
!git log



In [51]:

    
# What does README.md look like currently?
!cat README.md

Making changes: Our first commit

Suppose we want to add a file. You could create a Jupyter notebook or download an image. For simplicity, we'll just add a text file.



In [140]:

    
# We can use Python to compute the filename.
# Then we can reference Python variables in
# ! shell commands using {}, because Jupyter
# is magic.
import datetime
our_id = datetime.datetime.now().microsecond
filename = "our_file_{:d}.txt".format(our_id)
filename



In [106]:

    
!echo "The quick brown fox \
jumped over the lzy dog." > "{filename}"
!ls

Creating the file only changed the local filesystem. We can go to the repository page on Github to verify that the file hasn't been added yet. You probably wouldn't want your changes to be published immediately to the world!



In [107]:

    
!git add "{filename}"

If you check again, our file still hasn't been published to the world. In git, you package together your new files and updates to old files, and then you create a new version called a "commit."

Git maintains a "staging" or "index" area for files that you've marked for committing with git add.



In [108]:

    
!git status



In [109]:

    
!git commit -m 'Added our new file, "{filename}"'



In [110]:

    
!git status



In [111]:

    
!git log

Now our local repository has this new commit in it. Notice that the log shows the message we wrote when we made the commit. It is very tempting to write something like "stuff" here. But then it will be very hard to understand your history, and you'll lose some of the benefits of git.

For the same reason, try to make each commit a self-contained idea: You fixed a particular bug, added a particular feature, etc.

Our commit hasn't been published to other repositories yet, including the one on Github. We can check again to verify that.

To publish a commit we've created locally to another repository, we use git push. Git remembers that we checked out from the Github repository, and by default it will push to that repository. Just to be sure, let's find the name git has given to that repository, and pass that explicitly to git push.



In [112]:

    
!git remote -v



In [113]:

    
!git help push



In [120]:

    
!git push origin

Now our commit is finally visible on Github. Even if we spill coffee on our laptop, our new state will be safely recorded in the remote repository.

Going back

Oops, we didn't want that file! In fact, if you look at the history, people have been adding a bunch of silly files. We don't want any of them.

Once a commit is created, git basically never forgets about it or its contents (unless you try really hard). When your local filesystem doesn't have any outstanding changes, it's easy to switch back to an older commit.

We have previously given the name first to the first commit in the repo, which had basically nothing in it. (We'll soon see how to assign names to commits.)



In [132]:

    
!git help branch



In [145]:

    
!git branch --list



In [146]:

    
# Let's make a new name for the first commit, "going-back",
# with our ID in there so we don't conflict with other
# sections.
!git branch going-back-{our_id} first



In [147]:

    
!git branch --list



In [148]:

    
!git checkout going-back-{our_id}



In [149]:

    
!ls



In [150]:

    
!git status



In [162]:

    
!git log --graph --decorate first going-back-{our_id} master

Note: we can always get back to the commit we made with:

git checkout master

Branches and commits

Git informs us that we've switched to the going-back "branch," and in the local filesystem, neither the file we created nor any other files, other than README.md, are there any more. What do you think would happen if we made some changes and made a new commit now?

A. The previous commits would be overwritten. The master branch would disappear.
B. The previous commits would be overwritten. The master branch would now refer to our new commit.
C. A new commit would be created. The master branch would still refer to our last commit. The first branch would refer to the new commit.
D. A new commit would be created. The master branch would still refer to our last commit. The first branch would still refer to the first commit in the repository.
E. Git would ask us what to do, because it's not clear what we intended.
F. Something else?

Let's find out.



In [152]:

    
new_filename = "our_second_file_{}.txt".format(our_id)
new_filename



In [153]:

    
!echo "Text for our second file!" > {new_filename}
!ls



In [154]:

    
!git add {new_filename}
!git commit -m'Adding our second file!'



In [157]:

    
!git status



In [161]:

    
!git log --graph --decorate first going-back-{our_id} master

How does committing work?

Every commit is a snapshot of some files. A commit can never be changed. It has a unique ID assigned by git, like 20f97c1.

Humans can't work with IDs like that, so git lets us give names like master or first to commits, using git branch <name> <commit ID>. These names are called "branches" or "refs" or "tags." They're just names. Often master is used for the most up-to-date commit in a repository, but not always.

At any point in time, your repository is pointing to a commit. Except in unusual cases, that commit will have a name. Git gives that name its own name: HEAD. Remember: HEAD is a special kind of name. It refers to other names rather than to a commit.

When you commit:

Git creates your new commit.
To keep track of its lineage, git records that your new commit is a "child" of the current commit. That's what the lines in that git log line are showing.
Git updates whatever name HEAD points to (your "current branch"). Now that name refers to the new commit.

Can you list all the pieces that make up the full state of your git repository?

All the commits with their IDs.
All the pointers from commits to their parents (the previous commit they built on).
All your "refs," each pointing to a commit.
The HEAD, which points to a ref.
The "working directory," which is all the actual files you see.
The "index" or "staging" area, which is all the files you've added with git add but haven't committed yet. (You can find out what's staged with git status. The staging area is confusing, so use it sparingly. Usually you should stage things and then immediately create a commit.)
A list of "remotes," which are other repositories your repository knows about. Often this is just the repository you cloned.
The last-known state of the remotes' refs.
[...there are more, but these are the main ones.]

How does pushing work?

In git, every repository is coequal. The repository we cloned from Github looks exactly like ours, except it might contain different commits and names.

Suppose you want to publish your changes.



In [171]:

    
!git push origin going-back-{our_id}

Here origin is the name (according to git remote -v) of the repository you want to push to. If you omit a remote name, origin is also the default. Normally that's what you want.

going-back-{our_id} (whatever the value of {our_id}) is a branch in your repository. If you omit a branch name here, your current branch (the branch HEAD refers to) is the default.

What do you think git does?

A few things happen:

Git finds all the commits in going-back-{our_id}'s history - all of its ancestors.
It sends all of those commits to origin, and they're added to that repository. (If origin already has a bunch of them, of course those don't need to be sent.)
It updates the branch named going-back-{our_id} in origin to point to the same commit yours does.

However, suppose someone else has updated going-back-{our_id} since you last got it?

 456 (your going-back-{our_id})
   \   345 (origin's going-back-{our_id}, pushed by someone else)
    \   /
     \ /
     234 (going-back-{our_id} when you last pulled it from origin)
      |
     123

How do you think git handles that?

The answer may surprise you: git gives up and tells you you're not allowed to push. Instead, you have to pull the remote commits and merge them in your repository, then push after merging.

error: failed to push some refs to 'https://github.com/DS-100/git-intro.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

We'll go over merging next, but the end result after merging will look like this:

 567 (your going-back-{our_id})
  |  \
  |   \
  |    \
 456    \
   \   345 (origin's going-back-{our_id}, pushed by someone else)
    \   /
     \ /
     234 (going-back-{our_id} when you last pulled it from origin)
      |
     123

Then git push origin going-back-{our_id} would succeed, since there are now no conflicts. We're updating going-back-{our_id} to a commit that's a descendant of the current commit going-back-{our_id} names in origin.

So it remains to see how to accomplish a merge. We need to start with pulling updates from other repositories.

How does pulling work?

Suppose someone else pushes a commit to the remote repository. We can simulate that with our own second repository:



In [165]:

    
cd /tmp



In [166]:

    
!git clone https://github.com/DS-100/git-intro git-intro-2



In [172]:

    
cd /tmp/git-intro-2



In [175]:

    
!git checkout going-back-{our_id}



In [176]:

    
third_filename = "our_third_file_{}.txt".format(our_id)
third_filename



In [177]:

    
!echo "Someone else added this third file!" > {third_filename}
!git add {third_filename}
!git commit -m"Adding a third file!"
!git push

Now we go back to our original repo.



In [178]:

    
cd /tmp/git-intro

You might just want the update. Or maybe you want to push your own commit to the same branch, and your git push failed.

Git has a command called pull that you could use. But it's complicated, and it's easier to break it down into two steps: fetching and merging.

Since git commits are never destroyed, it's always safe to fetch commits from another repository. (Refs can be changed, so that's not true for refs. That's the source of the problem with our push before!)



In [164]:

    
!git help fetch



In [ ]:

    
!git fetch origin



In [184]:

    
!git log --graph --decorate going-back-{our_id} origin/going-back-{our_id}

Now we need to update our ref to the newer commit. In this case, it's easy, because we didn't have any further commits. Git calls that a "fast-forward" merge.



In [191]:

    
!git merge origin/going-back-{our_id} --ff-only



In [192]:

    
!git log --graph --decorate

As a shortcut, you can do fetch and fast-forward merge with a single command:

git pull origin/going-back-{our_id} --ff-only

What if there's a nontrivial merge to do?

In this class, you have three repositories:

The class Github repository ds100, which contains blank copies of assignments
The repository that lives on your own computer, where you work on your assignments
Your Github repository origin, where you submit your assignments

ds100 will be updated regularly with commits that add new assignments. You'll never push to ds100. But you will pull from it regularly to get the new assignments.

When you pull from ds100, you don't want to just use the latest commit from that repo. Then you'd be starting from scratch, without all your work on previous assignments.

Instead, you want to merge the ds100 updates so that you get the new assignments but don't clobber your own work.

In the git log, after a few assignments, this will look something like this:

(ds100/master)              (master on local repo)
     ... ---------------------------efg (merged with 345 to get hw2)
      | /                            |
     345 (hw2)                      def (worked on lab2)
      |                              |
      |  ---------------------------cde (merged with 234 to get lab2)
      | /                            |
     234 (lab2)                     bcd (finishing touches on hw1)
      |                              |
      |  ---------------------------abc (your work on hw1)
      | /          ^ not a merge
     123 (hw1)

Consider the first merge only. The current commit is bcd, and you want to get lab 2. From what you know so far, how should you merge?

Answer: Assuming we're on the master branch in our repo, and there are no uncommitted changes to the working files:

!git fetch ds100
!git merge ds100/master

That doesn't finish things for us, though. How will the merge work? How will git reconcile your changes to the hw1 files with the addition of lab2 files?

Git tries to intelligently include all the changes introduced in the two merged branches since their last common ancestor. In this case, the changes are independent - one branch introduced new files in the lab2 directory, and the other edited files in the hw1 directory. So git will just do it.

Git assumes that changes introduced in separate files, or in separate lines of the same file, can be applied together. If two branches change the same line of the same file, it will give up and ask you to reconcile the changes. You'll then need to edit the file and follow the instructions to mark it as fixed. We won't go over an example of that today.

Note that sometimes git's assumption about independence is not true. For example, suppose you are working on a project and you create a new code file A that imports code from another file B. Your coworker deletes file B. Git will merrily apply both changes, but your code in file A won't work any more. So you need to apply human judgment when merging. If you write informative commit messages, it's much easier to find such problems.

A shortcut to pull in this class

The instructions on the course website tell you to get changes from ds100 with this command:

git pull -s recursive -X ours --no-edit ds100 master

What does this do? It's basically what we just went through, with some extra options that let you avoid dealing with merges:

git pull ds100 master: Pull from ds100, updating the master branch. Equivalent to git fetch ds100; git merge ds100/master as seen above.
-s recursive -X ours: If git finds that you and the ds100 repo have made changes to the same line in a file, always take your changes and delete the ds100 repo changes. It will do this instead of asking you to reconcile the changes.
--no-edit: Normally, git will ask you to create a commit message to describe the merge commit. This option generates a default message for you.

Miscellaneous useful tips and commands

Think before you run commands like git merge or git checkout that might update your current branch. If you have outstanding uncommitted changes, it can be complicated to keep them intact. Generally you should commit your changes before running such commands.
git diff: See all changes in your working directory versus the most recent commit.
git diff <commit_or_ref> <other_commit_or_ref>: See all the changes between two commits.
git add -u: Add to the index (in preparation for a commit) all files that have been changed.
git add -A: Add to the index (in preparation for a commit) all files.
git rm <file>: Delete a file. This change happens in the index, so it will show up in your next commit.
git mv <file> <new_name>: Rename a file. Again, this change happens in the index, so it will show up in your next commit.
git checkout -- <file>: Reset a file to its state in the current commit, eliminating changes in the working directory.
git rm --cached <file>: Unstage a staged change to a file. (Useful if you git add a file you didn't mean to add.) Doesn't delete the file.
git checkout -b <new_branch_name>: Create a new branch at the current commit and check it out, making HEAD point to it.