Git: how it works

Git

This article is a summary of the Git documentation. This knowledge is needed to start working with Git and SourceTree. If you are looking for a widely used git development flow read this article. Also check the Git reference site. It is meant to be a quick reference for learning and remembering the most important and commonly used Git commands.

Git thinks of its data like a set of snapshots of a mini filesystem. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like
at that moment and stores a reference to that snapshot. If files have not changed, Git doesn’t store the file again—just a link to the previous identical file it has already stored.

18333fig0105-tn

Everything in Git is check-summed before it is stored and is then referred to by that checksum. The mechanism that Git uses for this checksumming is called a SHA-1 hash. Git stores everything not by file name but in the Git database addressable by the hash value of its contents. Git has three main states that your files can reside in:

  • Committed: the data is safely stored in your local database
  • Modified: you have changed the file but have not committed it to your database yet
  • Staged: you have marked a modified file in its current version to go into your next commit snapshot

A git project

This leads us to the three main sections of a Git project: the Git directory, the working directory, and the staging area.

The Git directory is where Git stores the metadata and object database for your project. This is the most important part of Git, and it is what is copied when you clone a repository from another computer.

The working directory is a single checkout of one version of the project. These files are pulled out of the compressed database in the Git directory and placed on disk for you to use or modify.

The staging area is a simple file, generally contained in your Git directory, that stores information about what will go into your next commit. It’s sometimes referred to as the index, but it’s becoming standard to refer to it as the staging area.

Git flow

The basic Git workflow goes something like this:

  • You modify files in your working directory
  • You stage the files, adding snapshots of them to your staging area
  • You do a commit, which takes the files as they are in the staging area and stores that snapshot permanently to your Git directory
  • Finally you push your local repository to the remote repository

local-remote
You can get a Git project using two main approaches. The first takes an existing project or directory and imports it into Git. The second clones an existing Git repository from another server. If you want to get a copy of an existing Git repository the command you need is git clone. Git receives a copy of nearly all data that the server has. Every version of every file for the history of the project is pulled down. When you first clone a repository, all of your files will be tracked and unmodified because you just checked them out and haven’t edited anything.

As you edit files, Git sees them as modified. You stage these modified files and then commit all your staged changes.

18333fig0201-tn

 

Git add and  git status

In order to begin tracking a new file, you use the command git add. If you run git status, you can see that your added file is now tracked and staged. You can tell that it’s staged because it’s under the Changes to be committed heading. If you commit at this point, the version of the file at the time you ran git add is what will be in the historical snapshot.  If you change a previously tracked file and then run your status command again, the file appears under a section named Changes not staged for commit. This means that a file that is tracked has been modified in the working directory but not yet staged. To stage it, you run the git add command. The staged file will go into your next commit.

Sidenote: Git add is a multipurpose command, you use it to begin tracking new files, to stage files, and to do other things like marking merge-conflicted files as resolved.

If you make changes to you’re staged file and run the status command you’ll see it is now listed as both staged and unstaged.  It turns out that Git stages a file exactly as it is when you run the git add command. If you commit now, the version of the file as it was when you last ran the git add command is how it will go into the commit, not the version of the file as it looks in your working directory when you run git commit. If you modify a file after you run git add, you have to run git add again to stage the latest version of the file.

Git diff

You can use git diff to see your staged and unstaged changes. Git diff by itself only shows changes that are unstaged.  If you edit an staged file, you can use git diff to see the changes in the file that is staged and the changes that are unstaged. If you want to see what you’ve staged that will go into your next commit, you can use git diff –staged. This command compares your staged changes to your last commit.

Git commit

The simplest way to commit is to type git commit. You can see that the commit has given you some output about itself: which branch you committed to, what SHA-1 checksum the commit has, how many files were changed, and statistics about lines added and removed in the commit. Remember that the commit records the snapshot you set up in your staging area. Anything you didn’t stage is still sitting there modified; you can do another commit to add it to your history. Every time you perform a commit, you’re recording a snapshot of your project that you can revert to or compare to later.

Let’s assume that you have a directory containing three files, and you stage them all and commit. Staging the files checksums each one, stores that version of the file in the Git repository (Git refers to them as blobs), and adds that checksum to the staging area:

$ git add README LICENSE test.rb
$ git commit -m 'initial commit'

Running git commit checksums all project directories and stores them as tree objects in the Git repository. Git then creates a commit object that has the metadata and a pointer to the root project tree object. Your Git repository now contains five objects: one blob for the contents of each of your three files, one tree that lists the contents of the directory and specifies which file names are stored as which blobs, and one commit with the pointer to that root tree and all the commit metadata.

18333fig0301-tn

If you make some changes and commit again, the next commit stores a pointer to the commit that came immediately before it.

18333fig0302-tn

Sidenote: the staging area is sometimes a bit more complex than you need in your workflow. If you want to skip the staging area, Git provides a simple shortcut. Providing the -a option to the git commit command makes Git automatically stage every file that is already tracked before doing the commit.

Git remove and git move

To remove a file from Git, you have to remove it from your tracked files (remove it from your staging area) and then commit. The git rm command does that and also removes the file from your working directory so you don’t see it as an untracked file next time around. If you simply remove the file from your working directory, it shows up under the Changes not staged for commit (unstaged). If you modified the file and added it to the staging area already, you must force the removal with the -f option. Use git rm –cached if you want to remove a file from your staging area.

Git doesn’t explicitly track file movement. If you rename a file in Git, no metadata is stored in Git that tells it you renamed the file. It doesn’t matter if you rename a file by hand or with the mv command.

Git log

The most basic and powerful tool to look back to see what has happened is the git log command. By default, with no arguments, git log lists the commits made in that repository in reverse chronological order. That is, the most recent commits show up first.

At any stage, you may want to undo something. Be careful, because you can’t always revert some of these undos. This is one of the few areas in Git where you may lose some work if you do it wrong. One of the common undos takes place when you commit too early and possibly forget to add some files, or you mess up your commit message. If you want to try that commit again, you can run commit with the git commit –amend. This command takes your staging area and uses it for the commit. An example:

$ git commit -m 'initial commit'
$ git add forgotten_file
$ git commit --amend

After these three commands, you end up with a single commit, the second commit replaces the results of the first.

Let’s say you’ve changed two files and want to commit them as two separate changes, but you accidentally type git add * and stage them both. The git status command reminds you how to unstage them:

$ git add .
$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD ..." to unstage)
 
        modified:   index.php
        modified:   top.php

Remote repositories

Remote (or shared) repositories are versions of your project that are hosted on the Internet or network somewhere. A remote repository is generally a bare repository, a Git repository that has no working directory. To see which remote servers you have configured, you can run the git remote command. It lists the shortnames of each remote handle you’ve specified. If you’ve cloned your repository, you should at least see origin, that is the default name Git gives to the server you cloned from.

To get data from your remote projects, you can run git fetch [remote-name]. You should now have references to all the branches from that remote, which you can merge in or inspect at any time. If you clone a repository, the command automatically adds that remote repository under the name origin. So, git fetch origin fetches any new work that has been pushed to that server since you cloned (or last fetched from) it. If you have a branch set up to track a remote branch, you can use the git pull command to automatically fetch and then merge a remote branch into your current branch.

When you have your project at a point that you want to share, you have to push it upstream. The command for this is simple: git push [remote-name] [branch-name]. If you want to push your master branch to your origin server, then you can run this to push your work back up to the server git push origin master. If you want to see more information about a particular remote, you can use the git remote show [remote-name] command:

$ git remote show origin
* remote origin
  URL: git://github.com/project/project.git
  Remote branch merged with 'git pull' while on branch master
    master
  Tracked remote branches
    master
    project

The command helpfully tells you that if you’re on the master branch and you run git pull, it will automatically merge in the master branch on the remote after it fetches all the remote references. It also lists all the remote references it has pulled down.

Git tag

Git has the ability to tag specific points in history as being important. Listing the available tags in Git is straightforward. Just type git tag. Git uses two main types of tags: lightweight and annotated:

  • A lightweight tag is very much like a branch that doesn’t change, it’s just a pointer to a specific commit
  • Annotated tags are stored as full objects in the Git database. They’re checksummed; contain the tagger name, e-mail, and date; have a tagging message

It’s generally recommended that you create annotated tags so you can have all this information. By default, the git push command doesn’t transfer tags to remote servers. You will have to explicitly push tags to a shared server after you have created them. This process is just like sharing remote branches, you can run git push origin [tagname].

Git branch

Nearly every version control system has some form of branching support. Branching means you diverge from the main line of development and continue to do work without messing with that main line. Git encourages a workflow that branches and merges often, even multiple times in a day.

A branch in Git is simply a lightweight movable pointer to one of your commits. The default branch name in Git is master. As you initially make commits, you’re given a master branch that points to the last commit you made. Every time you commit, it moves forward automatically.

18333fig0303-tn

If you create a new branch Git creates a new pointer for you to move around. Let’s say you create a new branch called testing. You do this with the git branch command.

Git keeps a special pointer called HEAD to know what branch you’re currently on. This is a pointer to the local branch you’re currently on. To switch to an existing branch, you run the git checkout command. Let’s switch to the new testing branch: git checkout testing. This moves HEAD to point to the testing branch.

18333fig0306-tn

If you now make a change to an existing file and commit this change, your testing branch has moved forward, but your master branch still points to the commit you were on when you ran git checkout to switch branches.

18333fig0307-tn

Let’s switch back to the master branch with git checkout master. That command does two things. It moves the HEAD pointer back to point to the master branch, and it reverts the files in your working directory back to the snapshot that master points to. This means the changes you make from this point forward will diverge from an older version of the project. It essentially rewinds the work you’ve done in your testing branch temporarily so you can go in a different direction.

18333fig0308-tn

Let’s make a change and commit again. Now your project history has diverged. You created and switched to a branch, did some work on it, and then switched back to your main branch and did other work. Both of those changes are isolated in separate branches: you can switch back and forth between the branches and merge them together when you’re ready.

18333fig0309-tn

Sidenote: the git branch command does more than just create and delete branches. If you run it with no arguments, you get a simple listing of your current branches. Note that if the * character prefixes the master branch it indicates the branch that you currently have checked out is the master branch. In other words: if you commit at this point, the master branch will be moved forward with your new work.

Git fetch

Let’s say you have a Git server on your network, git.ourcompany.com. If you clone from this, Git automatically names it origin for you, pulls down all its data, creates a pointer to where its master branch is, and names it origin/master locally; and you can’t move it. Git also gives you your own master branch starting at the same place as origin’s master branch, so you have something to work from.

18333fig0322-tn

If you do some work on your local master branch, and, in the meantime, someone else pushes to git.ourcompany.com and updates its master branch, then your histories move forward differently. Also, as long as you stay out of contact with your origin server, your origin/master pointer doesn’t move.

18333fig0323-tn

To synchronize your work, you run a git fetch origin command. This command looks up which server origin is, fetches any data from it that you don’t yet have, and updates your local database, moving your origin/master pointer to its new, more up-to-date position.

18333fig0324-tn

Let’s assume you have another internal Git server, git.team1.ourcompany.com. You can add it as a new remote reference to the project you’re currently working on by running the git remote add command. Name this remote teamone, which will be your shortname for that whole URL.

18333fig0325-tn

You can run git fetch teamone to fetch everything the remote teamone server has that you don’t have yet. Because that server has a subset of the data your origin server has right now, Git fetches no data but sets a remote branch called teamone/master to point to the commit that teamone has as its master branch.

18333fig0326-tn

Sharing branches

When you want to share a branch with the world, you need to push it up to a remote that you have write access to. Your local branches aren’t automatically synchronized to the remotes you write to, you have to explicitly push the branches you want to share. That way, you can use private branches for work you don’t want to share, and push up only the topic branches you want to collaborate on. If you have a branch named serverfix that you want to work on with others, you can push it up the same way you pushed your first branch. Run git push (remote) (branch):/master i.e. git push origin serverfix, which means, take my serverfix local branch and push it to update the remote’s serverfix branch. You can also do git push origin serverfix:serverfix, which does the same thing, it says, take my serverfix and make it the remote’s serverfix.

You can use this format to push a local branch into a remote branch that is named differently. If you didn’t want it to be called serverfix on the remote, you could instead run git push origin serverfix:awesomebranch to push your local serverfix branch to the awesomebranch branch on the remote project. The next time one of your collaborators fetches from the server, they will get a reference to where the server’s version of serverfix is under the remote branch origin/serverfix.

It’s important to note that when you do a fetch that brings down new remote branches, you don’t automatically have local, editable copies of them. In other words, in this case, you don’t have a new serverfix branch, you only have an origin/serverfix pointer that you can’t modify. To merge this work into your current working branch, you can run git merge origin/serverfix. If you want your own serverfix branch that you can work on, you can base it off your remote branch with git checkout -b serverfix origin/serverfix. This gives you a local branch that you can work on that starts where origin/serverfix is.

Checking out a local branch from a remote branch automatically creates what is called a tracking branch. Tracking branches are local branches that have a direct relationship to a remote branch. If you’re on a tracking branch and type git push, Git automatically knows which server and branch to push to. Also, running git pull while on one of these branches fetches all the remote references and then automatically merges in the corresponding remote branch. When you clone a repository, it generally automatically creates a master branch that tracks origin/master. That’s why git push and git pull work out of the box with no other arguments.


Leave a Reply