Understanding Git

Last modified by Sergiu Dumitriu on 2012/11/29 17:53

Slides

Failed to execute the [groovy] macro. Cause: [The execution of the [groovy] script macro is not allowed in [sergiu:Presentation.PresentationSheet]. Check the rights of its last author or the parameters if it's rendered from another script.]. Click on this message for details.

Understanding git

DVCS basics

phd101212s.gif

What is Version Control

  • Source Control, Version Control, Revision Control, Source Configuration Management, Software Configuration Management, Software Change and Configuration Management...
  • "Software Configuration Management (SCM) is the task of tracking and controlling changes in the software." [Wikipedia]
  • "Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later." [Pro Git book]
  • "Many people think of a version control system as a sort of time machine." [Version Control with Subversion book]
  • "Distributed version control system (DVCS) keeps track of software revisions and allows many developers to work on a given project without necessarily being connected to a common network." [Wikipedia]

Evolution of version control

  • Keeping multiple copies of the same file on the local disk
  • Keeping multiple copies of the same file on a remote disk
  • Keeping multiple copies of the same file on a shared remote server
  • Hiding previous versions in a browsable history, keeping only the head visible by default (branches and HEADs)

Evolution of collaboration

  • Sending papers back and forth
  • Sending emails back and forth
  • Storing files on a shared server
  • Storing files in a Central Version Control System
  • Now: Remove the Central aspect to get a Distributed Version Control System

The two main purposes of VCS

  1. Track changes over time
  2. Help people work collaboratively on the same project
  • The Distributed aspect of DVCS tries to improve the second purpose
    • but as a consequence also improves the first one by adding reliability through replication and cryptographic security to the codebase
    • and also by allowing more parallelism, with no locks and fewer conflicts

Concepts (SVN)

Repository

  • A database where the entire history of the project is stored
  • Not necessarily in a human-readable format
  • Existing versions can be read from the repository
  • New versions can be added to the repository

Working Copy

  • A local copy of a specific version from the repository
  • Can be freely modified, but changes aren't automatically put in the repo
  • Usually the latest version, but any previous version can be used
  • a.k.a. Workspace, Working Tree

Checkout

  • Copy a specific version from the repository into a local working copy

Commit

  • Upload a new version based on the working copy to the repository

Update

  • Re-fetch the latest version from the central server

Branches

  • It is possible to keep parallel development histories
  • For example, after releasing version 3.0 of the software, continue developing for the upcoming 4.0 version, but also maintain a 3.0.x branch for eventual critical bugfixes needed before 4.0 is ready

HEAD

  • The most recent version on a specific branch

The D in DVCS

Repositories

  • Instead of having just one central repository, everyone clones the entire repository locally
  • A working copy is always right next to the local repository
  • A checkout extracts files from the local repository
    • No network transfer is involved
  • A commit stores the new version in the local repository
    • No network transfer is involved

Collaboration

  • You can add as many remote repositories as needed to your local clone
  • You can fetch versions from any of your registered remotes
    • Fetch just adds new version into your local repository without changing the current working tree or your local branches
    • Pull fetches and updates the local branches that track remote branches
  • You can issue a pull request for someone else to fetch your changes and include them in their repository
  • It is still possible to nominate an accessible repository as the central repository where committers can push directly

About Git

Git

  • Distributed Version Control System
  • Powerful branching
    • Branches have a very low cost (tens of bytes)
    • Versatile merging
  • Efficient storage
  • Data integrity assured by SHA-1 checksums of each file, tree or commit
  • Many commands at high (porcelain) and low (plumbing) level
  • Steep learning curve, but easy to master after the a-ha! moment
  • Easy to make mistakes, easy to recover
    • But hard to make irreparable mistakes

GitHub

  • Git hosting with many enhancements
  • Social interaction
    • Organizations
    • Following users and repositories
    • Public Forks and Pull Requests
  • Nice visualization for commits, versions, branches, forks
  • Comments on commits
  • Remote access APIs
  • Also provides basic issue tracker, wiki, web hosting, download hosting

From Subversion to Git

Equivalent actions

SubversionGit
checkoutclone
updatefetch or pull
statusstatus
diffdiff
commitadd + commit + push
revertcheckout

Git internals

The Object Database

Blobs and Objects

  • Blob = A piece of data in the repository
  • Object = blob + type + SHA-1 ID
    • They have a SHA-1 validating their contents, thus objects are immutable; changing something means creating a new version of it
  • Usually, the contents of a file
    • This can be a symlink as well
  • Trees, Commits, Tags
  • git cat-file -p <SHA> shows the raw contents of an object's blob
  • Object types: blob, tree, commit, tag

Blob (File)

  • Object of type 'blob'
  • Contains just the plain content of the file, be it text or binary
  • For symlinks, contains the path to the linked file
  • Does not contain a file name, or a file path, or a file mode

Tree (Directory)

  • Object of type 'tree'
  • Collection of other files and trees
  • The blob contains a list of entries
  • For each included entry, specifies:
    • File mode
    • File type (blob, link, tree)
    • SHA-1 of the file or subtree object
    • File name
  • A file does not know its name
  • A sub-tree does not know its parent

Tree example

$ git cat-file -p 8b79037423e3dfda32c87dc9e7add14be4688c9c
100644 blob 523a688c25b20dee0a9e0a0ebd2eed65545423d5    .gitattributes
100644 blob 53f9e9b271e54f46617ee0f52cfc9c370e5b01fe    .gitignore
100644 blob d0f1cacc70bccbcd587c8c5119cf11438eca4563    README.markdown
120000 blob ec54964de17e52fc4ee652d92ade47f35c9fb77d    README.symlink
040000 tree f663f5a69a6ca941381dd1c527cd506163113fbd    jetty-resources
040000 tree 40b20f4f91d1d85f5a523944522c8f44525ea5c1    ncbieutils-access-service
040000 tree 0c09a04658435dcd43e4cee149bef5ad170aff77    obo2solr
040000 tree 25065bef9f647ffb0ba1c6c571f1afa317082e78    patient-tools
040000 tree 869a3a8cb5863cd6cff802c05a9d4ad71b21dc05    patient-update-listeners
040000 tree 8cae1b0cd809c4a1133f8a3cd8505c3bff3fb229    phenotype-mapping-service
100644 blob 50deded834803b6e2530b9207c52e274177c4636    pom.xml
040000 tree 7a0bdf97ffa40f3decec211229aae3e96e70e73a    solr-access-service
040000 tree 329b8aceee152c46199f13773f58667e45b03f78    solr-configuration
040000 tree 513db7cae3ff0a61a83115445adf7e0a5d7d12fd    standalone-distribution
040000 tree b7562687ef03b5a1cf47fed14feffdf4ed8e62d8    standalone-ui
040000 tree 3970df2a29c988309410e44699d7d923fe10b560    wiki-database
040000 tree be08365fa5e665ac428c6b450b9622b496ebf222    wiki-distribution
040000 tree 3bd73e7f49c2828f412f0935bddfa1ce0e965ca0    wiki-ui

Files do not know their names Files do not know their directory Files do not know they're files!

Commit

  • Object of type 'commit'
  • A link to a tree, with associated metadata
    • Parent commit(s)
    • Author
    • Committers (can be more than one when retouching commits)
    • Dates (for each commit)
    • Commit message
  • The diff between two versions (a "classical" SVN commit) is actually done by recursively comparing the trees of the two commit objects

Commit examples

$ git cat-file -p HEAD
tree ee5c210558e3ce9b8fc46568878562e76c5e5c98
parent 30303a2408f7911842e4cc5646f9541097c96fbe
author Sergiu Dumitriu <sergiu@xwiki.org> 1353364745 -0500
committer Sergiu Dumitriu <sergiu@xwiki.org> 1353364745 -0500

Issue #30: Autosave patient sheets
Done.
$ git cat-file -p 0b591f2
tree e4bace937c9b270874989d4e1e780ebdc31af64b
parent af7d5a6914693a73081fe0a220bc6cfb543f3ebf
parent 2076271fdd57dee7c66a6238e5d8d53c2ddfe8e5
author Sergiu Dumitriu <sergiu@xwiki.com> 1351284534 -0400
committer Sergiu Dumitriu <sergiu@xwiki.com> 1351284534 -0400

Merge branch 'new-pheno-displayer'

Tag

  • Object of type 'tag'
  • A link to a commit object, with additional metadata
    • Tag name
    • Tag author
    • Optional strong signature (PGP) to certify the author
  • This is an internal, invisible tag object; an actual tag, explained later, is a reference to this tag object

Tag example

$ git cat-file -p xwiki-platform-3.5.1
object 31024788158cc45879bf15832fa38c4834d434df
type commit
tag xwiki-platform-3.5.1
tagger Sergiu Dumitriu <sergiu@xwiki.com> Sun Apr 29 22:04:28 2012 +0200

Tagging xwiki-platform-3.5.1
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iQEcBAABAgAGBQJPnZ7MAAoJEDXWcnP59QNwmkQH/3763+OGMYE/VUfw5CP//elr
obmCSTYKWlVPf6GGn3kH6IQPD/iJGJLtoBziCU0uVBYPnOpRaKu/fo2jdi5LXR9/
nK6aosyoIwN7oqE+tDBwahujovXcIu93Dkf5vFtsrbSqg43EnB0dm54wWc26LR57
Q6Wp3s/hU6gdl7cHdoNDfGAbziTE6+e61WjkxuR18xdKwNN88G+aYYJAtEy21qaO
aMjyZMMSCTlj1KFSNQpOm5tu/49dzs8TQdZqbf5ykuOJ1WbJUgVirRzvmyisjzvC
HjjnUMkanH7BEZfI4nn9scZkiDTr+XwwoH6VzN81Ea8gu8LNy/fGBhK30KHBgMc=
=xnTk
-----END PGP SIGNATURE-----

Merge Commits

  • Object of type 'commit' with more than one parent
  • Reunites two (or more) parent branches
  • Combines the two (or more) parent trees into a unified tree
  • Whenever two or more branches diverge from a common parent, bringing them together requires a merge
    • ...or a rebase of one of the branches

History example

* commit 60d25ab8915a2ae6ef04bf7bf122f61975105923
| Author: Marta Girdea <marta.girdea@gmail.com>
|
|     Issue #62: Sorting not supported by current implementation of the solr access service component
|    
*   commit 0a066433ce089f7ec8566f01a756d9aa227158f3
|\  Merge: 000ad9b 8816daf
| | Author: Marta Girdea <marta.girdea@gmail.com>
| |
| |     Merge branch 'master' of github.com:marta-/cidb
| |   
| * commit 8816daf88703189d6e88a5c1f949c01e2ccb1e9a
| | Author: Marta Girdea <marta.girdea@gmail.com>
| |
| |     [misc] Fixed translation
| |   
* | commit 369894b2ca95eba2660cb9c0d65ae8361951459b
| | Author: Marta Girdea <marta.girdea@gmail.com>
| |
| |     [misc] Some disabled fields still show up in the form
| |   
* | commit 0ffc9fd56fb8397b4c48265733eed10d13a4fc92
|/  Author: Marta Girdea <marta.girdea@gmail.com>
|   
|       Issue #61: Delete button fails on homepage
|  

All the objects in a git repository form a Directed Acyclic Graph

Rebasing

  • Rebase == rewrite a diverging branch of commits so that they appear in line after the “official” branch
  • New blobs, trees and new commit objects are created!
  • Useful when trying to minimize the number of merge commits
    • the parallel nature of the rebased commits is not of importance
  • Available as an option for commands that combine branches: merge, pull
  • Also a standalone command for rewriting history
    git rebase --interactive <older commit>

Rebase versus Merge

Assuming the following commit structure:

                      F---G    local master
                     /
            A---B---C---D---E  upstream master

After git pull:

                      F---G---H  merge commit
                     /       /
            A---B---C---D---E

After git pull --rebase:

                              F'--G'
                             /
            A---B---C---D---E

A git repository is a database of objects

plus some metadata and caches

The full repository

Repository contents

  • The object database
  • Repository configuration
  • config, info/*, description, hooks/*
  • References
    • refs/* and packed-refs
  • The current index
  • The Working Tree, or the local checkout
    • It is possible to have a bare repository, without the index and the checkout; this is for server repositories
  • A stash, a list of saved patches, not part of the history

References (Heads)

  • A link pointing to an existing object
    • Just a reference name and the SHA-1 of the target object
  • References are not objects!
    • Files located in .git/refs/*
    • Packed references inside .git/packed-refs
  • References can be:
    • Local and Remote branches
    • Tags (tag = reference to a tag object)
    • Stash
    • HEAD (the branch or commit that is checked out)

Example references

$ cat .git/packed-refs
# pack-refs with: peeled
cfce87925d25ad5cb33d5361bcdbbbdca517ded4 refs/heads/master
161cb5ebafa50395a442c3a1f96501f2607ac75b refs/heads/swizzle-upgrade
8177d7b5152171d7b892682a4023bf9de25f5f61 refs/remotes/origin/feature-solr-search
cfce87925d25ad5cb33d5361bcdbbbdca517ded4 refs/remotes/origin/master
2b232b899fb32c3f7d5be1f020c738cadb9596af refs/remotes/origin/stable-4.1.x
518fecb89bae8d1334341263c2f5bc049636959d refs/remotes/origin/stable-4.2.x
ae7de6e2993ffaffecec56415b76c64243fa1fa1 refs/remotes/origin/stable-4.3.x
584f14233004eae260d1bd3b8711f7f8051921b8 refs/stash
09a740e9c6e4183365130d58b0cd7017f7eace35 refs/tags/xwiki-platform-4.3-milestone-2
^7e8a6514d4e46260ad30551d0197db9ecbe581ab
3398ccdc8639aeb7c441b9dbdb7de417be1b7edd refs/tags/xwiki-platform-4.3-rc-1
^62337c8d3da2864ab147aaacdbc998534ebcff6f

Branches

  • A branch is just a reference to a commit object
    • A post-it, a label outside the object database
    • It doesn't have to point to a commit with no descendants
  • Creating a branch == creating a reference
  • Deleting a branch == deleting a reference
    • All the commits are still kept in the object database, until an explicit pruning is performed
  • Local branches can track a remote branch
    • Pulling will also try to update local tracking branches
    • Pushing will try to update the remote tracked branch

HEAD

  • A special type of reference: a symbolic reference
  • Unlike branches, the HEAD usually points to another reference, and not directly to a commit object
    • must be a local branch
  • Checking out a branch means updating the HEAD reference to point to the branch reference
  • Committing a new revision means that the reference pointed to by HEAD will be updated as well
  • Checking out a commit will cause the HEAD to point directly to the commit object, which means that a new commit will not update any branch; this is called a detached HEAD

HEAD examples

$ cat .git/HEAD
ref: refs/heads/master
$ git checkout feature-docnaming
Switched to branch 'feature-docnaming'
$ cat .git/HEAD
ref: refs/heads/feature-docnaming
$ git checkout xwiki-manager-4.2
Note: checking out 'xwiki-manager-4.2'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

[...]

HEAD is now at 0784c23... [maven-release-plugin] prepare release xwiki-manager-4.2
$ cat .git/HEAD
0784c2334be743ecc64d67f7c7b22f64fcb22044

Database, Index, Working Tree

  • The object database contains all the history of the commits, but doesn't have an explicit HEAD
    • It's just a DAG of objects, all equally important
    • References identify certain nodes as the head of a branch, or as a tagged state
  • The Working Tree is a local checkout of a given tree from the database, on which the user can work
  • The Index is a buffer between the DB and the workspace, structured as a git tree ready for commit, and a reference against which to compare the working tree
    • Also called staging tree, or commit cache

The stash

  • A separate list of changes to be saved for later
    • Work in progress not ready for a commit, but useful enough to be kept
  • Stashing the current state saves both the index and the working tree, hard-resetting to the HEAD
  • Any number of stash entries can be saved
  • A stashed change can be popped at a later time
    • Onto any index with a clean working tree

Code examples

Working with the index

# Checkout a tree from the DB:
# - update the HEAD reference
# - copy the tree from the DB into the index
# - extract its files (blobs) into the workspace
$ git checkout [<SHA-1> | <ref>]
# Add changed files from the workspace to the staging tree
$ git add [<files> | --all | --interactive]
# Remove blobs from the index and the workspace
$ git rm <files>
# Move blobs to another subtree, index+workspace
# Also for renaming files/directories
$ git mv <original location> <new location>

Git can automatically detect a file move/rename, even if not explicitly specified with a mv operation

Working with the index: checking differences

# See changes between the index and the workspace
$ git diff [<files>]
# See changes between the index and the DB (HEAD)
$ git diff --cached
# See changes from another version, comparing the
# tree from the DB with the working tree
$ git diff <ref>
# See changes between any two trees in the DB
$ git diff <ref>..<ref>

Working with the index: dropping changes

# Reset the index, re-copying blobs from the DB into the index
# Doesn't touch the working tree
$ git reset [<ref>] [<files>]
# Reset the working tree from the index, re-copying files into the working tree
# Doesn't touch the index
$ git checkout -- <files>
# Reset both the index and the working tree:
# - copy blobs from the database to the Index
# - extract blobs into workspace
# When currently on a branch, looks like "forgetting"/discarding commits
# by going back to another revision and moving the branch head to it
$ git reset --hard <ref>
# Still, untracked files are not discarded by any of these commands;
# to get a really clean state with no extra changes, use:
$ git clean -dxf

Working with the index: commits

# Check the status of the index
$ git status [-s -u]
# Commit the index:
# - copy the staging tree into the DB
# - create a new commit object in the DB
# - update the HEAD or the current branch referefenced by it
# It's the index that gets committed, not the working tree;
# you must 'git add' the changes you want to commit first
$ git commit
# Re-unite with another branch by:
# - merging the staging tree with another tree
# - creating a commit with two parents
$ git merge <ref>
# Clone a "commit" into the current branch
$ git cherry-pick [-x] <ref>

Working with the stash

# Save the changes into a new stash entry
$ git stash
$ git stash save "a nice name for this changeset"
# Show the contents of the stash
$ git stash list
# Show a particular stash entry
$ git stash show [-p] [stash@{N}]
# Drop a stashed entry, permanently forgetting it
$ git stash drop [stash@{N}]
# Apply a stashed changeset onto the working tree, keeping it in the stash
$ git stash apply [stash@{N}]
# Apply a changeset and drop the entry from the stash
$ git stash pop [stash@{N}]

Working with branches

# Checkout a commit into the Index/Workspace
# This creates a detached HEAD, not on a branch
$ git checkout <ref>
# Checkout an existing branch
$ git checkout <branch name>
# Create a new branch from the current HEAD; does not switch to the new branch
$ git branch <name>
# Checkout a commit and create a branch from it
$ git checkout -b <branch name> <ref>
# List existing local branches
$ git branch
# Delete a branch; only deletes the reference, commits will remain in the DB
$ git branch -d <name>

Working with tags

# Checkout a tag into the Index/Workspace (detached HEAD)
$ git checkout <tag name>
# Tag the current HEAD
$ git tag <name>
# Create a signed tag; requires setting up GPG
$ git tag -s <name>
# Show existing tag names
$ git tag
# Delete a tag
$ git tag -d <name>

Working with the object database: reading objects

# Show the contents of an object
$ git show <SHA-1>
# Show the raw contents of a blob
$ git cat-file -p <SHA-1>

Working with the object database: browsing and searching for commits

# Show the history of the current HEAD
$ git log [--graph]
# Show commits with a given message
$ git log --grep=illumina
# Show commits that introduced a given text
$ git log '-Stext to search for'
# Show commits from a certain author
$ git log --author=Sergiu
# Show commits (with diff) from a certain date
$ git log -p '--since=three days ago'
# Show the log on a specific file, following renames
$ git log --follow -- <file>
# Show the repository in a nice GUI
# Some also allow to [un]stage and commit
$ gitx | gitk | gitg | gitview | git gui

Working with the object database: file history

# "Blame" a file: trace back each line of code to the last commit that changed it
$ git blame <file>
# Blame the file as it was at a given revision
$ git blame <ref> <file>
# Show the contents of a file at a given revision
$ git show <ref>:<file>

prune and fsck

  • References point to certain nodes in the DAG
  • Most objects are transitively reachable from one of the references
  • Unreachable objects are "invisible" if their ID is not known
  • git prune can remove these objects
  • git reflog can list recent revisions, reachable or unreachable
  • git fsck can list these objects
    • and restore unreferenced tag objects

Collaboration: remotes

A remote repository is
a source of new objects

and a destination as well

Remotes

  • Clones of the same git repository located in another place
    • Another directory on the same machine
    • A repository on another machine accessible via ssh
    • An online repository accessible via HTTP or the special git communication protocol
    • Or a foreign repository of another type (SVN, CVS...)
  • == Related object databases located in other places
  • A repository can have as many remotes as it wants
  • origin is considered the main remote by convention

clone

  • Creates a new local clone of a repository
    git clone git@github.com:compbio-UofT/shrimp.git    (for committers)
    git clone git://github.com/compbio-UofT/shrimp.git  (read-only)
  1. Creates a new git repository
  2. Clones some or all of the references from the origin repository
  3. Fetches the objects reachable from the cloned references from the remote repository into the local database
  4. Configures the remote as the origin repository
  5. Sets the master local branch to the remote head (ORIGIN_HEAD)
  6. Checks out the master branch into the working tree

fetch and pull

  • Fetches new values for the remote references
  • Also fetches new remote references (new remote branches, new tags)
    • unless only certain refs are specified
  • Fetches new objects reachable from the updated references
  • Does not update the local branches
  • git pull also updates the currently checked out branch, if tracking a remote branch, to point to the new remote branch head
    • HEAD copies ORIGIN_HEAD
  • If the local and remote branch diverged, a merge commit will be created
    • unless --rebase is specified

Working with fetch and pull

# Fetch from origin, update all remote heads
$ git fetch
# Fetch from <remote>
$ git fetch <remote>
# Also fetch all new tags, even if not on followed branches
$ git fetch --tags
# Remove references to remote branches that no longer exist
$ git fetch --prune
# Only fetch the specified remote branch
$ git fetch origin stable-4.3.x
# Rebase local changes instead of merging
$ git pull --rebase

push

  • Updates remote refs using local refs, while sending objects necessary to complete the given refs
  • By default, tries to update all remote branches to point to the new heads of the local branches with the same name
  • Optionally pushes tags
  • Can also be used to delete remote references
  • Note that push fails if there are new commits on the remote; pull first to resolve the conflict

Working with push

# Push all tracked remotes to origin
$ git push
# Push to <remote>
$ git push <remote>
# Also push all tags
$ git push --tags
# Push only a branch
$ git push origin <tracked remote branch name>
# Push a local branch to a new remote branch
$ git push origin <local branch name>:<remote branch name>
# Delete a remote branch
$ git push origin :<remote branch name>

Resolving conflicts

Conflicts

  • When merging different branches of the DAG, conflicts may occur
    • Also when applying stashed changes and cherry-picking
  • Git is pretty good at automatically solving conflicts, but overlapping changes can't be automatically merged
  • When a conflict occurs, git puts the working tree in a conflict state and waits for the user to solve the conflict

Resolving a conflict

  • git status shows the current state: which files are dirty, which are prepared for commit
  • The conflicting files have both changes inside them, marked with <<<< ==== >>>>
  • After choosing one of the two variants or combining them, use git add to mark a file as resolved and prepare it for commit
  • When all the files are resolved, git commit
    • Usually git shows a command line to execute

Resolving a rebase conflict

  • When the conflict appears during a rebase, git stores this information and allows to continue the rebase
  • As before, resolve all conflicts and add files to the index, then:
    • git rebase --continue
    • git rebase --abort stops the process
    • git rebase --skip skips the current commit and continues with the next commit to rebase

When something is wrong,
analyze the DAG and
try to figure out where you are,
and where you want to get

Think DAG, not linear commits history

Debugging tips

  • Look at the graphical log and try to see what happens
    • git log --graph --decorate
  • git reflog shows recent states, even if no longer reachable from references
  • Compare local references with the remotes
  • If all else fails, hard reset to a previous working state and continue from there, even if some changes will be lost
    • You can stash them before resetting to preserve some data
  • Don't forget to check on what branch you are before committing
  • When panicking, ask for help from someone else

Resources

Documentation