Code release history and aesthetics with git filter-branch and friends

Many academic projects that manage to make the transition to open-source projects are so lucky to have even made it that far, that what often happens is the latest code is copied to a new repository, some license headers are pasted in, README and friends created, and that is made public. In the process, a lot of history is dropped on the floor. I think this is a shame. Especially if the main consumers are likely to be other researchers, students, or developers for whom the ability to see how something developed adds considerable value. I recently put a lot of legwork into doing a release and trying hard to preserve history and have decent aesthetics (consistent line endings, no binary blobs appearing and disappearing in the history, yanking out those accidentally-committed proprietary files, giving committers consistent names and email addresses, preserving correct timestamps, collapsing embarassing misguided commits, …). I accomplished all of this with git and its filter-branch tool, even though all of our history was in a Subversion repository. I am now a card-carrying member of the git-is-the-greatest-thing-ever club.

Here is our starting directory structure:
/buzz
/buzz/profanity/timesink.txt
/fizz
/fizz/foo/binary-abyss
/unpublished
/unrelated

Our program is known for fizzing and buzzing, so we definitely want to keep /fizz and /buzz, except that an otherwise well-mannered developer once expressed their frustrations a little too explicitly in /buzz/profanity/timesink.txt, and some inexperienced developers accidentally checked in /fizz/foo/binary-abyss. We also have some /unpublished work that we need to be careful with, and as luck would have it we were working on /fizz while preparing a draft of /unpublished, and we made some large commits that touch both paths. Don’t want it showing up on the ‘net too soon. Because administering repositories isn’t actually all that exciting, some other folks did a lot of /unrelated work in our SVN repository. Oh, and I just remembered, we also want to release /dancing-pigs at the same time, but that work was part of a different SVN repository. So we make a list of relevant paths for the dancing-pigs project too:

/trunk/dp-keepers
/dp-unrelated

STEP 0: Our tool of choice is git

Aiming for completeness here. :) This is an important step!

$ git svn clone https://svn-server/path/to/project/root release.git
$ cd release.git

STEP 1: Annihilate unwanted files and directories

$ git filter-branch -d /dev/shm/git --index-filter \
"git rm -qrf --cached --ignore-unmatch unpublished ;\
git rm -qrf --cached --ignore-unmatch unrelated ;\
git rm -qrf --cached --ignore-unmatch buzz/profanity/timesink.txt ;\
git rm -qrf --cached --ignore-unmatch fizz/foo/binary-abyss" \
--prune-empty HEAD

What’s going on here?

  • filter-branch: “Lets you rewrite git revision history.” “Really, I know what I’m doing”. You will need to add -f to force it if you’re experimenting at home and try to run this more than once (because git is trying to be conservative against permanent loss of data).
  • -d /dev/shm/git: This tells git to use /dev/shm/git as a temporary directory (instead of /tmp or something inside your repository’s .git) while it’s working. /dev/shm is a ramdisk on most modern Linux distros, and this will get you a nice speedup. Our program actually does more than fizz and buzz, and this step is the most expensive.
  • --index-filter: This particular type of filter-branch works without actually checking out a copy of all your files once for every revision in history (i.e., it doesn’t update the working directory). It just notes internally that some things are getting deleted. We could have used a tree-filter and the system’s normal rm command and it would have accomplished the same result, but taken a lot longer due to the additional filesystem operations.
  • git rm: Because we’re doing an index-filter, we have to use git itself to do the rm, since the physical files aren’t actually being created and deleted each time.
    • -q: Be quiet. Do not print the name of each thing you delete. It’s a good idea to remove this when you’re experimenting, but you want it there for “the big run”.
    • -rf: Be recursive and force it. Just like traditional rm -rf.
    • --cached: Tell git rm not to bother updating the working directory. This is necessary to play nice with an --index-filter.
    • --ignore-unmatch: This tells git rm to exit cleanly even if the file it’s being asked to delete doesn’t exist in this particular commit. This is our first encounter with a constraint that we will see again: whatever command you run during the filter-branch must always exit cleanly. Otherwise the whole thing aborts. This is intended to be a recoverable situation but I found it easier to just keep around more than one clone while I was experimenting.
    • unpublished: This is a path. It can be a directory or a file. Just like with normal rm -rf. You do have to be conscious of troublesome files that get moved during the history of a project. I didn’t find any clean way to --follow such things. This was where a lot of the manual energy went: making a list of files and directories that shall be destroyed.
    • ;\: Normal shell syntax to execute more than one command and tolerate line-breaks.
  • --prune-empty: This tells git to just drop a commit if we’ve deleted everything that was touched by that commit. That way we don’t wind up with any empty commits or commits that reveal partial information about unpublished in their log message.
  • HEAD: This tells git where to start when working its way back through history. You’re welcome to filter-branch some other branch if you desire. There’s also a --all option, but I didn’t need it.

All of this filter-branching junks up your git repository with a lot of stale objects (for a good reason: so that you can recover from errors and so that you don’t accidentally lose a lot of important data, but that’s not especially relevant for what we’re doing here today). They can be garbage-collected as described here:

$ git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
$ git reflog expire --verbose --expire=0 --all
$ git gc --prune=0

Or, you can just clone your repository to a new one to get a guaranteed clean start:

$ cd ..
$ git clone --no-hardlinks release.git release-clean.git
$ cd release-clean.git

STEP 2: Recursively insert License header in *.[chS]

Goal: insert license header into every *.[chS] file in the repository. Make it look like it’s always been there.

$ git filter-branch -d /dev/shm/git --tree-filter '
perl /path/to/find-chS-and-add-license.pl
' --prune-empty HEAD

This time we use a --tree-filter and we do write to the working copy’s filesystem (well, a working-copy in a ramdisk, thanks to -d /dev/shm/git). The perl script that I ended up using was the inspiration for the one described in this StackOverflow question. If your repository isn’t too big, then this kind of thing is likely efficient enough (you could easily drop the Perl script altogether and use a few shell commands):

$ git filter-branch -d /dev/shm/git --tree-filter '
find . -name "*.[chS]" -exec perl /path/to/just-add-license.pl \{\} \;
' --prune-empty HEAD

Time to clean or clone your repo again. Really only important if it is large, and these operations consume many minutes each.

STEP 3: Use pretty author names

git-svn ends up leaving you with a git repository with author and committer names like username <username@23bd32ad-1234-e3ac-907f-1209acedead1>. Yuck. This time we use an --env-filter to change environment variables. Thanks yet again StackOverflow. Insert additional if-fi blocks for each relevant user in your repository and note that git maintains both an Author name and a Committer name. (Yes, this shell script is a lot more verbose than it really needs to be, but it aids readability.)

$ git filter-branch -d /dev/shm/git --env-filter '
an="$GIT_AUTHOR_NAME"
am="$GIT_AUTHOR_EMAIL"
cn="$GIT_COMMITTER_NAME"
cm="$GIT_COMMITTER_EMAIL"
if [ "$GIT_COMMITTER_EMAIL" = "user@23bd32ad-1234-e3ac-907f-1209acedead1" ]
then
cn="Nice Name"
cm="nicename@nicedomain.com"
fi
if [ "$GIT_AUTHOR_EMAIL" = "user@23bd32ad-1234-e3ac-907f-1209acedead1" ]
then
an="Nice Name"
am="nicename@nicedomain.com"
fi
export GIT_AUTHOR_NAME="$an"
export GIT_AUTHOR_EMAIL="$am"
export GIT_COMMITTER_NAME="$cn"
export GIT_COMMITTER_EMAIL="$cm"
' HEAD

STEP 4: Drop commits from a particular author

We had a situation where the end result of someone’s changes were fine to release (e.g., a bug is fixed), but some intermediate states were problematic (e.g., another flavor of the unpublished-research problem). We can just drop the commits by that author. Note that skip_commit is a shell script that git puts in the path during a rebase.

$ git filter-branch --commit-filter '
if [ "$GIT_AUTHOR_EMAIL" = "userToDrop@23bd32ad-1234-e3ac-907f-1209acedead1" ]
then
skip_commit "$@"
else
git commit-tree "$@"
fi' HEAD

STEP 5: Integrate dancing-pigs work from separate SVN repository

This is basically a repeat of lots of the above steps for a separate SVN repo, so it’s more condensed:

$ git svn clone https://additional-svn-server/dancing-pigs dp.git
$ cd dp.git

$ git filter-branch -d /dev/shm/git --index-filter \
"git rm -qrf --cached --ignore-unmatch dp-unrelated" \
--prune-empty HEAD

$ git filter-branch -d /dev/shm/git --env-filter '
an="$GIT_AUTHOR_NAME"
am="$GIT_AUTHOR_EMAIL"
cn="$GIT_COMMITTER_NAME"
cm="$GIT_COMMITTER_EMAIL"
if [ "$GIT_COMMITTER_EMAIL" = "uglyname@32df21ae-3ed5-5678-ae11-54211aeedfed" ]
then
cn="Developer Two"
cm="dev2@favoritedomain.com"
fi
if [ "$GIT_AUTHOR_EMAIL" = "uglyname@32df21ae-3ed5-5678-ae11-54211aeedfed" ]
then
an="Developer Two"
am="dev2@favoritedomain.com"
fi
export GIT_AUTHOR_NAME="$an"
export GIT_AUTHOR_EMAIL="$am"
export GIT_COMMITTER_NAME="$cn"
export GIT_COMMITTER_EMAIL="$cm"
' HEAD

Here’s an interesting trick. We know we’re going to be pulling this into our release.git repository as a subdirectory, so we can do a little directory structure manipulation to make the easier. There are other ways to accomplish this goal but since we’re doing so much filter-branching anyways, this solution worked well for us.

$ git filter-branch -d /dev/shm/git --tree-filter '
if [ -d trunk ]
then
mkdir -p dancing-pigs
mv trunk dancing-pigs
fi
' HEAD

$ git filter-branch -f -d /dev/shm/git --tree-filter '
perl /path/to/find-chS-and-add-license.pl
' --prune-empty HEAD

We also choose to rename master to something more descriptive, to give us a nicer looking history when we merge.

$ git branch -m master dancing-pigs-v3

STEP 6: Pull dancing-pigs into our main repository

$ cd /path/to/release.git
$ git remote add dancing-pigs /path/to/dp.git
$ git fetch dancing-pigs
$ git merge dancing-pigs/dancing-pigs-v3

One can also git remote add -f to add and fetch at the same time, but the three-step version is in my opinion easier to understand (-f usually means force, but not in this case).

Really, we’re done at this point. release.git is a thing of beauty. Complete history, pretty-printed author names, only the code of interest, exhaustive license headers, …). However, to share it, one must put it somewhere public.

STEP 7: Push to remote repository

$ git remote rm origin
$ git remote add origin git://username@host/path/to/remote-fizz-buzz
$ git push -u origin master

The remote rm origin is necessary if some other origin is already defined. For us, it was, since this came from git-svn. You may also name your remote repo something else, of course. But for us it is to become the new origin.

If you’re still here, thanks! I hope you found this useful and informative.

About these ads

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s