Locales problems on Debian or Ubuntu

I use debootstrap to install a lot of Debian or Ubuntu systems. This often leads to “locales” problems. Examples of the errors you might see include (from aptitude or apt-get):

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory

…or from gparted:

(process:16644): Gtk-WARNING **: Locale not supported by C library.
Using the fallback 'C' locale.
libparted : 2.3
(gpartedbin:16644): glibmm-ERROR **:
unhandled exception (type std::exception) in signal handler:
what: locale::facet::_S_create_c_locale name not valid

This fix has been the most reliable that I have found:

export LANGUAGE=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
locale-gen en_US.UTF-8
dpkg-reconfigure locales

In particular, just running dpkg-reconfigure locales rarely accomplishes anything useful for me. If it doesn’t produce output, it probably didn’t fix your problem.

Exporting Redmine issues and then importing them to a SourceForge project

I had 60-something Issues in a Redmine installation that I managed myself that I wanted to import into a new SourceForge project. Redmine has native support to export these issues into a CSV file. It turns out that CSV is mildly richer than I had previously thought, and cleanly supports things like a single “cell” in the resulting spreadsheet containing a large, multi-line description full of quotes and commas.

The SourceForge API v2.0 Beta is scriptable. Awesome. The example at that page uses Python, and I like Python, so we’re good. The steps are roughly as follows:

  • Create an “oauth application” in your SourceForge account here. You will end up with a key and a secret from the registration process, which you will need to paste into the relevant scripts.
  • Thanks to the existing SourceForge example, the scripting related to Oauth login was already done for me. Note that webbrowser.open() is used to allow you (the human) to manually copy/paste a per-session PIN. Your OS needs to support the launch of a web browser. Mine (Ubuntu 10.04) did without issue.
  • Use the Python CSV package to parse the Redmine-exported CSV file into a Python data structure (roughly an array of dict objects, but see DictReader.fieldnames to understand how it is more than that).
  • Add some custom scripting to map Redmine fields to supported SourceForge fields. A few fields map in a sensible way. The rest I just inserted with a descriptive prefix at the beginning of the primary Ticket Description.

I placed the scripts on github (no, the irony is not lost on me, but my current preference is for long-lived things to be on SourceForge and quick-and-dirty things to be on github).

Code release history and aesthetics with git filter-branch and friends

Many academic projects that manage to make the transition to open-source projects are so lucky to have even made it that far, that what often happens is the latest code is copied to a new repository, some license headers are pasted in, README and friends created, and that is made public. In the process, a lot of history is dropped on the floor. I think this is a shame. Especially if the main consumers are likely to be other researchers, students, or developers for whom the ability to see how something developed adds considerable value. I recently put a lot of legwork into doing a release and trying hard to preserve history and have decent aesthetics (consistent line endings, no binary blobs appearing and disappearing in the history, yanking out those accidentally-committed proprietary files, giving committers consistent names and email addresses, preserving correct timestamps, collapsing embarassing misguided commits, …). I accomplished all of this with git and its filter-branch tool, even though all of our history was in a Subversion repository. I am now a card-carrying member of the git-is-the-greatest-thing-ever club.

Here is our starting directory structure:

Our program is known for fizzing and buzzing, so we definitely want to keep /fizz and /buzz, except that an otherwise well-mannered developer once expressed their frustrations a little too explicitly in /buzz/profanity/timesink.txt, and some inexperienced developers accidentally checked in /fizz/foo/binary-abyss. We also have some /unpublished work that we need to be careful with, and as luck would have it we were working on /fizz while preparing a draft of /unpublished, and we made some large commits that touch both paths. Don’t want it showing up on the ‘net too soon. Because administering repositories isn’t actually all that exciting, some other folks did a lot of /unrelated work in our SVN repository. Oh, and I just remembered, we also want to release /dancing-pigs at the same time, but that work was part of a different SVN repository. So we make a list of relevant paths for the dancing-pigs project too:


STEP 0: Our tool of choice is git

Aiming for completeness here. šŸ™‚ This is an important step!

$ git svn clone https://svn-server/path/to/project/root release.git
$ cd release.git

STEP 1: Annihilate unwanted files and directories

$ git filter-branch -d /dev/shm/git --index-filter \
"git rm -qrf --cached --ignore-unmatch unpublished ;\
git rm -qrf --cached --ignore-unmatch unrelated ;\
git rm -qrf --cached --ignore-unmatch buzz/profanity/timesink.txt ;\
git rm -qrf --cached --ignore-unmatch fizz/foo/binary-abyss" \
--prune-empty HEAD

What’s going on here?

  • filter-branch: “Lets you rewrite git revision history.” “Really, I know what I’m doing”. You will need to add -f to force it if you’re experimenting at home and try to run this more than once (because git is trying to be conservative against permanent loss of data).
  • -d /dev/shm/git: This tells git to use /dev/shm/git as a temporary directory (instead of /tmp or something inside your repository’s .git) while it’s working. /dev/shm is a ramdisk on most modern Linux distros, and this will get you a nice speedup. Our program actually does more than fizz and buzz, and this step is the most expensive.
  • --index-filter: This particular type of filter-branch works without actually checking out a copy of all your files once for every revision in history (i.e., it doesn’t update the working directory). It just notes internally that some things are getting deleted. We could have used a tree-filter and the system’s normal rm command and it would have accomplished the same result, but taken a lot longer due to the additional filesystem operations.
  • git rm: Because we’re doing an index-filter, we have to use git itself to do the rm, since the physical files aren’t actually being created and deleted each time.
    • -q: Be quiet. Do not print the name of each thing you delete. It’s a good idea to remove this when you’re experimenting, but you want it there for “the big run”.
    • -rf: Be recursive and force it. Just like traditional rm -rf.
    • --cached: Tell git rm not to bother updating the working directory. This is necessary to play nice with an --index-filter.
    • --ignore-unmatch: This tells git rm to exit cleanly even if the file it’s being asked to delete doesn’t exist in this particular commit. This is our first encounter with a constraint that we will see again: whatever command you run during the filter-branch must always exit cleanly. Otherwise the whole thing aborts. This is intended to be a recoverable situation but I found it easier to just keep around more than one clone while I was experimenting.
    • unpublished: This is a path. It can be a directory or a file. Just like with normal rm -rf. You do have to be conscious of troublesome files that get moved during the history of a project. I didn’t find any clean way to --follow such things. This was where a lot of the manual energy went: making a list of files and directories that shall be destroyed.
    • ;\: Normal shell syntax to execute more than one command and tolerate line-breaks.
  • --prune-empty: This tells git to just drop a commit if we’ve deleted everything that was touched by that commit. That way we don’t wind up with any empty commits or commits that reveal partial information about unpublished in their log message.
  • HEAD: This tells git where to start when working its way back through history. You’re welcome to filter-branch some other branch if you desire. There’s also a --all option, but I didn’t need it.

All of this filter-branching junks up your git repository with a lot of stale objects (for a good reason: so that you can recover from errors and so that you don’t accidentally lose a lot of important data, but that’s not especially relevant for what we’re doing here today). They can be garbage-collected as described here:

$ git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
$ git reflog expire --verbose --expire=0 --all
$ git gc --prune=0

Or, you can just clone your repository to a new one to get a guaranteed clean start:

$ cd ..
$ git clone --no-hardlinks release.git release-clean.git
$ cd release-clean.git

STEP 2: Recursively insert License header in *.[chS]

Goal: insert license header into every *.[chS] file in the repository. Make it look like it’s always been there.

$ git filter-branch -d /dev/shm/git --tree-filter '
perl /path/to/find-chS-and-add-license.pl
' --prune-empty HEAD

This time we use a --tree-filter and we do write to the working copy’s filesystem (well, a working-copy in a ramdisk, thanks to -d /dev/shm/git). The perl script that I ended up using was the inspiration for the one described in this StackOverflow question. If your repository isn’t too big, then this kind of thing is likely efficient enough (you could easily drop the Perl script altogether and use a few shell commands):

$ git filter-branch -d /dev/shm/git --tree-filter '
find . -name "*.[chS]" -exec perl /path/to/just-add-license.pl \{\} \;
' --prune-empty HEAD

Time to clean or clone your repo again. Really only important if it is large, and these operations consume many minutes each.

STEP 3: Use pretty author names

git-svn ends up leaving you with a git repository with author and committer names like username <username@23bd32ad-1234-e3ac-907f-1209acedead1>. Yuck. This time we use an --env-filter to change environment variables. Thanks yet again StackOverflow. Insert additional if-fi blocks for each relevant user in your repository and note that git maintains both an Author name and a Committer name. (Yes, this shell script is a lot more verbose than it really needs to be, but it aids readability.)

$ git filter-branch -d /dev/shm/git --env-filter '
if [ "$GIT_COMMITTER_EMAIL" = "user@23bd32ad-1234-e3ac-907f-1209acedead1" ]
cn="Nice Name"
if [ "$GIT_AUTHOR_EMAIL" = "user@23bd32ad-1234-e3ac-907f-1209acedead1" ]
an="Nice Name"
export GIT_AUTHOR_NAME="$an"
export GIT_AUTHOR_EMAIL="$am"

STEP 4: Drop commits from a particular author

We had a situation where the end result of someone’s changes were fine to release (e.g., a bug is fixed), but some intermediate states were problematic (e.g., another flavor of the unpublished-research problem). We can just drop the commits by that author. Note that skip_commit is a shell script that git puts in the path during a rebase.

$ git filter-branch --commit-filter '
if [ "$GIT_AUTHOR_EMAIL" = "userToDrop@23bd32ad-1234-e3ac-907f-1209acedead1" ]
skip_commit "$@"
git commit-tree "$@"
fi' HEAD

STEP 5: Integrate dancing-pigs work from separate SVN repository

This is basically a repeat of lots of the above steps for a separate SVN repo, so it’s more condensed:

$ git svn clone https://additional-svn-server/dancing-pigs dp.git
$ cd dp.git

$ git filter-branch -d /dev/shm/git --index-filter \
"git rm -qrf --cached --ignore-unmatch dp-unrelated" \
--prune-empty HEAD

$ git filter-branch -d /dev/shm/git --env-filter '
if [ "$GIT_COMMITTER_EMAIL" = "uglyname@32df21ae-3ed5-5678-ae11-54211aeedfed" ]
cn="Developer Two"
if [ "$GIT_AUTHOR_EMAIL" = "uglyname@32df21ae-3ed5-5678-ae11-54211aeedfed" ]
an="Developer Two"
export GIT_AUTHOR_NAME="$an"
export GIT_AUTHOR_EMAIL="$am"

Here’s an interesting trick. We know we’re going to be pulling this into our release.git repository as a subdirectory, so we can do a little directory structure manipulation to make the easier. There are other ways to accomplish this goal but since we’re doing so much filter-branching anyways, this solution worked well for us.

$ git filter-branch -d /dev/shm/git --tree-filter '
if [ -d trunk ]
mkdir -p dancing-pigs
mv trunk dancing-pigs

$ git filter-branch -f -d /dev/shm/git --tree-filter '
perl /path/to/find-chS-and-add-license.pl
' --prune-empty HEAD

We also choose to rename master to something more descriptive, to give us a nicer looking history when we merge.

$ git branch -m master dancing-pigs-v3

STEP 6: Pull dancing-pigs into our main repository

$ cd /path/to/release.git
$ git remote add dancing-pigs /path/to/dp.git
$ git fetch dancing-pigs
$ git merge dancing-pigs/dancing-pigs-v3

One can also git remote add -f to add and fetch at the same time, but the three-step version is in my opinion easier to understand (-f usually means force, but not in this case).

Really, we’re done at this point. release.git is a thing of beauty. Complete history, pretty-printed author names, only the code of interest, exhaustive license headers, …). However, to share it, one must put it somewhere public.

STEP 7: Push to remote repository

$ git remote rm origin
$ git remote add origin git://username@host/path/to/remote-fizz-buzz
$ git push -u origin master

The remote rm origin is necessary if some other origin is already defined. For us, it was, since this came from git-svn. You may also name your remote repo something else, of course. But for us it is to become the new origin.

If you’re still here, thanks! I hope you found this useful and informative.

Using git-svn for svn disaster recovery

Real Life dictates that I use SVN servers outside my control, and that makes me nervous that they might go down or otherwise become unavailable near an important deadline. Here is how to cope with that thanks to the wonders of git and git-svn. Note: Experienced git users may balk at this. I do too now that I have some git experience. However, it is intended to help inexperienced SVN users cope. Maybe nothing can actually accomplish that goal? /rant

  • Assumes no-password ssh to localhost is possible.
  • Otherwise, several commands will prompt for a password

Demonstrate the basic operation of the SVN repo.Ā Note: In the scenario of concern, localhost is replaced by someĀ other system outside one’s administrative control

svnadmin create /tmp/svnrepo
svn checkout svn+ssh://localhost/tmp/svnrepo /tmp/svncheckout
cd /tmp/svncheckout
echo "hello" > test.txt
svn add test.txt
svn ci -m "informative commit message"

Demonstrate git-svn

git svn clone svn+ssh://localhost/tmp/svnrepo /tmp/gitsvnclone
cd /tmp/gitsvnclone
echo "more stuff" >> test.txt
git add test.txt
git commit -m "added some more stuff"
git svn dcommit

Demonstrate disaster. The SVN server has gone down. However, we (let’s call ourself “Alice”) have a local git repository with the full history of the SVN server. Our basic disaster recovery is to adopt the convention that this local git repository is the “new official server”. Another team member (“Bob”) can readily clone our own git repository and work within it:

git clone git+ssh://localhost/tmp/gitsvnclone aregulargitclone

Here is what NOT TO DO…

cd /path/to/aregulargitclone
echo "more stuff" >> test.txt
git add test.txt
git commit -m "added some more stuff"
git push

…because you will get the following:

remote: error: refusing to update checked out branch: refs/heads/master

This is saying that things can get out of sync in the “new official server” git repository. What Bob needs to do is make sure his work is within a git “branch”. Below is what Bob should do. For more
information about branches see, e.g., http://stackoverflow.com/questions/2670680/git-basic-workflow

cd /path/to/aregulargitclone

Now create a new branch with a unique name:
git checkout -b bobsedits master
…and then proceed to do work as usual.

echo "more stuff" >> test.txt
git add test.txt
git commit -m "added some more stuff"
git push

This will work fine. Now there is a branch called “bobsedits” back in Alice’s git repository. Alice can merge the changes:

git checkout master
git merge bobsedits

Basically what has happened is that “master” has become the anointed stack of changesets, and each individual editor / developer must make their edits on a branch. Once the SVN server comes back online, Alice can make sure everybody’s branches are merged into master, and she can the ‘git svn dcommit’ them back to the server.