Cloning a Git Repository Piece by Piece

A while back, I ran into a problem downloading a vendor's Board Support Package (BSP) from their public GitLab instance. This git repository included the source code for Android and Linux, which is what I was interested in. It also included the toolchain package used by the vendor to compile that source code, as well as some prebuilt firmware binaries. In all, there were about 11 GiB of files.

The problem was that every time I tried to clone this repository, my HTTP connection was reset after exactly 1 GiB had been transferred:

$ git clone https://git.example.com/project/big-repo.git
Cloning into 'big-repo'...
remote: Enumerating objects: 600022, done.
remote: Counting objects: 100% (600022/600022), done.
remote: Compressing objects: 100% (390985/390985), done.
error: RPC failed; curl 18 trasfer closed with outstanding read data remaining
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

No matter what machine I tried, connecting over IPv4 or IPv6, I could not overcome this limit. It was not a timeout, as the disconnection happened at the same number of bytes downloaded regardless of the connection speed. There must have been a misconfigured firewall between me and their GitLab server. In any case, in order to clone this repository, I would have to divide it into pieces.

Recovering from failure

The first problem to solve is that git clone deletes the local copy of the repository when it fails. This prevents examining the partial clone to see what went wrong. This problem is relatively simple; we can split git clone into a series of separate commands, one for each step in the clone process. Then, if any step fails, we can immediately retry that step instead of starting over at the beginning. For example, the command:

git clone https://git.example.com/project/big-repo.git

becomes:

mkdir big-repo
pushd big-repo
git init
git remote add origin https://git.example.com/project/big-repo.git
git fetch origin
git checkout --track origin/HEAD
popd # Optional, only here for equivalence

Now, unsurprisingly, the git fetch command fails, in exactly the same way as git clone.

Going back to the beginning

git fetch is already designed to incrementally fetch new commits, so the simplest way to partition the repository is by its commit history. By default, git fetch downloads all branches and tags, but we can limit the download to a single branch or tag by specifying it on the command line.

For example, to download only the history up through the first release tag, assuming such a tag exists, we could run:

git fetch v0.1

GitLab provides a list of available tags in its web interface. If there are no tags, the same concept would work with an older release or feature branch. Unfortunately, the repository in question had no tags, and contained only a single branch with about 400 commits.

Is it possible to fetch only part of a branch? The answer is "maybe". Git has a few server-side options controlling what clients are allowed to fetch, in order of increasing permissiveness:

uploadpack.allowTipSHA1InWant: Git allows servers to hide branches. This option allows clients to fetch hidden branches, if they know exactly what commit each branch points to.
uploadpack.allowReachableSHA1InWant: This option allows fetching just part of the history (that's what we want!), but not anything that has been made unreachable by a force push.
uploadpack.allowAnySHA1InWant: This option allows fetching "any object at all", even objects that are not commits.

By default all of these options are set to false. Thankfully, GitLab has the most permissive option uploadpack.allowAnySHA1InWant enabled, so we can find the hash of the oldest commit in the web interface, and fetch only it:

git fetch origin f1e2fe1208cf # Fetch the first commit
git fetch origin --negotiation-tip f1e2fe1208cf # Fetch the remaining commits

Using the --negotiation-tip option here tells the server that we already have the history up through commit f1e2fe1208cf, so the server does not need to send that part.

It is also possible to incrementally download the history of a branch starting from the new end using the --depth and --deepen options to git fetch:

git fetch origin --depth=10
git fetch origin --deepen=10
git fetch origin --deepen=10 # Repeat until nothing new is fetched

In this case, neither option was sufficient, because the root commit was over the 1 GiB limit by itself. We need some way to download just part of that commit.

Fetching trees

Since GitLab has uploadpack.allowAnySHA1InWant enabled, we can fetch "tree" objects in addition to commits. Tree objects are what git uses to represent the directory structure. They contain a list of entries, each with a name, mode, a reference to some object. Entries for files reference "blob" objects, and entries for subdirectories reference other tree objects. For example, here is a pretty-printed version of the tree object for the Android source directory:

$ git cat-file -p f2a048d9eff3c0263dfb3acbeb2987f3a9f0f6eb
120000 blob 543755010469814f7f2cb4551169f2dc7828dcf8    Android.bp
100644 blob 6c8f7952e9cc029a8cf834762ae4fcbf9d4ab746    Makefile
040000 tree 19e61df5f2f7cbfa6892cd8f43b4c2c9bba15636    abi
040000 tree 9e9fa1147dfca7b69d8842472078cff3a4a5264e    art
040000 tree b3140f26c00d2d9242bea0615da837bf2079fb07    bionic
040000 tree c5158a134ce4179cdc6eba482b4ced14910f0680    bootable
120000 blob 1dfe7549a7983ca0968c74188da99c5c941e42d6    bootstrap.bash
100755 blob c763b9b060b9acd1797d6785ea90a578f29d0bb0    build.sh
040000 tree 140f15a5a3fbaedc09710f61155de4f0fd137835    build
040000 tree 4652dfc37674ee79d09e32118640d2d8c1ad65e7    cts
040000 tree e6907482727e9555241c09ec5781f8e1c47c479e    dalvik
040000 tree 34938f981ab6197f5a3a10290df4b191508120d3    developers
040000 tree 98b0d55fd1070cb78dba01b248c0531a2d70eaaa    development
040000 tree b2130017ab6dcf6a5e5728d62ce87bd0abdc9547    device
040000 tree bb449224fba6960c4b370a60a7aaf35e75807c73    docs
040000 tree 50304feae93616932253329be1f8ccdd80d80e63    external
040000 tree 6b4c7e1e07f1bf2ad1c6fbbb0c516c697d80398c    frameworks
040000 tree e271f84b9d51ce6a39aebf5fc7bea79d4d14b681    hardware
040000 tree 093b48e71014c5382d501669a1163202f727a8a7    kernel
040000 tree 2492878d074d91d73fb4aa349f596e82b358af28    libcore
040000 tree ad4e5f443784eb23aa1744c22f84354f8c6c442c    libnativehelper
040000 tree d61c51939fe8155a2526990f9410cd7260f442d0    ndk
040000 tree b6c09b927416b63d375eed3e54c465b7f6074712    packages
040000 tree e95b4103d65cfcec2894318df77a77a1734d1353    pdk
040000 tree 3964cc4d96505825adeba51481efb0a8f7dce7e6    platform_testing
040000 tree 6e41e803329a9336e3d0f40f3fc21bd7dde4312f    prebuilts
040000 tree de549ff8c45b0d5b9612ddae6b1e9aa2cbed3cb1    sdk
040000 tree 41e6db157d7febf76a51c76ebea12c6593e0bb62    shortcut-fe
040000 tree eee0f75c0dbd76b76ef4f10e08769f83aa7c0742    system
040000 tree b8c5f1dfd434fb571036e1b6cb30b352b55d2d76    toolchain
040000 tree acdfa1bcb1923b081ab195daa23871ad358d4a1a    tools
040000 tree f74851a49456551f142611ea452be092e7bf621a    vendor

So we could pass each of those hashes to git fetch, and each command would download just that one directory. Of course, since git fetch ensures there are no dangling object references, all of the contents of that directory would be included as well.

And that is a problem: once we have downloaded all of the subdirectories using that method, if we were to try to download the tree object shown above:

git fetch origin f2a048d9eff3c0263dfb3acbeb2987f3a9f0f6eb

git would fetch not only that one tree object, but also all of the subdirectory contents that we just fetched! And while git fetch has a --negotiation-tip option, that option only works on commits, not trees.

We have a second problem as well: we do not know the hashes of any tree objects in the repository! We know the commit hash from the web interface, but we cannot use that to guess even the root directory's tree hash.

Using the GitLab API

The solution to both problems is the GitLab API. It provides three API endpoints we need:

First, we need the repository's internal project ID, so we can use the other API endpoints. The easy way to get this is through the search functionality in the Projects API:

$ curl -s https://git.example.com/api/v4/projects?search=big-repo | jq
[
  {
    "id": 136,
    "name": "big-repo",
    "name_with_namespace": "project / big-repo",
    ...
  }
]

The Repositories API contains a tree endpoint which provides the all-important tree hashes. It also provides names, types, and modes, which are incidentally what we need to create our own tree objects.

$ curl -s https://git.example.com/api/v4/projects/136/repository/tree | jq
[
  {
    "id": "f2a048d9eff3c0263dfb3acbeb2987f3a9f0f6eb",
    "name": "android",
    "type": "tree",
    "path": "android",
    "mode": "040000"
  },
  ...
]

The Repositories API also has a raw blob endpoint, which we can use to download the contents of individual files.

Using only these API endpoints, we can reimplement git fetch for tree objects. Here's how it works: we start by requesting the contents of the root directory tree. We convert the response to git's internal tree format, and write that out to disk. Then for each entry in the tree:

If it is a blob, we download the raw contents of the file, and convert that to git's internal blob format.
If it is a tree, we perform this process again, recursively.

While git's internal blob and tree formats are simple enough, there are some gotcha's to be aware of:

Trees are sorted, but in an unusual order. Notice how build comes after build.sh in the tree printed above? That is because trees have a forward slash (/) implicitly appended to their name when being sorted. And that slash, at \x2f, comes after the full stop (\x2e).
Objects must be compressed in the zlib format before being written to disk, but the object ID is the SHA-1 hash of the uncompressed contents.

This is one place where git's use of hashes really shines, because it makes the object IDs 100% deterministic. Once we have downloaded a file from GitLab and constructed its blob, we can verify its integrity by checking the blob's hash against the entry in the tree.

The result is the gitlab_fetch package, available on GitHub!

Our git fetch implementation is incredibly inefficient, because it makes a separate HTTP request for each object, but splitting the work across many HTTP requests is exactly what we need.

A better way

Fetching one file at a time using the GitLab API is so slow that I developed another method and used it to fetch the whole repository while gitlab_fetch was still running.

It turns out that --negotiation-tip not working with trees is only a client- side limitation in git fetch. The actual git smart protocol allows providing trees for both have and want lines in a git-upload-pack request. So we can implement a similar algorithm combining the GitLab API with manually-constructed git-upload-pack requests:

Get the tree contents via the GitLab API.
For each entry in the tree:
1. Attempt to fetch that entry using git fetch.
2. If it fails due to hitting the 1 GiB limit, recurse into that directory.
Fetch the tree object manually (e.g. using curl or Python), providing the hashes of all entries as haves. This will return a pack object, which can be passed to git unpack-objects.

This method also allows fetching the actual commit object, since we can tell the server that we have the top-level tree object. Constructing the commit object from the GitLab API alone would have required finding all of the bits of metadata (author, committer, date, message) and putting them together in exactly the correct format.

There's no code available for this method. I must admit I used vim to construct the requests, curl to send them, and dd to extract the packfiles from the responses. But it worked!

Conclusion

If you ever run in to this incredibly specific scenario, with a bit of effort, it is indeed possible to clone the repository you want.