Cloning a Git Repository Piece by Piece
Posted: 2021-04-24
A while back, I ran into a problem downloading a vendor's Board Support Package (BSP) from their public GitLab instance. This git repository included the source code for Android and Linux, which is what I was interested in. It also included the toolchain package used by the vendor to compile that source code, as well as some prebuilt firmware binaries. In all, there were about 11 GiB of files.
The problem was that every time I tried to clone this repository, my HTTP connection was reset after exactly 1 GiB had been transferred:
$ git clone https://git.example.com/project/big-repo.git
Cloning into 'big-repo'...
remote: Enumerating objects: 600022, done.
remote: Counting objects: 100% (600022/600022), done.
remote: Compressing objects: 100% (390985/390985), done.
error: RPC failed; curl 18 trasfer closed with outstanding read data remaining
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
No matter what machine I tried, connecting over IPv4 or IPv6, I could not overcome this limit. It was not a timeout, as the disconnection happened at the same number of bytes downloaded regardless of the connection speed. There must have been a misconfigured firewall between me and their GitLab server. In any case, in order to clone this repository, I would have to divide it into pieces.
Recovering from failure
The first problem to solve is that git clone
deletes the local copy of the
repository when it fails. This prevents examining the partial clone to see what
went wrong. This problem is relatively simple; we can split git clone
into a
series of separate commands, one for each step in the clone process. Then, if
any step fails, we can immediately retry that step instead of starting over at
the beginning. For example, the command:
git clone https://git.example.com/project/big-repo.git
becomes:
mkdir big-repo
pushd big-repo
git init
git remote add origin https://git.example.com/project/big-repo.git
git fetch origin
git checkout --track origin/HEAD
popd # Optional, only here for equivalence
Now, unsurprisingly, the git fetch
command fails, in exactly the same way as
git clone
.
Going back to the beginning
git fetch
is already designed to incrementally fetch new commits, so the
simplest way to partition the repository is by its commit history. By default,
git fetch
downloads all branches and tags, but we can limit the download to a
single branch or tag by specifying it on the command line.
For example, to download only the history up through the first release tag, assuming such a tag exists, we could run:
git fetch v0.1
GitLab provides a list of available tags in its web interface. If there are no tags, the same concept would work with an older release or feature branch. Unfortunately, the repository in question had no tags, and contained only a single branch with about 400 commits.
Is it possible to fetch only part of a branch? The answer is "maybe". Git has a few server-side options controlling what clients are allowed to fetch, in order of increasing permissiveness:
uploadpack.allowTipSHA1InWant
: Git allows servers to hide branches. This option allows clients to fetch hidden branches, if they know exactly what commit each branch points to.uploadpack.allowReachableSHA1InWant
: This option allows fetching just part of the history (that's what we want!), but not anything that has been made unreachable by a force push.uploadpack.allowAnySHA1InWant
: This option allows fetching "any object at all", even objects that are not commits.
By default all of these options are set to false
. Thankfully, GitLab has the
most permissive option uploadpack.allowAnySHA1InWant
enabled, so we can find
the hash of the oldest commit in the web interface, and fetch only it:
git fetch origin f1e2fe1208cf # Fetch the first commit
git fetch origin --negotiation-tip f1e2fe1208cf # Fetch the remaining commits
Using the --negotiation-tip
option here tells the server that we already have
the history up through commit f1e2fe1208cf
, so the server does not need to
send that part.
It is also possible to incrementally download the history of a branch starting
from the new end using the --depth
and --deepen
options to
git fetch
:
git fetch origin --depth=10
git fetch origin --deepen=10
git fetch origin --deepen=10 # Repeat until nothing new is fetched
In this case, neither option was sufficient, because the root commit was over the 1 GiB limit by itself. We need some way to download just part of that commit.
Fetching trees
Since GitLab has uploadpack.allowAnySHA1InWant
enabled, we can fetch "tree"
objects in addition to commits. Tree objects are what git uses to represent the
directory structure. They contain a list of entries, each with a name, mode,
a reference to some object. Entries for files reference "blob" objects, and
entries for subdirectories reference other tree objects. For example, here is a
pretty-printed version of the tree object for the Android source directory:
$ git cat-file -p f2a048d9eff3c0263dfb3acbeb2987f3a9f0f6eb
120000 blob 543755010469814f7f2cb4551169f2dc7828dcf8 Android.bp
100644 blob 6c8f7952e9cc029a8cf834762ae4fcbf9d4ab746 Makefile
040000 tree 19e61df5f2f7cbfa6892cd8f43b4c2c9bba15636 abi
040000 tree 9e9fa1147dfca7b69d8842472078cff3a4a5264e art
040000 tree b3140f26c00d2d9242bea0615da837bf2079fb07 bionic
040000 tree c5158a134ce4179cdc6eba482b4ced14910f0680 bootable
120000 blob 1dfe7549a7983ca0968c74188da99c5c941e42d6 bootstrap.bash
100755 blob c763b9b060b9acd1797d6785ea90a578f29d0bb0 build.sh
040000 tree 140f15a5a3fbaedc09710f61155de4f0fd137835 build
040000 tree 4652dfc37674ee79d09e32118640d2d8c1ad65e7 cts
040000 tree e6907482727e9555241c09ec5781f8e1c47c479e dalvik
040000 tree 34938f981ab6197f5a3a10290df4b191508120d3 developers
040000 tree 98b0d55fd1070cb78dba01b248c0531a2d70eaaa development
040000 tree b2130017ab6dcf6a5e5728d62ce87bd0abdc9547 device
040000 tree bb449224fba6960c4b370a60a7aaf35e75807c73 docs
040000 tree 50304feae93616932253329be1f8ccdd80d80e63 external
040000 tree 6b4c7e1e07f1bf2ad1c6fbbb0c516c697d80398c frameworks
040000 tree e271f84b9d51ce6a39aebf5fc7bea79d4d14b681 hardware
040000 tree 093b48e71014c5382d501669a1163202f727a8a7 kernel
040000 tree 2492878d074d91d73fb4aa349f596e82b358af28 libcore
040000 tree ad4e5f443784eb23aa1744c22f84354f8c6c442c libnativehelper
040000 tree d61c51939fe8155a2526990f9410cd7260f442d0 ndk
040000 tree b6c09b927416b63d375eed3e54c465b7f6074712 packages
040000 tree e95b4103d65cfcec2894318df77a77a1734d1353 pdk
040000 tree 3964cc4d96505825adeba51481efb0a8f7dce7e6 platform_testing
040000 tree 6e41e803329a9336e3d0f40f3fc21bd7dde4312f prebuilts
040000 tree de549ff8c45b0d5b9612ddae6b1e9aa2cbed3cb1 sdk
040000 tree 41e6db157d7febf76a51c76ebea12c6593e0bb62 shortcut-fe
040000 tree eee0f75c0dbd76b76ef4f10e08769f83aa7c0742 system
040000 tree b8c5f1dfd434fb571036e1b6cb30b352b55d2d76 toolchain
040000 tree acdfa1bcb1923b081ab195daa23871ad358d4a1a tools
040000 tree f74851a49456551f142611ea452be092e7bf621a vendor
So we could pass each of those hashes to git fetch
, and each command would
download just that one directory. Of course, since git fetch
ensures there
are no dangling object references, all of the contents of that directory would
be included as well.
And that is a problem: once we have downloaded all of the subdirectories using that method, if we were to try to download the tree object shown above:
git fetch origin f2a048d9eff3c0263dfb3acbeb2987f3a9f0f6eb
git would fetch not only that one tree object, but also all of the subdirectory
contents that we just fetched! And while git fetch
has a --negotiation-tip
option, that option only works on commits, not trees.
We have a second problem as well: we do not know the hashes of any tree objects in the repository! We know the commit hash from the web interface, but we cannot use that to guess even the root directory's tree hash.
Using the GitLab API
The solution to both problems is the GitLab API. It provides three API endpoints we need:
-
First, we need the repository's internal project ID, so we can use the other API endpoints. The easy way to get this is through the search functionality in the Projects API:
$ curl -s https://git.example.com/api/v4/projects?search=big-repo | jq [ { "id": 136, "name": "big-repo", "name_with_namespace": "project / big-repo", ... } ]
-
The Repositories API contains a tree endpoint which provides the all-important tree hashes. It also provides names, types, and modes, which are incidentally what we need to create our own tree objects.
$ curl -s https://git.example.com/api/v4/projects/136/repository/tree | jq [ { "id": "f2a048d9eff3c0263dfb3acbeb2987f3a9f0f6eb", "name": "android", "type": "tree", "path": "android", "mode": "040000" }, ... ]
-
The Repositories API also has a raw blob endpoint, which we can use to download the contents of individual files.
Using only these API endpoints, we can reimplement git fetch
for tree
objects. Here's how it works: we start by requesting the contents of the root
directory tree. We convert the response to git's internal tree format, and
write that out to disk. Then for each entry in the tree:
- If it is a blob, we download the raw contents of the file, and convert that to git's internal blob format.
- If it is a tree, we perform this process again, recursively.
While git's internal blob and tree formats are simple enough, there are some gotcha's to be aware of:
- Trees are sorted, but in an unusual order. Notice how
build
comes afterbuild.sh
in the tree printed above? That is because trees have a forward slash (/
) implicitly appended to their name when being sorted. And that slash, at\x2f
, comes after the full stop (\x2e
). - Objects must be compressed in the zlib format before being written to disk, but the object ID is the SHA-1 hash of the uncompressed contents.
This is one place where git's use of hashes really shines, because it makes the object IDs 100% deterministic. Once we have downloaded a file from GitLab and constructed its blob, we can verify its integrity by checking the blob's hash against the entry in the tree.
The result is the gitlab_fetch
package, available on GitHub!
Our git fetch
implementation is incredibly inefficient, because it makes a
separate HTTP request for each object, but splitting the work across many HTTP
requests is exactly what we need.
A better way
Fetching one file at a time using the GitLab API is so slow that I developed
another method and used it to fetch the whole repository while gitlab_fetch
was still running.
It turns out that --negotiation-tip
not working with trees is only a client-
side limitation in git fetch
. The actual git smart protocol
allows providing trees for both have
and want
lines in a git-upload-pack
request. So we can implement a similar algorithm combining the GitLab API with
manually-constructed git-upload-pack
requests:
- Get the tree contents via the GitLab API.
- For each entry in the tree:
- Attempt to fetch that entry using
git fetch
. - If it fails due to hitting the 1 GiB limit, recurse into that directory.
- Attempt to fetch that entry using
- Fetch the tree object manually (e.g. using
curl
or Python), providing the hashes of all entries ashave
s. This will return a pack object, which can be passed togit unpack-objects
.
This method also allows fetching the actual commit object, since we can tell
the server that we have
the top-level tree object. Constructing the commit
object from the GitLab API alone would have required finding all of the bits of
metadata (author, committer, date, message) and putting them together in
exactly the correct format.
There's no code available for this method. I must admit I used vim
to
construct the requests, curl
to send them, and dd
to extract the packfiles
from the responses. But it worked!
Conclusion
If you ever run in to this incredibly specific scenario, with a bit of effort, it is indeed possible to clone the repository you want.