Handle cloning of objects larger than 4GB on Windows#2102
Handle cloning of objects larger than 4GB on Windows#2102dscho wants to merge 6 commits intogitgitgadget:masterfrom
Conversation
When unpacking objects from a packfile, the object size is decoded from a variable-length encoding. On platforms where unsigned long is 32-bit (such as Windows, even in 64-bit builds), the shift operation overflows when decoding sizes larger than 4GB. The result is a truncated size value, causing the unpacked object to be corrupted or rejected. Fix this by changing the size variable to size_t, which is 64-bit on 64-bit platforms, and ensuring the shift arithmetic occurs in 64-bit space. This was originally authored by LordKiRon <https://github.com/LordKiRon>, who preferred not to reveal their real name and therefore agreed that I take over authorship. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
On Windows, zlib's `uLong` type is 32-bit even on 64-bit systems. When processing data streams larger than 4GB, the `total_in` and `total_out` fields in zlib's `z_stream` structure wrap around, which caused the sanity checks in `zlib_post_call()` to trigger `BUG()` assertions. The git_zstream wrapper now tracks its own 64-bit totals rather than copying them from zlib. The sanity checks compare only the low bits, using `maximum_unsigned_value_of_type(uLong)` to mask appropriately for the platform's `uLong` size. This is based on work by LordKiRon in git-for-windows#6076. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The odb_read_stream structure uses unsigned long for the size field, which is 32-bit on Windows even in 64-bit builds. When streaming objects larger than 4GB, the size would be truncated to zero or an incorrect value, resulting in empty files being written to disk. Change the size field in odb_read_stream to size_t and introduce unpack_object_header_sz() to return sizes via size_t pointer. Since object_info.sizep remains unsigned long for API compatibility, use temporary variables where the types differ, with comments noting the truncation limitation for code paths that still use unsigned long. This was originally authored by LordKiRon <https://github.com/LordKiRon>, who preferred not to reveal their real name and therefore agreed that I take over authorship. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The delta header decoding functions return unsigned long, which truncates on Windows for objects larger than 4GB. Introduce size_t variants get_delta_hdr_size_sz() and get_size_from_delta_sz() that preserve the full 64-bit size, and use them in packed_object_info() where the size is needed for streaming decisions. This was originally authored by LordKiRon <https://github.com/LordKiRon>, who preferred not to reveal their real name and therefore agreed that I take over authorship. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
To test Git's behavior with very large pack files, we need a way to generate such files quickly. A naive approach using only readily-available Git commands would take over 10 hours for a 4GB pack file, which is prohibitive. Side-stepping Git's machinery and actual zlib compression by writing uncompressed content with the appropriate zlib header makes things much faster. The fastest method using this approach generates many small, unreachable blob objects and takes about 1.5 minutes for 4GB. However, this cannot be used because we need to test git clone, which requires a reachable commit history. Generating many reachable commits with small, uncompressed blobs takes about 4 minutes for 4GB. But this approach 1) does not reproduce the issues we want to fix (which require individual objects larger than 4GB) and 2) is comparatively slow because of the many SHA-1 calculations. The approach taken here generates a single large blob (filled with NUL bytes), along with the trees and commits needed to make it reachable. This takes about 2.5 minutes for 4.5GB, which is the fastest option that produces a valid, clonable repository with an object large enough to trigger the bugs we want to test. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The shift overflow bug in index-pack and unpack-objects caused incorrect object size calculation when the encoded size required more than 32 bits of shift. This would result in corrupted or failed unpacking of objects larger than 4GB. Add a test that creates a pack file containing a 4GB+ blob using the new 'test-tool synthesize pack --reachable-large' command, then clones the repository to verify the fix works correctly. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
/submit |
|
Submitted as pull.2102.git.1777393580.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
| @@ -86,8 +86,11 @@ void *patch_delta(const void *src_buf, unsigned long src_size, | |||
| * This must be called twice on the delta data buffer, first to get the | |||
There was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 4/28/26 12:26 PM, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > The delta header decoding functions return unsigned long, which
> truncates on Windows for objects larger than 4GB. Introduce size_t
> variants get_delta_hdr_size_sz() and get_size_from_delta_sz() that
> preserve the full 64-bit size, and use them in packed_object_info()
> where the size is needed for streaming decisions.
> + * Size_t variant that doesn't truncate - use for >4GB objects on Windows.
> + */
> +static inline size_t get_delta_hdr_size_sz(const unsigned char **datap,
> + const unsigned char *top)
...
> +static inline unsigned long get_delta_hdr_size(const unsigned char **datap,
> + const unsigned char *top)
> +{
> + size_t size = get_delta_hdr_size_sz(datap, top);
> return cast_size_t_to_ulong(size);
> }
I like this trick to use the 64-bit implementation and only to
down-cast for API compatibility. This allows a more gradual
transition than if we replaced ulongs with size_ts everywhere
at once.
Thanks,
-Stolee|
User |
| @@ -49,4 +49,41 @@ test_expect_success 'clone - with worktree, file:// protocol' ' | |||
|
|
|||
There was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 4/28/26 12:26 PM, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > The shift overflow bug in index-pack and unpack-objects caused incorrect
> object size calculation when the encoded size required more than 32 bits
> of shift. This would result in corrupted or failed unpacking of objects
> larger than 4GB.
> > Add a test that creates a pack file containing a 4GB+ blob using the
> new 'test-tool synthesize pack --reachable-large' command, then clones
> the repository to verify the fix works correctly.
As mentioned in the previous patch, constructing this large packfile
takes ~4 minutes in CI pipelines. That's quite a lot to handle for
every CI run.
> +test_expect_success SIZE_T_IS_64BIT 'set up repo with >4GB object' '
Your prereq here prevents it from running on 32-bit builds, which is
good. However, I wonder if it would be worth also specifying these
tests as expensive. It's less likely that these layers will be touched
often, so it should be enough to run these on major occasions, such as
testing a release candidate.
(I also think it's appropriate to have these tests _not_ be marked
expensive in their original contribution to git-for-windows/git,
because the pull request build should prove that the tests work. And
maybe git-for-windows/git should keep them for every PR since that's
where the tests matter the most.)
I suppose this also is a question for Junio and our process for
validating releases. Do we have a certain cadence where we run the
expensive tests? What has been our threshold for hiding a test case
behind the expensive label?
Thanks,
-StoleeThere was a problem hiding this comment.
Jeff King wrote on the Git mailing list (how to reply to this email):
On Wed, Apr 29, 2026 at 09:34:21AM -0400, Derrick Stolee wrote:
> As mentioned in the previous patch, constructing this large packfile
> takes ~4 minutes in CI pipelines. That's quite a lot to handle for
> every CI run.
And for local runs, too. ;) The test suite takes less than 90 seconds to
run on my laptop, but t5608 by itself 160 seconds. Even if running it in
parallel didn't slow down the rest of the suite (which is not true,
because it's obviously hogging a whole processor the whole time), that's
still almost doubling the run-time.
I'd also worry about assuming that the trash directory can hold 4+GB
(maybe 8GB+ since we clone it?), especially since many of us use ram
disks.
That said...
> > +test_expect_success SIZE_T_IS_64BIT 'set up repo with >4GB object' '
>
> Your prereq here prevents it from running on 32-bit builds, which is
> good. However, I wonder if it would be worth also specifying these
> tests as expensive. It's less likely that these layers will be touched
> often, so it should be enough to run these on major occasions, such as
> testing a release candidate.
I think it is already skipped in most cases, because t5608 requires the
GIT_TEST_CLONE_2GB environment variable be set. Arguably it should just
be using EXPENSIVE, too, as I do not think there is much value in having
individual flags for all of the expensive tests. I think that test just
predates the modern prereq system entirely.
> I suppose this also is a question for Junio and our process for
> validating releases. Do we have a certain cadence where we run the
> expensive tests? What has been our threshold for hiding a test case
> behind the expensive label?
AFAIK the labeling of expensive things is mostly ad-hoc, and nobody is
systematically running them. Likewise for the t/perf tests, which are
super expensive but do (very occasionally) turn up interesting
regressions.
-PeffThere was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 5/1/2026 2:38 AM, Jeff King wrote:
> On Wed, Apr 29, 2026 at 09:34:21AM -0400, Derrick Stolee wrote:
>>> +test_expect_success SIZE_T_IS_64BIT 'set up repo with >4GB object' '
>>
>> Your prereq here prevents it from running on 32-bit builds, which is
>> good. However, I wonder if it would be worth also specifying these
>> tests as expensive. It's less likely that these layers will be touched
>> often, so it should be enough to run these on major occasions, such as
>> testing a release candidate.
>
> I think it is already skipped in most cases, because t5608 requires the
> GIT_TEST_CLONE_2GB environment variable be set. Arguably it should just
> be using EXPENSIVE, too, as I do not think there is much value in having
> individual flags for all of the expensive tests. I think that test just
> predates the modern prereq system entirely.
Thanks for the extra details here! That helps avoid the issues that I
was thinking about, but maybe doubling-down and adding EXPENSIVE is
still worth it.
>> I suppose this also is a question for Junio and our process for
>> validating releases. Do we have a certain cadence where we run the
>> expensive tests? What has been our threshold for hiding a test case
>> behind the expensive label?
>
> AFAIK the labeling of expensive things is mostly ad-hoc, and nobody is
> systematically running them. Likewise for the t/perf tests, which are
> super expensive but do (very occasionally) turn up interesting
> regressions.
I used to be more diligent about running the performance tests myself
around release windows. The EXPENSIVE tests would also be good to do
on rc0. I will contemplate how to put this into my routine.
Thanks,
-Stolee
|
Derrick Stolee wrote on the Git mailing list (how to reply to this email): On 4/28/26 12:26 PM, Johannes Schindelin via GitGitGadget wrote:
> On Windows, unsigned long is 32-bit even on 64-bit systems. This causes
> multiple problems when Git handles objects larger than 4GB. This patch
> series is a very targeted fix for a very early part of the problem: it
> addresses the most fundamental truncation points that prevent a >4GB object
> from surviving a clone at all.
> > Specifically, this fixes:
> > * zlib's uLong wrapping and triggering BUG() assertions in the git_zstream
> wrapper
> * Object sizes being truncated in pack streaming, delta headers, and
> index-pack/unpack-objects
> * pack-objects re-encoding reused pack entries with a truncated size,
> producing corrupt packs on the wire
I'm glad to see this progress in this direction. It's a big step!
> Many other code paths still use unsigned long for object sizes (e.g.,
> cat-file -s, object_info.sizep, the delta machinery) and will need their own
> conversions. This series does not attempt to fix those.
I appreciate the mechanisms used to keep the scope of change minimal. This
differs from the typical "replace all 'unsigned long's with 'size_t'"
proposals in some clever ways.
> Based on work by @LordKiRon in git-for-windows/git#6076.
> > The last two commits add a test helper that synthesizes a pack with a >4GB
> blob and regression tests that clone it via both the unpack-objects and
> index-pack code paths using file:// transport.
My biggest concern here is about how expensive this test is. I wonder if
we should mark it as expensive for the core project and leave it enabled by
default in git-for-windows/git.
Thanks,
-Stolee |
| @@ -37,7 +37,7 @@ static const char index_pack_usage[] = | |||
|
|
|||
There was a problem hiding this comment.
Torsten Bögershausen wrote on the Git mailing list (how to reply to this email):
On Tue, Apr 28, 2026 at 04:26:15PM +0000, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> When unpacking objects from a packfile, the object size is decoded
> from a variable-length encoding. On platforms where unsigned long is
> 32-bit (such as Windows, even in 64-bit builds), the shift operation
> overflows when decoding sizes larger than 4GB. The result is a
> truncated size value, causing the unpacked object to be corrupted or
> rejected.
>
> Fix this by changing the size variable to size_t, which is 64-bit on
> 64-bit platforms, and ensuring the shift arithmetic occurs in 64-bit
> space.
>
> This was originally authored by LordKiRon <https://github.com/LordKiRon>,
> who preferred not to reveal their real name and therefore agreed that I
> take over authorship.
Good to see things moving forward.
See even
https://github.com/git-for-windows/git/pull/2179
which is probably obsolete soon.|
User |
|
User |
On Windows,
unsigned longis 32-bit even on 64-bit systems. Thiscauses multiple problems when Git handles objects larger than 4GB.
This patch series is a very targeted fix for a very early part of the
problem: it addresses the most fundamental truncation points that
prevent a >4GB object from surviving a clone at all.
Specifically, this fixes:
uLongwrapping and triggeringBUG()assertions in thegit_zstreamwrapperindex-pack/unpack-objects
pack-objectsre-encoding reused pack entries with a truncatedsize, producing corrupt packs on the wire
Many other code paths still use
unsigned longfor object sizes(e.g.,
cat-file -s,object_info.sizep, the delta machinery) andwill need their own conversions. This series does not attempt to fix
those.
Based on work by @LordKiRon in git-for-windows#6076.
The last two commits add a test helper that synthesizes a pack with
a >4GB blob and regression tests that clone it via both the
unpack-objectsandindex-packcode paths usingfile://transport.
cc: Derrick Stolee stolee@gmail.com
cc: Torsten Bögershausen tboegi@web.de
cc: Jeff King peff@peff.net