pack-objects: integrate --path-walk and some --filter options#2101
Open
derrickstolee wants to merge 7 commits intogitgitgadget:masterfrom
Open
pack-objects: integrate --path-walk and some --filter options#2101derrickstolee wants to merge 7 commits intogitgitgadget:masterfrom
derrickstolee wants to merge 7 commits intogitgitgadget:masterfrom
Conversation
0c68681 to
bf00b44
Compare
When 'git pack-objects' has the --path-walk option enabled, it uses a different set of revision walk parameters than normal. For once, --objects was previously assumed by the path-walk API and was not needed to be added. We also needed --boundary to allow discovering UNINTERESTING objects to use as delta bases. We will be updating the path-walk API soon to work with some filter options. However, the revision machinery will trigger a fatal error: fatal: object filtering requires --objects The fix is easy: add the --objects option as an argument. This has no effect on the path-walk API but does simplify the revision option parsing for the objects filter. We can remove the comment about "removing" the options because they were never removed and instead not added. We still need to disable using bitmaps. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Add p5315-pack-objects-filter.sh to measure the performance of 'git pack-objects --revs --all' under different filter and traversal combinations: * no filter (baseline) * --filter=blob:none (blobless) * --filter=sparse:oid=<oid> (cone-mode sparse) Each filter scenario is tested both with and without --path-walk, producing paired measurements that show the impact of the path-walk traversal for each filter type as we integrate the --path-walk feature with different --filter options. It currently has no integration so falls back to the standard revision walk. Thus, there are no significant differences in the current results other than a full repack (and even then, the --path-walk feature is not incredibly different for the default Git repository): Test HEAD ----------------------------------------------------- 5315.2: repack (no filter) 27.91 5315.3: repack size (no filter) 250.7M 5315.4: repack (no filter, --path-walk) 34.92 5315.5: repack size (no filter, --path-walk) 220.0M 5315.6: repack (blob:none) 13.63 5315.7: repack size (blob:none) 137.6M 5315.8: repack (blob:none, --path-walk) 13.48 5315.9: repack size (blob:none, --path-walk) 137.7M 5315.10: repack (sparse:oid) 72.67 5315.11: repack size (sparse:oid) 187.4M 5315.12: repack (sparse:oid, --path-walk) 72.47 5315.13: repack size (sparse:oid, --path-walk) 187.4M The sparse filter definition is built automatically by sampling depth-2 directories from the test repository, making the test work on any repo passed via GIT_PERF_LARGE_REPO. For repos that lack depth-2 directories, a single top-level directory is used; for flat repos, the sparse tests are skipped via prerequisite. Signed-off-by: Derrick Stolee <stolee@gmail.com>
The 'git pack-objects' command can opt-in to using the path-walk API for scanning the objects. Currently, this option is dynamically disabled if combined with '--filter=<X>', even when using a simple filter such as 'blob:none' to signal a blobless packfile. This is a common scenario for repos at scale, so is worth integrating. Also, users can opt-in to the '--path-walk' option by default through the pack.usePathWalk=true config option. When using that in a blobless partial clone, the following warning can appear even though the user did not specify either option directly: warning: cannot use --filter with --path-walk Teach the path-walk API to handle the 'blob:none' object filter natively. When revs->filter.choice is LOFC_BLOB_NONE, the path-walk sets info->blobs to 0 (skipping all blob objects) and clears the filter from revs so that prepare_revision_walk() does not reject the configuration. This check is implemented in the static prepare_filters() method, which will simultaneously check if the input filters are compatible and will make the appropriate mutations to the path_walk_info and filters if the path_walk_info is non-NULL. This allows us to use this logic both in the API method path_walk_filter_compatible() for use in builtin/pack-objects.c and as a prep step in walk_objects_by_path(). Update the test helper (test-path-walk) to accept --filter=<spec> as a test-tool option (before '--'), applying it to revs after setup_revisions() to avoid the --objects requirement check. Also switch test-path-walk from REV_INFO_INIT with manual repo assignment to repo_init_revisions(), which properly initializes the filter_spec strbuf needed for filter parsing. Add tests for blob:none with --all and with a single branch. The performance test p5315 shows the impact of this change when using blobless filters: Test HEAD~1 HEAD --------------------------------------------------------------------- 5315.6: repack (blob:none) 13.53 13.87 +2.5% 5315.7: repack size (blob:none) 137.7M 137.8M +0.1% 5315.8: repack (blob:none, --path-walk) 13.51 23.43 +73.4% 5315.9: repack size (blob:none, --path-walk) 137.7M 115.2M -16.3% These performance tests were run on the Git repository. The --path-walk feature shows meaningful space savings (16% smaller for blobless packs) at the cost of increased computation time due to the two compression passes. This data demonstrates that the feature is engaged and provides real compression benefits when --no-reuse-delta forces fresh deltas. Signed-off-by: Derrick Stolee <stolee@gmail.com>
The 'git backfill' command uses the path-walk API in a critical way: it uses the objects output from the command to find the batches of missing objects that should be requested from the server. Unlike 'git pack-objects', we cannot fall back to another mechanism. The previous change added the path_walk_filter_compatible() method that we can reuse here. Use it during argument validation in cmd_backfill(). Signed-off-by: Derrick Stolee <stolee@gmail.com>
Extend the path-walk API to handle the 'blob:limit=<size>' object filter natively. This filter omits blobs whose size is equal to or greater than the given limit, matching the semantics used by the list-objects-filter machinery. When revs->filter.choice is LOFC_BLOB_LIMIT, the prepare_filters() method stores the limit value in info->blob_limit and clears the filter from revs. If the limit is zero, this degenerates to blob:none (all blobs excluded), so info->blobs is set to 0 instead. During walk_path(), blob batches are filtered before being delivered to the callback: each blob's size is checked via odb_read_object_info(), and only blobs strictly smaller than the limit are included. Blobs whose size cannot be determined (e.g. missing in a partial clone) are conservatively included, matching the existing filter behavior. Empty batches after filtering are skipped entirely. The check for inclusion in the path batch looks a little strange at first glance. We use odb_read_object_info() to read the object's size. Based on all of the assumptions to this point, this _should_ return OBJ_BLOB. Since we are focused on the size filter, we use a short-circuited OR (||) to skip the size check if that method returns a different object type. Notice that this inspection of object sizes requires the content to be present in the repository. The odb_read_object_info() call will download a missing blob on-demand. This means that the use of the path-walk API within 'git backfill' would not operate nicely with this filter type. The intention of that command is to download missing blobs in batches. Downloading objects one-by-one would go against the point. Update the validation in 'git backfill' to add its own compatibility check on top of path_walk_filter_compatible(). Add tests for blob:limit=0 (equivalent to blob:none) and blob:limit=3 (which exercises partial filtering within a batch where some blobs are kept and others are excluded). Signed-off-by: Derrick Stolee <stolee@gmail.com>
The path-walk API prunes trees and blobs when a sparse-checkout pattern list is provided, which is the correct behavior for 'git backfill --sparse' since it only needs to fill in objects at paths within the sparse cone. However, a future change will use the path-walk API with a sparse:<oid> filter that restricts only blobs while retaining all reachable trees. To support both behaviors, add a 'pl_sparse_trees' flag to path_walk_info. When set (as in 'git backfill --sparse' and the --stdin-pl test helper mode), the sparse patterns prune both trees and blobs. When unset, only blobs are filtered and all trees are walked and reported. Additionally, move the SEEN flag assignment in add_tree_entries() to after the sparse pattern and pathspec checks. Previously, SEEN was set immediately upon discovering an object, before checking whether its path matched the sparse patterns. When the same object ID appeared at multiple paths (e.g. sibling directories with identical contents), the first path to be visited would mark the object as SEEN. If that path was outside the sparse cone, the object would be skipped there but also never discovered at its in-cone path. By deferring the SEEN flag until after the checks pass, objects that are skipped due to sparse filtering remain discoverable at other paths where they may be in scope. Signed-off-by: Derrick Stolee <stolee@gmail.com>
bf00b44 to
6fb9f3e
Compare
The --filter=sparse:<oid> option to 'git pack-objects' allows focusing an object set to a sparse-checkout definition. This reduces the set of matching blobs while retaining all reachable trees. No server currently supports fetching with this filter because it is expensive to compute and reachability bitmaps do not help without a significant effort to extend the bitmap feature to store bitmaps for each supported sparse- checkout definition. Without focusing on serving fetches and clones with these filters, there are still benefits that could be realized by making this faster. With the sparse index, it's more realistic now than ever to be able to operate a local clone that was bootstrapped by a packfile created with a sparse filter, because the missing trees are not needed to move a sparse-checkout from one commit to another or to view the history of any path in scope. Such clones could perhaps be bootstrapped by partial bundles. Previously, constructing these sparse packs has been incredibly computationally inefficient. The revision walk that explores which objects are in scope spends a lot of time checking each object to see if it matches the sparse-checkout patterns, causing quadratic behavior (number of objects times number of sparse-checkout patterns). This improves somewhat when using cone-mode sparse-checkout patterns that can use hashtables and prefix matches to determine containment. However, the check per object is still too expensive for most cases. This is where the path-walk feature comes in. We can proceed as normal by placing objects in bins by path and _then_ check a group of objects all at once. Since sparse:<oid> only restricts blobs, the path-walk must include all reachable trees while using the cone-mode patterns to skip blobs at paths outside the sparse scope. This establishes a baseline for a potential future "treesparse:<oid>" filter that would also restrict trees, but introducing such a new filter is deferred to a later change. The implementation here is focused around loading the sparse-checkout patterns from the provided object ID and checking that the patterns are indeed cone-mode patterns. We can then load the correct pattern list into the path walk context and use the logic that already exists from bff4555 (backfill: add --sparse option, 2025-02-03), though that feature loads sparse-checkout patterns from the worktree's local settings and also restricts tree objects. We use a combination of errors and warnings to signal problems during this load. The difference is that errors are likely fatal for the non-path-walk version while the warnings are probably just implementation details for the path-walk version and the 'git pack-objects' command can fall back to the revision walk version. Now that the SEEN flag is deferred until after pattern checks (from the previous commit), handle the case where a tree with a shared OID appears at both an out-of-cone and in-cone path. When trees are not being pruned (pl_sparse_trees == 0), the path-walk re-walks the tree at the in-cone path so that in-cone blobs within it are discovered. The new tests in t5317 and t6601 demonstrate this behavior and would fail without these changes. The performance test p5315 shows the impact of this change when using sparse filters: Test HEAD~1 HEAD ---------------------------------------------------------------------- 5315.10: repack (sparse:oid) 77.98 77.47 -0.7% 5315.11: repack size (sparse:oid) 187.5M 187.4M -0.0% 5315.12: repack (sparse:oid, --path-walk) 77.91 31.41 -59.7% 5315.13: repack size (sparse:oid, --path-walk) 187.5M 161.1M -14.1% These performance tests were run on the Git repository. The --path-walk feature shows meaningful space savings (14% smaller for sparse packs) and dramatic time savings (60% faster) by leveraging the path-walk's ability to skip blobs outside the sparse scope. Signed-off-by: Derrick Stolee <stolee@gmail.com>
6fb9f3e to
859bee3
Compare
Author
|
/submit |
|
Submitted as pull.2101.git.1777731354.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The 'git pack-objects' command has a '--path-walk' option that uses the path-walk API instead of a typical revision walk to group objects into chunks by path name instead of relying solely on name-hashes to group similar files together. (It also does a second compression pass looking for better deltas after the first pass that is focused within chunks per path.)
The '--path-walk' feature was not previously integrated with the '--filter' feature, so a warning would appear and disable the path-walk API when a filter is given. This patch series integrates these together in the following ways:
--filter=blob:noneupdates the path-walk API options to skip blobs.--filter=blob:limit=<size>adds a scan to a list of blob objects to remove objects that are too large.--filter=sparse:<oid>adds a scan to the chunks to validate that the paths match the sparse-checkout patterns.In particular, this last check is significantly faster than the previous algorithm because it can check all objects at a given path simultaneously instead of checking all sparse-checkout patterns for each discovered blob object.
A subtlety must be added here, in that we must change how we mark an object as "seen" during the path-walk. We may need to add an object to multiple paths and only mark it as "seen" if it indeed matched the sparse-checkout patterns as the path is accepted for emitting to the callback. This adds a new filter that the "seen" objects must also be removed from later chunks to avoid sending the same object as grouped to multiple chunks.
There's also a subtle detail here in that the path-walk API also prunes tree paths based on cone-mode sparse-checkouts, to enable 'git backfill --sparse' operating quickly for small sparse-checkout scopes. But the
--filter=sparse:<oid>feature doesn't prune trees!As a future step, I do plan to recommend that we add a
treesparse:<oid>setting that does allow us to trim the tree set by cone-mode sparse patterns. At the time that partial clone filters were being created, cone mode sparse-checkout didn't exist and neither did the sparse index. Those features together make a smaller tree set possible, assuming the user never needs to change their scope. This would be a significant change so it is not implemented here, though thegit pack-objectsintegration would be quick after this series completes.Neither the
sparse:<oid>or hypotheticaltreesparse:<oid>options are or should necessarily be supported by Git servers. It's too expensive to compute dynamically and it doesn't work well with reachability bitmaps. What becomes possible with this change is that it becomes reasonably fast to construct bundles with these filters that can bootstrap a working environment with the full history of all files within a given sparse-checkout scope.Performance Results
Since the '--path-walk' option is ignored in today's Git version when a '--filter' is added, the performance matches the behavior without '--path-walk'. For the tables below, you can compare the rows against each other (time and then packfile size) for the mode without and then with '--path-walk' as a representation of "before" and "after". (These tables are repeated in the commit messages as new implementations improve specific rows.)
I chose a number of open source repositories of various sizes and shapes:
git/git
nodejs/node
microsoft/fluentui
microsoftdocs/azure-devops-docs
Performance conclusions
As seen in earlier series around the '--path-walk' feature, the space savings can be valuable but is not always guaranteed. When the space savings doesn't happen, then the time spent is generally slower because of the two-pass mechanism. The microsoftdocs/azure-devops-docs repo demonstrates this case quite clearly.
However, even in these cases the '
sparse:<oid>' filters are much faster because of the ability to check an entire set of objects against the sparse-checkout patterns only once.Thanks,
-Stolee
P.S. I've CC'd the folks who were on the original path-walk feature thread [1]
[1] https://lore.kernel.org/git/pull.1819.git.1741571455.gitgitgadget@gmail.com/
cc: christian.couder@gmail.com
cc: gitster@pobox.com
cc: johannes.schindelin@gmx.de
cc: johncai86@gmail.com
cc: karthik.188@gmail.com
cc: kristofferhaugsbakk@fastmail.com
cc: me@ttaylorr.com
cc: newren@gmail.com
cc: peff@peff.net
cc: ps@pks.im