Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't search/list every single repo reference when explicit references are provided #112

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

elhmn
Copy link
Contributor

@elhmn elhmn commented Sep 25, 2023

Now that we can provide explicit roots to git-sizer, we don't want to look for other references than the one provided.

@elhmn elhmn requested a review from a team as a code owner September 25, 2023 13:15
@mhagger
Copy link
Member

mhagger commented Sep 25, 2023

It would be a nice optimization not to run git for-each-ref at all in the case that no references are being scanned. That might make a significant difference to the runtime in a repository that has a lot of references if only a small part of the history is being scanned.

But this PR as currently written introduces a couple of rough edges:

  1. In the case that both references and objects were specified, the old code scans both. For example,

     git-sizer --tags 'HEAD@{1.week.ago}'
    

    scans objects that are reachable from tags and also objects that are reachable from the reflog entry for HEAD from a week ago. I think that the optimization should only be used if an argument was specified AND no tag-related options were specified (which is the behavior documented in the --help output).

  2. When this optimization is triggered, the output still includes the lines

     | * References                 |           |                                |
     |   * Count                    |     0     |                                |
    

    The count usually lists the total number of references (including those that were not selected). It is misleading to say that this count is zero only because we didn't bother to read the references at all. I think it would be better not to output these lines at all when the optimization is in effect (and analogously for the JSON output).

    The textual output is defined in HistorySize.contents(). This function already has some code that adds an item for each reference group (in rgis). It could take an additional bool parameter and based on that include or omit the whole "References" section. It's a little bit awkward because of the way sections are constructed; let me know if you want some ideas for how that could be wired up.

…s are specified

When explicit roots and reference filters are both provided, we want to make
sure that the optimisation don't kick in. As we still want to iterate
over all references in order to apply the filters.

Here is an example with git-sizer being run with explicit roots and reference filters
```
git-sizer main~20^{tree} main~10^{tree} --tags --branches
```
@elhmn elhmn force-pushed the elhmn-limit-tree-traversal-to-refs-passed-by-arguments branch from 135d97e to 721e353 Compare September 29, 2023 12:09
We printed out the reference count, even when explicit roots were
provided.
```
git-sizer master~50^{tree} 'HEAD@{1.week.ago}' -v
[TRUNCATED_OUTPUT]
| * References                 |           |                                |
|   * Count                    |   187     |                                |
|     * Ignored                |   187     |                                |
|                              |           |                                |
[TRUNCATED_OUTPUT]
```

This is not very useful, as the reference count depends on the list of references
extracted by running `git for-each-ref`.

This commit changes the output to not print the reference count when
explicit roots and no reference filters are provided to git-sizer
The reference count is still displayed when the -j flag is passed to
git-sizer.
```
git-sizer master~50^{tree} 'HEAD@{1.week.ago}' -v -j
[TRUNCATED_OUTPUT]
    "reference_count": 15,
    "reference_groups": {
        "ignored": 15
    },
[TRUNCATED_OUTPUT]
```

We don't want to print the reference count for json v1, when explicit
root are not provided git-sizer
@elhmn elhmn force-pushed the elhmn-limit-tree-traversal-to-refs-passed-by-arguments branch from 41cd01a to d407c86 Compare October 5, 2023 12:20
Copy link
Member

@mhagger mhagger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the line comments, I noticed that this version of the code emits confusing counts in the "references" progress meter, like:

$ git-sizer HEAD^
[…]
Processing references: 1                        

I think that this count should be zero, since no references were scanned. But the progress meter is actually counting roots, not references. So this is misleading. (I know it's "just" a progress meter, but I know that I glance at those numbers all the time and rely on them.)

I suggest instead iterating over the references and roots separately, with two separate progress meters, like

diff --git a/sizes/graph.go b/sizes/graph.go
index 0fb1c8a..2fb1b71 100644
--- a/sizes/graph.go
+++ b/sizes/graph.go
@@ -269,14 +269,22 @@ func ScanRepositoryUsingGraph(
                return HistorySize{}, err
        }
 
+       // Process roots in two passes just to make the progress meter
+       // output more useful:
+
        progressMeter.Start("Processing references: %d")
        for _, root := range roots {
-               progressMeter.Inc()
                if refRoot, ok := root.(ReferenceRoot); ok {
+                       progressMeter.Inc()
                        graph.RegisterReference(refRoot.Reference(), refRoot.Groups())
                }
+       }
+       progressMeter.Done()
 
+       progressMeter.Start("Processing roots: %d")
+       for _, root := range roots {
                if root.Walk() {
+                       progressMeter.Inc()
                        graph.pathResolver.RecordName(root.Name(), root.OID())
                }
        }

for _, arg := range flags.Args() {
oid, err := repo.ResolveObject(arg)
// If no reference filters and no explicit roots were provided
if git.IsNoReferencesFilter(rgb.GetTopLevelGroup().GetFilter()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of revealing internal details via both RefGroupBuilder.GetTopLevelGroup() and refGroup.GetFilter() plus needing the strange function IsNoReferencesFilter(), how about adding a special-purpose method to RefGroupBuilder, something like the following:

// RefoptsSeen returns `true` iff any reference selection options were
// used. It is only valid after `Finish()` has been called.
func (rgb *RefGroupBuilder) RefoptsSeen() bool {
	return rgb.topLevelGroup.filter != git.NoReferencesFilter
}

git.NoReferencesFilter is a singleton, so I think that a simple comparison like this should suffice.

Comment on lines +143 to +147

func IsNoReferencesFilter(val interface{}) bool {
_, ok := val.(noReferencesFilter)
return ok
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RefGroupBuilder.RefoptsSeen() could make this could go away.

Comment on lines +33 to +36
func (rg *refGroup) GetFilter() git.ReferenceFilter {
return rg.filter
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RefGroupBuilder.RefoptsSeen() could make this could go away.

Comment on lines +25 to +28
func (rgb *RefGroupBuilder) GetTopLevelGroup() *refGroup {
return rgb.topLevelGroup
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RefGroupBuilder.RefoptsSeen() could make this could go away.

Comment on lines +509 to +522
var refCountSection section
if !opts.WithoutReferenceCount {
refCountSection = S(
"References",
I("referenceCount", "Count",
"The total number of references",
nil, s.ReferenceCount, metric, "", 25e3),
S(
"",
rgis...,
),
)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems to depend on the fact that even though the uninitialized section is inserted into the table, it doesn't output anything. I think that's correct ✔️

@mhagger
Copy link
Member

mhagger commented Dec 14, 2023

(To be clear, the progress meter was already inaccurate before this PR. It just wasn't so obvious, because the references were being scanned and counted as well.)

@Andrei99120
Copy link

elhmn-limit-tree-traversal-to-refs-passed-by-arguments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants