Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-45237: Add ability to disable file validation info on ingest-files and import #1038

Merged
merged 11 commits into from
Jul 23, 2024

Conversation

timj
Copy link
Member

@timj timj commented Jul 13, 2024

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

Copy link

codecov bot commented Jul 13, 2024

Codecov Report

Attention: Patch coverage is 82.35294% with 6 lines in your changes missing coverage. Please review.

Project coverage is 89.35%. Comparing base (ce18b02) to head (b5dc4a2).

Files Patch % Lines
python/lsst/daf/butler/datastores/fileDatastore.py 60.00% 3 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1038      +/-   ##
==========================================
- Coverage   89.36%   89.35%   -0.02%     
==========================================
  Files         359      359              
  Lines       45511    45526      +15     
  Branches     9354     9360       +6     
==========================================
+ Hits        40673    40681       +8     
- Misses       3536     3539       +3     
- Partials     1302     1306       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

timj added 2 commits July 15, 2024 16:23
* Existence checking has a large overhead for remote URIs and
  can make an ingest untenable with tens of thousands of files.
* Existence would eventually be checked for any transfer mode
  other than None or direct.
* Existence would eventually be checked if file sizes are being
  recorded (although it is likely that anyone using ingest
  would turn that off to save time).

On balance, removing the check entirely seems to be the
best approach rather than only doing it for local URIs
or only doing it if "direct and no track-file-attrs"
if n_datasets <= max_checks:
check_every_n = 1
elif transfer in ("direct", None):
check_every_n = int(n_datasets / max_checks + 1) # +1 so that if n < 100 the answer is 1.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktlim I've asked for a re-review since this change is quite significant. The code coverage doesn't cover the important bit because none of the tests try to ingest large numbers of files. I have verified that @gpdf 's example does only check 200 files out of 155k and that doesn't notable affect the import time.

@timj timj force-pushed the tickets/DM-45237 branch 2 times, most recently from ad742e5 to df5e712 Compare July 22, 2024 18:19
@@ -352,10 +352,10 @@ def split_kv(
separator: str = "=",
unseparated_okay: bool = False,
return_type: type[dict] | type[tuple] = dict,
default_key: str = "",
default_key: str | None = "",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This None is here specifically to support --log-levels because the code in cliLog is using None to indicate the default logger and "" or "." to indicate the root logger. They are not the same because we want --log-level INFO to affect the lsst logger and not the root logger.

@@ -789,7 +790,7 @@ class MWOptionDecorator:
"""

def __init__(self, *param_decls: Any, **kwargs: Any) -> None:
self.partialOpt = partial(click.option, *param_decls, cls=partial(MWOption), **kwargs)
self.partialOpt = partial(click.option, *param_decls, cls=partial(MWOption), **kwargs) # type: ignore
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy doesn't think that partial(MWOption) is the same as an option and so now complains.

Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

doc/changes/DM-45237.misc.rst Outdated Show resolved Hide resolved
Comment on lines +1131 to +1137
# Ingest could be given tens of thousands of files. It is not efficient
# to check for the existence of every single file (especially if they
# are remote URIs) but in some transfer modes the files will be checked
# anyhow when they are relocated. For direct or None transfer modes
# it is possible to not know if the file is accessible at all.
# Therefore limit number of files that will be checked (but always
# include the first one).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a use case when clients want to check every file even if there are thousands of them? 200 looks like an arbitrary number, maybe we want a new option for that (or per-datastore config parameter)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's an issue. It's why part of the conversation was considering always doing the checks for POSIX and only skipping for remote. Gregory's original 150k import example was so long (many hours) that it seemed like it wouldn't ever be something people wanted to do. I picked 200 because that's about the size of the focal plane... and it doesn't take very long.

timj added 7 commits July 23, 2024 10:14
We want to check some of the files to ensure that they are
accessible at all but we don't want to check every file.
If there are thousands of files on a remote bucket this can take
a long time.
python/lsst/daf/butler/formatters/json.py:136: note: Possible overload variants:
python/lsst/daf/butler/formatters/json.py:136: note:     def asdict(obj: DataclassInstance) -> dict[str, Any]
python/lsst/daf/butler/formatters/json.py:136: note:     def [_T] asdict(obj: DataclassInstance, *, dict_factory: Callable[[list[tuple[str, Any]]], _T]) -> _T
We use a default_key=None in the log handling code to imply
the root logger. Changing the types in split_kv to use str | None
fixes the problem. The --log-levels option wants to use None
to indicate the default logger should be affected and saves
"" to mean the root logger (the default logger is "lsst").
New typeshed now says return type is bool | None.
python/lsst/daf/butler/cli/utils.py:793: error: Argument "cls" to "option" has incompatible type "partial[MWOption]"; expected "type[Option] | None"  [arg-type]
The suggested fix on python/mypy#17550 causes
another problem that we ignore for now.
@timj timj merged commit 8dd8362 into main Jul 23, 2024
17 of 18 checks passed
@timj timj deleted the tickets/DM-45237 branch July 23, 2024 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants