-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-45237: Add ability to disable file validation info on ingest-files and import #1038
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1038 +/- ##
==========================================
- Coverage 89.36% 89.35% -0.02%
==========================================
Files 359 359
Lines 45511 45526 +15
Branches 9354 9360 +6
==========================================
+ Hits 40673 40681 +8
- Misses 3536 3539 +3
- Partials 1302 1306 +4 ☔ View full report in Codecov by Sentry. |
* Existence checking has a large overhead for remote URIs and can make an ingest untenable with tens of thousands of files. * Existence would eventually be checked for any transfer mode other than None or direct. * Existence would eventually be checked if file sizes are being recorded (although it is likely that anyone using ingest would turn that off to save time). On balance, removing the check entirely seems to be the best approach rather than only doing it for local URIs or only doing it if "direct and no track-file-attrs"
Closes #802
if n_datasets <= max_checks: | ||
check_every_n = 1 | ||
elif transfer in ("direct", None): | ||
check_every_n = int(n_datasets / max_checks + 1) # +1 so that if n < 100 the answer is 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ktlim I've asked for a re-review since this change is quite significant. The code coverage doesn't cover the important bit because none of the tests try to ingest large numbers of files. I have verified that @gpdf 's example does only check 200 files out of 155k and that doesn't notable affect the import time.
ad742e5
to
df5e712
Compare
@@ -352,10 +352,10 @@ def split_kv( | |||
separator: str = "=", | |||
unseparated_okay: bool = False, | |||
return_type: type[dict] | type[tuple] = dict, | |||
default_key: str = "", | |||
default_key: str | None = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This None
is here specifically to support --log-levels
because the code in cliLog
is using None
to indicate the default logger and "" or "." to indicate the root logger. They are not the same because we want --log-level INFO
to affect the lsst
logger and not the root logger.
@@ -789,7 +790,7 @@ class MWOptionDecorator: | |||
""" | |||
|
|||
def __init__(self, *param_decls: Any, **kwargs: Any) -> None: | |||
self.partialOpt = partial(click.option, *param_decls, cls=partial(MWOption), **kwargs) | |||
self.partialOpt = partial(click.option, *param_decls, cls=partial(MWOption), **kwargs) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy doesn't think that partial(MWOption)
is the same as an option and so now complains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
# Ingest could be given tens of thousands of files. It is not efficient | ||
# to check for the existence of every single file (especially if they | ||
# are remote URIs) but in some transfer modes the files will be checked | ||
# anyhow when they are relocated. For direct or None transfer modes | ||
# it is possible to not know if the file is accessible at all. | ||
# Therefore limit number of files that will be checked (but always | ||
# include the first one). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there be a use case when clients want to check every file even if there are thousands of them? 200 looks like an arbitrary number, maybe we want a new option for that (or per-datastore config parameter)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's an issue. It's why part of the conversation was considering always doing the checks for POSIX and only skipping for remote. Gregory's original 150k import example was so long (many hours) that it seemed like it wouldn't ever be something people wanted to do. I picked 200 because that's about the size of the focal plane... and it doesn't take very long.
We want to check some of the files to ensure that they are accessible at all but we don't want to check every file. If there are thousands of files on a remote bucket this can take a long time.
python/lsst/daf/butler/formatters/json.py:136: note: Possible overload variants: python/lsst/daf/butler/formatters/json.py:136: note: def asdict(obj: DataclassInstance) -> dict[str, Any] python/lsst/daf/butler/formatters/json.py:136: note: def [_T] asdict(obj: DataclassInstance, *, dict_factory: Callable[[list[tuple[str, Any]]], _T]) -> _T
We use a default_key=None in the log handling code to imply the root logger. Changing the types in split_kv to use str | None fixes the problem. The --log-levels option wants to use None to indicate the default logger should be affected and saves "" to mean the root logger (the default logger is "lsst").
New typeshed now says return type is bool | None.
python/lsst/daf/butler/cli/utils.py:793: error: Argument "cls" to "option" has incompatible type "partial[MWOption]"; expected "type[Option] | None" [arg-type]
The suggested fix on python/mypy#17550 causes another problem that we ignore for now.
Checklist
doc/changes
configs/old_dimensions