-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-41879: Implement URL signing for client/server #920
Conversation
5def122
to
59d1e8b
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #920 +/- ##
==========================================
+ Coverage 87.55% 87.60% +0.04%
==========================================
Files 295 295
Lines 38241 38259 +18
Branches 8088 8088
==========================================
+ Hits 33482 33516 +34
+ Misses 3553 3539 -14
+ Partials 1206 1204 -2 ☔ View full report in Codecov by Sentry. |
24ac2c9
to
8ab6368
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nice clean ups. I agree that checking the file size after download is fine since that's the case that should always happen and the case of a bad size is meant to be very rare (so the cost of downloading the file and immediately raising an exception should not be burdensome).
@@ -1998,10 +1998,19 @@ def get( | |||
def prepare_get_for_external_client(self, ref: DatasetRef) -> FileDatastoreGetPayload: | |||
# Docstring inherited | |||
|
|||
# 8 hours. Chosen somewhat arbitrarily -- this is long enough that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time we talked with Russ in Princeton we decided that a few minutes was reasonable here since the signing should only happen close to the getting (and we wouldn't be automatically signing the results from query.datasets()) and the server shouldn't be returning URLs that live long enough for people to post on social media. If the getting of a single dataset is taking hours then we have bigger problems (and I was under the impression that once the download starts it wouldn't cut off the download if it passed expiration but that may be misunderstanding how chunked downloads would work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't cut off the download, but if you need to retry at the HTTP layer it will need the signature to remain valid. Its nice if you can allow HTTP retries without needing to throw an exception all the way up to Butler.get() to re-do the API request too.
Or e.g. for multiple range requests if it does some slow work in between reading the chunks, the signature will need to stay valid for the duration of the work.
Also we had discussed like a get_many(), so the download might not start immediately if there are other files in queue on a slow link.
I would argue that a few minutes isn't really long enough. 10-20 minutes is probably OK, but I think it's best if we can make this the longest possible length that isn't a political problem. How would you feel about 1-2 hours? I think it's best to have timeouts one or two orders of magnitude higher than the worst expected typical case, so things don't blow up immediately when the system is under stress.
Also I really don't buy the social media thing given that we provide a function for downloading the files, which could then be stored wherever you want for as long as you want.
@@ -225,7 +226,7 @@ def _read_artifact_into_memory( | |||
|
|||
formatter = getInfo.formatter | |||
nbytes_max = 10_000_000 # Arbitrary number that we can tune | |||
if resource_size <= nbytes_max and formatter.can_read_bytes(): | |||
if recorded_size >= 0 and recorded_size <= nbytes_max and formatter.can_read_bytes(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. The size of -1 for "unknown" was added long after this check was written.
@@ -311,4 +315,6 @@ def addDataset( | |||
if run: | |||
self.butler.registry.registerCollection(run, type=CollectionType.RUN) | |||
metric = self._makeExampleMetrics() | |||
self.butler.put(metric, self.datasetType if datasetType is None else datasetType, dataId, run=run) | |||
return self.butler.put( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a Returns section to the docstring.
tests/test_server.py
Outdated
@@ -51,6 +52,7 @@ | |||
removeTestTempDir, | |||
) | |||
from lsst.resources.http import HttpResourcePath | |||
from lsst.resources.s3utils import clean_test_environment_for_s3, getS3Client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S3 dependencies are not installed by default in the pyproject.toml definition so if server testing is now required to have boto3 available this is going to have to have import protection so the tests can be skipped.
tests/test_server.py
Outdated
def _create_corrupted_dataset(repo: MetricTestRepo) -> DatasetRef: | ||
run = "corrupted-run" | ||
ref = repo.addDataset({"instrument": "DummyCamComp", "visit": 423}, run=run) | ||
uris = repo.butler.getURIs(ref, run=run) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hoping that the run
parameter is not needed here because the ref
knows its own run.
9ef02ce
to
d718a09
Compare
7af2210
to
55d0b33
Compare
Deduplicated some environment cleanup logic between resources and daf_butler
Butler server will initially only support returning HTTP URLs to the client via signing of S3 URLs, so our tests need to use an S3 root for its DataStore.
Move the resource size check when retrieving artifacts so it uses the downloaded data size, instead of checking the size separately. HEAD is not supported on S3 URLs presigned for GET, so HttpResourcePath.size() does not work for them. This is also a slight optimization, since it saves an unnecessary network round trip.
55d0b33
to
e6639aa
Compare
In the currently planned use case for Butler client/server, the server uses S3 for storing artifacts and gives clients access to them by presigning URLs.
e6639aa
to
a633a1a
Compare
Requires lsst/resources#73
Checklist
doc/changes