Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-41879: Implement URL signing for client/server #920

Merged
merged 4 commits into from
Dec 8, 2023
Merged

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Dec 5, 2023

Requires lsst/resources#73

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@dhirving dhirving changed the base branch from main to tickets/DM-41878 December 5, 2023 22:24
Copy link

codecov bot commented Dec 5, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1e7b397) 87.55% compared to head (a633a1a) 87.60%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #920      +/-   ##
==========================================
+ Coverage   87.55%   87.60%   +0.04%     
==========================================
  Files         295      295              
  Lines       38241    38259      +18     
  Branches     8088     8088              
==========================================
+ Hits        33482    33516      +34     
+ Misses       3553     3539      -14     
+ Partials     1206     1204       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhirving dhirving marked this pull request as ready for review December 6, 2023 17:33
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nice clean ups. I agree that checking the file size after download is fine since that's the case that should always happen and the case of a bad size is meant to be very rare (so the cost of downloading the file and immediately raising an exception should not be burdensome).

@@ -1998,10 +1998,19 @@ def get(
def prepare_get_for_external_client(self, ref: DatasetRef) -> FileDatastoreGetPayload:
# Docstring inherited

# 8 hours. Chosen somewhat arbitrarily -- this is long enough that the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time we talked with Russ in Princeton we decided that a few minutes was reasonable here since the signing should only happen close to the getting (and we wouldn't be automatically signing the results from query.datasets()) and the server shouldn't be returning URLs that live long enough for people to post on social media. If the getting of a single dataset is taking hours then we have bigger problems (and I was under the impression that once the download starts it wouldn't cut off the download if it passed expiration but that may be misunderstanding how chunked downloads would work)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't cut off the download, but if you need to retry at the HTTP layer it will need the signature to remain valid. Its nice if you can allow HTTP retries without needing to throw an exception all the way up to Butler.get() to re-do the API request too.

Or e.g. for multiple range requests if it does some slow work in between reading the chunks, the signature will need to stay valid for the duration of the work.

Also we had discussed like a get_many(), so the download might not start immediately if there are other files in queue on a slow link.

I would argue that a few minutes isn't really long enough. 10-20 minutes is probably OK, but I think it's best if we can make this the longest possible length that isn't a political problem. How would you feel about 1-2 hours? I think it's best to have timeouts one or two orders of magnitude higher than the worst expected typical case, so things don't blow up immediately when the system is under stress.

Also I really don't buy the social media thing given that we provide a function for downloading the files, which could then be stored wherever you want for as long as you want.

@@ -225,7 +226,7 @@ def _read_artifact_into_memory(

formatter = getInfo.formatter
nbytes_max = 10_000_000 # Arbitrary number that we can tune
if resource_size <= nbytes_max and formatter.can_read_bytes():
if recorded_size >= 0 and recorded_size <= nbytes_max and formatter.can_read_bytes():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The size of -1 for "unknown" was added long after this check was written.

@@ -311,4 +315,6 @@ def addDataset(
if run:
self.butler.registry.registerCollection(run, type=CollectionType.RUN)
metric = self._makeExampleMetrics()
self.butler.put(metric, self.datasetType if datasetType is None else datasetType, dataId, run=run)
return self.butler.put(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Returns section to the docstring.

@@ -51,6 +52,7 @@
removeTestTempDir,
)
from lsst.resources.http import HttpResourcePath
from lsst.resources.s3utils import clean_test_environment_for_s3, getS3Client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 dependencies are not installed by default in the pyproject.toml definition so if server testing is now required to have boto3 available this is going to have to have import protection so the tests can be skipped.

def _create_corrupted_dataset(repo: MetricTestRepo) -> DatasetRef:
run = "corrupted-run"
ref = repo.addDataset({"instrument": "DummyCamComp", "visit": 423}, run=run)
uris = repo.butler.getURIs(ref, run=run)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping that the run parameter is not needed here because the ref knows its own run.

@dhirving dhirving force-pushed the tickets/DM-41878 branch 2 times, most recently from 9ef02ce to d718a09 Compare December 7, 2023 19:40
@dhirving dhirving force-pushed the tickets/DM-41879 branch 2 times, most recently from 7af2210 to 55d0b33 Compare December 7, 2023 23:51
Base automatically changed from tickets/DM-41878 to main December 8, 2023 16:25
Deduplicated some environment cleanup logic between resources and daf_butler
Butler server will initially only support returning HTTP URLs to the client via signing of S3 URLs, so our tests need to use an S3 root for its DataStore.
Move the resource size check when retrieving artifacts so it uses the downloaded data size, instead of checking the size separately.

HEAD is not supported on S3 URLs presigned for GET, so HttpResourcePath.size() does not work for them.  This is also a slight optimization, since it saves an unnecessary network round trip.
In the currently planned use case for Butler client/server, the server uses S3 for storing artifacts and gives clients access to them by presigning URLs.
@dhirving dhirving merged commit d550cd6 into main Dec 8, 2023
17 checks passed
@dhirving dhirving deleted the tickets/DM-41879 branch December 8, 2023 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants