Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/issue 334 drs bulk requests #365

Merged
merged 43 commits into from
Sep 25, 2023

Conversation

briandoconnor
Copy link
Contributor

@briandoconnor briandoconnor commented Aug 16, 2021

Goal

We want to have this merged into develop for DRS 1.4 for the 2023 Plenary

Background

See Issue #334

A PR request for bulk operations in DRS. As of 5/22/23 we have the complete set of bulk endpoints for authorization information, DRS IDs, and DRS access methods. This PR does not include pagination nor does it include explicit pairing of passports to the output of a bulk response. For pagination, I think we should move bulk forward and cleanly separate out pagination as a different feature/PR. For pairing, the feature branch now has bulk options support, so you can 1) ask the DRS server what authorization mechanisms are needed for a bulk list of DRS IDs and then 2) make one or more bulk requests where you pair the correct passport(s) or bearer tokens in your request.

Built Docs

See the built doc here: https://ga4gh.github.io/data-repository-service-schemas/preview/feature/issue-334-drs-bulk-requests/docs/#tag/Objects/operation/GetBulkObjects/

For more information...

See the Hackathon notes from our June FASP hackathon. Including this gist

See also Feature/issue 334 bulk requests nobundle from Ian and DRS bundle contents pagination from Jeremy

@@ -103,8 +103,12 @@ x-tagGroups:
paths:
/objects/{object_id}:
$ref: ./paths/objects@{object_id}.yaml
/bulk/objects:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making this endpoint at just POST /objects rather than POST /bulk/objects ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll make that change.

We do need to figure out how to enable multi-part POST here so it can be used with Passports. this is a TODO

schema:
type: object
properties:
object_ids:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming this property to selection to harmonize with the DRS + Passport downscoping thread.

@kwrodarmer @mbarkley WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So "object_ids" --> "selection"? I'm fine with that...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It started as selection and, with some discussion, we consciously went to object_ids as there may be other use cases than "selection" for how you get to a list of ids that you want to request. object_ids is neutral as to use case, rather than projecting a particular use case on to DRS which has some generality.

schema:
type: object
properties:
drsobjects:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts on the response payload of a bulk DRS request:

  1. Currently drsobjects is a flat list, would it be advantageous to make it an object with the supplied IDs as keys, and the corresponding DRS Object as the value?
  2. There could be a mix of valid and invalid IDs in the overall selection (e.g. no resource by an ID, client not authorized to access that DRS Object). Should we make a slightly more complex object that clearly outlines which IDs could not be returned as Objects and why?

Taking the above points together, here is a proposed example response payload:

{
  "summary": {
    "requested": 5,
    "loaded": 2,
    "unloaded": 3
  },
  "loadedDrsObjects": {
    "123": {
      "id": "123",
      "name": "DRS Object 123",
      ...
    },
    "456": {
      "id": "456",
      "name": "DRS Object 456",
      ...
    }
  },
  "unloadedDrsObjects": {
    "777": 404,
    "778": 404,
    "779": 401 
  }
}

In the above example, the response payload summarizes how many IDs were passed, how many DRS Objects were successfully retrieved, and how many were unsuccessful. Each retrieved DRS Object can be referenced by its ID. Each non-retrieved DRS Object comes with an explanation (in the form of HTTP code) why it wasn't retrieved.

It's a more complex payload overall, but provides more information to the client about how the request was processed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss on today's call

Copy link

@patmagee patmagee Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the proposed format:

Summary

Depending on the number of items that are requested and where we end up falling on pagination in a bulk request, the summary information may not be known ahead of time when the initial response is sent. Its possible this number may be per page, instead of in total.

Results Format

It may be nice to return the objects as an array with the ordering mirroring the order of the requested Id's This leads to less surprises and allows the client to iterate over the objects in the order they expect. Additionally, since this is a bulk operation we could interleave the errors that are encountered directly in the array at the appropriate index. (this is using the error message format from beacon here

for example, if I request both [456,457] the following may be a valid response

{
  "drsObjects" [
    {
      "id": 456,
      "name":  "..."
    },
    {
      "id": 457,
      "error":  {
        "errorCode": 404,
        "errorMessage": "DRS object does not exist"
      }
    }
  ]
}

@ianfore
Copy link

ianfore commented Jan 11, 2022

See also - #377 for a version of this which drops "contents" and "expand".

@bcli4d
Copy link

bcli4d commented Mar 14, 2022

Comments are relative to the built doc:
Does POST /objects need an error like "413 Payload Too Large", to indicate that bulk_object_ids is too large?
If so, how does the client find out what the limit is? Perhaps from the service-info endpoint? Note that the built doc is missing that endpoint.
The POST /objects Request Body schema is described as "Array of strings". Should that be more specific, e.g."Array of DrsObject identifiers'?
For POST /objects, if some but not all objects were resolved, what is the response code? Currently the build doc responses include "200 The DrsObjects were found successfully" and "404 The requested DrsObject wasn't found".

Same comments generally apply to POST /objects/access

@briandoconnor
Copy link
Contributor Author

Thanks @bcli4d this is great feedback. I'll make these changes/clarifications since they all sound totally reasonable.

@briandoconnor
Copy link
Contributor Author

FYI, this branch is building properly even though it's a PR. So I'm not going to use (and will delete) the v2 of this feature branch.

@briandoconnor
Copy link
Contributor Author

Added 413 error based on @bcli4d 's comment above

@mattions
Copy link

mattions commented Aug 3, 2023

We took a look at this PR and we do like the general approach and how it is set-up.

One request we would like to advance, similar to what @bcli4d has mentioned and @briandoconnor started to solve is the length of the bulk_objects_ids

We understand the 413 is a good error message, however we would like to have it written as an integer variable somewhere, so the client can figure out before starting, what is the allowed length.

We happy to have this in service info or somewhere else, but it must be explicit, and mandatory, so the client can build its own logic with that in mind.

Do we know if other GA4GH API specified specific variable/info in the service_info response, which it's useful for that API and if there is a canonical format to do it?

@briandoconnor
Copy link
Contributor Author

So looping back on the remaining issues/comments here:

  • bulk_objects_ids and bulk_object_access_ids array length max needed in /service-info, should be mandatory and describe the max length the bulk requests can be for this DRS server --> added maxBulkRequestLength to the /service-info endpoint
  • "if some but not all objects were resolved" --> the response includes "unresolved_drs_objects" to show which are not resolved. If none are resolved then the server can return 404
  • better descriptions instead of "Array of Strings" --> I added description text

@briandoconnor
Copy link
Contributor Author

@dglazer gave the feedback of what happens when the DRS server doesn't support bulk? Is the field still required? To be "1"?

@bcli4d
Copy link

bcli4d commented Sep 19, 2023

I think defaulting to 1 in the absence of the field is OK.

@dglazer
Copy link
Member

dglazer commented Sep 19, 2023

I'm "+0" on my own suggestion -- I think it's clean to say "you don't have to support batch, and if you don't you shouldn't have to specify a max batch size".

But I'm only +0 since, as Brian pointed out in person, my initial concern about breaking backwards compatibility was overblown, since compatibility only kicks in after you've already decided to upgrade to a new API version.

@briandoconnor
Copy link
Contributor Author

Just as a heads up, I'm planning on merging this into develop and opening a 1.4 release branch (which will have a request for comment period by drivers/implementers) by the end of GA4GH Connect 2023. Unless I hear an urgent problem that needs to be fixed of course...

@briandoconnor
Copy link
Contributor Author

Thanks everyone, I don't see any remaining issues. Merging into develop now. If you see more issues I'll open a release branch for comments. Always remember, this is going to develop now so if you see an issue you can always open up a new PR to fix it before we finalize 1.4

@briandoconnor briandoconnor merged commit 775868d into develop Sep 25, 2023
1 of 2 checks passed
@briandoconnor briandoconnor deleted the feature/issue-334-drs-bulk-requests branch September 25, 2023 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants