Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locations constraints on DRS Pointer #400

Open
mattions opened this issue Dec 5, 2023 · 5 comments
Open

locations constraints on DRS Pointer #400

mattions opened this issue Dec 5, 2023 · 5 comments

Comments

@mattions
Copy link

mattions commented Dec 5, 2023

In CRDC driver Project and also in BioDataCatalyst we have a situation where the host of the data would like to provide a guidance on how to use the data, and there to use it.

In other words, they would like that any platform downstream of the DRS Server would compute on the data in certain cloud locations, which usually are the same where the data are from. The reasons for this request are different, going from keeping the egress cost down, to not having the data leaving the security level.

Given that at the end we have download url in DRS, and it would be pretty difficult to enforce the situation, therefore I suggest we go more towards an idea where the host "suggest" what is the preferred way to access the data, and the DRS client accessing these data honor the request to the best of their ability.

Proposal

The proposal aims to enhance the GA4GH DRS (Data Repository Service) specification by introducing a new field that provides metadata regarding the intended usage and location constraints for data objects. This additional field will allow data providers to specify their preferences and requirements for how the data should be accessed and utilized. The proposed field will offer the following options:

  1. Cloud Exclusive (cloud_exclusive): the data object is intended for use exclusively within a cloud environment. Users are expected to access and process the data only within a cloud computing infrastructure and not outside of it; cannot download the data on somebody's laptop

  2. Cloud Provider-Limited (cloud_provider_limited): the data object should not leave the cloud provider's ecosystem. Users are restricted from moving the data to external locations or platforms. It must remain within the boundaries of the specified cloud provider.

  3. Cloud Region-Limited (cloud_region_limited): the data object is restricted to a specific cloud region. Users are required to access and process the data within the designated region and are prohibited from transferring it to other geographic locations within the cloud provider's infrastructure.

By introducing this new field, data providers and administrators can communicate their data access and usage policies more effectively, ensuring that data is handled in accordance with their specific requirements. This addition not only enhances the flexibility of the DRS specification but also strengthens data governance and compliance for genomic and health-related data in cloud-based environments.

It could look like this:

{ 
  "id": "string", 
  "name": "string", 
  "self_uri": "drs://drs.example.org/314159", 
  "size": 1024, 
  "created_time": "2019-08-24T14:15:22Z", 
  "updated_time": "2019-08-24T14:15:22Z", 
  "version": "string", 
  "mime_type": "application/json", 
  "checksums": [ 
    { 
      "checksum": "string", 
      "type": "sha-256" 
    } 
  ], 
  "usage_constraints": { 
    "access_type": "cloud_exclusive", 
    "location_constraints": { 
      "cloud_provider": "AWS", 
      "cloud_region": "us-west-2" 
    } 
  } 
} 

In this structure:

  • usage_constraints is a section within the DRS metadata specifically dedicated to describing data usage and location constraints.
  • access_type is a field that specifies how the data should be accessed and used. We can define different values such as
  • cloud_exclusive, cloud_provider_limited, cloud_region_limited to represent the intended usage constraints. (mandatory)
  • location_constraints is an optional nested section that provides additional details, depending on the access type. For example, it includes cloud_provider to specify the preferred cloud provider, and cloud_region to designate the desired cloud region.

This structured metadata allows data providers to clearly communicate their data access and usage policies, ensuring that users are aware of the intended constraints. It also enables data consumers to make informed decisions about how to handle and access the data. The specific values for access_type can be defined in the DRS specification, and they should correspond to the proposed usage policy options. This structure helps promote consistency and interoperability across different implementations of the DRS specification.

@ianfore
Copy link

ianfore commented Dec 5, 2023

The CRDC driven work in fasp-scripts had this use case in mind. The basic model was to use DRS to find out where the provider (CRDC, BDC, Anvil, etc) had made the data available and "go with the flow" of running compute there rather than downloading.

The guidance is in essence provided by the provider by having the DRS service tell the consumer where the data is available.

Some providers didn't enforce the expectation that the consumer would compute in place. They expected the consumer to "go with the flow".
Others made their buckets "requester pays" - which meant they weren't restricting where you did the compute - but you would have to pay if the consumer didn't go their preferred route - which is to compute on the data in place.

If we need the addition proposed here it might likely be better as an attribute of an access method - providing the constraints on usage in that particular location.

@MichaelLukowski
Copy link
Collaborator

I think that this is a valid concern for data that is being indexed by a DRS server however I am not sure that the GET /objects/{object_id} endpoint is the best location for the requested information. I tend to agree with @ianfore that this could be part of the access method flow. Perhaps this could be a optional field as a part of the OPTIONS /objects/{object_id}?

@briandoconnor
Copy link
Contributor

briandoconnor commented Mar 25, 2024

Trying to mock this as part of the access method, could this be informational in this way:

Some things I included here:

  • ability to have more than one constraint type (array)

  • more fields to express location

    "access_methods": [
    {
    "type": "s3",
    "access_url": {
    "url": "string",
    "headers": "Authorization: Basic Z2E0Z2g6ZHJz"
    },
    "access_id": "string",
    "location" : { <-- a place for declaring locations... could be cloud-based using a predetermined list of providers + regions or geo location or country code. This can be used for information... the constraints come below.
    "geo_location_coordinates": "lat long coordinates",
    "geo_location_country_code": "country code",
    "cloud_provider" : "cloud provider name",
    "cloud_region": "region code that makes sense"
    },
    "usage_constraint": [ { <-- if you try to access outside of these constraints as determined by this DRS server then you will get an unauthorized
    "access_type": "cloud_exclusive" or "geolocation_exclusive" <-- this would let you express the access constraint type... in the future this might be a place we can express other constraint types (think DUO)
    "location_constraints": {
    "cloud_provider": "AWS", <-- for "cloud_exclusive"
    "cloud_region": "us-west-2" <-- for "cloud_exclusive"
    "geo_location_country_codes": ["country code"] <-- array of country codes for "geolocation_exclusive"
    } ]
    }
    ...
    }
    ]

@kanchana404
Copy link

{
  "id": "string",
  "name": "string",
  "self_uri": "drs://drs.example.org/314159",
  "size": 1024,
  "created_time": "2019-08-24T14:15:22Z",
  "updated_time": "2019-08-24T14:15:22Z",
  "version": "string",
  "mime_type": "application/json",
  "checksums": [
    {
      "checksum": "string",
      "type": "sha-256"
    }
  ],
  "usage_constraints": {
    "access_type": "cloud_exclusive",
    "location_constraints": {
      "cloud_provider": "AWS",
      "cloud_region": "us-west-2"
    }
  }
}

In this corrected version:

  1. The usage_constraints section contains the access_type and location_constraints fields.
  2. access_type specifies how the data should be accessed and used.
  3. location_constraints provides additional details such as the preferred cloud provider (cloud_provider) and desired cloud region (cloud_region).

@briandoconnor
Copy link
Contributor

In the Cloud WS meeting on Aug 12th, 2024 we decided to simplify the feature described in Issue #400 for DRS release 1.5.

PR #407, intended for DRS release 1.5, simply adds a string "cloud" to the access response. We now include cloud, region, and type information only… no cloud or geo location constraint support for example.

The fields we will include are:

  • Cloud: e.g. gcp or aws or azure (this is the new field)
  • Type: s3 or https or etc
  • Region: e.g. us-east-1

After DRS 1.5 we can revisit how we express region, cloud, geo location, etc constraints in DRS which is a much bigger issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants