Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scenarios to Deploy DataONE Indexer #24

Closed
taojing2002 opened this issue Sep 7, 2022 · 6 comments
Closed

Scenarios to Deploy DataONE Indexer #24

taojing2002 opened this issue Sep 7, 2022 · 6 comments
Assignees
Milestone

Comments

@taojing2002
Copy link
Collaborator

We have those scenarios to deploy DataONE indexer:

  1. CN
  2. Big Metacat Instances, like arctic and ess-dive
  3. Registered small Metacat instances, like PPBio
  4. Unregistered Metacat instances.

To get system metadata, DataONE indexer looks the file system first. If it can't find it, it will use the DataONE API to get it. Before we release a Metacat version which supports the feature storing system metadata in the file system, we have use the Member node token to read the private system metadata. It will be a big burden for us to issue the member node tokens for all registered Metacat instances every three months. The unregistered Metacat instances alway share the same node id - urn:node:Metacat_Test. So the member node token issued based on this node id can access every unregistered Metacat instance except the ones which the operator manually changed it, which is not a likely case.

@taojing2002 taojing2002 self-assigned this Sep 28, 2022
@taojing2002 taojing2002 added this to the 3.0.0 milestone Sep 28, 2022
@artntek
Copy link
Collaborator

artntek commented Dec 1, 2023

@taojing2002 or @mbjones - Is there a "next action" here? It's hard to understand what will be needed in order to satisfy and close this Issue

@mbjones
Copy link
Member

mbjones commented Dec 1, 2023

I think design overview documentation showing how those deployments each work would be sufficient.

@artntek
Copy link
Collaborator

artntek commented Dec 6, 2023

upon re-(re)-reading this, it seems like it can be boiled down to:

  1. indexer currently uses the DataONE API to retrieve metadata from metacat
  2. it therefore needs the Member node token to read private metadata
  3. unregistered Metacat instances alway share the same node id - urn:node:Metacat_Test
  4. So any member node token with this node id can access every other unregistered Metacat instance

HOWEVER: once metacat has been changed to use hashstore, none of this will be necessary, since metadata can be accessed directly from the store. See #41 and #58

I therefore think this indexer issue can be closed (after metacat has been changed to use hashstore). Thoughts, @mbjones & @taojing2002 ?

@mbjones
Copy link
Member

mbjones commented Dec 6, 2023

That all sounds reasonable. I think Jing's original list had some other characteristics that we should be watching out for. A few notes expading on his list of considerations:

  • CN
    • the CN deployment of the indexer runs as an external process from Metacat (d1_cn_index_processor) and a separate index generator, which will need to be refactored before we can deploy; these both run independently of Metacat as standalone processes
    • the CN has a much larger corpus to reindex, which has historically taken 4-6 weeks to run the full re-indexing job, so our deployment should consider how to do this without downtime of the service
  • Big Metacat Instances, like arctic and ess-dive
    • deployment should ensure we can handle the reindex of these corpora with millions of files (similar to CN, but les metadata and a lot more data files)
    • make sure there is a deployment path to migrate from VM to K8S-based deployments
    • check on mechanism to incorporate large datasets that are not currently in metacat due to scaling limits
  • Registered small Metacat instances, like PPBio
    • simplest of deployments, probably just make sure the deployment is automated and smooth; reindexing is liely fast for small collections
  • Unregistered Metacat instances.
    • not sure there are special issues here, but ned to be sure we can index without DataONE API access if its disabled

@artntek
Copy link
Collaborator

artntek commented Dec 7, 2023

ok - so it sounds like this is really 2 issues in one?

  1. Document the 4 deployment scenarios

    • still open. How/where should this be documented, ideally?
  2. Concerns about shared-id-member-node-token issue

    • can be closed when metacat has been changed to use hashstore

If this is true, then I'll split the issue to make it more clear

@mbjones
Copy link
Member

mbjones commented Dec 7, 2023

Yeah, I think the token issues for accessing systemmetadata will be moot for Metacat because of hashstore. Assuming we can deploy metacat on the CN with hashstore, there's probably nothing to be done there except factor out the reliance on the token altogether. A separate issue for that would be good.

@artntek artntek modified the milestones: 3.0.0, 3.1.0 Feb 6, 2024
@artntek artntek closed this as completed Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants