Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only return one document per file checked (including checks on data files) #454

Open
jeanetteclark opened this issue Oct 2, 2024 · 0 comments
Milestone

Comments

@jeanetteclark
Copy link
Collaborator

jeanetteclark commented Oct 2, 2024

The goals here are:

  • return more atomic documents as opposed to massive results docs with hundreds of file results
  • increase efficiency by parallelizing among data files
  • return more atomic results from solr

Current sequence looks something like this

sequenceDiagram
    participant engine as Worker
    participant dispatcher as Dispatcher

    engine->>engine: Run getDataPids()
    engine->>dispatcher: create dispatcher
    dispatcher->>dispatcher: Run checks for each pid
    dispatcher->>engine: Return result for all pids
    engine-->>engine: Index into solr
Loading

Proposed sequence would potentially look like this:

sequenceDiagram
    participant engine as Worker
    participant dispatcher1 as Dispatcher 1
    participant dispatcher2 as Dispatcher 2

    engine->>engine: Run getDataPids()

    par Parallel Thread 1
        engine->>dispatcher1: Create dispatcher for PID 1
        dispatcher1->>dispatcher1: Run checks for PID 1
        dispatcher1->>engine: Return result for PID 1
        engine-->>engine: Index into Solr
    and Parallel Thread 2
        engine->>dispatcher2: Create dispatcher for PID 2
        dispatcher2->>dispatcher2: Run checks for PID 2
        dispatcher2->>engine: Return result for PID 2
        engine-->>engine: Index into Solr
    end
Loading

In order to do this we'll need to modify the solr indexing (which I think needs help anyway), the dispatch system (maybe running in parallel?), and possibly the schema for the run document.

@jeanetteclark jeanetteclark added this to the 3.2 milestone Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

1 participant