Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Add ability to create job groups at top level #14170

Closed
wants to merge 77 commits into from

Conversation

jigold
Copy link
Contributor

@jigold jigold commented Jan 18, 2024

Dan, Can you look at the last commit only? The stack is kind of broken right now. Note, it is not possible yet to create jobs within the new job groups.

The only non-straightforward part of how this turned out is the batch_updates table copies the previous start_job_id and start_job_group_id with n_jobs = 0 or n_job_groups = 0 if there are none in that update. I think it's fine to do this, but I did need to remove a unique constraint on (batch_id, start_job_id). I tried to find any instances where we use that index, but couldn't find any. We'll want to double check that.

@jigold
Copy link
Contributor Author

jigold commented Jan 18, 2024

Other to-do items are to make sure the stack has tests for authorization for all new endpoints in test_batch.py in the corresponding PR.

@danking
Copy link
Contributor

danking commented Jan 18, 2024

I will look but I suspect I have to look at everything to have the right context.

@@ -45,11 +45,6 @@ spec:
value: "{{ default_ns.name }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason GitHub shows this file. I think if you merge/rebase with main it will go away.

@jigold jigold dismissed danking’s stale review February 7, 2024 20:06

I need feedback again!

@jigold jigold force-pushed the create-job-groups-w-batch-updates branch from 70e4d7f to 823da60 Compare February 8, 2024 13:26
@jigold
Copy link
Contributor Author

jigold commented Feb 8, 2024

I thought about this some more. I think the correct solution is to have two values for n_jobs in the job_groups table. There should be n_jobs (direct child jobs) and n_jobs_recursive (all descendent jobs). This way the UI and the status will make sense for whichever use case is most applicable. To do this, we'll need to write a migration that adds the new column and backfills the column. Right now, those two values are the same so it's just a copy of one column to another column. We might actually be able to do this migration in one update command. The other tables had hundreds of millions of rows and would have taken days and crashed the db if there was not enough memory. I can test this locally on my laptop by creating a test database and inserting 10 million records and then seeing how long it takes to do the update.

mysql> select count(*) from job_groups limit 10;
+----------+
| count(*) |
+----------+
|  8122788 |
+----------+
1 row in set (16.75 sec)

@jigold
Copy link
Contributor Author

jigold commented Feb 8, 2024

Actually, I have a feeling this bulk update will lock the entire table for 30 mins to an hour since there's no use of a where condition.

@danking
Copy link
Contributor

danking commented Feb 13, 2024

Can we close this PR now?

@danking
Copy link
Contributor

danking commented Feb 13, 2024

I'm going to start testing, but I think the only thing that wasn't clear to me was how to resolve these sorts of comments. Do I try and fix them now or in a separate PR with an issue to make sure it gets noted? #14170 (comment)

Since this is merely propagating existing bad behavior, let's change it separately so as to keep the conceptual overhead of this change as small as possible. Please make an issue and paste the link into each comment so that we can later track how we resolved each comment.

Ok. Still working on getting the tests to pass and cleaning things up. However, I ran into a small snag. The code below needs to be ironed out. Should the number of jobs and state of the job group be recursive or specific to that job group? It's a bit weird for the billing and cancellation to be nested, but the number of jobs etc. are not. More concretely, if a child batch is running, should the parent also be running even if it has no direct child jobs that are running? Thoughts?

cc: @daniel-goldstein

      UPDATE batches SET
        `state` = 'running',
        time_completed = NULL,
        n_jobs = n_jobs + expected_n_jobs
      WHERE id = in_batch_id;

      ### FIXME FIXME what should the state be of nested job groups?
      UPDATE job_groups
      INNER JOIN (
        SELECT batch_id, job_group_id, CAST(COALESCE(SUM(n_jobs), 0) AS SIGNED) AS staged_n_jobs
        FROM job_groups_inst_coll_staging
        WHERE batch_id = in_batch_id AND update_id = in_update_id
        GROUP BY batch_id, job_group_id
      ) AS t ON job_groups.batch_id = t.batch_id AND job_groups.job_group_id = t.job_group_id
      SET `state` = 'running', time_completed = NULL, n_jobs = n_jobs + t.staged_n_jobs;

When you say "billing and cancellation [is] nested" do you mean that the bill for a group is the sum of the bill for all jobs directly in the group with all jobs in any descendent group?

Since we decided that groups are nested, my inclination is for everything to represent a sum total over the direct jobs and jobs within any descendant groups. From here on out "sum total" means exactly that. OK, so:

  1. In the UI (database should do what makes sense and is fast), the number of jobs should be the sum total but the pagination should page through the direct jobs.
  2. In the UI (same caveat), the total bill should be the sum total. (Including a sum of direct jobs cost seems fine if it is efficiently computable from the database).
  3. Yes, a group is running if any direct job or job within any descendant group is running (reasoning: if cancelling the group could cancel a running job, the UI needs to indicate that fact).

@danking
Copy link
Contributor

danking commented Feb 13, 2024

When do we want n_direct_jobs?

@jigold
Copy link
Contributor Author

jigold commented Feb 13, 2024

Can we close this PR now?

Yes

When you say "billing and cancellation [is] nested" do you mean that the bill for a group is the sum of the bill for all jobs directly in the group with all jobs in any descendent group?

Ended up with this design in #14282

When do we want n_direct_jobs?

Not sure. If we need it later on, we can run a separate migration.

@jigold
Copy link
Contributor Author

jigold commented Feb 13, 2024

Closing in favor of #14282

@jigold jigold closed this Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants