Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement kill, delete and remove endpoints #64

Merged
merged 2 commits into from
Oct 26, 2023

Conversation

simon-mazenoux
Copy link
Contributor

@simon-mazenoux simon-mazenoux commented Aug 24, 2023

Implement kill, delete and remove endpoints (see #22, #63)

@simon-mazenoux simon-mazenoux marked this pull request as draft August 24, 2023 15:49
@simon-mazenoux simon-mazenoux changed the title Implement taskqueuedb Implement Taskqueuedb Aug 24, 2023
@simon-mazenoux simon-mazenoux changed the title Implement Taskqueuedb Implement TaskQueueDB Aug 24, 2023
@simon-mazenoux simon-mazenoux force-pushed the feat-implement-taskqueuedb branch 3 times, most recently from f61699c to 6e86eb1 Compare August 30, 2023 16:38
@simon-mazenoux simon-mazenoux force-pushed the feat-implement-taskqueuedb branch 2 times, most recently from d64eb2a to bd52dc7 Compare September 11, 2023 08:39
@simon-mazenoux simon-mazenoux changed the title Implement TaskQueueDB Implement kill and delete endpoints Sep 11, 2023
@simon-mazenoux simon-mazenoux changed the title Implement kill and delete endpoints Implement kill, delete and remove endpoints Sep 12, 2023
@simon-mazenoux simon-mazenoux marked this pull request as ready for review September 12, 2023 09:14
@simon-mazenoux
Copy link
Contributor Author

In this PR, and unlike what is done in DIRAC, I used the ON DELETE CASCADE constraint. This change would therefore require an ALTER statement on some tables of the DB. Should we do this or duplicate the logic of what is currently done in DIRAC ?

@chrisburr
Copy link
Member

In this PR, and unlike what is done in DIRAC, I used the ON DELETE CASCADE constraint. This change would therefore require an ALTER statement on some tables of the DB. Should we do this or duplicate the logic of what is currently done in DIRAC ?

This seems very sensible to me, would any changes be needed in DIRAC v8.1 to make it tollerant to this? We should probably discuss it in the BiLD, can you open an issue in vanilla DIRAC?

@simon-mazenoux
Copy link
Contributor Author

In this PR, and unlike what is done in DIRAC, I used the ON DELETE CASCADE constraint. This change would therefore require an ALTER statement on some tables of the DB. Should we do this or duplicate the logic of what is currently done in DIRAC ?

This seems very sensible to me, would any changes be needed in DIRAC v8.1 to make it tollerant to this? We should probably discuss it in the BiLD, can you open an issue in vanilla

I think DIRAC is already tolerant: by adding the ON DELETE CASCADE constraint, we would just avoid having to delete manually every foreign key reference to the row we want to delete. Therefore, I don't really see how it could affect the codebase besides not raising an error anymore if a foreign key is not deleted before deleting the primary key.

@@ -198,3 +210,99 @@ class LoggingInfo(JobLoggingDBBase):
StatusTimeOrder = Column(Numeric(precision=12, scale=3), default=0)
StatusSource = Column(String(32), default="Unknown")
__table_args__ = (PrimaryKeyConstraint("JobID", "SeqNum"),)


class TaskQueues(TaskQueueDBBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just stylistic, but: you added this in the same module where JobDB (and JobLoggingDB) is. SandboxMetadataDB is instead defined in another module. We either put all "WMS-related" schemas together or we split them in different files like it was in DIRAC.

)


class BannedSitesQueue(TaskQueueDBBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you check if this is still used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still used by the JobScheduling optimizer (__sendToTQ method that create/fills the JobRequirements section of the JDL, which is then interpreted by the insertIntoTQ method of the JobState class which calls the TQDB insertJob method). This can be the case if a user specifies a list of banned sites and the job doesn't require any site in particular.

However, I would argue that having a BannedSitesQueue table is not very appealing.

One solution would be to get the list of all sites, remove from this list the banned sites and insert the remaining sites in the tq_TQToSites table.

I think we should also ask the community if banning sites is a real use case, because this feature brings quite a lot of complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is there because, most probably, of some (LHCb) user requirement of long time ago. This reminds me that we have often argued that it did not make much sense and that could even create issues. I would be tempted to remove it also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we use the ON DELETE CASCADE constraint in DIRAC to remove the entries from this table, then I think we can remove this table from diracx.

src/diracx/db/sandbox_metadata/db.py Outdated Show resolved Hide resolved
@simon-mazenoux simon-mazenoux force-pushed the feat-implement-taskqueuedb branch 2 times, most recently from 9b4990c to 993d2d4 Compare September 12, 2023 14:22
src/diracx/db/jobs/db.py Outdated Show resolved Hide resolved
src/diracx/db/jobs/db.py Outdated Show resolved Hide resolved
src/diracx/db/jobs/db.py Outdated Show resolved Hide resolved
src/diracx/db/jobs/db.py Outdated Show resolved Hide resolved
src/diracx/db/jobs/db.py Outdated Show resolved Hide resolved
src/diracx/db/jobs/status_utility.py Outdated Show resolved Hide resolved
src/diracx/routers/job_manager/__init__.py Show resolved Hide resolved
tests/routers/test_job_manager.py Outdated Show resolved Hide resolved
Comment on lines 249 to 261
await asyncio.gather(
*(
delete_job(
job_id,
config,
job_db,
job_logging_db,
task_queue_db,
background_task,
)
for job_id in job_ids
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be practical to make this an actual bulk operation on the DB side? i.e. something like:

DELETE FROM JOBS WHERE JOB_ID IN (123, 124, 125, 126)

rather than

DELETE FROM JOBS WHERE JOB_ID = 123
DELETE FROM JOBS WHERE JOB_ID = 124
DELETE FROM JOBS WHERE JOB_ID = 125
DELETE FROM JOBS WHERE JOB_ID = 126

(This is probably better as an issue and then a separate PR rather than adding complexity to this one.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up doing it here because I had problems using the TaskGroup+except* as you advised, and I wanted to see if making bulk methods could help me avoid theses. In the end, I still had to use them, but most of the work was already done. Sorry for the complexity.

for job_id in job_ids
)
)
except NoResultFound as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use a TaskGroup+except* instead of gather so we get the better exception handling.

Ideally I think we should return a more explicit error like:

detail={
    "message": "Failed to remove 5 jobs out of 167",
    "failed_ids": [1, 78, 32, 25, 90]
}

or maybe even:

detail={
    "message": "Failed to remove 5 jobs out of 167",
    "failed_ids": {
        "1": str(e1),
        "78": str(e2),
        "32": str(e3),
        "25": str(e4),
        "90": str(e5)
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should cancel any action if one of them fail ? (for example here should we remove the other 162 jobs or not ?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that’s practical I think that would be nice. What do you think @fstagni @chaen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that’s practical I think that would be nice.

I concur.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaen Agrees as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linked discussion: #97

@simon-mazenoux simon-mazenoux force-pushed the feat-implement-taskqueuedb branch 7 times, most recently from f17af3a to f503216 Compare September 21, 2023 16:20
@simon-mazenoux
Copy link
Contributor Author

@chaen I think the pytest-integration tests are failing because the TaskQueueDB and SandboxMetadataDB are not in diracx-charts. Is this normal ? If so, do you want me to add them in a pull request there, or should I let you do it?

@chrisburr chrisburr merged commit ba6966b into DIRACGrid:main Oct 26, 2023
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants