-
Notifications
You must be signed in to change notification settings - Fork 107
Request Status
This page describes the different status for a request in ReqMgr.
A new request, this status is usually skipped and the request goes directly to assignment-approved when created with script tools. (McM)
Requests in this status are awaiting review and assignment from the CompOps L2s. They will be moved to rejected if there is a problem with the request, otherwise it is assigned. (McM, Physics groups)
Assigned requests have been reviewed and modified by the CompOps L2s, these requests have been provided with an appropriate site whitelist, acquisition era, processing string and other attributes. Requests in this state will be processed by MSTransferor, which will check whether any input data needs to be placed, and then the request is moved to the staging status. (MSTransferor)
Workflows in this status are in the process of transferring input data. Requests in this state will be processed by MSMonitor, which will check the input data placement completion, and possibly advance the workflow to status staged. Requests will remain in this state until their input data placement completion reaches the expected fraction defined in the workflow campaign. (MSMonitor)
Workflows in this status have completed all the input data placement and are ready to be processed by global workqueue. Global workqueue will process these requests and create the first level of splitting of work. (WorkQueue)
Acquired requests have been split by the global WorkQueue into work elements, but no work element has been injected into the Local Workqueue of any WMAgent and therefore not considered as running yet. (ReqMgr2)
Refactored on 10/Nov/2021: Requests will be set to running-open state when at least one WQE has been pulled by an agent (thus, in state Acquired or beyond). It also requires to have other WQEs waiting in Global Workqueue (either in Available/Negotiating/Acquired state, thus not yet in WMBS).
The old behavior of running-open requests required to have at least one work element injected into Local Workqueue (having jobs created or not). It is running-open for historical reason when DBS supported open blocks and data still can be added but this definition is not valid anymore since DBS3 only allows closed blocked (ReqMgr2, via GlobalWQ elements' status checking)
Requests will be set to running-closed state when all of the WQEs are in at least Running state, thus everything has been injected into WMBS and there are no workqueue elements available in GQ.
Set by user which will kill all remaining work. But workflow will still move to completed status. (Ops, Unified)
A request is marked as completed after all work elements are done, which means that the WMAgent(s) have processed all the jobs generated by each one of them. This includes not only the top level task, but also the auxiliary ones like log collection and cleanup of unmerged data. A completed request will be looked at by CompOps people to verify the success or failure of it, when the output of the request is considered satisfactory the request is moved to closed-out status, otherwise to rejected. (ReqMgr2, via GlobalWQ elements' status checking)
Note that a request in completed is not guaranteed to have all its output data registered in DBS and/or PhEDEx, although this is usually taken for granted there are failure cases when this may not happen automatically.
Closed out status indicates that the output has been reviewed and is ready to be announced back to the requestors. (Unified)
An announced request has been announced to the requestors using the usual channels and can be archived. (Unified, Ops)
After a request is announced, WMAgent cleans all the monitoring information for that workflow. MSRuleCleaner also does input data placement cleanup - if data no longer needs to remain locked - and transient output rucio rules are also removed. Once MSRuleCleaner goes through those successfully, MSRuleCleaner advances it to normal-archived. (MSRuleCleaner, via WMStats job information and archive delay config parameters)
A request is moved to rejected when it is considered invalid at assignment or when the produced output is not satisfactory. (Unified, Ops)
After a request is rejected, WMAgent cleans all the monitoring information for that workflow. MSRuleCleaner also does input data placement cleanup - if data no longer needs to remain locked - and transient output rucio rules are also removed. Once MSRuleCleaner goes through those successfully, MSRuleCleaner advances it to rejected-archived. (MSRuleCleaner, via WMStats job information and archive delay config parameters)
A failed request has had a failure in one of the work elements, or it didn't produce any. These can be re-evaluated and reassigned to run again, or move to rejected state if unrecoverable. (GlobalQueue)
If there is an unrecoverable problem with a request after it has been acquired, then it is possible to move it to aborted state. This will trigger an internal action to kill all current jobs and run only auxiliary tasks like unmerged data cleanup and log collection, after all these actions are completed the request will be moved to aborted completed. (Ops)
A request is marked as aborted completed after all left-over jobs have been processed in an aborted request, a request in this state has been cleaned up from the WMAgents and global WorkQueue and is ready to be archived.(ReqMgr2, via GlobalWQ elements' status checking)
After a request is aborted-completed and all its monitoring information is cleaned up, MSRuleCleaner does the input data placement cleanup - if data no longer needs to remain locked - and transient output rucio rules are also removed. Once MSRuleCleaner goes through those successfully, MSRuleCleaner advances it to aborted-archived. (MSRuleCleaner, via WMStats job information and archive delay config parameters)
Common issues among different states, and their solution:
- Issue: Too many workflows stuck in assigned state
Problem: team is not set properly. A block is going to be pulled by an agent only if the team is set properly. The teams are: mc, mc_highprio, repro_lowprio, repro_highprio, step0, hlt, relval. Solution: elog about the problem, the workflow has to be resubmitted.
Problem: Site is not assigned properly. Some workflows need to be submitted to specific places (i.e. ACDCs) Solution: There is no one solution to this, you have to check each case. If the site where the block is available is not set in the whitelist, send an elog about this.
Problem: White list only includes the _Disk of a site. Solution: For sites where disk and tape are separated, elog about this and recommend to send the request to both T1_XX_XXXX and T1_XX_XXXX_Disk. Read this
- Issue: Too many workflows acquired and not moving in an agent
Problem: