-
Notifications
You must be signed in to change notification settings - Fork 107
Implementation WMAgent Refactoring
- need a new work unit table to keep track the which lumis/events are successfully processed. (not considering optimization but minimal changes)
wmbs_work_unit
CREATE TABLE wmbs_workunit (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
taskid INTEGER NOT NULL,
fileid INTEGER NOT NULL, #(fake file for mc)
run INTEGER NOT NULL,
lumi INTEGER NOT NULL,
firstevent INTEGER NOT NULL,
lastevent INTEGER NOT NULL,
status INT(1) DEFAULT 0,
FOREIGN KEY (taskid)
REFERENCES wmbs_workflow(id) ON DELETE CASCADE)
fileid, run, lumi can be replaced by one id, if we add unique id in wmbs_file_runlumi_map table.
EWV: I think we would need 2-3 other fields here too. Retry count for the work unit, how many work units ended up in the last job to try this work unit (remember we want to try a work unit by itself before giving up completely), and perhaps a timestamp. But that might not be necessary if we have a timeout on the jobs themselves.
If one lumi can be spread out in multiple files, we need association table for work unit and wmbs_file_runlumi_map table.
-
above table need to be populated when fileset and subscription is created (before job splitting happens)
-
wmbs_job_mask table should be modified (or replaced) so it contains relationship between work unit and job id.
CREATE TABLE wmbs_job_workunit_assoc (
jobid INTEGER NOT NULL,
workunitid INTEGER NOT NULL,
FOREIGN KEY (jobid)
REFERENCES wmbs_job(id) ON DELETE CASCADE,
FOREIGN KEY (workunitid)
REFERENCES wmbs_workunit(id) ON DELETE CASCADE)
CREATE TABLE file_run_lumi_map (
fileid INTEGER NOT NULL, (PK)
run INTEGER NOT NULL, (PK)
lumi INTEGER NOT NULL, (PK)
num_events INTEGER NOT NULL,
FOREIGN KEY (fileid)
REFERENCES wmbs_file_deatails(id) ON DELETE CASCADE)
-
wmbs_job table contains 4 states (success, failure, partial_success, not_attempt) Not sure this is needed but maybe need for retrying logic. (in case total failure don't reshuffle the work unit, etc)
-
We might need the association between output file and wmbs_work_unit
- No change - Could keep the current estimate of the job using mask and EventPerJobs and EventsPerLumis
- There to be more changes to sport various StepChain (out of scope for this implementation)
- When file info is retrieved in DBS, populate file_run_lumi_map table above)
-
Splitting Algo can be 2 steps. First splitting to work unit then create the jobs over work unit. (This way retrying logic can used the same logic on second step - by only selecting the work where work unit is failed) (Job splitting need to happen multiple times not just for initially over the input. To make this simpler, splitting happens over wmbs_work_unit not over files.)
-
We still need to support old way of job splitting since most of DBS input files doesn't contain the event number information by file, run, lumi.
-
Don't create the merge jobs unless all the work units are successful for given input file in initial job with the same lumi. If one of the work_unit fails, fail whole work_units given the job belong to the same lumi.
FWJR changes (pre condition - CMSSW fwjr should contain information which lumi event were failed or successful)
-
WMAgent FWJR need to contain list of work_unit and success/failed work_unit information.
-
Does CMSSW knows work_unit? If it just returns what lumi/events failed we need to consider subrange event for work unit from the beginning
-
JobAccounter needs to update wmbs_work_unit status, also wmbs_work_unit and output file accociation
-
In sub events support is needed, It needs to break work_unit to sub work_unit and mark correct status.
-
Is job status meaningful or only work_unit status matters (in that case we have to tract the status not by job but work_unit) - The problem is we might have to support both always.
- How to define retry rules (by work_unit or jobs)
- we might still able to track jobs which would contain work unit information.