Skip to content

trouble shooting

ticoann edited this page Oct 18, 2016 · 36 revisions

WMAgent trouble shoot

Workqueue is not acquiring request: acquired status but not moving to running open status.

  1. Check WorkQueueManager Component Log.
  2. If the error message is showing json parse error, wq view is corrupted.
    ERROR:WorkQueueManagerWMBSFileFeeder:Error in wmbs inject loop: unterminated array starting at position 0:
    
    1. shutdown the agent, and shutdown couch server
    2. remove the view from /data1/database/.workqueue_design
    3. start the couch server and rebuild the view by triggering one of the view
      curl http://localhost:5984/workqueue/_design/WorkQueue/_view/availableByPriority
      
    4. If the view rebuild is finished, start the agent.

Note, however, in case this issue is happening to all the agents, then it probably is an issue with central workqueue.

Datasets were produced with None either for AcquisitionEra or ProcessingString.

  1. In these cases these blocks and files will bug DBS3Upload all the time since it does not pass the Lexicon validation in DBS. So, what we need to do is basically to close those blocks and mark them (and its files) as injected in dbsbuffer tables. 1. Shutdown PhEDExInjector and DBS3Upload 2. Gather a list of workflow names and their bad output dataset names: SELECT DISTINCT dbsbuffer_workflow.name, dbsbuffer_dataset.path FROM dbsbuffer_dataset INNER JOIN dbsbuffer_dataset_subscription ON dbsbuffer_dataset.id = dbsbuffer_dataset_subscription.dataset_id INNER JOIN dbsbuffer_block ON dbsbuffer_block.dataset_id = dbsbuffer_dataset_subscription.dataset_id INNER JOIN dbsbuffer_file ON dbsbuffer_file.block_id = dbsbuffer_block.id INNER JOIN dbsbuffer_workflow ON dbsbuffer_workflow.id = dbsbuffer_file.workflow WHERE dbsbuffer_block.blockname LIKE '/%/None-%' AND dbsbuffer_block.status != 'Closed'; 3. Find the blocks and files that needs manual intervention (just to keep record) SELECT blockname FROM dbsbuffer_block WHERE blockname LIKE '/%/None-%' AND status!='Closed'; SELECT lfn FROM dbsbuffer_file WHERE lfn LIKE '/store/%/None/%' AND status!='InDBS'; 4. Close and mark them as injected UPDATE dbsbuffer_block SET status='Closed' WHERE blockname LIKE '/%/None-%' AND status!='Closed'; UPDATE dbsbuffer_file SET status='InDBS', in_phedex='1' WHERE lfn LIKE '/store/%/None/%' AND status!='InDBS'; 5. elog them in the workflow team.

  2. If there are children of these None samples (usually correctly named), they will fail DBS injection because their parent information is missing as well. Those will also have to be marked as injected, we better perform the following procedure for them: 1. Find the exact block name in the DBS3Upload logs 2. Get a list of files that belong to that block (replace BLOCKNAME in the query below): SELECT lfn FROM dbsbuffer_file WHERE block_id=(SELECT id FROM dbsbuffer_block WHERE blockname='BLOCKNAME' AND status!='Closed') AND status!='InDBS'; 3. Then we mark all its files as injected in DBS and PhEDEx (replace BLOCKNAME in the query below): UPDATE dbsbuffer_file SET status='InDBS', in_phedex='1' WHERE block_id=(SELECT id FROM dbsbuffer_block WHERE blockname='BLOCKNAME' AND status!='Closed') AND status!='InDBS'; 4. Finally, we mark the same block as Closed (replace BLOCKNAME in the query below): UPDATE dbsbuffer_block SET status='Closed' WHERE blockname='BLOCKNAME' AND status!='Closed';

JobAccountant crashes with duplicate entry error (This seems to be caused by UUID collision)

1. Use Alan's script to find the fwjr which contains duplicate file
 https://github.com/amaltaro/ProductionTools/blob/master/removeDupJobAccountant.py
cmst1@vocms0304: curl https://raw.githubusercontent.com/amaltaro/ProductionTools/master/removeDupJobAccountant.py > ~/removeDupJobAccountant.py
cmst1@vocms0304: source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
cmst1@vocms0304: cd /data/srv/wmagent/current

cmst1@vocms0304:/data/srv/wmagent/current $ python ~/removeDupJobAccountant.py 
Found 300 pickle files to open
with a total of 200 output files
Retrieved 764442 lfns from wmbs_file_details

Found 1 duplicate files: ['/store/unmerged/RunIISummer15GSBackfill/InclusiveBtoJpsitoMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-evtgen/GEN-SIM/BACKFILL-v4/90114/B205B040-DA76-E611-96E1-10983627C3C1.root']
The bad pkl files are:
/data/srv/wmagent/v1.0.19.patch1/install/wmagent/JobCreator/JobCache/vlimant_BPH-RunIISummer15GS-Backfill-00030_00212_v0__160907_112626_808/Production/JobCollection_74138_0/job_750147/Report.0.pkl
Remove them, restart the component and be happy!

2. By removing that it will fail the job.
rm /data/srv/wmagent/v1.0.19.patch1/install/wmagent/JobCreator/JobCache/vlimant_BPH-RunIISummer15GS-Backfill-00030_00212_v0__160907_112626_808/Production/JobCollection_74138_0/job_750147/Report.0.pkl
Error message example

ERROR:BaseWorkerThread:Error in event loop (2): <WMComponent.JobAccountant.JobAccountantPoller.JobAccountantPoller instance at 0x26d4ab8> AccountantWorkerException Message: Error while adding files to WMBS! (IntegrityError) (1062, "Duplicate entry '/store/unmerged/logs/prod/2014/6/16/boudoul_RVCMSSW_6_2_0_SLHC14' for key 'lfn'") 'INSERT INTO wmbs_file_details (lfn, filesize, events,\n
first_event, merged)\n VALUES (%s, %s, %s, %s,\n %s)' [('/store/unmerged/logs/prod/2014/6/16/boudoul_RVCMSSW_6_2_0_SLHC14SingleElectronPt35Extended__UPG2023SHNoTaper_140615_171930_4391/SingleElectronPt35Extended_Extended2023SHCalNoTaper_GenSimFull/SingleElectronPt35Extended_Extended2023SHCalNoTaper_GenSimFullMergeFEVTDEBUGoutput/DigiFull_Extended2023SHCalNoTaper/DigiFull_Extended2023SHCalNoTaperMergeFEVTDEBUGHLToutput/RecoFull_Extended2023SHCalNoTaper/RecoFull_Extended2023SHCalNoTaperMergeDQMoutput/RecoFull_Extended2023SHCalNoTaperMergeDQMoutputEndOfRunDQMHarvestMerged/0000/1/0e6169c8-f542-11e3-b6db-003048c9c3fe-EndOfRun-Harvest-1-1-logArchive.tar.gz', 0, 0, 0, 0), ('/store/unmerged/CMSSW_6_2_0_SLHC14/RelValSingleElectronPt1000/GEN-SIM-RECO/DES23_62_V1_UPG2023SHNoTaper-v1/00000/C4C0AC11-46F5-E311-8999-0025905A60A6.root', 36213656, 50, 0, 0), ('/store/unmerged/CMSSW_6_2_0_SLHC14/RelValSingleElectronPt1000/DQM/DES23_62_V1_UPG2023SHNoTaper-v1/00000/C836D311-46F5-E311-8999-0025905A60A6.root', 6334400, 50, 0, 0), ('/store/unmerged/logs/prod/2014/6/16/boudoul_RVCMSSW_6_2_0_SLHC14SingleElectronPt1000__UPG2023SHNoTaper_140615_172119_1332/SingleElectronPt1000_Extended2023SHCalNoTaper_GenSimFull/SingleElectronPt1000_Extended2023SHCalNoTaper_GenSimFullMergeFEVTDEBUGoutput/DigiFull_Extended2023SHCalNoTaper/DigiFull_Extended2023SHCalNoTaperMergeFEVTDEBUGHLToutput/RecoFull_Extended2023SHCalNoTaper/0000/0/9be962e8-f545-11e3-b6db-003048c9c3fe-7-0-logArchive.tar.gz', 0, 0, 0, 0)] ModuleName : WMComponent.JobAccountant.AccountantWorker MethodName : handleWMBSFiles ClassInstance : None FileName : /data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMComponent/JobAccountant/AccountantWorker.py ClassName : None LineNumber : 822 ErrorNr : 0

Traceback: File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 794, in handleWMBSFiles transaction = self.existingTransaction())

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/WMBS/MySQL/Files/Add.py", line 40, in execute conn = conn, transaction = transaction)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 167, in processData returnCursor=returnCursor))

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/Database/MySQLCore.py", line 140, in executemanybinds returnCursor)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 114, in executemanybinds result = connection.execute(s, b)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 824, in execute return Connection.executors[c](self, object, multiparams, params)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 888, in _execute_text return self.__execute_context(context)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 894, in __execute_context self._cursor_executemany(context.cursor, context.statement, context.parameters, context=context)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 960, in _cursor_executemany self._handle_dbapi_exception(e, statement, parameters, cursor, context)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)

Clone this wiki locally