Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

Error codes

Paul Nilsson edited this page Mar 5, 2021 · 17 revisions

When detecting a fatal problem, the Pilot assigns an error code and informs the server. Aside from the numerical code itself, it also reports the error meaning and a more detailed error diagnostics. The current range of error codes are listed in the [Pilot 2 wiki](https://twiki.cern.ch/twiki/bin/view/PanDA/Pilot2ErrorCodes).

Error code Acronym Meaning Notes
1008 GENERALERROR General pilot error, consult batch log  
1098 NOLOCALSPACE Not enough local space Error code is set e.g. by job monitoring, also if copytool command fails (if "No space left on device" is found in command output)
1099 STAGEINFAILED Failed to stage-in file  
1100 REPLICANOTFOUND The rucio API function list_replicas() did not return any replicas. Check log for details.  
1103 NOSUCHFILE No such file or directory Error thrown by open_file() function. Also set if copytool fails if copytool fails and "No such file or directory" is found in command output
1104 USERDIRTOOLARGE User work directory too large The error is set if the user work directory exceeds the maximum allowed limit, as defined by schedconfig.maxwdir (default: 14 GB)
1106 STDOUTTOOBIG Payload log or stdout file too big Set if stdout exceeds maximum allowed limit of 2 GB, set in the Pilot's default config file
1110 SETUPFAILURE Failed during payload setup  
1115 NFSSQLITE NFS SQLite locking problems Pilot identifies this error by doing a grep on the strings "prepare 5 database is locked" and "Error SQLiteStatement" in the payload stdout
1116 QUEUEDATA Pilot could not download queuedata  
1117 QUEUEDATANOTOK Pilot found non-valid queuedata  
1124 OUTPUTFILETOOLARGE Output file too large  
1133 NOSTORAGE Fetching default storage failed: no activity related storage defined  
1137 STAGEOUTFAILED Failed to stage-out file  
1141 PUTMD5MISMATCH md5sum mismatch on output file Error acronym should be renamed
1143 CHMODTRF Failed to chmod trf After downloading a trf, the pilot tries to do a chmod 0755 on it. If this fails, the pilot will set this error
1144 PANDAKILL This job was killed by panda server  
1145 GETMD5MISMATCH md5sum mismatch on input file Error acronym should be renamed
1149 TRFDOWNLOADFAILURE Transform could not be downloaded  
1150 LOOPINGJOB Looping job killed by pilot The pilot will kill the payload (or stop stage-in/out) if there is no activity (i.e. files touched in the work directory or if the file transfer is stuck) within the allowed time. The default looping job time limit is 12*3600 s for production jobs and 3*3600 s for user analysis jobs. The limit can be overridden in the pilot's config file (or set by the user using the maxCPUCount variable)
1151 STAGEINTIMEOUT File transfer timed out during stage-in Currently only identified for rucio file transfer (unless "Operation timed out" is in stderr)
1152 STAGEOUTTIMEOUT File transfer timed out during stage-out Currently only identified for rucio file transfer (unless "Operation timed out" is in stderr)