Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

Error codes

Paul Nilsson edited this page Mar 5, 2021 · 17 revisions

When detecting a fatal problem, the Pilot assigns an error code and informs the server. Aside from the numerical code itself, it also reports the error meaning and a more detailed error diagnostics. The current range of error codes are listed in the [Pilot 2 wiki](https://twiki.cern.ch/twiki/bin/view/PanDA/Pilot2ErrorCodes).

Error code Acronym Meaning Notes
1008 GENERALERROR General pilot error, consult batch log  
1098 NOLOCALSPACE Not enough local space Error code is set e.g. by job monitoring, also if copytool command fails (if "No space left on device" is found in command output)
1099 STAGEINFAILED Failed to stage-in file  
1100 REPLICANOTFOUND The rucio API function list_replicas() did not return any replicas. Check log for details.  
1103 NOSUCHFILE No such file or directory Error thrown by open_file() function. Also set if copytool fails and "No such file or directory" is found in output
1104 USERDIRTOOLARGE User work directory too large The error is set if the user work directory exceeds the maximum allowed limit, as defined by schedconfig.maxwdir (default: 14 GB)
1106 STDOUTTOOBIG Payload log or stdout file too big Set if stdout exceeds maximum allowed limit of 2 GB, set in the Pilot's default config file
1110 SETUPFAILURE Failed during payload setup  
1115 NFSSQLITE NFS SQLite locking problems Pilot identifies this error by doing a grep on the strings "prepare 5 database is locked" and "Error SQLiteStatement" in in the payload stdout
1116 QUEUEDATA Pilot could not download queuedata  
1117 QUEUEDATANOTOK Pilot found non-valid queuedata  
1124 OUTPUTFILETOOLARGE Output file too large  
1133 NOSTORAGE Fetching default storage failed: no activity related storage defined  
1137 STAGEOUTFAILED Failed to stage-out file  
1141 PUTMD5MISMATCH md5sum mismatch on output file Error acronym should be renamed
1143 CHMODTRF Failed to chmod trf After downloading a trf, the pilot tries to do a chmod 0755 on it. If this fails, the pilot will set this error
1144 PANDAKILL This job was killed by panda server  
1145 GETMD5MISMATCH md5sum mismatch on input file Error acronym should be renamed
1149 TRFDOWNLOADFAILURE Transform could not be downloaded  
1150 LOOPINGJOB Looping job killed by pilot The pilot will kill the payload (or stop stage-in/out) if there is no activity (i.e. files touched in the work directory or if the file transfer is stuck) within the allowed time. The default looping job time limit is 12*3600 s for production jobs and 3*3600 s for user analysis jobs. The limit can be overridden in the pilot's config file (or set by the user using the maxCPUCount variable)
1151 STAGEINTIMEOUT File transfer timed out during stage-in Currently only identified for rucio file transfer (unless "Operation timed out" is in stderr)
1152 STAGEOUTTIMEOUT File transfer timed out during stage-out Currently only identified for rucio file transfer (unless "Operation timed out" is in stderr)
1163 NOPROXY Grid proxy not valid Set if grid-proxy-info fails or if "Could not establish context" is found in copytool command output
1165 MISSINGOUTPUTFILE Local output file is missing  
1168 SIZETOOLARGE Total file size too large Before stage-in, the pilot verifies that the sum of the input file sizes does not exceed maxwdir (set in schedconfig or in pilot config file). Any files that are to be accessed directly/remotely are excluded
1171 GETADMISMATCH adler32 mismatch on input file Error acronym should be renamed
1172 PUTADMISMATCH adler32 mismatch on output file Error acronym should be renamed
1177 NOVOMSPROXY Voms proxy not valid Set if arcproxy fails
1180 GETGLOBUSSYSERR Globus system error during stage-in Pilot identifies this error if "globes_xio:" is found in command output
1181 PUTGLOBUSSYSERR Globus system error during stage-out Pilot identifies this error if "globes_xio:" is found in command output
1186 NOSOFTWAREDIR Software directory does not exist  
1187 NOPAYLOADMETADATA Payload metadata does not exist This error can happen due to previous uncaught error, leading to missing metadata, i.e. the error label can be misleading (when discovered, pilot is usually patched)
1190 LFNTOOLONG LFN too long (exceeding limit of 255 characters) When validating a job definition, before executing the payload, the Pilot makes sure that no output file has an LFN that is longer than 255 characters (which is not supported by the DDM system)
1191 ZEROFILESIZE File size cannot be zero Before executing the stage-out command, the Pilot verifies that the size of the file is not zero (which will not be accepted by any storage system)
1199 MKDIR Failed to create local directory  
1200 KILLSIGNAL Job terminated by unknown kill signal  
1201 SIGTERM Job killed by signal: SIGTERM  
1202 SIGQUIT Job killed by signal: SIGQUIT  
1203 SIGSEGV Job killed by signal: SIGSEGV  
1204 SIGXCPU Job killed by signal: SIGXCPU  
1205 USERKILL Job killed by user Reserved error code for user defined kill instructions. Currently not implemented
1206 SIGBUS Job killed by signal: SIGBUS  
1207 SIGUSR1 Job killed by signal: SIGUSR1  
1211 MISSINGINSTALLATION Missing installation Assigned error code if the payload fails to execute the transform
1212 PAYLOADOUTOFMEMORY Payload ran out of memory Assigned error code if the pilot finds the string "FATAL out of memory: taking the application down" in the stderr and "St9bad_alloc", "std::bad_alloc" in the stdout
1213 REACHEDMAXTIME Reached batch system time limit Pilot aborts automatically when 10 minutes remain of the maximum allowed running time, as set by 1) schedconfig,maxtime or 2) Pilot option -l <maxtime> (both values are in seconds)
1214 UNKNOWNPAYLOADFAILURE Job failed due to unknown reason (consult log file)  
2222 SINGULARITYRESOURCEUNAVAILABLE