Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

Error codes

Paul Nilsson edited this page May 13, 2021 · 17 revisions

When detecting a fatal problem, the Pilot assigns an error code and informs the server. Aside from the numerical code itself, it also reports the error meaning and a more detailed error diagnostics.

Error code Acronym Meaning Notes
1008 GENERALERROR General pilot error, consult batch log  
1098 NOLOCALSPACE Not enough local space Error code is set e.g. by job monitoring, also if copytool command fails (if "No space left on device" is found in command output)
1099 STAGEINFAILED Failed to stage-in file  
1100 REPLICANOTFOUND The rucio API function list_replicas() did not return any replicas. Check log for details.  
1103 NOSUCHFILE No such file or directory Error thrown by open_file() function. Also set if copytool fails and "No such file or directory" is found in output
1104 USERDIRTOOLARGE User work directory too large The error is set if the user work directory exceeds the maximum allowed limit, as defined by schedconfig.maxwdir (default: 14 GB)
1106 STDOUTTOOBIG Payload log or stdout file too big Set if stdout exceeds maximum allowed limit of 2 GB, set in the Pilot's default config file
1110 SETUPFAILURE Failed during payload setup  
1115 NFSSQLITE NFS SQLite locking problems Pilot identifies this error by doing a grep on the strings "prepare 5 database is locked" and "Error SQLiteStatement" in in the payload stdout
1116 QUEUEDATA Pilot could not download queuedata  
1117 QUEUEDATANOTOK Pilot found non-valid queuedata  
1124 OUTPUTFILETOOLARGE Output file too large  
1133 NOSTORAGE Fetching default storage failed: no activity related storage defined  
1137 STAGEOUTFAILED Failed to stage-out file  
1141 PUTMD5MISMATCH md5sum mismatch on output file Error acronym should be renamed
1143 CHMODTRF Failed to chmod trf After downloading a trf, the pilot tries to do a chmod 0755 on it. If this fails, the pilot will set this error
1144 PANDAKILL This job was killed by panda server  
1145 GETMD5MISMATCH md5sum mismatch on input file Error acronym should be renamed
1149 TRFDOWNLOADFAILURE Transform could not be downloaded  
1150 LOOPINGJOB Looping job killed by pilot The pilot will kill the payload (or stop stage-in/out) if there is no activity (i.e. files touched in the work directory or if the file transfer is stuck) within the allowed time. The default looping job time limit is 12*3600 s for production jobs and 3*3600 s for user analysis jobs. The limit can be overridden in the pilot's config file (or set by the user using the maxCPUCount variable)
1151 STAGEINTIMEOUT File transfer timed out during stage-in Currently only identified for rucio file transfer (unless "Operation timed out" is in stderr)
1152 STAGEOUTTIMEOUT File transfer timed out during stage-out Currently only identified for rucio file transfer (unless "Operation timed out" is in stderr)
1163 NOPROXY Grid proxy not valid Set if grid-proxy-info fails or if "Could not establish context" is found in copytool command output
1165 MISSINGOUTPUTFILE Local output file is missing  
1168 SIZETOOLARGE Total file size too large Before stage-in, the pilot verifies that the sum of the input file sizes does not exceed maxwdir (set in schedconfig or in pilot config file). Any files that are to be accessed directly/remotely are excluded
1171 GETADMISMATCH adler32 mismatch on input file Error acronym should be renamed
1172 PUTADMISMATCH adler32 mismatch on output file Error acronym should be renamed
1177 NOVOMSPROXY Voms proxy not valid Set if arcproxy fails
1180 GETGLOBUSSYSERR Globus system error during stage-in Pilot identifies this error if "globes_xio:" is found in command output
1181 PUTGLOBUSSYSERR Globus system error during stage-out Pilot identifies this error if "globes_xio:" is found in command output
1186 NOSOFTWAREDIR Software directory does not exist  
1187 NOPAYLOADMETADATA Payload metadata does not exist This error can happen due to previous uncaught error, leading to missing metadata, i.e. the error label can be misleading (when discovered, pilot is usually patched)
1190 LFNTOOLONG LFN too long (exceeding limit of 255 characters) When validating a job definition, before executing the payload, the Pilot makes sure that no output file has an LFN that is longer than 255 characters (which is not supported by the DDM system)
1191 ZEROFILESIZE File size cannot be zero Before executing the stage-out command, the Pilot verifies that the size of the file is not zero (which will not be accepted by any storage system)
1199 MKDIR Failed to create local directory  
1200 KILLSIGNAL Job terminated by unknown kill signal  
1201 SIGTERM Job killed by signal: SIGTERM  
1202 SIGQUIT Job killed by signal: SIGQUIT  
1203 SIGSEGV Job killed by signal: SIGSEGV  
1204 SIGXCPU Job killed by signal: SIGXCPU  
1205 USERKILL Job killed by user Reserved error code for user defined kill instructions. Currently not implemented
1206 SIGBUS Job killed by signal: SIGBUS  
1207 SIGUSR1 Job killed by signal: SIGUSR1  
1211 MISSINGINSTALLATION Missing installation Assigned error code if the payload fails to execute the transform
1212 PAYLOADOUTOFMEMORY Payload ran out of memory Assigned error code if the pilot finds the string "FATAL out of memory: taking the application down" in the stderr and "St9bad_alloc", "std::bad_alloc" in the stdout
1213 REACHEDMAXTIME Reached batch system time limit Pilot aborts automatically when 10 minutes remain of the maximum allowed running time, as set by 1) schedconfig,maxtime or 2) Pilot option -l <maxtime> (both values are in seconds)
1220 UNKNOWNPAYLOADFAILURE Job failed due to unknown reason (consult log file)  
1221 FILEEXISTS File already exists Error code is set if "File exists", "SRM_FILE_BUSY" or "file already exists" is found in copytool command output
1223 BADALLOC Transform failed due to bad_alloc Assigned error code if the pilot finds "badalloc" among the job report errors
1224 ESRECOVERABLE Event service: recoverable error  
1228 ESFATAL Event service: fatal error  
1234 EXECUTEDCLONEJOB Clone job is already executed  
1235 PAYLOADEXCEEDMAXMEM Payload exceeded maximum memory  
1236 KILLEDBYSERVER Killed by server This error is not set by the pilot. It is currently only set by Harvester
1238 ESNOEVENTS Event service: no events  
1240 MESSAGEHANDLINGFAILURE Failed to handle message from payload  
1242 CHKSUMNOTSUP Query checksum is not supported The error code is set if Pilot finds "query chksum is not supported" or "Unable to checksum" in command output
1244 NORELEASEFOUND No release candidates found  
1246 NOUSERTARBALL User tarball could not be downloaded from PanDA server  
1247 BADXML Badly formed XML Parsing of metadata failed most likely due to presence of illegal character
1300 NOTIMPLEMENTED The class or function is not implemented  
1301 UNKNOWNEXCEPTION An unknown pilot exception has occurred  
1302 CONVERSIONFAILURE Failed to convert object data E.g. if a JSON dictionary can't be converted from unicode to utf-8
1303 FILEHANDLINGFAILURE Failed during file handling E.g. if a file can't be opened or a dictionary can't be loaded from file
1305 PAYLOADEXECUTIONFAILURE Failed to execute payload  
1306 SINGULARITYGENERALFAILURE Singularity: general failure Site issue; set if the Pilot finds "Operation not permitted" in stderr
1307 SINGULARITYNOLOOPDEVICES Singularity: No more available loop devices Site issue; set if Pilot finds "No more available loop devices" in stderr
1308 SINGULARITYBINDPOINTFAILURE Singularity: Not mounting requested bind point Site issue; set if the Pilot finds "Not mounting requested bind point" in stderr
1309 SINGULARITYIMAGEMOUNTFAILURE Singularity: Failed to mount image Site issue; set if the Pilot finds "Failed to mount image" in stderr
1310 PAYLOADEXECUTIONEXCEPTION Exception caught during payload execution Internal pilot problem
1311 NOTDEFINED Not defined A general - internally used - error that is explained in the corresponding exception (NotDefined) error diagnostics; e.g. the analytics package throws this exception if a fit has not been defined; or if a math function fails to convert a string to an integer
1312 NOTSAMELENGTH Not same length Internally used error corresponding to exception NotSameLength, which is thrown if input data are not of same length in a fit
1313 NOSTORAGEPROTOCOL No protocol defined for storage endpoint  
1314 UNKNOWNCHECKSUMTYPE Unknown checksum type  
1315 UNKNOWNTRFFAILURE Unknown TRF failure  
1316 RUCIOSERVICEUNAVAILABLE Rucio: Service unavailable Set if corresponding Rucio error details (reg.exp. or "service_unavailable") are found in copytool command output
1317 EXCEEDEDMAXWAITTIME Exceeded maximum waiting time Internally used exception.error code. Exception thrown by pilot monitoring when abort_job wait time has been exceeded (and when other threads have not finished cleaning up on time). abort_job is set when pilot has received a kill signal
1318 COMMUNICATIONFAILURE Failed to communicate with server  
1319 INTERNALPILOTPROBLEM An internal Pilot problem has occurred (consult Pilot log) Error code used for internal debugging. A more precise error message should be written to the log
1320 LOGFILECREATIONFAILURE Failed during creation of log file In case tarfile.open() or the archive.add() fails, the pilot will set this error code
1321 RUCIOLOCATIONFAILED Failed to get client location for Rucio  
1322 RUCIOLISTREPLICASFAILED Failed to get replicas from Rucio  
1323 UNKNOWNCOPYTOOL Unknown copy tool Set if the requested copy tool has no implementation
1324 SERVICENOTAVAILABLE Service not available at the moment Rucio server not available
1325 SINGULARITYNOTINSTALLED Singularity: not installed Identified by trf exit code 64 and the string "Singularity is not installed" present in the stderr
1326 NOREPLICAS No matching replicas were found in list_replicas() output list_replicas() returned replicas but no local matching replica was found
1327 UNREACHABLENETWORK Unable to stage-in file since network is unreachable Problem seen with xrdcp command during stage-in
1328 PAYLOADSIGSEGV SIGSEGV: Invalid memory reference or a segmentation fault Special payload error extracted from job report. A SIGSEGV is an error (signal) caused by an invalid memory reference or a segmentation fault. The payload is probably trying to access an array element out of bounds or trying to use too much memory
1329 NONDETERMINISTICDDM Failed to construct SURL for non-deterministic ddm (update CRIC) While Pilot 1 ignored the is_deterministic endpoint field if the storage path ended in /rucio, Pilot 2 will instead fail the job if the endpoint is not deterministic. The endpoint should be fixed in CRIC
1330 JSONRETRIEVALTIMEOUT JSON retrieval timed out Error is assigned if the pilot fails to download JSON
1331 MISSINGINPUTFILE Input file is missing in storage element  
1332 BLACKHOLE Black hole detected in file system (consult Pilot log) This error is assigned if a pilot module goes missing. Typically this would mean that it cannot be imported
1333 NOREMOTESPACE No space left on device  
1334 SETUPFATAL Setup failed with a fatal exception (consult payload log)  
1335 MISSINGUSERCODE User code not available on PanDA server (resubmit task with --useNewCode) Error occurs when user tarball has been deleted from the server and the pilot tries to download it. User must resubmit task with prun/pathena option --useNewCode
1336 JOBALREADYRUNNING Job is already running elsewhere  
1337 BADMEMORYMONITORJSON Memory monitor produced bad output Failure to parse JSON file from Memory monitor
1338 STAGEINAUTHENTICATIONFAILURE Authentication failure during stage-in  
1339 DBRELEASEFAILURE Local DBRelease handling failed (consult Pilot log)  
1340 SINGULARITYNEWUSERNAMESPACE Singularity: Failed invoking the NEWUSER namespace runtime  
1341 BADQUEUECONFIGURATION Bad queue configuration detected  
1342 MIDDLEWAREIMPORTFAILURE Failed to import middleware (consult Pilot log)  
1343 NOOUTPUTINJOBREPORT Found no output in job report Set when output=[] in job report
1344 RESOURCEUNAVAILABLE Resource temporarily unavailable (consult Pilot log) Set when get_current_cpu_consumption_time() fails due to OSError exception raised in subprocess module (failed os.fork()). To be extended in v 2.1.22+
1345 SINGULARITYFAILEDUSERNAMESPACE Singularity: Failed to create user namespace Detected in stderr when the transform has a non-zero exit code
1346 TRANSFORMNOTFOUND Transform not found Detected in stderr when the transform has a non-zero exit code
1347 UNSUPPORTEDSL5OS Unsupported SL5 OS Detected in stderr when the transform has a non-zero exit code
1348 SINGULARITYRESOURCEUNAVAILABLE Singularity: Resource temporarily unavailable Detected in stderr when the transform has a non-zero exit code
1349 UNRECOGNIZEDTRFARGUMENTS Unrecognized transform arguments Detected in stderr when the transform has a non-zero exit code
1350 EMPTYOUTPUTFILE Empty output file detected Detected in stderr when the transform has a non-zero exit code
1351 UNRECOGNIZEDTRFSTDERR Unrecognized fatal error in transform stderr Detected in stderr when the transform has a non-zero exit code
1352 STATFILEPROBLEM Failed to stat proc file for CPU consumption calculation The pilot sets this error during the CPU consumption calculation if reading /proc/pid/stat fails with "No such file or directory"
1353 NOSUCHPROCESS CPU consumption calculation failed: No such process The pilot sets this error during the CPU consumption calculation if reading /proc/pid/stat fails with "No such process"
1354 GENERALCPUCALCPROBLEM General CPU consumption calculation problem (consult Pilot log) If there is a problem accessing the /proc/pid/stat file that is not recognised, this error will be set
1355 COREDUMP Core dump detected Set if a core dump is found for a failed job in the payload work dir (during the initial payload error analysis). The core dump is removed. Note: currently the file name must be "core" (i.e. not "core.*")
1356 PREPROCESSFAILURE Pre-process command failed  
1357 POSTPROCESSFAILURE Post-process command failed  
1358 MISSINGRELEASEUNPACKED Missing release setup in unpacked container Pilot requires that /release_setup.sh is present in unpacked containers. It is not present in older containers
1359 PANDAQUEUENOTACTIVE PanDA queue is not active The error is set as soon as the pilot has downloaded queue data if the queue is not active
1360 IMAGENOTFOUND Image not found The error is set if the pilot cannot find an image whose path is known
1361 REMOTEFILECOULDNOTBEOPENED Remote file could not be opened For direct access jobs, the pilot attempts to open (and close) all input root files to avoid wasting CPU with the payload
1362 XRDCPERROR Xrdcp was unable to open file  
1363 KILLPAYLOAD Raythena has decided to kill payload If the pilot monitoring discovers a kill instruction file in the pilot's work directory ($PILOT_HOME), it will terminate the payload and set this error. The kill instruction file name and checking time are defined in the pilot configuration file
1364 MISSINGCREDENTIALS Unable to locate credentials for S3 transfere Error set if "Unable to locate credentials" is found in the S3 transfer command output