This repository has been archived by the owner on Jan 30, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 24
Error codes
Paul Nilsson edited this page May 13, 2021
·
17 revisions
When detecting a fatal problem, the Pilot assigns an error code and informs the server. Aside from the numerical code itself, it also reports the error meaning and a more detailed error diagnostics.
Error code | Acronym | Meaning | Notes |
---|---|---|---|
1008 | GENERALERROR | General pilot error, consult batch log | |
1098 | NOLOCALSPACE | Not enough local space | Error code is set e.g. by job monitoring, also if copytool command fails (if "No space left on device" is found in command output) |
1099 | STAGEINFAILED | Failed to stage-in file | |
1100 | REPLICANOTFOUND | The rucio API function list_replicas() did not return any replicas. Check log for details. | |
1103 | NOSUCHFILE | No such file or directory | Error thrown by open_file() function. Also set if copytool fails and "No such file or directory" is found in output |
1104 | USERDIRTOOLARGE | User work directory too large | The error is set if the user work directory exceeds the maximum allowed limit, as defined by schedconfig.maxwdir (default: 14 GB) |
1106 | STDOUTTOOBIG | Payload log or stdout file too big | Set if stdout exceeds maximum allowed limit of 2 GB, set in the Pilot's default config file |
1110 | SETUPFAILURE | Failed during payload setup | |
1115 | NFSSQLITE | NFS SQLite locking problems | Pilot identifies this error by doing a grep on the strings "prepare 5 database is locked" and "Error SQLiteStatement" in in the payload stdout |
1116 | QUEUEDATA | Pilot could not download queuedata | |
1117 | QUEUEDATANOTOK | Pilot found non-valid queuedata | |
1124 | OUTPUTFILETOOLARGE | Output file too large | |
1133 | NOSTORAGE | Fetching default storage failed: no activity related storage defined | |
1137 | STAGEOUTFAILED | Failed to stage-out file | |
1141 | PUTMD5MISMATCH | md5sum mismatch on output file | Error acronym should be renamed |
1143 | CHMODTRF | Failed to chmod trf | After downloading a trf, the pilot tries to do a chmod 0755 on it. If this fails, the pilot will set this error |
1144 | PANDAKILL | This job was killed by panda server | |
1145 | GETMD5MISMATCH | md5sum mismatch on input file | Error acronym should be renamed |
1149 | TRFDOWNLOADFAILURE | Transform could not be downloaded | |
1150 | LOOPINGJOB | Looping job killed by pilot | The pilot will kill the payload (or stop stage-in/out) if there is no activity (i.e. files touched in the work directory or if the file transfer is stuck) within the allowed time. The default looping job time limit is 12*3600 s for production jobs and 3*3600 s for user analysis jobs. The limit can be overridden in the pilot's config file (or set by the user using the maxCPUCount variable) |
1151 | STAGEINTIMEOUT | File transfer timed out during stage-in | Currently only identified for rucio file transfer (unless "Operation timed out" is in stderr) |
1152 | STAGEOUTTIMEOUT | File transfer timed out during stage-out | Currently only identified for rucio file transfer (unless "Operation timed out" is in stderr) |
1163 | NOPROXY | Grid proxy not valid | Set if grid-proxy-info fails or if "Could not establish context" is found in copytool command output |
1165 | MISSINGOUTPUTFILE | Local output file is missing | |
1168 | SIZETOOLARGE | Total file size too large | Before stage-in, the pilot verifies that the sum of the input file sizes does not exceed maxwdir (set in schedconfig or in pilot config file). Any files that are to be accessed directly/remotely are excluded |
1171 | GETADMISMATCH | adler32 mismatch on input file | Error acronym should be renamed |
1172 | PUTADMISMATCH | adler32 mismatch on output file | Error acronym should be renamed |
1177 | NOVOMSPROXY | Voms proxy not valid | Set if arcproxy fails |
1180 | GETGLOBUSSYSERR | Globus system error during stage-in | Pilot identifies this error if "globes_xio:" is found in command output |
1181 | PUTGLOBUSSYSERR | Globus system error during stage-out | Pilot identifies this error if "globes_xio:" is found in command output |
1186 | NOSOFTWAREDIR | Software directory does not exist | |
1187 | NOPAYLOADMETADATA | Payload metadata does not exist | This error can happen due to previous uncaught error, leading to missing metadata, i.e. the error label can be misleading (when discovered, pilot is usually patched) |
1190 | LFNTOOLONG | LFN too long (exceeding limit of 255 characters) | When validating a job definition, before executing the payload, the Pilot makes sure that no output file has an LFN that is longer than 255 characters (which is not supported by the DDM system) |
1191 | ZEROFILESIZE | File size cannot be zero | Before executing the stage-out command, the Pilot verifies that the size of the file is not zero (which will not be accepted by any storage system) |
1199 | MKDIR | Failed to create local directory | |
1200 | KILLSIGNAL | Job terminated by unknown kill signal | |
1201 | SIGTERM | Job killed by signal: SIGTERM | |
1202 | SIGQUIT | Job killed by signal: SIGQUIT | |
1203 | SIGSEGV | Job killed by signal: SIGSEGV | |
1204 | SIGXCPU | Job killed by signal: SIGXCPU | |
1205 | USERKILL | Job killed by user | Reserved error code for user defined kill instructions. Currently not implemented |
1206 | SIGBUS | Job killed by signal: SIGBUS | |
1207 | SIGUSR1 | Job killed by signal: SIGUSR1 | |
1211 | MISSINGINSTALLATION | Missing installation | Assigned error code if the payload fails to execute the transform |
1212 | PAYLOADOUTOFMEMORY | Payload ran out of memory | Assigned error code if the pilot finds the string "FATAL out of memory: taking the application down" in the stderr and "St9bad_alloc", "std::bad_alloc" in the stdout |
1213 | REACHEDMAXTIME | Reached batch system time limit | Pilot aborts automatically when 10 minutes remain of the maximum allowed running time, as set by 1) schedconfig,maxtime or 2) Pilot option -l <maxtime> (both values are in seconds) |
1220 | UNKNOWNPAYLOADFAILURE | Job failed due to unknown reason (consult log file) | |
1221 | FILEEXISTS | File already exists | Error code is set if "File exists", "SRM_FILE_BUSY" or "file already exists" is found in copytool command output |
1223 | BADALLOC | Transform failed due to bad_alloc | Assigned error code if the pilot finds "badalloc" among the job report errors |
1224 | ESRECOVERABLE | Event service: recoverable error | |
1228 | ESFATAL | Event service: fatal error | |
1234 | EXECUTEDCLONEJOB | Clone job is already executed | |
1235 | PAYLOADEXCEEDMAXMEM | Payload exceeded maximum memory | |
1236 | KILLEDBYSERVER | Killed by server | This error is not set by the pilot. It is currently only set by Harvester |
1238 | ESNOEVENTS | Event service: no events | |
1240 | MESSAGEHANDLINGFAILURE | Failed to handle message from payload | |
1242 | CHKSUMNOTSUP | Query checksum is not supported | The error code is set if Pilot finds "query chksum is not supported" or "Unable to checksum" in command output |
1244 | NORELEASEFOUND | No release candidates found | |
1246 | NOUSERTARBALL | User tarball could not be downloaded from PanDA server | |
1247 | BADXML | Badly formed XML | Parsing of metadata failed most likely due to presence of illegal character |
1300 | NOTIMPLEMENTED | The class or function is not implemented | |
1301 | UNKNOWNEXCEPTION | An unknown pilot exception has occurred | |
1302 | CONVERSIONFAILURE | Failed to convert object data | E.g. if a JSON dictionary can't be converted from unicode to utf-8 |
1303 | FILEHANDLINGFAILURE | Failed during file handling | E.g. if a file can't be opened or a dictionary can't be loaded from file |
1305 | PAYLOADEXECUTIONFAILURE | Failed to execute payload | |
1306 | SINGULARITYGENERALFAILURE | Singularity: general failure | Site issue; set if the Pilot finds "Operation not permitted" in stderr |
1307 | SINGULARITYNOLOOPDEVICES | Singularity: No more available loop devices | Site issue; set if Pilot finds "No more available loop devices" in stderr |
1308 | SINGULARITYBINDPOINTFAILURE | Singularity: Not mounting requested bind point | Site issue; set if the Pilot finds "Not mounting requested bind point" in stderr |
1309 | SINGULARITYIMAGEMOUNTFAILURE | Singularity: Failed to mount image | Site issue; set if the Pilot finds "Failed to mount image" in stderr |
1310 | PAYLOADEXECUTIONEXCEPTION | Exception caught during payload execution | Internal pilot problem |
1311 | NOTDEFINED | Not defined | A general - internally used - error that is explained in the corresponding exception (NotDefined) error diagnostics; e.g. the analytics package throws this exception if a fit has not been defined; or if a math function fails to convert a string to an integer |
1312 | NOTSAMELENGTH | Not same length | Internally used error corresponding to exception NotSameLength, which is thrown if input data are not of same length in a fit |
1313 | NOSTORAGEPROTOCOL | No protocol defined for storage endpoint | |
1314 | UNKNOWNCHECKSUMTYPE | Unknown checksum type | |
1315 | UNKNOWNTRFFAILURE | Unknown TRF failure | |
1316 | RUCIOSERVICEUNAVAILABLE | Rucio: Service unavailable | Set if corresponding Rucio error details (reg.exp. or "service_unavailable") are found in copytool command output |
1317 | EXCEEDEDMAXWAITTIME | Exceeded maximum waiting time | Internally used exception.error code. Exception thrown by pilot monitoring when abort_job wait time has been exceeded (and when other threads have not finished cleaning up on time). abort_job is set when pilot has received a kill signal |
1318 | COMMUNICATIONFAILURE | Failed to communicate with server | |
1319 | INTERNALPILOTPROBLEM | An internal Pilot problem has occurred (consult Pilot log) | Error code used for internal debugging. A more precise error message should be written to the log |
1320 | LOGFILECREATIONFAILURE | Failed during creation of log file | In case tarfile.open() or the archive.add() fails, the pilot will set this error code |
1321 | RUCIOLOCATIONFAILED | Failed to get client location for Rucio | |
1322 | RUCIOLISTREPLICASFAILED | Failed to get replicas from Rucio | |
1323 | UNKNOWNCOPYTOOL | Unknown copy tool | Set if the requested copy tool has no implementation |
1324 | SERVICENOTAVAILABLE | Service not available at the moment | Rucio server not available |
1325 | SINGULARITYNOTINSTALLED | Singularity: not installed | Identified by trf exit code 64 and the string "Singularity is not installed" present in the stderr |
1326 | NOREPLICAS | No matching replicas were found in list_replicas() output | list_replicas() returned replicas but no local matching replica was found |
1327 | UNREACHABLENETWORK | Unable to stage-in file since network is unreachable | Problem seen with xrdcp command during stage-in |
1328 | PAYLOADSIGSEGV | SIGSEGV: Invalid memory reference or a segmentation fault | Special payload error extracted from job report. A SIGSEGV is an error (signal) caused by an invalid memory reference or a segmentation fault. The payload is probably trying to access an array element out of bounds or trying to use too much memory |
1329 | NONDETERMINISTICDDM | Failed to construct SURL for non-deterministic ddm (update CRIC) | While Pilot 1 ignored the is_deterministic endpoint field if the storage path ended in /rucio, Pilot 2 will instead fail the job if the endpoint is not deterministic. The endpoint should be fixed in CRIC |
1330 | JSONRETRIEVALTIMEOUT | JSON retrieval timed out | Error is assigned if the pilot fails to download JSON |
1331 | MISSINGINPUTFILE | Input file is missing in storage element | |
1332 | BLACKHOLE | Black hole detected in file system (consult Pilot log) | This error is assigned if a pilot module goes missing. Typically this would mean that it cannot be imported |
1333 | NOREMOTESPACE | No space left on device | |
1334 | SETUPFATAL | Setup failed with a fatal exception (consult payload log) | |
1335 | MISSINGUSERCODE | User code not available on PanDA server (resubmit task with --useNewCode) | Error occurs when user tarball has been deleted from the server and the pilot tries to download it. User must resubmit task with prun/pathena option --useNewCode |
1336 | JOBALREADYRUNNING | Job is already running elsewhere | |
1337 | BADMEMORYMONITORJSON | Memory monitor produced bad output | Failure to parse JSON file from Memory monitor |
1338 | STAGEINAUTHENTICATIONFAILURE | Authentication failure during stage-in | |
1339 | DBRELEASEFAILURE | Local DBRelease handling failed (consult Pilot log) | |
1340 | SINGULARITYNEWUSERNAMESPACE | Singularity: Failed invoking the NEWUSER namespace runtime | |
1341 | BADQUEUECONFIGURATION | Bad queue configuration detected | |
1342 | MIDDLEWAREIMPORTFAILURE | Failed to import middleware (consult Pilot log) | |
1343 | NOOUTPUTINJOBREPORT | Found no output in job report | Set when output=[] in job report |
1344 | RESOURCEUNAVAILABLE | Resource temporarily unavailable (consult Pilot log) | Set when get_current_cpu_consumption_time() fails due to OSError exception raised in subprocess module (failed os.fork()). To be extended in v 2.1.22+ |
1345 | SINGULARITYFAILEDUSERNAMESPACE | Singularity: Failed to create user namespace | Detected in stderr when the transform has a non-zero exit code |
1346 | TRANSFORMNOTFOUND | Transform not found | Detected in stderr when the transform has a non-zero exit code |
1347 | UNSUPPORTEDSL5OS | Unsupported SL5 OS | Detected in stderr when the transform has a non-zero exit code |
1348 | SINGULARITYRESOURCEUNAVAILABLE | Singularity: Resource temporarily unavailable | Detected in stderr when the transform has a non-zero exit code |
1349 | UNRECOGNIZEDTRFARGUMENTS | Unrecognized transform arguments | Detected in stderr when the transform has a non-zero exit code |
1350 | EMPTYOUTPUTFILE | Empty output file detected | Detected in stderr when the transform has a non-zero exit code |
1351 | UNRECOGNIZEDTRFSTDERR | Unrecognized fatal error in transform stderr | Detected in stderr when the transform has a non-zero exit code |
1352 | STATFILEPROBLEM | Failed to stat proc file for CPU consumption calculation | The pilot sets this error during the CPU consumption calculation if reading /proc/pid/stat fails with "No such file or directory" |
1353 | NOSUCHPROCESS | CPU consumption calculation failed: No such process | The pilot sets this error during the CPU consumption calculation if reading /proc/pid/stat fails with "No such process" |
1354 | GENERALCPUCALCPROBLEM | General CPU consumption calculation problem (consult Pilot log) | If there is a problem accessing the /proc/pid/stat file that is not recognised, this error will be set |
1355 | COREDUMP | Core dump detected | Set if a core dump is found for a failed job in the payload work dir (during the initial payload error analysis). The core dump is removed. Note: currently the file name must be "core" (i.e. not "core.*") |
1356 | PREPROCESSFAILURE | Pre-process command failed | |
1357 | POSTPROCESSFAILURE | Post-process command failed | |
1358 | MISSINGRELEASEUNPACKED | Missing release setup in unpacked container | Pilot requires that /release_setup.sh is present in unpacked containers. It is not present in older containers |
1359 | PANDAQUEUENOTACTIVE | PanDA queue is not active | The error is set as soon as the pilot has downloaded queue data if the queue is not active |
1360 | IMAGENOTFOUND | Image not found | The error is set if the pilot cannot find an image whose path is known |
1361 | REMOTEFILECOULDNOTBEOPENED | Remote file could not be opened | For direct access jobs, the pilot attempts to open (and close) all input root files to avoid wasting CPU with the payload |
1362 | XRDCPERROR | Xrdcp was unable to open file | |
1363 | KILLPAYLOAD | Raythena has decided to kill payload | If the pilot monitoring discovers a kill instruction file in the pilot's work directory ($PILOT_HOME), it will terminate the payload and set this error. The kill instruction file name and checking time are defined in the pilot configuration file |
1364 | MISSINGCREDENTIALS | Unable to locate credentials for S3 transfere | Error set if "Unable to locate credentials" is found in the S3 transfer command output |