-
Notifications
You must be signed in to change notification settings - Fork 24
Special Algorithms and Functionalities
The pilot executes several special algorithms for different tasks, described in the sections below.
The checksum is calculated after a stage-in or before stage-out followed by verification against a known number. The Adler32 checksum algorithm is implemented in the pilot since it is normally not available as a command on the worker nodes. The algorithm is standard, with the addition that the pilot makes sure that the returned string is always eight characters long, i.e. it fills the leading part with zeros (e.g. '3d' -> '0000003d').
The current CPU consumption time (system+user time) for a given process is calculated on the fly by looping over all of its child processes. After all child processes have been identified, the corresponding /prod/pid/stat files are parsed and the utime, stime, cutime, cstime are calculated by dividing the relevant fields from the stat files by the os.sysconf(os.sysconf_names['SC_CLK_TCK']) value. The CPU consumption time for each sub process is the sum of these values, and the wanted CPU consumption time for the given process is the sum of the sub process CPU consumption times.
See also the Timing measurements section.
A job is considered to be looping if it has not updated any files in the work directory within the specified time. The pilot uses an internal time limit of 2h for both user analysis and production jobs. The mechanism can be turned off by using the noLoopingCheck
task parameter (forwarded to the Pilot as loopingCheck=False
). The internal limit can be changed in pilot/util/default.cfg
.
To find the last touched files, the following command is executed once per 15 minutes (also configurable in the Pilot config file):
find <workdir> -mmin -<limit>
where the limit is divided by 60 to convert to minutes.
A troublesome job can be debugged live by turning on the special debug mode in the prodtask-dev page. The instruction is delivered to the pilot via the job update backchannel (i.e. in the return dictionary after an updateJob call). In this case, the pilot changes the frequency of updateJob calls to one per five minutes, and adds the tail of the latest found non-binary file in the working directory. The uploaded tail is then made visible in the corresponding PanDA monitor job page.
The pilot monitors the payload for possible memory leakage, if it has access to memory values returned by an external memory monitor tool. ATLAS currently uses the prmon tool. The pilot fits the PSS+SWAP values versus time. The slope gives a measure of the leakage rate. Tails are removed and the Chi2 is also calculated.