3.6.5.32 #96

PalNilsson · 2023-09-05T08:08:14Z

Measures against problems with lingering defunct processes
- Added internal timeout of 300s to curl call (on top of existing connection and max time timeouts)
- Added zombie reaper as part of job monitoring, executed at the time of the looping job check, i.e. every ten minutes
- Problems seen at least at MWT2 with large amount of lingering prmon processes and work directories
  - Prmon has a suspected problem with being killed by SIGUSR1, often leaving it in a defunct state. Pilot is waiting for process to end normally, but this mostly fails. Being discussed with prmon developers
  - Problem appears to have started around July 29 (unclear why), but lingering prmon processes seem to be everywhere. These defunct processes will normally go away when the top bash process is ended, but can remain lingering if there are hard kills of parent processes. If there are too many defunct processes, the batch system may kill the parent without warning which will result in lingering work directories, which in turn requires external cleanup
Python 3.11 tests
- ALRB setup now supports Python 3.11 (A. De Silva)
- Tested successfully manually/interactively on Lxplus9/Alma9
  - I.e. including rucio stage-in (Rucio stage-out fails as normally since I don’t have permission to write to the SE while running interactively / ie I used my own proxy)
- Logstash 2.3.0 and 2.5.0 tested successfully for real-time logging
- Also tested new logstash version 2.5.0 on CentOS7, works fine
- Added Python 3.11 to flake8 and unit tests
Irrelevant ‘warning’ from lsetup cpu_flags ignored by the pilot (would otherwise lead to failure by pilot to interpret cpu_arch output)
Reduced number of ps command calls
- Pilot uses ps to get info about processes in various situations which can be heavy on the system when there are several pilots running simultaneously
- Cached ps output when collecting child pids
- Removed several ps calls
- Note: there are quite a few ps calls during a long running job since the output is needed for the CPU consumption reporting - this will soon be addressed as well as A. De Silva has made the psutil module available (related pilot development is pending a wrapper update)
- Requested by J. Templon
Moved multiprocessing module import to where it is used (instead of the top of the module)
- To prevent it from causing possible deadlocks in importing esp. Google Cloud Logging modules, which have known problems with multiprocessing
- This change is most relevant for Rubin since they have seen locking behavior when importing gcloud modules
- Change done to job control module (there is also usage in timer module)
No output file verification for Raythena jobs since the final job report will not be known by the pilot (Raythena will handle it)
- Requested by J. Esseiva
- https://its.cern.ch/jira/browse/ATLASAMI-316
Added size based time-out to log file creation
- Based on the size of the work directory (min timeout set to 90s, max 3h)
- New error code 1376, "Log file creation timed out"
- Requested by X. Zhao (sPHENIX) but change is relevant also for ATLAS
Main command execute function updates
- Now thread safe
  - An strace from MWT2 provided by F. Luehring indicated a thread lock in the execute function
- Always use a timeout on command calls
  - A ridiculously long timeout is better than nothing since it will force subprocess python code to flush the stdout buffer which otherwise can be a problem on nodes with a huge number of cores
  - Congested stdout buffers can lead to hanging
  - Requested by W. Guan
Truncating WARNING field in job report if too large
- Report includes the first 25 warnings
- Original report is backed up and kept in the log
- Requested by R. Walker
Real-time logging update
- rtlogging field is now experiment specific (used to define RT server)
- Requested by X. Zhao (sPHENIX), change is transparent for other experiments
Updated encoded HTCondor env var
- Requested by X. Zhao (sPHENIX)
Object store updates (Rubin)
1. add env to be able to define different AWS_PROFILE: For Rubin, multiple objectstore can be used (pilot is using one and the other Rubin payload is using another one). The AWS_PROFILE can be used to select different credentials for authentication.
2. add copy_out_extend function: By default, pilot will put all logs files in a tar file and copy out this tar file. For Rubin, we need to copy different log files separately without putting them into a tar file. So here I added to environment variable whether to use this copy_out_extend function.
3. fix upload_files to be able to use different endpoint and bucket name
PanDA/Dask integration related changes
- Pilot now keeps job in running state until lease time is up
- Interactive job can now be aborted by user

Contributions from W. Guan, P. Nilsson

…esses function

PalNilsson and others added 30 commits July 13, 2023 17:36

New version

ee6d454

New version

693acfd

Allowing Raythena jobs to get guids from job report

33b65cf

Added LEASETIME

0e001b5

Added ct_start, ct_least

47094a1

Added check_lease_time()

06bf4b2

Added logic for lease time

997db1a

Added logic for lease time

593474c

Added logic for lease time

7e52675

Added logic for lease time

2026fcf

Added queues.monitored_payloads to stager workflow

18fface

Added job object to queues.monitored_payloads in stager mode

d4ac70d

Changed HTCondor_JOB_ID to HTCondor_PANDA, updated comments

006ea60

Revert change HTCondor_JOB_ID to HTCondor_PANDA

c6da944

Interactive dask job can now be aborted by user

457df83

Removed useless comment

5df7785

Aborting threads after tobekilled in stager mode

d298104

Ordering log transfer after dask user kill

e95b5e7

Changes requested by Xin

f702cc5

Changes requested by Xin

6c7642b

Changes requested by Julien (do output file verification for raythena)

797dd1d

Fixed irrelevant stderr from cpu_arch script

b030dda

Fixed lease time

ce0afb2

rtlogging is now experiment specific

1f96a7c

ssl_enable and ssl_verify are now experiment specific

74d2fbf

Updated comment

4d478c1

Updated

4424d34

Using get_rtlogging_ssl()

726df72

Added Pythno 3.11 flake8 and unit testing

e5190c3

Added Python 3.11 flake8 and unit testing

b7ab4d8

Paul Nilsson and others added 29 commits August 11, 2023 13:53

Now kills any defunct child processes when kill_processes() is called

35b229a

Testing defunct processes. Added zombie monitoring (preliminary)

d4066d6

Added non zero return code after looping

135e86e

Added non zero return code after looping

fbd5b8f

Updated log message

171ee1e

Corrected PAMDA_HOSTNAME

0e7da2e

Added timeout=300 to execute() call

d145a84

Removed test code

47c40ac

Cleanup

fa90713

Added zombie collector

10059b1

Added ps caching

99d71ad

Prevent RT logging from starting

c84132c

Added thread lock. Only storing subprocess pid once. Added child proc…

972aee1

…esses function

Reaping zombies test

bc307eb

Removed one ps call

91c78f3

Corrected r+b in adler32 calculation (harmless)

296ab76

Cleanup test code

c2b4090

Cleaned up is_defunct()

8a1cd7e

Now measuring log file creation

94309ca

Test

5b153d5

Improved job monitor. Improved tar command timeout message

de1cff7

Removed test code

49a1f8d

Refactored _stage_out_new()

0ea91c5

Refactoring

043df12

Cleanup, fixed return status handling

1671620

Corrected r+b from rb in adler32 calculation

a30d5ab

Always use a timeout

aa1b52c

Updated comment

379bba4

Truncating metadata if necessary (too long WARNING field)

4bfc731

PalNilsson merged commit 6fff31f into PanDAWMS:next Sep 5, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.6.5.32 #96

3.6.5.32 #96

PalNilsson commented Sep 5, 2023 •

edited

Loading

3.6.5.32 #96

3.6.5.32 #96

Conversation

PalNilsson commented Sep 5, 2023 • edited Loading

PalNilsson commented Sep 5, 2023 •

edited

Loading