Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.6.5.32 #96

Merged
merged 64 commits into from
Sep 5, 2023
Merged

3.6.5.32 #96

merged 64 commits into from
Sep 5, 2023

Conversation

PalNilsson
Copy link
Collaborator

@PalNilsson PalNilsson commented Sep 5, 2023

  • Measures against problems with lingering defunct processes
    • Added internal timeout of 300s to curl call (on top of existing connection and max time timeouts)
    • Added zombie reaper as part of job monitoring, executed at the time of the looping job check, i.e. every ten minutes
    • Problems seen at least at MWT2 with large amount of lingering prmon processes and work directories
      • Prmon has a suspected problem with being killed by SIGUSR1, often leaving it in a defunct state. Pilot is waiting for process to end normally, but this mostly fails. Being discussed with prmon developers
      • Problem appears to have started around July 29 (unclear why), but lingering prmon processes seem to be everywhere. These defunct processes will normally go away when the top bash process is ended, but can remain lingering if there are hard kills of parent processes. If there are too many defunct processes, the batch system may kill the parent without warning which will result in lingering work directories, which in turn requires external cleanup
  • Python 3.11 tests
    • ALRB setup now supports Python 3.11 (A. De Silva)
    • Tested successfully manually/interactively on Lxplus9/Alma9
      • I.e. including rucio stage-in (Rucio stage-out fails as normally since I don’t have permission to write to the SE while running interactively / ie I used my own proxy)
    • Logstash 2.3.0 and 2.5.0 tested successfully for real-time logging
    • Also tested new logstash version 2.5.0 on CentOS7, works fine
    • Added Python 3.11 to flake8 and unit tests
  • Irrelevant ‘warning’ from lsetup cpu_flags ignored by the pilot (would otherwise lead to failure by pilot to interpret cpu_arch output)
  • Reduced number of ps command calls
    • Pilot uses ps to get info about processes in various situations which can be heavy on the system when there are several pilots running simultaneously
    • Cached ps output when collecting child pids
    • Removed several ps calls
    • Note: there are quite a few ps calls during a long running job since the output is needed for the CPU consumption reporting - this will soon be addressed as well as A. De Silva has made the psutil module available (related pilot development is pending a wrapper update)
    • Requested by J. Templon
  • Moved multiprocessing module import to where it is used (instead of the top of the module)
    • To prevent it from causing possible deadlocks in importing esp. Google Cloud Logging modules, which have known problems with multiprocessing
    • This change is most relevant for Rubin since they have seen locking behavior when importing gcloud modules
    • Change done to job control module (there is also usage in timer module)
  • No output file verification for Raythena jobs since the final job report will not be known by the pilot (Raythena will handle it)
  • Added size based time-out to log file creation
    • Based on the size of the work directory (min timeout set to 90s, max 3h)
    • New error code 1376, "Log file creation timed out"
    • Requested by X. Zhao (sPHENIX) but change is relevant also for ATLAS
  • Main command execute function updates
    • Now thread safe
      • An strace from MWT2 provided by F. Luehring indicated a thread lock in the execute function
    • Always use a timeout on command calls
      • A ridiculously long timeout is better than nothing since it will force subprocess python code to flush the stdout buffer which otherwise can be a problem on nodes with a huge number of cores
      • Congested stdout buffers can lead to hanging
      • Requested by W. Guan
  • Truncating WARNING field in job report if too large
    • Report includes the first 25 warnings
    • Original report is backed up and kept in the log
    • Requested by R. Walker
  • Real-time logging update
    • rtlogging field is now experiment specific (used to define RT server)
    • Requested by X. Zhao (sPHENIX), change is transparent for other experiments
  • Updated encoded HTCondor env var
    • Requested by X. Zhao (sPHENIX)
  • Object store updates (Rubin)
    1. add env to be able to define different AWS_PROFILE: For Rubin, multiple objectstore can be used (pilot is using one and the other Rubin payload is using another one). The AWS_PROFILE can be used to select different credentials for authentication.
    2. add copy_out_extend function: By default, pilot will put all logs files in a tar file and copy out this tar file. For Rubin, we need to copy different log files separately without putting them into a tar file. So here I added to environment variable whether to use this copy_out_extend function.
    3. fix upload_files to be able to use different endpoint and bucket name
  • PanDA/Dask integration related changes
    • Pilot now keeps job in running state until lease time is up
    • Interactive job can now be aborted by user

Contributions from W. Guan, P. Nilsson

@PalNilsson PalNilsson merged commit 6fff31f into PanDAWMS:next Sep 5, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant