Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.8.1.66 #140

Merged
merged 135 commits into from
Sep 9, 2024
Merged

3.8.1.66 #140

merged 135 commits into from
Sep 9, 2024

Conversation

PalNilsson
Copy link
Collaborator

  • Added default path for ifconfig command (used to lookup IPv6 info) if command not found
  • Support for OIDC tokens in urllib based request function (used for pilot-PanDA server communications)
    • Together with a token key, the primary OIDC token is used to download a shorter token, used in the later communications with the PanDA server
    • The pilot is refreshing the token immediately after launch, the original long lasting token is overwritten
    • The short lasting tokens are refreshed periodically (once every 60 minutes)
    • Note: OIDC tokens are used by default if found locally, otherwise X509 is used - i.e. there is no corresponding pilot option to activate the mechanism
  • Received SIGTERM signals on Kubernetes resources reported with new error code 1379, “Job was preempted”
  • Added two error codes for arcproxy failures
    • 1380: “General arcproxy failure” (was previously reported as 1008: “"General pilot error, consult batch log"”)
    • 1381: “Arcproxy failure while loading shared libraries”
      • Note: this (1381) is currently only used internally and does not lead to a failed job
  • Remote file open container now using EL9 instead of CentOS7
    • Required for latest ROOT release
    • Requested by A. De Silva
  • Skipping setting RUCIO_ACCOUNT for payload
    • Requested by R. Walker
  • A time-out was added to the gdb command execution (for producing a core dump file) when a looping job has been discovered
    • Requested by R. Walker
  • Real-time logging
    • Now possible to specify real-time logging server (type, protocol, URL and port) via pilot argument
      • Previously, it only worked via pilot config
      • Requested by W. Guan
    • Added Loki real-time logging module (Rubin)
    • Real-time logging can now be activated for all jobs on a given queue (relevant for pilot logs, not payload stdout)
      • Activation currently via PQ.catchall
      • Streaming of pilot logs requested by I. Vukotic
      • To be tested more widely
  • New pilot option --noworkerpilotstatusupdate can be used to switch off worker pilot status updates
    • Needed at NERSC
    • Requested by T. Maeno
  • Added timeout to urlopen() used for pilot-PanDA server communication
    • The default timeout is too short and for getjob operations can lead to “jobdispatcher, 102: Sent job didn't receive reply from pilot within 30 min”-errors
    • In case of failure, pilot will currently fallback to curl based communication
    • Timeout is now explicitly set to 30 s
    • Reported by Z. Yang (Rubin)
  • Bug fix
    • Patch for setting final job completion state before log stage-out had completed
      • Leading to “ddm, 200: Could not get GUID/LFN/MD5/FSIZE/SURL from pilot XML”-error
      • Reported by R. Walker, discussed in JIRA ticket ATLASPANDA-1047
  • Housekeeping with pylint
    • The average pylint score of all pilot modules is 9.56

Contributions from W. Guan, P. Nilsson

@PalNilsson PalNilsson merged commit e1e6571 into PanDAWMS:next Sep 9, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant