Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.
Paul Nilsson edited this page Apr 30, 2021 · 16 revisions

Introduction

The PanDA Pilot has been used by ATLAS and other experiments for well over a decade. To meet the demands of extending PanDA beyond grids and ATLAS, the original Pilot (henceforth referred to as Pilot 1) was rewritten and Pilot 2 was born.

What does the PanDA Pilot do?

The task of the PanDA Pilot is to monitor and execute work units on a worker node, either on the job or event level. On the job level, the work unit is a payload that a user or production system wants to execute. The payload has certain requirements, e.g. input and output files, that are staged by the Pilot, and needs a working environment (incl. containers) that is setup by the Pilot. On the event level, the Pilot launches and feeds a payload with event ranges (a set of events to be processed) downloaded from a server.

Job cycle

Jobs are downloaded and processed sequentially until the Pilot runs out of time (defined by PQ.timefloor). All sub-steps are optional and can be executed in containers. The Pilot is monitoring all steps, and report to the server regularly.

PanDA Pilot highlights

  • The Pilot is launched by a pilot wrapper script sent to the WN by Harvester/batch system
  • It is responsible for running payloads created by the users or production system, and monitors all steps and keeping the server updated
    • Any necessary input, and produced output will be transferred from/to the current storage element
    • Input may be accessed directly from storage
    • The payload may consist of a suite of pre-, co- and post-processes as well as the main payload itself (e.g. in HPO jobs)
    • The Pilot can execute special utility processes (e.g. xcache service and memory monitoring tools)
    • All processes can be executed in their own containers, either predetermined or set by the users
    • In the event service mode, the Pilot launches and feeds a payload with event ranges (a set of events to be processed) downloaded from the server
  • All server communications are done with secure https
  • File transfers are handled by dedicated copy tools
    • rucio, xrdcp, gfal, gs, s3, mv/cp/ln, objectstore, lsm (locally defined)
  • Support for HPCs with no outside network communications is done via plug-ins or entire workflows (in the case of Raythena)
  • Identification and reporting of 120+ unique errors
  • Troublesome payloads can be debugged live (PanDA monitor) via the tail of latest modified file uploaded on each server update (every five minutes in debug mode)
  • The Pilot is user (“experiment”) independent and the user codes are stored in plug-ins
  • The current version is Python 2 and 3 compliant (Python 2 support is expected to be dropped in late 2021)
    • A GitHub pull request triggers unit tests and flake8 verifications for Python 2.7, 3.6, 3.7 and 3.8 (as well as automatic code documentation)

How does the Pilot fit into the PanDA hierarchy?

The PanDA Pilot is executed on the worker nodes on local resources, on grids and clouds, on HPCs and on volunteer computers. It is downloaded and run by wrapper scripts that are sent by Pilot factories to the worker nodes via batch systems. A Pilot interacts with the PanDA server either directly, via a local instance of the ARC Control Tower (a job management framework used on Nordugrid) or with the resource-facing Harvester service (which provides resource provisioning and workload shaping).