-
Notifications
You must be signed in to change notification settings - Fork 24
Pilot Architecture
The Pilot is component based, with each component being responsible for different tasks. The main tasks are handled by controller components, such as Job Control, Payload Control and Data Control. There is also a set of components with auxiliary functionalities, e.g. Pilot Monitor and Job Monitor - one for internal use which monitors threads and one that is tied to the job and checks parameters that are relevant for the payload (e.g. size checks). The Information System component presents an interface to a database containing knowledge about the resource where the Pilot is running (e.g. which copy tool to use and where to read and write data).
The pilot worksflows are described in the corresponding section.
Each of the pilot components run as independent threads in the pilot. Each component has additional subthreads and are described below. Most of the threads manipulate Job objects that contain all the full information for a job downloaded from the PanDA server, or read from file. The Job objects are stored in globally available Python Queues. A Job object is passed around different queues until processing it has completed. The various threads are monitoring these queues, and act on a Job object as it arrives.
The Job control spawns five subthreads for various tasks:
-
retrieve
: Retrieve a job definition from any source and place it in the "job" queue. The job definition is a json dictionary that is either preplaced in the launch directory or downloaded from a server specified byargs.url
(pilot option) -
validate
: Retrieve a Job object from the "jobs" queue. If it passes the user defined verification, the main payload work directory gets created (PanDA_Pilot-<pandaid>
) in the main pilot work directory. The Job object is passed on to the "validated_jobs" queue or "failed_jobs" in case of failure -
create_data_payload
: Get a Job object from the "validated_jobs" queue. If the job has defined input files, move the Job object to the "data_in" queue and put the internal pilot state to "stagein". In case there are no input files, place the Job object in the "finished_data_in" queue. For either case, the thread also places the Job object in the "payloads" queue (another thread will retrieve it and wait for any stage-in to finish) -
queue_monitor
: Monitoring of (internal Python) queues. This thread monitors queue activity, specifically if a job has finished or failed, and reports to the server. A completed job will be moved to the "completed_jobs" queue -
job_monitor
: Monitoring of job parameters. This thread monitors certain job parameters, such as job looping, at various time intervals. The main loop is executed once a minute, while individual verifications may be executed at any time interval (>= 1 minute). E.g. looping jobs are checked once per ten minutes (default) and the heartbeat is send once per 30 minutes. Memory usage is checked once a minute
bla..
bla..
bla..