-
Notifications
You must be signed in to change notification settings - Fork 24
Pilot Architecture
The Pilot is component based, with each component being responsible for different tasks. The main tasks are handled by controller components, such as Job Control, Payload Control and Data Control. There is also a set of components with auxiliary functionalities, e.g. Pilot Monitor and Job Monitor - one for internal use which monitors threads and one that is tied to the job and checks parameters that are relevant for the payload (e.g. size checks). The Information System component presents an interface to a database containing knowledge about the resource where the Pilot is running (e.g. which copy tool to use and where to read and write data).
The pilot worksflows are described in the corresponding section.
Each of the pilot components run as independent threads in the pilot. Each component has additional subthreads and are described below. Most of the threads manipulate Job objects that contain all the full information for a job downloaded from the PanDA server, or read from file. The Job objects are stored in globally available Python Queues. A Job object is passed around different queues until processing it has completed. The various threads are monitoring these queues, and act on a Job object as it arrives.
The Job control spawns five subthreads for various tasks:
-
retrieve
: Retrieve a job definition from any source and place it in the "job" queue. The job definition is a json dictionary that is either preplaced in the launch directory or downloaded from a server specified byargs.url
(pilot option) -
validate
: Retrieve a Job object from the "jobs" queue. If it passes the user defined verification, the main payload work directory gets created (PanDA_Pilot-<pandaid>
) in the main pilot work directory. The Job object is passed on to the "validated_jobs" queue or "failed_jobs" in case of failure -
create_data_payload
: Get a Job object from the "validated_jobs" queue. If the job has defined input files, move the Job object to the "data_in" queue and put the internal pilot state to "stagein". In case there are no input files, place the Job object in the "finished_data_in" queue. For either case, the thread also places the Job object in the "payloads" queue (another thread will retrieve it and wait for any stage-in to finish) -
queue_monitor
: Monitoring of (internal Python) queues. This thread monitors queue activity, specifically if a job has finished or failed, and reports to the server. A completed job will be moved to the "completed_jobs" queue -
job_monitor
: Monitoring of job parameters. This thread monitors certain job parameters, such as job looping, at various time intervals. The main loop is executed once a minute, while individual verifications may be executed at any time interval (>= 1 minute). E.g. looping jobs are checked once per ten minutes (default) and the heartbeat is send once per 30 minutes. Memory usage is checked once a minute
The Payload control spawns four threads related to executing the main payload:
-
validate_pre
: Get a Job object from the "payloads" queue. If the payload is successfully validated (user defined), the Job object is placed in the "validated_payloads" queue, otherwise it is placed in the "failed_payloads" queue -
execute_payloads
: Extract a Job object from the "validated_payloads" queue and put it in the "monitored_jobs" queue. The payload stdout/err streams are opened and the pilot state is changed to "starting". A payload executor is selected (for executing a normal job, an event service job or event service merge job). After the payload (or rather its executor) is started, the thread will wait for it to finish and then check for any failures. A successfully completed job is placed in the "finished_payloads" queue, and a failed job will be placed in the "failed_payloads" queue -
validate_post
: Validate finished payloads. The completed job will be added to the "data_out" queue -
failed_post
: Get a Job object from the "failed_payloads" queue. Set the pilot state to "stakeout" and the stageout field to "log", and add the Job object to the "data_out" queue
Data Control manages three threads:
-
copytool_in
: Call the stage-in function and put the Job object in the proper queue. Get a Job object from the "data_in" queue and place immediately in the "current_data_in". Notify the server that the job is in "running" state (note that the payload is not yet running, but the job is). The stage-in function is called, and if it finishes correctly the Job object is moved to the "finished_data_in" queue and is also removed from the "current_data_in" queue. In case of stage-in failure, the Job object is moved to the "failed_data_in" queue -
copytool_out
: Perform stage-out as soon as a job object can be extracted from the data_out queue. If the stage-out function finishes correctly, place the Job object in the "finished_data_out" queue. In case of failure, place it in the "failed_data_out" queue -
queue_monitoring
: Monitoring of Data queues. If a Job object can be extracted from the "failed_data_in" (or from the "failed_data_out") queue, set the stageout field to "log" and attempt to stage-out the log. If a Job object can be extracted from the "finished_data_out" queue, put a successful job in the "finished_jobs" queue and a failed job in the "failed_jobs" queue
The job_monitor thread is described in the Job Control section above. The Pilot monitor is responsible for internal thread monitoring and will report if a thread is no longer alive. It also provides internal memory monitoring of the pilot process and makes sure that the pilot is not exceeding the maximum allowed running time.