Skip to content

JobRequirements RFC

atsareg edited this page Mar 13, 2013 · 9 revisions

RFC #7

Authors: S.Poss, A.Tsaregorodtsev

Last Modified: 9.03.2013

User jobs can specify various requirements to the resources to be eligible for the job execution. In most cases these requirements are specified as reserved keywords in the job description ( JDl ). The keywords are:

  • Site
  • BannedSite
  • Platform
  • CPUTime

If these parameters are specified in the job description, they are added to the definition of the corresponding Task Queue ( TQ ). The pilots are providing the resource description which is matched against TQs by the Matcher service.

The described standard matching mechanism is very efficient but is rather limited as well. Not all the requirements can be expressed in terms of predefined job parameters. New activities can require new resources specification that can be requested by the users jobs. Examples are: preinstalled software tags, specific services available on site, e.g. databases, memory available for jobs, CPU models, etc. Therefore there is a necessity to add more non-predefined characterisitics to the resources ( sites, CEs, queues ) that can be used in the job requirements without changing the code and the schema of the TQ database. In the present RFC, we present a proposal for such mechanism.

Resources specification

As described in RFC #5 the new Resources description schema in the CS allows to specify arbitrary parameters at each level ( Site, Computing Element, Queue ) valid for all the VOs that can use the resources as well as parameters valid for a given VO. These parameters can be used in order to select a set of resources according to certain specification using the Resources Configuration helper. For example:

result = Resources( vo = vo ).getEligibleNodes( 'Queue', ceSelectDict, queueSelectDict )

As a result, a tree structure ( nested dictionaries ) is returned starting from the Site level. This structure allows to construct a list of eligible Sites, CEs and Queues with respect to the given selection criteria. It is important to note that queues are inheriting parameters from their computing elements and sites, computing elements are inheriting their parameters from the sites. Queues can override the specification on the computing element or site level of the queue value is more specific than the one of its parent.

Job Description

In the job description which is usually done in a form of a JDL, but can be a new, CFG format based structure, the specific job requirements can be described in a special section, called Requirements. For example:

[
   Executable = "my_executable";
   ...
   Requirements = [
      SoftwareTag = { "AppVersion1","AppVersion2" };
      CPUModel = "Intel Xeon";
      Memory = 4000;
   ]
]
The parameters specification in the Requirements section must allow selection of the eligible queues. In particular, the following rules apply:
  • if a parameter is specified in the job Requirements section and is not present in the resource description, the resources is not eligible for the job;
  • if the parameter value is a list, resources having one of this values in their description are eligible;
  • parameters can have numerical values that should be compared to the resource capacity, e.g. Memory in the example above. The parameter type can be deduced from the parameter description in the JobDescription section of the <Operations> configuration. In this case, the resource capacity should be greater than the job requirement.

Applying job requirements

The job requirements are applied in two steps. First a list of eligible queues is evaluated and stored in the database as a SubmitPool object. Second, each pilot is instrumented with a method to determine the list of SubmitPools that it is eligible for. Finally, the matching is done using the standard mechanism using the SubmitPool TQ parameter.

Processing job requirements

When a new job arrives to the JobManager it goes through a chain of Optimizer executors which are checking the job validity and prepare its entry into the Task Queue. In case the job specifies Requirements as described above, a special executor performs the following actions:

  • extracts the job requirements, orders them alphabathically and creates a unique hash of the given set of requirements. This hash value will be refered to as the SubmitPool identifier;
  • the job requirements are stored in a special SubmitPools table ( see below ) with the hash as a primary key;
  • eligible queues and sites are obtained for the given requirements using the Resources helper as described above;
  • the queues eligible to this set of job requirements are stored in a special table which defines the SubmitPool to queues correspondence, where the queue is defined by its Site/CEName/QueueName set of names;
  • the list of eligible sites can be added to the Site job description parameter;
  • the SubmitPool parameter of the job description is given the value of the identifier corresponding to the job requirements

Instrumenting pilots with SubmitPool data

There are two cases: pilots submitted by SiteDirectors and by TaskQueueDirectors. In both cases the Matcher service interface getSubmitPoolsForQueue() is used. However in the case of SiteDirectors it is putting less load on the Matcher service as one call is made for a bunch of submitted pilots.

SiteDirectors

SiteDirectors are submitting pilots to a certain queue. At this moment they interrogate the Matcher service with the query getSubmitPoolsForQueue() which returns all the defined SubmitPools ( if any ) for the queue. The so obtained list of SubmitPools is included into the pilot arguments while submission and is used by the pilot directly to pass to the Matcher when asking for jobs.

TaskQueueDirectors

When pilots are submitted by TaskQueueDirectors there is no way to know in which queue they will be running. Therefore, such pilots will make a getSubmitPoolsForQueue() query to Matcher service presenting the queue that thay happen to run in. Once the list of SubmitPools is obtained, it can be used in the job requests.

Other considerations

Keeping SubmitPool definitions up to date

The proposed solution implies that the computing resources properties are not changing often, only static parameters can be used. Still site parameters can change after the jobs are submitted but are not yet picked up by the pilots. In this case, either a no more eligible resource can pick up a job which will likely fail or an eligible resource will not be able to pick up a job. In order to minimize failures of this kind, the SubmitPool to queue mapping which is stored in the TaskQueueDB, should be refreshed at regular intervals ( for example each 15 minutes ). The SubmitPool definition in the database should have a time stamp of its last update.

Cleaning SubmitPools

When no more jobs corresponding to a given SubmitPool are available in the TaskQueue, the SubmitPool should be removed from the database. This should not be done necessarily as soon as the last job defining the SubmitPool is picked up ( it can be rescheduled soon after ). Rather the SubmitPools can be cleaned up by a dedicated agent running at regular intervals.

Initial draft below

# JobRequirements applied during TaskQueue creation There is the idea to apply job-site requirements (like software tags) during the TaskQueue creation, and not during the Matching. This would use the new Resources description (in v6r8). What it implies (as far as I can tell) is to follow the same kind of structure that the InputData treatment has:

## Change in JobDB: * Addition of a table to hold the requirements: JobID, ReqName, ReqType, ReqValue, ReqOperator (this one should hold (>, <, <=, >=, =, in). ReqType may not be needed if the type is checked in the python code. The python should know what are the possible requirements (from the CS) and throw an error when trying to add a Req that does not exist at any site. * Addition of setters and getters in the JobDB.py (as dicts probably, a bit like meta data in the FC)

## Change in JobManager * In the submitJob of the JobManager the proper calls should be added.

## Change needed for executor * JobStates should have a method to access the requirements as needed by the executors

## New executor: JobRequirements * Would find the sites matching the requirements among the available ones. Should it take all unbanned sites? Or only those that a previously ran executor would have selected?

## Interface * Addition of the relevant code in the DIRAC API to set the requirements in the JDL: need a cleaver way of encoding the operators (as req>12 is not req=12)

## Clean up * The Requirements table must be cleaned when the jobs are removed (what about reset?) * What to do with requirements that are dropped from the CS? Should there be a watch agent that cleans the tables?

Clone this wiki locally