Skip to content

Cylc Refactor Proposal (2014)

Matt Shin edited this page Jul 31, 2014 · 47 revisions

Issues

Load on Server

Scheduler currently becomes inefficient when a suite has 1500+ tasks.

Reload of a large suite can take a long time, and can use up a lot of memory. (Stop and restart is faster. How come?)

Scheduler uses a significant amount of CPU, even when it should be idle. E.g. When a large suite is stopping, waiting for a single job to complete, scheduler appears to continue to consume CPU.

  • Is it still the case?
  • When the suite is idle, it should use almost no CPU.

See also #107, #108, #184, #788, #987.

Data Model and Persistent Layer

Current persistent layer added as an after thought. Hence it does not document the suite throughout its lifetime.

  • Environment and changes.
  • Configuration and changes.
  • Runtime state and changes.
  • User interactions.
  • Changes to other items in the suite?

SQLite database can be locked by a reader and can cause the suite to die.

It is not clear to users what suite items to backup in order to guarantee a good restart. Do we need all of these?

  • Suite environment.
  • Suite definition.
  • State file.
  • Suite runtime database.
  • Job logs and status files.
  • Other items, which may be modified in the interim.

Current data model is difficult to serialise, because it mixes everything together:

  • Site/user configuration.
  • Runtime options.
  • Suite configuration.
  • Runtime states.
  • Functional logic.

(This also causes much unnecessary getting and setting of data throughout the logic.)

Task proxies are generated classes with various layers of inheritance. This is undesirable and restricts names that can be given to tasks.

Runtime files are not in one place.

  • It is not obvious to users what they should housekeep and/or archive.

See also #372, #421, #423, #705, #846, #864, #975.

Communication Layer

API for job message, user query and command. E.g.: Current use of Pyro limits what we can do:

  • Only single passphrase authentication.
  • Once you are in, you can do everything.
  • Object RPC instead of RESTful API design.
  • There is no clear API to:
    • send job messages (except via cylc command).
    • send user queries. (A query does not change the suite state.)
    • send user commands. (A command asks the suite to perform an action.)
  • Unable to use mainstream technology built around the HTTP and other more common protocols.
    • SMTP would be a useful protocol to support, as just about any load-balancing system on any site is able send emails out.

See also #72, #124, #126, #537, #969, #970.

Job Submission and Management

Inefficiency host selection via rose host-select.

  • Multiple SSH commands to multiple hosts or login nodes for every job.
  • While insignificant, time adds up when we start running large number of jobs at the same time, e.g. large ensembles.

Multiple SSH and almost identical commands to submit jobs to queueing system.

  • This may create unnecessary loads to suite hosts and job hosts.

Submission error output, currently goes to log/suite/err, can be lost in the noise. Users are often puzzled when they have a submission failure.

  • Similar issue with event hooks.

It is not easy to archive a single cycle of log files due to the log/job/$TASK.$CYCLE.$SUBMIT_NUM* naming convention.

  • In addition *.1 to *.8 are the traditional file extensions for Unix manual pages.
  • It is not easy to compare logs between suites.
  • Submit number not document in job script.
  • Users would find it easier with a hierarchy based on log/job/$CYCLE/.

Ditto for items in work/$TASK.$CYCLE/.

  • Users would find it easier with a hierarchy based on work/$CYCLE/$TASK/.

Rose Integration

Rose and other 3rd party tool-kits and frameworks.

Rose provides these functionalities, which should probably be part of Cylc?

  • Suite installation (rose suite-run).
  • Suite host selection (rose suite-run, etc).
    • In the future, provide a way to migrate a suite to a different host, e.g. if current host is scheduled for a reboot in the middle of a run.
  • Job log management and event handling (rose suite-hook).
  • Suite clean (rose suite-clean).
  • Locate suites on multiple hosts (rose suite-scan).
    • cylc gsummary works, but cylc scan doesn't.
    • Other relevant cylc commands, e.g. gcylc and cylc stop should do the same.
  • Browser of suite logs via HTTP (Rose Bush).
    • Need a new name. What about cylc moth for (Monitor Tasks via HTTP?)

Users are unable to call Rose functionalities via cylc gui.

  • Restart or reload suite, or re-trigger a task with or without reinstalling configurations for suites and/or applications.
  • Launch rose config-edit.
  • Launch Rose Bush.

Users have to hard wire Rose environment in job scripts. See also #511.

Actions

Following discussions, we agreed the following:

  1. Investigate how to improve suite runtime performance (CPU usage, memory usage, etc):

    • Activity can start as soon as possible.
    • Implement performance quick wins. (2014-Q3?)
    • Propose a new and more scalable architecture for the future. (2014-Q3/4?)
      • More event driven.
      • Boss-worker processes. DONE (#1012)
      • Functional architecture.
  2. Propose new data model and data persistent layer. (2014-Q3?)

    • Data model will be able to represent and fully document the runtime of a suite.
      • Task and job states.
      • Change to suite.
      • User commands.
    • Data model will be easy to serialise.
    • Data model will be easy to pass between functions.
    • Data model will be memory efficient.
    • Persistent layer will be friendly to write and query.
  3. Propose new communication layer API. (2014-Q3/4 after data model activity?)

    • RESTful HTTP API, which will allow us to use common web technologies.
    • Job message via HTTP POST. Unique one-time authentication token for each job.
    • For those sites that do not allow outward communications from jobs to suites.
      • Support job message via email.
      • Support HTTP and SMTP proxies.
    • User commands via HTTP POST. (Tell the suite to perform an action.)
      • Require authentication. Use a system based on public/private key pairs?
    • User queries via HTTP GET. (Read-only, no change to suite.)
  4. Propose log/job/ and work/ directory restructure before cylc 6. (2014-Q3)

    • log/job/%(cycle)s/%(task)s.%(submit_number)02d%(suffix)s
    • work/%(cycle)s/%(task)s/
  5. Propose changes to job submission log and event hook log locations. (2014-Q3/4)

    • With the job logs at the suite host?
  6. Propose changes to migrate rose suite-hook and Rose Bush functionalities into cylc. (2014-Q4)

    • New configurations to ask the suite to pull job logs back from job hosts.
    • New configurations to send email notification on events.
    • New configurations to shut down on events.
    • Suite to populate job log database table.
    • Rose Bush -> cylc moth? (MOnitor Tasks via HTTP?)
  7. Propose changes to allow cylc commands to look for suite in configured suite hosts. (2014-Q3/Q4)

    • N.B. cylc gsummary already does this.
  8. Migrate rose suite-run and rose suite-clean functionalities. (2014-Q4/2015?)