Skip to content

Releases: martian-lang/martian

v3.2.1 bugfix release

07 Mar 18:01
Compare
Choose a tag to compare
  • Performance fixes for VDR computation in cases where stages have a
    large number of output files.
  • Include invoulentary context switches in rusage tracking.
  • Fix a crash in cases where the mrp binary becomes unavailable on
    disk during a pipestance run.
  • Spelling corrections, mostly in code comments.

Martian 3.2.0

14 Jan 18:46
Compare
Choose a tag to compare

Martian 3.2.0 release.
Major new features:

  • The Python stage code adapter now works with Python 3.
  • Martian can now account for virtual address space size, in addition to
    physical memory.
    • Normally, virtual address space (vmem) size is ignored, since modern
      linux systems have no good reason to restrict it - vmem size is not
      the same as rss+swap, contrary to inexplicably popular belief.
    • In local mode, a limit may be specified with the --localvmem flag.
    • A limit will also be imposed automatically if a virtual size rlimit
      (e.g. ulimit -d or ulimit -v) is detected by mrp. SGE's
      h_vmem, s_vmem, h_data, and s_data resource specifiers set
      these limits.
    • In cluster mode job templates, users may now use __MRO_VMEM_GB__
      and related variables in the same way as the existing
      __MRO_MEM_GB__ variables to get the predicted virtual address
      space (vmem) size rather than the physical memory requirement.
    • The job mode configuration for cluster modes found in
      jobmanagers/config.json may set the mem_is_vmem key to true,
      in which case __MRO_MEM_GB__ and related template variables will
      also use the virtual address space size, for backwards compatibility
      with existing user templates (most SGE clusters mistakenly enforce
      virtual size, if they handle anything like memory reservations at
      all). This is turned on by default for SGE.
    • Stages may specify a vmem_gb requirement in addition to mem_gb,
      through all of the same existing mechanisms:
      • Specifying using ( vmem_gb = 4, ) in the mro declaration of the
        stage.
      • Specifying __vmem_gb in the chunk or join definitions returned
        by a split phase.
      • In overrides.json.
    • Stages which do not specify a vmem requirement will be allocated an
      amount equal to their physical memory requirement plus a constant
      specified in the extra_vmem_per_job key configured in
      jobmanagers/config.json.
    • With --monitor, mrjob will now restrict stage virtual size as
      well as physical size, to make sure the requests are being set
      correctly. It will include its own virtual size in the restriction,
      but will not include the virtual size of profiling jobs (e.g.
      perf record) which may be running alongside the stage code.
  • Update graph UI page
    • Reduce the amount of excess bytes required to render the page.
      • Inline the 7% of bootstrap.min.css we actually use.
      • Remove the fonts, just use an svg icon instead.
      • Remove the clipboard button, since it hasn't actually worked in a
        long time.
    • Remove dead js files. These files either were already not being
      included in the serve package or are no longer required.
    • Concatenate javascript source files together.
    • Remove duplicated DOM element IDs.
    • Get angular, dagare-d3 from npm, as well as support libraries d3 and
      lodash. This means we're no longer shipping an insecure version of
      lodash.
    • Add pan/zoom now works on the graph page.
  • MRO syntax now supports escaping for string literals, using json
    escaping syntax.

Minor improvements:

  • mrp now checks for stage completion whenever local-mode jobs complete.
    Previously it would check every 3 seconds regardless. For very short
    jobs (such as, frequently, split phases) this results in shorter
    pipeline wall times. While the impact on large pipelines should be
    tiny in percentage terms, this significantly accelerates integration
    tests.
  • make tarball now produces both tar.gz and tar.xz.
  • Improvements to tests.
    • Integration tests can now run in parallel (make -j longtests)
    • Fix some bugs in integration test result validation.
    • More test coverage for both unit and integration tests.
  • Pipelines should be more robust against missed or delayed updates
    from the pipestance journal directory. Rather than timing out,
    mrp will now check whether the file exists if a notification wasn't
    seen.
  • mrjob now includes its own memory usage in the statistics included
    in the jobinfo, which are used to generate the _perf summary..

Bug fixes:

  • Fix a potential deadlock when mrp receives a signal (e.g. from kill)
    or a shutdown request over the API while it is in the middle of
    starting or restarting a pipeline.
  • Fix a crash in mrf --includes if a stage called by a pipeline was
    not present in the transitive includes of the file defining the
    pipeline.
  • Fix a bug in mrf --includes which resulted in duplicate declarations
    for existing user-defined file types.
  • Updated npm dependencies.
  • mrjob will now begin waiting on the profiling command (e.g.
    perf record) immediately, rather than waiting until the stage code
    finishes. This prevents zombie processes lying around if the
    profiling command finishes before the stage code.
  • mrp will no longer read chunk _outs files if no chunk outputs
    were expected, e.g. for pre-flight stages. This prevents spurious
    errors when chunk outputs were not a dictionary object. It also
    means chunk outputs need to be properly declared if the stage has
    no outputs.

v3.2.0-pre2

14 Jan 18:49
Compare
Choose a tag to compare
Fix a typo in limit exceeded message.

v3.2.0-pre1: Martian 3.2.0 release candidate.

14 Jan 18:49
Compare
Choose a tag to compare
Major new features:
* The Python stage code adapter now works with Python 3.
* Martian can now account for virtual address space size, in addition to
  physical memory.
  * Normally, virtual address space (vmem) size is ignored, since modern
    linux systems have no good reason to restrict it - vmem size is not
    the same as rss+swap, contrary to inexplicably popular belief.
  * In local mode, a limit may be specified with the `--localvmem` flag.
  * A limit will also be imposed automatically if a virtual size rlimit
    (e.g. `ulimit -d` or `ulimit -v`) is detected by mrp.  SGE's
    `h_vmem`, `s_vmem`, `h_data`, and `s_data` resource specifiers set
    these limits.
  * In cluster mode job templates, users may now use `__MRO_VMEM_GB__`
    and related variables in the same way as the existing
    `__MRO_MEM_GB__` variables to get the predicted virtual address
    space (vmem) size rather than the physical memory requirement.
  * The job mode configuration for cluster modes found in
    `jobmanagers/config.json` may set the `mem_is_vmem` key to `true`,
    in which case `__MRO_MEM_GB__` and related template variables will
    also use the virtual address space size, for backwards compatibility
    with existing user templates (most SGE clusters mistakenly enforce
    virtual size, if they handle anything like memory reservations at
    all).
  * Stages may specify a `vmem_gb` requirement in addition to `mem_gb`,
    through all of the same existing mechanisms:
    * Specifying `using ( vmem_gb = 4, )` in the mro declaration of the
      stage.
    * Specifying `__vmem_gb` in the chunk or join definitions returned
      by a split phase.
    * In overrides.json.
  * Stages which do not specify a vmem requirement will be allocated an
    amount equal to their physical memory requirement plus a constant
    specified in the `extra_vmem_per_job` key configured in
    `jobmanagers/config.json`.
  * With `--monitor`, `mrjob` will now restrict stage virtual size as
    well as physical size, to make sure the requests are being set
    correctly.  It will include its own virtual size in the restriction,
    but will not include the virtual size of profiling jobs (e.g.
   `perf record`) which may be running alongside the stage code.

Minor improvements:
* mrp now checks for stage completion whenever local-mode jobs complete.
  Previously it would check every 3 seconds regardless.  For very short
  jobs (such as, frequently, split phases) this results in shorter
  pipeline wall times.  While the impact on large pipelines should be
  tiny in percentage terms, this significantly accelerates integration
  tests.
* `make tarball` now produces both `tar.gz` and `tar.xz`.
* Improvements to tests.
  * Integration tests can now run in parallel (`make -j longtests`)
  * Fix some bugs in integration test result validation.
  * More test coverage for both unit and integration tests.
* Pipelines should be more robust against missed or delayed updates
  from the pipestance journal directory.  Rather than timing out,
  mrp will now check whether the file exists if a notification wasn't
  seen.
* `mrjob` now includes its own memory usage in the statistics included
  in the jobinfo, which are used to generate the `_perf` summary..

Bug fixes:
* Fix a potential deadlock when mrp receives a signal (e.g. from `kill`)
  or a shutdown request over the API while it is in the middle of
  starting or restarting a pipeline.
* Fix a crash in `mrf --includes` if a stage called by a pipeline was
  not present in the transitive includes of the file defining the
  pipeline.
* Fix a bug in `mrf --includes` which resulted in duplicate declarations
  for existing user-defined file types.
* Updated npm dependencies.
* `mrjob` will now begin waiting on the profiling command (e.g.
  `perf record`) immediately, rather than waiting until the stage code
  finishes.  This prevents zombie processes lying around if the
  profiling command finishes before the stage code.

Martian 3.1.0

11 Oct 22:45
Compare
Choose a tag to compare

Martian 3.1

This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp (especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro source code more convenient, and logging improvements should improve make debugging easier in the event of failures.

VDR (volatile disk recovery) Changes

VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.

General changes

  • "rolling" is now the default VDR mode.
  • Each stage job (split, chunk, join) now has its own $TEMPDIR, which is cleaned up as soon as that stage phase has completed.
  • If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
  • When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
  • Stage metadata files are now accounted for in the storage high-water mark calculation.

New feature: strict-mode volatile

Stages may now declare themselves as being "strict-mode volatile" compatible:

stage FOO(
    in  bam in1,
    out bam bamfile,
    out bai index,
    out json summary,
) using (
    volatile = strict,
)

In this mode, the volatile modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1 takes bamfile as input and STAGE2 takes summary as input, bamfile can be deleted as soon STAGE1 completes, rather than waiting for STAGE2 to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.

One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam often implies the existence of filename.bai. If a downstream stage does not bind an output which mentions filename.bai then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.

New feature: "retained" outputs

In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:

Pipeline retains

pipeline BAR(
    in  bam  input1,
    out bam  output1,
)
{
    call FOO(
        input1 = self.input1,
    )
    call FOO as BAZ(
        input2 = FOO.output1,
    )
    return (
        output1 = BAZ.output1,
    )
    retain (
        FOO.output1
    )
}

This specifies that output1 of this pipeline's call to FOO should never be deleted, for example if one wants to be able to later re-run BAZ. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ in this example, or from other pipelines) which do not need to retain the output.

Stage retains

stage FOO(
    in  int  input1,
    out bam  output1,
    out json summary,
) using (
    volatile = strict,
) retain (
    summary,
)

This specifies that VDR should never delete summary. This should be used in the case where a file should always be preserved for potential later inspection.

Runtime improvements

User-facing improvements

  • The memory and CPU consumption of mrp has been reduced, especially for very large pipelines, and in cases where stages create large output objects.
  • There is now a timeout, configurable with the --retry-wait command line flag, between when mrp observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second.
  • The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
  • The web UI will now show files in the /extras directory of the pipestance. This is intended primarily for outputs of on-finish hooks.

Improvements for Pipeline-Developers

  • mrp can now run stage invocations as well as pipeline invocations. mrs now exists only as a symlink to mrp for backwards compatibility. This eliminates the feature gap between mrs and mrp, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.
  • Performance profiling modes are now configurable through jobmanagers/config.json. Each profiling mode may specify an executable to run to attach to the stage code (such as perf) and environment variables (for example HEAPPROFILE may be used to enable tcmalloc's heap profiler), in addition to any profiling built into the stage adapter itself.
  • The default event collection for --profile=perf no longer includes bpf-output events. This was not working in most cases.
  • Profiling mode may now be specified with in the overrides json file (used with the --overrides flag) to enable or disable it for an individual stage.
  • The _mrosource output now includes comments indicating file boundaries from the original source code with merged @includes.
  • More logging about the environment when mrp starts up
    • Log a few more environment variables (MALLOC_* and RUST_*).
    • The filesystem type and available space and inode count are now logged on startup.
  • When mrp is run with --monitor, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them.
  • The stage code parent process, mrjob, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages.
  • When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
  • Stage and pipeline _invocation files now only @include the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline.
  • Stage and pipeline _invocation files now properly represent call aliasing (e.g. call FOO as BAR).
  • The python adapter now exposes two new methods, get_threads_allocation and get_memory_allocation, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.

Improvements to the compiler/parser (mrc) and formatter (mrf)

  • The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of @include directives.
  • The parser is now substantially faster, and uses less memory.
  • mrf has a new flag, --includes, which will remove @include directives which are not required, and will attempt to add @include and filetype directives which are missing. It is inspired by the clang iwyu tool.
  • mrf now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
    ...
)

becomes

call FOO(
    ...
) using (
    volatile = true,
)
  • mrf now sorts keys in map literals.
  • mrf now inserts a trailing comma in map and array literals, e.g.
[
    1,
    2,
]

Bug fixes

  • The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket conn...
Read more

Martian 3.1.0 release candidate 3

09 Oct 20:18
Compare
Choose a tag to compare
Pre-release

Martian 3.1

This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp (especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro source code more convenient, and logging improvements should improve make debugging easier in the event of failures.

VDR (volatile disk recovery) Changes

VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.

General changes

  • "rolling" is now the default VDR mode.
  • Each stage job (split, chunk, join) now has its own $TEMPDIR, which is cleaned up as soon as that stage phase has completed.
  • If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
  • When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
  • Stage metadata files are now accounted for in the storage high-water mark calculation.

New feature: strict-mode volatile

Stages may now declare themselves as being "strict-mode volatile" compatible:

stage FOO(
    in  bam in1,
    out bam bamfile,
    out bai index,
    out json summary,
) using (
    volatile = strict,
)

In this mode, the volatile modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1 takes bamfile as input and STAGE2 takes summary as input, bamfile can be deleted as soon STAGE1 completes, rather than waiting for STAGE2 to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.

One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam often implies the existence of filename.bai. If a downstream stage does not bind an output which mentions filename.bai then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.

New feature: "retained" outputs

In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:

Pipeline retains

pipeline BAR(
    in  bam  input1,
    out bam  output1,
)
{
    call FOO(
        input1 = self.input1,
    )
    call FOO as BAZ(
        input2 = FOO.output1,
    )
    return (
        output1 = BAZ.output1,
    )
    retain (
        FOO.output1
    )
}

This specifies that output1 of this pipeline's call to FOO should never be deleted, for example if one wants to be able to later re-run BAZ. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ in this example, or from other pipelines) which do not need to retain the output.

Stage retains

stage FOO(
    in  int  input1,
    out bam  output1,
    out json summary,
) using (
    volatile = strict,
) retain (
    summary,
)

This specifies that VDR should never delete summary. This should be used in the case where a file should always be preserved for potential later inspection.

Runtime improvements

User-facing improvements

  • The memory and CPU consumption of mrp has been reduced, especially for very large pipelines, and in cases where stages create large output objects.
  • There is now a timeout, configurable with the --retry-wait command line flag, between when mrp observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second.
  • The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
  • The web UI will now show files in the /extras directory of the pipestance. This is intended primarily for outputs of on-finish hooks.

Improvements for Pipeline-Developers

  • mrp can now run stage invocations as well as pipeline invocations. mrs now exists only as a symlink to mrp for backwards compatibility. This eliminates the feature gap between mrs and mrp, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.
  • Performance profiling modes are now configurable through jobmanagers/config.json. Each profiling mode may specify an executable to run to attach to the stage code (such as perf) and environment variables (for example HEAPPROFILE may be used to enable tcmalloc's heap profiler), in addition to any profiling built into the stage adapter itself.
  • The default event collection for --profile=perf no longer includes bpf-output events. This was not working in most cases.
  • Profiling mode may now be specified with in the overrides json file (used with the --overrides flag) to enable or disable it for an individual stage.
  • The _mrosource output now includes comments indicating file boundaries from the original source code with merged @includes.
  • More logging about the environment when mrp starts up
    • Log a few more environment variables (MALLOC_* and RUST_*).
    • The filesystem type and available space and inode count are now logged on startup.
  • When mrp is run with --monitor, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them.
  • The stage code parent process, mrjob, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages.
  • When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
  • Stage and pipeline _invocation files now only @include the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline.
  • Stage and pipeline _invocation files now properly represent call aliasing (e.g. call FOO as BAR).
  • The python adapter now exposes two new methods, get_threads_allocation and get_memory_allocation, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.

Improvements to the compiler/parser (mrc) and formatter (mrf)

  • The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of @include directives.
  • The parser is now substantially faster, and uses less memory.
  • mrf has a new flag, --includes, which will remove @include directives which are not required, and will attempt to add @include and filetype directives which are missing. It is inspired by the clang iwyu tool.
  • mrf now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
    ...
)

becomes

call FOO(
    ...
) using (
    volatile = true,
)
  • mrf now sorts keys in map literals.
  • mrf now inserts a trailing comma in map and array literals, e.g.
[
    1,
    2,
]

Bug fixes

  • The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket conn...
Read more

Martian 3.1.0 release candidate 2

06 Sep 02:14
67837c0
Compare
Choose a tag to compare
Pre-release

Martian 3.1

This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp (especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro source code more convenient, and logging improvements should improve make debugging easier in the event of failures.

VDR (volatile disk recovery) Changes

VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.

General changes

  • "rolling" is now the default VDR mode.
  • Each stage job (split, chunk, join) now has its own $TEMPDIR, which is cleaned up as soon as that stage phase has completed.
  • If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
  • When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
  • Stage metadata files are now accounted for in the storage high-water mark calculation.

New feature: strict-mode volatile

Stages may now declare themselves as being "strict-mode volatile" compatible:

stage FOO(
    in  bam in1,
    out bam bamfile,
    out bai index,
    out json summary,
) using (
    volatile = strict,
)

In this mode, the volatile modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1 takes bamfile as input and STAGE2 takes summary as input, bamfile can be deleted as soon STAGE1 completes, rather than waiting for STAGE2 to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.

One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam often implies the existence of filename.bai. If a downstream stage does not bind an output which mentions filename.bai then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.

New feature: "retained" outputs

In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:

Pipeline retains

pipeline BAR(
    in  bam  input1,
    out bam  output1,
)
{
    call FOO(
        input1 = self.input1,
    )
    call FOO as BAZ(
        input2 = FOO.output1,
    )
    return (
        output1 = BAZ.output1,
    )
    retain (
        FOO.output1
    )
}

This specifies that output1 of this pipeline's call to FOO should never be deleted, for example if one wants to be able to later re-run BAZ. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ in this example, or from other pipelines) which do not need to retain the output.

Stage retains

stage FOO(
    in  int  input1,
    out bam  output1,
    out json summary,
) using (
    volatile = strict,
) retain (
    summary,
)

This specifies that VDR should never delete summary. This should be used in the case where a file should always be preserved for potential later inspection.

Runtime improvements

User-facing improvements

  • The memory and CPU consumption of mrp has been reduced, especially for very large pipelines, and in cases where stages create large output objects.
  • There is now a timeout, configurable with the --retry-wait command line flag, between when mrp observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second.
  • The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
  • The web UI will now show files in the /extras directory of the pipestance. This is intended primarily for outputs of on-finish hooks.

Improvements for Pipeline-Developers

  • mrp can now run stage invocations as well as pipeline invocations. mrs now exists only as a symlink to mrp for backwards compatibility. This eliminates the feature gap between mrs and mrp, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.
  • The default event collection for --profile=perf no longer includes bpf-output events. This was not working in most cases.
  • For users who desire more control over perf profile recording with --profile=perf, the environment variable MRO_PERF_ARGS allows one to specify the command like to perf record. This overrides MRO_PERF_EVENTS, MRO_PERF_FREQ, and MRO_PERF_DURATION. The command that runs will be perf record -p <pid> -o <path/to/job/_perf.data> $MRO_PERF_ARGS. The same behavior as setting those variables can thus be achieved by setting MRO_PERF_ARGS="-g -e $MRO_PERF_EVENTS -F $MRO_PERF_FREQ sleep $MRO_PERF_DURATION".
  • The _mrosource output now includes comments indicating file boundaries from the original source code with merged @includes.
  • More logging about the environment when mrp starts up
    • Log a few more environment variables (MALLOC_* and RUST_*).
    • The filesystem type and available space and inode count are now logged on startup.
  • When mrp is run with --monitor, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them.
  • The stage code parent process, mrjob, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages.
  • When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
  • Stage and pipeline _invocation files now only @include the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline.
  • Stage and pipeline _invocation files now properly represent call aliasing (e.g. call FOO as BAR).
  • The python adapter now exposes two new methods, get_threads_allocation and get_memory_allocation, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.

Improvements to the compiler/parser (mrc) and formatter (mrf)

  • The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of @include directives.
  • The parser is now substantially faster, and uses less memory.
  • mrf has a new flag, --includes, which will remove @include directives which are not required, and will attempt to add @include and filetype directives which are missing. It is inspired by the clang iwyu tool.
  • mrf now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
    ...
)

becomes

call FOO(
    ...
) using (
    volatile = true,
)
  • mrf now sorts keys in map literals.
  • mrf now inserts a trailing comma in map and array literals, e.g.
[
    1,
    2,
]

Bug fixes

  • The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket connections until the server ran out of file handles....
Read more

Martian 3.1.0 release candidate 1

13 Aug 20:56
Compare
Choose a tag to compare
Pre-release

Martian 3.1

This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp (especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro source code more convenient, and logging improvements should improve make debugging easier in the event of failures.

VDR (volatile disk recovery) Changes

VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.

General changes

  • "rolling" is now the default VDR mode.
  • Each stage job (split, chunk, join) now has its own $TEMPDIR, which is cleaned up as soon as that stage phase has completed.
  • If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
  • When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
  • Stage metadata files are now accounted for in the storage high-water mark calculation.

New feature: strict-mode volatile

Stages may now declare themselves as being "strict-mode volatile" compatible:

stage FOO(
    in  bam in1,
    out bam bamfile,
    out bai index,
    out json summary,
) using (
    volatile = strict,
)

In this mode, the volatile modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1 takes bamfile as input and STAGE2 takes summary as input, bamfile can be deleted as soon STAGE1 completes, rather than waiting for STAGE2 to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.

One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam often implies the existence of filename.bai. If a downstream stage does not bind an output which mentions filename.bai then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.

New feature: "retained" outputs

In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:

Pipeline retains

pipeline BAR(
    in  bam  input1,
    out bam  output1,
)
{
    call FOO(
        input1 = self.input1,
    )
    call FOO as BAZ(
        input2 = FOO.output1,
    )
    return (
        output1 = BAZ.output1,
    )
    retain (
        FOO.output1
    )
}

This specifies that output1 of this pipeline's call to FOO should never be deleted, for example if one wants to be able to later re-run BAZ. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ in this example, or from other pipelines) which do not need to retain the output.

Stage retains

stage FOO(
    in  int  input1,
    out bam  output1,
    out json summary,
) using (
    volatile = strict,
) retain (
    summary,
)

This specifies that VDR should never delete summary. This should be used in the case where a file should always be preserved for potential later inspection.

Runtime improvements

User-facing improvements

  • The memory and CPU consumption of mrp has been reduced, especially for very large pipelines, and in cases where stages create large output objects.
  • There is now a timeout, configurable with the --retry-wait command line flag, between when mrp observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second.
  • The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
  • The web UI will now show files in the /extras directory of the pipestance. This is intended primarily for outputs of on-finish hooks.

Improvements for Pipeline-Developers

  • mrp can now run stage invocations as well as pipeline invocations. mrs now exists only as a symlink to mrp for backwards compatibility. This eliminates the feature gap between mrs and mrp, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.
  • The default event collection for --profile=perf no longer includes bpf-output events. This was not working in most cases.
  • For users who desire more control over perf profile recording with --profile=perf, the environment variable MRO_PERF_ARGS allows one to specify the command like to perf record. This overrides MRO_PERF_EVENTS, MRO_PERF_FREQ, and MRO_PERF_DURATION. The command that runs will be perf record -p <pid> -o <path/to/job/_perf.data> $MRO_PERF_ARGS. The same behavior as setting those variables can thus be achieved by setting MRO_PERF_ARGS="-g -e $MRO_PERF_EVENTS -F $MRO_PERF_FREQ sleep $MRO_PERF_DURATION".
  • When running with --zip --profile=perf, profiling outputs are no longer included in _metadata.zip.
  • The _mrosource output now includes comments indicating file boundaries from the original source code with merged @includes.
  • More logging about the environment when mrp starts up
    • Log a few more environment variables (MALLOC_* and RUST_*).
    • The filesystem type and available space and inode count are now logged on startup.
  • When mrp is run with --monitor, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them.
  • The stage code parent process, mrjob, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages.
  • When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
  • Stage and pipeline _invocation files now only @include the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline.
  • Stage and pipeline _invocation files now properly represent call aliasing (e.g. call FOO as BAR).
  • The python adapter now exposes two new methods, get_threads_allocation and get_memory_allocation, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.

Improvements to the compiler/parser (mrc) and formatter (mrf)

  • The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of @include directives.
  • The parser is now substantially faster, and uses less memory.
  • mrf has a new flag, --includes, which will remove @include directives which are not required, and will attempt to add @include and filetype directives which are missing. It is inspired by the clang iwyu tool.
  • mrf now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
    ...
)

becomes

call FOO(
    ...
) using (
    volatile = true,
)
  • mrf now sorts keys in map literals.
  • mrf now inserts a trailing comma in map and array literals, e.g.
[
    1,
    2,
]

Bug fixes

  • The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial ...
Read more

v2.3.3: Fix web UI in Firefox.

19 Jul 01:44
Compare
Choose a tag to compare
This is a manual cherry-pick of d5f6fe982014edf3215f7c07e11d4fa5a773c1c5

Martian 3.0.0 release.

21 Jun 19:23
Compare
Choose a tag to compare

This is a major version change and includes updates and improvements to
the syntax as well as well as the usual assortment of bug fixes and
performance improvements.

This release of the Martian framework is packaged with 10X Genomics'
Cellranger DNA 1.0.0 release.

The repository has also been reorganized to conform to the "go way" of
code organization, so it can now be fetched and included with go get
without any weird messing around with GOPATH. Comments have been
substantially improved throughout the code base, so godoc is no longer
useless.

New MRO syntax features

  • Stages may now change their default thread and memory reservation used for
    the split, chunks and join. The split can, of course, still override
    the threads/memory for the chunks and join as before. This eliminates
    many cases where a stage would split/join simply in to set a memory
    reservation (although this is still required if the reservation logic is
    dynamic).
  • @include lines are now deduplicated. Transitively including an mro
    file will no longer result in errors due to duplicate definitions.
    This allows developers to obey best practices by directly including
    every mro which defines a stage or pipeline they depend on.
  • Stage split definitions may now declare chunk outputs as well as
    inputs.
  • Calls may now bind array-type input parameters to an array of elements
    specified in mro, e.g.
call FOO(
    argv = [ BAR.out1, BAZ.out1 ],
)
  • Calls may now alias the name of a stage or pipeline to allow, for
    example, a pipeline to call the same stage or pipeline multiple times with
    different inputs, e.g.
call FOO as FOO1 (...)
call FOO as FOO2 (...)
  • Add a new syntax for call modifiers (e.g. preflight, local, volatile):
call FOO(
    ...
) using (
    volatile = true,
)

This allows comments to be placed around the modifiers, and also
allows...

  • Calls may now be conditionally disabled based on the output of a prior
    stage:
call FOO(
    ...
) using (
    disabled = BAR.disable_foo,
)
  • A new --strict flag to mrc, and --strict={disabled|log|alarm|error}
    flag for mrp, now checks inputs and outputs more strictly. "strict"
    mode is disabled by default for backwards compatibility, but in other
    modes it fixes some oversights in the pipeline type checking logic
    including:
    • mem_gb and thread were always truncated to integers. Now
      non-integral values are errors.
    • In several cases, arrays of a given type were treated
      interchangeably as scalar types. Now they're not.
    • In strict mode, undeclared chunk inputs or outputs, or chunk input
      or output values of an incorrect type, are considered errors.
  • A stage or split may now request a negative number for threads or
    memory. In cluster mode, the absolute value is used. In local mode,
    the entire --localmem or --localcores allocation is given to the stage,
    if that value is higher than the absolute value of the request.
    Otherwise, an error is raised due to insufficient resources. This
    provides more reliable behavior than the previous preferred strategy of
    asking for e.g. 2TB and counting on the job manager to cut it down to
    the actual availability.

Developer quality of life improvements:

  • MRO "compile" errors now include the full include stack, rather than
    just the location of the failure.
  • The mrp webserver now exposes the /debug/pprof endpoint for profiling
    and debugging a running instance.
  • Added a new tool, mro2go, which can automatically generate
    data structures for serializing and deserializing stage inputs and
    outputs from mro code, making authoring of native go stages much easier.

Bug fixes:

  • A situation where the cluster mode --maxjobs limit could "leak" jobs
    until it would stop queuing more jobs to the cluster has been fixed.
  • The graph UI should now render correctly in Firefox.
  • Several cases where the python stage code adapter would fail around
    unicode strings should be resolved.
  • A case where a stage in the MROPATH which was not transitively
    included by the top-level call mro could replace the correct call has
    been eliminated.
  • mrjob now forces a filesystem sync of the job's _outs file before
    writing the _complete file, to avoid race conditions observed on certain
    aggressively configured network file systems.
  • In cluster mode, the default --localmem will no longer ever be more
    than the available memory at the start of execution.
  • Cyclic mro @include paths will now result in an error rather than an
    infinite loop.
  • http API commands to terminate the pipestance no longer leave it in an
    un-restartable state.
  • Fix a case where mrp would attempt to restart a pipestance which had
    not failed, resulting in a failure.
  • --jobresources is now correctly applied.

Run-time improvements:

  • mrp now tracks the number of posix processes owned by the user and
    compares to the current process rlimit (ulimit -u), throttling the
    number of spawned jobs if the user is approaching that limit. It will
    also issue a warning at startup if the difference between the current process
    count and the process rlimit is small compared to --localcores. This
    should mitigate the frequently-observed issues with massive computers with
    64+ cores, but with the default process ulimit of 1024.
  • The default memory reservation for all jobs is now 1gb. Stages which
    need more may specify that either in the mro definition or dynamically
    in their split stage.
  • mrp now proceeds immediately between the split, chunk, and join phases
    of a stage rather than waiting for the next iteration of the run loop.
    This is a trivial speedup compared to most pipeline run times, but
    improves the cache hit rate for processing of stage inputs and outputs.
  • mrp no longer recomputes the bindings for a stage separately when
    starting each chunk of that stage. This can save a very large amount of
    memory and time for stages with many chunks and large input argument
    structures.
  • --never-local flag causes mrp to ignore local modifier on
    non-preflight stages. This may be important when mrp is running on a
    submit host with limited resources.
  • mrp's steady-state thread and memory usage has been reduced substantially.
    For most pipelines, the rss usage should stay under 30mb most of the time, and
    the number of posix threads should also be reduced.
  • Walking directory trees to, for example, delete files during VDR is now
    faster, particularly on network file systems.
  • Minor improvements to the rendering speed for the web UI.
  • The web UI now has a favicon.
  • The web UI for the pipeline graph now only renders a single line for
    each dependency, rather than a line for each argument.
  • Volatile Disk Recovery's dependency tracking now ignores arguments of
    types int, float, and bool when considering whether intermediate files
    can be deleted, since those cannot contain file names.
  • If a (non-int,float,bool) output argument of a stage is depended on by
    the top-level pipeline, the stage will not be cleaned up by VDR. This
    should make VDR safe to enable for nearly all stages.
  • mrjob now attempts to track the I/O syscall and bytes rates for stage
    code. Unfortunately, due to limitations of the Linux kernel's reporting of
    such metrics, this is only accurate for block devices (e.g. local disk), as
    opposed to for example NFS mounts.