Releases: martian-lang/martian
v3.2.1 bugfix release
- Performance fixes for VDR computation in cases where stages have a
large number of output files. - Include invoulentary context switches in rusage tracking.
- Fix a crash in cases where the
mrp
binary becomes unavailable on
disk during a pipestance run. - Spelling corrections, mostly in code comments.
Martian 3.2.0
Martian 3.2.0 release.
Major new features:
- The Python stage code adapter now works with Python 3.
- Martian can now account for virtual address space size, in addition to
physical memory.- Normally, virtual address space (vmem) size is ignored, since modern
linux systems have no good reason to restrict it - vmem size is not
the same as rss+swap, contrary to inexplicably popular belief. - In local mode, a limit may be specified with the
--localvmem
flag. - A limit will also be imposed automatically if a virtual size rlimit
(e.g.ulimit -d
orulimit -v
) is detected by mrp. SGE's
h_vmem
,s_vmem
,h_data
, ands_data
resource specifiers set
these limits. - In cluster mode job templates, users may now use
__MRO_VMEM_GB__
and related variables in the same way as the existing
__MRO_MEM_GB__
variables to get the predicted virtual address
space (vmem) size rather than the physical memory requirement. - The job mode configuration for cluster modes found in
jobmanagers/config.json
may set themem_is_vmem
key totrue
,
in which case__MRO_MEM_GB__
and related template variables will
also use the virtual address space size, for backwards compatibility
with existing user templates (most SGE clusters mistakenly enforce
virtual size, if they handle anything like memory reservations at
all). This is turned on by default for SGE. - Stages may specify a
vmem_gb
requirement in addition tomem_gb
,
through all of the same existing mechanisms:- Specifying
using ( vmem_gb = 4, )
in the mro declaration of the
stage. - Specifying
__vmem_gb
in the chunk or join definitions returned
by a split phase. - In overrides.json.
- Specifying
- Stages which do not specify a vmem requirement will be allocated an
amount equal to their physical memory requirement plus a constant
specified in theextra_vmem_per_job
key configured in
jobmanagers/config.json
. - With
--monitor
,mrjob
will now restrict stage virtual size as
well as physical size, to make sure the requests are being set
correctly. It will include its own virtual size in the restriction,
but will not include the virtual size of profiling jobs (e.g.
perf record
) which may be running alongside the stage code.
- Normally, virtual address space (vmem) size is ignored, since modern
- Update graph UI page
- Reduce the amount of excess bytes required to render the page.
- Inline the 7% of bootstrap.min.css we actually use.
- Remove the fonts, just use an svg icon instead.
- Remove the clipboard button, since it hasn't actually worked in a
long time.
- Remove dead js files. These files either were already not being
included in the serve package or are no longer required. - Concatenate javascript source files together.
- Remove duplicated DOM element IDs.
- Get angular, dagare-d3 from npm, as well as support libraries d3 and
lodash. This means we're no longer shipping an insecure version of
lodash. - Add pan/zoom now works on the graph page.
- Reduce the amount of excess bytes required to render the page.
- MRO syntax now supports escaping for string literals, using json
escaping syntax.
Minor improvements:
- mrp now checks for stage completion whenever local-mode jobs complete.
Previously it would check every 3 seconds regardless. For very short
jobs (such as, frequently, split phases) this results in shorter
pipeline wall times. While the impact on large pipelines should be
tiny in percentage terms, this significantly accelerates integration
tests. make tarball
now produces bothtar.gz
andtar.xz
.- Improvements to tests.
- Integration tests can now run in parallel (
make -j longtests
) - Fix some bugs in integration test result validation.
- More test coverage for both unit and integration tests.
- Integration tests can now run in parallel (
- Pipelines should be more robust against missed or delayed updates
from the pipestance journal directory. Rather than timing out,
mrp will now check whether the file exists if a notification wasn't
seen. mrjob
now includes its own memory usage in the statistics included
in the jobinfo, which are used to generate the_perf
summary..
Bug fixes:
- Fix a potential deadlock when mrp receives a signal (e.g. from
kill
)
or a shutdown request over the API while it is in the middle of
starting or restarting a pipeline. - Fix a crash in
mrf --includes
if a stage called by a pipeline was
not present in the transitive includes of the file defining the
pipeline. - Fix a bug in
mrf --includes
which resulted in duplicate declarations
for existing user-defined file types. - Updated npm dependencies.
mrjob
will now begin waiting on the profiling command (e.g.
perf record
) immediately, rather than waiting until the stage code
finishes. This prevents zombie processes lying around if the
profiling command finishes before the stage code.mrp
will no longer read chunk_outs
files if no chunk outputs
were expected, e.g. for pre-flight stages. This prevents spurious
errors when chunk outputs were not a dictionary object. It also
means chunk outputs need to be properly declared if the stage has
no outputs.
v3.2.0-pre2
Fix a typo in limit exceeded message.
v3.2.0-pre1: Martian 3.2.0 release candidate.
Major new features: * The Python stage code adapter now works with Python 3. * Martian can now account for virtual address space size, in addition to physical memory. * Normally, virtual address space (vmem) size is ignored, since modern linux systems have no good reason to restrict it - vmem size is not the same as rss+swap, contrary to inexplicably popular belief. * In local mode, a limit may be specified with the `--localvmem` flag. * A limit will also be imposed automatically if a virtual size rlimit (e.g. `ulimit -d` or `ulimit -v`) is detected by mrp. SGE's `h_vmem`, `s_vmem`, `h_data`, and `s_data` resource specifiers set these limits. * In cluster mode job templates, users may now use `__MRO_VMEM_GB__` and related variables in the same way as the existing `__MRO_MEM_GB__` variables to get the predicted virtual address space (vmem) size rather than the physical memory requirement. * The job mode configuration for cluster modes found in `jobmanagers/config.json` may set the `mem_is_vmem` key to `true`, in which case `__MRO_MEM_GB__` and related template variables will also use the virtual address space size, for backwards compatibility with existing user templates (most SGE clusters mistakenly enforce virtual size, if they handle anything like memory reservations at all). * Stages may specify a `vmem_gb` requirement in addition to `mem_gb`, through all of the same existing mechanisms: * Specifying `using ( vmem_gb = 4, )` in the mro declaration of the stage. * Specifying `__vmem_gb` in the chunk or join definitions returned by a split phase. * In overrides.json. * Stages which do not specify a vmem requirement will be allocated an amount equal to their physical memory requirement plus a constant specified in the `extra_vmem_per_job` key configured in `jobmanagers/config.json`. * With `--monitor`, `mrjob` will now restrict stage virtual size as well as physical size, to make sure the requests are being set correctly. It will include its own virtual size in the restriction, but will not include the virtual size of profiling jobs (e.g. `perf record`) which may be running alongside the stage code. Minor improvements: * mrp now checks for stage completion whenever local-mode jobs complete. Previously it would check every 3 seconds regardless. For very short jobs (such as, frequently, split phases) this results in shorter pipeline wall times. While the impact on large pipelines should be tiny in percentage terms, this significantly accelerates integration tests. * `make tarball` now produces both `tar.gz` and `tar.xz`. * Improvements to tests. * Integration tests can now run in parallel (`make -j longtests`) * Fix some bugs in integration test result validation. * More test coverage for both unit and integration tests. * Pipelines should be more robust against missed or delayed updates from the pipestance journal directory. Rather than timing out, mrp will now check whether the file exists if a notification wasn't seen. * `mrjob` now includes its own memory usage in the statistics included in the jobinfo, which are used to generate the `_perf` summary.. Bug fixes: * Fix a potential deadlock when mrp receives a signal (e.g. from `kill`) or a shutdown request over the API while it is in the middle of starting or restarting a pipeline. * Fix a crash in `mrf --includes` if a stage called by a pipeline was not present in the transitive includes of the file defining the pipeline. * Fix a bug in `mrf --includes` which resulted in duplicate declarations for existing user-defined file types. * Updated npm dependencies. * `mrjob` will now begin waiting on the profiling command (e.g. `perf record`) immediately, rather than waiting until the stage code finishes. This prevents zombie processes lying around if the profiling command finishes before the stage code.
Martian 3.1.0
Martian 3.1
This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp
(especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro
source code more convenient, and logging improvements should improve make debugging easier in the event of failures.
VDR (volatile disk recovery) Changes
VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.
General changes
- "rolling" is now the default VDR mode.
- Each stage job (split, chunk, join) now has its own
$TEMPDIR
, which is cleaned up as soon as that stage phase has completed. - If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
- When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
- Stage metadata files are now accounted for in the storage high-water mark calculation.
New feature: strict-mode volatile
Stages may now declare themselves as being "strict-mode volatile" compatible:
stage FOO(
in bam in1,
out bam bamfile,
out bai index,
out json summary,
) using (
volatile = strict,
)
In this mode, the volatile
modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1
takes bamfile
as input and STAGE2
takes summary
as input, bamfile
can be deleted as soon STAGE1
completes, rather than waiting for STAGE2
to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.
One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam
often implies the existence of filename.bai
. If a downstream stage does not bind an output which mentions filename.bai
then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.
New feature: "retained" outputs
In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:
Pipeline retains
pipeline BAR(
in bam input1,
out bam output1,
)
{
call FOO(
input1 = self.input1,
)
call FOO as BAZ(
input2 = FOO.output1,
)
return (
output1 = BAZ.output1,
)
retain (
FOO.output1
)
}
This specifies that output1
of this pipeline's call to FOO
should never be deleted, for example if one wants to be able to later re-run BAZ
. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain
directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ
in this example, or from other pipelines) which do not need to retain the output.
Stage retains
stage FOO(
in int input1,
out bam output1,
out json summary,
) using (
volatile = strict,
) retain (
summary,
)
This specifies that VDR should never delete summary
. This should be used in the case where a file should always be preserved for potential later inspection.
Runtime improvements
User-facing improvements
- The memory and CPU consumption of
mrp
has been reduced, especially for very large pipelines, and in cases where stages create large output objects. - There is now a timeout, configurable with the
--retry-wait
command line flag, between whenmrp
observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second. - The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
- The web UI will now show files in the
/extras
directory of the pipestance. This is intended primarily for outputs of on-finish hooks.
Improvements for Pipeline-Developers
mrp
can now run stage invocations as well as pipeline invocations.mrs
now exists only as a symlink tomrp
for backwards compatibility. This eliminates the feature gap betweenmrs
andmrp
, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.- Performance profiling modes are now configurable through
jobmanagers/config.json
. Each profiling mode may specify an executable to run to attach to the stage code (such asperf
) and environment variables (for exampleHEAPPROFILE
may be used to enable tcmalloc's heap profiler), in addition to any profiling built into the stage adapter itself. - The default event collection for
--profile=perf
no longer includesbpf-output
events. This was not working in most cases. - Profiling mode may now be specified with in the overrides json file (used with the
--overrides
flag) to enable or disable it for an individual stage. - The
_mrosource
output now includes comments indicating file boundaries from the original source code with merged@include
s. - More logging about the environment when
mrp
starts up- Log a few more environment variables (
MALLOC_*
andRUST_*
). - The filesystem type and available space and inode count are now logged on startup.
- Log a few more environment variables (
- When
mrp
is run with--monitor
, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them. - The stage code parent process,
mrjob
, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages. - When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
- Stage and pipeline
_invocation
files now only@include
the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline. - Stage and pipeline
_invocation
files now properly represent call aliasing (e.g.call FOO as BAR
). - The python adapter now exposes two new methods,
get_threads_allocation
andget_memory_allocation
, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.
Improvements to the compiler/parser (mrc
) and formatter (mrf
)
- The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of
@include
directives. - The parser is now substantially faster, and uses less memory.
mrf
has a new flag,--includes
, which will remove@include
directives which are not required, and will attempt to add@include
andfiletype
directives which are missing. It is inspired by the clang iwyu tool.mrf
now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
...
)
becomes
call FOO(
...
) using (
volatile = true,
)
mrf
now sorts keys inmap
literals.mrf
now inserts a trailing comma in map and array literals, e.g.
[
1,
2,
]
Bug fixes
- The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket conn...
Martian 3.1.0 release candidate 3
Martian 3.1
This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp
(especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro
source code more convenient, and logging improvements should improve make debugging easier in the event of failures.
VDR (volatile disk recovery) Changes
VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.
General changes
- "rolling" is now the default VDR mode.
- Each stage job (split, chunk, join) now has its own
$TEMPDIR
, which is cleaned up as soon as that stage phase has completed. - If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
- When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
- Stage metadata files are now accounted for in the storage high-water mark calculation.
New feature: strict-mode volatile
Stages may now declare themselves as being "strict-mode volatile" compatible:
stage FOO(
in bam in1,
out bam bamfile,
out bai index,
out json summary,
) using (
volatile = strict,
)
In this mode, the volatile
modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1
takes bamfile
as input and STAGE2
takes summary
as input, bamfile
can be deleted as soon STAGE1
completes, rather than waiting for STAGE2
to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.
One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam
often implies the existence of filename.bai
. If a downstream stage does not bind an output which mentions filename.bai
then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.
New feature: "retained" outputs
In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:
Pipeline retains
pipeline BAR(
in bam input1,
out bam output1,
)
{
call FOO(
input1 = self.input1,
)
call FOO as BAZ(
input2 = FOO.output1,
)
return (
output1 = BAZ.output1,
)
retain (
FOO.output1
)
}
This specifies that output1
of this pipeline's call to FOO
should never be deleted, for example if one wants to be able to later re-run BAZ
. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain
directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ
in this example, or from other pipelines) which do not need to retain the output.
Stage retains
stage FOO(
in int input1,
out bam output1,
out json summary,
) using (
volatile = strict,
) retain (
summary,
)
This specifies that VDR should never delete summary
. This should be used in the case where a file should always be preserved for potential later inspection.
Runtime improvements
User-facing improvements
- The memory and CPU consumption of
mrp
has been reduced, especially for very large pipelines, and in cases where stages create large output objects. - There is now a timeout, configurable with the
--retry-wait
command line flag, between whenmrp
observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second. - The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
- The web UI will now show files in the
/extras
directory of the pipestance. This is intended primarily for outputs of on-finish hooks.
Improvements for Pipeline-Developers
mrp
can now run stage invocations as well as pipeline invocations.mrs
now exists only as a symlink tomrp
for backwards compatibility. This eliminates the feature gap betweenmrs
andmrp
, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.- Performance profiling modes are now configurable through
jobmanagers/config.json
. Each profiling mode may specify an executable to run to attach to the stage code (such asperf
) and environment variables (for exampleHEAPPROFILE
may be used to enable tcmalloc's heap profiler), in addition to any profiling built into the stage adapter itself. - The default event collection for
--profile=perf
no longer includesbpf-output
events. This was not working in most cases. - Profiling mode may now be specified with in the overrides json file (used with the
--overrides
flag) to enable or disable it for an individual stage. - The
_mrosource
output now includes comments indicating file boundaries from the original source code with merged@include
s. - More logging about the environment when
mrp
starts up- Log a few more environment variables (
MALLOC_*
andRUST_*
). - The filesystem type and available space and inode count are now logged on startup.
- Log a few more environment variables (
- When
mrp
is run with--monitor
, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them. - The stage code parent process,
mrjob
, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages. - When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
- Stage and pipeline
_invocation
files now only@include
the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline. - Stage and pipeline
_invocation
files now properly represent call aliasing (e.g.call FOO as BAR
). - The python adapter now exposes two new methods,
get_threads_allocation
andget_memory_allocation
, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.
Improvements to the compiler/parser (mrc
) and formatter (mrf
)
- The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of
@include
directives. - The parser is now substantially faster, and uses less memory.
mrf
has a new flag,--includes
, which will remove@include
directives which are not required, and will attempt to add@include
andfiletype
directives which are missing. It is inspired by the clang iwyu tool.mrf
now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
...
)
becomes
call FOO(
...
) using (
volatile = true,
)
mrf
now sorts keys inmap
literals.mrf
now inserts a trailing comma in map and array literals, e.g.
[
1,
2,
]
Bug fixes
- The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket conn...
Martian 3.1.0 release candidate 2
Martian 3.1
This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp
(especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro
source code more convenient, and logging improvements should improve make debugging easier in the event of failures.
VDR (volatile disk recovery) Changes
VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.
General changes
- "rolling" is now the default VDR mode.
- Each stage job (split, chunk, join) now has its own
$TEMPDIR
, which is cleaned up as soon as that stage phase has completed. - If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
- When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
- Stage metadata files are now accounted for in the storage high-water mark calculation.
New feature: strict-mode volatile
Stages may now declare themselves as being "strict-mode volatile" compatible:
stage FOO(
in bam in1,
out bam bamfile,
out bai index,
out json summary,
) using (
volatile = strict,
)
In this mode, the volatile
modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1
takes bamfile
as input and STAGE2
takes summary
as input, bamfile
can be deleted as soon STAGE1
completes, rather than waiting for STAGE2
to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.
One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam
often implies the existence of filename.bai
. If a downstream stage does not bind an output which mentions filename.bai
then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.
New feature: "retained" outputs
In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:
Pipeline retains
pipeline BAR(
in bam input1,
out bam output1,
)
{
call FOO(
input1 = self.input1,
)
call FOO as BAZ(
input2 = FOO.output1,
)
return (
output1 = BAZ.output1,
)
retain (
FOO.output1
)
}
This specifies that output1
of this pipeline's call to FOO
should never be deleted, for example if one wants to be able to later re-run BAZ
. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain
directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ
in this example, or from other pipelines) which do not need to retain the output.
Stage retains
stage FOO(
in int input1,
out bam output1,
out json summary,
) using (
volatile = strict,
) retain (
summary,
)
This specifies that VDR should never delete summary
. This should be used in the case where a file should always be preserved for potential later inspection.
Runtime improvements
User-facing improvements
- The memory and CPU consumption of
mrp
has been reduced, especially for very large pipelines, and in cases where stages create large output objects. - There is now a timeout, configurable with the
--retry-wait
command line flag, between whenmrp
observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second. - The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
- The web UI will now show files in the
/extras
directory of the pipestance. This is intended primarily for outputs of on-finish hooks.
Improvements for Pipeline-Developers
mrp
can now run stage invocations as well as pipeline invocations.mrs
now exists only as a symlink tomrp
for backwards compatibility. This eliminates the feature gap betweenmrs
andmrp
, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.- The default event collection for
--profile=perf
no longer includesbpf-output
events. This was not working in most cases. - For users who desire more control over perf profile recording with
--profile=perf
, the environment variableMRO_PERF_ARGS
allows one to specify the command like toperf record
. This overridesMRO_PERF_EVENTS
,MRO_PERF_FREQ
, andMRO_PERF_DURATION
. The command that runs will beperf record -p <pid> -o <path/to/job/_perf.data> $MRO_PERF_ARGS
. The same behavior as setting those variables can thus be achieved by settingMRO_PERF_ARGS="-g -e $MRO_PERF_EVENTS -F $MRO_PERF_FREQ sleep $MRO_PERF_DURATION"
. - The
_mrosource
output now includes comments indicating file boundaries from the original source code with merged@include
s. - More logging about the environment when
mrp
starts up- Log a few more environment variables (
MALLOC_*
andRUST_*
). - The filesystem type and available space and inode count are now logged on startup.
- Log a few more environment variables (
- When
mrp
is run with--monitor
, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them. - The stage code parent process,
mrjob
, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages. - When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
- Stage and pipeline
_invocation
files now only@include
the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline. - Stage and pipeline
_invocation
files now properly represent call aliasing (e.g.call FOO as BAR
). - The python adapter now exposes two new methods,
get_threads_allocation
andget_memory_allocation
, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.
Improvements to the compiler/parser (mrc
) and formatter (mrf
)
- The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of
@include
directives. - The parser is now substantially faster, and uses less memory.
mrf
has a new flag,--includes
, which will remove@include
directives which are not required, and will attempt to add@include
andfiletype
directives which are missing. It is inspired by the clang iwyu tool.mrf
now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
...
)
becomes
call FOO(
...
) using (
volatile = true,
)
mrf
now sorts keys inmap
literals.mrf
now inserts a trailing comma in map and array literals, e.g.
[
1,
2,
]
Bug fixes
- The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket connections until the server ran out of file handles....
Martian 3.1.0 release candidate 1
Martian 3.1
This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp
(especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro
source code more convenient, and logging improvements should improve make debugging easier in the event of failures.
VDR (volatile disk recovery) Changes
VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.
General changes
- "rolling" is now the default VDR mode.
- Each stage job (split, chunk, join) now has its own
$TEMPDIR
, which is cleaned up as soon as that stage phase has completed. - If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
- When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
- Stage metadata files are now accounted for in the storage high-water mark calculation.
New feature: strict-mode volatile
Stages may now declare themselves as being "strict-mode volatile" compatible:
stage FOO(
in bam in1,
out bam bamfile,
out bai index,
out json summary,
) using (
volatile = strict,
)
In this mode, the volatile
modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1
takes bamfile
as input and STAGE2
takes summary
as input, bamfile
can be deleted as soon STAGE1
completes, rather than waiting for STAGE2
to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.
One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam
often implies the existence of filename.bai
. If a downstream stage does not bind an output which mentions filename.bai
then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.
New feature: "retained" outputs
In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:
Pipeline retains
pipeline BAR(
in bam input1,
out bam output1,
)
{
call FOO(
input1 = self.input1,
)
call FOO as BAZ(
input2 = FOO.output1,
)
return (
output1 = BAZ.output1,
)
retain (
FOO.output1
)
}
This specifies that output1
of this pipeline's call to FOO
should never be deleted, for example if one wants to be able to later re-run BAZ
. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain
directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ
in this example, or from other pipelines) which do not need to retain the output.
Stage retains
stage FOO(
in int input1,
out bam output1,
out json summary,
) using (
volatile = strict,
) retain (
summary,
)
This specifies that VDR should never delete summary
. This should be used in the case where a file should always be preserved for potential later inspection.
Runtime improvements
User-facing improvements
- The memory and CPU consumption of
mrp
has been reduced, especially for very large pipelines, and in cases where stages create large output objects. - There is now a timeout, configurable with the
--retry-wait
command line flag, between whenmrp
observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second. - The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
- The web UI will now show files in the
/extras
directory of the pipestance. This is intended primarily for outputs of on-finish hooks.
Improvements for Pipeline-Developers
mrp
can now run stage invocations as well as pipeline invocations.mrs
now exists only as a symlink tomrp
for backwards compatibility. This eliminates the feature gap betweenmrs
andmrp
, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.- The default event collection for
--profile=perf
no longer includesbpf-output
events. This was not working in most cases. - For users who desire more control over perf profile recording with
--profile=perf
, the environment variableMRO_PERF_ARGS
allows one to specify the command like toperf record
. This overridesMRO_PERF_EVENTS
,MRO_PERF_FREQ
, andMRO_PERF_DURATION
. The command that runs will beperf record -p <pid> -o <path/to/job/_perf.data> $MRO_PERF_ARGS
. The same behavior as setting those variables can thus be achieved by settingMRO_PERF_ARGS="-g -e $MRO_PERF_EVENTS -F $MRO_PERF_FREQ sleep $MRO_PERF_DURATION"
. - When running with
--zip --profile=perf
, profiling outputs are no longer included in_metadata.zip
. - The
_mrosource
output now includes comments indicating file boundaries from the original source code with merged@include
s. - More logging about the environment when
mrp
starts up- Log a few more environment variables (
MALLOC_*
andRUST_*
). - The filesystem type and available space and inode count are now logged on startup.
- Log a few more environment variables (
- When
mrp
is run with--monitor
, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them. - The stage code parent process,
mrjob
, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages. - When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
- Stage and pipeline
_invocation
files now only@include
the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline. - Stage and pipeline
_invocation
files now properly represent call aliasing (e.g.call FOO as BAR
). - The python adapter now exposes two new methods,
get_threads_allocation
andget_memory_allocation
, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.
Improvements to the compiler/parser (mrc
) and formatter (mrf
)
- The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of
@include
directives. - The parser is now substantially faster, and uses less memory.
mrf
has a new flag,--includes
, which will remove@include
directives which are not required, and will attempt to add@include
andfiletype
directives which are missing. It is inspired by the clang iwyu tool.mrf
now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
...
)
becomes
call FOO(
...
) using (
volatile = true,
)
mrf
now sorts keys inmap
literals.mrf
now inserts a trailing comma in map and array literals, e.g.
[
1,
2,
]
Bug fixes
- The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial ...
v2.3.3: Fix web UI in Firefox.
This is a manual cherry-pick of d5f6fe982014edf3215f7c07e11d4fa5a773c1c5
Martian 3.0.0 release.
This is a major version change and includes updates and improvements to
the syntax as well as well as the usual assortment of bug fixes and
performance improvements.
This release of the Martian framework is packaged with 10X Genomics'
Cellranger DNA 1.0.0 release.
The repository has also been reorganized to conform to the "go way" of
code organization, so it can now be fetched and included with go get
without any weird messing around with GOPATH
. Comments have been
substantially improved throughout the code base, so godoc is no longer
useless.
New MRO syntax features
- Stages may now change their default thread and memory reservation used for
the split, chunks and join. The split can, of course, still override
the threads/memory for the chunks and join as before. This eliminates
many cases where a stage would split/join simply in to set a memory
reservation (although this is still required if the reservation logic is
dynamic). @include
lines are now deduplicated. Transitively including an mro
file will no longer result in errors due to duplicate definitions.
This allows developers to obey best practices by directly including
every mro which defines a stage or pipeline they depend on.- Stage split definitions may now declare chunk outputs as well as
inputs. - Calls may now bind array-type input parameters to an array of elements
specified in mro, e.g.
call FOO(
argv = [ BAR.out1, BAZ.out1 ],
)
- Calls may now alias the name of a stage or pipeline to allow, for
example, a pipeline to call the same stage or pipeline multiple times with
different inputs, e.g.
call FOO as FOO1 (...)
call FOO as FOO2 (...)
- Add a new syntax for call modifiers (e.g. preflight, local, volatile):
call FOO(
...
) using (
volatile = true,
)
This allows comments to be placed around the modifiers, and also
allows...
- Calls may now be conditionally disabled based on the output of a prior
stage:
call FOO(
...
) using (
disabled = BAR.disable_foo,
)
- A new
--strict
flag to mrc, and--strict={disabled|log|alarm|error}
flag for mrp, now checks inputs and outputs more strictly. "strict"
mode is disabled by default for backwards compatibility, but in other
modes it fixes some oversights in the pipeline type checking logic
including:- mem_gb and thread were always truncated to integers. Now
non-integral values are errors. - In several cases, arrays of a given type were treated
interchangeably as scalar types. Now they're not. - In strict mode, undeclared chunk inputs or outputs, or chunk input
or output values of an incorrect type, are considered errors.
- mem_gb and thread were always truncated to integers. Now
- A stage or split may now request a negative number for threads or
memory. In cluster mode, the absolute value is used. In local mode,
the entire--localmem
or--localcores
allocation is given to the stage,
if that value is higher than the absolute value of the request.
Otherwise, an error is raised due to insufficient resources. This
provides more reliable behavior than the previous preferred strategy of
asking for e.g. 2TB and counting on the job manager to cut it down to
the actual availability.
Developer quality of life improvements:
- MRO "compile" errors now include the full include stack, rather than
just the location of the failure. - The mrp webserver now exposes the /debug/pprof endpoint for profiling
and debugging a running instance. - Added a new tool, mro2go, which can automatically generate
data structures for serializing and deserializing stage inputs and
outputs from mro code, making authoring of native go stages much easier.
Bug fixes:
- A situation where the cluster mode
--maxjobs
limit could "leak" jobs
until it would stop queuing more jobs to the cluster has been fixed. - The graph UI should now render correctly in Firefox.
- Several cases where the python stage code adapter would fail around
unicode strings should be resolved. - A case where a stage in the
MROPATH
which was not transitively
included by the top-level call mro could replace the correct call has
been eliminated. - mrjob now forces a filesystem sync of the job's
_outs
file before
writing the_complete
file, to avoid race conditions observed on certain
aggressively configured network file systems. - In cluster mode, the default
--localmem
will no longer ever be more
than the available memory at the start of execution. - Cyclic mro
@include
paths will now result in an error rather than an
infinite loop. - http API commands to terminate the pipestance no longer leave it in an
un-restartable state. - Fix a case where mrp would attempt to restart a pipestance which had
not failed, resulting in a failure. --jobresources
is now correctly applied.
Run-time improvements:
- mrp now tracks the number of posix processes owned by the user and
compares to the current process rlimit (ulimit -u), throttling the
number of spawned jobs if the user is approaching that limit. It will
also issue a warning at startup if the difference between the current process
count and the process rlimit is small compared to --localcores. This
should mitigate the frequently-observed issues with massive computers with
64+ cores, but with the default process ulimit of 1024. - The default memory reservation for all jobs is now 1gb. Stages which
need more may specify that either in the mro definition or dynamically
in their split stage. - mrp now proceeds immediately between the split, chunk, and join phases
of a stage rather than waiting for the next iteration of the run loop.
This is a trivial speedup compared to most pipeline run times, but
improves the cache hit rate for processing of stage inputs and outputs. - mrp no longer recomputes the bindings for a stage separately when
starting each chunk of that stage. This can save a very large amount of
memory and time for stages with many chunks and large input argument
structures. --never-local
flag causes mrp to ignorelocal
modifier on
non-preflight stages. This may be important when mrp is running on a
submit host with limited resources.mrp
's steady-state thread and memory usage has been reduced substantially.
For most pipelines, the rss usage should stay under 30mb most of the time, and
the number of posix threads should also be reduced.- Walking directory trees to, for example, delete files during VDR is now
faster, particularly on network file systems. - Minor improvements to the rendering speed for the web UI.
- The web UI now has a favicon.
- The web UI for the pipeline graph now only renders a single line for
each dependency, rather than a line for each argument. - Volatile Disk Recovery's dependency tracking now ignores arguments of
types int, float, and bool when considering whether intermediate files
can be deleted, since those cannot contain file names. - If a (non-int,float,bool) output argument of a stage is depended on by
the top-level pipeline, the stage will not be cleaned up by VDR. This
should make VDR safe to enable for nearly all stages. - mrjob now attempts to track the I/O syscall and bytes rates for stage
code. Unfortunately, due to limitations of the Linux kernel's reporting of
such metrics, this is only accurate for block devices (e.g. local disk), as
opposed to for example NFS mounts.