diff --git a/docs/release-notes.rst b/docs/release-notes.rst index 76e834db4d5..28b18fab770 100644 --- a/docs/release-notes.rst +++ b/docs/release-notes.rst @@ -10,6 +10,61 @@ Version 0.19 ************** +Version 0.19.2 +============== + +**Release Date:** August 26, 2022 + +**Breaking Changes** + +- API: Response format for metrics has been standardized to return aggregated and per-batch metrics + in a uniform way. ``GetTrialWorkloads``, ``GetTrials`` API response format has changed. + ``ReportTrialTrainingMetrics``, ``ReportTrialValidationMetrics`` API request format has changed + as well. + +- API: ``GetJobs`` request format for pagination object has changed. Instead of being contained in + a nested ``pagination`` object, these are now top level options, in line with the other + paginatable API requests. + +- CLI: ``det trial describe --json`` output format has changed. Fixed a bug where ``det trial + describe --json --metrics`` would fail for trials with a very large number of steps. + +- CLI: ``det job list`` will now return all jobs by default instead of a single API results page. + Use ``--pages=1`` option for the old behavior. + +**Bug Fixes** + +- Kubernetes: Fixed an issue where restoring a job in a Kubernetes set up could crash the resource + manager. + +- CLI: Fixed a bug where ``det e set gc-policy`` would fail when deserializing an api response + because it wasn't adjusted for the new format. + +- Distributed training: Previously, experiments launched with determined.launch.torch_distributed + were wrongly skipping torch.distributed.run for single-slot trials and invoking training scripts + directly. As a result, functions such as torch.distributed.init_process_group() would fail, but + only inside single-slot trials. Now, determined.launch.torch_distributed will conform to the + intended behavior as a wrapper around torch.distributed.run and will invoke torch.distributed.run + on all training scripts. + +- Experiments with a single trial are now considered canceled when their trial is canceled or + killed. + +**Improvements** + +- API: `GetTrialWorkloads` can now optionally include per-batch metrics when + ``includeBatchMetrics`` query parameter is set. + +**New Features** + +- Cluster: The enterprise edition of Determined ([HPE Machine Learning Development + Environment](https://www.hpe.com/us/en/solutions/artificial-intelligence/machine-learning-development-environment.html)), + can now be deployed on a Slurm cluster. When using Slurm, Determined delegates all job scheduling + and prioritization to the Slurm workload manager. This integration enables existing Slurm + workloads and Determined workloads to coexist and access all of the advanced capabilities of the + Slurm workload manager. The Determined Slurm integration can use either Singularity or Podman for + the container runtime. + Version 0.19.1 ============== diff --git a/docs/release-notes/137-feat-slurm.txt b/docs/release-notes/137-feat-slurm.txt deleted file mode 100644 index c0ffd598a78..00000000000 --- a/docs/release-notes/137-feat-slurm.txt +++ /dev/null @@ -1,11 +0,0 @@ -:orphan: - -**New Features** - -- Cluster: Determined Enterprise Edition can now be deployed on a Slurm cluster. When using Slurm, - Determined delegates all job scheduling and prioritization to the Slurm workload manager. - This integration enables existing Slurm workloads and Determined workloads to coexist and - Determined workloads to access all of the advanced capabilities of the Slurm workload manager. - The Determined Slurm integration can use either Singularity or PodMan for the container - runtime. - diff --git a/docs/release-notes/canceled-experiment-trial-fix.txt b/docs/release-notes/canceled-experiment-trial-fix.txt deleted file mode 100644 index 2ceca2236ab..00000000000 --- a/docs/release-notes/canceled-experiment-trial-fix.txt +++ /dev/null @@ -1,6 +0,0 @@ -:orphan: - -**Fixes** - -- Experiments with a single trial are now considered canceled when their trial is - canceled or killed. diff --git a/docs/release-notes/concurrent-grid.rst b/docs/release-notes/concurrent-grid.rst deleted file mode 100644 index 176484f2947..00000000000 --- a/docs/release-notes/concurrent-grid.rst +++ /dev/null @@ -1,8 +0,0 @@ -:orphan: - -**Bug Fixes** - -- Hyperparameter Search: Prevent hyperparameter searches from incorrectly terminating early when - starting a new trial in response to the last previously open trial closing. One common way for - this situation to arise is when running an experiment with ``max_concurrent_trials`` set to - ``1``. diff --git a/docs/release-notes/experiment-restore.txt b/docs/release-notes/experiment-restore.txt deleted file mode 100644 index 8f571e1c407..00000000000 --- a/docs/release-notes/experiment-restore.txt +++ /dev/null @@ -1,6 +0,0 @@ -:orphan: - -**Bug Fixes** - -- Fix an issue where restoring a job in into a Kubernetes set up could crash the - resource manager. diff --git a/docs/release-notes/jobs-api.txt b/docs/release-notes/jobs-api.txt deleted file mode 100644 index a431ff23c4b..00000000000 --- a/docs/release-notes/jobs-api.txt +++ /dev/null @@ -1,6 +0,0 @@ -:orphan: - -**Breaking Changes** - -- CLI: ``det job list`` will now return all jobs by default instead of a single API results page. Use ``--pages=1`` option for the old behavior. -- API: ``GetJobs`` request format for pagination object has changed. Instead of being contained in a nested ``pagination`` object, these are now top level options, in line with the other paginatable API requests. diff --git a/docs/release-notes/max-cc-searcher-bugs.rst b/docs/release-notes/max-cc-searcher-bugs.rst deleted file mode 100644 index 0538f2ba111..00000000000 --- a/docs/release-notes/max-cc-searcher-bugs.rst +++ /dev/null @@ -1,6 +0,0 @@ -:orphan: - -**Bug Fixes** - -- Hyperparameter Search: Prevent the random and grid hyperparameter searches from spawning more - trials than allowed by ``max_concurrent_trials`` in the event of trial failures. diff --git a/docs/release-notes/set_gc_policy_fix.txt b/docs/release-notes/set_gc_policy_fix.txt deleted file mode 100644 index 2109d893b17..00000000000 --- a/docs/release-notes/set_gc_policy_fix.txt +++ /dev/null @@ -1,5 +0,0 @@ -:orphan: - -** Bug Fixes ** - -- Command: Fixed a bug where ``det e set gc-policy`` would fail when deserializing an api response because it wasn't adjusted for the new format. \ No newline at end of file diff --git a/docs/release-notes/torch-distributed-single-slot.txt b/docs/release-notes/torch-distributed-single-slot.txt deleted file mode 100644 index 0fc8c6b4c17..00000000000 --- a/docs/release-notes/torch-distributed-single-slot.txt +++ /dev/null @@ -1,8 +0,0 @@ -:orphan: - -**Fixes** -- Distributed training: previously, experiments launched with determined.launch.torch_distributed were wrongly skipping -torch.distributed.run for single-slot trials and invoking training scripts directly. As a result, functions such as -torch.distributed.init_process_group() would fail, but only inside single-slot trials. Now, -determined.launch.torch_distributed will conform to the intended behavior as a wrapper around -torch.distributed.run and will invoke torch.distributed.run on all training scripts. diff --git a/docs/release-notes/trial-describe.txt b/docs/release-notes/trial-describe.txt deleted file mode 100644 index 87f12a65df7..00000000000 --- a/docs/release-notes/trial-describe.txt +++ /dev/null @@ -1,10 +0,0 @@ -:orphan: - -**Improvements** - -- API: `GetTrialWorkloads` can now optionally include per-batch metrics when ``includeBatchMetrics`` query parameter is set. - -**Breaking Changes** - -- CLI: ``det trial describe --json`` output format has changed. Fixed a bug where ``det trial describe --json --metrics`` would fail for trials with a very large number of steps. -- API: Response format for metrics has been standardized to return aggregated and per-batch metrics in a uniform way. ``GetTrialWorkloads``, ``GetTrials`` API response format has changed. ``ReportTrialTrainingMetrics``, ``ReportTrialValidationMetrics`` API request format has changed as well.