Skip to content

Commit

Permalink
Checkpoint is up to date when resuming training (#1043)
Browse files Browse the repository at this point in the history
  • Loading branch information
mjdenkowski authored Apr 21, 2022
1 parent 30c3913 commit c822e20
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 2 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@ Note that Sockeye has checks in place to not translate with an old model that wa

Each version section may have subsections for: _Added_, _Changed_, _Removed_, _Deprecated_, and _Fixed_.

## [3.1.11]

### Fixed

- When resuming training with a fully trained model, `sockeye-train` will correctly exit without creating a duplicate (but separately numbered) checkpoint.

## [3.1.10]

### Fixed
Expand Down
2 changes: 1 addition & 1 deletion sockeye/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.

__version__ = '3.1.10'
__version__ = '3.1.11'
4 changes: 3 additions & 1 deletion sockeye/training.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,9 @@ def fit(self,
self.config.max_updates,
self.config.max_checkpoints)

checkpoint_up_to_date = False
# At the start of training, the checkpoint is only up to date if it has
# just been loaded (resuming training with an existing model directory).
checkpoint_up_to_date = resume_training
while True:
if self.config.max_epochs is not None and self.state.epoch == self.config.max_epochs:
logger.info("Maximum # of epochs (%s) reached.", self.config.max_epochs)
Expand Down

0 comments on commit c822e20

Please sign in to comment.