Question regarding TrainingListener and CheckpointsTrainingListener #565

hmf · 2021-01-28T17:17:28Z

hmf
Jan 28, 2021

I am using chapter 4 of the book as a basis for my tests.
As per the example I do the following:

    DefaultTrainingConfig(Loss.softmaxCrossEntropyLoss())
      .addEvaluator(Accuracy())
      .optDevices(trainSetUp.devices)
      .addTrainingListeners(TrainingListener.Defaults.logging(outputDir):_*)
      .addTrainingListeners(listener)

In my tests I see that I have to close the trainer (and model?) in order for the logging to be saved. I assumed that the files would be saved/flushed on every epoch. I have tried running 40 epochs and no data is saved. Is this the expected behavior? How do I force the flushing of the files on each epoch?

As per the example, I also use a checkpoint listener so:

val listener: CheckpointsTrainingListener = CheckpointsTrainingListener(outputDir)
    listener.setSaveModelCallback(
      trainer => {
        // Record accuracy and loss on every epoch
        val result: TrainingResult = trainer.getTrainingResult
        val model: Model = trainer.getModel
        val accuracy = result.getValidateEvaluation("Accuracy")
        model.setProperty("Accuracy", String.format("%.5f", accuracy))
        model.setProperty("Loss", String.format("%.5f", result.getValidateLoss))
      })

I see that instead of saving data to the model, I can save it to a file. But does DJL already has a pre-baked logger for this?

I see that several default loggers are activated. These are:

                new EpochTrainingListener(),
                new EvaluatorTrainingListener(),
                new DivergenceCheckTrainingListener(),
                new LoggingTrainingListener()

and I get after the save:

rw-rw-r-- 1 hmf hmf    0 jan 28 15:18 memory.log
-rw-rw-r-- 1 hmf hmf 807K jan 28 16:51 mlp-0040.params
-rw-rw-r-- 1 hmf hmf 842K jan 28 16:51 training.log
-rw-rw-r-- 1 hmf hmf 150K jan 28 16:51 validate.log

Which of the files above are generated by the loggers above?

What is the meaning of the counter in the logs below?

hmf@xxxxxx:/tmp/checkpoint_cache$ tail training.log 
train.count:77103277|#timestamp:1611847126107
train.count:69400315|#timestamp:1611847126177
train.count:69869381|#timestamp:1611847126246

hmf@xxxxx:/tmp/checkpoint_cache$ tail validate.log 
validate.count:60949527|#timestamp:1611847128845
validate.count:69553110|#timestamp:1611847128915
validate.count:70928301|#timestamp:1611847128986

Why is my memory file always empty? Does this log memory used by the DL engines? Do I have to activate this somwere?

Finally does DJL have an equivalent to TensorBoard? I think I something like this but cannot find it now.

TIA

Answered by zachgk

Jan 28, 2021

Most of the training listeners only save files when training is over, not every epoch. But, each listener has their own behavior that they add to the training process individually. The defaults are just pre-made collections of listeners. The TrainingListener.Defaults.logging is just named because it contains the LoggingTrainingListener which logs to the stdout.

If you want the CheckpointsTrainingListener to log every epoch, right now it looks like you have to set the step in the constructor to 1 (or n for every nth epoch). This seems a bit odd to me, so maybe this listener needs to be changed. You would think that a Checkpoints listener would default to checkpointing every epoch.

For the …

View full answer

zachgk · 2021-01-28T18:41:10Z

zachgk
Jan 28, 2021
Maintainer

Most of the training listeners only save files when training is over, not every epoch. But, each listener has their own behavior that they add to the training process individually. The defaults are just pre-made collections of listeners. The TrainingListener.Defaults.logging is just named because it contains the LoggingTrainingListener which logs to the stdout.

If you want the CheckpointsTrainingListener to log every epoch, right now it looks like you have to set the step in the constructor to 1 (or n for every nth epoch). This seems a bit odd to me, so maybe this listener needs to be changed. You would think that a Checkpoints listener would default to checkpointing every epoch.

For the files generated, training.log and validate.log are both from the TimeMeasureTrainingListener and it records all the Metrics produced (and their timestamps) to a file. Activating the memory log requires you to set the Java system property -Dcollect-memory to true.

We have thought about adding support for tensorboard, but haven't completed it yet. There is this https://github.com/aws-samples/djl-demo/blob/master/visualization/README.md which is from 0.6 but may still work.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding TrainingListener and CheckpointsTrainingListener #565

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Question regarding TrainingListener and CheckpointsTrainingListener #565

hmf Jan 28, 2021

Replies: 1 comment

zachgk Jan 28, 2021 Maintainer

hmf
Jan 28, 2021

zachgk
Jan 28, 2021
Maintainer