-
I am using chapter 4 of the book as a basis for my tests. DefaultTrainingConfig(Loss.softmaxCrossEntropyLoss())
.addEvaluator(Accuracy())
.optDevices(trainSetUp.devices)
.addTrainingListeners(TrainingListener.Defaults.logging(outputDir):_*)
.addTrainingListeners(listener) In my tests I see that I have to close the trainer (and model?) in order for the logging to be saved. I assumed that the files would be saved/flushed on every epoch. I have tried running 40 epochs and no data is saved. Is this the expected behavior? How do I force the flushing of the files on each epoch? As per the example, I also use a checkpoint listener so: val listener: CheckpointsTrainingListener = CheckpointsTrainingListener(outputDir)
listener.setSaveModelCallback(
trainer => {
// Record accuracy and loss on every epoch
val result: TrainingResult = trainer.getTrainingResult
val model: Model = trainer.getModel
val accuracy = result.getValidateEvaluation("Accuracy")
model.setProperty("Accuracy", String.format("%.5f", accuracy))
model.setProperty("Loss", String.format("%.5f", result.getValidateLoss))
}) I see that instead of saving data to the model, I can save it to a file. But does DJL already has a pre-baked logger for this? I see that several default loggers are activated. These are: new EpochTrainingListener(),
new EvaluatorTrainingListener(),
new DivergenceCheckTrainingListener(),
new LoggingTrainingListener() and I get after the save:
Which of the files above are generated by the loggers above? What is the meaning of the counter in the logs below?
Why is my memory file always empty? Does this log memory used by the DL engines? Do I have to activate this somwere? Finally does DJL have an equivalent to TensorBoard? I think I something like this but cannot find it now. TIA |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Most of the training listeners only save files when training is over, not every epoch. But, each listener has their own behavior that they add to the training process individually. The defaults are just pre-made collections of listeners. The TrainingListener.Defaults.logging is just named because it contains the LoggingTrainingListener which logs to the stdout. If you want the CheckpointsTrainingListener to log every epoch, right now it looks like you have to set the step in the constructor to 1 (or n for every nth epoch). This seems a bit odd to me, so maybe this listener needs to be changed. You would think that a Checkpoints listener would default to checkpointing every epoch. For the files generated, training.log and validate.log are both from the TimeMeasureTrainingListener and it records all the We have thought about adding support for tensorboard, but haven't completed it yet. There is this https://github.com/aws-samples/djl-demo/blob/master/visualization/README.md which is from 0.6 but may still work. |
Beta Was this translation helpful? Give feedback.
Most of the training listeners only save files when training is over, not every epoch. But, each listener has their own behavior that they add to the training process individually. The defaults are just pre-made collections of listeners. The TrainingListener.Defaults.logging is just named because it contains the LoggingTrainingListener which logs to the stdout.
If you want the CheckpointsTrainingListener to log every epoch, right now it looks like you have to set the step in the constructor to 1 (or n for every nth epoch). This seems a bit odd to me, so maybe this listener needs to be changed. You would think that a Checkpoints listener would default to checkpointing every epoch.
For the …