Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up logging #45

Merged
merged 2 commits into from
Feb 5, 2024
Merged

Clean up logging #45

merged 2 commits into from
Feb 5, 2024

Conversation

frostedoyster
Copy link
Collaborator

@frostedoyster frostedoyster commented Feb 2, 2024

This aims to clean up the logging, according to #39.
The resulting logger can be used by several models, so it has been extracted into utils

Before:

[2024-02-02 08:54:04,880][metatensor.models.cli.train_model][INFO] - Setting up training set
[2024-02-02 08:54:04,956][metatensor.models.utils.data.readers.readers][INFO] - Forces found in section 'energy'. Forces are taken for training!
[2024-02-02 08:54:04,976][metatensor.models.utils.data.readers.readers][WARNING] - Stress not found in section 'energy'. Continue without stress!
[2024-02-02 08:54:04,977][metatensor.models.cli.train_model][INFO] - Setting up test set
[2024-02-02 08:54:04,977][metatensor.models.cli.train_model][INFO] - Setting up validation set
[2024-02-02 08:54:04,978][metatensor.models.cli.train_model][INFO] - Setting up model
[2024-02-02 08:54:05,069][metatensor.models.cli.train_model][INFO] - Calling architecture trainer
[2024-02-02 08:54:05,069][metatensor.models.soap_bpnn.train][INFO] - Checking datasets for consistency
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
[2024-02-02 08:54:05,158][metatensor.models.soap_bpnn.train][INFO] - Calculating composition weights
[2024-02-02 08:54:05,273][metatensor.models.soap_bpnn.train][INFO] - Setting up data loaders
[2024-02-02 08:54:05,424][metatensor.models.soap_bpnn.train][INFO] - Starting training
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
/home/filippo/.local/lib/python3.8/site-packages/torch/autograd/__init__.py:251: UserWarning: second derivatives with respect to positions are not implemented and will not be accumulated during backward() calls. If you need second derivatives, please open an issue on rascaline repository. (Triggered internally at /home/filippo/rascaline/rascaline-torch/src/autograd.cpp:313.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
rascaline::math::splines -- spline reached requested accuracy (1.000e-8) with 321 reference points (max absolute error is 5.699e-8)
[2024-02-02 08:54:09,217][metatensor.models.soap_bpnn.train][INFO] - Epoch    0, train loss:  4104.6695, validation loss:   614.6490, train energy RMSE:     3.8793, train force RMSE:    28.3881, valid energy RMSE:     3.3691, valid force RMSE:    24.5621
[2024-02-02 08:54:15,572][metatensor.models.soap_bpnn.train][INFO] - Epoch    5, train loss:  2332.1070, validation loss:   328.8712, train energy RMSE:     2.9285, train force RMSE:    21.3973, valid energy RMSE:     2.0911, valid force RMSE:    18.0138
[2024-02-02 08:54:21,735][metatensor.models.soap_bpnn.train][INFO] - Epoch   10, train loss:   389.5255, validation loss:    61.7523, train energy RMSE:     2.2921, train force RMSE:     8.5236, valid energy RMSE:     1.7062, valid force RMSE:     7.6708
[2024-02-02 08:54:28,088][metatensor.models.soap_bpnn.train][INFO] - Epoch   15, train loss:   148.4480, validation loss:    28.8797, train energy RMSE:     1.7587, train force RMSE:     5.1572, valid energy RMSE:     1.6739, valid force RMSE:     5.1066
[2024-02-02 08:54:34,342][metatensor.models.soap_bpnn.train][INFO] - Epoch   20, train loss:   116.4343, validation loss:    25.2689, train energy RMSE:     2.1110, train force RMSE:     4.3394, valid energy RMSE:     2.4750, valid force RMSE:     4.3753
[2024-02-02 08:54:40,748][metatensor.models.soap_bpnn.train][INFO] - Epoch   25, train loss:   103.5423, validation loss:    16.3115, train energy RMSE:     2.2714, train force RMSE:     3.9433, valid energy RMSE:     0.9048, valid force RMSE:     3.9361

After:

[2024-02-02 08:50:28,485][metatensor.models.cli.train_model][INFO] - Setting up training set
[2024-02-02 08:50:28,560][metatensor.models.utils.data.readers.readers][INFO] - Forces found in section 'energy'. Forces are taken for training!
[2024-02-02 08:50:28,582][metatensor.models.utils.data.readers.readers][WARNING] - Stress not found in section 'energy'. Continue without stress!
[2024-02-02 08:50:28,583][metatensor.models.cli.train_model][INFO] - Setting up test set
[2024-02-02 08:50:28,583][metatensor.models.cli.train_model][INFO] - Setting up validation set
[2024-02-02 08:50:28,583][metatensor.models.cli.train_model][INFO] - Setting up model
[2024-02-02 08:50:28,673][metatensor.models.cli.train_model][INFO] - Calling architecture trainer
[2024-02-02 08:50:28,674][metatensor.models.soap_bpnn.train][INFO] - Checking datasets for consistency
[2024-02-02 08:50:28,761][metatensor.models.soap_bpnn.train][INFO] - Calculating composition weights
[2024-02-02 08:50:28,875][metatensor.models.soap_bpnn.train][INFO] - Setting up data loaders
[2024-02-02 08:50:29,044][metatensor.models.soap_bpnn.train][INFO] - Starting training
[2024-02-02 08:50:32,343][metatensor.models.utils.logging][INFO] - Epoch    0, train loss: 3930.5, validation loss: 614.71, train energy RMSE: 4.0425, train force RMSE: 27.744, valid energy RMSE: 3.1709, valid force RMSE: 24.590
[2024-02-02 08:50:39,017][metatensor.models.utils.logging][INFO] - Epoch    5, train loss: 2309.8, validation loss: 330.64, train energy RMSE: 2.7989, train force RMSE: 21.310, valid energy RMSE: 1.8786, valid force RMSE: 18.086
[2024-02-02 08:50:45,312][metatensor.models.utils.logging][INFO] - Epoch   10, train loss:  319.1, validation loss:  66.38, train energy RMSE: 2.1813, train force RMSE:  7.685, valid energy RMSE: 4.0625, valid force RMSE:  7.062
[2024-02-02 08:50:51,406][metatensor.models.utils.logging][INFO] - Epoch   15, train loss:  155.8, validation loss:  34.09, train energy RMSE: 2.8850, train force RMSE:  4.779, valid energy RMSE: 2.9216, valid force RMSE:  5.056
[2024-02-02 08:50:57,619][metatensor.models.utils.logging][INFO] - Epoch   20, train loss:  106.5, validation loss:  22.33, train energy RMSE: 2.1761, train force RMSE:  4.070, valid energy RMSE: 1.4479, valid force RMSE:  4.498
[2024-02-02 08:51:03,782][metatensor.models.utils.logging][INFO] - Epoch   25, train loss:  106.5, validation loss:  17.12, train energy RMSE: 2.3252, train force RMSE:  3.988, valid energy RMSE: 1.0178, valid force RMSE:  4.011

📚 Documentation preview 📚: https://metatensor-models--45.org.readthedocs.build/en/45/

@Luthaf
Copy link
Member

Luthaf commented Feb 3, 2024

[2024-02-02 08:50:28,485][metatensor.models.cli.train_model][INFO] - Setting up training set
[2024-02-02 08:50:28,560][metatensor.models.utils.data.readers.readers][INFO] - Forces found in section 'energy'. Forces are taken for training!
[2024-02-02 08:50:28,582][metatensor.models.utils.data.readers.readers][WARNING] - Stress not found in section 'energy'. Continue without stress!
[2024-02-02 08:50:28,583][metatensor.models.cli.train_model][INFO] - Setting up test set

This is still a bit overwhelming IMO. Could we do something like

[2024-02-02 08:50:28,485][metatensor.models.cli.train_model]
    [INFO] - Setting up training set
[2024-02-02 08:50:28,560][metatensor.models.utils.data.readers.readers]
    [INFO] - Forces found in section 'energy'. Forces are taken for training!
[2024-02-02 08:50:28,582][metatensor.models.utils.data.readers.readers]
    [WARNING] - Stress not found in section 'energy'. Continue without stress!
[2024-02-02 08:50:28,583][metatensor.models.cli.train_model]
    [INFO] - Setting up test set

Or remove the module path since it is not very relevant to users.

[2024-02-02 08:50:28,485][INFO] - Setting up training set
[2024-02-02 08:50:28,560][INFO] - Forces found in section 'energy'. Forces are taken for training!
[2024-02-02 08:50:28,582][WARNING] - Stress not found in section 'energy'. Continue without stress!
[2024-02-02 08:50:28,583][INFO] - Setting up test set

@frostedoyster frostedoyster requested review from PicoCentauri and removed request for PicoCentauri February 3, 2024 12:41
@PicoCentauri
Copy link
Contributor

Or remove the module path since it is not very relevant to users.

Yes, we can try. This something related to how hydra prints the logging. We have to overwrite their config.

Copy link
Contributor

@PicoCentauri PicoCentauri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good improvement to the logging.

In general, please follow our conventions for docstrings. When chatGPT generates the code for you, you can ask for rst docstrings in sphinx format. 🤫

Comment on lines 28 to 33
Args:
model_capabilities: The capabilities of the model.
train_loss_0: The initial training loss.
validation_loss_0: The initial validation loss.
train_info_0: The initial training metrics.
validation_info_0: The initial validation metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not how we write docstrings. Can you adjust this please?

logger.info(logging_string)


def _get_digits(value: float) -> Tuple[int, int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this chatGPT code?

Looks very cumbersome.

Python has a Decimal library which does this for you. i.e

t = Decimal(1.532).normalize().as_tuple()

gives:

DecimalTuple(sign=0, digits=(1, 5, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 8, 4, 2, 1, 7, 0, 9, 4, 3), exponent=-26)

which I think contains all the information you need in a single line. You can event get the mantissa and the exponent if you like...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I don't think I can use it to do what I want... Yes, what I do is quite cumbersome but I think it's the best solution for now. The function essentially gets a number and decides what parts of the number should be printed so that there are always at least 5 significant digits

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with %12.5e or equivalent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the strong feeling that many of the people we're dealing with won't be comfortable with exponential notation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would go for format specifiers: f"something {some_float:12.5e}". See documentation here: https://docs.python.org/3.11/library/string.html#formatspec

Copy link
Member

@Luthaf Luthaf Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the strong feeling that many of the people we're dealing with won't be comfortable with exponential notation

I'm not sure about this, but we can also use g specifiers instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Maybe the g is the better option then f

logger = logging.getLogger(__name__)


class MetricLogger:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You promised also testing for this in your issue. Maybe at least one test that is checking if a log is produced should be added.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will test after we find a way to log the way we want

Comment on lines +162 to +169
if epoch == 0:
metric_logger = MetricLogger(
model_capabilities,
train_loss,
validation_loss,
finalized_train_info,
finalized_validation_info,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you initialize the logger before entering the training loop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it needs the values from the first epoch to format things properly

@frostedoyster
Copy link
Collaborator Author

Personally, I would go for the following:

[2024-02-02 08:50:28][INFO] - Setting up training set
[2024-02-02 08:50:28][INFO] - Forces found in section 'energy'. Forces are taken for training!
[2024-02-02 08:50:28][WARNING] - Stress not found in section 'energy'. Continue without stress!
[2024-02-02 08:50:28][INFO] - Setting up test set

i.e., removing both the module and the thousandth of seconds (which shouldn't be relevant during our training exercises, it clogs up the log and we can definitely do without it)

@frostedoyster frostedoyster merged commit cd9a8ab into main Feb 5, 2024
8 checks passed
@frostedoyster frostedoyster deleted the clean-up-logging branch February 5, 2024 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants