-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint callbacks with n_jobs > 1 #848
Comments
Without having looked at your solution: Are you sure that you need a checkpoint for grid search (or any hyper-parameter search) at all? In general, the process looks like this:
Only for this last step would you typically want to have checkpoints. You can either not use checkpointing at all during the grid search and perform this last step manually. Or you can set Maybe you have a different use case in mind that would actually require one checkpoint for each hyper parameter combinations, but I wanted to ensure first if it's even necessary. |
I understand your point. But consider the following case with gridsearch + k-fold cross validation:
|
I have had similiar issues. What seems to work for me is to instead of using In a bit more detail:
|
Thanks for your suggestion. It seems that your solution does not use GridsearchCV. However, I prefer to avoid manually set all the loops and models, so I would like to use gridsearchcv + kfold of scikit. Is the solution that I adopted in the previous post (i.e., a callback using a file to store the prefix filenames) valid or is there a better solution? |
Okay, the use case of having I took a look at the code for Right now, @ottonemo WDYT? I vaguely remember that we discussed something like this at some point. |
I have run into the same issue with conflicting checkpoint files while using
|
Solves #848 Description As of now, dirname can only be a string. With this update, it can also be a callable with no arguments that returns a string. What this solves is that the directory that a model is saved in can now contain a dynamic element. This way, if you run, e.g., grid search with n_jobs>1 + checkpoint, each checkpoint instance can have its own directory name (e.g. using a function that retuns a random name), while the files inside the directory still follow the same naming. Without such a possibility, if a user runs grid search with n_jobs>1 and checkpoint with load_best=True, the loaded model would always be whatever happens to be the latest one stored, which can result in (silent) errors. Implementation As a consequence of the dirname now not being known at __init__ time, I removed the validation of the filenames from there. We still validate them inside initialize(), which is sufficient in my opinion. In theory, we could call the dirname function inside __init__ to validate it, and then call it again inside initialize to actually set it, but I don't like that. The reason is that we would call a function that is possible non-deterministic or might have side effects twice, with unknown consequences. This should be avoided if possible.
I would to use gridsearchcv in parallel mode. However, I think that the checkpoints used during the trainings in the different processes could override each other between them since they have the same filenames defined in the Checkpoint attributes
f_params
ecc. My first attempt was subclassing the Checkpoint class and implementing a semaphore in theon_train_begin
method that changed the filenames (using thefn_prefix
) using a global variable as job counter. However, the jobs are viewed as processes and not as threads, so my solution did not work. My present attempt is to store the counter in file, protected by a filelock. Is there a better way?In the following my solution:
The text was updated successfully, but these errors were encountered: