Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaffold Split Not Implemented Error #33

Open
kc-chong opened this issue Sep 19, 2024 · 3 comments
Open

Scaffold Split Not Implemented Error #33

kc-chong opened this issue Sep 19, 2024 · 3 comments

Comments

@kc-chong
Copy link

 config = OptimizationConfig( 
  data=Dataset(
        input_column=input_col,  # Typical names are "SMILES" and "smiles".
        response_column=value_col,  # Often a specific name (like here), or just "activity".
        training_dataset_file= train_data,
        test_dataset_file= test_data # Hidden during optimization.
    ),
    
    descriptors= descriptors,
    algorithms= algorithms ,
    settings=OptimizationConfig.Settings(
        mode=ModelMode.REGRESSION,
        cross_validation=cv,
        cv_split_strategy= ScaffoldSplit(),
        n_trials=100,  # Total number of trials.
        n_startup_trials=50,  # Number of startup ("random") trials.
        random_seed=42, # Seed for reproducability
        direction=OptimizationDirection.MAXIMIZATION,
    ),)

File "~/miniforge3/envs/qsartuna/lib/python3.10/site-packages/optunaz/utils/preprocessing/splitter.py", line 450, in get_sklearn_splitter
    raise NotImplementedError()
NotImplementedError

I tried to use Scaffold split but get this error, seems like the Scaffold splitting is not as straightforward as other splittings? Any clue how to use it to to perform evaluation split? Thanks!

@lewismervin1
Copy link
Collaborator

lewismervin1 commented Oct 2, 2024

Hi @kc-chong.

The scaffoldsplit is not currently supported for the Cross Validation split used during Hyperparameter optimisation.

You can use the ScaffoldSplit as a proper evaluation splitting strategy (used to determine the overall reported test performance of your model) which is part of the dataset in the configuration.

For example, see this example, where you may change the line split_strategy=Stratified(fraction=0.2) to e.g. split_strategy=ScaffoldSplit().

I hope this helps, let me know any problems

@kc-chong
Copy link
Author

kc-chong commented Oct 4, 2024

@lewismervin1 Thanks for the response! Appreciate that :)

@kc-chong kc-chong closed this as completed Oct 7, 2024
@kc-chong
Copy link
Author

kc-chong commented Oct 9, 2024

@lewismervin1 ,

I realise in the example, there's still cv in the OptimConfig, and when I look at the source code, seems like cv is always activated in the fitting? So I'm abit confused about the difference between the split in dataset, and the split in cv. Do you mind clarifying? More specifically, when I define ScaffoldSplit in the dataset, I suppose the data is split into train and valid set by ScaffoldSplit, and the score from the valid set will be used to inform optuna optimization? If that's the case, what's the role of cv over here.

@kc-chong kc-chong reopened this Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants