Scaffold Split Not Implemented Error #33

kc-chong · 2024-09-19T07:48:47Z

 config = OptimizationConfig( 
  data=Dataset(
        input_column=input_col,  # Typical names are "SMILES" and "smiles".
        response_column=value_col,  # Often a specific name (like here), or just "activity".
        training_dataset_file= train_data,
        test_dataset_file= test_data # Hidden during optimization.
    ),
    
    descriptors= descriptors,
    algorithms= algorithms ,
    settings=OptimizationConfig.Settings(
        mode=ModelMode.REGRESSION,
        cross_validation=cv,
        cv_split_strategy= ScaffoldSplit(),
        n_trials=100,  # Total number of trials.
        n_startup_trials=50,  # Number of startup ("random") trials.
        random_seed=42, # Seed for reproducability
        direction=OptimizationDirection.MAXIMIZATION,
    ),)

File "~/miniforge3/envs/qsartuna/lib/python3.10/site-packages/optunaz/utils/preprocessing/splitter.py", line 450, in get_sklearn_splitter
    raise NotImplementedError()
NotImplementedError

I tried to use Scaffold split but get this error, seems like the Scaffold splitting is not as straightforward as other splittings? Any clue how to use it to to perform evaluation split? Thanks!

The text was updated successfully, but these errors were encountered:

lewismervin1 · 2024-10-02T12:42:56Z

Hi @kc-chong.

The scaffoldsplit is not currently supported for the Cross Validation split used during Hyperparameter optimisation.

You can use the ScaffoldSplit as a proper evaluation splitting strategy (used to determine the overall reported test performance of your model) which is part of the dataset in the configuration.

For example, see this example, where you may change the line split_strategy=Stratified(fraction=0.2) to e.g. split_strategy=ScaffoldSplit().

I hope this helps, let me know any problems

kc-chong · 2024-10-04T07:39:40Z

@lewismervin1 Thanks for the response! Appreciate that :)

kc-chong · 2024-10-09T05:52:31Z

@lewismervin1 ,

I realise in the example, there's still cv in the OptimConfig, and when I look at the source code, seems like cv is always activated in the fitting? So I'm abit confused about the difference between the split in dataset, and the split in cv. Do you mind clarifying? More specifically, when I define ScaffoldSplit in the dataset, I suppose the data is split into train and valid set by ScaffoldSplit, and the score from the valid set will be used to inform optuna optimization? If that's the case, what's the role of cv over here.

kc-chong closed this as completed Oct 7, 2024

kc-chong reopened this Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaffold Split Not Implemented Error #33

Scaffold Split Not Implemented Error #33

kc-chong commented Sep 19, 2024

lewismervin1 commented Oct 2, 2024 •

edited

Loading

kc-chong commented Oct 4, 2024

kc-chong commented Oct 9, 2024

Scaffold Split Not Implemented Error #33

Scaffold Split Not Implemented Error #33

Comments

kc-chong commented Sep 19, 2024

lewismervin1 commented Oct 2, 2024 • edited Loading

kc-chong commented Oct 4, 2024

kc-chong commented Oct 9, 2024

lewismervin1 commented Oct 2, 2024 •

edited

Loading