This is the code for "Data augmentation and transfer learning strategies for reaction prediction in low chemical data regimes" paper. The preprint of this paper can be found in ChemRxiv with https://doi.org/10.26434/chemrxiv.13383275.v1
The dataset we used is named as general chemical reaction dataset, which contains approximately 380,000 chemical reactions. These reaction examples were originally sourced from Lowe's dataset, which were extracted from United States Patent and Trademark Office (USPTO) patents, and then subjected to a collection of pre-reatments in which all the reagents and conditions were deleted. The input data for training and validation was in the tmp folder.
We preprocess the input data by running the datagen.sh script, and put the output data in the t2t_data folder.
We use a Python program data_augmentation.py to perform data augmentation on the training data set of the Baeyer Villiger reaction data set with the SMILES form.
Model use the train.sh script to start training.
Model use the decode.sh script to start testing.