Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduction of table 1 #2

Open
mrTsjolder opened this issue May 5, 2023 · 1 comment
Open

Reproduction of table 1 #2

mrTsjolder opened this issue May 5, 2023 · 1 comment

Comments

@mrTsjolder
Copy link

I stumbled upon this paper and would like to reproduce some of the results in table 1.
However, when running the code as indicated in the README, values seem to be quite off.
Should it be possible to reproduce the results in table 1 with this codebase?
If yes, what arguments are necessary to get these results.

Concretely, I tried to reproduce the ZINC results by following the README (as close as possible).
After setting up the environment and downloading the zinc250k.csv file from moflow, I was able to run the data_preprocess.py script.

After downloading the models, I managed to run the following scripts (if I remember correctly):

python chemspace.py --gpu 0 --data_name zinc250k --random
python train_boundary_zinc.py
python chemspace.py --gpu 0 --data_name zinc250k --traverse

However, it might be that I already had to fix the mflow import statements at this stage and ran the generate_prop_ranges.py script at this point.

After creating the zinc250k.txt file from zinc250k.csv and after running generate_prop_ranges.py I should have been able to run calculate_statistics_single_prop.py --mani_range 1, although this also might have required some changes to the original code already.

After some further modifications (most notably by creating directories that were missing for the code to work), I also managed to run the random and largest baselines as follows:

python chemspace.py --gpu 0 --data_name zinc250k --traverse --baseline random
python chemspace.py --gpu 0 --data_name zinc250k --largest
python chemspace.py --gpu 0 --data_name zinc250k --traverse --baseline largest

which allowed me to run calculate_statiscs_single_prop.py on these baselines as well.

All of this eventually provided me with the following results:

QED strict relaxed local relaxed global
random 12.5 15.0 18.0
largest 17.0 18.0 24.5
chemspace 69.0 69.0 73.5

whereas table 1 (together with tables 5 and 6) in the paper seems to suggest something closer to

QED strict relaxed local relaxed global
random 1.5 3.5 6.0
largest 1.5 3.0 4.5
chemspace 52.0 53.5 57.0

Any chance you could provide me with some papers (or explain the discrepancies)?

@mrTsjolder mrTsjolder changed the title Commandseproduction of table 1 Reproduction of table 1 May 5, 2023
@yuanqidu
Copy link
Owner

Thanks for your interest in our paper! We have refactored the code before we release it. From first glance the results make sense that ChemSpacE outperforms the baseline methods by a large margin as they are very simple. I will try to find some time to look through it but I think the results are not very surprising despite different than what we reported in the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants