You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
scikit-chem currently only provides descriptors implemented in RDKit. The easy to use core API in scikit-chem should make it very easy to implement more descriptors, fast.
This could tie in with #46. I really like the idea of eventually having scikit-chem to cheminformatics as scikit-learn is to machine learning: a good place to get easy, open access to a large core of the functionality in an area, with good documentation of the underlying theory.
Specifically for descriptors, I am looking at implementing somewhere between ChemoPy and DRAGON in levels of quantity.
To enumerate the DRAGON list (create a separate issue for each type), or strike out if it is out of scope:
Constitutional
Ring descriptors
Topological indices
Walk and path counts
Connectivity indices
Information indices
2D matrix based
2D autocorrelations
Burden eigenvalues
P-VSA like (MOE)
ETA indices
Edge adjacency indices
Geometrical descriptors
3D matrix-based descriptors
3D autocorrelations
RDF descriptors
MoRSE descriptros
WHIM descriptors
Molecular profiles
Functional groups count
Atom-centered fragments (I think this is substructures...)
Atom-type E-state indices
CATS 2D
2D atom pairs
Charge descriptors
Molecular properties
Drug-like indices
CATS 3D
Implementation details
Each descriptor set should ideally have
Mathematical definitions
Example usage in docstring (with values from the source, so the testing framework picks this up).
References to original works, preferably including DOIs.
Implement as a function
We should implement the descriptors as functions for now.
These need to be plugged into the pipelining architecture somewhere along the line, so they will probably end up as (static?) methods on classes (so they can be used directly), which are called from the class' Descriptorizer._transform_mol (but lets not call it Descriptorizer...). They should dictate the index for the columns that are provided, but _transform_mol should yield results as np.arrays so they can be stacked faster.
Caching
A (the?) key issue with descriptors (especially in ChemoPy from what I can see), is that there are quantities that are required for many, diverse descriptors. These should be cached in a good implementation. For example, RDKit caches them as m._{}, such as m._gasteigerCharges.
Whilst this works, it dirties the object, takes up unnecessary memory, and makes writing code quite difficult, as you need to call everything your descriptor depends on in case the quantities haven't been cached.
I propose a method similar in function, but perhaps cleaner in execution. We could declare requirements for cached descriptors, and have the value injected into the function.
scikit-chem
currently only provides descriptors implemented in RDKit. The easy to use core API in scikit-chem should make it very easy to implement more descriptors, fast.This could tie in with #46. I really like the idea of eventually having
scikit-chem
to cheminformatics asscikit-learn
is to machine learning: a good place to get easy, open access to a large core of the functionality in an area, with good documentation of the underlying theory.Specifically for descriptors, I am looking at implementing somewhere between ChemoPy and DRAGON in levels of quantity.
To enumerate the DRAGON list (create a separate issue for each type), or strike out if it is out of scope:
Implementation details
Each descriptor set should ideally have
Implement as a function
We should implement the descriptors as functions for now.
These need to be plugged into the pipelining architecture somewhere along the line, so they will probably end up as (static?) methods on classes (so they can be used directly), which are called from the class'
Descriptorizer._transform_mol
(but lets not call itDescriptorizer
...). They should dictate the index for the columns that are provided, but_transform_mol
should yield results asnp.array
s so they can be stacked faster.Caching
A (the?) key issue with descriptors (especially in ChemoPy from what I can see), is that there are quantities that are required for many, diverse descriptors. These should be cached in a good implementation. For example, RDKit caches them as
m._{}
, such asm._gasteigerCharges
.Whilst this works, it dirties the object, takes up unnecessary memory, and makes writing code quite difficult, as you need to call everything your descriptor depends on in case the quantities haven't been cached.
I propose a method similar in function, but perhaps cleaner in execution. We could declare requirements for cached descriptors, and have the value injected into the function.
Example usage is:
The cache puts the properties into m.cache dictionary on the
Mol
, which could be emptied once we are finished calculating descriptors.If there are options, we could save a nested dictionary
The caching mechanism forwards the keyword args to the injected properties.
The text was updated successfully, but these errors were encountered: