Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Descriptors #56

Open
28 tasks
lewisacidic opened this issue Aug 29, 2016 · 0 comments
Open
28 tasks

Descriptors #56

lewisacidic opened this issue Aug 29, 2016 · 0 comments
Assignees
Milestone

Comments

@lewisacidic
Copy link
Owner

scikit-chem currently only provides descriptors implemented in RDKit. The easy to use core API in scikit-chem should make it very easy to implement more descriptors, fast.

This could tie in with #46. I really like the idea of eventually having scikit-chem to cheminformatics as scikit-learn is to machine learning: a good place to get easy, open access to a large core of the functionality in an area, with good documentation of the underlying theory.

Specifically for descriptors, I am looking at implementing somewhere between ChemoPy and DRAGON in levels of quantity.

To enumerate the DRAGON list (create a separate issue for each type), or strike out if it is out of scope:

  • Constitutional
  • Ring descriptors
  • Topological indices
  • Walk and path counts
  • Connectivity indices
  • Information indices
  • 2D matrix based
  • 2D autocorrelations
  • Burden eigenvalues
  • P-VSA like (MOE)
  • ETA indices
  • Edge adjacency indices
  • Geometrical descriptors
  • 3D matrix-based descriptors
  • 3D autocorrelations
  • RDF descriptors
  • MoRSE descriptros
  • WHIM descriptors
  • Molecular profiles
  • Functional groups count
  • Atom-centered fragments (I think this is substructures...)
  • Atom-type E-state indices
  • CATS 2D
  • 2D atom pairs
  • Charge descriptors
  • Molecular properties
  • Drug-like indices
  • CATS 3D

Implementation details

Each descriptor set should ideally have

  • Mathematical definitions
  • Example usage in docstring (with values from the source, so the testing framework picks this up).
  • References to original works, preferably including DOIs.

Implement as a function

We should implement the descriptors as functions for now.

These need to be plugged into the pipelining architecture somewhere along the line, so they will probably end up as (static?) methods on classes (so they can be used directly), which are called from the class' Descriptorizer._transform_mol (but lets not call it Descriptorizer...). They should dictate the index for the columns that are provided, but _transform_mol should yield results as np.arrays so they can be stacked faster.

Caching

A (the?) key issue with descriptors (especially in ChemoPy from what I can see), is that there are quantities that are required for many, diverse descriptors. These should be cached in a good implementation. For example, RDKit caches them as m._{}, such as m._gasteigerCharges.

Whilst this works, it dirties the object, takes up unnecessary memory, and makes writing code quite difficult, as you need to call everything your descriptor depends on in case the quantities haven't been cached.

I propose a method similar in function, but perhaps cleaner in execution. We could declare requirements for cached descriptors, and have the value injected into the function.

Example usage is:

from .caching import cache

>>> @cache
... def atomic_masses(mol):
...     return mol.atoms.atomic_mass

>>> @cache.inject(atomic_masses)
... def molecular_weight(mol, a_mass):
...    return sum(a_mass)

>>> m = skchem.Mol.from_smiles('CC').add_hs()
>>> molecular_weight(m)
30...

>>> m.cache['atomic_masses']
array([12, 12, 1, 1, 1, 1, 1, 1])

The cache puts the properties into m.cache dictionary on the Mol, which could be emptied once we are finished calculating descriptors.

If there are options, we could save a nested dictionary

>>> @cache
... def atomic_masses(mol, heavy=False):
...     return mol.atoms.atomic_masses + (1 if heavy else 0)

>>> atomic_masses(mol, heavy=False); atomic_masses(mol, heavy=True);
>>> mol.cache['atomic_masses']
{False: array([12, 12, 1, 1, 1, 1, 1, 1]), True: array([13, 13, 2, 2, 2, 2, 2, 2])}

>>> @cache.inject(atomic_masses)
... def molecular_mass(mol, a_mass, heavy=False):
...    return sum(a_mass)

>>> molecular_mass(mol)
30...

>>> molecular_mass(mol, heavy=True)
36...

The caching mechanism forwards the keyword args to the injected properties.

@lewisacidic lewisacidic added this to the v0.1.0 milestone Aug 29, 2016
@lewisacidic lewisacidic self-assigned this Aug 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant