Descriptors #56

lewisacidic · 2016-08-29T20:12:02Z

scikit-chem currently only provides descriptors implemented in RDKit. The easy to use core API in scikit-chem should make it very easy to implement more descriptors, fast.

This could tie in with #46. I really like the idea of eventually having scikit-chem to cheminformatics as scikit-learn is to machine learning: a good place to get easy, open access to a large core of the functionality in an area, with good documentation of the underlying theory.

Specifically for descriptors, I am looking at implementing somewhere between ChemoPy and DRAGON in levels of quantity.

To enumerate the DRAGON list (create a separate issue for each type), or strike out if it is out of scope:

Implementation details

Each descriptor set should ideally have

Mathematical definitions
Example usage in docstring (with values from the source, so the testing framework picks this up).
References to original works, preferably including DOIs.

Implement as a function

We should implement the descriptors as functions for now.

These need to be plugged into the pipelining architecture somewhere along the line, so they will probably end up as (static?) methods on classes (so they can be used directly), which are called from the class' Descriptorizer._transform_mol (but lets not call it Descriptorizer...). They should dictate the index for the columns that are provided, but _transform_mol should yield results as np.arrays so they can be stacked faster.

Caching

A (the?) key issue with descriptors (especially in ChemoPy from what I can see), is that there are quantities that are required for many, diverse descriptors. These should be cached in a good implementation. For example, RDKit caches them as m._{}, such as m._gasteigerCharges.

Whilst this works, it dirties the object, takes up unnecessary memory, and makes writing code quite difficult, as you need to call everything your descriptor depends on in case the quantities haven't been cached.

I propose a method similar in function, but perhaps cleaner in execution. We could declare requirements for cached descriptors, and have the value injected into the function.

Example usage is:

from .caching import cache

>>> @cache
... def atomic_masses(mol):
...     return mol.atoms.atomic_mass

>>> @cache.inject(atomic_masses)
... def molecular_weight(mol, a_mass):
...    return sum(a_mass)

>>> m = skchem.Mol.from_smiles('CC').add_hs()
>>> molecular_weight(m)
30...

>>> m.cache['atomic_masses']
array([12, 12, 1, 1, 1, 1, 1, 1])

The cache puts the properties into m.cache dictionary on the Mol, which could be emptied once we are finished calculating descriptors.

If there are options, we could save a nested dictionary

>>> @cache
... def atomic_masses(mol, heavy=False):
...     return mol.atoms.atomic_masses + (1 if heavy else 0)

>>> atomic_masses(mol, heavy=False); atomic_masses(mol, heavy=True);
>>> mol.cache['atomic_masses']
{False: array([12, 12, 1, 1, 1, 1, 1, 1]), True: array([13, 13, 2, 2, 2, 2, 2, 2])}

>>> @cache.inject(atomic_masses)
... def molecular_mass(mol, a_mass, heavy=False):
...    return sum(a_mass)

>>> molecular_mass(mol)
30...

>>> molecular_mass(mol, heavy=True)
36...

The caching mechanism forwards the keyword args to the injected properties.

The text was updated successfully, but these errors were encountered:

lewisacidic added the enhancement label Aug 29, 2016

lewisacidic added this to the v0.1.0 milestone Aug 29, 2016

lewisacidic self-assigned this Aug 29, 2016

lewisacidic added the descriptor label Aug 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Descriptors #56

Descriptors #56

lewisacidic commented Aug 29, 2016

Descriptors #56

Descriptors #56

Comments

lewisacidic commented Aug 29, 2016

Implementation details

Implement as a function

Caching