Skip to content

Commit

Permalink
astartes v1.3.0 (#176)
Browse files Browse the repository at this point in the history
This PR will resolve a number of issues and then be released as
`astartes` v1.3.0
 - resolves #175 
 - resolves #171 
- resolves #128 - will merge #134 into this PR and then get it working
the rest of the way
  • Loading branch information
JacksonBurns authored Aug 14, 2024
2 parents 09aa54b + 4fcbcf2 commit ce50dce
Show file tree
Hide file tree
Showing 7 changed files with 108 additions and 25 deletions.
2 changes: 1 addition & 1 deletion CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# RMG Code of Conduct
# `astartes` Code of Conduct

## Our Pledge

Expand Down
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ To contribute to the `astartes` source code, start by forking and then cloning t
### Version Checking

`astartes` uses `pyproject.toml` to specify all metadata, but the version is also specified in `astartes/__init__.py` (via `__version__`) for backwards compatibility with Python 3.7.
To check which version of `astartes` you have installed, you can run `python -c "import astartes; print(astartes.__version__)"` on Python 3.7 or `python -c "from importlib.metadata import version; version('astartes')" on Python 3.8 or newer.
`astartes` uses `pyproject.toml` to specify all metadata except the version, which is specified in `astartes/__init__.py` (via `__version__`) for backwards compatibility with Python 3.7.
To check which version of `astartes` you have installed, you can run `python -c "import astartes; print(astartes.__version__)"` on Python 3.7 or `python -c "from importlib.metadata import version; version('astartes')"` on Python 3.8 or newer.

### Testing
All of the tests in `astartes` are written using the built-in python `unittest` module (to allow running without `pytest`) but we _highly_ recommend using `pytest`.
Expand Down
15 changes: 8 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
</tr>
<tr>
<td><img alt="PyPI - License" src="https://img.shields.io/github/license/JacksonBurns/astartes"></td>
<td><img alt="Test Status" src="https://github.com/JacksonBurns/astartes/actions/workflows/run_tests.yml/badge.svg?branch=main&event=schedule"></td>
<td><img alt="Test Status" src="https://github.com/JacksonBurns/astartes/actions/workflows/ci.yml/badge.svg?branch=main&event=schedule"></td>
<td><a href="https://doi.org/10.5281/zenodo.8147205"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.8147205.svg" alt="DOI"></a></td>
</tr>
<tr>
Expand Down Expand Up @@ -121,6 +121,11 @@ Click the badges in the table below to be taken to a live, interactive demo of `
To execute these notebooks locally, clone this repository (i.e. `git clone https://github.com/JacksonBurns/astartes.git`), navigate to the `astartes` directory, run `pip install .[demos]`, then open and run the notebooks in your preferred editor.
You do _not_ need to execute the cells prefixed with `%%capture` - they are only present for compatibility with Google Colab.

#### Packages Using `astartes`
- [Chemprop](https://github.com/chemprop/chemprop), a machine learning library for chemical property prediction, uses `astartes` in the backend for splitting molecular structures.
- [`fastprop`](https://github.com/JacksonBurns/fastprop), a descriptor-based property prediction library, uses `astartes`.
- [Google Scholar of articles citing the JOSS paper for `astartes`](https://scholar.google.com/scholar?cites=4693802000464819413&as_sdt=40000005&sciodt=0,22&hl=en)

### Withhold Testing Data with `train_val_test_split`
For rigorous ML research, it is critical to withhold some data during training to use a `test` set.
The model should _never_ see this data during training (unlike the validation set) so that we can get an accurate measurement of its performance.
Expand Down Expand Up @@ -260,13 +265,9 @@ train_test_split_molecules(
train_size=0.8,
fingerprint="daylight_fingerprint",
fprints_hopts={
"minPath": 2,
"maxPath": 5,
"fpSize": 200,
"bitsPerHash": 4,
"useHs": 1,
"tgtDensity": 0.4,
"minSize": 64,
"numBitsPerFeature": 4,
"useHs": True,
},
sampler="random",
random_state=42,
Expand Down
2 changes: 1 addition & 1 deletion astartes/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# convenience import to enable 'from astartes import train_test_split'
from .main import train_test_split, train_val_test_split

__version__ = "1.2.2"
__version__ = "1.3.0"

# DO NOT do this:
# from .molecules import train_test_split_molecules
Expand Down
4 changes: 2 additions & 2 deletions astartes/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -313,8 +313,8 @@ def _check_actual_split(
)
if actual_val_size != requested_val_size:
msg += "Requested validation size of {:.2f}, got {:.2f}. ".format(
requested_test_size,
actual_test_size,
requested_val_size,
actual_val_size,
)
if actual_test_size != requested_test_size:
msg += "Requested test size of {:.2f}, got {:.2f}. ".format(
Expand Down
90 changes: 90 additions & 0 deletions examples/morais_lima_martin_sampling/mlm_sampler.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Implementing the Morais-Lima-Martin (MLM) Sampler\n",
"The notebook shows a brief demonstration of using the built in utilities in `astartes` to implement the Morais-Lima-Martin sampler, which you can read about [here](https://academic.oup.com/bioinformatics/article/35/24/5257/5497250)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`astartes` has a very fast implementation of the Kennard-Stone algorithm, on which the MLM sampler is based, available in its `utils`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from astartes.utils.fast_kennard_stone import fast_kennard_stone"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The MLM sampler can then be implemented as shown below.\n",
"The `mlm_sampler` functions takes a 2D array and splits it first using the Kennard-Stone algorithm, then permutes the indices according to the MLM algorithm."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from scipy.spatial.distance import pdist, squareform\n",
"import numpy as np\n",
"\n",
"from astartes.samplers.interpolation import KennardStone\n",
"\n",
"\n",
"def mlm_split(X: np.ndarray, *, train_size: float = 0.8, val_size: float = 0.1, test_size: float = 0.1, random_state: int = 42):\n",
" # calculate the distance matrix\n",
" ks_indexes = fast_kennard_stone(squareform(pdist(X, \"euclidean\")))\n",
" pivot = int(len(ks_indexes) * train_size)\n",
" train_idxs = ks_indexes[0:pivot]\n",
" other_idxs = ks_indexes[pivot:]\n",
"\n",
" # set RNG\n",
" rng = np.random.default_rng(seed=random_state)\n",
" \n",
" # choose 10% of train to switch with 10% of val/test\n",
" n_to_permute = np.floor(0.1 * len(train_idxs))\n",
" train_permute_idxs = rng.choice(train_idxs, n_to_permute)\n",
" remaining_train_idxs = filter(lambda i: i not in train_permute_idxs, train_idxs)\n",
" other_permute_idxs = rng.choice(other_idxs, n_to_permute)\n",
" remaining_other_idxs = filter(lambda i: i not in other_permute_idxs, other_idxs)\n",
"\n",
" # reassemble the new lists of indexes\n",
" new_train_idxs = np.concatenate(remaining_train_idxs, other_permute_idxs)\n",
" new_other_idxs = np.concatenate(train_permute_idxs, remaining_other_idxs)\n",
" n_val = int(len(new_other_idxs) * (val_size / (val_size + test_size)))\n",
" val_indexes = new_other_idxs[0:n_val]\n",
" test_indexes = new_other_idxs[n_val:]\n",
" \n",
" # return the split up array\n",
" return X[train_idxs], X[val_indexes], X[test_indexes]\n",
" "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "fprop",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
16 changes: 4 additions & 12 deletions test/functional/test_molecules.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,13 +126,9 @@ def test_fprint_hopts(self):
sampler="random",
fingerprint="topological_fingerprint",
fprints_hopts={
"minPath": 2,
"maxPath": 5,
"fpSize": 200,
"bitsPerHash": 4,
"useHs": 1,
"tgtDensity": 0.4,
"minSize": 64,
"numBitsPerFeature": 4,
"useHs": True,
},
)

Expand Down Expand Up @@ -163,13 +159,9 @@ def test_maximum_call(self):
train_size=0.2,
fingerprint="topological_fingerprint",
fprints_hopts={
"minPath": 2,
"maxPath": 5,
"fpSize": 200,
"bitsPerHash": 4,
"useHs": 1,
"tgtDensity": 0.4,
"minSize": 64,
"numBitsPerFeature": 2,
"useHs": True,
},
sampler="random",
random_state=42,
Expand Down

0 comments on commit ce50dce

Please sign in to comment.