Skip to content

Commit

Permalink
Added MIXMOD Thresholder
Browse files Browse the repository at this point in the history
  • Loading branch information
KulikDM committed Jan 29, 2024
1 parent 27352ac commit ae5886a
Show file tree
Hide file tree
Showing 16 changed files with 785 additions and 44 deletions.
1 change: 1 addition & 0 deletions CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,4 @@ v<0.3.4>, <09/08/2023> -- Updated GNBC model for meta
v<0.3.5>, <10/29/2023> -- Upgraded RANK and docs
v<0.3.5>, <11/02/2023> -- Added CONF for OD confidence
v<0.3.5>, <11/05/2023> -- Added CONF docs and updated FAQ
v<0.3.6>, <01/29/2024> -- Added mixmod thresholder
19 changes: 16 additions & 3 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
:target: https://codeclimate.com/github/KulikDM/pythresh/maintainability
:alt: Maintainability

.. image:: https://img.shields.io/github/stars/KulikDM/pythresh.svg?logo=github&logoColor=white
.. image:: https://img.shields.io/github/stars/KulikDM/pythresh.svg?logo=github&logoColor=white&style=flat
:target: https://github.com/KulikDM/pythresh/stargazers
:alt: GitHub stars

Expand Down Expand Up @@ -194,6 +194,11 @@ Key Attributes of threshold:
- **dscores_**: 1D array of the TruncatedSVD decomposed decision scores
if multiple outlier detector score sets are passed

- **mixture_**: fitted mixture model class of the selected model used
for thresholding. Only applies to MIXMOD. Attributes include:
components, weights, params. Functions include: fit, loglikelihood,
pdf, and posterior.

************************
External Feature Cases
************************
Expand Down Expand Up @@ -256,6 +261,8 @@ Unsupervised Anomaly Detection. <https://arxiv.org/abs/2210.10487>`_
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| META | Meta-model Trained Classifier | [#meta1]_ | `pythresh.thresholds.meta module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.meta>`_ |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| MIXMOD | Normal & Non-Normal Mixture Models | [#mixmod1]_ | `pythresh.thresholds.mixmod module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.mixmod>`_ |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| MOLL | Friedrichs' Mollifier | [#moll1]_ | `pythresh.thresholds.moll module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.moll>`_ |
| | | [#moll2]_ | |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
Expand Down Expand Up @@ -287,11 +294,12 @@ is made available below

Additional `benchmarking
<https://pythresh.readthedocs.io/en/latest/benchmark.html>`_ has been
done on all the thresholders and it was found that the ``META``
done on all the thresholders and it was found that the ``MIXMOD``
thresholder performed best while the ``CLF`` thresholder provided the
smallest uncertainty about its mean and is the most robust (best least
accurate prediction). However, for interpretability and general
performance the ``FILTER`` thresholder is a good fit.
performance the ``MIXMOD, FILTER,`` and ``META`` thresholders are good
fits.

Further utilities are available for assisting in the selection of the
most optimal outlier detection and thresholding methods `ranking
Expand Down Expand Up @@ -458,6 +466,11 @@ the threshold types available in PyThresh.
`Automating Outlier Detection via Meta-Learning
<https://arxiv.org/abs/2009.10606>`_
.. [#mixmod1]
`Application of Mixture Models to Threshold Anomaly Scores
<https://studenttheses.uu.nl/bitstream/handle/20.500.12932/45591/Masterthesis%20%284%29.pdf?sequence=1&isAllowed=y>`_
.. [#moll1]
`Riemannian center of mass and mollifier smoothing
Expand Down
22 changes: 14 additions & 8 deletions docs/api_cc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,29 @@
API CheatSheet
################

The following APIs are applicable for all detector models for ease of use.
The following APIs are applicable for all detector models for ease of
use.

- :func:`pythresh.thresholders.base.BaseDetector.eval`: evaluate a single
outlier or multiple outlier detection likelihood score sets
- :func:`pythresh.thresholders.base.BaseDetector.eval`: evaluate a
single outlier or multiple outlier detection likelihood score sets

Key Attributes of a fitted model:

- :attr:`pythresh.thresholds.base.BaseThresholder.thresh_`: threshold
value from scores normalize between 0 and 1

- :attr:`pythresh.thresholders.base.BaseDetector.confidence_interval_`:
Return the lower and upper confidence interval of the contamination level.
Only applies to the COMB thresholder
Return the lower and upper confidence interval of the contamination
level. Only applies to the COMB thresholder

- :attr:`pythresh.thresholders.base.BaseDetector.dscores_`: 1D array of the
TruncatedSVD decomposed decision scores if multiple outlier detector score
sets are passed
- :attr:`pythresh.thresholders.base.BaseDetector.dscores_`: 1D array of
the TruncatedSVD decomposed decision scores if multiple outlier
detector score sets are passed

- :attr:`pythresh.thresholders.mixmod.MIXMOD.mixture_`: fitted mixture
model class of the selected model used for thresholding. Only applies
to MIXMOD. Attributes include: components, weights, params. Functions
include: fit, loglikelihood, pdf, and posterior.

See base class definition below:

Expand Down
8 changes: 4 additions & 4 deletions docs/benchmark.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,11 +91,11 @@ Finally, a baseline was also calculated if outliers were selected at
random (`Random`). This was done by setting :math:`MCC_{\rm{norm}} = 1`.

Overall, a significant amount of thresholders performed better than
selecting a random contamination level. The ``META`` thresholder
selecting a random contamination level. The ``MIXMOD`` thresholder
performed best while the ``CLF`` thresholder provided the smallest
uncertainty about its mean and is the most robust (best least accurate
prediction). However, for interpretability and general performance the
``FILTER`` thresholder is a good fit.
``MIXMOD, FILTER,`` and ``META`` thresholders are good fits.

.. figure:: figs/Benchmark1.png
:alt: Benchmark defaults
Expand Down Expand Up @@ -451,8 +451,8 @@ datapoints, 1s for 10000 datapoints, 100s for 100000 datapoints, and
about 2.5 hours for 1 million datapoints. If time is a factor, suggested
thresholders with reasonable accuracy are: FILTER with 10s, OCSVM with
0.1s, and MTT with 100s for one million datapoints. **Note** that these
benchmarks were done using an i5 12th gen processor and results may scale
slightly differently depending on the hardware used.
benchmarks were done using an i5 12th gen processor and results may
scale slightly differently depending on the hardware used.

+---------------+--------------------+------------------------+
| Method | Complexity | Big-O Notation |
Expand Down
Binary file modified docs/figs/All.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/figs/Benchmark1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/figs/Multi1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/figs/Multi2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 13 additions & 5 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
:target: https://codeclimate.com/github/KulikDM/pythresh/maintainability
:alt: Maintainability

.. image:: https://img.shields.io/github/stars/KulikDM/pythresh.svg?logo=github&logoColor=white
.. image:: https://img.shields.io/github/stars/KulikDM/pythresh.svg?logo=github&logoColor=white&style=flat
:target: https://github.com/KulikDM/pythresh/stargazers
:alt: GitHub stars

Expand Down Expand Up @@ -95,10 +95,11 @@ complex mathematical methods that involve graph theory and topology.
**************************

Benchmarking has been done on all the thresholders and it was found
that the ``META`` thresholder performed best while the ``CLF`` thresholder
provided the smallest uncertainty about its mean and is the most robust
(best least accurate prediction). However, for interpretability and
general performance the ``FILTER`` thresholder is a good fit.
that the ``MIXMOD`` thresholder performed best while the ``CLF``
thresholder provided the smallest uncertainty about its mean and is
the most robust (best least accurate prediction). However, for
interpretability and general performance the ``MIXMOD, FILTER,`` and
``META`` thresholders are good fits.

Further utilities are available for assisting in the selection of the
most optimal outlier detection and thresholding methods `ranking
Expand Down Expand Up @@ -173,6 +174,8 @@ Unsupervised Anomaly Detection <https://arxiv.org/abs/2210.10487>`_
+-----------+----------------------------------------------------------------+-----------------------------------+
| META | Metamodel Trained Classifier | :cite:`zhao2022meta` |
+-----------+----------------------------------------------------------------+-----------------------------------+
| MIXMOD | Normal & Non-Normal Mixture Models | :cite:`veluw2023mixmod` |
+-----------+----------------------------------------------------------------+-----------------------------------+
| MOLL | Friedrichs' Mollifier | :cite:`keyzer1997moll` |
+-----------+----------------------------------------------------------------+-----------------------------------+
| MTT | Modified Thompson Tau Test | :cite:`rengasamy2020mtt` |
Expand Down Expand Up @@ -227,6 +230,11 @@ Key Attributes of a threshold:
TruncatedSVD decomposed decision scores if multiple outlier detector score
sets are passed

- :attr:`pythresh.thresholders.mixmod.MIXMOD.mixture_`: fitted mixture model class
of the selected model used for thresholding. Only applies to MIXMOD. Attributes
include: components, weights, params. Functions include: fit, loglikelihood,
pdf, and posterior.

----

.. toctree::
Expand Down
11 changes: 11 additions & 0 deletions docs/pythresh.thresholds.rst
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,17 @@
:show-inheritance:
:inherited-members:

*********************************
pythresh.thresholds.mixmod module
*********************************

.. automodule:: pythresh.thresholds.mixmod
:members:
:exclude-members: MixtureModel, MLES
:undoc-members:
:show-inheritance:
:inherited-members:

*********************************
pythresh.thresholds.moll module
*********************************
Expand Down
8 changes: 8 additions & 0 deletions docs/zreferences.bib
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,14 @@ @misc{zhao2022meta
copyright = {arXiv.org perpetual, non-exclusive license}
}

@mastersthesis{veluw2023mixmod,
author = {van Veluw, Willem},
title = {Application of Mixture Models to Threshold Anomaly Scores},
school = {Utrecht University},
year = {2023},
url = {https://studenttheses.uu.nl/handle/20.500.12932/45591}
}

@article{keyzer1997moll,
author = {Keyzer, Michiel and Sonneveld, Ben},
title = {Using the mollifier method to characterize datasets and models: The case of the Universal Soil Loss Equation},
Expand Down
57 changes: 57 additions & 0 deletions examples/mixmod_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
"""Example of using mixture models for outlier thresholding."""
# Author: D Kulik
# License: BSD 2 clause


import os
import sys

from pyod.models.knn import KNN
from pyod.utils.data import evaluate_print, generate_data
from pyod.utils.example import visualize

from pythresh.thresholds.mixmod import MIXMOD

# temporary solution for relative imports in case pyod is not installed
# if pyod is installed, no need to use the following line
sys.path.append(
os.path.abspath(os.path.join(os.path.dirname('__file__'), '..')))


if __name__ == '__main__':
contamination = 0.1 # percentage of outliers
n_train = 200 # number of training points
n_test = 100 # number of testing points

# Generate sample data
X_train, X_test, y_train, y_test =\
generate_data(n_train=n_train,
n_test=n_test,
n_features=2,
contamination=contamination,
random_state=42)

# train Autoencoder detector
clf_name = 'KNN'
clf = KNN()
clf.fit(X_train)
thres = MIXMOD()

# get the prediction labels and outlier scores of the training data
y_train_scores = clf.decision_scores_ # raw outlier scores
# binary labels (0: inliers, 1: outliers)
y_train_pred = thres.eval(y_train_scores)

# get the prediction on the test data
y_test_scores = clf.decision_function(X_test) # outlier scores
y_test_pred = thres.eval(y_test_scores) # outlier labels (0 or 1)

# evaluate and print the results
print('\nOn Training Data:')
evaluate_print(clf_name, y_train, y_train_scores)
print('\nOn Test Data:')
evaluate_print(clf_name, y_test, y_test_scores)

# visualize the results
visualize(clf_name, X_train, X_test, y_train, y_test, y_train_pred,
y_test_pred, show_figure=True, save_figure=False)
Binary file modified imgs/All.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
52 changes: 28 additions & 24 deletions notebooks/Compare All Thresholders.ipynb

Large diffs are not rendered by default.

91 changes: 91 additions & 0 deletions pythresh/test/test_mixmod.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
import sys
import unittest
from itertools import product
from os.path import dirname as up

# noinspection
import numpy as np
from numpy.testing import assert_equal
from pyod.models.iforest import IForest
from pyod.models.knn import KNN
from pyod.models.pca import PCA
from pyod.utils.data import generate_data

from pythresh.thresholds.mixmod import MIXMOD

# temporary solution for relative imports in case pythresh is not installed
# if pythresh is installed, no need to use the following line

path = up(up(up(__file__)))
sys.path.append(path)


class TestMIXMOD(unittest.TestCase):
def setUp(self):
self.n_train = 200
self.n_test = 100
self.contamination = 0.1
self.X_train, self.X_test, self.y_train, self.y_test = generate_data(
n_train=self.n_train, n_test=self.n_test,
contamination=self.contamination, random_state=42)

clf = KNN()
clf.fit(self.X_train)

scores = clf.decision_scores_

clfs = [KNN(), PCA(), IForest()]

multiple_scores = [
clf.fit(self.X_train).decision_scores_ for clf in clfs]
multiple_scores = np.vstack(multiple_scores).T

self.all_scores = [scores, multiple_scores]

self.methods = ['mean', 'ks']

self.tol = [1e-3, 1e-5, 1e-8, 1e-12]

self.max_iter = [50, 100, 250, 500]

def test_prediction_labels(self):

params = product(self.all_scores, self.methods,
self.tol, self.max_iter)

for scores, method, tol, max_iter in params:

self.thres = MIXMOD(method=method, tol=tol, max_iter=max_iter)
pred_labels = self.thres.eval(scores)

assert (self.thres.thresh_ is not None)
assert (self.thres.dscores_ is not None)
assert (self.thres.mixture_ is not None)

assert (self.thres.dscores_.min() == 0)
assert (self.thres.dscores_.max() == 1)

assert (self.thres.mixture_.components is not None)
assert (self.thres.mixture_.weights is not None)
assert (self.thres.mixture_.params is not None)

nscores = self.thres.dscores_ + 1

assert (callable(self.thres.mixture_.loglikelihood) and
(_ := self.thres.mixture_.loglikelihood(nscores))
is not None)

assert (callable(self.thres.mixture_.pdf) and
(_ := self.thres.mixture_.pdf(nscores))
is not None)

assert (callable(self.thres.mixture_.posterior) and
(_ := self.thres.mixture_.posterior(nscores))
is not None)

assert_equal(pred_labels.shape, self.y_train.shape)

if (not np.all(pred_labels == 0)) & (not np.all(pred_labels == 1)):

assert (pred_labels.min() == 0)
assert (pred_labels.max() == 1)
Loading

0 comments on commit ae5886a

Please sign in to comment.