Added MIXMOD Thresholder

KulikDM · Jan 29, 2024 · ae5886a · ae5886a
1 parent 27352ac
commit ae5886a
Show file tree

Hide file tree

Showing 16 changed files with 785 additions and 44 deletions.
diff --git a/CHANGES.txt b/CHANGES.txt
@@ -73,3 +73,4 @@ v<0.3.4>, <09/08/2023> -- Updated GNBC model for meta
 v<0.3.5>, <10/29/2023> -- Upgraded RANK and docs
 v<0.3.5>, <11/02/2023> -- Added CONF for OD confidence
 v<0.3.5>, <11/05/2023> -- Added CONF docs and updated FAQ
+v<0.3.6>, <01/29/2024> -- Added mixmod thresholder
diff --git a/README.rst b/README.rst
@@ -28,7 +28,7 @@
    :target: https://codeclimate.com/github/KulikDM/pythresh/maintainability
    :alt: Maintainability
 
-.. image:: https://img.shields.io/github/stars/KulikDM/pythresh.svg?logo=github&logoColor=white
+.. image:: https://img.shields.io/github/stars/KulikDM/pythresh.svg?logo=github&logoColor=white&style=flat
    :target: https://github.com/KulikDM/pythresh/stargazers
    :alt: GitHub stars
 
@@ -194,6 +194,11 @@ Key Attributes of threshold:
 -  **dscores_**: 1D array of the TruncatedSVD decomposed decision scores
    if multiple outlier detector score sets are passed
 
+-  **mixture_**: fitted mixture model class of the selected model used
+   for thresholding. Only applies to MIXMOD. Attributes include:
+   components, weights, params. Functions include: fit, loglikelihood,
+   pdf, and posterior.
+
 ************************
  External Feature Cases
 ************************
@@ -256,6 +261,8 @@ Unsupervised Anomaly Detection. <https://arxiv.org/abs/2210.10487>`_
 +-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
 | META      | Meta-model Trained Classifier             | [#meta1]_          | `pythresh.thresholds.meta module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.meta>`_                |
 +-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
+| MIXMOD    | Normal & Non-Normal Mixture Models        | [#mixmod1]_        | `pythresh.thresholds.mixmod module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.mixmod>`_            |
++-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
 | MOLL      | Friedrichs' Mollifier                     | [#moll1]_          | `pythresh.thresholds.moll module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.moll>`_                |
 |           |                                           | [#moll2]_          |                                                                                                                                                        |
 +-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
@@ -287,11 +294,12 @@ is made available below
 
 Additional `benchmarking
 <https://pythresh.readthedocs.io/en/latest/benchmark.html>`_ has been
-done on all the thresholders and it was found that the ``META``
+done on all the thresholders and it was found that the ``MIXMOD``
 thresholder performed best while the ``CLF`` thresholder provided the
 smallest uncertainty about its mean and is the most robust (best least
 accurate prediction). However, for interpretability and general
-performance the ``FILTER`` thresholder is a good fit.
+performance the ``MIXMOD, FILTER,`` and ``META`` thresholders are good
+fits.
 
 Further utilities are available for assisting in the selection of the
 most optimal outlier detection and thresholding methods `ranking
@@ -458,6 +466,11 @@ the threshold types available in PyThresh.
    `Automating Outlier Detection via Meta-Learning
    <https://arxiv.org/abs/2009.10606>`_
 
+.. [#mixmod1]
+
+   `Application of Mixture Models to Threshold Anomaly Scores
+   <https://studenttheses.uu.nl/bitstream/handle/20.500.12932/45591/Masterthesis%20%284%29.pdf?sequence=1&isAllowed=y>`_
+
 .. [#moll1]
 
    `Riemannian center of mass and mollifier smoothing

diff --git a/docs/api_cc.rst b/docs/api_cc.rst
@@ -2,23 +2,29 @@
  API CheatSheet
 ################
 
-The following APIs are applicable for all detector models for ease of use.
+The following APIs are applicable for all detector models for ease of
+use.
 
--  :func:`pythresh.thresholders.base.BaseDetector.eval`: evaluate a single
-   outlier or multiple outlier detection likelihood score sets
+-  :func:`pythresh.thresholders.base.BaseDetector.eval`: evaluate a
+   single outlier or multiple outlier detection likelihood score sets
 
 Key Attributes of a fitted model:
 
 -  :attr:`pythresh.thresholds.base.BaseThresholder.thresh_`: threshold
    value from scores normalize between 0 and 1
 
 -  :attr:`pythresh.thresholders.base.BaseDetector.confidence_interval_`:
-   Return the lower and upper confidence interval of the contamination level.
-   Only applies to the COMB thresholder
+   Return the lower and upper confidence interval of the contamination
+   level. Only applies to the COMB thresholder
 
--  :attr:`pythresh.thresholders.base.BaseDetector.dscores_`: 1D array of the
-   TruncatedSVD decomposed decision scores if multiple outlier detector score
-   sets are passed
+-  :attr:`pythresh.thresholders.base.BaseDetector.dscores_`: 1D array of
+   the TruncatedSVD decomposed decision scores if multiple outlier
+   detector score sets are passed
+
+-  :attr:`pythresh.thresholders.mixmod.MIXMOD.mixture_`: fitted mixture
+   model class of the selected model used for thresholding. Only applies
+   to MIXMOD. Attributes include: components, weights, params. Functions
+   include: fit, loglikelihood, pdf, and posterior.
 
 See base class definition below:
 

diff --git a/docs/benchmark.rst b/docs/benchmark.rst
@@ -91,11 +91,11 @@ Finally, a baseline was also calculated if outliers were selected at
 random (`Random`). This was done by setting :math:`MCC_{\rm{norm}} = 1`.
 
 Overall, a significant amount of thresholders performed better than
-selecting a random contamination level. The ``META`` thresholder
+selecting a random contamination level. The ``MIXMOD`` thresholder
 performed best while the ``CLF`` thresholder provided the smallest
 uncertainty about its mean and is the most robust (best least accurate
 prediction). However, for interpretability and general performance the
-``FILTER`` thresholder is a good fit.
+``MIXMOD, FILTER,`` and ``META`` thresholders are good fits.
 
 .. figure:: figs/Benchmark1.png
    :alt: Benchmark defaults
@@ -451,8 +451,8 @@ datapoints, 1s for 10000 datapoints, 100s for 100000 datapoints, and
 about 2.5 hours for 1 million datapoints. If time is a factor, suggested
 thresholders with reasonable accuracy are: FILTER with 10s, OCSVM with
 0.1s, and MTT with 100s for one million datapoints. **Note** that these
-benchmarks were done using an i5 12th gen processor and results may scale
-slightly differently depending on the hardware used.
+benchmarks were done using an i5 12th gen processor and results may
+scale slightly differently depending on the hardware used.
 
 +---------------+--------------------+------------------------+
 | Method        | Complexity         | Big-O Notation         |

diff --git a/docs/figs/All.png b/docs/figs/All.png
diff --git a/docs/figs/Benchmark1.png b/docs/figs/Benchmark1.png
diff --git a/docs/figs/Multi1.png b/docs/figs/Multi1.png
diff --git a/docs/figs/Multi2.png b/docs/figs/Multi2.png
diff --git a/docs/index.rst b/docs/index.rst
@@ -28,7 +28,7 @@
    :target: https://codeclimate.com/github/KulikDM/pythresh/maintainability
    :alt: Maintainability
 
-.. image:: https://img.shields.io/github/stars/KulikDM/pythresh.svg?logo=github&logoColor=white
+.. image:: https://img.shields.io/github/stars/KulikDM/pythresh.svg?logo=github&logoColor=white&style=flat
    :target: https://github.com/KulikDM/pythresh/stargazers
    :alt: GitHub stars
 
@@ -95,10 +95,11 @@ complex mathematical methods that involve graph theory and topology.
 **************************
 
 Benchmarking has been done on all the thresholders and it was found
-that the ``META`` thresholder performed best while the ``CLF`` thresholder
-provided the smallest uncertainty about its mean and is the most robust
-(best least accurate prediction). However, for interpretability and
-general performance the ``FILTER`` thresholder is a good fit.
+that the ``MIXMOD`` thresholder performed best while the ``CLF``
+thresholder provided the smallest uncertainty about its mean and is
+the most robust (best least accurate prediction). However, for
+interpretability and general performance the ``MIXMOD, FILTER,`` and
+``META`` thresholders are good fits.
 
 Further utilities are available for assisting in the selection of the
 most optimal outlier detection and thresholding methods `ranking
@@ -173,6 +174,8 @@ Unsupervised Anomaly Detection <https://arxiv.org/abs/2210.10487>`_
 +-----------+----------------------------------------------------------------+-----------------------------------+
 | META      | Metamodel Trained Classifier                                   | :cite:`zhao2022meta`              |
 +-----------+----------------------------------------------------------------+-----------------------------------+
+| MIXMOD    | Normal & Non-Normal Mixture Models                             | :cite:`veluw2023mixmod`           |
++-----------+----------------------------------------------------------------+-----------------------------------+
 | MOLL      | Friedrichs' Mollifier                                          | :cite:`keyzer1997moll`            |
 +-----------+----------------------------------------------------------------+-----------------------------------+
 | MTT       | Modified Thompson Tau Test                                     | :cite:`rengasamy2020mtt`          |
@@ -227,6 +230,11 @@ Key Attributes of a threshold:
    TruncatedSVD decomposed decision scores if multiple outlier detector score
    sets are passed
 
+-  :attr:`pythresh.thresholders.mixmod.MIXMOD.mixture_`: fitted mixture model class
+   of the selected model used for thresholding. Only applies to MIXMOD. Attributes
+   include: components, weights, params. Functions include: fit, loglikelihood,
+   pdf, and posterior.
+
 ----
 
 .. toctree::

diff --git a/docs/pythresh.thresholds.rst b/docs/pythresh.thresholds.rst
@@ -234,6 +234,17 @@
    :show-inheritance:
    :inherited-members:
 
+*********************************
+ pythresh.thresholds.mixmod module
+*********************************
+
+.. automodule:: pythresh.thresholds.mixmod
+   :members:
+   :exclude-members: MixtureModel, MLES
+   :undoc-members:
+   :show-inheritance:
+   :inherited-members:
+
 *********************************
  pythresh.thresholds.moll module
 *********************************

diff --git a/docs/zreferences.bib b/docs/zreferences.bib
@@ -243,6 +243,14 @@ @misc{zhao2022meta
   copyright = {arXiv.org perpetual, non-exclusive license}
 }
 
+@mastersthesis{veluw2023mixmod,
+  author = {van Veluw, Willem},
+  title = {Application of Mixture Models to Threshold Anomaly Scores},
+  school = {Utrecht University},
+  year = {2023},
+  url = {https://studenttheses.uu.nl/handle/20.500.12932/45591}
+}
+
 @article{keyzer1997moll,
   author = {Keyzer, Michiel and Sonneveld, Ben},
   title = {Using the mollifier method to characterize datasets and models: The case of the Universal Soil Loss Equation},

diff --git a/examples/mixmod_example.py b/examples/mixmod_example.py
@@ -0,0 +1,57 @@
+"""Example of using mixture models for outlier thresholding."""
+# Author: D Kulik
+# License: BSD 2 clause
+
+
+import os
+import sys
+
+from pyod.models.knn import KNN
+from pyod.utils.data import evaluate_print, generate_data
+from pyod.utils.example import visualize
+
+from pythresh.thresholds.mixmod import MIXMOD
+
+# temporary solution for relative imports in case pyod is not installed
+# if pyod is installed, no need to use the following line
+sys.path.append(
+    os.path.abspath(os.path.join(os.path.dirname('__file__'), '..')))
+
+
+if __name__ == '__main__':
+    contamination = 0.1  # percentage of outliers
+    n_train = 200  # number of training points
+    n_test = 100  # number of testing points
+
+    # Generate sample data
+    X_train, X_test, y_train, y_test =\
+        generate_data(n_train=n_train,
+                      n_test=n_test,
+                      n_features=2,
+                      contamination=contamination,
+                      random_state=42)
+
+    # train Autoencoder detector
+    clf_name = 'KNN'
+    clf = KNN()
+    clf.fit(X_train)
+    thres = MIXMOD()
+
+    # get the prediction labels and outlier scores of the training data
+    y_train_scores = clf.decision_scores_  # raw outlier scores
+    # binary labels (0: inliers, 1: outliers)
+    y_train_pred = thres.eval(y_train_scores)
+
+    # get the prediction on the test data
+    y_test_scores = clf.decision_function(X_test)  # outlier scores
+    y_test_pred = thres.eval(y_test_scores)  # outlier labels (0 or 1)
+
+    # evaluate and print the results
+    print('\nOn Training Data:')
+    evaluate_print(clf_name, y_train, y_train_scores)
+    print('\nOn Test Data:')
+    evaluate_print(clf_name, y_test, y_test_scores)
+
+    # visualize the results
+    visualize(clf_name, X_train, X_test, y_train, y_test, y_train_pred,
+              y_test_pred, show_figure=True, save_figure=False)
diff --git a/imgs/All.png b/imgs/All.png
diff --git a/notebooks/Compare All Thresholders.ipynb b/notebooks/Compare All Thresholders.ipynb
diff --git a/pythresh/test/test_mixmod.py b/pythresh/test/test_mixmod.py
@@ -0,0 +1,91 @@
+import sys
+import unittest
+from itertools import product
+from os.path import dirname as up
+
+# noinspection
+import numpy as np
+from numpy.testing import assert_equal
+from pyod.models.iforest import IForest
+from pyod.models.knn import KNN
+from pyod.models.pca import PCA
+from pyod.utils.data import generate_data
+
+from pythresh.thresholds.mixmod import MIXMOD
+
+# temporary solution for relative imports in case pythresh is not installed
+# if pythresh is installed, no need to use the following line
+
+path = up(up(up(__file__)))
+sys.path.append(path)
+
+
+class TestMIXMOD(unittest.TestCase):
+    def setUp(self):
+        self.n_train = 200
+        self.n_test = 100
+        self.contamination = 0.1
+        self.X_train, self.X_test, self.y_train, self.y_test = generate_data(
+            n_train=self.n_train, n_test=self.n_test,
+            contamination=self.contamination, random_state=42)
+
+        clf = KNN()
+        clf.fit(self.X_train)
+
+        scores = clf.decision_scores_
+
+        clfs = [KNN(), PCA(), IForest()]
+
+        multiple_scores = [
+            clf.fit(self.X_train).decision_scores_ for clf in clfs]
+        multiple_scores = np.vstack(multiple_scores).T
+
+        self.all_scores = [scores, multiple_scores]
+
+        self.methods = ['mean', 'ks']
+
+        self.tol = [1e-3, 1e-5, 1e-8, 1e-12]
+
+        self.max_iter = [50, 100, 250, 500]
+
+    def test_prediction_labels(self):
+
+        params = product(self.all_scores, self.methods,
+                         self.tol, self.max_iter)
+
+        for scores, method, tol, max_iter in params:
+
+            self.thres = MIXMOD(method=method, tol=tol, max_iter=max_iter)
+            pred_labels = self.thres.eval(scores)
+
+            assert (self.thres.thresh_ is not None)
+            assert (self.thres.dscores_ is not None)
+            assert (self.thres.mixture_ is not None)
+
+            assert (self.thres.dscores_.min() == 0)
+            assert (self.thres.dscores_.max() == 1)
+
+            assert (self.thres.mixture_.components is not None)
+            assert (self.thres.mixture_.weights is not None)
+            assert (self.thres.mixture_.params is not None)
+
+            nscores = self.thres.dscores_ + 1
+
+            assert (callable(self.thres.mixture_.loglikelihood) and
+                    (_ := self.thres.mixture_.loglikelihood(nscores))
+                    is not None)
+
+            assert (callable(self.thres.mixture_.pdf) and
+                    (_ := self.thres.mixture_.pdf(nscores))
+                    is not None)
+
+            assert (callable(self.thres.mixture_.posterior) and
+                    (_ := self.thres.mixture_.posterior(nscores))
+                    is not None)
+
+            assert_equal(pred_labels.shape, self.y_train.shape)
+
+            if (not np.all(pred_labels == 0)) & (not np.all(pred_labels == 1)):
+
+                assert (pred_labels.min() == 0)
+                assert (pred_labels.max() == 1)