Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

score_array computes roc-auc values on discretized predictions #181

Open
pbenner opened this issue Sep 8, 2022 · 1 comment
Open

score_array computes roc-auc values on discretized predictions #181

pbenner opened this issue Sep 8, 2022 · 1 comment

Comments

@pbenner
Copy link

pbenner commented Sep 8, 2022

I am not sure if the following behavior is intended:

> cat matbench-scores-bug.py

from matbench.data_ops import score_array, CLF_KEY
from sklearn.metrics import roc_auc_score

true_array = 8*[True]+2*[False]
pred_array = 8*[ 0.4]+2*[0.2  ]

scores = score_array(true_array, pred_array, CLF_KEY)

print('matbench roc-auc:', scores['rocauc'])
print('    true roc-auc:', roc_auc_score(true_array, pred_array))

> python matbench-scores-bug.py 
matbench roc-auc: 0.5
    true roc-auc: 1.0

The mismatch is caused by a discretization of the values in pred_array prior to calling roc_auc_score. ROC-AUC values are typically evaluated on class probabilities. The following patch fixes the problem:

> cat matbench-scores-bug.patch 
--- data_ops.py 2022-08-29 09:51:34.565746826 +0200
+++ data_ops.py.new     2022-09-08 08:52:52.994181877 +0200
@@ -108,18 +108,22 @@
     for metric in metrics:
         mfunc = METRIC_MAP[metric]
 
+        true_array_ = true_array
+        pred_array_ = pred_array
+
         if metric == "rocauc":
             # Both arrays must be in probability form
             # if pred. array is given in probabilities
             if isinstance(pred_array[0], float):
-                true_array = homogenize_clf_array(true_array, to_probs=True)
+                true_array_ = homogenize_clf_array(true_array, to_probs=True)
 
         # Other clf metrics always be converted to labels
         elif metric in CLF_METRICS:
             if isinstance(pred_array[0], float):
-                pred_array = homogenize_clf_array(pred_array, to_labels=True)
+                pred_array_ = homogenize_clf_array(pred_array, to_labels=True)
+
+        computed[metric] = mfunc(true_array_, pred_array_)
 
-        computed[metric] = mfunc(true_array, pred_array)
     return computed

Matbench version: 70c79fb

@ml-evs
Copy link
Contributor

ml-evs commented Sep 10, 2022

Related to #40 (where probabilities were introduced) and #137 (which I think is reporting the same underlying issue as this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants