Certifiably Optimal RulE ListS
CORELS is a custom discrete optimization technique for building rule lists over a categorical feature space. Our algorithm provides the optimal solution, with a certificate of optimality. By leveraging algorithmic bounds, efficient data structures, and computational reuse, we achieve several orders of magnitude speedup in time and a massive reduction of memory consumption. Our approach produces optimal rule lists on practical problems in seconds. This framework is a novel alternative to CART and other decision tree methods.
-
Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. Learning Certifiably Optimal Rule Lists for Categorical Data. arXiv:1704.01701, 2017.
-
Nicholas Larus-Stone. Learning Certifiably Optimal Rule Lists: A Case For Discrete Optimization in the 21st Century. Senior thesis, 2017.
CORELS is a custom branch-and-bound algorithm for optimizing rule lists.
Web UI can be found at: https://corels.eecs.harvard.edu/
- C/C++ dependencies
- Sample command
- Usage
- Data format
- Arguments
- Example dataset
- Example rule list
- Optimization algorithm and objective
- Data structures
- Related work
- gmp (GNU Multiple Precision Arithmetic Library)
- mpfr (GNU MPFR Library for multiple-precision floating-point computations; depends on gmp)
- libmpc (GNU MPC for arbitrarily high precision and correct rounding; depends on gmp and mpfr)
e.g., install libmpc via homewbrew (this will also install gmp and mpfr):
brew install libmpc
Run the following from the src/
directory.
make
./corels -r 0.015 -c 2 -p 1 ../data/compas_train.out ../data/compas_train.label ../data/compas_train.minor
./corels [-b] [-n max_num_nodes] [-r regularization] [-v verbosity] -c (1|2|3|4) -p (0|1|2)
[-f logging_frequency] -a (1|2|3) [-s] [-L latex_out] data.out data.label [data.minor]
For examples, see compas.out
and compas.label
in data/
. Also see compas.minor
(optional). Note that our data parsing is fragile and errors will occur if the format is not followed exactly.
- The input data files must be space-delimited text.
- Each line contains
N + 1
fields, whereN
is the number of observations, and ends with\n
(including the last line). - In each line, the last
N
fields are0
's and1
's, and encode a bit vector; the first field has the format{text-description}
, where the text between the brackets provides a description of the bit vector. - There can be no spaces in the text descriptions--words should be separated by dashes or underscores.
[-b] Breadth-first search (BFS). You must specify a search policy;
use exactly one of (-b | -c)
.
[-n max_num_nodes] Maximum trie (cache) size.
Stop execution if the number of nodes in the trie exceeds this number.
Default value corresponds to -n 100000
.
[-r regularization] Regularization parameter (optional).
The default value corresponds to -r 0.01
and can be thought of as adding a
penalty equivalent to misclassifying 1% of data when increasing a rule list's
length by one association rule.
[-v verbosity] Verbosity.
Default value corresponds to -v 0
.
If verbosity is at least 10
, then print machine info.
If verbosity is at least 1000
, then also print input antecedents and labels.
-c (1|2|3|4) Best-first search policy.
You must specify a search policy; use exactly one of (-b | -c)
.
We include four different prioritization schemes:
- Use
-c 1
to prioritize by curiosity (see Section 5.1 Custom scheduling policies of our paper). - Use
-c 2
to prioritize by the lower bound. - Use
-c 3
to prioritize by the objective. - Use
-c 4
for depth-first search.
-p (0|1|2) Symmetry-aware map (optional).
- Use
-p 0
for no symmetry-aware map (default). - Use
-p 1
for the permutation map. - Use
-p 2
for the captured vector map.
[-f logging_frequency] Logging frequency. Default value corresponds to -f 1000
.
-a (0|1|2) Exclude a specific algorithm optimization.
Default value corresponds to -a 0
.
- Use
-a 0
to include the following optimizations. - Use
-a 1
to exclude the minimum support bounds (see Section 3.7 Lower bounds on antecedent support of our paper). - Use
-a 2
to exclude the lookahead bound (see Lemma 2 in Section 3.4 Hierarchical objective lower bound).
[-s] Calculate upper bound on remaining search space size (optional).
This adds a small overhead; the default behavior does not perform the calculation.
With -s
, we dynamically and incrementally calculate floor(log10(Gamma(Rc, Q)))
,
where Gamma(Rc, Q)
is the upper bound (see Theorem 7 in Section 3.6 of our paper).
[-L latex_out] Latex output. Include this flag to generate a latex representation of the output rule list.
data.out File path to training data. See Data format, above.
- Encode
M
antecedents asM
space-delimited lines of text. Each line containsN + 1
fields. - The first field has the format
{antecedent-description}
, where the text between the brackets describes the antecedent, e.g.,{hair-color:brown}
, or{age:20-25}
. - The remaining
N
fields indicate whether the antecedent is true or false for theN
observations.
data.label File path to training labels. See Data format, above.
- Encode labels as two space-delimited lines of text.
*The first line starts with a description of the negative class label, e.g.,
{label=0}
; the remainingN
fields indicate which of theN
observations have this label. - The second line contains analogous (redundant) information for the positive class label.
data.minor File path to a bit vector to support use of the equivalent points bound (optional, see Theorem 20 in Section 3.14 of our paper).
The files in data/
were generated from ProPublica's COMPAS recidivism dataset,
specifically, the file compas-scores-two-years.csv
.
We include one training set (N = 6489
) and one test set (N = 721
) from a 10-fold cross-validation experiment.
There are four training data files:
-
compas_train.csv
:7
categorical and integer-valued features and binary class labels extracted fromcompas-scores-two-years.csv
sex (Male, Female), age, juvenile-felonies, juvenile-misdemeanors, juvenile-crimes, priors, current-charge-degree (Misdemeanor, Felony)
-
compas_train-binary.csv
:14
equivalent binary features and class labels (included for convenience / use with other algorithms)sex:Male, age:18-20, age:21-22, age:23-25, age:26-45, age:>45 juvenile-felonies:>0, juvenile-misdemeanors:>0, juvenile-crimes:>0 priors:2-3, priors:=0, priors:=1, priors:>3, current-charge-degree:Misdemeanor
-
compas_train.out
:155
mined antecedents (binary features and length-2 conjunctions of binary features with support in[0.005, 0.995]
), e.g.,{sex:Female}, {age:18-20}, {sex:Male,current-charge-degree:Misdemeanor}, {age:26-45,juvenile-felonies:>0}
-
compas_train.labels
: class labels{recidivate-within-two-years:No}, {recidivate-within-two-years:Yes}
-
compas_train.minor
: bit vector used to support the equivalent points bound
The corresponding test data files are: compas_test.csv
, compas_test-binary.csv
, compas_test.out
, and compas_test.labels
.
if (age=23−25) and (priors=2−3) then predict yes
else if (age = 18 − 20) then predict yes
else if (sex = male) and (age = 21 − 22) then predict yes
else if (priors > 3) then predict yes
else predict no
This rule list predicts two-year recidivism for the ProPublica two-year recidivism dataset, and was found by CORELS. Its prefix corresponds to the first four rules and its default rule corresponds to the last rule.
CORELS is a custom branch-and-bound algorithm that minimizes the following objective
defined for a rule list d
, training data x
and corresponding labels y
:
R(d, x, y) = misc(d, x, y) + c * length(d).
This objective is a regularized empirical risk that consists of a loss and a regularization term that penalizes longer rule lists.
- The loss
misc(d, x, y)
measuresd
's misclassification error. - The regularization parameter
c >= 0
is a small constant. length(d)
is the number of rules ind
's prefix.
Let p
be d
's prefix. The following objective lower bound drives
our branch-and-bound procedure:
b(p, x, y) = misc(p, x, y) + c * length(d) <= R(d, x, y)
where misc(p, x, y)
is the prefix misclassification error
(due to mistakes made by the prefix, but not the default rule).
- A trie (prefix tree) functions as a cache and supports incremental computation.
- A priority queue supports multiple best-first search policies, as well as both breadth-first and depth-first search.
- A map supports symmetry-aware pruning.
CORELS builds directly on:
-
Hongyu Yang, Cynthia Rudin, and Margo Seltzer. Scalable Bayesian Rule Lists. arXiv:1602.08610, 2016. code
-
Benjamin Letham, Cynthia Rudin, Tyler McCormick and David Madigan. Interpretable Classifiers Using Rules and Bayesian Analysis: Building a Better Stroke Prediction Model. The Annals of Applied Statistics, 2015, Vol. 9, No. 3, 1350–1371. pdf code
In particular, CORELS uses a library by Yang et al. for efficiently representing and operating on bit vectors. See the files src/rule.h and src/rule.c.