Repository of tool DAT presented in the following paper:
@article{martinez2023dat,
title = {Hyperparameter Optimization for AST Differencing},
journal = {IEEE Transactions on Software Engineering},
year = {2023},
doi = {10.1109/TSE.2023.3315935},
author = {Matias Martinez and Jean-Rémy Falleri and Martin Monperrus},
url = {http://arxiv.org/pdf/2011.10268},
}
The goal of DAT to find a new configuration i.e., a particular value for each hyper-parameter of an AST differencing algorithm.
DAT provides two types of optimization: global
and local
.
In global
optimization, the search of the best hyperparameters is done globally i.e., from a set of file-pairs.
DAT finds the values for the hyper-parameter(s) that produces the best edit scripts according to a fitness function.
The input of this optimization mode is a file containing a list of file pairs.
path to file A1, path to file A2
path to file B1, path to file B2
path to file C1, path to file C2
path to file D1, path to file D2
...
In this mode, for each configuration C
of the diff algorithm, DAT computes the fitness function on each file pair, then it computes a function (e.g., mean, median) to obtain the performance of C
accross all the data.
Finally, it selects the configuration according with an objective goal (e.g., min, max).
The local
optimization is done on a single file pair.
DAT has as goal to search the hyperparameter that produces the best edit script according to a fitness function (by default shortest edit scripts)
The Java code requires at least Java 17.
$ mvn install
To use TPE, it's requited to:
- Install Python 3.x
- Include Python in the path or create the environment variable
python.home
which value corresponds to the path to Python. - In case of using TPE from Hyperopt, install hyperopt e.g.,
pip install hyperopt
- In case of using TPE from Optuna, install [Optuna][https://optuna.org/) e.g.,
pip install optuna
fr.gumtree.autotuning.Main -left <path_to_file> -right <path_to_file> -mode exhaustive -scope local
fr.gumtree.autotuning.Main -left <path_to_file> -right <path_to_file> -mode <tpe> -scope local
DAT works with different implementations of TPE, including one from Hyperopt (replace <tpe>
by TPE_HYPEROPT
) and from Optuna (replace <tpe>
by TPE_OPTUNA
).
fr.gumtree.autotuning.Main -listpairs<path_to_file> -mode exhaustive -scope global
fr.gumtree.autotuning.Main -listpairs <path_to_file> -mode <tpe> -scope global
DAT works with different implementations of TPE, including one from Hyperopt (replace <tpe>
by TPE_HYPEROPT
) and from Optuna (replace <tpe>
by TPE_OPTUNA
).
The interface OptimizationMethod
provides the methods that define the two optimization types.
public ResponseBestParameter computeBestGlobal(File dataFilePairs, Fitness fitnessFunction,
ExecutionConfiguration configuration) throws Exception;
public ResponseBestParameter computeBestLocal(File left, File right, Fitness fitnessFunction,
ExecutionConfiguration configuration) throws Exception;
As mentioned before, computeBestGlobal
receives a file which stores a list of file-pairs, computeBestLocal
two files, which corresponds to the file-pair to analyze.
Both methods has other two arguments: the fitness function, and the configuration of DAT for this execution (the type and number of configuration entries depend on the interface implementation).
The current version of DAT provides two optimization methods, which implement the interface OptimizationMethod
:
- Exhaustive search
- TPE
Other optimization methods, not included in the current version of DAT, can be included in DAT by implementing this interface.
For DAT, a configuration is a single point in the hyper-parameter space: for each hyper-parameter, a configuration has exactly one value.
The next listing shows the template for representing a particular configuration i.e., a point in the hyperparameter space:
<Algorithm Name>-<hyperparameter name 1>-<value of hyperparameter 1>-<hyperparameter name 2>-<value of hyperparameter 2>...-<hyperparameter name n>-<value of hyperparameter n>
For example, the following configuration:
ClassicGumtree-bu_minsim-0.1-bu_minsize-100-st_minprio-1-st_priocalc-size
concerns ClassicGumtree
matching algorithm and assigns values 0.1 to hyper-parameter bu_minsim
, 100 to bu_minsize
, 1 to bu_minsize
and size
value to st_priocalc
.
We recall that all those 4 are hyper-parameters of ClassicGumtree matches.
Another diff algorithm may have different parameters.
For example XY
has 3 hyper-parameters (st_minprio
, st_priocalc
and xy_minsim
) and one configuration is:
XyMatcher-st_minprio-1-st_priocalc-size-xy_minsim-0.1
All methods from OptimizationMethod
returns a ResponseBestParameter
which contains:
- a list of best hyper-parameters (DAT can find several sets of hyper-parameters that produces the same output).
- the value of fitness function corresponding to the best configuration found.
Matias Martinez [email protected]