Skip to content

Commit

Permalink
Merge pull request #100 from PGijsbers/develop
Browse files Browse the repository at this point in the history
GAMA 20.2.1 release
  • Loading branch information
PGijsbers authored Jun 30, 2020
2 parents 565f48e + 4f7f649 commit a285f64
Show file tree
Hide file tree
Showing 71 changed files with 7,662 additions and 6,292 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ repos:
rev: 19.10b0
hooks:
- id: black
language_version: python3.6
language_version: python3.8
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.761
hooks:
Expand Down
2 changes: 2 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,9 @@ env:
jobs:
include:
- env: JOB=check
python: 3.8
- env: JOB=deploy
python: 3.8
if: (branch = master OR branch = develop) AND type = push

install:
Expand Down
391 changes: 391 additions & 0 deletions data/GAMA_20_2_1.csv

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/source/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
.. default-role:: code

.. _api_doc:

API
===

Expand Down
15 changes: 11 additions & 4 deletions docs/source/benchmark.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,18 @@
Benchmark Results
=================

.. note:: Additional results will be uploaded soon.

This page reports the benchmark results obtained by running GAMA on the `AutoML benchmark <https://openml.github.io/automlbenchmark/automl_overview.html>`_.
Note that the performance of other AutoML frameworks is taken from the original benchmark experiments, they are not reproduced by us.
Some results on this page may have been performed on other hardware than aws nodes, but care is taken that the resources available match.
We will do our best to keep the results complete and up to date.

.. image:: images/benchmark.png
GAMA 20.2.1
***********
The results are obtained on non-aws hardware.
Additionally, GAMA was ran with a one-hour time constraint instead of four hours, other conditions are equal.
It is important to point out that the constrained runtime might both be bad (less time to search), but also good (less opportunity to overfit).
The restriction is in place due to our compute budget.
The raw data is `here <https://github.com/PGijsbers/gama/tree/develop/data/GAMA_20_2_1.csv>`_.

.. image:: images/binary_results_stripplot.png

.. image:: images/multiclass_results_stripplot.png
40 changes: 36 additions & 4 deletions docs/source/citing.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,36 @@
Citing
Papers
======
This page contains bibtex entries for each paper, as well as up-to-date code listings from each paper.
Unless you want to reference a specific paper, when citing GAMA please cite the `JOSS article <http://joss.theoj.org/papers/10.21105/joss.01132>`_.

If you want to cite GAMA, please cite the `JOSS article <http://joss.theoj.org/papers/10.21105/joss.01132>`_.
Here's the bibtex:
GAMA: a General Automated Machine learning Assistant
----------------------------------------------------
Features GAMA 20.2.1

Bibtex will be added after publication.

Listings
********
Listing 1:

.. code-block:: Python
from gama import GamaClassifier
from gama.search_methods import AsynchronousSuccessiveHalving
from gama.postprocessing import EnsemblePostProcessing
automl = GamaClassifier(
search_method=AsynchronousSuccessiveHalving(),
post_processing_method=EnsemblePostProcessing()
)
automl.fit(X, y)
automl.predict(X_test)
automl.fit(X_test, y_test)
GAMA: Genetic Automated Machine learning Assistant
--------------------------------------------------
Features GAMA 19.01.0

.. code-block:: latex

Expand All @@ -18,4 +46,8 @@ Here's the bibtex:
author = {Pieter Gijsbers and Joaquin Vanschoren},
title = {{GAMA}: Genetic Automated Machine learning Assistant},
journal = {Journal of Open Source Software}
}
}

Listings
********
This paper features no listings.
Binary file added docs/source/images/binary_results_stripplot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 0 additions & 3 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,6 @@ It describes visualization of optimization logs, changing the AutoML pipeline, a
If there are any questions you have that are not answered by the documentation, check the `issue page <https://github.com/PGijsbers/GAMA/issues>`_.
If your question has not been answered there either, please open a new issue.

.. note::


.. toctree::
:includehidden:

Expand Down
20 changes: 20 additions & 0 deletions docs/source/releases.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,26 @@
Release Notes
=============

Version 20.2.1
--------------
Changes:
# 24: Changes to logging
The structure of the log file(s) have changed.
The goal is to make the log files easier to use, by making them easier to read and
easier to extend write behavior.
There will now be three log files, one which contains just evaluation data, one which contains progress data, and one which contains resource usage data.
For more information see :ref:`logging-section` in the technical guide.


Features:
# 66: csv files are now supported.
Call `fit_arff` is now `fit_from_file` which accepts both arff and csv files.
The CLI interface and Gama Dashboard also allow for csv files.
# 92: You can specify a memory limit through `max_memory_mb` hyperparameter.
GAMA does not guarantee it will not violate the constraint, but violations
should be infrequent and minor. Feel free to open an issue if you experience a
violation which does not minor.

Version 20.2.0
--------------
Features:
Expand Down
4 changes: 0 additions & 4 deletions docs/source/technical_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,6 @@ This section will cover more advanced usage of GAMA, in particular it covers:
.. include:: logging.rst
:start-line: 1

----

.. include:: visualization.rst
:start-line: 5

----

Expand Down
29 changes: 14 additions & 15 deletions docs/source/technical_guide/logging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,30 @@

.. default-role:: code

.. _logging-section:

Logging
-------

GAMA makes use of the default Python `logging <https://docs.python.org/3.5/library/logging.html>`_ module.
This means logs can be captured at different levels, and handled by one of several StreamHandlers.
In addition to the Python built-in log levels GAMA introduces one level below `logging.DEBUG`, explicitly for log
messages that are meant to be parsed by a program later.

The most common logging use cases are to write a comprehensive log to file, as well as print important messages to `stdout`.
Writing log messages to `stdout` is directly supported by GAMA through the `verbosity` hyperparameter
(which defaults to `logging.WARNING`).

Logging all log messages to file is not entirely supported directly.
Instead, GAMA provides a `keep_analysis_log` hyperparameter.
When set (by providing the name of the file to write to), it will only write *most* information to file.
In particular it will write any log messages at `logging.WARNING`, `logging.CRITICAL` and `logging.ERROR`, as
well as all messages at the `MACHINE_LOG_LEVEL` - the log level introduced by GAMA.
The `MACHINE_LOG_LEVEL` write messages in a structured way, so that they may easily be parsed by a program later.
By default GAMA writes to 'gama.log'.
By default GAMA will also save several different logs.
This can be turned off by the `store` hyperparameter.
The `store` hyperparameter allows you to store the logs, as well as models and predictions.
By default logs are kept (which includes evaluation data), but models and predictions are discarded.

The `output_directory` hyperparameter determines where this data is stored, by default a unique name is generated.
In the output directory you will find three files and a subdirectory:

- 'evaluations.log': a csv file (with ';' as separator) in which each evaluation is stored.
- 'gama.log': A loosely structured file with general (human readable) information of the GAMA run.
- 'resources.log': A record of the memory usage for each of GAMA's processes over time.
- cache directory: contains evaluated models and predictions, only if `store` is 'all' or 'models'

If you want other behavior, the logging module offers you great flexibility on making your own variations.
The following script writes any log messages of `logging.DEBUG` or up to both file and console::
Expand All @@ -43,14 +47,9 @@ The following script writes any log messages of `logging.DEBUG` or up to both fi
Running the above script will create the 'logfile.txt' file with all log messages that could also be seen in the console.
An overview the log levels:

- `MACHINE_LOG_LEVEL (5)`: Messages mostly meant to be parsed by a program.
- `DEBUG`: Messages for developers.
- `INFO`: General information about the optimization process.
- `WARNING`: Serious errors that do not prohibit GAMA from running to completion (but results could be suboptimal).
- `ERROR`: Errors which prevent GAMA from running to completion.


By default GAMA will also capture log output to a file named 'gama.log'.
You can use the `keep_analysis_log` hyperparameter to specify the desired location and name of the log,
or set it to `None` to prevent it being produced in the first place.
As described in :ref:`Dashboard` this file can be used to generate visualizations about the optimization process.
As described in :ref:`dashboard-section` the files in the output directory can be used to generate visualizations about the optimization process.
10 changes: 6 additions & 4 deletions docs/source/user_guide/dashboard.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
:orphan:


.. _dashboard:
.. _dashboard-section:

Dashboard
---------

.. note::
The GAMA Dashboard is not done.
However, it is functional and released to get some early feedback on what users would like to see included.
The near future may see a reimplementation, see `#97 <https://github.com/PGijsbers/gama/issues/97>`_.

GAMA Dashboard is a graphical user interface to start and monitor the AutoML search.
It is available when GAMA has been installed with its visualization optional dependencies (`pip install gama[vis]`).
Expand All @@ -22,7 +23,7 @@ Starting GAMA Dashboard will open a new tab in your webbrowser which will show t
.. image:: images/DashboardHome.png

On the left you can configure GAMA, on the right you can select the dataset you want to perform AutoML on.
To provide a dataset, specify the path to the ARFF-file which contains your data.
To provide a dataset, specify the path to the csv or ARFF-file which contains your data.
Once the dataset has been set, the `Go!`-button on the bottom left will be enabled.
When you are satisfied with your chosen configuration, press the `Go!`-button to start GAMA.
This will take you to the 'Running' tab.
Expand Down Expand Up @@ -60,8 +61,9 @@ On this tab, you can visualize search results from logs.
.. image:: images/analysis_empty.png

Clicking 'Select or drop log(s)' in the top-right corner opens a file explorer which lets you select file(s) to load.
Select a log file generated by a GAMA run, here we use the example log found `here <https://github.com/PGijsbers/gama/blob/master/tests/data/airline_run_0.log>`_.
After loading the file, you can toggle its visualization by clicking the checkbox that appears next to the file name.
Select both the 'gama.log' and 'evaluation.log' files from your directory together.
For example the the logs found `here <https://github.com/PGijsbers/gama/blob/master/tests/data/AsyncEA>`_.
After loading the files, you can toggle its visualization by clicking the checkbox that appears next to the file name.
The first visualization you will see is the best obtained score as a function of the number of evaluated pipelines:

.. image:: images/analysis_load.png
Expand Down
2 changes: 1 addition & 1 deletion docs/source/user_guide/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,5 @@ Examples
.. include:: examples/regression_example.rst
:start-line: 1

.. include:: examples/using_arff_files.rst
.. include:: examples/using_files_directly.rst
:start-line: 1
Original file line number Diff line number Diff line change
@@ -1,24 +1,26 @@
:orphan:

Using ARFF files
****************
Using Files Directly
********************

GAMA supports data in `ARFF <https://www.cs.waikato.ac.nz/ml/weka/arff.html>`_ files directly, utilizing extra information given, such as which features are categorical.
In the example below, make sure to replace the file paths to the ARFF files to be used.
You can load data directly from csv and `ARFF <https://www.cs.waikato.ac.nz/ml/weka/arff.html>`_ files.
For ARFF files, GAMA can utilize extra information given, such as which features are categorical.
For csv files GAMA will infer column types, but this might lead to mistakes.
In the example below, make sure to replace the file paths to the files to be used.
The example script can be run by using e.g.
`breast_cancer_train.arff <https://github.com/PGijsbers/gama/tree/master/gama/tests/data/breast_cancer_train.arff>`_ and
`breast_cancer_test.arff <https://github.com/PGijsbers/gama/tree/master/gama/tests/data/breast_cancer_test.arff>`_.
The target should always be specified as the last column.
The target should always be specified as the last column, unless the `target_column` is specified.
Make sure you adjust the file path if not executed from the examples directory.


.. file below is copied in by conf.py
.. literalinclude:: /user_guide/examples/arff_example.py

The GamaRegressor also has ARFF support.
The GamaRegressor also has csv and ARFF support.

The advantage of using an ARFF file over something like a numpy-array, is that attribute types are specified.
When supplying only numpy-arrays (e.g. through ``fit(X, y)``), GAMA can not know if a particular feature is ordinal or numeric.
The advantage of using an ARFF file over something like a numpy-array or a csv file is that attribute types are specified.
When supplying only numpy-arrays (e.g. through ``fit(X, y)``), GAMA can not know if a particular feature is nominal or numeric.
This means that GAMA might use a wrong feature transformation for the data (e.g. one-hot encoding on a numeric feature or scaling on a categorical feature).
Note that this is not unique to GAMA, but any framework which accepts numeric input without meta-data.

Expand Down
6 changes: 3 additions & 3 deletions docs/source/user_guide/hyperparameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ Optimization
************
Perhaps the most important hyperparameters are the ones that specify what to optimize for, these are:

``scoring``: ``string`` (default='log_loss' for classification and 'mean_squared_error' for regression)
``scoring``: ``string`` (default='neg_log_loss' for classification and 'mean_squared_error' for regression)
Sets the metric to optimize for. Make sure to optimize towards the metric that reflects well what is important to you.
Valid options include `roc_auc`, `accuracy` and `log_loss` for classification, and `mean_squared_error` and `r2` for regression.
For more options see :ref:`API documentation <api_doc>`.
Any string that can construct a scikit-learn scorer is accepted, see `this page <https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter>`_ for more information.
Valid options include `roc_auc`, `accuracy` and `neg_log_loss` for classification, and `neg_mean_squared_error` and `r2` for regression.

``regularize_length``: ``bool`` (default=True)
If True, in addition to optimizing towards the metric set in ``scoring``, also guide the search towards shorter pipelines.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/user_guide/related_packages.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ The advantage of asynchronous evolution is that in theory it is able to utilize
This is due to generation-based evolution waiting for the last individual in the population to be evaluated in each generation,
whereas asynchronous evolution does not have any point where all evaluations need to be finished at the same time.

TPOT sports some great user-friendly features such as a less limited command-line interface, support for csv files
TPOT sports some great user-friendly features such as a less limited command-line interface
and `DASK <https://dask.org/>`_ integration (which allows the use of a Dask cluster for pipeline evaluations and further optimizations).

GAMA focuses more on extensibility and research friendliness,
Expand Down
4 changes: 2 additions & 2 deletions docs/source/user_guide/simple_features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@
Simple Features
---------------
This section features a couple of simple to use features that might be interesting for a wide audience.
For more advanced features, see the :ref:`Technical Guide`.
For more advanced features, see the :ref:`technical_guide_index`.

Command Line Interface
**********************

GAMA may also be called from a terminal, but the tool currently supports only part of all Python functionality.
In particular it can only load data from `.arff` files and AutoML pipeline configuration is not available.
In particular it can only load data from `.csv` or `.arff` files and AutoML pipeline configuration is not available.
The tool will produce a single pickled scikit-learn model (by default named 'gama_model.pkl'),
code export is also available.
Please see `gama -h` for all options.
Expand Down
6 changes: 3 additions & 3 deletions examples/arff_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

automl = GamaClassifier(max_total_time=180, keep_analysis_log=None, n_jobs=1)
print("Starting `fit` which will take roughly 3 minutes.")
automl.fit_arff(file_path.format("train"))
automl.fit_from_file(file_path.format("train"))

label_predictions = automl.predict_arff(file_path.format("test"))
probability_predictions = automl.predict_proba_arff(file_path.format("test"))
label_predictions = automl.predict_from_file(file_path.format("test"))
probability_predictions = automl.predict_proba_from_file(file_path.format("test"))
6 changes: 3 additions & 3 deletions gama/GamaClassifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from sklearn.preprocessing import LabelEncoder

from .gama import Gama
from gama.data import X_y_from_arff
from gama.data import X_y_from_file
from gama.configuration.classification import clf_config
from gama.utilities.metrics import scoring_to_metric

Expand Down Expand Up @@ -93,7 +93,7 @@ def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]):
x = self._prepare_for_prediction(x)
return self._predict_proba(x)

def predict_proba_arff(
def predict_proba_from_file(
self,
arff_file_path: str,
target_column: Optional[str] = None,
Expand All @@ -119,7 +119,7 @@ def predict_proba_arff(
The array is of shape (N, K) where N is len(X),
and K is the number of class labels found in `y` of `fit`.
"""
x, _ = X_y_from_arff(arff_file_path, target_column, encoding)
x, _ = X_y_from_file(arff_file_path, target_column, encoding)
x = self._prepare_for_prediction(x)
return self._predict_proba(x)

Expand Down
2 changes: 1 addition & 1 deletion gama/__version__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# format: YY.minor.micro
__version__ = "20.2.0"
__version__ = "20.2.1"
6 changes: 3 additions & 3 deletions gama/dashboard/controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def start_gama(
max_eval_time_h,
max_eval_time_m,
input_file,
log_file,
log_dir,
target,
):
# For some reason, 0 input registers as None.
Expand All @@ -32,7 +32,7 @@ def start_gama(
max_eval_time = max_eval_time_h * 60 + max_eval_time_m
command = (
f'gama "{input_file}" -v -n {n_jobs} -t {max_total_time} '
f'--time_pipeline {max_eval_time} -log {log_file} --target "{target}"'
f'--time_pipeline {max_eval_time} -outdir {log_dir} --target "{target}"'
)
if regularize != "on":
command += " --long"
Expand All @@ -42,7 +42,7 @@ def start_gama(
command = shlex.split(command)
# fake_command = ['python', '-h']
process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
self._on_gama_started(process, log_file)
self._on_gama_started(process, log_dir)

def _on_gama_started(self, process, log_file):
for subscriber in self._subscribers["gama_started"]:
Expand Down
Loading

0 comments on commit a285f64

Please sign in to comment.