Merge pull request #100 from PGijsbers/develop

GAMA 20.2.1 release
openml-labs · Jun 30, 2020 · a285f64 · a285f64
2 parents 565f48e + 4f7f649
commit a285f64
Show file tree

Hide file tree

Showing 71 changed files with 7,662 additions and 6,292 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -3,7 +3,7 @@ repos:
     rev: 19.10b0
     hooks:
       - id: black
-        language_version: python3.6
+        language_version: python3.8
   - repo: https://github.com/pre-commit/mirrors-mypy
     rev: v0.761
     hooks:

diff --git a/.travis.yml b/.travis.yml
@@ -11,7 +11,9 @@ env:
 jobs:
   include:
   - env: JOB=check
+    python: 3.8
   - env: JOB=deploy
+    python: 3.8
     if: (branch = master OR branch = develop) AND type = push
 
 install:

diff --git a/data/GAMA_20_2_1.csv b/data/GAMA_20_2_1.csv
diff --git a/docs/source/api/index.rst b/docs/source/api/index.rst
@@ -2,6 +2,7 @@
 .. default-role:: code
 
 ..  _api_doc:
+
 API
 ===
 

diff --git a/docs/source/benchmark.rst b/docs/source/benchmark.rst
@@ -3,11 +3,18 @@
 Benchmark Results
 =================
 
-.. note::  Additional results will be uploaded soon.
-
 This page reports the benchmark results obtained by running GAMA on the `AutoML benchmark <https://openml.github.io/automlbenchmark/automl_overview.html>`_.
 Note that the performance of other AutoML frameworks is taken from the original benchmark experiments, they are not reproduced by us.
-Some results on this page may have been performed on other hardware than aws nodes, but care is taken that the resources available match.
 We will do our best to keep the results complete and up to date.
 
-.. image:: images/benchmark.png
+GAMA 20.2.1
+***********
+The results are obtained on non-aws hardware.
+Additionally, GAMA was ran with a one-hour time constraint instead of four hours, other conditions are equal.
+It is important to point out that the constrained runtime might both be bad (less time to search), but also good (less opportunity to overfit).
+The restriction is in place due to our compute budget.
+The raw data is `here <https://github.com/PGijsbers/gama/tree/develop/data/GAMA_20_2_1.csv>`_.
+
+.. image:: images/binary_results_stripplot.png
+
+.. image:: images/multiclass_results_stripplot.png
diff --git a/docs/source/citing.rst b/docs/source/citing.rst
@@ -1,8 +1,36 @@
-Citing
+Papers
 ======
+This page contains bibtex entries for each paper, as well as up-to-date code listings from each paper.
+Unless you want to reference a specific paper, when citing GAMA please cite the `JOSS article <http://joss.theoj.org/papers/10.21105/joss.01132>`_.
 
-If you want to cite GAMA, please cite the `JOSS article <http://joss.theoj.org/papers/10.21105/joss.01132>`_.
-Here's the bibtex:
+GAMA: a General Automated Machine learning Assistant
+----------------------------------------------------
+Features GAMA 20.2.1
+
+Bibtex will be added after publication.
+
+Listings
+********
+Listing 1:
+
+.. code-block:: Python
+
+    from gama import GamaClassifier
+    from gama.search_methods import AsynchronousSuccessiveHalving
+    from gama.postprocessing import EnsemblePostProcessing
+
+    automl = GamaClassifier(
+        search_method=AsynchronousSuccessiveHalving(),
+        post_processing_method=EnsemblePostProcessing()
+    )
+    automl.fit(X, y)
+    automl.predict(X_test)
+    automl.fit(X_test, y_test)
+
+
+GAMA: Genetic Automated Machine learning Assistant
+--------------------------------------------------
+Features GAMA 19.01.0
 
 .. code-block:: latex
 
@@ -18,4 +46,8 @@ Here's the bibtex:
       author = {Pieter Gijsbers and Joaquin Vanschoren},
       title = {{GAMA}: Genetic Automated Machine learning Assistant},
       journal = {Journal of Open Source Software}
-    }
+    }
+
+Listings
+********
+This paper features no listings.
diff --git a/docs/source/images/binary_results_stripplot.png b/docs/source/images/binary_results_stripplot.png
diff --git a/docs/source/images/multiclass_results_stripplot.png b/docs/source/images/multiclass_results_stripplot.png
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -27,9 +27,6 @@ It describes visualization of optimization logs, changing the AutoML pipeline, a
 If there are any questions you have that are not answered by the documentation, check the `issue page <https://github.com/PGijsbers/GAMA/issues>`_.
 If your question has not been answered there either, please open a new issue.
 
-.. note::
-
-
 .. toctree::
    :includehidden:
 

diff --git a/docs/source/releases.rst b/docs/source/releases.rst
@@ -1,6 +1,26 @@
 Release Notes
 =============
 
+Version 20.2.1
+--------------
+Changes:
+ # 24: Changes to logging
+    The structure of the log file(s) have changed.
+    The goal is to make the log files easier to use, by making them easier to read and
+    easier to extend write behavior.
+    There will now be three log files, one which contains just evaluation data, one which contains progress data, and one which contains resource usage data.
+    For more information see :ref:`logging-section` in the technical guide.
+
+
+Features:
+ # 66: csv files are now supported.
+    Call `fit_arff` is now `fit_from_file` which accepts both arff and csv files.
+    The CLI interface and Gama Dashboard also allow for csv files.
+ # 92: You can specify a memory limit through `max_memory_mb` hyperparameter.
+    GAMA does not guarantee it will not violate the constraint, but violations
+    should be infrequent and minor. Feel free to open an issue if you experience a
+    violation which does not minor.
+
 Version 20.2.0
 --------------
 Features:

diff --git a/docs/source/technical_guide/index.rst b/docs/source/technical_guide/index.rst
@@ -20,10 +20,6 @@ This section will cover more advanced usage of GAMA, in particular it covers:
 .. include:: logging.rst
     :start-line: 1
 
-----
-
-.. include:: visualization.rst
-    :start-line: 5
 
 ----
 

diff --git a/docs/source/technical_guide/logging.rst b/docs/source/technical_guide/logging.rst
@@ -2,26 +2,30 @@
 
 .. default-role:: code
 
+.. _logging-section:
 
 Logging
 -------
 
 GAMA makes use of the default Python `logging <https://docs.python.org/3.5/library/logging.html>`_ module.
 This means logs can be captured at different levels, and handled by one of several StreamHandlers.
-In addition to the Python built-in log levels GAMA introduces one level below `logging.DEBUG`, explicitly for log
-messages that are meant to be parsed by a program later.
 
 The most common logging use cases are to write a comprehensive log to file, as well as print important messages to `stdout`.
 Writing log messages to `stdout` is directly supported by GAMA through the `verbosity` hyperparameter
 (which defaults to `logging.WARNING`).
 
-Logging all log messages to file is not entirely supported directly.
-Instead, GAMA provides a `keep_analysis_log` hyperparameter.
-When set (by providing the name of the file to write to), it will only write *most* information to file.
-In particular it will write any log messages at `logging.WARNING`, `logging.CRITICAL` and `logging.ERROR`, as
-well as all messages at the `MACHINE_LOG_LEVEL` - the log level introduced by GAMA.
-The `MACHINE_LOG_LEVEL` write messages in a structured way, so that they may easily be parsed by a program later.
-By default GAMA writes to 'gama.log'.
+By default GAMA will also save several different logs.
+This can be turned off by the `store` hyperparameter.
+The `store` hyperparameter allows you to store the logs, as well as models and predictions.
+By default logs are kept (which includes evaluation data), but models and predictions are discarded.
+
+The `output_directory` hyperparameter determines where this data is stored, by default a unique name is generated.
+In the output directory you will find three files and a subdirectory:
+
+ - 'evaluations.log': a csv file (with ';' as separator) in which each evaluation is stored.
+ - 'gama.log': A loosely structured file with general (human readable) information of the GAMA run.
+ - 'resources.log': A record of the memory usage for each of GAMA's processes over time.
+ - cache directory: contains evaluated models and predictions, only if `store` is 'all' or 'models'
 
 If you want other behavior, the logging module offers you great flexibility on making your own variations.
 The following script writes any log messages of `logging.DEBUG` or up to both file and console::
@@ -43,14 +47,9 @@ The following script writes any log messages of `logging.DEBUG` or up to both fi
 Running the above script will create the 'logfile.txt' file with all log messages that could also be seen in the console.
 An overview the log levels:
 
- - `MACHINE_LOG_LEVEL (5)`: Messages mostly meant to be parsed by a program.
  - `DEBUG`: Messages for developers.
  - `INFO`: General information about the optimization process.
  - `WARNING`: Serious errors that do not prohibit GAMA from running to completion (but results could be suboptimal).
  - `ERROR`: Errors which prevent GAMA from running to completion.
 
-
-By default GAMA will also capture log output to a file named 'gama.log'.
-You can use the `keep_analysis_log` hyperparameter to specify the desired location and name of the log,
-or set it to `None` to prevent it being produced in the first place.
-As described in :ref:`Dashboard` this file can be used to generate visualizations about the optimization process.
+As described in :ref:`dashboard-section` the files in the output directory can be used to generate visualizations about the optimization process.
diff --git a/docs/source/user_guide/dashboard.rst b/docs/source/user_guide/dashboard.rst
@@ -1,14 +1,15 @@
 :orphan:
 
 
-.. _dashboard:
+.. _dashboard-section:
 
 Dashboard
 ---------
 
 .. note::
     The GAMA Dashboard is not done.
     However, it is functional and released to get some early feedback on what users would like to see included.
+    The near future may see a reimplementation, see `#97 <https://github.com/PGijsbers/gama/issues/97>`_.
 
 GAMA Dashboard is a graphical user interface to start and monitor the AutoML search.
 It is available when GAMA has been installed with its visualization optional dependencies (`pip install gama[vis]`).
@@ -22,7 +23,7 @@ Starting GAMA Dashboard will open a new tab in your webbrowser which will show t
 .. image:: images/DashboardHome.png
 
 On the left you can configure GAMA, on the right you can select the dataset you want to perform AutoML on.
-To provide a dataset, specify the path to the ARFF-file which contains your data.
+To provide a dataset, specify the path to the csv or ARFF-file which contains your data.
 Once the dataset has been set, the `Go!`-button on the bottom left will be enabled.
 When you are satisfied with your chosen configuration, press the `Go!`-button to start GAMA.
 This will take you to the 'Running' tab.
@@ -60,8 +61,9 @@ On this tab, you can visualize search results from logs.
 .. image:: images/analysis_empty.png
 
 Clicking 'Select or drop log(s)' in the top-right corner opens a file explorer which lets you select file(s) to load.
-Select a log file generated by a GAMA run, here we use the example log found `here <https://github.com/PGijsbers/gama/blob/master/tests/data/airline_run_0.log>`_.
-After loading the file, you can toggle its visualization by clicking the checkbox that appears next to the file name.
+Select both the 'gama.log' and 'evaluation.log' files from your directory together.
+For example the the logs found `here <https://github.com/PGijsbers/gama/blob/master/tests/data/AsyncEA>`_.
+After loading the files, you can toggle its visualization by clicking the checkbox that appears next to the file name.
 The first visualization you will see is the best obtained score as a function of the number of evaluated pipelines:
 
 .. image:: images/analysis_load.png

diff --git a/docs/source/user_guide/examples.rst b/docs/source/user_guide/examples.rst
@@ -9,5 +9,5 @@ Examples
 .. include:: examples/regression_example.rst
     :start-line: 1
 
-.. include:: examples/using_arff_files.rst
+.. include:: examples/using_files_directly.rst
     :start-line: 1
diff --git a/.../user_guide/examples/using_arff_files.rst → ...r_guide/examples/using_files_directly.rst b/.../user_guide/examples/using_arff_files.rst → ...r_guide/examples/using_files_directly.rst
@@ -1,24 +1,26 @@
 :orphan:
 
-Using ARFF files
-****************
+Using Files Directly
+********************
 
-GAMA supports data in `ARFF <https://www.cs.waikato.ac.nz/ml/weka/arff.html>`_ files directly, utilizing extra information given, such as which features are categorical.
-In the example below, make sure to replace the file paths to the ARFF files to be used.
+You can load data directly from csv and `ARFF <https://www.cs.waikato.ac.nz/ml/weka/arff.html>`_ files.
+For ARFF files, GAMA can utilize extra information given, such as which features are categorical.
+For csv files GAMA will infer column types, but this might lead to mistakes.
+In the example below, make sure to replace the file paths to the files to be used.
 The example script can be run by using e.g.
 `breast_cancer_train.arff <https://github.com/PGijsbers/gama/tree/master/gama/tests/data/breast_cancer_train.arff>`_ and
 `breast_cancer_test.arff <https://github.com/PGijsbers/gama/tree/master/gama/tests/data/breast_cancer_test.arff>`_.
-The target should always be specified as the last column.
+The target should always be specified as the last column, unless the `target_column` is specified.
 Make sure you adjust the file path if not executed from the examples directory.
 
 
 .. file below is copied in by conf.py
 .. literalinclude:: /user_guide/examples/arff_example.py
 
-The GamaRegressor also has ARFF support.
+The GamaRegressor also has csv and ARFF support.
 
-The advantage of using an ARFF file over something like a numpy-array, is that attribute types are specified.
-When supplying only numpy-arrays (e.g. through ``fit(X, y)``), GAMA can not know if a particular feature is ordinal or numeric.
+The advantage of using an ARFF file over something like a numpy-array or a csv file is that attribute types are specified.
+When supplying only numpy-arrays (e.g. through ``fit(X, y)``), GAMA can not know if a particular feature is nominal or numeric.
 This means that GAMA might use a wrong feature transformation for the data (e.g. one-hot encoding on a numeric feature or scaling on a categorical feature).
 Note that this is not unique to GAMA, but any framework which accepts numeric input without meta-data.
 

diff --git a/docs/source/user_guide/hyperparameters.rst b/docs/source/user_guide/hyperparameters.rst
@@ -11,10 +11,10 @@ Optimization
 ************
 Perhaps the most important hyperparameters are the ones that specify what to optimize for, these are:
 
-``scoring``: ``string`` (default='log_loss' for classification and 'mean_squared_error' for regression)
+``scoring``: ``string`` (default='neg_log_loss' for classification and 'mean_squared_error' for regression)
     Sets the metric to optimize for. Make sure to optimize towards the metric that reflects well what is important to you.
-    Valid options include `roc_auc`, `accuracy` and `log_loss` for classification, and `mean_squared_error` and `r2` for regression.
-    For more options see :ref:`API documentation <api_doc>`.
+    Any string that can construct a scikit-learn scorer is accepted, see `this page <https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter>`_ for more information.
+    Valid options include `roc_auc`, `accuracy` and `neg_log_loss` for classification, and `neg_mean_squared_error` and `r2` for regression.
 
 ``regularize_length``: ``bool`` (default=True)
     If True, in addition to optimizing towards the metric set in ``scoring``, also guide the search towards shorter pipelines.

diff --git a/docs/source/user_guide/related_packages.rst b/docs/source/user_guide/related_packages.rst
@@ -23,7 +23,7 @@ The advantage of asynchronous evolution is that in theory it is able to utilize
 This is due to generation-based evolution waiting for the last individual in the population to be evaluated in each generation,
 whereas asynchronous evolution does not have any point where all evaluations need to be finished at the same time.
 
-TPOT sports some great user-friendly features such as a less limited command-line interface, support for csv files
+TPOT sports some great user-friendly features such as a less limited command-line interface
 and `DASK <https://dask.org/>`_ integration (which allows the use of a Dask cluster for pipeline evaluations and further optimizations).
 
 GAMA focuses more on extensibility and research friendliness,

diff --git a/docs/source/user_guide/simple_features.rst b/docs/source/user_guide/simple_features.rst
@@ -3,13 +3,13 @@
 Simple Features
 ---------------
 This section features a couple of simple to use features that might be interesting for a wide audience.
-For more advanced features, see the :ref:`Technical Guide`.
+For more advanced features, see the :ref:`technical_guide_index`.
 
 Command Line Interface
 **********************
 
 GAMA may also be called from a terminal, but the tool currently supports only part of all Python functionality.
-In particular it can only load data from `.arff` files and AutoML pipeline configuration is not available.
+In particular it can only load data from `.csv` or `.arff` files and AutoML pipeline configuration is not available.
 The tool will produce a single pickled scikit-learn model (by default named 'gama_model.pkl'),
 code export is also available.
 Please see `gama -h` for all options.

diff --git a/examples/arff_example.py b/examples/arff_example.py
@@ -5,7 +5,7 @@
 
     automl = GamaClassifier(max_total_time=180, keep_analysis_log=None, n_jobs=1)
     print("Starting `fit` which will take roughly 3 minutes.")
-    automl.fit_arff(file_path.format("train"))
+    automl.fit_from_file(file_path.format("train"))
 
-    label_predictions = automl.predict_arff(file_path.format("test"))
-    probability_predictions = automl.predict_proba_arff(file_path.format("test"))
+    label_predictions = automl.predict_from_file(file_path.format("test"))
+    probability_predictions = automl.predict_proba_from_file(file_path.format("test"))
diff --git a/gama/GamaClassifier.py b/gama/GamaClassifier.py
@@ -7,7 +7,7 @@
 from sklearn.preprocessing import LabelEncoder
 
 from .gama import Gama
-from gama.data import X_y_from_arff
+from gama.data import X_y_from_file
 from gama.configuration.classification import clf_config
 from gama.utilities.metrics import scoring_to_metric
 
@@ -93,7 +93,7 @@ def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]):
         x = self._prepare_for_prediction(x)
         return self._predict_proba(x)
 
-    def predict_proba_arff(
+    def predict_proba_from_file(
         self,
         arff_file_path: str,
         target_column: Optional[str] = None,
@@ -119,7 +119,7 @@ def predict_proba_arff(
             The array is of shape (N, K) where N is len(X),
             and K is the number of class labels found in `y` of `fit`.
         """
-        x, _ = X_y_from_arff(arff_file_path, target_column, encoding)
+        x, _ = X_y_from_file(arff_file_path, target_column, encoding)
         x = self._prepare_for_prediction(x)
         return self._predict_proba(x)
 

diff --git a/gama/__version__.py b/gama/__version__.py
@@ -1,2 +1,2 @@
 # format: YY.minor.micro
-__version__ = "20.2.0"
+__version__ = "20.2.1"
diff --git a/gama/dashboard/controller.py b/gama/dashboard/controller.py
@@ -20,7 +20,7 @@ def start_gama(
         max_eval_time_h,
         max_eval_time_m,
         input_file,
-        log_file,
+        log_dir,
         target,
     ):
         # For some reason, 0 input registers as None.
@@ -32,7 +32,7 @@ def start_gama(
         max_eval_time = max_eval_time_h * 60 + max_eval_time_m
         command = (
             f'gama "{input_file}" -v -n {n_jobs} -t {max_total_time} '
-            f'--time_pipeline {max_eval_time} -log {log_file} --target "{target}"'
+            f'--time_pipeline {max_eval_time} -outdir {log_dir} --target "{target}"'
         )
         if regularize != "on":
             command += " --long"
@@ -42,7 +42,7 @@ def start_gama(
         command = shlex.split(command)
         # fake_command = ['python', '-h']
         process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
-        self._on_gama_started(process, log_file)
+        self._on_gama_started(process, log_dir)
 
     def _on_gama_started(self, process, log_file):
         for subscriber in self._subscribers["gama_started"]: