Skip to content

Development Reference Notes

NucciTheBoss edited this page Jul 7, 2021 · 4 revisions

Table of Contents

Core assumptions:

  • User is only going to use up to 8 nodes (this number was chosen since it is the most manageable for right now).
  • Data being sent into the pipeline is pre-cleaned, has no headers, and there is no categorical data.
  • The target variable holds the index position of list[-1] in the data file.
  • The data sets being sent into the pipeline are in tabular CSV format.
  • Models being trained are being trained on many to one data. i.e. many features to one label.

Preprocessing Stage Reference:

The preprocessing stage is dedicated to preforming system integrity checks and the processing of the XML macro file written by the user before launching the three other stages.

Manager Node:

  • Import local modules/libraries that are only needed by the manager node.
  • Define color-printing functions using the colorama package and lambda statements.
  • Perform necessary checks to ensure that Jespipe is good to go.
    • Check that macro file is in XML format.
    • Check that main.py is in the current working directory.
    • Check that the user passed a XML macro file at the command line.
  • Manager takes in macro XML file.
    • Processes the macro file and processes it using BeautifulSoup.
    • Converts the macro file into a dictionary split into three sections: train, attack, and clean.
  • After getting job control dictionary, splits the dictionary into three control sections: train, attack, and clean.
  • Create the data directory.
    • This is the main directory that Jespipe operates out of.
      • This local approach was adopted because on most high-performance computing clusters, non-root users lack the ability to install sofware and packages at the root level.
  • Create the data/.logs directory.
    • This is where the available workers will write out their log files.
      • The workers right everything to the log files in order to prevent bottlenecks at stdout.
  • Create the data/.tmp directory.
    • This is where Jespipe has the ability to temporary pickles.
  • Broadcast greenlight for the workers to begin their routines.

Worker Nodes:

  • Waits for greenlight to begin from manager node.
    • 0 means no-go and abort.
    • 1 means go-ahead and no abort
  • Log that greenlight was received from the manager node.
  • Create directory data/.logs/$worker-# for storing log files.
    • If greenlight is 1, wait for signal if skipping training, attack, and/or cleaning stage.
    • If greenlight is 0, abort execution with exit code 127.

Training Stage Reference:

The training stage is dedicated to utilizing user-defined plugins to train machine learning models using manipulated/vanilla datasets.

Core Notes:

  • Format of task directive:
["dataset_name", "dataset_path", "model_name", "algorithm", {"algorithm": "parameter_as_string"}, "path_to_algorithm_plugin", "data_manip_name", "datamanip_tag", {"datamanip_parameters": "as_a_string"}]
  • Positional values of task list:
    • 0: dataset_name
    • 1: dataset_path
    • 2: model_name
    • 3: algorithm
    • 4: algorithm_parameters
    • 5: absolute_path_to_algorithm_plugin
    • 6: data_manipulation_name
    • 7: data_manipulation_tag
    • 8: data_manipulation_parameters
  • If there are more nodes than directives, numpy will send an empty list [] to the extra worker nodes.
  • Two positional arguments are sent to the user-specified plugin using subprocess:
    • 0: stage -> value is either "train" or "attack"
    • 1: parameters -> value is a reference to a pickled parameter dictionairy stored in data/.tmp
      • The user will need to unpickle this dictionairy in their plugin.
  • Format of parameter dictionary sent to training plugin:
{
    "dataset_name": "/path/to/dataset",
    "model_name": "mymodelsname",
    "dataframe": pandas.DataFrame,
    "model_params": {},
    "save_path": "/path/to/save/model/to/",
    "log_path": "path/to/log/model/statistics",
    "manip_info": ("manip_name", "manip_tag")
}
  • Training plugin is responsible for where the data created by the model is saved. Jespipe is hands off towards this, but provides paiths that the user can make use of.

Manager Node:

  • Broadcast out to nodes if skipping or proceeding with training stage.
    • 0 means no skip and continue with the training stage.
    • 1 means skip training stage and go to attack stage.
  • Take train dictionary and unwrap into a more easily processable dictionary.
  • Evaluate if the path to the plugin and data set is absolute.
    • If the path is relative and not absolute, convert path to absolute path.
    • If the path is absolute, continue.
  • Verify the data set and plugin are callable. If the pipeline cannot find the data set or the plugin, raise a FileNotFoundError exception.
  • Create the directory path data/$dataset_name/models.
  • Send unwrapped dictionary to scattershot.generate_train() to create directive list for worker nodes.
  • Send directive list to scattershot.slice() to slice up directive list amongst available worker nodes.
  • Send task list to workers using scattershot.delegate().
  • Block on the manager node until all the worker nodes have returned status 1.
    • This is to prevent the manager node from moving onto the next stage while the workers are still training the model.

Worker Nodes:

  • Log whether or not it is performing the training stage.
  • Wait for greenlight from manager to proceed with the training stage.
  • Receive greenlight from manager node.
    • 0: Something went wrong with generating the task list on the manager. Abort execution with exit code 127.
    • 1: Task list was successfully created on the manager. Proceed with the training stage.
  • Log task list received from the manager node.
  • If the task list received from the manager is empty, return status 1 to the manager. This will inform the manager that the worker is good to move onto the next stage.
  • Once task list is received, perform user specified manipulations on the data. The current available manipulation options are the following:
    • XGBoost
    • Random Forest
    • Principle Component Analysis
    • Candlestick
    • None <- This option is reserved for users who just want to tune the hyperparameters.
  • Save a copy of the manipulation if the user so desires. Can remove this option from custom installations of Jespipe if storage space is limited.
  • Create parameter dictionary to send to the plugin.
  • Call plugin using the subprocess.run() function and block until training has completed.
  • Run status 1 to the manager node once all models in the task list have completed training.

Attack Stage Reference:

The attack stage is dedicated to attacking the machine learning models trained during the train stage using user-defined attacks.

Core Notes:

  • This stage can be called before the training stage if the pretrained models already exist.
    • If the models do not exist, the attack stage will throw an error.
  • Models are auto-detected by Jespipe. The default format for models is .h5, but this can be easily changed in custom installations.
  • Format of task directive:
["dataset_name", "dataset_path", "attack_type", "attack_tag", "attack_plugin", "model_plugin", "attack_parameters", "model_path", "model_root_path"]
  • Positional values of task directive:
    • 0: dataset_name
    • 1: dataset_path
    • 2: attack_type
    • 3: attack_tag
    • 4: attack_plugin
    • 5: model_plugin
    • 6: attack_parameters
    • 7: model_path
    • 8: model_root_path
  • If there are more nodes than directives, numpy will send an empty list [] to the extra worker nodes.
  • Two positional arguments are sent to the user-specified plugin using subprocess:
    • 0: stage -> value is "attack"
    • 1: parameters -> value is a reference to a pickled parameter dictionairy stored in data/.tmp
      • The user will need to unpickle this dictionairy in their plugin.
  • Still working on what the parameter dictionary is going to look like.

Manager Node:

  • Broadcast out to workers if the attack stage is being skipped or not.
    • 0 means no skip and continue with the attack stage.
    • 1 means skip attack stage and move onto cleaning stage.
  • Unwrap attack dictionary into more easily understandable format.
  • Check if there are any user-defined attack plugins and model plugins.
    • Convert relative paths to absolute paths and check if the files exist. If not, abort execution.
  • Create task directives for worker nodes, and then slice the directive list based on how many available workers there are.
  • Send task list to worker nodes.
  • Block execution until workers report back to the manager node.

Worker Nodes:

TODO

Cleaning Stage Reference:

The cleaning stage is dedicated to performing the final routines of the pipeline. This consists of cleaning out the data/.tmp directory, compressing and moving the data directory, and creating plots using user-defined plotting plugins.

Core Notes:

  • There is no need to unwrap the clean_control dictionary since it is extremely simple.

Manager Node:

  • Broadcast out to workers whether or not the cleaning stage is being skipped.
    • 0 means no skip and continue with the cleaning stage.
    • 1 means skip cleaning stage and complete pipeline runtime.
  • Check if there are any user-defined plots they would like to have printed out.
    • If plots is None, broadcast out 0 for greenlight message.
    • If plots is not None, broadcast out 1 for greenlight message.
    • If there are plots, create task list and send to worker nodes.
    • Block until all workers return a status of 1.
  • Check if user has requested to clean out the data/.tmp directory.
    • If clean_tmp is 0, leave data/.tmp alone.
    • If clean_tmp is 1, recursively delete data/.tmp.
  • Check if user has requested that the data directory be compressed.
    • If compress is None, continue.
    • If compress is not None, rename data directory, compress the archive using user-specified compression algorithm, and then move the compressed archive to the user requested location on the system.
  • Print final statements to stdout and then exit.

Worker Nodes:

  • Log whether or not skipping the cleaning stage.
    • If skipping the cleaning stage, exit.
    • If not skipping the cleaning stage, continue.
  • Wait for greenlight from manager to begin plotting output from model training and attacks.
    • If greenlight is 0, exit with status 0.
      • Exiting with status 0 because this means that the worker nodes are no longer required by the manager node.
    • If greenlight is 1, continue.
  • Log task list received from manager
    • If task list is empty, return status 1 to the manager node.
  • Create parameter dictionary and send to plotting plugin.
  • Call the plotting plugin using the subprocess.run() function and block until all plotting has completed.
  • Run status 1 to the manager node once all the plotting has been completed.
  • Exit as worker nodes are now no longer needed by the manager node.