Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding TPCx-AI Benchmark #2061

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions scripts/tpcx-ai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Implementation of TPCx-AI on Apache SystemDS

The TPCx-AI is an express benchmark developed by the TPC (Transaction Processing Performance Council)
specifically tailored for end-to-end machine learning systems.
For further information, refer to the official documentation provided by the TPC:
[TPCx-AI documentation](https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPCX-AI_v1.0.3.1.pdf)

To run the TPCx-AI benchmark on SystemDS:
* Download the TPCx-AI benchmark kit from [TPC's website](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp).
* Install and build SystemDS.
* Build the python package and copy the distribution to the TPCx-AI root directory.
* Copy the files from this directory (`tpcx-ai`) into the TPCx-AI benchmark kit root directory.
* Set values for scale factor and java paths in `setenv_sds.sh`.
* Set up TPCx-AI by running `setup_python_sds.sh`.
* Generate data with `generate_data.sh`.
* Lastly, execute the benchmark using `TPCx-AI_Benchmarkrun_sds.sh`.

## Detailed Instructions

### Prerequisites

The following sections describe system prerequisites and steps to prepare and adapt the TPCx-AI benchmark kit to run on SystemDS.

#### Downloading the TPCx-AI Benchmark Kit

Go to [TPC's website](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp) and download the TPCx-AI benchmark kit.
Extract the archived directory, which from now on will be referred to as TPCx-AI root directory.


#### Building SystemDS

Go back to the SystemDS root directory and follow the installation guide for SystemDS: <http://apache.github.io/systemds/site/install>.
Build the project with maven.
```bash
mvn package -P distribution
```

#### Building python package and copy to TPCx-AI root directory

- From `SYSTEMDS_ROOT/src/main/python` run `create_python_dist.py`.

```bash
python3 create_python_dist.py
```

- Now, in the `./dist` directory, there will exist the source distribution `systemds-VERSION.tar.gz`
and the wheel distribution `systemds-VERSION-py3-none-any.whl`, with `VERSION` being the current version number
- copy the `systemds-VERSION-py3-none-any.whl` to the TPCx-AI benchmark kit root directory

#### Transfering Files to TPCx-AI Root Directory

The following files need to be copied from this directory into the TPCx-AI root directory:

- generate_data.sh
- setenv_sds.sh
- setup_python_sds.sh
- TPCx-AI_Benchmarkrun_sds.sh

The following directories in the TPCx-AI benchmark kit directory need to be **replaced**:

- Replace the driver directory with the driver directory from this directory.
- In the TPCx-AI root directory, navigate to workload/python/workload and replace the 10 use case files with the files in the `use_cases` directory.
- Replace `tpcxai_fdr.py` and `tpcxai_fdr_template.html` from the `TPCx-AI_ROOT\tools directory with the files in this directory.

#### Setting Up TPCx-AI

Now the benchmark kit is ready for set up and installation.
Prerequisites for running are:
* Java 8,
* Java 11,
* Python 3.6+
* Anaconda3/Conda4+

* The binaries "java", "sbt" and "conda" must be included (and have "priority") in the PATH environment variable.
* Disk Space: Make sure you have enough disk space to store the test data that will be generated, in the `output/raw_data`
The value of "TPCxAI_SCALE_FACTOR" in the setenv.sh file will determine the approximate size (GB) of the dataset that will be generated and used during the benchmark execution.

For more detailed information and optional setup possibilities refer to the official TPCs-AI documentation:
[TPCx-AI documentation](https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPCX-AI_v1.0.3.1.pdf).

#### Setting up Environment in setenv_sds.sh

There are three variables that need to be set in the setenv_sds.sh files prior to set-up:
* JAVA8_HOME: Set this variable to the path to the home directory of your Java 8 version.
* JAVA11_HOME: Set this variable to the path to the home directory of your Java 11 version.
* TPCxAI_SCALE_FACTOR: Set to the desired scale factor of the data set; the default is 1.

#### Running the Setup Script

* This implementation is based on SystemdDS Version 3.3.0 (commit id: 5ad67e8). If you want to use a different version, make sure to set the correct filename for the systemds distribution in the `setup_python_sds.sh`file.
The filename should match the appropriate version and build of systemds for your environment.
* Run `setup_python_sds.sh` to automatically set up the benchmark,
install all the neccessary libraries, install SystemDS as and set up the virtual environments.

## Benchmark Execution

### Data Generation

Before running the benchmark, data needs to be generated. To generate data, run the `generate_data.sh` script. The size of the generated data can be chosen by setting the
TPCxAI_SCALE_FACTOR variable from the setenv_sds.sh file. The default value is 1, which leads to the generation of a dataset with the size of 1 GB.

### Running the Benchmark

Now the benchmark can be executed by running the `TPCx-AI_Benchmarkrun_sds.sh` script.

```bash
./TPCx-AI_Benchmarkrun_sds.sh
```
90 changes: 90 additions & 0 deletions scripts/tpcx-ai/TPCx-AI_Benchmarkrun_sds.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#!/bin/bash

#
# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors.
# This file is part of a software package distributed by the TPC
# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor
# license agreements.
# This file is subject to the terms and conditions outlined in the End-User
# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL:
# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt
# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality
# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details.
#


#
# Copyright 2019 Intel Corporation.
# This software and the related documents are Intel copyrighted materials, and your use of them
# is governed by the express license under which they were provided to you ("License"). Unless the
# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or
# transmit this software or the related documents without Intel's prior written permission.
#
# This software and the related documents are provided as is, with no express or implied warranties,
# other than those that are expressly stated in the License.
#
#

# Stop if any command fails
set -e

. setenv_sds.sh

LOG_DEST="tpcxai_benchmark_run"
TPCxAI_CONFIG_FILE_PATH=${TPCxAI_BENCHMARKRUN_CONFIG_FILE_PATH}
if [[ ${IS_VALIDATION_RUN} -eq "1" ]]; then
echo "Benchmark validation run. Setting scale factor value to 1..."
export TPCxAI_SCALE_FACTOR=1
TPCxAI_CONFIG_FILE_PATH=${TPCxAI_VALIDATION_CONFIG_FILE_PATH}
LOG_DEST="tpcxai_benchmark_validation"
fi

if [[ ${TPCx_AI_VERBOSE} == "True" ]]; then
VFLAG="-v"
fi

echo "TPCx-AI_HOME directory: ${TPCx_AI_HOME_DIR}";
echo "Using configuration file: ${TPCxAI_CONFIG_FILE_PATH} and Scale factor ${TPCxAI_SCALE_FACTOR}..."

PATH=$JAVA11_HOME/bin:$PATH
export JAVA11_HOME
export PATH
echo "Using Java at $JAVA11_HOME"

echo "Starting Benchmark run..."
sleep 1;

bash ${TPCxAI_ENV_TOOLS_DIR}/clock_check.sh start

./bin/tpcxai.sh --phase {CLEAN,DATA_GENERATION,LOADING,TRAINING,SERVING,SERVING,SERVING_THROUGHPUT,SCORING,CHECK_INTEGRITY} -sf ${TPCxAI_SCALE_FACTOR} --streams ${TPCxAI_SERVING_THROUGHPUT_STREAMS} -c ${TPCxAI_CONFIG_FILE_PATH} ${VFLAG}

BENCHMARK_RUN_EXIT_CODE=$?

if [ ${BENCHMARK_RUN_EXIT_CODE} -eq "0" ]; then
echo "Generating report..."
lib/python-venv/bin/python tools/tpcxai_fdr.py -d logs/tpcxai.db -f logs/report.txt
lib/python-venv/bin/python tools/tpcxai_fdr.py -d logs/tpcxai.db -t html -f logs/report.html
echo "Finished generating report";

echo ""

echo "Saving output data. This may take a few minutes ..."
bash ${TPCxAI_ENV_TOOLS_DIR}/saveOutputData.sh
echo "Finished saving output data"
echo ""

echo "Saving execution enviroment details..."
bash ${TPCxAI_ENV_TOOLS_DIR}/getEnvInfo.sh
echo "Finished saving environment details"
echo ""
echo "Saving data redundancy information..."
bash ${TPCxAI_ENV_TOOLS_DIR}/dataRedundancyInformation.sh
echo "Finished saving data redundancy"
fi

bash ${TPCxAI_ENV_TOOLS_DIR}/clock_check.sh end
LOG_DEST="${LOG_DEST}_$(date +"%m%d%Y_%H%m%S")"
mkdir -p "logs/history/${LOG_DEST}"
find logs/ -mindepth 1 -maxdepth 1 -path logs/history -prune -o -print | xargs -I file mv file "logs/history/${LOG_DEST}"
zip -vr "logs/history/${LOG_DEST}.zip" "logs/history/${LOG_DEST}" -x logs/history/*
Loading
Loading