Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit of Node Affinity Anti-Affinity workload files #91

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
| [Namespaces per cluster](namespaces-per-cluster.md) | Maximum Namespaces | None |
| [Services per namespace](services-per-namespace.md) | Maximum services per namespace | None |
| [FIO I/O test](fio.md) | FIO I/O test - stress storage backend | Privileged Containers, Working storage class |
| [Node Affinity Anti Affinity](node-affinity.md) | Node Affinity Anti Affinity Test | Privileged Container |

* Baseline job without a tooled cluster just idles a cluster. The goal is to capture resource consumption over a period of time to characterize resource requirements thus tooling is required. (For now)

Expand Down Expand Up @@ -53,3 +54,4 @@ Each workload will implement a form of pass/fail criteria in order to flag if th
| [Namespaces per cluster](namespaces-per-cluster.md) | Yes: Exit code, Test Duration |
| [Services per namespace](services-per-namespace.md) | Yes: Exit code, Test Duration |
| [FIO I/O test](fio.md) | No |
| [Node Affinity Anti Affinity](node-affinity.md) | Yes: Exit code |
92 changes: 92 additions & 0 deletions docs/node-affinity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Node Affinity Anti Affinity Workload

The Node Affinity Anti Affinity workload playbook is `workloads/node-affinity.yml` and will run the Node Affinity Anti Affinity Workload workload on your cluster.

The Node Affinity Anti Affinity workload's purpose is to validate if the OpenShift cluster can deploy 130 pause-pods with node affinity to one labeled worker node, and 130 hello-pods with anti-affinity to another labeled worker node. Deployed pods have memory and CPU requests, and goal is to be close 85% CPU capacity after all pods are deployed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

85% is subjective to the worker node instance type right?


An OCP with 3 masters and 3 worker nodes is required.

Running from CLI:

```sh
$ cp workloads/inventory.example inventory
$ # Add orchestration host to inventory
$ # Edit vars in workloads/vars/node-affinity.yml or define Environment vars (See below)
$ time ansible-playbook -vv -i inventory workloads/node-affinity.yml
```

## Environment variables

### PUBLIC_KEY
Default: `~/.ssh/id_rsa.pub`
Public ssh key file for Ansible.

### PRIVATE_KEY
Default: `~/.ssh/id_rsa`
Private ssh key file for Ansible.

### ORCHESTRATION_USER
Default: `root`
User for Ansible to log in as. Must authenticate with PUBLIC_KEY/PRIVATE_KEY.

### WORKLOAD_IMAGE
Default: `quay.io/openshift-scale/scale-ci-workload`
Container image that runs the workload script.

### WORKLOAD_JOB_NODE_SELECTOR
Default: `false`
Enables/disables the node selector that places the workload job on the `workload` node.

### WORKLOAD_JOB_TAINT
Default: `false`
Enables/disables the toleration on the workload job to permit the `workload` taint.

### WORKLOAD_JOB_PRIVILEGED
Default: `true`
Enables/disables running the workload pod as privileged.

### KUBECONFIG_FILE
Default: `~/.kube/config`
Location of kubeconfig on orchestration host.

### PBENCH_INSTRUMENTATION
Default: `false`
Enables/disables running the workload wrapped by pbench-user-benchmark. When enabled, pbench agents can then be enabled (`ENABLE_PBENCH_AGENTS`) for further instrumentation data and pbench-copy-results can be enabled (`ENABLE_PBENCH_COPY`) to export captured data for further analysis.

### ENABLE_PBENCH_AGENTS
Default: `false`
Enables/disables the collection of pbench data on the pbench agent Pods. These Pods are deployed by the tooling playbook.

### ENABLE_PBENCH_COPY
Default: `false`
Enables/disables the copying of pbench data to a remote results server for further analysis.

### PBENCH_SSH_PRIVATE_KEY_FILE
Default: `~/.ssh/id_rsa`
Location of ssh private key to authenticate to the pbench results server.

### PBENCH_SSH_PUBLIC_KEY_FILE
Default: `~/.ssh/id_rsa.pub`
Location of the ssh public key to authenticate to the pbench results server.

### PBENCH_SERVER
Default: There is no public default.
DNS address of the pbench results server.

### SCALE_CI_RESULTS_TOKEN
Default: There is no public default.
Future use for pbench and prometheus scraper to place results into git repo that holds results data.

### JOB_COMPLETION_POLL_ATTEMPTS
Default: `360`
Number of retries for Ansible to poll if the workload job has completed. Poll attempts delay 10s between polls with some additional time taken for each polling action depending on the orchestration host setup.

### NODE_AFFINITY_TEST_PREFIX
Default: `node-affinity`
Test to prefix the pbench results.

## Smoke test variables

```
NODE_AFFINITY_TEST_PREFIX=node-affinity_smoke
```
97 changes: 97 additions & 0 deletions workloads/files/workload-node-affinity-script-cm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: scale-ci-workload-script
data:
run.sh: |
#!/bin/sh
set -eo pipefail
workload_log() { echo "$(date -u) $@" >&2; }
export -f workload_log
workload_log "Configuring pbench for Concurrent Scale Up Down"
mkdir -p /var/lib/pbench-agent/tools-default/
echo "${USER_NAME:-default}:x:$(id -u):0:${USER_NAME:-default} user:${HOME}:/sbin/nologin" >> /etc/passwd
if [ "${ENABLE_PBENCH_AGENTS}" = true ]; then
echo "" > /var/lib/pbench-agent/tools-default/disk
echo "" > /var/lib/pbench-agent/tools-default/iostat
echo "workload" > /var/lib/pbench-agent/tools-default/label
echo "" > /var/lib/pbench-agent/tools-default/mpstat
echo "" > /var/lib/pbench-agent/tools-default/oc
echo "" > /var/lib/pbench-agent/tools-default/perf
echo "" > /var/lib/pbench-agent/tools-default/pidstat
echo "" > /var/lib/pbench-agent/tools-default/sar
master_nodes=`oc get nodes -l pbench_agent=true,node-role.kubernetes.io/master= --no-headers | awk '{print $1}'`
for node in $master_nodes; do
echo "master" > /var/lib/pbench-agent/tools-default/remote@$node
done
infra_nodes=`oc get nodes -l pbench_agent=true,node-role.kubernetes.io/infra= --no-headers | awk '{print $1}'`
for node in $infra_nodes; do
echo "infra" > /var/lib/pbench-agent/tools-default/remote@$node
done
worker_nodes=`oc get nodes -l pbench_agent=true,node-role.kubernetes.io/worker= --no-headers | awk '{print $1}'`
for node in $worker_nodes; do
echo "worker" > /var/lib/pbench-agent/tools-default/remote@$node
done
fi
source /opt/pbench-agent/profile
workload_log "Done configuring pbench Concurrent Scale Up Down"

workload_log "Running Node Affinity Anti-Affinity workload"
if [ "${PBENCH_INSTRUMENTATION}" = "true" ]; then
pbench-user-benchmark -- sh /root/workload/workload.sh
result_dir="/var/lib/pbench-agent/$(ls -t /var/lib/pbench-agent/ | grep "pbench-user" | head -1)"/1/sample1
if [ "${ENABLE_PBENCH_COPY}" = "true" ]; then
pbench-copy-results --prefix ${NODE_AFFINITY_TEST_PREFIX}
fi
else
sh /root/workload/workload.sh
result_dir=/tmp
fi

workload_log "Completed Node Affinity and Anti-Affinity workload run"

workload_log "Checking Test Results"
workload_log "Checking script run-node-affinity-anti-affinity.sh execution exit code : ${exit_code}"

if [ "$(jq '.exit_code==0' ${result_dir}/exit.json)" = "false" ]; then
workload_log "Node Affinity Anti-Affinity Test Failure"
workload_log "Test Analysis: Failed"
exit 1
fi
# TODO: Check pbench-agent collected metrics for Pass/Fail
# TODO: Check prometheus collected metrics for Pass/Fail
workload_log "Test Analysis: Passed"

workload.sh: |
#!/bin/sh
set -o pipefail

result_dir=/tmp
if [ "${PBENCH_INSTRUMENTATION}" = "true" ]; then
result_dir=${benchmark_results_dir}
fi

# git clone svt repo in /root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a pointer to the script and pod spec/template in svt repo since they are being maintained there. This way, we will the location in case we need to modify something. On the other hand, it might be a good idea to bake in the script and pod templates into the workload itself like how we are doing in case of other workloads like nodevertical for example : https://github.com/openshift-scale/workloads/blob/master/workloads/templates/workload-nodevertical-script-cm.yml.j2#L75. This way we will have a clear idea of what workload is doing as well easier to modify it in case we want to change something. Thoughts?

cd /root
git clone https://github.com/openshift/svt.git
cd svt
git status
cd /root/svt/openshift_scalability/ci/scripts
ls -ltr

start_time=$(date +%s)
my_time=$(date +%Y-%m-%d-%H%M)

# run run-node-affinity-anti-affinity.sh
./run-node-affinity-anti-affinity.sh 2>&1 | tee /tmp/output-node-affinity-${my_time}.log

exit_code=$?
end_time=$(date +%s)
duration=$((end_time-start_time))
workload_log "Test duration was: ${duration}"

workload_log "Output of script run-node-affinity-anti-affinity.sh execution: $(cat /tmp/output_node-affinity-${my_time}.log)"

workload_log "Writing script run-node-affinity-anti-affinity.sh execution exit code : ${exit_code}"
jq -n '. | ."exit_code"='${exit_code}' | ."duration"='${duration}'' > "${result_dir}/exit.json"
workload_log "Finished workload script"
131 changes: 131 additions & 0 deletions workloads/node-affinity.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
#
# Runs Node Affinity Anti-Affinity on OpenShift 4.x cluster
#

- name: Runs node-affinity on a RHCOS OpenShift cluster
hosts: orchestration
gather_facts: true
remote_user: "{{orchestration_user}}"
vars_files:
- vars/node-affinity.yml
vars:
workload_job: "node-affinity"
tasks:
- name: Create scale-ci-tooling directory
file:
path: "{{ansible_user_dir}}/scale-ci-tooling"
state: directory

- name: Copy workload files
copy:
src: "{{item.src}}"
dest: "{{item.dest}}"
with_items:
- src: scale-ci-tooling-ns.yml
dest: "{{ansible_user_dir}}/scale-ci-tooling/scale-ci-tooling-ns.yml"
- src: workload-node-affinity-script-cm.yml
dest: "{{ansible_user_dir}}/scale-ci-tooling/workload-node-affinity-script-cm.yml"

- name: Slurp kubeconfig file
slurp:
src: "{{kubeconfig_file}}"
register: kubeconfig_file_slurp

- name: Slurp ssh private key file
slurp:
src: "{{pbench_ssh_private_key_file}}"
register: pbench_ssh_private_key_file_slurp

- name: Slurp ssh public key file
slurp:
src: "{{pbench_ssh_public_key_file}}"
register: pbench_ssh_public_key_file_slurp

- name: Template workload templates
template:
src: "{{item.src}}"
dest: "{{item.dest}}"
with_items:
- src: pbench-cm.yml.j2
dest: "{{ansible_user_dir}}/scale-ci-tooling/pbench-cm.yml"
- src: pbench-ssh-secret.yml.j2
dest: "{{ansible_user_dir}}/scale-ci-tooling/pbench-ssh-secret.yml"
- src: kubeconfig-secret.yml.j2
dest: "{{ansible_user_dir}}/scale-ci-tooling/kubeconfig-secret.yml"
- src: workload-job.yml.j2
dest: "{{ansible_user_dir}}/scale-ci-tooling/workload-job.yml"
- src: workload-env.yml.j2
dest: "{{ansible_user_dir}}/scale-ci-tooling/workload-node-affinity-env.yml"

- name: Check if scale-ci-tooling namespace exists
shell: |
oc get project scale-ci-tooling
ignore_errors: true
changed_when: false
register: scale_ci_tooling_ns_exists

- name: Ensure any stale scale-ci-node-affinity job is deleted
shell: |
oc delete job scale-ci-node-affinity -n scale-ci-tooling
register: scale_ci_tooling_project
failed_when: scale_ci_tooling_project.rc == 0
until: scale_ci_tooling_project.rc == 1
retries: 60
delay: 1
when: scale_ci_tooling_ns_exists.rc == 0

- name: Block for non-existing tooling namespace
block:
- name: Create tooling namespace
shell: |
oc create -f {{ansible_user_dir}}/scale-ci-tooling/scale-ci-tooling-ns.yml

- name: Create tooling service account
shell: |
oc create serviceaccount useroot -n scale-ci-tooling
oc adm policy add-scc-to-user privileged -z useroot -n scale-ci-tooling
when: enable_pbench_agents|bool
when: scale_ci_tooling_ns_exists.rc != 0

- name: Create/replace kubeconfig secret
shell: |
oc replace --force -n scale-ci-tooling -f "{{ansible_user_dir}}/scale-ci-tooling/kubeconfig-secret.yml"

- name: Create/replace the pbench configmap
shell: |
oc replace --force -n scale-ci-tooling -f "{{ansible_user_dir}}/scale-ci-tooling/pbench-cm.yml"

- name: Create/replace pbench ssh secret
shell: |
oc replace --force -n scale-ci-tooling -f "{{ansible_user_dir}}/scale-ci-tooling/pbench-ssh-secret.yml"

- name: Create/replace workload script configmap
shell: |
oc replace --force -n scale-ci-tooling -f "{{ansible_user_dir}}/scale-ci-tooling/workload-node-affinity-script-cm.yml"

- name: Create/replace workload script environment configmap
shell: |
oc replace --force -n scale-ci-tooling -f "{{ansible_user_dir}}/scale-ci-tooling/workload-node-affinity-env.yml"

- name: Create/replace workload job to that runs workload script
shell: |
oc replace --force -n scale-ci-tooling -f "{{ansible_user_dir}}/scale-ci-tooling/workload-job.yml"

- name: Poll until job pod is running
shell: |
oc get pods --selector=job-name=scale-ci-node-affinity -n scale-ci-tooling -o json
register: pod_json
retries: 60
delay: 2
until: pod_json.stdout | from_json | json_query('items[0].status.phase==`Running`')

- name: Poll until job is complete
shell: |
oc get job scale-ci-node-affinity -n scale-ci-tooling -o json
register: job_json
retries: "{{job_completion_poll_attempts}}"
delay: 10
until: job_json.stdout | from_json | json_query('status.succeeded==`1` || status.failed==`1`')
failed_when: job_json.stdout | from_json | json_query('status.succeeded==`1`') == false
when: job_completion_poll_attempts|int > 0
4 changes: 4 additions & 0 deletions workloads/templates/workload-env.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -103,4 +103,8 @@ data:
PROMETHEUS_GRAPH_PERIOD: "{{prometheus_graph_period}}"
PROMETHEUS_REFRESH_INTERVAL: "{{prometheus_refresh_interval}}"
PROMETHEUS_SCALE_TEST_PREFIX: "{{prometheus_scale_test_prefix}}"
{% elif workload_job == "node-affinity" %}
PBENCH_INSTRUMENTATION: "{{pbench_instrumentation|bool|lower}}"
ENABLE_PBENCH_COPY: "{{enable_pbench_copy|bool|lower}}"
NODE_AFFINITY_TEST_PREFIX: "{{node_affinity_test_prefix}}"
{% endif %}
33 changes: 33 additions & 0 deletions workloads/vars/node-affinity.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
###############################################################################
# Ansible SSH variables.
###############################################################################
ansible_public_key_file: "{{ lookup('env', 'PUBLIC_KEY')|default('~/.ssh/id_rsa.pub', true) }}"
ansible_private_key_file: "{{ lookup('env', 'PRIVATE_KEY')|default('~/.ssh/id_rsa', true) }}"

orchestration_user: "{{ lookup('env', 'ORCHESTRATION_USER')|default('root', true) }}"
###############################################################################
# NodeVertical workload variables.
###############################################################################
workload_image: "{{ lookup('env', 'WORKLOAD_IMAGE')|default('quay.io/openshift-scale/scale-ci-workload', true) }}"

workload_job_node_selector: "{{ lookup('env', 'WORKLOAD_JOB_NODE_SELECTOR')|default(false, true)|bool }}"
workload_job_taint: "{{ lookup('env', 'WORKLOAD_JOB_TAINT')|default(false, true)|bool }}"
workload_job_privileged: "{{ lookup('env', 'WORKLOAD_JOB_PRIVILEGED')|default(true, true)|bool }}"

kubeconfig_file: "{{ lookup('env', 'KUBECONFIG_FILE')|default('~/.kube/config', true) }}"

# pbench variables
pbench_instrumentation: "{{ lookup('env', 'PBENCH_INSTRUMENTATION')|default(false, true)|bool|lower }}"
enable_pbench_agents: "{{ lookup('env', 'ENABLE_PBENCH_AGENTS')|default(false, true)|bool }}"
enable_pbench_copy: "{{ lookup('env', 'ENABLE_PBENCH_COPY')|default(false, true)|bool|lower }}"
pbench_ssh_private_key_file: "{{ lookup('env', 'PBENCH_SSH_PRIVATE_KEY_FILE')|default('~/.ssh/id_rsa', true) }}"
pbench_ssh_public_key_file: "{{ lookup('env', 'PBENCH_SSH_PUBLIC_KEY_FILE')|default('~/.ssh/id_rsa.pub', true) }}"
pbench_server: "{{ lookup('env', 'PBENCH_SERVER')|default('', true) }}"

# Other variables for workload tests
scale_ci_results_token: "{{ lookup('env', 'SCALE_CI_RESULTS_TOKEN')|default('', true) }}"
job_completion_poll_attempts: "{{ lookup('env', 'JOB_COMPLETION_POLL_ATTEMPTS')|default(3600, true)|int }}"

# node affinity and anti-affinity workload specific parameters:
node_affinity_test_prefix: "{{ lookup('env', 'NODE_AFFINITY_TEST_PREFIX')|default('node_affinity', true) }}"