Skip to content

Commit

Permalink
Add upgrade workload
Browse files Browse the repository at this point in the history
  • Loading branch information
akrzos committed Aug 16, 2019
1 parent 7b32be0 commit 9665e52
Show file tree
Hide file tree
Showing 6 changed files with 450 additions and 31 deletions.
64 changes: 33 additions & 31 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,23 @@
# Table of workloads

| Workload/tooling | Short Description | Minimum Requirements |
|:-------------------------------------------------- |:----------------------------------------- | ------------------------------------- |
| [Tooling](tooling.md) | Setup pbench instrumentation tools | Cluster-admin, Privileged Containers |
| [Test](test.md) | Test/Run your workload from ssh Container | Cluster-admin, Privileged Containers |
| [Baseline](baseline.md) | Baseline metrics capture | Tooling job* |
| [Scale](scale.md) | Scales worker nodes | Cluster-admin |
| [NodeVertical](nodevertical.md) | Node Kubelet Density | Labeling Nodes |
| [PodVertical](podvertical.md) | Max Pod Density | None |
| [MasterVertical](mastervertical.md) | Master Node Stress workload | None |
| [HTTP](http.md) | HTTP ingress TPS/Latency | None |
| [Network](network.md) | TCP/UDP Throughput/Latency | Labeling Nodes, [See below](#network) |
| [Deployments Per Namespace](deployments-per-ns.md) | Maximum Deployments | None |
| [PVCscale](pvscale.md) | PVCScale test | Working storageclass |
| [Conformance](conformance.md) | OCP/Kubernetes e2e tests | None |
| [Namespaces per cluster](namespaces-per-cluster.md) | Maximum Namespaces | None |
| [Services per namespace](services-per-namespace.md) | Maximum services per namespace | None |
| [FIO I/O test](fio.md) | FIO I/O test - stress storage backend | Privileged Containers, Working storage class |
| Workload/tooling | Short Description | Minimum Requirements |
|:--------------------------------------------------- |:----------------------------------------- | -------------------------------------------- |
| [Tooling](tooling.md) | Setup pbench instrumentation tools | Cluster-admin, Privileged Containers |
| [Test](test.md) | Test/Run your workload from ssh Container | Cluster-admin, Privileged Containers |
| [Baseline](baseline.md) | Baseline metrics capture | Tooling job* |
| [Scale](scale.md) | Scales worker nodes | Cluster-admin |
| [NodeVertical](nodevertical.md) | Node Kubelet Density | Labeling Nodes |
| [PodVertical](podvertical.md) | Max Pod Density | None |
| [MasterVertical](mastervertical.md) | Master Node Stress workload | None |
| [HTTP](http.md) | HTTP ingress TPS/Latency | None |
| [Network](network.md) | TCP/UDP Throughput/Latency | Labeling Nodes, [See below](#network) |
| [Deployments Per Namespace](deployments-per-ns.md) | Maximum Deployments | None |
| [PVCscale](pvscale.md) | PVCScale test | Working storageclass |
| [Conformance](conformance.md) | OCP/Kubernetes e2e tests | None |
| [Namespaces per cluster](namespaces-per-cluster.md) | Maximum Namespaces | None |
| [Services per namespace](services-per-namespace.md) | Maximum services per namespace | None |
| [FIO I/O test](fio.md) | FIO I/O test - stress storage backend | Privileged Containers, Working storage class |
| [Upgrade](upgrade.md) | Upgrades cluster | Cluster-admin |

* Baseline job without a tooled cluster just idles a cluster. The goal is to capture resource consumption over a period of time to characterize resource requirements thus tooling is required. (For now)

Expand All @@ -36,20 +37,21 @@

Each workload will implement a form of pass/fail criteria in order to flag if the tests have failed in CI.

| Workload/tooling | Pass/Fail |
|:-------------------------------------------------- |:----------------------------- |
| [Tooling](tooling.md) | NA |
| [Test](test.md) | NA |
| [Baseline](baseline.md) | NA |
| [Scale](scale.md) | Yes: Test Duration |
| [NodeVertical](nodevertical.md) | Yes: Exit Code, Test Duration |
| [PodVertical](podvertical.md) | Yes: Exit Code, Test Duration |
| [MasterVertical](mastervertical.md) | Yes: Exit Code, Test Duration |
| [HTTP](http.md) | No |
| [Network](network.md) | No |
| [Deployments Per Namespace](deployments-per-ns.md) | No |
| [PVCscale](pvscale.md) | No |
| [Conformance](conformance.md) | No |
| Workload/tooling | Pass/Fail |
|:--------------------------------------------------- |:----------------------------- |
| [Tooling](tooling.md) | NA |
| [Test](test.md) | NA |
| [Baseline](baseline.md) | NA |
| [Scale](scale.md) | Yes: Test Duration |
| [NodeVertical](nodevertical.md) | Yes: Exit Code, Test Duration |
| [PodVertical](podvertical.md) | Yes: Exit Code, Test Duration |
| [MasterVertical](mastervertical.md) | Yes: Exit Code, Test Duration |
| [HTTP](http.md) | No |
| [Network](network.md) | No |
| [Deployments Per Namespace](deployments-per-ns.md) | No |
| [PVCscale](pvscale.md) | No |
| [Conformance](conformance.md) | No |
| [Namespaces per cluster](namespaces-per-cluster.md) | Yes: Exit code, Test Duration |
| [Services per namespace](services-per-namespace.md) | Yes: Exit code, Test Duration |
| [FIO I/O test](fio.md) | No |
| [Upgrade](upgrade.md) | Yes: Test Duration |
114 changes: 114 additions & 0 deletions docs/upgrade.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Upgrade Workload

The upgrade workload playbook is `workloads/upgrade.yml` and will upgrade a cluster with or without tooling.

Note that upgrades can reboot nodes and thus any node that is rebooted hosting a pbench agent pod that is actively collecting data will be interrupted. As with cloud native workloads, pods are supposed to be ephemeral anyway.

Running from CLI:

```sh
$ cp workloads/inventory.example inventory
$ # Add orchestration host to inventory
$ # Edit vars in workloads/vars/scale.yml or define Environment vars (See below)
$ time ansible-playbook -vv -i inventory workloads/upgrade.yml
```

## Environment variables

### PUBLIC_KEY
Default: `~/.ssh/id_rsa.pub`
Public ssh key file for Ansible.

### PRIVATE_KEY
Default: `~/.ssh/id_rsa`
Private ssh key file for Ansible.

### ORCHESTRATION_USER
Default: `root`
User for Ansible to log in as. Must authenticate with PUBLIC_KEY/PRIVATE_KEY.

### WORKLOAD_IMAGE
Default: `quay.io/openshift-scale/scale-ci-workload`
Container image that runs the workload script.

### WORKLOAD_JOB_NODE_SELECTOR
Default: `true`
Enables/disables the node selector that places the workload job on the `workload` node.

### WORKLOAD_JOB_TAINT
Default: `true`
Enables/disables the toleration on the workload job to permit the `workload` taint.

### WORKLOAD_JOB_PRIVILEGED
Default: `false`
Enables/disables running the workload Pod as privileged.

### KUBECONFIG_FILE
Default: `~/.kube/config`
Location of kubeconfig on orchestration host.

### PBENCH_INSTRUMENTATION
Default: `false`
Enables/disables running the workload wrapped by pbench-user-benchmark. When enabled, pbench agents can then be enabled (`ENABLE_PBENCH_AGENTS`) for further instrumentation data and pbench-copy-results can be enabled (`ENABLE_PBENCH_COPY`) to export captured data for further analysis.

### ENABLE_PBENCH_AGENTS
Default: `false`
Enables/disables the collection of pbench data on the pbench agent Pods. These Pods are deployed by the tooling playbook.

### ENABLE_PBENCH_COPY
Default: `false`
Enables/disables the copying of pbench data to a remote results server for further analysis.

### PBENCH_SSH_PRIVATE_KEY_FILE
Default: `~/.ssh/id_rsa`
Location of ssh private key to authenticate to the pbench results server.

### PBENCH_SSH_PUBLIC_KEY_FILE
Default: `~/.ssh/id_rsa.pub`
Location of the ssh public key to authenticate to the pbench results server.

### PBENCH_SERVER
Default: There is no public default.
DNS address of the pbench results server.

### SCALE_CI_RESULTS_TOKEN
Default: There is no public default.
Future use for pbench and prometheus scraper to place results into git repo that holds results data.

### JOB_COMPLETION_POLL_ATTEMPTS
Default: `360`
Number of retries for Ansible to poll if the workload job has completed. Poll attempts delay 10s between polls with some additional time taken for each polling action depending on the orchestration host setup.

### UPGRADE_TEST_PREFIX
Default: `upgrade`
Test to prefix the pbench results.

### UPGRADE_NEW_VERSION_URL
Default: No default.
The url portion of a new version to upgrade to. An example would be `quay.io/openshift-release-dev/ocp-release` or `registry.svc.ci.openshift.org/ocp/release`.

### UPGRADE_NEW_VERSION
Default: No default.
The new version to upgrade to. Check [https://openshift-release.svc.ci.openshift.org/](https://openshift-release.svc.ci.openshift.org/) for versions and upgrade paths based on the installed cluster.

### FORCE_UPGRADE
Default: `false`
Determines the `--force` flag value for the `oc adm upgrade` command to initiate an upgrade.

### UPGRADE_POLL_ATTEMPTS
Default: `1800`
Number of times to poll to determine if the cluster has been upgraded. Each poll attempted corresponds to approximately a 2s wait plus poll time.

### EXPECTED_UPGRADE_DURATION
Default: `1800`
Pass/fail criteria. Value to determine if upgrade workload executed in duration expected.

## Smoke test variables

```
UPGRADE_TEST_PREFIX=upgrade_smoke
UPGRADE_NEW_VERSION_URL=registry.svc.ci.openshift.org/ocp/release
UPGRADE_NEW_VERSION=4.2.0-0.nightly-2019-08-13-183722
FORCE_UPGRADE=true
UPGRADE_POLL_ATTEMPTS=7200
```
123 changes: 123 additions & 0 deletions workloads/files/workload-upgrade-script-cm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: scale-ci-workload-script
data:
run.sh: |
#!/bin/sh
set -eo pipefail
workload_log() { echo "$(date -u) $@" >&2; }
export -f workload_log
workload_log "Configuring pbench for running upgrade workload"
mkdir -p /var/lib/pbench-agent/tools-default/
echo "${USER_NAME:-default}:x:$(id -u):0:${USER_NAME:-default} user:${HOME}:/sbin/nologin" >> /etc/passwd
if [ "${ENABLE_PBENCH_AGENTS}" = true ]; then
echo "" > /var/lib/pbench-agent/tools-default/disk
echo "" > /var/lib/pbench-agent/tools-default/iostat
echo "workload" > /var/lib/pbench-agent/tools-default/label
echo "" > /var/lib/pbench-agent/tools-default/mpstat
echo "" > /var/lib/pbench-agent/tools-default/oc
echo "" > /var/lib/pbench-agent/tools-default/perf
echo "" > /var/lib/pbench-agent/tools-default/pidstat
echo "" > /var/lib/pbench-agent/tools-default/sar
master_nodes=`oc get nodes -l pbench_agent=true,node-role.kubernetes.io/master= --no-headers | awk '{print $1}'`
for node in $master_nodes; do
echo "master" > /var/lib/pbench-agent/tools-default/remote@$node
done
infra_nodes=`oc get nodes -l pbench_agent=true,node-role.kubernetes.io/infra= --no-headers | awk '{print $1}'`
for node in $infra_nodes; do
echo "infra" > /var/lib/pbench-agent/tools-default/remote@$node
done
worker_nodes=`oc get nodes -l pbench_agent=true,node-role.kubernetes.io/worker= --no-headers | awk '{print $1}'`
for node in $worker_nodes; do
echo "worker" > /var/lib/pbench-agent/tools-default/remote@$node
done
fi
source /opt/pbench-agent/profile
workload_log "Done configuring pbench for upgrade workload run"
workload_log "Running upgrade workload"
if [ "${PBENCH_INSTRUMENTATION}" = "true" ]; then
pbench-user-benchmark -- sh /root/workload/workload.sh
result_dir="/var/lib/pbench-agent/$(ls -t /var/lib/pbench-agent/ | grep "pbench-user" | head -2 | tail -1)"/1/sample1
if [ "${ENABLE_PBENCH_COPY}" = "true" ]; then
pbench-copy-results --prefix ${UPGRADE_TEST_PREFIX}
fi
else
sh /root/workload/workload.sh
result_dir=/tmp
fi
workload_log "Completed upgrade workload run"
workload_log "Checking Test Results"
workload_log "Checking Test Exit Code"
if [ $(jq '.exit_code==0' ${result_dir}/exit.json) == "false" ]; then
workload_log "Test Failure"
workload_log "Test Analysis: Failed"
exit 1
fi
workload_log "Comparing upgrade duration to expected duration"
workload_log "Scaling Duration: $(jq '.duration' ${result_dir}/exit.json)"
if [ $(jq '.duration>'${EXPECTED_UPGRADE_DURATION}'' ${result_dir}/exit.json) == "true" ]; then
workload_log "EXPECTED_UPGRADE_DURATION (${EXPECTED_UPGRADE_DURATION}) exceeded ($(jq '.duration' ${result_dir}/exit.json))"
workload_log "Test Analysis: Failed"
exit 1
fi
# TODO: Check pbench-agent collected metrics for Pass/Fail
# TODO: Check prometheus collected metrics for Pass/Fail
workload_log "Test Analysis: Passed"
workload.sh: |
#!/bin/sh
result_dir=/tmp
if [ "${PBENCH_INSTRUMENTATION}" = "true" ]; then
result_dir=${benchmark_results_dir}
fi
start_time=$(date +%s)
workload_log "Before Upgrade Data"
oc get clusterversion/version
oc get clusteroperators
oc adm upgrade --force=${FORCE_UPGRADE} --to-image=${UPGRADE_NEW_VERSION_URL}:${UPGRADE_NEW_VERSION}
# Poll to see upgrade started
retries=0
while [ ${retries} -le 120 ] ; do
clusterversion_output=`oc get clusterversion/version`
if [[ "${clusterversion_output}" == *"Working towards "* ]]; then
workload_log "Cluster upgrade started"
break
else
workload_log "Cluster upgrade has not started, Poll attempts: ${retries}/120"
sleep 1
fi
retries=$[${retries} + 1]
done
# Poll to see if upgrade has completed
retries=0
while [ ${retries} -le ${UPGRADE_POLL_ATTEMPTS} ] ; do
clusterversion_output=`oc get clusterversion/version`
if [[ "${clusterversion_output}" == *"Cluster version is "* ]]; then
workload_log "Cluster upgrade complete"
break
else
workload_log "Cluster still upgrading, Poll attempts: ${retries}/${UPGRADE_POLL_ATTEMPTS}"
sleep 2
fi
retries=$[${retries} + 1]
done
end_time=$(date +%s)
duration=$((end_time-start_time))
exit_code=0
workload_log "Post Upgrade Data"
oc get clusterversion/version
oc get clusteroperators
if [[ "${clusterversion_output}" != *"Cluster version is "* ]]; then
workload_log "Cluster failed to scale to ${UPGRADE_NEW_VERSION} in (${UPGRADE_POLL_ATTEMPTS} * 2s)"
exit_code=1
fi
workload_log "Writing Exit Code and Duration"
jq -n '. | ."exit_code"='${exit_code}' | ."duration"='${duration}'' > "${result_dir}/exit.json"
9 changes: 9 additions & 0 deletions workloads/templates/workload-env.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -95,4 +95,13 @@ data:
FIOTEST_SSH_AUTHORIZED_KEYS: "{{pbench_ssh_public_key_file_slurp['content']}}"
FIOTEST_SSH_PRIVATE_KEY: "{{pbench_ssh_private_key_file_slurp['content']}}"
FIOTEST_SSH_PUBLIC_KEY: "{{pbench_ssh_public_key_file_slurp['content']}}"
{% elif workload_job == "upgrade" %}
PBENCH_INSTRUMENTATION: "{{pbench_instrumentation|bool|lower}}"
ENABLE_PBENCH_COPY: "{{enable_pbench_copy|bool|lower}}"
UPGRADE_TEST_PREFIX: "{{upgrade_test_prefix}}"
UPGRADE_NEW_VERSION_URL: "{{upgrade_new_version_url}}"
UPGRADE_NEW_VERSION: "{{upgrade_new_version}}"
FORCE_UPGRADE: "{{force_upgrade}}"
UPGRADE_POLL_ATTEMPTS: "{{upgrade_poll_attempts}}"
EXPECTED_UPGRADE_DURATION: "{{expected_upgrade_duration}}"
{% endif %}
Loading

0 comments on commit 9665e52

Please sign in to comment.