README.evals

Using epumgmt for running EPU workload evaluations

There are three main components to running EPU workload evaluations. First,
cloudinit.d is used to launch and configure the EPU. Second,
epumgmt/bin/generate-workload-definition.py is used to create an
epumgmt-understandable workload format file. And finally, epumgmt is used
to execute the workload and graph the results.

Discussion of cloudinit.d is beyond the scope of this README.


To generate a workload definition file for epumgmt you should use the
generate-workload-definition.py script provided in ./bin/. This command will
allow you to specify when during the evaluation you want to kill a controller,
worker instances, or submit work. (All of the options are explained by
running './bin/generate-workload-definition.py -h'.)

    For example, this command:

        $ ./bin/generate-workload-definition.py --kill-controller=60,120,300
          --kill-seconds=60,120 --kill-counts=1,12 --submit-seconds=0,120 
          --submit-counts=5,5 --submit-sleep=300,600

    will generate this on standard out (you should redirect to a file if you
    want to create a workload definition file to execute with epumgmt):

          KILL_CONTROLLER 60 1
          KILL_CONTROLLER 120 1
          KILL_CONTROLLER 300 1
          KILL 60 1
          KILL 120 12
          SUBMIT 0 5 300 0
          SUBMIT 120 5 600 5

    This workload attempts to submit 5 jobs at the very beginning of the test
    (second 0) that sleep for 300 seconds. It then submits another 5 jobs 120
    seconds into the evaluation. These jobs run for 600 seconds. This workload
    also attempts to kill 1 worker VM 60 seconds into the evaluation and 12 VMs
    120 seconds into the evaluation. Finally, it kills a controller at 60, 120,
    and 300 seconds into the evaluation.


Once you have generated a workload definition file with
generate-workload-definition.py, you can then use this file with epumgmt to 
execute the workload (and graph the results).

Assuming we launched a plan with cloudinit.d with the name "testrun" and
generated a workload definition file (similar to above) with the name
"workload.def" then to execute the workload with the EPU launched by
cloudinit.d you'd simply run the following command:

./bin/epumgmt.sh -a execute-workload-test -n testrun -f workload.def -w torque

You can also specify amqp as the workload type (-w).

Once this completes you should then fetch all logs with the following commands:

./bin/epumgmt.sh -a logfetch -n testrun
./bin/epumgmt.sh -a torque-logfetch -n testrun

Obviously you can skip torque-logfetch if you've only run an amqp workload.
These steps should actually already been done for you by execute-workload-test,
however, it isn't a bad idea to follow up a run with these commands just to
make sure you have all of the logs you need.

Once this is complete you can simply generate a graph with:

./bin/epumgmt.sh -a generate-graph -n testrun -r stacked-vms -t png -w torque

There numerous other graphs (-r) that you can specify: job-tts, job-rate, 
node-info, and controller. You can also specify eps instead of png for the
graph type (-t).

After examining your results, don't forget to kill the run:

./bin/epumgmt.sh -a killrun -n testrun

Also, you should probably check the cloud (e.g. EC2) that you're using and make
sure you didn't leave any zombie instances running.