-
Notifications
You must be signed in to change notification settings - Fork 96
Performance Testing Guidelines
Setting an entire environment up from scratch can be done with oae-provisioning, slapchop and fabric:
fab ulous:performance
This should create all the machines, run the puppet scripts and give you a working environment.
Once your environment is up and running, it's probably a good idea to stop the puppet service on all the machines during the data-load process as that is particularly stressing on the application and avoids having additional background noise. From the puppet master machine run:
mco service puppet stop
Before you start loading data and running tests, it's usually a good idea to see if the environment is well balanced. A quick and easy way to check that is to ssh into the machines and run a sysbench test.
To run sysbench on a group of servers of the same class such as app servers, you can do something like the following from the puppet master:
mco package sysbench install
# The following runs the command "sysbench --test=cpu..." (after the -- ) on hosts (-H) app0, app1, app2 and app3 in parallel (-P)
fab -H app0,app1,app2,app3 -P -- sysbench --test=cpu --cpu-max-prime=20000 run
Each execution should result in a response like the following:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Doing CPU performance benchmark
Threads started!
Done.
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 33.4556s
total number of events: 10000
total time taken by event execution: 33.4531
per-request statistics:
min: 2.83ms
avg: 3.35ms
max: 19.62ms
approx. 95 percentile: 4.65ms
Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 33.4531/0.00
The "important" number here is the avg execution time. That should be similar across the various groups.
For example, a decent number on the app/activity nodes is around 30
. Joyent can't always guarantee a solid distribution of nodes, so if a couple of them are off by a factor of 2 it's a good idea to trash those machines and fire up new ones. This can be done with slapchop/fabric like so:
slapchop -d performance destroy -i app0 -i app2 -i activity1
fab "provision_machines:performance,app0;app2;activity1"
If you're starting from a fresh environment you need to generate and load data. If you've already done a dataload and are following this guide, you can restore a backup in cassandra and start from there Data can be generated with OAE-model-loader. Ensure you have all the user pictures, group pictures, ...
Generating:
nohup node generate.js -b 10 -t oae -u 1000 -g 2000 -c 5000 -d 5000 &
Loading:
- Create a tenant with the
oae
alias - Disable reCaptcha on the tenant (or globally)
- Disable the activity servers, as the activities that get generated by the model loader would kill the db/activity-cache servers. On the puppet machine you can run:
mco service hilary stop -I activity0 -I activity1 -I activity2
- Start the dataload:
nohup node loaddata.js -h http://oae.oae-performance.oaeproject.org -b 10 -s 0 -c 2
It's important that the dataload ends without any errors. If some users, groups or content failed to be created you will end up with a bunch of 400
's in the tsung tests which makes them hard to read/interpret.
Now that your data is in Cassandra, its a good idea to take a backup of it so we can restore it for the next test.
Take a snapshot with fab -H db0,db1,db2 -P -- nodetool -h localhost -p 7199 snapshot oae
For more info see http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_backup_takes_snapshot_t.html
TODO: See http://www.datastax.com/docs/1.0/operations/backup_restore for now
- Stop the app/activity/pp servers from puppet master:
mco service hilary stop -W oaeservice::hilary
RabbitMQ: - Reset RabbitMQ from mq0:
rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app
Clean the ActivityStreams CF in Cassandra:
- Truncate the ActivityStreams CF:
cqlsh << EOF
use oae;
truncate ActivityStreams;
EOF
- Stop Cassandra:
fab -H db0,db1,db2 -P -- service dse stop
- Trash the commitlogs and saved caches:
fab -H db0,db1,db2 -P -- rm -rf /var/lib/cassandra/*
(note the*
removing the entire /var/lib/cassandra will cause permission issues when starting cassandra back up) - Start Cassandra:
fab -H db0,db1,db2 -P -- service dse start
Redis data:
fab -H cache0,activity-cache0 -P -- redis-cli flushdb
Restart:
- Start a (single!) app server:
mco service hilary start -I app0
(might want to check the logs on app0 if everything started up fine. We start a single app server because each app server will try and create queues on startup and you can get otherwise get concurrency issues - Start all the app servers:
mco service hilary start -W oaeservice::hilary
Ensure munin is working in your test, it gives valuable OS performance information for identifying bottlenecks. If Tsung's munin integration fails on one node you will get 0 munin stats :( So tail your performance test for a bit to ensure you're getting OS stats, and investigate the issue if you are not.
From the puppet master node:
mco service dse stop -W oaeservice::hilary::dse
On each db node
rm -rf /data/cassandra/*
rm -rf /var/lib/cassandra/*
rm -rf /var/log/cassandra/*
# Ensure that the cassandra user has r/w access on all those directories
chown cassandra:cassandra /data/cassandra
chown cassandra:cassandra /var/lib/cassandra
chown cassandra:cassandra /var/log/cassandra
Then start them back up one-by-one
service dse start
Might be best to restart opscenterd (from the monitor machine)
service opscenterd restart
Once your data has been generated/loaded you can generate a tsung test with node-oae-tsung which should be at /opt/node-oae-tsung.
node main.js -s /opt/OAE-model-loader/scripts -b 10 -o ./name-of-feature-you-are-testing -a answers/baseline.json
That should give you a directory at /opt/node-oae-tsung/name-of-feature-you-are-testing
with the tsung.xml
and the properly formatted csv files that tsung can use.
The baseline.json
file contains (among others) the arrival rates that will be used in tsung. Generally, we have 2 setups. When you're trying to find the breaking point of the application it's usually a good idea to do this in waves. ie:
- arrival rate of 5 users/s for 10 minutes
- arrival rate of 0.1 users/s for 10 minutes (cool down phase)
- arrival rate of 5.5 users/s for 10 minutes
- arrival rate of 0.1 users/s for 10 minutes (cool down phase)
- arrival rate of 6 users/s for 10 minutes
- arrival rate of 0.1 users/s for 10 minutes (cool down phase)
- arrival rate of 6.5 users/s for 10 minutes
- arrival rate of 0.1 users/s for 10 minutes (cool down phase) ...
The cool down phases are there to allow the system some time to catch up with users from the previous phase.
If you're trying to compare a feature against master (baseline), it might be easier to have one big phase that stretches 2 hours with a constant arrival rate:
- arrival rate of 5 users/s for 2 hours
The answers.json file also includes which hosts should be monitored with munin.
cd /opt/node-oae-tsung/name-of-feature-you-are-testing
nohup tsung -f tsung.xml -l tsung start &
watch -n 120 /usr/lib/tsung/bin/tsung_stats.pl --stats tsung.log