-
Notifications
You must be signed in to change notification settings - Fork 1
Home
This wiki talks about ways to compare the accounting records at some site to the accounting records stored in APEL. It is based on the system we developed at Liverpool, but it is intended to be both portable and extensible. If the batch system or CE at your site is not covered by any of the sections, we invite you to extend the software using similar scripts and techniques to those presented. Andrew McNab ([email protected]) can give you write access to the repository for fixes and new methods.
For the time being, I recommended to install this software directly from github repo as follows:
git clone https://:@github.com/gridpp/audit.git
The nodes that need the software will be described in specific sections later.
You'll need a Linux system with a browser for this step. Login as root and install the software, as described above, off the /root directory or somewhere like that.
Start the browser and go to the next EGI accounting portal
https://accounting-next.egi.eu/
In the web page, click Research Infrastructure, Tier 2 ...
In the Row Variable, select Submit Host, and chose Start Time, End Time, which is usually one month, e.g. Oct 2016 to Oct 2016. This means the start of Oct to the end of Oct.
You will make two runs of the report. On the first run, select metric as Number of Jobs and click Update. Below the table is a button to download the data as CSV. Do so, and save the file in ~/audit/APEL/, calling the file jobcount.csv.
On the second run, select metric as Normalised Sum of Elapsed * Number of Processors, click Update and save the data as a CSV, calling the file hs06.csv this time.
On the command line, for each of your submit hosts, cd to the ~/audit/APEL/ directory and run the folliwng commands (Note that the CE hepgrid97 is connected to Torque server hepgrid96).
$ ./totals.pl hs06.csv hepgrid97
Numeric column totals:
1859428.0362 1859428 0.09
$ ./totals.pl jobcount.csv hepgrid97
Numeric column totals:
76217 76217 0.23
This means that, for the hepgrid97.ph.liv.ac.uk CE, APEL knows of 76217 jobs, and thinks they did 1,859,428 HS06 Hours of work.
Do that for all your CEs/Submit hosts. Where you are using VAC, put in the leading characters of the VAC hostname; for Liverpool, this would be like this:
$ ./totals.pl hs06.csv vac01.ph.liv.ac.uk
Numeric column totals:
484434.3381 484435 0
$ ./totals.pl jobcount.csv vac01.ph.liv.ac.uk
Numeric column totals:
111806 111806 0.31
The work for each VAC factory is totalled, and this shows that APEL knows of 111806 jobs, and thinks they did 484,434 HS06 Hours of work.
Login as root into your Torque headnode and install the software, as described above, off the /root directory or somewhere like that.
cd to the ~/audit/Torque/ directory, and get a list of the files that cover the period you want so that the script doesn't have to plough through them all. Since some records for Oct might lie in Sep or Nov, just, then list those too. Note: the location of the files may vary; check with your admin guy.
rm -f recordFilesCoveringPeriod
ls /var/lib/torque/server_priv/accounting/201609* >> recordFilesCoveringPeriod
ls /var/lib/torque/server_priv/accounting/201610* >> recordFilesCoveringPeriod
ls /var/lib/torque/server_priv/accounting/201611* >> recordFilesCoveringPeriod
Get the UNIX epochs for the start and end of the period in question (in this case from the start of October to before the 1st Nov)
startEpoch=`date --date="Oct 01 00:00:00 UTC 2016" +%s`
endEpoch=`date --date="Nov 01 00:00:00 UTC 2016" +%s`
Now run a script to get the job data for that period. You have to pass it the Publishing Benchmark to which you scale. At Liverpool, we scale to 10 HS06, which is 2500 bogoSpecInt2K, hence we use 2500. Also give it the start and end.
./extractRecordsBetweenEpochs.pl recordFilesCoveringPeriod 2500 $startEpoch $endEpoch > table.oct
Add up the tables to get the result for the month.
./accu.pl table.oct
The work done for that month, in HS06 Hours, should pop out. The job count for the month is represented by the number of lines in the table file.
ARC must be running with Archiving turned on in /etc/arc/conf. Check https://www.gridpp.ac.uk/wiki/Example_Build_of_an_ARC/Condor_Cluster
Login as root into your ARC headnode and install the software, as described above, off the /root directory or somewhere like that.
cd to the ~/audit/ARC directory and make a list of all the usage reports (location varies, check /etc/arc.conf)
ls /var/urs > /tmp/urs
Get the ones for jobs that started in (say) sept
for f in `cat /tmp/urs`; do grep -l "EndTime.2016-09" /var/urs/$f; done > /tmp/urs.sept
Parse them to make the table
for t in `cat /tmp/urs.sept `; do ./parseUrs.pl $t; done > table.sept
Sum up the table.
cat table | ~/scripts/accu.pl
The usage for the month should pop out. The job count for the month is represented by the number of lines in the table file.
Since VAC does not bring its results back to some headnode, the process is a bit more cumbersome at present, but I'll try to fix that in due course. Nonetheless, cumbersome as it is, the method below gives readings that are 99.99% correct, which is good enough.
So, to get an estimate of the VAC node accounting, it is necessary to designate some Linux system as the “controlnode”, and install the software, as described above, off the /root directory or somewhere like that. Furthermore, you also must install the software on each VAC node. (Note: To be complete, I should say that the only software needed on the VAC node is the audit/VAC/factory/listRecordsWithEndDateBetween.pl tool.)
Unless you have only one or two VAC nodes, you also must set-up the designated controlnode so that it can access the VAC systems passwordless. A good tool to do this is “ssh-agent bash” and “ssh-add” to add the key of the root user into the ssh session. There are other ways too. Anyway, once that set up, this is how the get an estimate of work done for VAC.
On the control node, be in ~/audit/VAC/controlnode, and create a list of all the VAC hostnames in a file called vacnodes. Get your start and end epochs, in this case for October.
startEpoch=`date --date="Oct 01 00:00:00 UTC 2016" +%s`
endEpoch=`date --date="Nov 01 00:00:00 UTC 2016" +%s`
echo startEpoch $startEpoch
echo endEpoch $endEpoch
Then run a script in a loop to get all the VAC data for work done:
for vn in `cat vacnodes `; do
./auditOneVacNode.sh $vn /var/lib/vac/apel-archive 1475280000 1477958400 ;
done > table.oct
Sum up the table.
cat table | ~/scripts/accu.pl
The usage for the month should pop out. The job count for the month is represented by the number of lines in the table file.
TBD
It's best to compile all the results into a table, like this one that lists the readings obtained from these procedures (heading is "Site Log Jobs) and the reading present inthe APEL system. The table below is the actual comparision done at Liverpool for the month of October, 2016.
Cluster | Site Log Jobs | Site Log HS06Hrs | APEL Jobs | APEL HS06Hrs |
---|---|---|---|---|
ARC/Condor (hepgrid2) | 213,794 | 6,866,329 | 213,794 | 6,859,454 |
CREAM/Torque (hepgrid97/96) | 76,217 | 1,859,428 | 76,217 | 1,859,428 |
VAC Hammer (vachammer.ph.liv.ac.uk) | 20,435 | 4,190,244 | 20,426 | 4,188,542 |
VAC Chadwick (vac01.ph.liv.ac.uk) | 111,817 | 484,458 | 111,806 | 484,434 |
There you can compare the accuracy (or lack of it) in the readings.
If you have comments, please contact Steve Jones ([email protected])