Skip to content

workqueue couch db replication procedure

Alan Malta Rodrigues edited this page Oct 19, 2018 · 7 revisions

Introduction

From time to time we need to cleanup the central couch databases because they accumulate too much deleted docs (>90%), which causes:

  • a slowdown in the system in general, including replication to the agents workqueue_inbox db
  • an increase in the views size (central workqueue views are now about 70GB!)

There are at least two ways to cleanup couch databases, either deleting the whole database (when we do not care about its data, since it can get replicated back), or by temporarily creating another database only with the documents we want (so called permanent docs). For workqueue database in CMSWEB, the latter is the procedure we need to follow.

Since we migrated to CouchServer 1.6.1, the _replicate database is gone and all replication tasks are carried by the replicator db.

It's also worth mentioning that Alan faced several couch weirdness (and sometimes couch/replication crashes), problems like:

  • permission denied to create documents in the new database (thus requiring user_ctx option, more below)
  • dozens/hundreds of process trigerring the couch replication, which caused most of the instabilities (can be seen with _active_tasks, which prints TONS of replication tasks instead of 1 only)
  • replication filter in erlang causes much more replication process to be spawned, increasing the couch server instability.

Step-by-step to replicate workqueue couch db

These instructions are supposed to be performed in one of the developers VM, after copying over the original workqueue.couch database file from the CMSWEB production node. When running these steps on the CMSWEB, there might be minor changes to the commands and an announcement for a short outage.

To start off, we need to temporarily disable all compact and compactviews cron jobs on the backend that will go through this procedure. After updating the crontab, check the couchdb status, if there are any compactions or indexer tasks, stop here and come back later. This is mainly to avoid overloading that backend node with so many couch tasks, in addition to all the standard requests that come through.

Let us start by creating the replication filter function in javascript. This function is very simple, it does NOT replicate documents that are deleted (doc._deleted:true), all the other documents will get replicated. Run the following command in the terminal:

cat > /tmp/filter.js << EOF
function(doc, req) {
    if (doc._deleted){
       return false;
    }
    return true;
}
EOF

Now we need to move this tmp file under the workqueue couchapps directory AND correct/update the ownership and permissions if needed:

mv /tmp/filter.js /data/srv/current/apps/workqueue/data/couchapps/WorkQueue/filters/filterDeletedDocs.js 
sudo chown _sw:_sw /data/srv/current/apps/workqueue/data/couchapps/WorkQueue/filters/filterDeletedDocs.js 

Make sure the correct (the one you want to replicate/cleanup) workqueue.couch database is in place and with the right permissions/ownership. File is:

ls -lh /data/srv/state/couchdb/database/workqueue.couch

Before restarting couch server, make sure once again that there are no active tasks running (using the status command). Restart couch server such that it uploads the new filter function (in filterDeletedDocs.js) to the WorkQueue design document:

(A=/data/cfg/admin; cd /data; $A/InstallDev -s stop:couchdb)
(A=/data/cfg/admin; cd /data; $A/InstallDev -s start:couchdb)

OPTIONAL: we might have to fix the permissions of this file to avoid errors in the logs (which has been always there). Not sure CMSWEB needs to do this...:

sudo chmod 770 /data/srv/current/auth/couchdb/hmackey.ini

Then it's time to start the replication by creating a new document in the _replicator database. Run this command:

curl -ks -X POST http://localhost:5984/_replicator -H 'Content-Type:application/json' -d '{"_id":"rep_wq", "source":"workqueue", "target":"newwq", "continuous": true, "create_target":true, "filter":"WorkQueue/filterDeletedDocs", "user_ctx": {"name": "_admin", "roles": ["_admin"]}}'

the description of those options are:

  • _id: we can define the doc id in couch 1.6.1, which helps to keep track of this document
  • source: the database to replicate documents from
  • target: the database to replicate documents to (if they are both local, we can just provide their names)
  • continuous: set to true to make sure any new documents written to the source will also get replicated to the target
  • create_target: creates the target database if it does not exist
  • filter: the filter function which every single source document has to pass before getting replicated
  • user_ctx: we need to provide a user and its role in order to write documents to the new database (also needed in the CMSWEB environment!)

You get a revision number from the previous command, but you can also check the state of this replication by checking the replication document you posted before.

curl -ks "http://localhost:5984/_replicator/rep_wq"

or checking the couch status command output (there should be a replication task active), couch.log is also your best friend:

(A=/data/cfg/admin; cd /data; $A/InstallDev -s stop:couchdb)

In order to check the status of the replication, one can run the _active_tasks command or get a summary of each database:

curl -ks "http://localhost:5984/_active_tasks" 
curl -ks -X GET 'http://localhost:5984/workqueue'
curl -ks -X GET 'http://localhost:5984/newwq'

once newwq database has the same amount of doc_count as the source database (workqueue) AND it shows "progress":100, replication should be up-to-date and you can now perform the second phase of this intervention.

On the day of swapping the databases in the CMSWEB environment, we need to create a service outage to make sure that no new documents will be written to the old database but not to the new one. Among the steps required, are:

  • set access to couch workqueue to nowhere in the frontend map rules, like vocms0731nowhere.cern.ch instead of vocms0731.cern.ch
  • watch the couch.log until the read/write activities stop (around 5min)
  • check again the amount of valid documents in workqueue and newwq, IF they are the same, you can stop Couch server. ELSE, wait a bit longer watching couch.log and run this test again.
  • fetch the data volume of the original couch database and its views
  • given that the same databases have the same amount of valid documents (doc_count), move the workqueue.couch file somewhere else and rename newwq.couch to workqueue.couch
  • delete ALL the workqueue view files in /data/srv/state/couchdb/database/.workqueue_design/
  • at this point you should have the new database in place, and old views completely removed from the system, thus you can restart Couch server
  • when couch gets restarted, your previous replication document gets removed as well, so at this point there is no more any replication on going.
  • trigger the views building with the following command:
curl -ks -X GET 'http://localhost:5984/workqueue/_design/WorkQueue/_view/elementsByStatus' | head
  • restore the frontend maps and that's all!

Keep an eye on the couch.log just to make sure it's health. Thanks!

Clone this wiki locally