Skip to content

workqueue couch db replication procedure

Alan Malta Rodrigues edited this page Oct 18, 2018 · 7 revisions

Introduction

From time to time we need to cleanup the central couch databases because they accumulate too much deleted docs (>90%), which causes:

  • a slowdown in the system in general, including replication to the agents workqueue_inbox db
  • an increase in the views size (central workqueue views are now about 70GB!)

There are at least two ways to cleanup couch databases, either deleting the whole database (when we do not care about its data, since it can get replicated back), or by temporarily creating another database only with the documents we want (so called permanent docs). For workqueue database in CMSWEB, the latter is the procedure we need to follow.

Since we migrated to CouchServer 1.6.1, the _replicate database is gone and all replication tasks are carried by the replicator db.

It's also worth mentioning that Alan faced several couch weirdness (and sometimes couch/replication crashes), problems like:

  • permission denied to create documents in the new database (thus requiring user_ctx option, more below)
  • dozens/hundreds of process trigerring the couch replication, which caused most of the instabilities (can be seen with _active_tasks, which prints TONS of replication tasks instead of 1 only)
  • replication filter in erlang causes much more replication process to be spawned, increasing the couch server instability.

Step-by-step to replicate workqueue couch db

These instructions are supposed to be performed in one of the developers VM, after copying over the original workqueue.couch database file from the CMSWEB production node. When running these steps on CMSWEB, there might be minor changes like the user_ctx and/or making an outage before swapping the db files.

Let us start by creating the replication filter function in javascript. This function is very simple, it does NOT replicate documents that are deleted (doc._deleted:true), all the other documents will get replicated. Run the following command in the terminal:

cat > /tmp/filter.js << EOF
function(doc, req) {
    if (doc._deleted){
       return false;
    }
    return true;
}
EOF

Now we need to move this tmp file under the workqueue couchapps directory AND correct/update the ownership and permissions (owner might be different in CMSWEB):

mv /tmp/filter.js /data/srv/current/apps/workqueue/data/couchapps/WorkQueue/filters/FilterDeletedDocs.js 
sudo chown _sw:_sw /data/srv/current/apps/workqueue/data/couchapps/WorkQueue/filters/FilterDeletedDocs.js 

Make sure the correct (the one you want to replicate/cleanup) workqueue.couch database is in place and with the right permissions/ownership. File is:

/data/srv/state/couchdb/database/workqueue.couch

Restart couch server such that it uploads the new filter function (in FilterDeletedDocs.js) to the WorkQueue design document:

(A=/data/cfg/admin; cd /data; $A/InstallDev -s stop:couchdb)
(A=/data/cfg/admin; cd /data; $A/InstallDev -s start:couchdb)

OPTIONAL: we might have to fix the permissions of this file to avoid errors in the logs (which has been always there). Not sure CMSWEB needs to do this...:

sudo chmod 770 /data/srv/current/auth/couchdb/hmackey.ini

For CMSWEB nodes, we need to disable the cronjobs running couch databases and view compaction to avoid overloading the node. Check that out before proceeding to the next step.

Then it's time to start the replication by creating a new document in the _replicator database. Run this command:

curl -ks -X POST http://localhost:5984/_replicator -H 'Content-Type:application/json' -d '{"_id":"rep_wq", "source":"workqueue", "target":"newwq", "continuous": true, "create_target":true, "filter":"WorkQueue/FilterDeletedDocs", "user_ctx": {"name": "_admin", "roles": ["_admin"]}}'

the description of those options are:

  • _id: we can define the doc id in couch 1.6.1, which helps to keep track of this document
  • source: the database to replicate documents from
  • target: the database to replicate documents to (if they are both local, we can just provide their names)
  • continuous: set to true to make sure any new documents written to the source will also get replicated to the target
  • create_target: creates the target database if it does not exist
  • filter: the filter function which every single source document has to pass before getting replicated
  • user_ctx: we need to provide a user and its role in order to write documents to the new database (CHECK: is it the same in CMSWEB?)

You get a revision number from the previous command, but you can also check the state of this replication by checking the replication document you posted before.

curl -ks "http://localhost:5984/_replicator/rep_wq"

In order to check the status of the replication, one can run the _active_tasks command or get a summary of each database:

curl -ks "http://localhost:5984/_active_tasks" 
curl -ks -X GET 'http://localhost:5984/workqueue'
curl -ks -X GET 'http://localhost:5984/newwq'

once newwq database has the same amount of doc_count as the source database (workqueue), replication would be up-to-date.

On the day of swapping the databases in the CMSWEB environment, we need to create a service outage to make sure that no new documents will be written to the old database but not to the new one. Among the steps required, are:

  • set access to couch workqueue to nowhere in the frontend map rules
  • watch the couch.log until the read/write activities stop (around 5min)
  • check again the amount of valid documents in workqueue and newwq, IF they are the same, you can stop Couch server. ELSE, wait a bit longer watching couch.log and run this test again.
  • given that the same databases have the same amount of valid documents (doc_count), move the workqueue.couch file somewhere else and rename newwq.couch to workqueue.couch
  • delete ALL the workqueue view files in /data/srv/state/couchdb/database/.workqueue_design/
  • at this point you should have the new database in place, and old views completely removed from the system, thus you can restart Couch server
  • trigger the views building with the following command:
curl -ks -X GET 'http://localhost:5984/workqueue/_design/WorkQueue/_view/elementsByStatus' | head
  • restore the frontend maps and that's all!
Clone this wiki locally