Data Restoration

The best kind of replication is one that requires no manual intervention, like ClickHouse's ReplicatedMergeTree. High-availability systems can be constructed that tolerate single node failures. Sooner or later, maintainers of larger scale setups also have to consider whole-cluster failures and those call for some degree of manual data restoration.

At the scale of my time-series deployment, several million messages/sec with several aggregations, I have been finding it useful to have a small restoration procedure separate from a big one, as described below.

Small One: The Patch

A quick replacement of a small gap in data in the order of tens to hundreds million rows (minutes of my time-series stream). Such gap may be caused by an awkward node restart or a network glitch. Use ClickHouse Copier or simply a manual DELETE-SELECT-INSERT. I might do:

ALTER TABLE x DELETE WHERE t BETWEEN t0 AND t1
SELECT FROM x WHERE t BETWEEN t0 AND t1 and INSERT INTO x, native protocol network transport between the two using netcat or rsync
repeat for each table, rounding (t0,t1) interval appropriately

The rounding is necessary for those of my tables that are time-resolution aggregates.

This is an easy (bunch of SQL), flexible (one can transform schema in SELECT statement as desired + arbitrary resharding) and safe (command line client) procedure. It is slowed down by having to serialise-deserialise and re-merge inserted data. I might get a million rows/sec on my cluster with magnetic disk drives.

Big One: The Backfill

Repair of incidents lasting up to several days. Copy and attach entire merged parts. What I do is (see script example):

SELECT path FROM system.parts WHERE table = 'x' AND partition BETWEEN t0 and t1 to list parts in partitions overlapping given interval, rounded to table's partition size
copy those parts from their filesystem location to destination cluster's table detached directory (I use netcat over TCP)
ALTER TABLE x DELETE WHERE t BETWEEN t0 AND t1
ALTER TABLE x ATTACH PART on all copied parts
repeat for each table

Rounding to partition size accommodates parts that have merged fully – a single part worth one partition.

This procedure is faster than SELECT-INSERT patching. I might get a couple million rows/sec on my cluster. A decently formulated benchmark would be useful here, and I might add one here at some point.

Main disadvantage is the potential for messing something up through a direct filesystem access. Schema of source and destination tables must be identical, with no SQL tweaks possible. Finally, you cannot reshuffle many shards to one; part names collide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Restoration

Small One: The Patch

Big One: The Backfill

Clone this wiki locally