From ab1c815f645f702bc87c9dc21b18bed3ceed3e74 Mon Sep 17 00:00:00 2001
From: Asias He <asias@scylladb.com>
Date: Tue, 2 Apr 2024 09:21:05 +0800
Subject: [PATCH] [criteo] repair: Improve estimated_partitions to reduce
 memory usage

Currently, we use the sum of the estimated_partitions from each
participant node as the estimated_partitions for sstable produced by
repair. This way, the estimated_partitions is the biggest possible
number of partitions repair would write.

Since repair will write only the difference between repair participant
nodes, using the biggest possible estimation will overestimate the
partitions written by repair, most of the time.

The problem is that overestimated partitions makes the bloom filter
consume more memory. It is observed that it causes OOM in the field.

This patch changes the estimation to use a fraction of the average
partitions per node instead of sum. It is still not a perfect estimation
but it already improves memory usage significantly.

Fixes #18140

Criteo: cherry-picked from https://github.com/scylladb/scylladb/pull/18141 to
reduce probability of bad_alloc during repair
---
 repair/row_level.cc | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/repair/row_level.cc b/repair/row_level.cc
index d4534c6376b3..1b9af82faa87 100644
--- a/repair/row_level.cc
+++ b/repair/row_level.cc
@@ -2787,6 +2787,26 @@ class row_level_repair {
                     });
                 }).get();
 
+                if (!master.all_nodes().empty()) {
+                    // Use the average number of partitions, instead of the sum
+                    // of the partitions, as the estimated partitions in a
+                    // given range. The bigger the estimated partitions, the
+                    // more memory bloom filter for the sstable would consume.
+                    _estimated_partitions /= master.all_nodes().size();
+
+                    // In addition, estimate the difference between nodes is
+                    // less than 10% for regular repair. Underestimation will
+                    // not be a big problem since those sstables produced by
+                    // repair will go through off-strategy later anyway. The
+                    // worst case is that we have a worse false positive ratio
+                    // than expected temporarily when the sstable is still in
+                    // maintenance set.
+                    //
+                    // To save memory and have less different conditions, we
+                    // use the 10% estimation for RBNO repair as well.
+                    _estimated_partitions /= 10;
+                }
+
                 parallel_for_each(master.all_nodes(), [&, this] (repair_node_state& ns) {
                     const auto& node = ns.node;
                     rlogger.trace("Get repair_set_estimated_partitions for node={}, estimated_partitions={}", node, _estimated_partitions);