You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using pglogical_create_subscriber to create a new logical subsrciber, there is a race condition between the two processes updating/reading the sync_status field of pglogical.local_sync_status. The two processes are:
the pglogical_create_subscriber process
the "pglogical apply worker" created by the first process (indirectly), using SELECT pglogical.create_subscription
There is a tiny time window for the "pglogical apply worker" to start its work using the wrong state. The time windows is
at these lines in the pglogical_subscribe function of pglogcial_create_subscriber.c .
When the first query in the referred lines executed, and kernel switch out the pglogical_create_subscriber process for too long. The "pglogical apply worker" will run the pglogical_sync_subscription function with sync_status == SYNC_STATUS_INIT . Started pglogical_sync_subscription with INIT status is not OK, as it will re-create the replication slot and use the new snapshot's LSN as the logical replication origin.
I create a small patch to make the bug very easy to reproduce on a system that has data coming in during the logical replication creation. Here is the patch:
diff --git a/pglogical_create_subscriber.c b/pglogical_create_subscriber.c
index 509e165..e5b6552 100644
--- a/pglogical_create_subscriber.c
+++ b/pglogical_create_subscriber.c
@@ -1108,6 +1108,9 @@ pglogical_subscribe(PGconn *conn, char *subscriber_name, char *subscriber_dsn,
}
PQclear(res);
+ printf("========> Insert some new data on primary now (in the next 30 seconds), these data will be lost in subscriber\n");
+ sleep(30);
+
resetPQExpBuffer(&query);
initPQExpBuffer(&repsets);
@@ -1133,6 +1136,12 @@ pglogical_subscribe(PGconn *conn, char *subscriber_name, char *subscriber_dsn,
}
PQclear(res);
+ // This sleep makes the race condition easy to reproduce
+ //
+ // This sleep will allow the subscription worker process to start first with sync_status == 'i',
+ // before the local_sync_status.sync_status field is set to 'r'.
+ sleep(30);
+
/* TODO */
res = PQexec(conn, "UPDATE pglogical.local_sync_status SET sync_status = 'r'");
if (PQresultStatus(res) != PGRES_COMMAND_OK)
(the above patch is under MIT license)
The text was updated successfully, but these errors were encountered:
I was able to get around this issue by creating the node, node_interface, subscription, and local_sync_status resources manually and not using the pglogical_create_subscriber function. Then creating replication origin; replication slot on the remote node. I can then start the replication using alter_subscription_enable function and it re-uses the existing replication slot.
Is there any other step that I miss or downside to use the above approach, for pglogical to properly work? I do not notice anything unusual ATM.
When using
pglogical_create_subscriber
to create a new logical subsrciber, there is a race condition between the two processes updating/reading thesync_status
field ofpglogical.local_sync_status
. The two processes are:pglogical_create_subscriber
processSELECT pglogical.create_subscription
There is a tiny time window for the "pglogical apply worker" to start its work using the wrong state. The time windows is
at these lines in the
pglogical_subscribe
function of pglogcial_create_subscriber.c .When the first query in the referred lines executed, and kernel switch out the pglogical_create_subscriber process for too long. The "pglogical apply worker" will run the
pglogical_sync_subscription
function with sync_status == SYNC_STATUS_INIT . Started pglogical_sync_subscription with INIT status is not OK, as it will re-create the replication slot and use the new snapshot's LSN as the logical replication origin.I create a small patch to make the bug very easy to reproduce on a system that has data coming in during the logical replication creation. Here is the patch:
(the above patch is under MIT license)
The text was updated successfully, but these errors were encountered: