You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While testing, I've observed an intermittent test failure within the ruby test InterruptResumeTest#test_interrupt_resume_idempotence_with_writes_to_source. When debugging that, I can see that all the Ghostferry goroutine to be stuck at some logrus mutex methods[1]. The ruby process is always trying to modify a row a the same time via the data writer at the same time. This modification query appears to be always stuck waiting for lock, presumably because the data iterator just went over that row and issued a FOR UPDATE against that row (BEFORE_ROW_COPY and AFTER_ROW_COPY in the integrationferry is sent while the FOR UPDATE is held, which makes the situation worse).
I'm not quite sure why this is the case. It's possible that the problem lies in an upstream logrus issue (sirupsen/logrus#1201) or it's somehow a race between DataIterator which holds the lock while trying to send BEFORE_ROW_COPY and AFTER_ROW_COPY to the integration test server (reader should check integrationferry.go to see that, and check how the listeners on the data iterator + cursor works with the FOR UPDATE lock). I'm not sure which one is more likely, as maybe we should see the same deadlock in production if it is purely a problem with logrus.
I've found that by reducing the frequency of the data writer, the deadlock appears to be gone. However, it seems that even a very small sleep (5ms) solves this issue. Such a small sleep only reduces the total number of modification query from about 30 to 22 (for the duration over which the datawriter is active). I'm not sure why this make such a big difference, either.
A failure grinding script is included below, at [2]
While testing, I've observed an intermittent test failure within the ruby test
InterruptResumeTest#test_interrupt_resume_idempotence_with_writes_to_source
. When debugging that, I can see that all the Ghostferry goroutine to be stuck at some logrus mutex methods[1]. The ruby process is always trying to modify a row a the same time via the data writer at the same time. This modification query appears to be always stuck waiting for lock, presumably because the data iterator just went over that row and issued aFOR UPDATE
against that row (BEFORE_ROW_COPY and AFTER_ROW_COPY in the integrationferry is sent while the FOR UPDATE is held, which makes the situation worse).I'm not quite sure why this is the case. It's possible that the problem lies in an upstream logrus issue (sirupsen/logrus#1201) or it's somehow a race between DataIterator which holds the lock while trying to send BEFORE_ROW_COPY and AFTER_ROW_COPY to the integration test server (reader should check integrationferry.go to see that, and check how the listeners on the data iterator + cursor works with the FOR UPDATE lock). I'm not sure which one is more likely, as maybe we should see the same deadlock in production if it is purely a problem with logrus.
I've found that by reducing the frequency of the data writer, the deadlock appears to be gone. However, it seems that even a very small sleep (5ms) solves this issue. Such a small sleep only reduces the total number of modification query from about 30 to 22 (for the duration over which the datawriter is active). I'm not sure why this make such a big difference, either.
A failure grinding script is included below, at [2]
[1] Ghostferry per go routine stack trace:
[2] Failure grinding script
Before running this, you should comment out the
sleep
intest/helpers/data_writer_helper.rb
, after the call towrite_data
.The text was updated successfully, but these errors were encountered: