Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loop shutdown should not call CancelRequest for inflight requests because it causes a race with the code that's executing those requests, we should simply wait until those requests complete by themselves #1193

Merged
merged 8 commits into from
May 21, 2024

Conversation

qkrorlqr
Copy link
Collaborator

@qkrorlqr qkrorlqr commented May 14, 2024

Data race example:

WARNING: ThreadSanitizer: data race (pid=238490)
  Atomic write of size 8 at 0x7b20001f2c50 by main thread (mutexes: write M0):
    #0 AtomicCas /actions-runner/_work/nbs/nbs/library/cpp/deprecated/atomic/atomic_gcc.h:67:12 (filestore-vhost+0x10cf2eb8) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #1 std::__y1::invoke_result<NLWTrace::TOrbit::AddProbe(NLWTrace::TProbe*, NLWTrace::TParams const&, unsigned long)::'lambda'(TIntrusivePtr<NLWTrace::IShuttle, TDefaultIntrusivePtrOps<NLWTrace::IShuttle>>&), TIntrusivePtr<NLWTrace::IShuttle, TDefaultIntrusivePtrOps<NLWTrace::IShuttle>>&>::type NLWTrace::TOrbit::NotConcurrent<NLWTrace::TOrbit::AddProbe(NLWTrace::TProbe*, NLWTrace::TParams const&, unsigned long)::'lambda'(TIntrusivePtr<NLWTrace::IShuttle, TDefaultIntrusivePtrOps<NLWTrace::IShuttle>>&)>(NLWTrace::TOrbit::AddProbe(NLWTrace::TProbe*, NLWTrace::TParams const&, unsigned long)::'lambda'(TIntrusivePtr<NLWTrace::IShuttle, TDefaultIntrusivePtrOps<NLWTrace::IShuttle>>&)) /actions-runner/_work/nbs/nbs/library/cpp/lwtrace/shuttle.h:337:39 (filestore-vhost+0x10cf2eb8)
    #2 NLWTrace::TOrbit::AddProbe /actions-runner/_work/nbs/nbs/library/cpp/lwtrace/shuttle.h:232:13 (filestore-vhost+0x26b7a698) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #3 NLWTrace::TProbe::RunShuttles /actions-runner/_work/nbs/nbs/library/cpp/lwtrace/probe.h:164:19 (filestore-vhost+0x26b7a698)
    #4 NLWTrace::TUserProbe<TBasicString<char, std::__y1::char_traits<char> >, unsigned long, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil, NLWTrace::TNil>::RunShuttles /actions-runner/_work/nbs/nbs/library/cpp/lwtrace/probe.h:257:19 (filestore-vhost+0x26b7a698)
    #5 NCloud::NFileStore::NFuse::(anonymous namespace)::CancelRequest(TLog&, NCloud::NFileStore::IRequestStats&, NCloud::NFileStore::TCallContext&, fuse_req*, fuse_cancelation_code) /actions-runner/_work/nbs/nbs/cloud/filestore/libs/vfs_fuse/loop.cpp:55:5 (filestore-vhost+0x26b7a698)
    #6 NCloud::NFileStore::NFuse::(anonymous namespace)::TCompletionQueue::Stop(fuse_cancelation_code) /actions-runner/_work/nbs/nbs/cloud/filestore/libs/vfs_fuse/loop.cpp:142:13 (filestore-vhost+0x26b95705) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #7 NCloud::NFileStore::NFuse::(anonymous namespace)::TFileSystemLoop::SuspendAsync() /actions-runner/_work/nbs/nbs/cloud/filestore/libs/vfs_fuse/loop.cpp:572:26 (filestore-vhost+0x26b70447) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #8 NCloud::NFileStore::NVhost::(anonymous namespace)::TEndpoint::SuspendAsync() /actions-runner/_work/nbs/nbs/cloud/filestore/libs/endpoint_vhost/listener.cpp:46:22 (filestore-vhost+0x270e7271) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #9 NCloud::NFileStore::(anonymous namespace)::TEndpointManager::Stop() /actions-runner/_work/nbs/nbs/cloud/filestore/libs/endpoint/service.cpp:133:50 (filestore-vhost+0x26c47f9e) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #10 NCloud::NFileStore::(anonymous namespace)::TAuthService::Stop() /actions-runner/_work/nbs/nbs/cloud/filestore/libs/endpoint/service_auth.cpp:42:18 (filestore-vhost+0x26c7821f) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #11 NCloud::NFileStore::NDaemon::TBootstrapVhost::StopComponents() /actions-runner/_work/nbs/nbs/cloud/filestore/libs/daemon/vhost/bootstrap.cpp:323:5 (filestore-vhost+0x26c12097) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #12 NCloud::NFileStore::NDaemon::TBootstrapCommon::Stop() /actions-runner/_work/nbs/nbs/cloud/filestore/libs/daemon/common/bootstrap.cpp:144:5 (filestore-vhost+0x1b932593) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #13 int NCloud::DoMain<NCloud::NFileStore::NDaemon::TBootstrapVhost>(NCloud::NFileStore::NDaemon::TBootstrapVhost&, int, char**) /actions-runner/_work/nbs/nbs/cloud/storage/core/libs/daemon/app.h:47:23 (filestore-vhost+0xf4c1d4e) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #14 main /actions-runner/_work/nbs/nbs/cloud/filestore/apps/vhost/main.cpp:22:12 (filestore-vhost+0xf4c1b14) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)

  Previous read of size 8 at 0x7b20001f2c50 by thread T50:
    #0 TIntrusivePtr<NLWTrace::IShuttle, TDefaultIntrusivePtrOps<NLWTrace::IShuttle> >::Get /actions-runner/_work/nbs/nbs/util/generic/ptr.h:560:16 (filestore-vhost+0x1bd8c96d) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #1 NLWTrace::TOrbit::HasShuttles /actions-runner/_work/nbs/nbs/library/cpp/lwtrace/shuttle.h:203:31 (filestore-vhost+0x1bd8c96d)
    #2 NLWTrace::TManager::CreateTraceRequest(NLWTrace::TTraceRequest&, NLWTrace::TOrbit&) /actions-runner/_work/nbs/nbs/library/cpp/lwtrace/control.cpp:37:27 (filestore-vhost+0x1bd8c96d)
    #3 NCloud::(anonymous namespace)::TTraceSerializer::BuildTraceRequest(NLWTrace::TTraceRequest&, NLWTrace::TOrbit&) /actions-runner/_work/nbs/nbs/cloud/storage/core/libs/diagnostics/trace_serializer.cpp:105:19 (filestore-vhost+0x1bd8bf03) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #4 void NCloud::NFileStore::NStorage::TStorageServiceActor::ForwardRequest<NCloud::NFileStore::NStorage::TEvService::TWriteDataMethod>(NActors::TActorContext const&, NCloud::NFileStore::NStorage::TEvService::TWriteDataMethod::TRequest::TPtr const&) /actions-runner/_work/nbs/nbs/cloud/filestore/libs/storage/service/service_actor_forward.cpp:49:22 (filestore-vhost+0x266fcc46) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #5 NCloud::NFileStore::NStorage::TStorageServiceActor::HandleWriteData(TAutoPtr<NActors::TEventHandle<NCloud::NFileStore::NStorage::TProtoRequestEvent<NCloud::NFileStore::NProto::TWriteDataRequest, 275054747u>>, TDelete> const&, NActors::TActorContext const&) /actions-runner/_work/nbs/nbs/cloud/filestore/libs/storage/service/service_actor_writedata.cpp:411:9 (filestore-vhost+0x26776c88) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #6 NCloud::NFileStore::NStorage::TStorageServiceActor::HandleRequests(TAutoPtr<NActors::IEventHandle, TDelete>&) /actions-runner/_work/nbs/nbs/cloud/filestore/libs/storage/service/service_actor.cpp:128:9 (filestore-vhost+0x26585c1b) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #7 NCloud::NFileStore::NStorage::TStorageServiceActor::StateWork(TAutoPtr<NActors::IEventHandle, TDelete>&) /actions-runner/_work/nbs/nbs/cloud/filestore/libs/storage/service/service_actor.cpp:152:18 (filestore-vhost+0x265831c0) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #8 NActors::TActorCallbackBehaviour::Receive(NActors::IActor*, TAutoPtr<NActors::IEventHandle, TDelete>&) /actions-runner/_work/nbs/nbs/contrib/ydb/library/actors/core/actor.cpp:232:9 (filestore-vhost+0x10c30133) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #9 NActors::IActor::Receive(TAutoPtr<NActors::IEventHandle, TDelete>&) /actions-runner/_work/nbs/nbs/contrib/ydb/library/actors/core/actor.h:526:23 (filestore-vhost+0x10cf3a22) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #10 NActors::TExecutorThread::TProcessingResult NActors::TExecutorThread::Execute<NActors::TMailboxTable::TRevolvingMailbox>(NActors::TMailboxTable::TRevolvingMailbox*, unsigned int, bool) /actions-runner/_work/nbs/nbs/contrib/ydb/library/actors/core/executor_thread.cpp:236:28 (filestore-vhost+0x10ce0d33) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #11 NActors::TExecutorThread::ProcessExecutorPool(NActors::IExecutorPool*)::$_0::operator()(unsigned int, bool) const /actions-runner/_work/nbs/nbs/contrib/ydb/library/actors/core/executor_thread.cpp:416:25 (filestore-vhost+0x10cd9132) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #12 NActors::TExecutorThread::ProcessExecutorPool(NActors::IExecutorPool*) /actions-runner/_work/nbs/nbs/contrib/ydb/library/actors/core/executor_thread.cpp:470:13 (filestore-vhost+0x10cd874e) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #13 NActors::TExecutorThread::ThreadProc() /actions-runner/_work/nbs/nbs/contrib/ydb/library/actors/core/executor_thread.cpp:515:13 (filestore-vhost+0x10cdaa43) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #14 void* (anonymous namespace)::ThreadProcWrapper<ISimpleThread>(void*) /actions-runner/_work/nbs/nbs/util/system/thread.cpp:383:45 (filestore-vhost+0xf729782) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)
    #15 (anonymous namespace)::TPosixThread::ThreadProxy(void*) /actions-runner/_work/nbs/nbs/util/system/thread.cpp:244:20 (filestore-vhost+0xf72d11c) (BuildId: ba951d02e6e461eddf1019ce03d1fdf7c3ddcc51)

Current code calls CancelRequest for each inflight request upon loop shutdown/suspend. But inflight requests are flying all over the system and are being accessed by other threads. We have to wait till those requests complete, we can't cancel them with our current code architecture.

#1307

@qkrorlqr qkrorlqr added large-tests Launch large tests for PR filestore Add this label to run only cloud/filestore build and tests on PR tsan Launch builds with thread sanitizer along with regular build labels May 14, 2024
@librarian librarian added rebase Add this label if you want to rebase your PR for test run and removed rebase Add this label if you want to rebase your PR for test run labels May 14, 2024
@qkrorlqr qkrorlqr added the rebase Add this label if you want to rebase your PR for test run label May 14, 2024
@qkrorlqr qkrorlqr force-pushed the users/qkrorlqr/filestore-proper-graceful-shutdown branch from d3384ce to 642be6e Compare May 14, 2024 15:47
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-release-tsan: some tests FAILED for commit 642be6e.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1522 1519 0 2 1 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-release-tsan: some tests FAILED for commit 642be6e.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1522 1519 0 2 1 0

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 642be6e.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1547 1542 0 4 1 0

@qkrorlqr qkrorlqr added asan Launch builds with address sanitizer along with regular build rebase Add this label if you want to rebase your PR for test run and removed rebase Add this label if you want to rebase your PR for test run asan Launch builds with address sanitizer along with regular build labels May 14, 2024
…ause it causes a race with the code that's executing those requests, we should simply wait until those requests complete by themselves
…ug in TBootstrap in fs_ut.cpp (Apply on a callback which returns TFuture<void> discards the future returned by the callback!)
@qkrorlqr qkrorlqr force-pushed the users/qkrorlqr/filestore-proper-graceful-shutdown branch from 519a647 to e3d4cd0 Compare May 15, 2024 11:14
@librarian librarian removed the rebase Add this label if you want to rebase your PR for test run label May 15, 2024
@qkrorlqr qkrorlqr added tsan Launch builds with thread sanitizer along with regular build and removed tsan Launch builds with thread sanitizer along with regular build labels May 15, 2024
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit e3d4cd0.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1551 1519 0 12 20 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit e3d4cd0.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1551 1519 0 12 20 0

🔴 linux-x86_64-release-tsan: some tests FAILED for commit e3d4cd0.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1525 1506 0 8 11 0

@qkrorlqr qkrorlqr removed the tsan Launch builds with thread sanitizer along with regular build label May 15, 2024
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 98c1958.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1546 1546 0 0 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 98c1958.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1546 1546 0 0 0 0

🔴 linux-x86_64-release-tsan: some tests FAILED for commit 98c1958.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1523 1515 0 2 6 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 8dc6125.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1554 1505 0 16 33 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 77339fb.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1554 1506 0 16 32 0

@qkrorlqr qkrorlqr added tsan Launch builds with thread sanitizer along with regular build and removed tsan Launch builds with thread sanitizer along with regular build labels May 18, 2024
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit e074cae.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1547 1545 0 2 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit e074cae.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1547 1545 0 2 0 0

🔴 linux-x86_64-release-tsan: some tests FAILED for commit e074cae.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1523 1521 0 2 0 0

@qkrorlqr qkrorlqr added tsan Launch builds with thread sanitizer along with regular build and removed tsan Launch builds with thread sanitizer along with regular build labels May 21, 2024
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit e074cae.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1546 1546 0 0 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit e074cae.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1546 1546 0 0 0 0

🟢 linux-x86_64-release-tsan: all tests PASSED for commit e074cae.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1522 1522 0 0 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 0ad0e2c.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1546 1546 0 0 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 0ad0e2c.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1546 1546 0 0 0 0

🟢 linux-x86_64-release-tsan: all tests PASSED for commit 0ad0e2c.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1522 1522 0 0 0 0

@qkrorlqr qkrorlqr merged commit 075bd0d into main May 21, 2024
12 of 13 checks passed
@qkrorlqr qkrorlqr deleted the users/qkrorlqr/filestore-proper-graceful-shutdown branch May 21, 2024 17:35
qkrorlqr added a commit that referenced this pull request May 23, 2024
…ause it causes a race with the code that's executing those requests, we should simply wait until those requests complete by themselves (#1193)

* loop shutdown should not call CancelRequest for inflight requests because it causes a race with the code that's executing those requests, we should simply wait until those requests complete by themselves

* CompletionQueue shutdown should actually be async + fixed StopAsync bug in TBootstrap in fs_ut.cpp (Apply on a callback which returns TFuture<void> discards the future returned by the callback!)

* loop shutdown: SuspendAsync ut + some tmp dbg output

* loop shutdown: more tmp dbg output

* loop shutdown: more tmp dbg output

* loop shutdown: outputting dbg stuff under a lock, outputting dbg stuff only when ShouldStop flag is set

* loop shutdown: fixed CompletingCount vs Requests.empty() checks race

* loop shutdown: inflight count and completing count should be checked together atomically
qkrorlqr added a commit that referenced this pull request May 24, 2024
…, filestore CompactionMap empty entry cleanup (#1266)

* loop shutdown should not call CancelRequest for inflight requests because it causes a race with the code that's executing those requests, we should simply wait until those requests complete by themselves (#1193)

* loop shutdown should not call CancelRequest for inflight requests because it causes a race with the code that's executing those requests, we should simply wait until those requests complete by themselves

* CompletionQueue shutdown should actually be async + fixed StopAsync bug in TBootstrap in fs_ut.cpp (Apply on a callback which returns TFuture<void> discards the future returned by the callback!)

* loop shutdown: SuspendAsync ut + some tmp dbg output

* loop shutdown: more tmp dbg output

* loop shutdown: more tmp dbg output

* loop shutdown: outputting dbg stuff under a lock, outputting dbg stuff only when ShouldStop flag is set

* loop shutdown: fixed CompletingCount vs Requests.empty() checks race

* loop shutdown: inflight count and completing count should be checked together atomically

* issue-1128: entries in the CompactionMap table should be deleted instead of updating of both BlobCount and DeletionCount are 0 (#1260)

* filestore-vhost loop shutdown logging (#1265)

* filestore-vhost loop shutdown logging

* filestore-vhost loop shutdown logging - fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
filestore Add this label to run only cloud/filestore build and tests on PR large-tests Launch large tests for PR tsan Launch builds with thread sanitizer along with regular build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants