Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filestore] Invalid CollectGarbage requests to blobstorage. #652

Open
debnatkh opened this issue Mar 5, 2024 · 3 comments
Open

[Filestore] Invalid CollectGarbage requests to blobstorage. #652

debnatkh opened this issue Mar 5, 2024 · 3 comments
Assignees
Labels
2024Q2 bug Something isn't working filestore Add this label to run only cloud/filestore build and tests on PR

Comments

@debnatkh
Copy link
Collaborator

debnatkh commented Mar 5, 2024

Errors like following stared causing IndexTablet to restart

NFS_SERVER[578573]: 2024-03-05T15:06:36.674043Z :NFS_TABLET ERROR: [f:***][t:***] CollectGarbage failed: SEVERITY_ERROR | FACILITY_KIKIMR | 1 Processed status# ERROR from VDisk# [8200021d:2:0:0:0] incarnationGuid# empty QuorumTracker status# ERROR

Looks like CollectGarbage requests sent by TIndexTablet does not guarantee the increasing order of (gen, step)

Started seeing this error much more often after enabling vhost-side reads on the whole cluster

@debnatkh debnatkh added bug Something isn't working filestore Add this label to run only cloud/filestore build and tests on PR labels Mar 5, 2024
@debnatkh debnatkh self-assigned this Mar 5, 2024
@qkrorlqr qkrorlqr self-assigned this Mar 5, 2024
@debnatkh debnatkh linked a pull request Apr 3, 2024 that will close this issue
@debnatkh
Copy link
Collaborator Author

debnatkh commented Apr 4, 2024

  1. CollectGarbage is executed with commitId = GetCurrentCommitId() = 42
  2. Cleanup is started. It acquires a collect barrier with commitId = 42:

ExecuteTx<TCleanup>(
ctx,
std::move(requestInfo),
msg->RangeId,
GetCurrentCommitId());

AcquireCollectBarrier(args.CollectBarrier);

  1. Before the collect barrier is released on completing the Cleanup transaction, another CollectGarbage is exeсuted. CollectCommitId is selected as follows:

ui64 TIndexTabletState::GetCollectCommitId() const
{
// should not collect after any barrier
return Min(GetCurrentCommitId(), Impl->GarbageQueue.GetCollectCommitId());
}

  1. GarbageQueue.GetCollectCommitId():

ui64 TGarbageQueue::GetCollectCommitId() const
{
if (Impl->Barriers) {
const auto& barrier = *Impl->Barriers.begin();
return barrier.CommitId - 1;
}
return InvalidCommitId;
}

  1. There is an unreleased collect barrier with commitId = 42, thus CollectCommitId will be equal to 41, which is less than LastCollectCommitId

Generating a new CommitId on the Cleanup execution will solve the issue

@debnatkh
Copy link
Collaborator Author

The main problem is that FlushBytes acquires collect barrier, which is less than the last collect commit id:


  1. Consider that there were the following sequence of writes:
Write(0,       256 KiB, 'a') -> Blob(commitId = 42)
Write(256 KiB, 256 KiB, 'b') -> Blob(commitId = 43)
Write(512 KiB, 1,       'f') -> FreshBytes(commitId = 44)
Write(0,       256 KiB, 'c') -> Blob(commitId = 45)

This will lead to the following file layout: [ccccccc][bbbbbbb][f]

  1. After execution of the CollectGarbage, all three new blobs will get a KeepFlag and the last collect commit id will be equal to 44
CommitId:     41      42        43       44
            Blob(a) Blob(b) FreshBytes Blob(c)
                                          |
                                 LastCollectCommitId
  1. After execution of the Cleanup operation, the first blob will be marked as garbage

  2. Let us execute FlushBytes operation, It will acquire collect barrier, equal to the minimal commitId, associated with FreshBlobs:

    for (const auto& bytes: args.Bytes) {
    args.CollectCommitId = Min(args.CollectCommitId, bytes.MinCommitId);
    }

After this acquisition there will be one barrier, equal to 43:

CommitId:     41      42        43       44
            Blob(a) Blob(b) FreshBytes Blob(c)
                                |         |
                             Barrier  LastCollectCommitId
  1. When the next CollectGarbage operation is to be executed, it will choose 42 as a collectCommitId:

const auto& barrier = *Impl->Barriers.begin();
return barrier.CommitId - 1;

After it the CollectGarbage request with one new grabage will be sent, leading to a decrease in collectCommitIds sequence: 42 after 44

@debnatkh
Copy link
Collaborator Author

To reproduce the issue, one can use fio:

fio --name=random-write-test \
    --ioengine=libaio \
    --rw=randwrite \
    --bs=512-4k \
    --size=1G \
    --direct=1 \
    --iodepth=16 \
    --numjobs=4 \
    --offset_increment=512 \
    --do_verify=0 \
    --time_based \
    --runtime=$[120*60*60]

AppCriticalEvents/CollectGarbageError errors after starting afformentioned fio:

image

AppCriticalEvents/CollectGarbageError errors after deploying fix #1919:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024Q2 bug Something isn't working filestore Add this label to run only cloud/filestore build and tests on PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants