Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccmlib/node: cleanup unsealed system sstables for offline tools #504

Merged
merged 1 commit into from
Aug 30, 2023

Conversation

tchaikov
Copy link
Contributor

see also 3616251

otherwise, sstablelevelreset would try to load the unsealed sstables and print out error messages in stderr if not all the components are available or is corrupted. this fails the test which checks for unexpected errors.

Fixes scylladb/scylladb#15210

see also 3616251

otherwise, `sstablelevelreset` would try to load the unsealed
sstables and print out error messages in stderr if not all the
components are available or is corrupted. this fails the test
which checks for unexpected errors.

Fixes scylladb/scylladb#15210

Signed-off-by: Kefu Chai <[email protected]>
@tchaikov
Copy link
Contributor Author

tested using

$ JAVA_HOME=/usr/lib/jvm/jre-11 pytest -c $PWD/pytest.ini --cassandra-dir $HOME/dev/scylladb/build/cmake offline_tools_test.py::TestOfflineTools::test_sstablelevelreset --count 20

previously 3 out 20 runs failed. after this change all run succeeded.

@tchaikov
Copy link
Contributor Author

hi @denesb and @bhalevy could you help review this change?

system_keyspace_names = ['system_schema', 'system']
if keyspace not in system_keyspace_names:
for system_keyspace_name in system_keyspace_names:
self.get_sstables(system_keyspace_name, '', cleanup_unsealed=True)
Copy link
Member

@bhalevy bhalevy Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why does ccm mess with unsealed sstables.
Maybe it makes sense for (older versions of?) Cassandra, but it doesn't seem to apply to Scylla.

It's obvious that this function was written with Cassandra in mind.
This other existing hunk doesn't apply to Scylla to either:

            if os.path.exists(f.replace('Data.db', 'Compacted')):
                files.remove(f)

I think that just ignore_unsealed=True, cleanup_unsealed=False should do the right job (for a ScyllaNode).

Copy link
Contributor Author

@tchaikov tchaikov Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhalevy it's not directly related to ccm. it's just how sstablelevelreset works. it tries to open sstables in system keyspaces. if we need to stick to sstablelevelreset, we have to clean the unsealed sstables before inviting it to look at the data directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's obvious that this function was written with Cassandra in mind. This other existing hunk doesn't apply to Scylla to either:

            if os.path.exists(f.replace('Data.db', 'Compacted')):
                files.remove(f)

we can drop this bit in another change.

I think that just ignore_unsealed=True, cleanup_unsealed=False should do the right job (for a ScyllaNode).

i don't think so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why does the function even mess with system_schema if it was called for a user keyspace? I guess it is configured with LCS in Cassandra. I'm not sure if it applies to scylla.
  2. Touching unsealed sstables means that sstable components might be incomplete, how can we touch them? The caller should stop the node before doing this or ensure there will be no unsealed sstables in the table, e.g. by disabling compaction and running flush.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can drop this bit in another change.

CCM should support both Cassandra and Scylla.
Since Scylla has no Compacted component this logic won't apply to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the failure are in places where a flush call was done, but node is being stopped with cluster.stop(gently=False), so it can be kill during compaction.

should we disable compaction completely for this test ? and that's it ? i.e. for this tool it's should be part of the instructions (if it ever had one)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these tools (be they C* or Scylla) work well when the node can compact while they are reading the sstables. Any test which wants to use these, should either make sure to disable compaction or stop the node when the tool is running.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So scylla is stopped, but not gently so it may have unsealed sstables that were in the processing of being written while the node was taken down.
It should actually be OK to remove all of them, as we do it anyhow on restart.
By mistake, I thought we're resetting the level for those unsealed sstables and sstables in the system keyspaces, but I see we only call self.get_sstables to perform this cleanup of unsealed sstables, which is OK.
Not elegant at all, but should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, it is even confusing. i will create a follow-up change to explain the rationale in a comment in this function. but since it's failing in multiple places frequently. will leave it for another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these tools (be they C* or Scylla) work well when the node can compact while they are reading the sstables. Any test which wants to use these, should either make sure to disable compaction or stop the node when the tool is running.

scylla is, well, stopped, in a harsh way.

@tchaikov
Copy link
Contributor Author

@fruch hi Israel, are you fine with this change? probably we can merge it?

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fruch fruch merged commit a45eda9 into scylladb:next Aug 30, 2023
3 checks passed
@tchaikov tchaikov deleted the node-cleanup-unsealed branch August 31, 2023 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

offline_tools_test.TestOfflineTools.test_sstablelevelreset is flaky
4 participants