Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: New mode to use transaction results to skip failed transaction and signature verification in catchup #4536

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ThomasBrady
Copy link
Contributor

@ThomasBrady ThomasBrady commented Nov 6, 2024

Description

Resolves #X

Adds a new config option, CATCHUP_SKIP_KNOWN_RESULTS. When this config option is enabled, transaction results are downloaded from history archives for the catchup range. Failed transactions are not applied, and signatures are not verified.

Preliminary perf testing locally running catchup on 1000 ledgers:

user/system/total time seconds

*Baseline (no skipping):*     429 / 115 / 138s
*Skip Failed:*                373 / 99  / 114s  (1.14x / 1.16x / 1.21x speedup over baseline)
*Skip Failed + verification:* 334 / 88  / 95s   (1.28x / 1.30x / 1.45x speedup over baseline)

Remaining work:

  • Run catchup in supercluster to get a more representative idea of the performance characteristics of the new mode
  • Add more test coverage to test correct meta generation in this mode

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document

@ThomasBrady ThomasBrady force-pushed the lightweight-catchup branch 2 times, most recently from 01b7bb7 to 1116893 Compare November 6, 2024 18:24
@@ -70,6 +70,10 @@ class TransactionFrame : public TransactionFrameBase
mutable Hash mFullHash; // the hash of the contents and the sig.

std::vector<std::shared_ptr<OperationFrame const>> mOperations;
mutable std::optional<TransactionResult> mReplaySuccessfulTransactionResult{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid adding mutable state to TransactionFrame - @SirTyson recently did a lot of work to make TransactionFrame immutable, so we should keep it that way. Reason is that tx frame is used across the codebase for overlay flooding as well as ledger application, so any incorrectly set mutable state leads to awful failure modes like state divergence. If you want to override results, could we modify MutableTransactionResult class instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, that makes sense. I'll change this to store the replay result in MutableTransactionResult.

@@ -1785,8 +1816,19 @@ TransactionFrame::apply(AppConnector& app, AbstractLedgerTxn& ltx,
{
mCachedAccountPreProtocol8.reset();
uint32_t ledgerVersion = ltx.loadHeader().current().ledgerVersion;
SignatureChecker signatureChecker{ledgerVersion, getContentsHash(),
getSignatures(mEnvelope)};
auto skipMode = mReplayFailingTransactionResult.has_value() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is correct: don't we still need to do signature verification for failed transactions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(A good sanity check for this check would be running full parallel catchup)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update one time signers (which are removed whether the signature verification succeeds or fails), but I don't see why we need to verify signatures for failed transactions as it doesn't effect the ledger state.

virtual bool checkSignature(std::vector<Signer> const&, int32_t) = 0;
virtual bool checkAllSignaturesUsed() const = 0;
};
class SignatureCheckerImpl : public SignatureChecker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request for some renaming for clarity:

  • SignatureChecker -> AbstractSignatureChecker
  • SignatureCheckerImpl -> SignatureChecker

TransactionHistoryEntry mTxHistoryEntry;
TransactionHistoryResultEntry mTxHistoryResultEntry;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid footguns, results should be optional

// to mark catchup as complete and node as synced. In OFFLINE mode node is not
// connected to network, so new ledgers are not being externalized. Only
// buckets and transactions from history archives are applied.
// Catchup can be done in two modes - ONLINE and OFFLINE. In ONLINE mode, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the formatting change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded it a bit, which changed the line lengths and formatting.

mCheckpointToQueue);
auto getAndUnzip =
std::make_shared<GetAndUnzipRemoteFileWork>(mApp, ft, mArchive);
OnFailureCallback cb = [archive = mArchive, filesToTransfer]() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the implementation of OnFailureCallback the way it was before: mArchive can be null, in which case GetAndUnzipRemoteFileWork will pick a random archive. With the current change, core would crash de-referencing a null pointer.

@@ -98,8 +94,10 @@ DownloadApplyTxsWork::yieldMoreWork()
{
auto prev = mLastYieldedWork;
bool pqFellBehind = false;
auto applyName = apply->getName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

std::filesystem::remove(
std::filesystem::path(ft.localPath_nogz()));
CLOG_DEBUG(History, "Deleted transactions {}",
CLOG_DEBUG(History, "Deleting transactions {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the log is misleading, since we're deleting transactions and results

@@ -1520,6 +1522,17 @@ LedgerManagerImpl::applyTransactions(

prefetchTransactionData(txs);

std::optional<std::vector<TransactionResultPair>::const_iterator>
expectedResultsIter = std::nullopt;
if (mApp.getConfig().CATCHUP_SKIP_KNOWN_RESULTS && expectedResults)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a valid scenario when CATCHUP_SKIP_KNOWN_RESULTS=true but expectedResults is not set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the CATCHUP_SKIP_KNOWN_RESULTS=true, but for some reason we do not have expectedResults (e.g. if for some reason they aren't in the archive), catchup will proceed but no transactions will be skipped. I think I should probably add a warning to inform users that this is the case, but I'm hesitant to report an error in that case as its not fatal. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this makes me realize: with the current implementation of catchup, won't it fail if results couldn't be downloaded? If so, I think we should make it less strict, so it'll try to download results, but if it can't, catchup should still proceed normally. In this case, I agree the logic here makes sense and we can add a warning (probably not at individual ledger level though as it'll get spammy. We can warn per checkpoint).

{
releaseAssert(*expectedResultsIter !=
expectedResults->results.end());
while ((*expectedResultsIter)->transactionHash !=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit suspicious: the order of results should match the order of transactions application exactly. I think you can do expectedResults->at(i) instead of the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the order should match, but the loop accounts for the case when there are gaps in the results stored in the archive, which I believe can be the case when there are ledgers with empty txn sets (?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have ledgers with empty sets, then txs should be empty as well, so we should use the same index logic for both results and txs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optionally skip work when performing catchup
3 participants