-
-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A tutorial on property-based testing and Apalache-TLC #1831
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1831 +/- ##
==========================================
- Coverage 78.86% 76.69% -2.17%
==========================================
Files 466 391 -75
Lines 15923 12005 -3918
Branches 2557 550 -2007
==========================================
- Hits 12557 9207 -3350
+ Misses 3366 2798 -568 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done pretty-printing, please someone proofread again.
I will proof-read and extend it again. There is something interesting about simulation going on here. |
@konnov Are you saying there will be significant changes/additions? Then I will hold off reading until you've committed those. |
I will add one more section on a hand-written simulator in C++. It is isolated. Go ahead and read what we have now :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indicating that I want to review this 😃
For reference, I'm taking a token on the math section for now (I discussed with Thomas so there wouldn't be any conflicts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I think this will be really useful to highlight how the "Apalache approach" compares to randomized testing.
To summarize some of my major remaining concerns:
I'm not sure the exact invariant that is checked in the Python code is the invariant that conceptually is expected to hold.
Also, I think the math should model the code closely to have the best idea of how likely it really is to find an invariant violation, and I think there are a couple of places where this is not the case.
I think we're learning more about the benefits and drawbacks of these approaches, and I wonder if eventually it would be useful to have a more high-level conceptual writeup comparing when PBT and when specification is appropriate, how either fits into the workflow along a project, who the people doing either are/should be, etc..., but this writeup is already great for comparing the technical merits.
|
||
Can we use some automation to discover such an execution? By looking at the | ||
above example, we can see that the core of this question is whether we can find | ||
the following sequence of events, for some values `n >= k > m >= l > 0`, and distinct addresses/users `u1, u2, u3`, such that `balanceOf[u1] >= k + l`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constants seem unnecessarily restrictive, in particular the requirement that k > l.
Consider the case where n >= k > 0, m >= l > 0, and balanceOf[u1] >= k+l > n.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I note this on the Python code, but since I think it's quite important to note: There are faulty sequences that do not have this shape (see the comment on the Python code for a sequence I think conceptually violates the properties one would expect, but is not caught by this). We either should 1) make an attempt to more generally describe all such sequences, or 2) explicitly mention that we only care about one particular type of sequence of events that can lead to a violation, but that there are others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constants seem unnecessarily restrictive, in particular the requirement that k > l. Consider the case where n >= k > 0, m >= l > 0, and balanceOf[u1] >= k+l > n.
The same sequence of events with these constants exhibits faulty behaviour, but 1-5 on its own is not faulty, so I'm not sure whether you'd want to include that sequence
Yes, I had mentioned this to @konnov myself, conceptually you don't need "high" and "low" amounts in any particular order, just the fact that their sum is more than what was intended to be spent. He wants to fix it like this for simplicity, from what I recall.
The majority of the above code should be clear. However, there are two new | ||
constructs in `commit_transfer`. First, we consume a transaction via | ||
`tx=consumes(pendingTxs)`, which deletes a transaction from the bundle | ||
`pendingTxs` and instantiates the input parameter `tx` with the chosen value. On top of that, we add the statement `assume(...)` inside the method. This statement tells the testing framework to reject the cases that violate the assumption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you choose to use assume
and have 3 different commit
rules, rather than one rule that matches the tag of the pending transaction? This might not matter, depending on how much overhead these rejections cause. but it seems like the approach with 3 different rules makes the framework reject many cases where the approach with matching the tag would make progress in each step.
if last: | ||
if last.tag == "transferFrom" and last.value > 0: | ||
for p in self.pendingTxsShadow: | ||
if p.tag == "approve" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't catch some faulty cases:
tx1: approve(me, 5)
tx2: approve(me, 3) // actually, I want to only allow 3 tokens
commit tx1
tx3: transferFrom(me, badGuy, 3)
commit tx3
commit tx2
tx4: transferFrom(me, badGuy, 3) // a total of 6 tokens was transferred, but I allowed 5 (and decided to decrease to 3 afterwards!)
commit tx4
This is not caught, because no single transfer exceeds any single approval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the invariant you'd want to check is that after one approval is submitted, and before the next one is submitted, the sum of tokens transferred out is not more than the value specified by the earlier approval.
docs/src/tutorials/pbt-and-tla.md
Outdated
If you are interested in the detailed analysis of probabilities, see the [math section](#math) below. | ||
|
||
In summary, there are `600'397'329'064'743` (6e14) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions. | ||
The odds of hitting an invariant violation are `6e-7` for our concrete selection of `3` addresses and `20` values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The odds of hitting an invariant violation are `6e-7` for our concrete selection of `3` addresses and `20` values. | |
The odds of hitting an invariant violation are `6e-7` (1 in 60 million) for our concrete selection of `3` addresses and `20` values. |
Afterwards, the number 2 million is written out, so I think these should be given in words too (I don't have an intuitive understanding of how 2 million compares to 6e-7, but I have an understanding how 1 in 60 million compares to 2 million)
|
||
If you are interested in the detailed analysis of probabilities, see the [math section](#math) below. | ||
|
||
In summary, there are `600'397'329'064'743` (6e14) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In summary, there are `600'397'329'064'743` (6e14) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions. | |
In summary, there are `600'397'329'064'743` (6e14, or 600 trillion) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions. |
docs/src/tutorials/pbt-and-tla.md
Outdated
subject to the following constraints: | ||
- \\(c_1: sender_H \ne spender_H\\) | ||
- \\(c_2: sender_L \ne spender_L\\) | ||
- \\(c_3: sender_T, from, to\\) pairwise distinct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do sender_T, to
have to be different? The code doesn't require this, and conceptually it also seems fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad you're raising the same questions that I originally had. @konnov specifically requested it be modeled as such.
- \\(c_3: sender_T, from, to\\) pairwise distinct | ||
- \\(c_4: sender_H = sender_L = from\\) | ||
- \\(c_5: spender_H = spender_L = sender_T\\) | ||
- \\(c_6: v_H \ge v_T > v_L > 1\\) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why > 1
? It doesn't seem required, since
tx1: approve(me, 5)
tx2: approve(me, 1)
commit tx1
tx3: transferFrom(me, badGuy, 2)
commit tx3
violates the invariant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 because the original amounts are 0..(N-1)
, but in the model they are 1..N
(nicer formulas w.r.t bounds). So 1 plays the role of 0
in the real system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! I'd suggest reminding the reader of this in the text (maybe I overlooked it, but a sentence right beneath the constraints should suffice)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm about halfway through and have a preliminary question about the Python code.
Still holding the review token.
# history variables that we need to express the invariants | ||
self.pendingTxsShadow = set() | ||
self.lastTx = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does lastTx
actually model? It seems to be more complex than "the last committed transaction", because we reset it to None
in the submit_*
methods.
I was puzzled by this quite a bit, and it's not explained in the prose either.
Please add a more meaningful comment or explanation.
@@ -0,0 +1,1000 @@ | |||
# Tutorial on Checking ERC20 with Property-Based Testing and TLA+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a TOC to individual pages? This is quite a long document, and I would've liked to locate subsections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the overall approach taken here!
Sound, reproducible arguments about the benefits/drawbacks of certain approaches. 👍
Just a few more comments below, including suggestions that should be double-checked.
I also directly committed some stylistic fixes.
Releasing my review token.
8 hours of running PBT, we find the same execution with Apalache in 12 seconds. | ||
So it is probably worth looking at. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That last sentence sounds way too understated to me 😉 How about
8 hours of running PBT, we find the same execution with Apalache in 12 seconds. | |
So it is probably worth looking at. | |
8 hours of running PBT, we find the same execution with Apalache in 12 seconds. | |
This underlines the effectiveness of symbolic simulation for this problem. |
\* To make it possible to submit two 'equal' transactions, | ||
\* we introduce a unique transaction id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes it clearer where/how that unique ID is introduced:
\* To make it possible to submit two 'equal' transactions, | |
\* we introduce a unique transaction id. | |
\* To make it possible to submit two 'equal' transactions, | |
\* we augment type `TX` with a unique transaction id. |
\* @type: <<ADDR, ADDR>> -> Int; | ||
allowance | ||
|
||
\* Variables that model Ethereum transactions, not the ERC20 state machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this second clause necessary? I don't understand what it refers to.
\* Variables that model Ethereum transactions, not the ERC20 state machine. | |
\* Variables that model Ethereum transactions. |
\* Initialize an ERC20 token. | ||
Init == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idk why this says "token"?
\* Initialize an ERC20 token. | |
Init == | |
\* Initialize the ERC20 state machine. | |
Init == |
However, the important difference between `simulate` and `check` is that | ||
`simulate` does not give us an ultimate guarantee about all executions, even | ||
though we limit the scope to all executions of length up to 10, whereas `check` | ||
does. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carve out the main argument here:
However, the important difference between `simulate` and `check` is that | |
`simulate` does not give us an ultimate guarantee about all executions, even | |
though we limit the scope to all executions of length up to 10, whereas `check` | |
does. | |
However, the important difference between `simulate` and `check` is that | |
`simulate` (due to randomization) does not give us an ultimate guarantee that all possible executions have been explored, whereas `check` does. |
Note that we let TLC use 75% of the available memory and ran it on 4 CPU | ||
cores (make sure you have them or change this setting!). Our experiments server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this 4 core setting in the CLI/code?
an ad-hoc random exploration. As we have seen, this mode slows down when | ||
there is no error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to mention this?
an ad-hoc random exploration. As we have seen, this mode slows down when | |
there is no error. | |
an ad-hoc random exploration. As we have seen, this mode slows down when | |
there is no error, but gives a completeness guarantee. |
- \\(N_{addr} = 100\\): \\(P(\omega) \doteq 2.1 \cdot 10^{-12}\\) | ||
- \\(N_{addr} = 1000\\): \\(P(\omega) \doteq 2.2 \cdot 10^{-16}\\) | ||
|
||
For reference, the space of 20byte address admits \\(2^{160}\\) unique values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this last sentence, or how it's related to the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if we wanted to simulate a real system, then N_{addr} = 2^160
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, then maybe that's what the sentence should say?
Though I'm not sure it's needed, since we're following the small-model hypothesis throughout the doc?
Given https://github.com/informalsystems/apalache/pull/1831/files#r890949777 is left out standing and the age of this PR and the lack of clear motive to move this forward, I'll close this. Again, feel free to reopen if the outstanding comments are to be addressed and work is resumed. |
In this tutorial, we reproduce a known scenario of ERC20 tokens with property-based testing, Apalache, and TLC under five configurations.
If you want to see the results, jump straight to the conclusions.
If you want just to read the document, without running,
mdbook
, access randomized.pdfIf you want to fix the document and see it rendered, do the following:
Since this tutorial is about 25 pages, I would prefer the following reviewing process:
Please commit all changes to English and the writing style directly in the PR. Otherwise, we will get too many small fixes, which makes it impossible to navigate through all the comments.
If you are asking for substantial changes, please claim the review token, add your comments and release the token. Doing it this way, we will avoid too much concurrency, as it often happens with large texts in pull requests.
Entries added to ./unreleased/ for any new functionality