Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A tutorial on property-based testing and Apalache-TLC #1831

Closed
wants to merge 42 commits into from

Conversation

konnov
Copy link
Collaborator

@konnov konnov commented May 30, 2022

In this tutorial, we reproduce a known scenario of ERC20 tokens with property-based testing, Apalache, and TLC under five configurations.

  • Stateful testing with Hypothesis. This is pretty much random simulation.
  • Simulation with Apalache. The choice of actions is randomized, whereas symbolic executions are checked with Z3.
  • Bounded model checking with Apalache. All bounded executions up to a predefined lengths are checked with Z3.
  • Simulation with TLC. The model checker picks successors at random.
  • State enumeration with TLC.

If you want to see the results, jump straight to the conclusions.

If you want just to read the document, without running, mdbook, access randomized.pdf

If you want to fix the document and see it rendered, do the following:

cd docs
mdbook serve
open http://localhost:3000

Since this tutorial is about 25 pages, I would prefer the following reviewing process:

  • Please commit all changes to English and the writing style directly in the PR. Otherwise, we will get too many small fixes, which makes it impossible to navigate through all the comments.

  • If you are asking for substantial changes, please claim the review token, add your comments and release the token. Doing it this way, we will avoid too much concurrency, as it often happens with large texts in pull requests.

  • Entries added to ./unreleased/ for any new functionality

@codecov-commenter
Copy link

codecov-commenter commented May 30, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (253f4c4) 78.86% compared to head (20b6754) 76.69%.

❗ Current head 20b6754 differs from pull request most recent head 8665225. Consider uploading reports for the commit 8665225 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1831      +/-   ##
==========================================
- Coverage   78.86%   76.69%   -2.17%     
==========================================
  Files         466      391      -75     
  Lines       15923    12005    -3918     
  Branches     2557      550    -2007     
==========================================
- Hits        12557     9207    -3350     
+ Misses       3366     2798     -568     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@konnov konnov marked this pull request as ready for review May 30, 2022 08:31
@konnov konnov requested a review from shonfeder as a code owner May 30, 2022 08:31
Copy link
Collaborator

@Kukovec Kukovec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done pretty-printing, please someone proofread again.

@p-offtermatt p-offtermatt self-requested a review June 3, 2022 15:26
@konnov
Copy link
Collaborator Author

konnov commented Jun 7, 2022

I will proof-read and extend it again. There is something interesting about simulation going on here.

@thpani
Copy link
Collaborator

thpani commented Jun 7, 2022

I will proof-read and extend it again. There is something interesting about simulation going on here.

@konnov Are you saying there will be significant changes/additions? Then I will hold off reading until you've committed those.

@konnov
Copy link
Collaborator Author

konnov commented Jun 7, 2022

I will proof-read and extend it again. There is something interesting about simulation going on here.

@konnov Are you saying there will be significant changes/additions? Then I will hold off reading until you've committed those.

I will add one more section on a hand-written simulator in C++. It is isolated. Go ahead and read what we have now :)

Copy link
Collaborator

@thpani thpani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indicating that I want to review this 😃

@thpani thpani self-requested a review June 7, 2022 09:17
@p-offtermatt
Copy link
Collaborator

For reference, I'm taking a token on the math section for now (I discussed with Thomas so there wouldn't be any conflicts)
I'll finish the review once I'm done through with that

Copy link
Collaborator

@p-offtermatt p-offtermatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I think this will be really useful to highlight how the "Apalache approach" compares to randomized testing.

To summarize some of my major remaining concerns:
I'm not sure the exact invariant that is checked in the Python code is the invariant that conceptually is expected to hold.
Also, I think the math should model the code closely to have the best idea of how likely it really is to find an invariant violation, and I think there are a couple of places where this is not the case.

I think we're learning more about the benefits and drawbacks of these approaches, and I wonder if eventually it would be useful to have a more high-level conceptual writeup comparing when PBT and when specification is appropriate, how either fits into the workflow along a project, who the people doing either are/should be, etc..., but this writeup is already great for comparing the technical merits.


Can we use some automation to discover such an execution? By looking at the
above example, we can see that the core of this question is whether we can find
the following sequence of events, for some values `n >= k > m >= l > 0`, and distinct addresses/users `u1, u2, u3`, such that `balanceOf[u1] >= k + l`:
Copy link
Collaborator

@p-offtermatt p-offtermatt Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constants seem unnecessarily restrictive, in particular the requirement that k > l.
Consider the case where n >= k > 0, m >= l > 0, and balanceOf[u1] >= k+l > n.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note this on the Python code, but since I think it's quite important to note: There are faulty sequences that do not have this shape (see the comment on the Python code for a sequence I think conceptually violates the properties one would expect, but is not caught by this). We either should 1) make an attempt to more generally describe all such sequences, or 2) explicitly mention that we only care about one particular type of sequence of events that can lead to a violation, but that there are others.

Copy link
Collaborator

@Kukovec Kukovec Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constants seem unnecessarily restrictive, in particular the requirement that k > l. Consider the case where n >= k > 0, m >= l > 0, and balanceOf[u1] >= k+l > n.

The same sequence of events with these constants exhibits faulty behaviour, but 1-5 on its own is not faulty, so I'm not sure whether you'd want to include that sequence

Yes, I had mentioned this to @konnov myself, conceptually you don't need "high" and "low" amounts in any particular order, just the fact that their sum is more than what was intended to be spent. He wants to fix it like this for simplicity, from what I recall.

The majority of the above code should be clear. However, there are two new
constructs in `commit_transfer`. First, we consume a transaction via
`tx=consumes(pendingTxs)`, which deletes a transaction from the bundle
`pendingTxs` and instantiates the input parameter `tx` with the chosen value. On top of that, we add the statement `assume(...)` inside the method. This statement tells the testing framework to reject the cases that violate the assumption.
Copy link
Collaborator

@p-offtermatt p-offtermatt Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose to use assume and have 3 different commit rules, rather than one rule that matches the tag of the pending transaction? This might not matter, depending on how much overhead these rejections cause. but it seems like the approach with 3 different rules makes the framework reject many cases where the approach with matching the tag would make progress in each step.

if last:
if last.tag == "transferFrom" and last.value > 0:
for p in self.pendingTxsShadow:
if p.tag == "approve" \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't catch some faulty cases:

tx1: approve(me, 5)
tx2: approve(me, 3) // actually, I want to only allow 3 tokens
commit tx1
tx3: transferFrom(me, badGuy, 3)
commit tx3
commit tx2
tx4: transferFrom(me, badGuy, 3) // a total of 6 tokens was transferred, but I allowed 5 (and decided to decrease to 3 afterwards!)
commit tx4

This is not caught, because no single transfer exceeds any single approval

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the invariant you'd want to check is that after one approval is submitted, and before the next one is submitted, the sum of tokens transferred out is not more than the value specified by the earlier approval.

If you are interested in the detailed analysis of probabilities, see the [math section](#math) below.

In summary, there are `600'397'329'064'743` (6e14) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions.
The odds of hitting an invariant violation are `6e-7` for our concrete selection of `3` addresses and `20` values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The odds of hitting an invariant violation are `6e-7` for our concrete selection of `3` addresses and `20` values.
The odds of hitting an invariant violation are `6e-7` (1 in 60 million) for our concrete selection of `3` addresses and `20` values.

Afterwards, the number 2 million is written out, so I think these should be given in words too (I don't have an intuitive understanding of how 2 million compares to 6e-7, but I have an understanding how 1 in 60 million compares to 2 million)


If you are interested in the detailed analysis of probabilities, see the [math section](#math) below.

In summary, there are `600'397'329'064'743` (6e14) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In summary, there are `600'397'329'064'743` (6e14) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions.
In summary, there are `600'397'329'064'743` (6e14, or 600 trillion) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions.

docs/src/tutorials/pbt-and-tla.md Show resolved Hide resolved
subject to the following constraints:
- \\(c_1: sender_H \ne spender_H\\)
- \\(c_2: sender_L \ne spender_L\\)
- \\(c_3: sender_T, from, to\\) pairwise distinct
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do sender_T, to have to be different? The code doesn't require this, and conceptually it also seems fine

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you're raising the same questions that I originally had. @konnov specifically requested it be modeled as such.

- \\(c_3: sender_T, from, to\\) pairwise distinct
- \\(c_4: sender_H = sender_L = from\\)
- \\(c_5: spender_H = spender_L = sender_T\\)
- \\(c_6: v_H \ge v_T > v_L > 1\\)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why > 1? It doesn't seem required, since

tx1: approve(me, 5)
tx2: approve(me, 1)
commit tx1
tx3: transferFrom(me, badGuy, 2)
commit tx3

violates the invariant

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 because the original amounts are 0..(N-1), but in the model they are 1..N (nicer formulas w.r.t bounds). So 1 plays the role of 0 in the real system

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I'd suggest reminding the reader of this in the text (maybe I overlooked it, but a sentence right beneath the constraints should suffice)

Copy link
Collaborator

@thpani thpani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm about halfway through and have a preliminary question about the Python code.

Still holding the review token.

Comment on lines +84 to +86
# history variables that we need to express the invariants
self.pendingTxsShadow = set()
self.lastTx = None
Copy link
Collaborator

@thpani thpani Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does lastTx actually model? It seems to be more complex than "the last committed transaction", because we reset it to None in the submit_* methods.

I was puzzled by this quite a bit, and it's not explained in the prose either.
Please add a more meaningful comment or explanation.

@@ -0,0 +1,1000 @@
# Tutorial on Checking ERC20 with Property-Based Testing and TLA+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a TOC to individual pages? This is quite a long document, and I would've liked to locate subsections.

Copy link
Collaborator

@thpani thpani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the overall approach taken here!
Sound, reproducible arguments about the benefits/drawbacks of certain approaches. 👍

Just a few more comments below, including suggestions that should be double-checked.
I also directly committed some stylistic fixes.

Releasing my review token.

Comment on lines +379 to +380
8 hours of running PBT, we find the same execution with Apalache in 12 seconds.
So it is probably worth looking at.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That last sentence sounds way too understated to me 😉 How about

Suggested change
8 hours of running PBT, we find the same execution with Apalache in 12 seconds.
So it is probably worth looking at.
8 hours of running PBT, we find the same execution with Apalache in 12 seconds.
This underlines the effectiveness of symbolic simulation for this problem.

Comment on lines +44 to +45
\* To make it possible to submit two 'equal' transactions,
\* we introduce a unique transaction id.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes it clearer where/how that unique ID is introduced:

Suggested change
\* To make it possible to submit two 'equal' transactions,
\* we introduce a unique transaction id.
\* To make it possible to submit two 'equal' transactions,
\* we augment type `TX` with a unique transaction id.

\* @type: <<ADDR, ADDR>> -> Int;
allowance

\* Variables that model Ethereum transactions, not the ERC20 state machine.
Copy link
Collaborator

@thpani thpani Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this second clause necessary? I don't understand what it refers to.

Suggested change
\* Variables that model Ethereum transactions, not the ERC20 state machine.
\* Variables that model Ethereum transactions.

Comment on lines +57 to +58
\* Initialize an ERC20 token.
Init ==
Copy link
Collaborator

@thpani thpani Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk why this says "token"?

Suggested change
\* Initialize an ERC20 token.
Init ==
\* Initialize the ERC20 state machine.
Init ==

test/tla/tutorials/randomized/test_erc20.py Show resolved Hide resolved
Comment on lines +651 to +654
However, the important difference between `simulate` and `check` is that
`simulate` does not give us an ultimate guarantee about all executions, even
though we limit the scope to all executions of length up to 10, whereas `check`
does.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carve out the main argument here:

Suggested change
However, the important difference between `simulate` and `check` is that
`simulate` does not give us an ultimate guarantee about all executions, even
though we limit the scope to all executions of length up to 10, whereas `check`
does.
However, the important difference between `simulate` and `check` is that
`simulate` (due to randomization) does not give us an ultimate guarantee that all possible executions have been explored, whereas `check` does.

Comment on lines +724 to +725
Note that we let TLC use 75% of the available memory and ran it on 4 CPU
cores (make sure you have them or change this setting!). Our experiments server
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this 4 core setting in the CLI/code?

Comment on lines +768 to +769
an ad-hoc random exploration. As we have seen, this mode slows down when
there is no error.
Copy link
Collaborator

@thpani thpani Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to mention this?

Suggested change
an ad-hoc random exploration. As we have seen, this mode slows down when
there is no error.
an ad-hoc random exploration. As we have seen, this mode slows down when
there is no error, but gives a completeness guarantee.

docs/src/tutorials/pbt-and-tla.md Outdated Show resolved Hide resolved
- \\(N_{addr} = 100\\): \\(P(\omega) \doteq 2.1 \cdot 10^{-12}\\)
- \\(N_{addr} = 1000\\): \\(P(\omega) \doteq 2.2 \cdot 10^{-16}\\)

For reference, the space of 20byte address admits \\(2^{160}\\) unique values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this last sentence, or how it's related to the above.

Copy link
Collaborator

@Kukovec Kukovec Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if we wanted to simulate a real system, then N_{addr} = 2^160

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then maybe that's what the sentence should say?
Though I'm not sure it's needed, since we're following the small-model hypothesis throughout the doc?

@konnov konnov added this to the Symbolic simulator milestone Jun 24, 2022
@shonfeder shonfeder changed the base branch from unstable to main July 21, 2022 21:23
@shonfeder
Copy link
Contributor

Given https://github.com/informalsystems/apalache/pull/1831/files#r890949777 is left out standing and the age of this PR and the lack of clear motive to move this forward, I'll close this. Again, feel free to reopen if the outstanding comments are to be addressed and work is resumed.

@shonfeder shonfeder closed this Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants