A tutorial on property-based testing and Apalache-TLC #1831

konnov · 2022-05-30T08:22:12Z

In this tutorial, we reproduce a known scenario of ERC20 tokens with property-based testing, Apalache, and TLC under five configurations.

Stateful testing with Hypothesis. This is pretty much random simulation.
Simulation with Apalache. The choice of actions is randomized, whereas symbolic executions are checked with Z3.
Bounded model checking with Apalache. All bounded executions up to a predefined lengths are checked with Z3.
Simulation with TLC. The model checker picks successors at random.
State enumeration with TLC.

If you want to see the results, jump straight to the conclusions.

If you want just to read the document, without running, mdbook, access randomized.pdf

If you want to fix the document and see it rendered, do the following:

cd docs
mdbook serve
open http://localhost:3000

Since this tutorial is about 25 pages, I would prefer the following reviewing process:

Please commit all changes to English and the writing style directly in the PR. Otherwise, we will get too many small fixes, which makes it impossible to navigate through all the comments.
If you are asking for substantial changes, please claim the review token, add your comments and release the token. Doing it this way, we will avoid too much concurrency, as it often happens with large texts in pull requests.
Entries added to ./unreleased/ for any new functionality

codecov-commenter · 2022-05-30T08:30:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (253f4c4) 78.86% compared to head (20b6754) 76.69%.

❗ Current head 20b6754 differs from pull request most recent head 8665225. Consider uploading reports for the commit 8665225 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1831      +/-   ##
==========================================
- Coverage   78.86%   76.69%   -2.17%     
==========================================
  Files         466      391      -75     
  Lines       15923    12005    -3918     
  Branches     2557      550    -2007     
==========================================
- Hits        12557     9207    -3350     
+ Misses       3366     2798     -568

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

docs/src/tutorials/randomized.md

Kukovec

Done pretty-printing, please someone proofread again.

konnov · 2022-06-07T08:12:35Z

I will proof-read and extend it again. There is something interesting about simulation going on here.

thpani · 2022-06-07T08:14:54Z

I will proof-read and extend it again. There is something interesting about simulation going on here.

@konnov Are you saying there will be significant changes/additions? Then I will hold off reading until you've committed those.

konnov · 2022-06-07T08:26:51Z

I will proof-read and extend it again. There is something interesting about simulation going on here.

@konnov Are you saying there will be significant changes/additions? Then I will hold off reading until you've committed those.

I will add one more section on a hand-written simulator in C++. It is isolated. Go ahead and read what we have now :)

thpani

Indicating that I want to review this 😃

p-offtermatt · 2022-06-07T09:37:43Z

For reference, I'm taking a token on the math section for now (I discussed with Thomas so there wouldn't be any conflicts)
I'll finish the review once I'm done through with that

p-offtermatt

Great work! I think this will be really useful to highlight how the "Apalache approach" compares to randomized testing.

To summarize some of my major remaining concerns:
I'm not sure the exact invariant that is checked in the Python code is the invariant that conceptually is expected to hold.
Also, I think the math should model the code closely to have the best idea of how likely it really is to find an invariant violation, and I think there are a couple of places where this is not the case.

I think we're learning more about the benefits and drawbacks of these approaches, and I wonder if eventually it would be useful to have a more high-level conceptual writeup comparing when PBT and when specification is appropriate, how either fits into the workflow along a project, who the people doing either are/should be, etc..., but this writeup is already great for comparing the technical merits.

p-offtermatt · 2022-06-07T08:16:06Z

docs/src/tutorials/pbt-and-tla.md

+
+Can we use some automation to discover such an execution? By looking at the
+above example, we can see that the core of this question is whether we can find
+the following sequence of events, for some values `n >= k > m >= l > 0`, and distinct addresses/users `u1, u2, u3`, such that `balanceOf[u1] >= k + l`:


The constants seem unnecessarily restrictive, in particular the requirement that k > l.
Consider the case where n >= k > 0, m >= l > 0, and balanceOf[u1] >= k+l > n.

I note this on the Python code, but since I think it's quite important to note: There are faulty sequences that do not have this shape (see the comment on the Python code for a sequence I think conceptually violates the properties one would expect, but is not caught by this). We either should 1) make an attempt to more generally describe all such sequences, or 2) explicitly mention that we only care about one particular type of sequence of events that can lead to a violation, but that there are others.

The constants seem unnecessarily restrictive, in particular the requirement that k > l. Consider the case where n >= k > 0, m >= l > 0, and balanceOf[u1] >= k+l > n.

The same sequence of events with these constants exhibits faulty behaviour, but 1-5 on its own is not faulty, so I'm not sure whether you'd want to include that sequence

Yes, I had mentioned this to @konnov myself, conceptually you don't need "high" and "low" amounts in any particular order, just the fact that their sum is more than what was intended to be spent. He wants to fix it like this for simplicity, from what I recall.

p-offtermatt · 2022-06-07T08:29:26Z

docs/src/tutorials/pbt-and-tla.md

+The majority of the above code should be clear. However, there are two new
+constructs in `commit_transfer`. First, we consume a transaction via
+`tx=consumes(pendingTxs)`, which deletes a transaction from the bundle
+`pendingTxs` and instantiates the input parameter `tx` with the chosen value. On top of that, we add the statement `assume(...)` inside the method. This statement tells the testing framework to reject the cases that violate the assumption. 


Why did you choose to use assume and have 3 different commit rules, rather than one rule that matches the tag of the pending transaction? This might not matter, depending on how much overhead these rejections cause. but it seems like the approach with 3 different rules makes the framework reject many cases where the approach with matching the tag would make progress in each step.

p-offtermatt · 2022-06-07T08:45:17Z

test/tla/tutorials/randomized/test_erc20.py

+        if last:
+            if last.tag == "transferFrom" and last.value > 0:
+                for p in self.pendingTxsShadow:
+                    if p.tag == "approve" \


This doesn't catch some faulty cases:

tx1: approve(me, 5) tx2: approve(me, 3) // actually, I want to only allow 3 tokens commit tx1 tx3: transferFrom(me, badGuy, 3) commit tx3 commit tx2 tx4: transferFrom(me, badGuy, 3) // a total of 6 tokens was transferred, but I allowed 5 (and decided to decrease to 3 afterwards!) commit tx4

This is not caught, because no single transfer exceeds any single approval

If I understand correctly, the invariant you'd want to check is that after one approval is submitted, and before the next one is submitted, the sum of tokens transferred out is not more than the value specified by the earlier approval.

p-offtermatt · 2022-06-07T08:58:44Z

docs/src/tutorials/pbt-and-tla.md

+If you are interested in the detailed analysis of probabilities, see the [math section](#math) below.
+
+In summary, there are `600'397'329'064'743` (6e14) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions.
+The odds of hitting an invariant violation are `6e-7` for our concrete selection of `3` addresses and `20` values.


Suggested change

The odds of hitting an invariant violation are `6e-7` for our concrete selection of `3` addresses and `20` values.

The odds of hitting an invariant violation are `6e-7` (1 in 60 million) for our concrete selection of `3` addresses and `20` values.

Afterwards, the number 2 million is written out, so I think these should be given in words too (I don't have an intuitive understanding of how 2 million compares to 6e-7, but I have an understanding how 1 in 60 million compares to 2 million)

p-offtermatt · 2022-06-07T08:59:12Z

docs/src/tutorials/pbt-and-tla.md

+
+If you are interested in the detailed analysis of probabilities, see the [math section](#math) below.
+
+In summary, there are `600'397'329'064'743` (6e14) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions.


Suggested change

In summary, there are `600'397'329'064'743` (6e14) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions.

In summary, there are `600'397'329'064'743` (6e14, or 600 trillion) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions.

docs/src/tutorials/pbt-and-tla.md

p-offtermatt · 2022-06-07T09:51:35Z

docs/src/tutorials/pbt-and-tla.md

+subject to the following constraints:
+  - \\(c_1: sender_H \ne spender_H\\)
+  - \\(c_2: sender_L \ne spender_L\\)
+  - \\(c_3: sender_T, from, to\\) pairwise distinct


why do sender_T, to have to be different? The code doesn't require this, and conceptually it also seems fine

I'm glad you're raising the same questions that I originally had. @konnov specifically requested it be modeled as such.

p-offtermatt · 2022-06-07T10:12:14Z

docs/src/tutorials/pbt-and-tla.md

+  - \\(c_3: sender_T, from, to\\) pairwise distinct
+  - \\(c_4: sender_H = sender_L = from\\)
+  - \\(c_5: spender_H = spender_L = sender_T\\)
+  - \\(c_6: v_H \ge v_T > v_L > 1\\)


Why > 1? It doesn't seem required, since

tx1: approve(me, 5) tx2: approve(me, 1) commit tx1 tx3: transferFrom(me, badGuy, 2) commit tx3

violates the invariant

1 because the original amounts are 0..(N-1), but in the model they are 1..N (nicer formulas w.r.t bounds). So 1 plays the role of 0 in the real system

Makes sense! I'd suggest reminding the reader of this in the text (maybe I overlooked it, but a sentence right beneath the constraints should suffice)

thpani

I'm about halfway through and have a preliminary question about the Python code.

Still holding the review token.

thpani · 2022-06-08T06:33:04Z

test/tla/tutorials/randomized/test_erc20.py

+        # history variables that we need to express the invariants
+        self.pendingTxsShadow = set()
+        self.lastTx = None


What does lastTx actually model? It seems to be more complex than "the last committed transaction", because we reset it to None in the submit_* methods.

I was puzzled by this quite a bit, and it's not explained in the prose either.
Please add a more meaningful comment or explanation.

thpani · 2022-06-08T06:36:53Z

docs/src/tutorials/pbt-and-tla.md

@@ -0,0 +1,1000 @@
+# Tutorial on Checking ERC20 with Property-Based Testing and TLA+


Can we add a TOC to individual pages? This is quite a long document, and I would've liked to locate subsections.

thpani

I really like the overall approach taken here!
Sound, reproducible arguments about the benefits/drawbacks of certain approaches. 👍

Just a few more comments below, including suggestions that should be double-checked.
I also directly committed some stylistic fixes.

Releasing my review token.

thpani · 2022-06-08T06:54:31Z

docs/src/tutorials/pbt-and-tla.md

+8 hours of running PBT, we find the same execution with Apalache in 12 seconds.
+So it is probably worth looking at.


That last sentence sounds way too understated to me 😉 How about

Suggested change

8 hours of running PBT, we find the same execution with Apalache in 12 seconds.

So it is probably worth looking at.

8 hours of running PBT, we find the same execution with Apalache in 12 seconds.

This underlines the effectiveness of symbolic simulation for this problem.

thpani · 2022-06-08T06:59:25Z

test/tla/tutorials/randomized/ERC20.tla

+    \* To make it possible to submit two 'equal' transactions,
+    \* we introduce a unique transaction id.


I think this makes it clearer where/how that unique ID is introduced:

Suggested change

\* To make it possible to submit two 'equal' transactions,

\* we introduce a unique transaction id.

\* To make it possible to submit two 'equal' transactions,

\* we augment type `TX` with a unique transaction id.

thpani · 2022-06-08T07:00:57Z

test/tla/tutorials/randomized/ERC20.tla

+    \* @type: <<ADDR, ADDR>> -> Int;
+    allowance
+
+\* Variables that model Ethereum transactions, not the ERC20 state machine.


Is this second clause necessary? I don't understand what it refers to.

Suggested change

\* Variables that model Ethereum transactions, not the ERC20 state machine.

\* Variables that model Ethereum transactions.

thpani · 2022-06-08T07:01:58Z

test/tla/tutorials/randomized/ERC20.tla

+\* Initialize an ERC20 token.
+Init ==


Idk why this says "token"?

Suggested change

\* Initialize an ERC20 token.

Init ==

\* Initialize the ERC20 state machine.

Init ==

test/tla/tutorials/randomized/test_erc20.py

thpani · 2022-06-08T07:57:11Z

docs/src/tutorials/pbt-and-tla.md

+However, the important difference between `simulate` and `check` is that
+`simulate` does not give us an ultimate guarantee about all executions, even
+though we limit the scope to all executions of length up to 10, whereas `check`
+does.


Carve out the main argument here:

Suggested change

However, the important difference between `simulate` and `check` is that

`simulate` does not give us an ultimate guarantee about all executions, even

though we limit the scope to all executions of length up to 10, whereas `check`

does.

However, the important difference between `simulate` and `check` is that

`simulate` (due to randomization) does not give us an ultimate guarantee that all possible executions have been explored, whereas `check` does.

thpani · 2022-06-08T08:02:20Z

docs/src/tutorials/pbt-and-tla.md

+Note that we let TLC use 75% of the available memory and ran it on 4 CPU
+cores (make sure you have them or change this setting!). Our experiments server


Where is this 4 core setting in the CLI/code?

thpani · 2022-06-08T08:08:46Z

docs/src/tutorials/pbt-and-tla.md

+   an ad-hoc random exploration. As we have seen, this mode slows down when
+   there is no error.


I think it's important to mention this?

Suggested change

an ad-hoc random exploration. As we have seen, this mode slows down when

there is no error.

an ad-hoc random exploration. As we have seen, this mode slows down when

there is no error, but gives a completeness guarantee.

docs/src/tutorials/pbt-and-tla.md

thpani · 2022-06-08T08:30:58Z

docs/src/tutorials/pbt-and-tla.md

+  - \\(N_{addr} = 100\\): \\(P(\omega) \doteq 2.1 \cdot 10^{-12}\\)
+  - \\(N_{addr} = 1000\\): \\(P(\omega) \doteq 2.2 \cdot 10^{-16}\\)
+
+For reference, the space of 20byte address admits \\(2^{160}\\) unique values.


I don't understand this last sentence, or how it's related to the above.

Well, if we wanted to simulate a real system, then N_{addr} = 2^160

Okay, then maybe that's what the sentence should say?
Though I'm not sure it's needed, since we're following the small-model hypothesis throughout the doc?

shonfeder · 2024-01-03T20:31:24Z

Given https://github.com/informalsystems/apalache/pull/1831/files#r890949777 is left out standing and the age of this PR and the lack of clear motive to move this forward, I'll close this. Again, feel free to reopen if the outstanding comments are to be addressed and work is resumed.

konnov added 16 commits May 25, 2022 10:50

copy the examples

c56be04

update the test file

80fdfb2

work in progress on the tutorial

6577031

finish on symbolic simulation

604ba6f

update the specs

123a7e6

an almost complete version

e0335b5

finished the tutorial

08cf416

proof reading

5a3d9f2

update the figure

836e808

fix the figure

030c298

update the test

1581107

increasing the number of steps

67e1318

rollback the test parameters

e4587d3

finish the tutorial on Hypothesis, Apalache, and TLC

c76b2fa

bump the required version

04a1536

add the rendered version

cb7ff94

konnov added 2 commits May 30, 2022 10:30

add release note

1cc94da

Merge branch 'unstable' into ik/pbt-tutorial

288deb5

konnov marked this pull request as ready for review May 30, 2022 08:31

konnov requested a review from shonfeder as a code owner May 30, 2022 08:31

konnov requested review from danwt, thpani, Kukovec and andrey-kuprianov May 30, 2022 08:31

Red trails are medium, blue trails are easy

264fb26

shonfeder reviewed May 31, 2022

View reviewed changes

docs/src/tutorials/randomized.md Outdated Show resolved Hide resolved

konnov added 3 commits May 31, 2022 08:23

Merge branch 'ik/pbt-tutorial'

875bd3e

rename randomized.md to pbt-and-tla.md

3b6938f

Merge branch 'unstable' into ik/pbt-tutorial

2d83787

typeset math

eafb52a

Kukovec approved these changes Jun 3, 2022

View reviewed changes

p-offtermatt self-requested a review June 3, 2022 15:26

Merge branch 'unstable' into ik/pbt-tutorial

a38a073

thpani requested changes Jun 7, 2022

View reviewed changes

thpani self-requested a review June 7, 2022 09:17

p-offtermatt requested changes Jun 7, 2022

View reviewed changes

konnov added 3 commits June 7, 2022 17:32

add a fresh version

0d79a29

add Jure as a co-author to pdf

740bbf8

remove nworkers

be50058

thpani requested changes Jun 8, 2022

View reviewed changes

thpani added 6 commits June 8, 2022 09:53

Apply stylistic fixes

cee2aa2

Replace verbatim with blockquote

b161cf9

Move paragraph to make it a bridge

0c7bd1b

Apply stylistic fixes

1f41fba

Number the math sections

38bfec2

Apply stylistic fixes

3817ce7

thpani reviewed Jun 8, 2022

View reviewed changes

thpani added 2 commits June 8, 2022 10:34

Stylistic fixes

85b45e2

Wrap some multi-letter identifiers

20b6754

konnov added this to the Symbolic simulator milestone Jun 24, 2022

shonfeder changed the base branch from unstable to main July 21, 2022 21:23

thpani assigned konnov Oct 24, 2022

Merge branch 'main' into ik/pbt-tutorial

8665225

shonfeder closed this Jan 3, 2024

	The odds of hitting an invariant violation are `6e-7` for our concrete selection of `3` addresses and `20` values.
	The odds of hitting an invariant violation are `6e-7` (1 in 60 million) for our concrete selection of `3` addresses and `20` values.


		If you are interested in the detailed analysis of probabilities, see the [math section](#math) below.

		In summary, there are `600'397'329'064'743` (6e14) possible executions, discounting premature termination due to e.g. insufficient coverage or commits preceding submissions.

		@@ -0,0 +1,1000 @@
		# Tutorial on Checking ERC20 with Property-Based Testing and TLA+

		8 hours of running PBT, we find the same execution with Apalache in 12 seconds.
		So it is probably worth looking at.

		\* To make it possible to submit two 'equal' transactions,
		\* we introduce a unique transaction id.

	\* Variables that model Ethereum transactions, not the ERC20 state machine.
	\* Variables that model Ethereum transactions.

		Note that we let TLC use 75% of the available memory and ran it on 4 CPU
		cores (make sure you have them or change this setting!). Our experiments server

		an ad-hoc random exploration. As we have seen, this mode slows down when
		there is no error.

A tutorial on property-based testing and Apalache-TLC #1831

A tutorial on property-based testing and Apalache-TLC #1831

Conversation

konnov commented May 30, 2022 • edited Loading

codecov-commenter commented May 30, 2022 • edited Loading

Codecov Report

Kukovec left a comment

Choose a reason for hiding this comment

konnov commented Jun 7, 2022

thpani commented Jun 7, 2022 • edited Loading

konnov commented Jun 7, 2022

thpani left a comment • edited Loading

Choose a reason for hiding this comment

p-offtermatt commented Jun 7, 2022

p-offtermatt left a comment • edited Loading

Choose a reason for hiding this comment

p-offtermatt Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kukovec Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

p-offtermatt Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thpani left a comment

Choose a reason for hiding this comment

thpani Jun 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thpani left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thpani Jun 8, 2022 • edited Loading

Choose a reason for hiding this comment

thpani Jun 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thpani Jun 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kukovec Jun 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shonfeder commented Jan 3, 2024

konnov commented May 30, 2022 •

edited

Loading

codecov-commenter commented May 30, 2022 •

edited

Loading

thpani commented Jun 7, 2022 •

edited

Loading

thpani left a comment •

edited

Loading

p-offtermatt left a comment •

edited

Loading

p-offtermatt Jun 7, 2022 •

edited

Loading

Kukovec Jun 7, 2022 •

edited

Loading

p-offtermatt Jun 7, 2022 •

edited

Loading

thpani Jun 8, 2022 •

edited

Loading

thpani left a comment •

edited

Loading

thpani Jun 8, 2022 •

edited

Loading

thpani Jun 8, 2022 •

edited

Loading

thpani Jun 8, 2022 •

edited

Loading

Kukovec Jun 8, 2022 •

edited

Loading