ci: tests: adds tooling to run very expensive tests #12234

laurentsenta · 2024-07-15T09:41:21Z

Related Issues

Proposed Changes

Keep the LOTUS_RUN_EXPENSIVE_TESTS flag unchanged, but make it explicit in CI
Adds the LOTUS_RUN_VERY_EXPENSIVE_TESTS flag based on a few signals (wip)
- If a PR contains the label need/very-expensive-tests
- During a nightly run
Adds a very_expensive_tests array in the test.yml settings (similar to how test are configured for custom runners, etc).
- Tests in that array will run with -timeout=60m

Additional Info

Currently the test using VERY EXPENSIVE is niporep_manual_test.go.

Example of a regular run:
https://github.com/filecoin-project/lotus/actions/runs/9908341740/job/27439500093?pr=12225
You'll find skipping test logs in the gotestsum step.

Adding the need/very-expensive-tests label should toggle those on.

Checklist

Before you mark the PR ready for review, please make sure that:

Commits have a clear commit message.
PR title is in the form of of <PR type>: <area>: <change being made>
- example: fix: mempool: Introduce a cache for valid signatures
- PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
- area, e.g. api, chain, state, mempool, multisig, networking, paych, proving, sealing, wallet, deps
Update CHANGELOG.md or signal that this change does not need it.
- If the PR affects users (e.g., new feature, bug fix, system requirements change), update the CHANGELOG.md and add details to the UNRELEASED section.
- If the change does not require a CHANGELOG.md entry, do one of the following:
  - Add [skip changelog] to the PR title
  - Add the label skip/changelog to the PR
New features have usage guidelines and / or documentation updates in
- Lotus Documentation
- Discussion Tutorials
Tests exist for new functionality or change in behavior
CI is green

.github/workflows/test.yml

laurentsenta · 2024-07-16T08:00:23Z

Seems like the feature is working: the test failing is the one using very expensive tests
https://github.com/filecoin-project/lotus/actions/runs/9940645037/job/27457885144?pr=12234

galargh

Looks good :D

.github/workflows/test.yml

laurentsenta · 2024-07-16T12:02:24Z

Setting this up for review, @rvagg we have an open question for you!

laurentsenta · 2024-07-18T09:10:43Z

@rvagg bumping this, the PR is ready for review,
we need your input on the envs discussion above, and you might want to fix the very expensive test before merging.

rvagg · 2024-07-19T10:46:06Z

2 things about very expensive tests:

They'll take a lot longer to run
They'll probably need a beefier machine to run on (or, they take a lot lot longer, which might be acceptable)

You can see in here that itest-niporep_manual where the only VERY_EXPENSIVE test is so far is timing out at the default golang 10m mark:

DONE 9 tests, 5 skipped, 3 failures in 673.657s

I'm not sure how long it takes in the CI machines we have, but we probably need to increase the go test timeout by quite a bit. Maybe 30m to start with and see how it goes? If it's something that we can easily bump in the config file then we could do that as we add more.

rvagg · 2024-07-22T04:59:05Z

Next steps? We either need somewhere new to run these that can handle the resources, and/or a change in test timeout to make these runnable. It'd be nice if there were a way to couple the timeout specifically with the expensive tests (so we don't end up giving a long timeout to tests that really should finish quickly) but I can't think of a good way to achieve this - the expensive tests are marked within the code but the timeout needs to be done outside the code.

laurentsenta · 2024-07-22T09:06:18Z

@rvagg thanks for taking a look,
I agree it'd make sense to wait for something like golang/go#48157 then set a timeout per test,
I'll check with timeout first and report back, that's the most manageable value to tweak.

rvagg · 2024-07-23T04:00:25Z

Even after 60m it still times out, that's not great at all, these machines must be pretty basic.

DONE 9 tests, 5 skipped, 3 failures in 3670.053s

I could probably have a look at making that test even dumber and doing less work, this is all the fault of proofs for niporep being really expensive.

What options do we have for introducing custom execution machines here? Can we throw our own hardware at this? We have a bare-metal machine which is usable for this, are we able to hook that up and have this test specifically run on there?

laurentsenta · 2024-07-23T07:54:35Z

@rvagg We use custom runners on AWS, c5.4xlarge machine with 16 cores and 32GB of memory,
I suspect the test is super flaky:

I've seen a success after 24min: https://github.com/filecoin-project/lotus/actions/runs/10041916541/job/27751141605
A fail at 3min: https://github.com/filecoin-project/lotus/actions/runs/10043543941/job/27756581112
That one failed after 1h: https://github.com/filecoin-project/lotus/actions/runs/10043543941/job/27757352123

(no success since I rebased with latest master)

On the last example, I noticed the log hang'ed after ~25min,

job log grouped per minute:

2024-07-22T15:29 19
2024-07-22T15:30 2653
2024-07-22T15:32 755
2024-07-22T15:33 1086
2024-07-22T15:34 178
2024-07-22T15:35 142
2024-07-22T15:36 109
2024-07-22T15:37 196
2024-07-22T15:38 88
2024-07-22T15:39 208
2024-07-22T15:40 57
2024-07-22T15:41 164
2024-07-22T15:42 124
2024-07-22T15:43 222
2024-07-22T15:44 207
2024-07-22T15:45 189
2024-07-22T15:46 145
2024-07-22T15:47 167
2024-07-22T15:48 161
2024-07-22T15:49 181
2024-07-22T15:50 204
2024-07-22T15:51 102
2024-07-22T15:52 19
2024-07-22T15:53 199
2024-07-22T15:54 212
2024-07-22T15:55 220
2024-07-22T15:56 36 # no more logs until timeout
2024-07-22T16:32 14761 # timeout errors

Same for that run:

2024-07-22T18:20 78
2024-07-22T18:21 2594
2024-07-22T18:22 63
2024-07-22T18:23 1385
2024-07-22T18:24 664
2024-07-22T18:25 208
2024-07-22T18:26 105
2024-07-22T18:27 221
2024-07-22T18:28 57
2024-07-22T18:29 235
2024-07-22T18:30 194
2024-07-22T18:31 142
2024-07-22T18:32 240
2024-07-22T18:33 136
2024-07-22T18:34 244
2024-07-22T18:35 131
2024-07-22T18:36 174
2024-07-22T18:37 207
2024-07-22T18:38 257
2024-07-22T18:39 181
2024-07-22T18:40 167
2024-07-22T18:41 118
2024-07-22T18:43 217
2024-07-22T18:44 234
2024-07-22T19:22 14917

We could allocate more resources to the job,

but could you double check nothing is wrong with the test itself? some sort of deadlock or flakiness.
if 16 cores and 32GB CPU is not enough, approx. how much would you need for the test to succeed consistently?

Edit: latest run succeeded in 24 min - https://github.com/filecoin-project/lotus/actions/runs/10054703843/job/27789842695?pr=12234

Edit: following run failed after 3 min - https://github.com/filecoin-project/lotus/actions/runs/10055630487/job/27792728538?pr=12234

laurentsenta · 2024-07-23T08:08:45Z

Please read my comment above on why I suspect the issue is related to the test itself,

A proposal to move this forward:

I'll add the "create github issue on schedule job failure" feature requested by @BigLep, (edit: Done)
We merge this PR and let the lotus team investigate, thanks to the need/very-expensive-tests label, this will be easier as its own PR managed by the team
If you identify an issue with the runners we can fix that later on.

Co-authored-by: Steve Loeppky <[email protected]>

BigLep

Thanks. Your proposal to move this forward makes sense to me:

.github/workflows/test.yml

rvagg · 2024-07-24T03:46:40Z

How much variability should we expect on the selection of these machines, and do we have a dedicated c5.4xlarge for the specific test?

I've seen a success after 24min: https://github.com/filecoin-project/lotus/actions/runs/10041916541/job/27751141605

Aggregating circuit proofs in the logs marks the end of the most expensive bit of the test, that happens somewhere around epoch 28257; and we have a blocktime of 2ms (but it ends up being much slower than that in practice, we just try and make it as quick as possible and let the block miner work hard to keep up and then we pause it now and again anyway).

It happens at 14:02:29, test finishes up @ 14:03:11. Total time is 1270.43s; which is decent. This is the kind of behaviour I'd expect from this test.

A fail at 3min: https://github.com/filecoin-project/lotus/actions/runs/10043543941/job/27756581112

signal: killed

That looks like an OOM error. I hope we're not taking more than 32G to run this test, are we sure we're getting those resources?

That one failed after 1h: https://github.com/filecoin-project/lotus/actions/runs/10043543941/job/27757352123

Aggregating circuit proofs at around epoch 32517, at 15:56:01, we continue for a bit until there's nothing after 15:56:05 until at all, until a timeout at 16:32:16. Which is very weird, we should at least have some log spam in there about something even if it continues to spin.

This looks like the culprit, but I have no idea why it might get stuck there!

Mon, 22 Jul 2024 16:32:16 GMT
goroutine 107964 [select, 36 minutes]:
Mon, 22 Jul 2024 16:32:16 GMT
github.com/filecoin-project/lotus/chain/stmgr.(*StateManager).WaitForMessage(0xc00084cdc0, {0x7c4fab8?, 0xc00ca90b90?}, {{0xc011844630?, 0x30?}}, 0x2, 0xffffffffffffffff, 0x1)
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/chain/stmgr/searchwait.go:84 +0x9fd
Mon, 22 Jul 2024 16:32:16 GMT
github.com/filecoin-project/lotus/node/impl/full.(*StateModule).StateWaitMsg(0xc009310040, {0x7c4fab8, 0xc00ca90b90}, {{0xc011844630?, 0xc001f36fd0?}}, 0xc001f37078?, 0x2?, 0xd0?)
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/node/impl/full/state.go:607 +0x55
Mon, 22 Jul 2024 16:32:16 GMT
github.com/filecoin-project/lotus/itests/kit.(*TestUnmanagedMiner).waitMessage(0xc00779a540, {{0xc011844630?, 0xc00ed954f0?}})
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/itests/kit/node_unmanaged.go:1405 +0x6a
Mon, 22 Jul 2024 16:32:16 GMT
github.com/filecoin-project/lotus/itests/kit.(*TestUnmanagedMiner).SubmitMessage(0xc00779a540?, {0x7c2e300?, 0xc00ed954f0?}, 0x0?, 0x0?)
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/itests/kit/node_unmanaged.go:1401 +0x48
Mon, 22 Jul 2024 16:32:16 GMT
github.com/filecoin-project/lotus/itests/kit.(*TestUnmanagedMiner).submitProveCommit(0xc00779a540, 0xf, {0xc0004f2500, 0x7, 0x0?}, 0x0, 0xc00421a390)
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/itests/kit/node_unmanaged.go:752 +0x14c7
Mon, 22 Jul 2024 16:32:16 GMT
github.com/filecoin-project/lotus/itests/kit.(*TestUnmanagedMiner).OnboardSectors(0xc00779a540, 0xf, 0x0, 0x7, {0xc001a67ed8, 0x3, 0x2?})
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/itests/kit/node_unmanaged.go:283 +0x41d
Mon, 22 Jul 2024 16:32:16 GMT
command-line-arguments.TestManualNISectorOnboarding.func2(0xc00c875520)
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/itests/niporep_manual_test.go:277 +0xc0c
Mon, 22 Jul 2024 16:32:16 GMT
testing.tRunner(0xc00c875520, 0xc00c103bf0)
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/_tool/go/1.21.12/x64/src/testing/testing.go:1595 +0xff
Mon, 22 Jul 2024 16:32:16 GMT
created by testing.(*T).Run in goroutine 67

So yes, I think there might be test issues here, but the signal: killed makes me worried that we're not getting the resources we need for this test. I'll have to do some local testing to find out what the memory demands are.

.github/workflows/test.yml

Co-authored-by: Steve Loeppky <[email protected]>

laurentsenta · 2024-07-24T13:01:29Z

@rvagg The test gets a fresh EC2 instance; instances are not shared or reused between tests, no noisy neighbors there,

We can share the AMI we use on the runner if you want to try it out manually on your AWS account,
This is out of scope for this work, but maybe researching the use of GOMEMLIMIT might help.

rvagg

👌 this is good for now I think; the flaky test can be dealt with separately by me

@laurentsenta good idea to have me test in AWS, if you could share the AMI then I'll give that a go and see. My dev machine is too overpowered to simulate this.

rvagg · 2024-07-25T07:04:38Z

#12307

galargh · 2024-07-25T13:51:04Z

@rvagg Could you give us an account ID that you'd like to use the AMI in? We'll make it available for that account.

BigLep · 2024-08-12T18:15:10Z

@rvagg: did you see the accountId comment (maybe was handled offline)? Also, anything else holding us back from merging this?

rvagg · 2024-08-13T01:25:53Z

@BigLep @galargh yes, in slack: https://filecoinproject.slack.com/archives/C06CY12V83S/p1721957023321019, I haven't checked yet whether I've got a new AMI to play with but I didn't get a response to that comment yet.

BigLep · 2024-09-04T23:34:53Z

I assigned this to @rvagg because I believe the ball is in his court for merging this functionality (but this isn't a top priority currently).

laurentsenta added the skip/changelog This change does not require CHANGELOG.md update label Jul 15, 2024

laurentsenta marked this pull request as draft July 15, 2024 09:41

rvagg reviewed Jul 15, 2024

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

rvagg reviewed Jul 15, 2024

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

laurentsenta force-pushed the ipdx-with-very-expensive-flag branch 2 times, most recently from 7f536fb to 7b6ddb5 Compare July 15, 2024 11:44

laurentsenta added need/very-expensive-tests skip/changelog This change does not require CHANGELOG.md update and removed skip/changelog This change does not require CHANGELOG.md update labels Jul 15, 2024

laurentsenta force-pushed the ipdx-with-very-expensive-flag branch 2 times, most recently from 6d114c7 to ce6947c Compare July 15, 2024 13:47

laurentsenta requested a review from galargh July 15, 2024 14:04

galargh approved these changes Jul 16, 2024

View reviewed changes

.github/workflows/test.yml Show resolved Hide resolved

.github/workflows/test.yml Show resolved Hide resolved

laurentsenta marked this pull request as ready for review July 16, 2024 11:57

laurentsenta requested a review from rvagg July 16, 2024 11:57

laurentsenta force-pushed the ipdx-with-very-expensive-flag branch 9 times, most recently from abded1d to 578b05c Compare July 22, 2024 14:10

laurentsenta and others added 8 commits July 23, 2024 11:50

ci: add run-very-expensive-tests checks

17e759f

ci: update tag name

240fad3

ci: labelling events

ee30822

ci: tweak niporep timeout

919535f

ci: simplif timeout

158bcce

ci: add doc

8dea5b8

ci: bump to 60 min

ad641c5

ci: refine .github/pull_request_template.md

ed60445

Co-authored-by: Steve Loeppky <[email protected]>

laurentsenta force-pushed the ipdx-with-very-expensive-flag branch from cb5416c to ed60445 Compare July 23, 2024 09:50

ci: tests: create issue on very expensive test failure (#12277)

dd43d30

laurentsenta requested a review from BigLep July 23, 2024 12:53

BigLep added area/expensive-test-failure Used in .github/workflows/test.yml to denote a VERY_EXPENSIVE_TEST failure. and removed area/expensive-test-failure Used in .github/workflows/test.yml to denote a VERY_EXPENSIVE_TEST failure. labels Jul 23, 2024

BigLep approved these changes Jul 23, 2024

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

.github/workflows/test.yml Outdated Show resolved Hide resolved

laurentsenta commented Jul 24, 2024

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

ci: tweak the create or update behavior

0b62e28

Co-authored-by: Steve Loeppky <[email protected]>

rvagg approved these changes Jul 25, 2024

View reviewed changes

rvagg mentioned this pull request Jul 25, 2024

[DX Streamline] Flaky niporep test when run as "very expensive" #12307

Open

BigLep assigned rvagg Sep 4, 2024

BigLep mentioned this pull request Sep 4, 2024

CI verifies that Lotus works with existing common smart contract developer tools #12362

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: tests: adds tooling to run very expensive tests #12234

ci: tests: adds tooling to run very expensive tests #12234

laurentsenta commented Jul 15, 2024 •

edited

Loading

laurentsenta commented Jul 16, 2024

galargh left a comment •

edited

Loading

laurentsenta commented Jul 16, 2024 •

edited

Loading

laurentsenta commented Jul 18, 2024

rvagg commented Jul 19, 2024

rvagg commented Jul 22, 2024

laurentsenta commented Jul 22, 2024

rvagg commented Jul 23, 2024

laurentsenta commented Jul 23, 2024 •

edited

Loading

laurentsenta commented Jul 23, 2024 •

edited

Loading

BigLep left a comment

rvagg commented Jul 24, 2024

laurentsenta commented Jul 24, 2024

rvagg left a comment

rvagg commented Jul 25, 2024

galargh commented Jul 25, 2024

BigLep commented Aug 12, 2024

rvagg commented Aug 13, 2024

BigLep commented Sep 4, 2024 •

edited

Loading

ci: tests: adds tooling to run very expensive tests #12234

Are you sure you want to change the base?

ci: tests: adds tooling to run very expensive tests #12234

Conversation

laurentsenta commented Jul 15, 2024 • edited Loading

Related Issues

Proposed Changes

Additional Info

Checklist

laurentsenta commented Jul 16, 2024

galargh left a comment • edited Loading

Choose a reason for hiding this comment

laurentsenta commented Jul 16, 2024 • edited Loading

laurentsenta commented Jul 18, 2024

rvagg commented Jul 19, 2024

rvagg commented Jul 22, 2024

laurentsenta commented Jul 22, 2024

rvagg commented Jul 23, 2024

laurentsenta commented Jul 23, 2024 • edited Loading

laurentsenta commented Jul 23, 2024 • edited Loading

BigLep left a comment

Choose a reason for hiding this comment

rvagg commented Jul 24, 2024

laurentsenta commented Jul 24, 2024

rvagg left a comment

Choose a reason for hiding this comment

rvagg commented Jul 25, 2024

galargh commented Jul 25, 2024

BigLep commented Aug 12, 2024

rvagg commented Aug 13, 2024

BigLep commented Sep 4, 2024 • edited Loading

laurentsenta commented Jul 15, 2024 •

edited

Loading

galargh left a comment •

edited

Loading

laurentsenta commented Jul 16, 2024 •

edited

Loading

laurentsenta commented Jul 23, 2024 •

edited

Loading

laurentsenta commented Jul 23, 2024 •

edited

Loading

BigLep commented Sep 4, 2024 •

edited

Loading