Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatic triage utility #793

Merged
merged 5 commits into from
Oct 15, 2024
Merged

Add automatic triage utility #793

merged 5 commits into from
Oct 15, 2024

Conversation

olupton
Copy link
Collaborator

@olupton olupton commented May 2, 2024

This is a small Python program that implements a two-stage bisection, identifying a commit in JAX or XLA that caused a test case to start failing.

An example use-case was the JetTest.test_dot test, which had been failing in the JAX unit tests on A100 for an extended period.

$ .github/triage/triage --container=jax test-jax.sh jet_test_gpu
[INFO] 2024-05-01 06:24:23 Checking end-of-range failure in 2024-04-30
[INFO] 2024-05-01 06:25:13 Ran test case in 2024-04-30 in 48.6s
[INFO] 2024-05-01 06:25:14 Starting coarse search with 2024-04-29 based on --end-date=None and end_date=2024-04-30
[INFO] 2024-05-01 06:26:07 Ran test case in 2024-04-29 in 52.4s
[INFO] 2024-05-01 06:26:58 Ran test case in 2024-04-28 in 49.8s
[INFO] 2024-05-01 06:27:52 Ran test case in 2024-04-26 in 51.4s
[INFO] 2024-05-01 06:28:50 Ran test case in 2024-04-22 in 54.7s
[INFO] 2024-05-01 06:29:44 Ran test case in 2024-04-14 in 51.4s
[INFO] 2024-05-01 06:30:36 Ran test case in 2024-03-29 in 49.7s
[INFO] 2024-05-01 06:31:31 Ran test case in 2024-02-26 in 52.5s
[INFO] 2024-05-01 06:32:40 Ran test case in 2023-12-24 in 68.0s
[INFO] 2024-05-01 06:34:23 Ran test case in 2023-08-18 in 101.6s
[INFO] 2024-05-01 06:34:23 Coarse container-level search yielded [2023-08-18, 2023-12-24]...
[INFO] 2024-05-01 06:35:28 Ran test case in 2023-10-20 in 61.4s
[INFO] 2024-05-01 06:35:28 Refined container-level range to [2023-10-20, 2023-12-24]
[INFO] 2024-05-01 06:36:32 Ran test case in 2023-11-21 in 63.4s
[INFO] 2024-05-01 06:36:32 Refined container-level range to [2023-10-20, 2023-11-21]
[INFO] 2024-05-01 06:37:38 Ran test case in 2023-11-03 in 61.5s
[INFO] 2024-05-01 06:37:38 Refined container-level range to [2023-11-03, 2023-11-21]
[INFO] 2024-05-01 06:38:45 Ran test case in 2023-11-12 in 66.4s
[INFO] 2024-05-01 06:38:45 Refined container-level range to [2023-11-03, 2023-11-12]
[INFO] 2024-05-01 06:39:49 Ran test case in 2023-11-09 in 59.5s
[INFO] 2024-05-01 06:39:49 Refined container-level range to [2023-11-03, 2023-11-09]
[INFO] 2024-05-01 06:39:52 Could not adjust 2023-11-06 00:00:00 given before=2023-11-09 and after=2023-11-03
[INFO] 2024-05-01 06:39:53 Bisecting JAX [db07f402333997d5d570a2d0478f9a15f79bf8b2, 9b1572c02859287294dec7a6b3df9018f462c8aa] and XLA [049a3e6caf3f60b7edcced4c05f579bf2c3f26dc, 0bc02d71f75ce5a4ea93ade1957a79c3227ce466] using ghcr.io/nvidia/jax:nightly-2023-11-09
[INFO] 2024-05-01 06:39:53 Building in the range-ending 2023-11-09 container...
[INFO] 2024-05-01 06:39:54 Checking out XLA 0bc02d71f75ce5a4ea93ade1957a79c3227ce466 JAX 9b1572c02859287294dec7a6b3df9018f462c8aa
[INFO] 2024-05-01 06:41:14 Build completed in 80.0s
[INFO] 2024-05-01 06:42:14 Test completed in 60.4s
[INFO] 2024-05-01 06:42:14 Verified test failure after rebuilding in 2023-11-09
[INFO] 2024-05-01 06:42:14 Checking out XLA 049a3e6caf3f60b7edcced4c05f579bf2c3f26dc JAX db07f402333997d5d570a2d0478f9a15f79bf8b2
[INFO] 2024-05-01 06:43:14 Build completed in 59.3s
[INFO] 2024-05-01 06:44:10 Test completed in 56.0s
[INFO] 2024-05-01 06:44:10 Test passes after rebuilding commits from 2023-11-03 in 2023-11-09
[INFO] 2024-05-01 06:44:10 Checking out XLA adee86839deffae1f789b483178eaf6454d0bd9b JAX 5a1731c16fd62ad1b6c3fbd059a65332949f4f15
[INFO] 2024-05-01 06:45:07 Build completed in 56.6s
[INFO] 2024-05-01 06:45:43 Test completed in 36.3s
[INFO] 2024-05-01 06:45:43 Checking out XLA 27c69deb22cafafb1226c6ed027d5917ef29f538 JAX 1c1dd7c8c7ff2e5790159d9cdbbbb1f029a92d4b
[INFO] 2024-05-01 06:46:37 Build completed in 53.7s
[INFO] 2024-05-01 06:47:28 Test completed in 51.6s
[INFO] 2024-05-01 06:47:29 Checking out XLA d257360bcc1ebc3cdca939baebfe8c48134e2b6a JAX 390022a227f7271fc0caebed3c4c066a667b8628
[INFO] 2024-05-01 06:48:22 Build completed in 53.4s
[INFO] 2024-05-01 06:48:59 Test completed in 37.5s
[INFO] 2024-05-01 06:49:00 Checking out XLA 75bd8cc70166bb532c5e7d98d929c2004b32b540 JAX dda76733e835788c611b0687f0477e469a92881b
[INFO] 2024-05-01 06:49:29 Build completed in 29.3s
[INFO] 2024-05-01 06:50:11 Test completed in 42.6s
[INFO] 2024-05-01 06:50:12 Checking out XLA e112baca8630aca294f996a5d95028e59ade56e5 JAX 1e810983fa7331a1ff18941e484b2a732c785e9d
[INFO] 2024-05-01 06:50:40 Build completed in 28.1s
[INFO] 2024-05-01 06:51:18 Test completed in 38.4s
[INFO] 2024-05-01 06:51:18 Checking out XLA 7fba2ad0bc4c21f8d62f09f07876ce2616c73349 JAX bfbf9e1c3313cae2dfd71e39e64c50c8f42017ef
[INFO] 2024-05-01 06:51:44 Build completed in 25.7s
[INFO] 2024-05-01 06:52:22 Test completed in 37.8s
[INFO] 2024-05-01 06:52:22 Checking out XLA a58070090a025db5cbdd4b596f9ef714fa13641a JAX 1126945da8e3d60b2ca670b78957ca38e92d23d6
[INFO] 2024-05-01 06:52:48 Build completed in 26.1s
[INFO] 2024-05-01 06:53:26 Test completed in 37.9s
[INFO] 2024-05-01 06:53:26 Checking out XLA 7fba2ad0bc4c21f8d62f09f07876ce2616c73349 JAX 1126945da8e3d60b2ca670b78957ca38e92d23d6
[INFO] 2024-05-01 06:53:52 Build completed in 25.9s
[INFO] 2024-05-01 06:54:30 Test completed in 37.9s
[INFO] 2024-05-01 06:54:30 Bisected failure to XLA 7fba2ad0bc4c21f8d62f09f07876ce2616c73349..a58070090a025db5cbdd4b596f9ef714fa13641a with JAX 1126945da8e3d60b2ca670b78957ca38e92d23d6

pointing the finger at openxla/xla@a580700. The failure is worked around in jax-ml/jax#21035.

@olupton olupton requested a review from DwarKapex May 3, 2024 12:19
@olupton olupton requested a review from gspschmid May 14, 2024 07:16
@olupton
Copy link
Collaborator Author

olupton commented May 14, 2024

A different test case: .github/triage/triage --container jax --bazel-cache ... test-jax.sh linear_search_test_gpu points to openxla/xla@a3f9a7e for https://github.com/NVIDIA/JAX-Toolbox/actions/runs/9060572476/job/24891187791#step:7:772

[INFO] 2024-05-13 23:15:24 Checking end-of-range failure in 2024-05-13
[INFO] 2024-05-13 23:16:15 Ran test case in 2024-05-13 in 47.5s
[INFO] 2024-05-13 23:17:05 Starting coarse search with 2024-05-12 based on --end-date=None and end_date=2024-05-13
[INFO] 2024-05-13 23:17:52 Ran test case in 2024-05-12 in 42.6s
[INFO] 2024-05-13 23:19:25 Ran test case in 2024-05-11 in 43.2s
[INFO] 2024-05-13 23:21:05 Ran test case in 2024-05-09 in 45.9s
[INFO] 2024-05-13 23:21:05 Coarse container-level search yielded [2024-05-09, 2024-05-11]...
[INFO] 2024-05-13 23:22:46 Ran test case in 2024-05-10 in 46.9s
[INFO] 2024-05-13 23:22:46 Refined container-level range to [2024-05-10, 2024-05-11]
[INFO] 2024-05-13 23:22:47 Bisecting JAX [f21e3e82c78454ef83975fc3964998b48451519f, 3b03e5497d70a7d0745f1bd421e4cb4f9dd56ebd] and XLA [9b6f2cbb95482d7c69ec56125ebf392b8a2faad1, 93fe4f0d20fa1ad29fee664f7842d7e427dc6cf1] using ghcr.io/nvidia
/jax:jax-2024-05-11
[INFO] 2024-05-13 23:22:47 Building in the range-ending 2024-05-11 container...
[INFO] 2024-05-13 23:22:48 Checking out XLA 93fe4f0d20fa1ad29fee664f7842d7e427dc6cf1 JAX 3b03e5497d70a7d0745f1bd421e4cb4f9dd56ebd
[INFO] 2024-05-13 23:35:41 Build completed in 773.3s
[INFO] 2024-05-13 23:36:06 Test completed in 24.9s
[INFO] 2024-05-13 23:36:06 Verified test failure after rebuilding in 2024-05-11
[INFO] 2024-05-13 23:36:06 Checking out XLA 9b6f2cbb95482d7c69ec56125ebf392b8a2faad1 JAX f21e3e82c78454ef83975fc3964998b48451519f
[INFO] 2024-05-13 23:45:28 Build completed in 562.5s
[INFO] 2024-05-13 23:45:50 Test completed in 21.9s
[INFO] 2024-05-13 23:45:50 Test passes after rebuilding commits from 2024-05-10 in 2024-05-11
[INFO] 2024-05-13 23:45:50 Checking out XLA cd35e5c5c1cffcdf3e914a94141d42163e215951 JAX 0a3e4327451a1825047cb883813ceb6c1b36f5ac
[INFO] 2024-05-13 23:49:19 Build completed in 208.5s
[INFO] 2024-05-13 23:49:51 Test completed in 31.8s
[INFO] 2024-05-13 23:49:51 Checking out XLA a3f9a7e68b82df174fee22b80a6ec1185ebca719 JAX 17444fc8fab32426f5ff8bbbe7be6507ec1641ea
[INFO] 2024-05-13 23:53:12 Build completed in 200.9s
[INFO] 2024-05-13 23:53:41 Test completed in 29.0s
[INFO] 2024-05-13 23:53:41 Checking out XLA 48f4ca5dd3b0900acac5f645f0f0911efe42350c JAX c231cd51eb074554b4c2abd115caa00f8bba3665
[INFO] 2024-05-13 23:54:55 Build completed in 74.4s
[INFO] 2024-05-13 23:55:27 Test completed in 31.5s
[INFO] 2024-05-13 23:55:27 Checking out XLA 4f6e8d670df87c114f7dc5e210b3846e25ef7932 JAX 17444fc8fab32426f5ff8bbbe7be6507ec1641ea
[INFO] 2024-05-13 23:56:16 Build completed in 48.8s
[INFO] 2024-05-13 23:56:47 Test completed in 31.3s
[INFO] 2024-05-13 23:56:47 Checking out XLA f9258de517de7ddfa912ac629eea2b827cc20575 JAX 17444fc8fab32426f5ff8bbbe7be6507ec1641ea
[INFO] 2024-05-13 23:57:47 Build completed in 59.7s
[INFO] 2024-05-13 23:58:18 Test completed in 31.3s
[INFO] 2024-05-13 23:58:18 Checking out XLA f9258de517de7ddfa912ac629eea2b827cc20575 JAX 17444fc8fab32426f5ff8bbbe7be6507ec1641ea
[INFO] 2024-05-13 23:58:53 Build completed in 34.2s
[INFO] 2024-05-13 23:59:24 Test completed in 31.4s
[INFO] 2024-05-13 23:59:24 Bisected failure to XLA f9258de517de7ddfa912ac629eea2b827cc20575..a3f9a7e68b82df174fee22b80a6ec1185ebca719 with JAX 17444fc8fab32426f5ff8bbbe7be6507ec1641ea
xla f9258de517de7ddfa912ac629eea2b827cc20575 a3f9a7e68b82df174fee22b80a6ec1185ebca719 jax 17444fc8fab32426f5ff8bbbe7be6507ec1641ea

Copy link
Contributor

@gspschmid gspschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, a few comments.

Algorithmic parts seem reasonable to me, but would nonetheless likely benefit from some tests :-)

Fwiw, I think this is a good target for property-based testing -- the idea being that we can randomly sample (or enumerate) some small scenarios and designate an XLA or JAX commit from which the CI will fail. So we'd mock container_exists, check_container and build_and_test and then test the logic using a model like

# Fabricate XLA and JAX commits for days a/b/c
def fake_commits(n_commits, n_days, rng):
  dt = datetime.datetime('2024-10-01')
  delta = datetime.timedelta(days=n_days) / n_commits
  commits: list[tuple[bool, datetime, str]] = []
  for i in range(n_commits):
    dt += delta
    is_jax = rng.coinflip()
    commit = '{0:09d}'.format(i)  # not really hex, but whatever
    commits.append((is_jax, dt, commit))
  return commits

def test_one_scenario():
  commits = fake_commits(100, 10, rng(seed=123))
  for bad_jax, bad_dt, bad_commit in commits:
    def build_and_test(xla_commit, jax_commit):
      return jax_commit >= bad_commit if bad_jax else xla_commit >= bad_commit
    def check_container(date):
      return ... # build_and_test(earliest xla commit >= date, earliest jax commit >= xla commit)
    assert triage(commits, check_container, build_and_test) == (bad_jax, bad_commit)

.github/triage/triage Outdated Show resolved Hide resolved
.github/triage/triage Outdated Show resolved Hide resolved
.github/triage/triage Outdated Show resolved Hide resolved
This is a small Python program that implements a two-stage bisection,
identifying a commit in JAX or XLA that caused a test case to start
failing.
gspschmid
gspschmid previously approved these changes Oct 9, 2024
@gpupuck
Copy link
Contributor

gpupuck commented Oct 11, 2024

My case here:

jax-toolbox-triage --container=maxtext --start-date 2024-10-09 --end-date 2024-10-10 -- test-maxtext.sh -b 4 --model-name=llama2-7b --attn-type=cudnn_flash_te --remat-policy=minimal_flash --steps=10 --fsdp=8 --output train_output -a "scan_layers=true max_target_length=4096 use_iota_embed=true logits_dot_in_fp32=false"
[INFO] 2024-10-11 06:07:36 Checking end-of-range failure in 2024-10-10
[INFO] 2024-10-11 06:08:51 Ran test case in 2024-10-10 in 73.5s
[INFO] 2024-10-11 06:08:51 Starting coarse search with 2024-10-09 based on --start-date
[INFO] 2024-10-11 06:10:14 Ran test case in 2024-10-09 in 81.9s
[INFO] 2024-10-11 06:10:14 Coarse container-level search yielded [2024-10-09, 2024-10-10]...
[INFO] 2024-10-11 06:10:16 Bisecting JAX [9cf952a535518da59cdcecc9145dba287beddca2, 351187d9dac6767e4e08845da87ccb918eb0f5b2] and XLA [aace8011552f7ccc700b25c4be9acbee0ab0a997, 80784a0bc05f52e7f49ae54cb4656a2ac0c9d412] using ghcr.io/nvidia/jax:maxtext-2024-10-10
[INFO] 2024-10-11 06:10:16 Building in the range-ending container...
...
[INFO] 2024-10-11 06:42:55 Test completed in 68.8s
[INFO] 2024-10-11 06:42:55 Checking out XLA bdc227e6867b01f482fc5018ffee2b64e57d1c63 JAX b65be4e1ae1e67aeaf6d2075e79e1a9ae8819cad
[INFO] 2024-10-11 06:43:55 Build completed in 59.7s
[INFO] 2024-10-11 06:45:12 Test completed in 76.8s
[INFO] 2024-10-11 06:45:12 Checking out XLA 9036d19af836e9c9583dc882f1d899c8677d7800 JAX 351187d9dac6767e4e08845da87ccb918eb0f5b2
[INFO] 2024-10-11 06:45:51 Build completed in 39.3s
[INFO] 2024-10-11 06:47:07 Test completed in 76.0s
[INFO] 2024-10-11 06:47:08 Checking out XLA 9036d19af836e9c9583dc882f1d899c8677d7800 JAX 351187d9dac6767e4e08845da87ccb918eb0f5b2
[INFO] 2024-10-11 06:47:47 Build completed in 39.0s
[INFO] 2024-10-11 06:49:03 Test completed in 76.8s
[INFO] 2024-10-11 06:49:03 Bisected failure to XLA 9036d19af836e9c9583dc882f1d899c8677d7800..28a4ebf1369ce34c50dfcbdf9ec5bf5acdfa4e22 with JAX 351187d9dac6767e4e08845da87ccb918eb0f5b2

And this 28a4ebf1369ce34c50dfcbdf9ec5bf5acdfa4e22 is the culprit commit!

Copy link
Contributor

@gpupuck gpupuck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a nice and neat work!

@olupton olupton merged commit 70f67a8 into main Oct 15, 2024
73 of 75 checks passed
@olupton olupton deleted the olupton/triage branch October 15, 2024 09:36
olupton added a commit that referenced this pull request Oct 22, 2024
Improve documentation of the triage tool added in
#793.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants