Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate rocm-pytorch and rocm-tensorflow runtime images #626

Merged
merged 3 commits into from
Jul 30, 2024

Conversation

atheo89
Copy link
Member

@atheo89 atheo89 commented Jul 19, 2024

Related to: https://issues.redhat.com/browse/RHOAIENG-9680
Depends on: #620

Description

Include PyTorch AMD and Tensorflow AMD images in the pre-included runtime image lists.
We need to provide runtime images with AMD support so that they can be used in the Pipeline creation via Elyra component or directly.

NOTE: This PR has dependency of this one: #620 as for the runtime builds the base image is amd-ubi9-python-3.9

Follow up items that will break up on different tracking tasks:

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@atheo89
Copy link
Member Author

atheo89 commented Jul 19, 2024

Build Notebooks (pr) / Generate job matrix (pull_request) Failing because this PR should be merged first. #620
(Check the note on the PR description)

@atheo89
Copy link
Member Author

atheo89 commented Jul 23, 2024

/retest-required

@atheo89
Copy link
Member Author

atheo89 commented Jul 23, 2024

Based on CI builds
ROCm Runtime TensorFlow build:
ghcr.io/atheo89/notebooks/workbench-images:rocm-runtime-tensorflow-ubi9-python-3.9-RHOAIENG-9680_d8788081cac625bf2e1edf64ed8140c3c7223531

ROCm Runtime PyTorch build:
ghcr.io/atheo89/notebooks/workbench-images:rocm-runtime-pytorch-ubi9-python-3.9-RHOAIENG-9680_d8788081cac625bf2e1edf64ed8140c3c7223531

@atheo89
Copy link
Member Author

atheo89 commented Jul 23, 2024

This PR is ready for a final review

@jiridanek
Copy link
Member

Is the openshift-ci still supposed to be failing?

RRO[2024-07-23T09:53:09Z] Some steps failed:
ERRO[2024-07-23T09:53:09Z]

  • could not sort nodes
  • steps are missing dependencies
  • step [images] is missing dependencies: <&api.externalImageLink{namespace:"", name:"stable", tag:"runtime-rocm-> pytorch-ubi9-python-3.9"}>, <&api.externalImageLink{namespace:"", name:"stable", tag:"runtime-rocm-tensorflow-ubi9-python-3.9"}>
  • step [output:stable:runtime-rocm-pytorch-ubi9-python-3.9] is missing dependencies: <&api.internalImageStreamTagLink{name:"pipeline", tag:"runtime-rocm-pytorch-ubi9-python-3.9", unsatisfiableError:""}>
  • step [output:stable:runtime-rocm-tensorflow-ubi9-python-3.9] is missing dependencies: <&api.internalImageStreamTagLink{name:"pipeline", tag:"runtime-rocm-tensorflow-ubi9-python-3.9", unsatisfiableError:""}>
  • step runtime-rocm-pytorch-ubi9-python-3.9 is missing dependencies: <&api.internalImageStreamTagLink{name:"pipeline", tag:"amd-ubi9-python-3.9", unsatisfiableError:""}>
  • step runtime-rocm-tensorflow-ubi9-python-3.9 is missing dependencies: <&api.internalImageStreamTagLink{name:"pipeline", tag:"amd-ubi9-python-3.9", unsatisfiableError:""}>
    INFO[2024-07-23T09:53:09Z] Reporting job state 'failed' with reason 'building_graph'

@atheo89
Copy link
Member Author

atheo89 commented Jul 23, 2024

Is the openshift-ci still supposed to be failing?

Not sure what is happening on CI... There is a follow up PR however that incorporates more things openshift/release#54579

@atheo89
Copy link
Member Author

atheo89 commented Jul 24, 2024

/test runtime-rocm-tensorflow-ubi9-python-3-9-pr-image-mirror

@atheo89
Copy link
Member Author

atheo89 commented Jul 24, 2024

The images fail to get build due to the node was low on resource: ephemeral-storage. Threshold quantity: 32127475555, available: 31169256Ki.

 * could not run steps: step rocm-ubi9-python-3.9 failed: error occurred handling build rocm-ubi9-python-3.9-amd64: build not successful after 5 attempts: [the build rocm-ubi9-python-3.9-amd64 failed after 15m15s with reason BuildPodEvicted: The node was low on resource: ephemeral-storage. Threshold quantity: 32127475555, available: 31169256Ki. Container docker-build was using 51577548Ki, request is 0, has larger consumption of ephemeral-storage. , the build rocm-ubi9-python-3.9-amd64 failed after 19m33s with reason BuildPodEvicted: The node was low on resource: ephemeral-storage. Threshold quantity: 32127475555, available: 30127664Ki. Container docker-build was using 45128112Ki, request is 0, has larger consumption of ephemeral-storage. 

@atheo89
Copy link
Member Author

atheo89 commented Jul 24, 2024

/retest

@atheo89
Copy link
Member Author

atheo89 commented Jul 24, 2024

/override ci/prow/runtimes-ubi9-e2e-tests
/override ci/prow/runtimes-ubi8-e2e-tests

Copy link
Contributor

openshift-ci bot commented Jul 24, 2024

@atheo89: Overrode contexts on behalf of atheo89: ci/prow/runtimes-ubi8-e2e-tests, ci/prow/runtimes-ubi9-e2e-tests

In response to this:

/override ci/prow/runtimes-ubi9-e2e-tests
/override ci/prow/runtimes-ubi8-e2e-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@atheo89
Copy link
Member Author

atheo89 commented Jul 24, 2024

/override ci/prow/rocm-notebooks-e2e-tests

Copy link
Contributor

openshift-ci bot commented Jul 24, 2024

@atheo89: Overrode contexts on behalf of atheo89: ci/prow/rocm-notebooks-e2e-tests

In response to this:

/override ci/prow/rocm-notebooks-e2e-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@atheo89
Copy link
Member Author

atheo89 commented Jul 25, 2024

/override ci/prow/rocm-notebooks-e2e-tests
/override ci/prow/runtimes-ubi8-e2e-tests
/override ci/prow/runtimes-ubi9-e2e-tests

/test rocm-runtimes-ubi9-e2e-tests
/test images

Copy link
Contributor

openshift-ci bot commented Jul 25, 2024

@atheo89: Overrode contexts on behalf of atheo89: ci/prow/rocm-notebooks-e2e-tests, ci/prow/runtimes-ubi8-e2e-tests, ci/prow/runtimes-ubi9-e2e-tests

In response to this:

/override ci/prow/rocm-notebooks-e2e-tests
/override ci/prow/runtimes-ubi8-e2e-tests
/override ci/prow/runtimes-ubi9-e2e-tests

/test rocm-runtimes-ubi9-e2e-tests
/test images

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@atheo89
Copy link
Member Author

atheo89 commented Jul 25, 2024

/test rocm-runtimes-ubi9-e2e-tests
/test images

Copy link
Contributor

openshift-ci bot commented Jul 25, 2024

@atheo89: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/notebook-rocm-ubi9-python-3-9-pr-image-mirror 767e285 link true /test notebook-rocm-ubi9-python-3-9-pr-image-mirror
ci/prow/amd-runtimes-ubi9-e2e-tests d878808 link true /test amd-runtimes-ubi9-e2e-tests
ci/prow/images d878808 link true /test images
ci/prow/rocm-runtimes-ubi9-e2e-tests d878808 link true /test rocm-runtimes-ubi9-e2e-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Member

@harshad16 harshad16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks for the work 👍

Copy link
Contributor

openshift-ci bot commented Jul 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: harshad16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@harshad16 harshad16 merged commit 2095eb5 into opendatahub-io:main Jul 30, 2024
11 of 15 checks passed
@atheo89 atheo89 deleted the RHOAIENG-9680 branch October 23, 2024 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants