Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocm/tensorflow is too large for GitLab CI #73

Open
Bengt opened this issue Dec 30, 2020 · 2 comments
Open

rocm/tensorflow is too large for GitLab CI #73

Bengt opened this issue Dec 30, 2020 · 2 comments

Comments

@Bengt
Copy link

Bengt commented Dec 30, 2020

Since upgrading to rocm/tensorflow:rocm4.0-tf2.4-dev, my pipeline jobs on GitLab.com fail:

https://gitlab.com/pfasdr/code/decoder/-/jobs/937693433
https://gitlab.com/pfasdr/code/decoder/-/jobs/937693435

The relevant error message is:

ERROR: Could not install packages due to an EnvironmentError: [Errno 28] No space left on device

As the documentation states, the shared runners on GitLab.com use

https://docs.gitlab.com/ee/user/gitlab_com/#linux-shared-runners

These have only 3.75 GB of memory and cannot download the docker image of currently 5.39 GB:

https://cloud.google.com/compute/docs/machine-types#n1_machine_types

When I run the jobs on my local machine via a GitLab runner registered to as a group runner, they execute as expected:

https://gitlab.com/pfasdr/code/decoder/-/jobs/937751331
https://gitlab.com/pfasdr/code/decoder/-/jobs/937746578

Obviously, running GitLab runner on an own machine is cumbersome. To reenable running in the cloud at GitLab CI, the image should be minified more to meet the target of somewhat under 3.75 GB.

@Bengt
Copy link
Author

Bengt commented Dec 30, 2020

As a workaround, I used the rocm/dev-ubuntu-20.04 docker image, installed rccl via apt and then tensorflow-rocm via pip. Here are some successful jobs executing this approach:

https://gitlab.com/pfasdr/code/decoder/-/jobs/937928162
https://gitlab.com/pfasdr/code/decoder/-/jobs/937928161

@Bengt
Copy link
Author

Bengt commented Dec 30, 2020

I created base images for use in TensorFlow ROCm projects:

https://gitlab.com/pfasdr/mesa/pfasdr_mesa_baseimage/container_registry/1598549

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant