Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Free up space in GitHub Actions Runners for remaining jobs #1601

Merged
merged 1 commit into from
Oct 31, 2024

Conversation

jordancarlin
Copy link
Contributor

Attempt to resolve issue #1591 by creating more free space on the runner for the jobs that are failing.

@jordancarlin
Copy link
Contributor Author

@cmuellner @TommyMurphyTM1234 Would be good to get this merged so we can finally get a gcc 14 nightly release

@TommyMurphyTM1234
Copy link
Collaborator

A few questions...

  1. What's the rationale for deleting the .NET and Android frameworks?
  2. Why are they installed in the first place?
  3. If they are to be removed then why not with apt uninstall rather than rm?
  4. Do we know that removing them frees up much/enough space?
  5. Why not remove them once at "init" time?
  6. Are there perhaps other things that could/should be removed at "init" time or elsewhere?
  7. What controls how much disk space an action container (?) is allocated?
  8. Why not use something like this?

I realise that these commands were already present in other places in the actions but I didn't understand them there either and was always meaning to ask about them.

@cmuellner
Copy link
Collaborator

Thanks for the PR!

Are you sure that manually deleting unused distro components is sufficient to address the problem? I.e. Have you reproduced the issue and verified that this fixes it?

If so, then we might consider uninstalling packages.

@jordancarlin
Copy link
Contributor Author

@TommyMurphyTM1234 In this case I just went with what had already been done in the other jobs for this workflow, but I dealt with a more complicated version of this for another project so can provide some context.

GitHub Actions runners are only guaranteed to have 14 GB of free disk space. In practice they tend to have something in the 20-25 GB range free. The actual runners are much larger (close to 75 GB), but much of that is used up by the default container configuration.

To answer you questions:

What's the rationale for deleting the .NET and Android frameworks?

These are two of the largest preinstalled items (collectively using 9 GB) and neither are needed for these workflows. Presumably when these jobs were first created they were selected as easy targets to recover space.

Why are they installed in the first place?

See https://github.com/actions/runner-images for details on what comes preinstalled on the GitHub Action runners. They try to preload most of the software people might need to reduce CI time and avoid the need to install various components every time.

If they are to be removed then why not with apt uninstall rather than rm?

I believe the runner does not install Android with apt, so it must be manually deleted.

Do we know that removing them frees up much/enough space?

All of the artifacts that the failing job is trying to download take up ~25 GB in total. Removing these two components gives us ~30 GB of free space.

Why not remove them once at "init" time?

Each job is started in a new container, so there is no way to make things persist between them. There is no "init" time that applies to everything in the workflow.

Are there perhaps other things that could/should be removed at "init" time or elsewhere?

I created a script that removes almost all of the preinstalled software for another repo (https://github.com/openhwgroup/cvw/blob/main/.github/cli-space-cleanup.sh). With that script the total available free storage increases to 61 GB. If we want to ensure this isn't an issue in the future we could do something like that to remove more software, but it seems like it is probably unnecessary for this.

What controls how much disk space an action container (?) is allocated?

All GitHub Actions runners are created from the container image linked above and guaranteed to have at least 14 GB of free space. Anything beyond that will fluctuate as the images are updated.

Why not use something like this?
https://github.com/marketplace/actions/maximize-build-disk-space

We definitely could, but most of those actions do a lot other strange things (recreating the filesystem to merge another unused disk) that seem much more likely to break as the image is updated. Just removing software shouldn't fail even if that software were to no longer be included.

@cmuellner
Copy link
Collaborator

The error log does not tell much. I assume we have not reached the disk space limit of a release's build artifacts, but we have reached the limit of the build machine that does the release step (create a release, download all toolchains, upload toolchains to release).

If my assumption is right, then the issue is that we now have 24 toolchains, and our approach of downloading them all at once and pushing them to the release is not working. Possible solutions: either we move the upload part to the toolchain builders, or we process one toolchain at a time (download from build, upload to release, delete).

@jordancarlin
Copy link
Contributor Author

The error log does not tell much. I assume we have not reached the disk space limit of a release's build artifacts, but we have reached the limit of the build machine that does the release step (create a release, download all toolchains, upload toolchains to release).

If my assumption is right, then the issue is that we now have 24 toolchains, and our approach of downloading them all at once and pushing them to the release is not working. Possible solutions: either we move the upload part to the toolchain builders, or we process one toolchain at a time (download from build, upload to release, delete).

Yes. That is what I see as well. The job that downloads all of them runs out of space. The easiest solution would be to create more space on that runner, but changing the workflow to avoid downloading them all could also work. I'm not sure how to upload to the same release from multiple jobs though.

@cmuellner
Copy link
Collaborator

If they are to be removed then why not with apt uninstall rather than rm?

I believe the runner does not install Android with apt, so it must be manually deleted.

For dotnet this could work: apt remove --purge dotnet*.

For things under /usr/local this is different.
We can do the manual remove thing, but then we should justify the path with a comment that references https://github.com/actions/runner-images/blob/main/images/ubuntu/Ubuntu2204-Readme.md

E.g.

Removal of preinstalled Android NDK/SDK components in the image.
The installation path is documented here: ...

I did not look into the installation scripts of the images that install Android in the images.
There might be a better way to remove it.

@TommyMurphyTM1234
Copy link
Collaborator

@TommyMurphyTM1234 In this case I just went with what had already been done in the other jobs for this workflow, but I dealt with a more complicated version of this for another project so can provide some context.

Thanks @jordancarlin for the explanations. I guess that I don't understand enough about the GitHub actions/runners etc. and need to read up on them a bit. 🙂

@jordancarlin
Copy link
Contributor Author

Maybe the best solution is to create a small script that uninstalls several of the tools (can be a simplified version of the one I linked above) and call that script at the beginning of each job. That way it is centralized in one place and can be easily updated if needed.

@cmuellner
Copy link
Collaborator

After looking into the CI/CD script again, I don't think we need to discuss or change this PR further.
I'll just merge the change, as we already have the exact same code in our repo: https://github.com/riscv-collab/riscv-gnu-toolchain/blob/master/.github/workflows/nightly-release.yaml#L61

That's enough justification to merge this change.

Copy link
Collaborator

@cmuellner cmuellner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

We already do this in the build step.

@cmuellner cmuellner merged commit 7d8e9ad into riscv-collab:master Oct 31, 2024
2 of 26 checks passed
@cmuellner
Copy link
Collaborator

Thanks again for the PR, @jordancarlin!

@jordancarlin jordancarlin deleted the patch-1 branch October 31, 2024 23:03
@jordancarlin
Copy link
Contributor Author

Great. Hopefully that'll solve our issues.

@jordancarlin
Copy link
Contributor Author

Looks like it was a transient network issue that time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants