Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test flakes tracker #579

Closed
cgwalters opened this issue Jun 1, 2024 · 19 comments
Closed

test flakes tracker #579

cgwalters opened this issue Jun 1, 2024 · 19 comments
Labels
area/ci Issues related to our own CI

Comments

@cgwalters
Copy link
Collaborator

cgwalters commented Jun 1, 2024

Parsing layer blob: Broken pipe

stderr: "\e[31mERROR\e[0m Switching: Pulling: Importing: Parsing layer blob sha256:4367367aae6325ce7351edb720e7e6929a7f369205b38fa88d140b7e3d0a274f: Broken pipe (os error 32)"

This one is like my enemy! I have a tracker over here for it coreos/rpm-ostree#4567 too

@henrywang
Copy link
Contributor

But anyways I think the larger problem pointed out by the aws error message is the script hardcodes a security group in a specific AZ, when it could really be targeting any AZ right?

There's only one Zone we can ues because RHEL needs internal access to install podman to run bootc install command. IT only configured one subnet in one Zone.

We had get available Zone for non-rhel test https://gitlab.com/fedora/bootc/tests/bootc-workflow-test/-/blob/2bebcdd18f4e0ff9639aff59e2fdfdfcec70f450/playbooks/deploy-aws.yaml#L55.

A few things on this. First it seems like a lot of this script is a basic "provision an ec2 instance" code that could probably be shared and live outside this repo? Maybe we fetch this stuff from a container or a distinct repo?

That's the things I'd like to talk with you on Monday QE sync meeting.

@cgwalters
Copy link
Collaborator Author

There's only one Zone we can ues because RHEL needs internal access to install podman to run bootc install command. IT only configured one subnet in one Zone.

OK, got it. Well...per the other discussion, what if we focused only on fedora:40 and centos:stream9 for PR testing by default, and did rhel integration testing both post merge (I'll get the -dev images re-spun up which build relevant things from git main) and also as part of dist-git merges to https://gitlab.com/redhat/centos-stream/rpms/bootc/ ?

@henrywang
Copy link
Contributor

OK, got it. Well...per the other discussion, what if we focused only on fedora:40 and centos:stream9 for PR testing by default

I agree.

and did rhel integration testing both post merge (I'll get the -dev images re-spun up which build relevant things from git main) and also as part of dist-git merges to https://gitlab.com/redhat/centos-stream/rpms/bootc/ ?

Just like you mentioned above, rhel-bootc-dev repo can be added just like centos-bootc-dev and -dev image can be saved in gitlab repo (repos under https://gitlab.com/redhat/rhel/bifrost should be private?). I can add test job in this repo without test code added, only run pipeline with https://gitlab.com/fedora/bootc/tests/bootc-workflow-test code. -dev image can be built daily and test will be run daily as well.

I'd not suggest to add testing in https://gitlab.com/redhat/centos-stream/rpms/bootc/ to avoid release block. From my perspective, all tests should be run before release, not on release.

@cgwalters cgwalters added the area/ci Issues related to our own CI label Jun 4, 2024
@henrywang
Copy link
Contributor

Recently, let's say last week, this error has been found more times. Automation added 3-times retry in ansible playbook as a workaround. Let's see what happens after retry.

@cgwalters
Copy link
Collaborator Author

In a different run, we somehow ended up with

Creating root filesystem (xfs) on device /dev/loop0p2 (size=512M)

Which seems related but different from the other one:

Creating root filesystem (xfs) on device /dev/loop0p1 (size=1M)

Actually, having it be 1M sometimes and 512M others looks very much like we're getting partitions swapped.

@henrywang
Copy link
Contributor

henrywang commented Jul 3, 2024

The test is facing Installing to filesystem: Creating ostree deployment: Performing deployment: Importing: Parsing layer blob sha256:9536e521dd6b076e09fa076feb4428e4b94e5330c6d6b3ab1e235a54be3d88b7: Failed to invoke skopeo proxy method FinishPipe: remote error: write |1: broken pipe error recently when run bootc install to-existing-root.

@cgwalters
Copy link
Collaborator Author

@henrywang anything we can do to fix/improve

[13:43:01] [E] [CentOS-Stream-9:x86_64:/plans/e2e/to-disk] guest provisioning failed: Guest couldn't be provisioned: Artemis resource ended in 'error' state
As seen on e.g. https://artifacts.dev.testing-farm.io/4fec6905-15b7-49d6-aff5-2bad9d78a12e/

Having some basically permanently-red CI is a mental overhead to check each time which specific jobs are failing.

@henrywang
Copy link
Contributor

Yes, have issue https://issues.redhat.com/browse/TFT-2691 to track.

@cgwalters
Copy link
Collaborator Author

Actually, having it be 1M sometimes and 512M others looks very much like we're getting partitions swapped.

I didn't try to stress test this much, but I think #698 is going to help. At the very least if we are still racing somehow, we'll get a more clear error message.

@cgwalters
Copy link
Collaborator Author

I didn't try to stress test this much, but I think #698 is going to help. At the very least if we are still racing somehow, we'll get a more clear error message.

I think that fixed the install flake, haven't seen it since.

@henrywang
Copy link
Contributor

Recently, install to-existing-root test got Installing to filesystem: Creating ostree deployment: Pulling: Importing: Unencapsulating base: failed to invoke method FinishPipe: failed to invoke method FinishPipe: expected 45 bytes in blob, got 139264 error in some tests. I think we should give this error a look. Thanks.

Failed log example:

@jeckersb
Copy link
Contributor

jeckersb commented Oct 1, 2024

Recently, install to-existing-root test got Installing to filesystem: Creating ostree deployment: Pulling: Importing: Unencapsulating base: failed to invoke method FinishPipe: failed to invoke method FinishPipe: expected 45 bytes in blob, got 139264 error in some tests. I think we should give this error a look. Thanks.

Failed log example:

* https://artifacts.dev.testing-farm.io/c4f7b9ab-02f7-485f-84dd-9f55559c9129/

* https://artifacts.dev.testing-farm.io/e9de5ed4-d125-4167-8968-ecfbbbe94072/

I noted this one over in the ostree-rs-ext tracker, it's likely related to the other similar issues around broken pipes.

@henrywang
Copy link
Contributor

This issue looks only exists on bare metal machine (testing farm public ranch runs virtualization test on AWS bare metal instance). I can't find same issue on nested virtualization environment, I mean run same test script.
Is that possible this issue is related with disk I/O?

@jeckersb
Copy link
Contributor

jeckersb commented Oct 7, 2024

This issue looks only exists on bare metal machine (testing farm public ranch runs virtualization test on AWS bare metal instance). I can't find same issue on nested virtualization environment, I mean run same test script. Is that possible this issue is related with disk I/O?

Hmm could be. Any idea what kind of storage is used on the baremetal instances?

I'm thinking of trying to reproduce in a virtualized environment by attaching the disk via nbd and then using the spinning filter to simulate a slow disk.

@henrywang
Copy link
Contributor

Hi @jeckersb, Do you know any workaround for issue Installing to filesystem: Creating ostree deployment: Pulling: Importing: Unencapsulating base: failed to invoke method FinishPipe: failed to invoke method FinishPipe: expected 45 bytes in blob, got 139264? This error failed a lot in our CI. Thank!

@cgwalters
Copy link
Collaborator Author

@henrywang isn't that #509 (comment) ? Is the input image zstd:chunked? Is it a RHEL10 system?

@henrywang
Copy link
Contributor

henrywang commented Oct 28, 2024

It's C10S system. Yeah, same thing as RHEL 10. The following workaround might work? Thanks.

if [[ "${REDHAT_VERSION_ID%%.*}" == "10" ]]; then
    sed -i 's/^compression_format = .*/compression_format = "gzip"/' /usr/share/containers/containers.conf
fi

@cgwalters
Copy link
Collaborator Author

Yep per #509 (comment) that's what the new default will be, hopefully soon

@cgwalters
Copy link
Collaborator Author

I think we're good on this!

cgwalters added a commit to cgwalters/bootc that referenced this issue Nov 5, 2024
cgwalters pushed a commit to cgwalters/bootc that referenced this issue Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci Issues related to our own CI
Projects
None yet
Development

No branches or pull requests

3 participants