Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: fluentd unable to start with error="fork/exec /usr/local/bundle/bin/fluentd: no such file or directory" #1187

Open
joshuabaird opened this issue May 28, 2024 · 31 comments · Fixed by #1195
Assignees

Comments

@joshuabaird
Copy link
Collaborator

Describe the issue

It seems a recent image was pushed to the kubesphere/fluentd:1.15.3 tag (docker.io/kubesphere/fluentd@sha256:bc06e880c224e76e659bf59250e5302ad159ee6b5474a2c5ee45f3a0969644c5) which breaks fluentd:

level=error msg="start Fluentd error" error="fork/exec /usr/local/bundle/bin/fluentd: no such file or directory"

Pinning to a previous SHA fixes the issue -- kubesphere/fluentd:v1.15.3@sha256:794311919658aee8eb9829836cd6c3437dffd9c7112556d5dc2f01ca3fcb826b.

To Reproduce

Repull the kubesphere/fluentd:1.15.3 latest SHA.

Expected behavior

Fluentd should start.

Your Environment

N/A

How did you install fluent operator?

No response

Additional context

No response

@joshuabaird
Copy link
Collaborator Author

@benjaminhuo @wenchajun Can someone please review this?

@m-gavrilyuk
Copy link

same error on fluent/fluent-operator/fluentd:v2.8.0

@rurus9
Copy link

rurus9 commented Jun 3, 2024

I think that in general each new image should have a new tag (does not apply to floating tags, like "latest").

@joshuabaird
Copy link
Collaborator Author

joshuabaird commented Jun 3, 2024

I would agree. Folks rely on versioned tags for stability and they should be immutable. If these images are going to be rebuilt for whatever reason, perhaps an internal "patch" version should be added (eg, v1.15.3.x).

@benjaminhuo
Copy link
Member

I would agree. Folks rely on versioned tags for stability and they should be immutable. If these images are going to be rebuilt for whatever reason, perhaps an internal "patch" version should be added (eg, v1.15.3.x).

This might be related to the CI changes we made recently, cc @sarathchandra24

#1183
#1079

I also remember there is an PR for a similar issue from @sarathchandra24
#1093

Would you help to take a look? @sarathchandra24

Thanks

@benjaminhuo
Copy link
Member

I've built fluentd v1.17.0 image

image

@joshuabaird
Copy link
Collaborator Author

joshuabaird commented Jun 4, 2024

@benjaminhuo It looks like the 1.17.0 image has the same bug for x86_64 images. Is this expected?

@sarathchandra24 sarathchandra24 self-assigned this Jun 4, 2024
@sarathchandra24
Copy link
Collaborator

Sorry for the late response everyone, I realized the problem after running it locally.

Root cause is defaultBinPath on main.go#L22 for amd64 it is"/usr/bin/fluentd" and for arm64 it is "/usr/local/bundle/bin/fluentd".

Creating a PR for logic to choose path based on arch.

@benjaminhuo
Copy link
Member

Sorry for the late response everyone, I realized the problem after running it locally.

Root cause is defaultBinPath on main.go#L22 for amd64 it is"/usr/bin/fluentd" and for arm64 it is "/usr/local/bundle/bin/fluentd".

Creating a PR for logic to choose path based on arch.

Thank you very much @sarathchandra24

@benjaminhuo
Copy link
Member

benjaminhuo commented Jun 5, 2024

both 1.15.3 and 1.17.0 are updated, would you try again? @joshuabaird
image

@joshuabaird
Copy link
Collaborator Author

joshuabaird commented Jun 5, 2024

@benjaminhuo @sarathchandra24 The bug is still present inkubesphere/fluentd:1.17.0@sha256:bc06e880c224e76e659bf59250e5302ad159ee6b5474a2c5ee45f3a0969644c5:

fluentd-1 fluentd level=error msg="start Fluentd error" error="fork/exec /usr/local/bundle/bin/fluentd: no such file or directory"
fluentd-1 fluentd level=info msg=backoff delay=4s

It looks like the v1.15.3 image is still broken as well.

@sarathchandra24
Copy link
Collaborator

@joshuabaird Can I know what OS are you using.

Also, I think there is something wrong with the builds or build system.

image

You see the message

level=info msg="Current architecture" arch=amd64

Also for
docker run sarathchandra24/fluentd-arm:local-arm1

image

You see the message

level=info msg="Current architecture" arch=arm64

But this is not the case while running
docker run kubesphere/fluentd:1.17.0@sha256:bc06e880c224e76e659bf59250e5302ad159ee6b5474a2c5ee45f3a0969644c5

image

@sarathchandra24
Copy link
Collaborator

@joshuabaird Can you please run

docker run ghcr.io/fluent/fluent-operator/fluentd:v1.17@sha256:095572fbf94ee3bbd01c0597b7b8a113c647e64ad2c53457c9c561432207f99d

and

docker run ghcr.io/fluent/fluent-operator/fluentd:v1.17@sha256:baac1724e2277baf50817d2612f06f0bf3b9050a77e1f7b78d351386b84541b7

To check if GitHub images are working

After inspecting images on GitHub

running: docker run ghcr.io/fluent/fluent-operator/fluentd:v1.17@sha256:095572fbf94ee3bbd01c0597b7b8a113c647e64ad2c53457c9c561432207f99d

image

We can see the message level=info msg="Current architecture" arch=amd64

running: docker run ghcr.io/fluent/fluent-operator/fluentd:v1.17@sha256:baac1724e2277baf50817d2612f06f0bf3b9050a77e1f7b78d351386b84541b7

image

We can see the message level=info msg="Current architecture" arch=arm64

@joshuabaird
Copy link
Collaborator Author

joshuabaird commented Jun 5, 2024

@sarathchandra24 Yeah - I'm not seeing the log statements on the images in Dockerhub. The images on Github do appear to be working as expected (I see the log statements).

We may have a CI problem with copying from Github to Dockerhub. I'll take a look at the CI runs and see if I can spot anything.

@joshuabaird
Copy link
Collaborator Author

@benjaminhuo Also, just noticed that the fluentbit images aren't available in Github (ghcr.io) -- so we probably need to manually run the CI job that pushes them.

@joshuabaird
Copy link
Collaborator Author

joshuabaird commented Jun 5, 2024

It also looks like the v1.17.0 linux/amd64 image on GHCR is actually 1.15.3:

❯ docker run --platform linux/amd64 ghcr.io/fluent/fluent-operator/fluentd:v1.17.0                                                                                                      
level=info msg="Current architecture" arch=amd64
level=info msg="Fluentd started"

2024-06-05 16:03:02 +0000 [info]: init supervisor logger path=nil rotate_age=nil rotate_size=nil
2024-06-05 16:03:02 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf"
2024-06-05 16:03:02 +0000 [info]: gem 'fluentd' version '1.15.3'
...
2024-06-05 16:03:02 +0000 [info]: starting fluentd-1.15.3 pid=13 ruby="3.2.4"
2024-06-05 16:03:02 +0000 [info]: spawn command to main:  cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/fluentd/etc/fluent.conf", "-p", "/fluentd/plugins", "--under-supervisor"]
2024-06-05 16:03:02 +0000 [info]: init supervisor logger path=nil rotate_age=nil rotate_size=nil

The linux/arm64 image however is actually v1.17.0:

❯ docker run --platform linux/arm64 ghcr.io/fluent/fluent-operator/fluentd:v1.17.0
Unable to find image 'ghcr.io/fluent/fluent-operator/fluentd:v1.17.0' locally
v1.17.0: Pulling from fluent/fluent-operator/fluentd
Digest: sha256:4651f4340241b53534c5b481422082d9e785e4f9e86cd2d027a51f61e521fe2e
Status: Downloaded newer image for ghcr.io/fluent/fluent-operator/fluentd:v1.17.0
level=info msg="Current architecture" arch=arm64
level=info msg="Fluentd started"
2024-06-05 16:04:25 +0000 [info]: init supervisor logger path=nil rotate_age=nil rotate_size=nil
2024-06-05 16:04:25 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf"
2024-06-05 16:04:25 +0000 [info]: gem 'fluentd' version '1.17.0'
...
2024-06-05 16:04:25 +0000 [info]: starting fluentd-1.17.0 pid=14 ruby="3.3.2"
2024-06-05 16:04:25 +0000 [info]: spawn command to main:  cmdline=["/usr/local/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/local/bundle/bin/fluentd", "-c", "/fluentd/etc/fluent.conf", "-p", "/fluentd/plugins", "--under-supervisor"]
2024-06-05 16:04:25 +0000 [info]: #0 init worker0 logger path=nil rotate_age=nil rotate_size=nil
2024-06-05 16:04:25 +0000 [info]: adding match in @FLUENT_LOG pattern="fluent.*" type="null"
2024-06-05 16:04:25 +0000 [info]: #0 starting fluentd worker pid=23 ppid=14 worker=0
2024-06-05 16:04:25 +0000 [info]: #0 fluentd worker is now running worker=0

@benjaminhuo
Copy link
Member

@joshuabaird I've added you as the maintainer, and you can trigger the image build here:
image

@joshuabaird
Copy link
Collaborator Author

@benjaminhuo @sarathchandra24 Is the intention to build and maintain both v1.15.3 and v1.17.0 fluentd images? Even if you pass 1.17.0 to the workflow, the Dockerfile still installs v1.15.3 here:

&& gem install fluentd -v 1.15.3 \

So, if the intention is to build/maintain both v1.15.3 and 1.17.0, the Dockerfile will need to be modified.

@benjaminhuo
Copy link
Member

@benjaminhuo @sarathchandra24 Is the intention to build and maintain both v1.15.3 and v1.17.0 fluentd images? Even if you pass 1.17.0 to the workflow, the Dockerfile still installs v1.15.3 here:

&& gem install fluentd -v 1.15.3 \

So, if the intention is to build/maintain both v1.15.3 and 1.17.0, the Dockerfile will need to be modified.

@joshuabaird You're right, the version is hardcoded in dockerfile for fluentd, we need to change that to use new version of fluentd

@joshuabaird
Copy link
Collaborator Author

@benjaminhuo But do we want to continue to support v1.15.3 or just modify the Dockerfiles to use 1.17.0?

@benjaminhuo
Copy link
Member

benjaminhuo commented Jun 11, 2024

@benjaminhuo But do we want to continue to support v1.15.3 or just modify the Dockerfiles to use 1.17.0?

we already have 1.51.3 image built that meets some people's requirement, I think we can move on to the latest version of fluentd, the image can be replaced to a older version if he needs

@joshuabaird
Copy link
Collaborator Author

@benjaminhuo #1198

@benjaminhuo
Copy link
Member

@benjaminhuo #1198

@joshuabaird The new fluentd image for 1.17 has been rebuilt after your PR, would you give it a try?
image

@joshuabaird
Copy link
Collaborator Author

Things are looking good. I'm going to open a PR to update fluentbit and then we'll rebuild the fluentbit images so they get pushed to GHCR.

@joshuabaird
Copy link
Collaborator Author

joshuabaird commented Jun 18, 2024

@benjaminhuo Any idea why fluentd:v2.8.0 and fluent-bit:v2.8.0 tags exist?

This is confusing, because it's the operator tag, not the fluentd/fluent-bit tag. This is causing dependency update apps (like Dependabot/Renovate) to try and update these images.

Should we delete them?

image

@benjaminhuo
Copy link
Member

@benjaminhuo Any idea why fluentd:v2.8.0 and fluent-bit:v2.8.0 tags exist?

This is confusing, because it's the operator tag, not the fluentd/fluent-bit tag. This is causing dependency update apps (like Dependabot/Renovate) to try and update these images.

Should we delete them?

image

@joshuabaird I can delete them, they're created by wrong CI workflow

@benjaminhuo
Copy link
Member

image 2.8.0 are all deleted

@joshuabaird
Copy link
Collaborator Author

@benjaminhuo Great, thank you!

@vajgi90
Copy link

vajgi90 commented Jun 20, 2024

We use Fluent-operator version 2.7.0, which uses Fluentd v.1.15.3. Unfortunately, we get the same error now:
level=info msg="backoff timer done" actual=16.013265218s expected=16s
level=error msg="start Fluentd error" error="fork/exec /usr/local/bundle/bin/fluentd: no such file or directory"
level=info msg=backoff delay=32s
with the old image as well, which worked fine priorly. What can I do to get Fluentd to start?

@joshuabaird
Copy link
Collaborator Author

@vajgi90 It looks like the amd64 image on Dockerhub for v1.15.3 has the bug. We'll try to get this fixed. Until then, you have two options:

  • Use ghcr.io/fluent/fluent-operator/fluentd:v1.15.3
  • Use ghcr.io/fluent/fluent-operator/fluentd:v1.17.0

@vajgi90
Copy link

vajgi90 commented Jun 20, 2024

Great, thank you so much for the quick response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants