Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If fluent-plugin-opensearch faied to refresh @_aws_credentials, it won't refresh @_aws_credentials anymore #129

Closed
1 of 2 tasks
aYukiSekiguchi opened this issue Feb 27, 2024 · 8 comments
Assignees

Comments

@aYukiSekiguchi
Copy link
Contributor

(check apply)

  • read the contribution guideline
  • (optional) already reported 3rd party upstream repository or mailing list if you use k8s addon or helm charts.

Steps to replicate

There is no reliable steps to replicate.

When it failed to refresh @_aws_credentials like the following error log:

2024-02-23 22:16:07 +0000 [error]: #0 Unexpected error raised. Stopping the timer. title=:out_opensearch_expire_credentials error_class=RuntimeError error="No valid AWS credentials found."
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:252:in `aws_credentials'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:353:in `block (2 levels) in configure'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:351:in `synchronize'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:351:in `block in configure'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/timer.rb:80:in `on_timer'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/loop.rb:88:in `run_once'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/loop.rb:88:in `run'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

It stopped to refresh with dumping the following log:

2024-02-23 22:16:07 +0000 [error]: #0 Timer detached. title=:out_opensearch_expire_credentials

Therefore, it will fail to flush the buffer with The security token included in the request is expired error message in the future.

FYI: The following is my config, but I don't think this depends on config.

<match apiserver>
  @type copy
  <store>
    @type s3
    <!-- skip -->
  </store>
  <store>
    @type opensearch
    bulk_message_request_threshold 6m
    request_timeout 90s
    resurrect_after 5s
    reload_connections false
    logstash_format true
    logstash_prefix apiserver
    logstash_dateformat %Y.%m.%d
    suppress_type_name true
    time_key time
    include_tag_key true
    tag_key @tag
    id_key _hash
    remove_keys _hash
    <buffer>
      @type file
      path /var/log/fluent/buffer/os/apiserver
      chunk_limit_size 60m
      flush_mode interval
      flush_interval 10s
      flush_at_shutdown true
    </buffer>
    <endpoint>
      url <URL to AWS OpenSearch Service>
      region ap-northeast-1
    </endpoint>
  </store>
</match>

Expected Behavior or What you need to ask

I'm not sure whether this is bug, but I want fluent-plugin-opensearch to refresh @_aws_credentials at the next refresh_credentials_interval. I guess AssumeRoleCredentials.new() failes if a network is unstable. If this happens, fluent-plugin-opensearch stops sending logs. I'm not happy with this.

The reason why fluent-plugin-opensearch stops to refresh @_aws_credentials is that timer_execute() removes the timer if its block raises an exeption.
https://github.com/fluent/fluentd/blob/2b4ca5d2927b706c3bdc98ffd0a0b66232bc0b65/lib/fluent/plugin_helper/timer.rb#L84-L85

Using Fluentd and OpenSearch plugin versions

  • OS version: Amazon Linux 2
  • Bare Metal or within Docker or Kubernetes or others?: Bare Metal
  • Fluentd v1.0 or later: fluentd 1.16.3
  • OpenSearch plugin version: fluent-plugin-opensearch (1.1.4)
  • OpenSearch version (optional): 1.3
  • OpenSearch template(s) (optional)
@aYukiSekiguchi
Copy link
Contributor Author

We are running 6 instances with this plugin for about 1 month. We faced this bug in 3 out of 6 instances. Therefore, this isn't rare problem.

@ashie ashie self-assigned this Mar 22, 2024
@davidpsv17
Copy link

It is happening the same to me with the same plugin version

@akhil31415
Copy link

@ashie san, Could you please confirm if there's any update for this issue?

@ntopee
Copy link

ntopee commented Aug 26, 2024

This is similar to #110 , we are experiencing the same issue.
In our case, once in a while there is a network timeout in some regions while connecting to sts for the aws token, which raises the error that stops the timer, with no option to recover other than manually restarting the pods.

@aYukiSekiguchi
Copy link
Contributor Author

FYI: My quick and dirty fix
https://github.com/aYukiSekiguchi/fluent-plugin-opensearch/commits/dont_stop_refresh_aws_credentials/

You can build and install like the following

$ fluent-gem build fluent-plugin-opensearch.gemspec
$ sudo fluent-gem install fluent-plugin-opensearch

@cosmo0920
Copy link
Collaborator

Hi @aYukiSekiguchi,
Could you send your patch as a PR?
It seems it's one of the good workaround to mitigate this issue.

@aYukiSekiguchi
Copy link
Contributor Author

Sure. I created a PR: #142

@cosmo0920
Copy link
Collaborator

This should be fixed in #142.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants