Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin does not recover after network issues #6

Open
ThomKoomen opened this issue Nov 3, 2017 · 6 comments
Open

Plugin does not recover after network issues #6

ThomKoomen opened this issue Nov 3, 2017 · 6 comments

Comments

@ThomKoomen
Copy link

When network issues occur while using td-agent with the fluent-plugin-beats plugin, it stops sending data, throwing the below exception(s) and stacktrace several times.
This is, ofcourse, expected behaviour, more so it is really helpful.
However, after these network issues (automatically) recover, the plugin does not resume sending data.
When trying to restart td-agent (service td-agent restart) it will hang indefinitely, requiring killing of the td-agent processes.
After eventually managing to get td-agent back up again, it will run without problems again.

td-agent log:

2017-10-24 00:34:16 +0000 [error]: unexpected error error="Resource temporarily unavailable"
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:342:in `sysread'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:342:in `read_socket'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:319:in `run'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/fluent/plugin/in_beats.rb:97:in `block in run'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:331:in `call'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:331:in `run_task'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:320:in `block (3 levels) in create_worker'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:303:in `loop'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:303:in `block (2 levels) in create_worker'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:302:in `catch'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:302:in `block in create_worker'
2017-10-24 00:34:16 +0000 [error]: unexpected error error="Bad file descriptor"
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:342:in `sysread'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:342:in `read_socket'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:319:in `run'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/fluent/plugin/in_beats.rb:97:in `block in run'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:331:in `call'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:331:in `run_task'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:320:in `block (3 levels) in create_worker'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:303:in `loop'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:303:in `block (2 levels) in create_worker'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:302:in `catch'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:302:in `block in create_worker'

metricbeat log

2017-10-24T02:34:35+02:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-diskio.events=6 fetches.system-diskio.success=1 fetches.system-filesystem.events=2 fetches.system-fi
lesystem.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1276 libb
eat.publisher.messages_in_worker_queues=10 libbeat.publisher.published_events=10
2017-10-24T02:34:36+02:00 ERR Failed to publish events caused by: read tcp 161.86.62.204:36038->161.86.62.206:5044: i/o timeout
2017-10-24T02:34:36+02:00 INFO Error publishing events (retrying): read tcp 161.86.62.204:36038->161.86.62.206:5044: i/o timeout
2017-10-24T02:35:05+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=801 libbeat.logstash.published_but_not_acked_events=
10
2017-10-24T02:35:07+02:00 ERR Failed to publish events caused by: read tcp 161.86.62.204:54702->161.86.62.206:5044: i/o timeout
2017-10-24T02:35:07+02:00 INFO Error publishing events (retrying): read tcp 161.86.62.204:54702->161.86.62.206:5044: i/o timeout
2017-10-24T02:35:13+02:00 ERR Failed to publish events caused by: read tcp 161.86.62.204:54922->161.86.62.206:5044: read: connection reset by peer
2017-10-24T02:35:13+02:00 INFO Error publishing events (retrying): read tcp 161.86.62.204:54922->161.86.62.206:5044: read: connection reset by peer
2017-10-24T02:35:23+02:00 ERR Failed to publish events caused by: read tcp 161.86.62.204:54992->161.86.62.206:5044: read: connection reset by peer
2017-10-24T02:35:23+02:00 INFO Error publishing events (retrying): read tcp 161.86.62.204:54992->161.86.62.206:5044: read: connection reset by peer
2017-10-24T02:35:35+02:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-diskio.events=6 fetches.system-diskio.success=1 fetches.system-filesystem.events=2 fetches.system-fi
lesystem.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 libbeat.logstash.call_count.PublishEvents=3 libbeat.logstash.publish.read_bytes=24 libbeat
.logstash.publish.read_errors=3 libbeat.logstash.publish.write_bytes=2700 libbeat.logstash.published_and_acked_events=10 libbeat.logstash.published_but_not_acked_events=30 libbeat.outputs.messages_dropped=1 libbeat.publisher.messages_in_
worker_queues=10 libbeat.publisher.published_events=10
2017-10-24T02:36:05+02:00 INFO No non-zero metrics in the last 30s
2017-10-24T02:36:35+02:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-diskio.events=6 fetches.system-diskio.success=1 fetches.system-filesystem.events=2 fetches.system-fi
lesystem.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=60 libbeat
.logstash.publish.write_bytes=1566 libbeat.logstash.published_and_acked_events=10 libbeat.publisher.messages_in_worker_queues=10 libbeat.publisher.published_events=10

metricbeat config:

###################### Metricbeat Configuration #######################

#==========================  Modules configuration ============================
metricbeat.modules:

#------------------------------- System Module -------------------------------
- module: system
  metricsets:
    - cpu
    - diskio
    - filesystem
    - network
    - memory
  enabled: true
  period: 60s
  filters:
    - drop_event.when.regexp.mount_point: '^/(sys|cgroup|proc|dev|etc|host|run|mnt/home|var/lib/lxcfs)($|/)'
  fields:
    uuid: xxxxx

#================================ Processors ==================================
processors:
- drop_event:
    when:
       regexp:
           system.diskio.name: "fd0|sr0"

#----------------------------- Logstash output --------------------------------
output.logstash:
  # The Logstash hosts
  hosts: ['xx:xx:xx:xx:xxxx']

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  #ssl.certificate_authorities: ['']

  # Certificate for SSL client authentication
  #ssl.certificate: ''

  # Client Certificate Key
  #ssl.key: ''


#================================ Logging =====================================

# Sets log level. The default log level is info.
# Available log levels are: critical, error, warning, info, debug
#logging.level: debug

# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publish", "service".
#logging.selectors: ["*"]

td-agent config:

<source>
        @type beats
         metadata_as_tag
</source>
<filter *beat>
        @type record_transformer
        <record>
            tag ${tag}
        </record>
</filter>
<match metricbeat>
        ...
</match>

After looking at the code, in an effort trying to fix it myself, I feel like the error handling may possibly need a retry and I suspect the @thread.join to be preventing td-agent from properly restarting, however I'm not that experienced with Ruby so I may be completely wrong.
While waiting for a response I'll keep trying to fix it and will try to keep this ticket updated.

@martinhynar
Copy link

martinhynar commented Jun 8, 2018

Hi, I have exactly same problem, only that there is no network problem and restarting td-agent service does not help at all. The trace fills logs right after start. As this continues, td-agent consumes all available memory and then oom_killer comes in and starts shooting.

My environment:
fluent-plugin-beats 0.1.3
td-agent 1.0.2
OS: Red Hat Enterprise Linux Server release 7.4

@TeunRoefs
Copy link

Hi @repeatedly ,

Did you have time to look into this? We are still experiencing this issue with fluentd and the fluent-plugin-beats v1.0.0. It does not matter how many nodes send metricbeat data towards fluentd.

We've got multiple environments that each have the same setup regarding fluentd and only one (mostly stack vm's (vcenter) environment. We're experiencing the most problems there.

It has multiple sites that can contain between 10 - 300 nodes and either of those sites shuts down randomly, but all with the same exact issues as described by Thom Koomen.

We're using multiple threads, writing the buffer to files but nothing seems to help.

Do you have any suggestions in regards to what we can look at to try and resolve the issue?

Kind regards,

Teun Roefs

@repeatedly
Copy link
Owner

@TeunRoefs Which problem happens on your environment?
Shutdown is blocked or connection handling is not recovered after error?
fluentd v1.x and beats plugin v1.x seems to not have former problem so latter error, right?

@TeunRoefs
Copy link

@repeatedly Seems its the latter error, we've tried multiple versions but the errors seems to persist. We're also looking into another plugin that might be the issue: https://github.com/htgc/fluent-plugin-azureeventhubs

I'll come back to you once I have more information.

@repeatedly
Copy link
Owner

repeatedly commented Dec 11, 2018

@TeunRoefs I see. To debug the problem, I want to reproduce this bad fd problem but hard to it in my laptop. So I want to know how to reproduce this problem.

If closing bad socket resolves the problem, adding Errno::EBADF to error list may be a workaround.

@TeunRoefs
Copy link

@repeatedly What we do to reproduce this issue, we created a fluentd-data-collector VM that uses the fluent-plugin-beats and disable its connection (clean resolv.conf) onwards to the Azure EventHub that we're using. This causes Fluentd to 'hang' and metricbeat cannot send its data to Fluentd anymore. Funny thing is, metricbeat on localhost cannot even send its data to Fluent.

We re-enable the network connection (adding values to resolv.conf) and this should allow traffic to pass through, however, this is not the case.

repeatedly added a commit that referenced this issue Jan 18, 2019
Add more Errno handling in server. ref #6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants