Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read error looking for ack: EOF #293

Open
robin13 opened this issue Oct 24, 2014 · 95 comments
Open

Read error looking for ack: EOF #293

robin13 opened this issue Oct 24, 2014 · 95 comments

Comments

@robin13
Copy link
Contributor

robin13 commented Oct 24, 2014

Since upgrading to (0.3.1 / 0632ce)[https://github.com/elasticsearch/logstash-forwarder/commit/0632ce3952fb4e941ec520d9fad05a3e10955dc4] I've been getting this error a lot, but strangely only on the boxes where there is relatively little activity. The boxes which are sending hundreds of events per second never have this error, but where there are fewer events being sent it looks like this:

2014/10/24 12:02:21.681463 Setting trusted CA from file: /opt/our-configuration/logstash-forwarder/elk-cluster.crt
2014/10/24 12:02:21.683663 Connecting to [10.128.248.146]:5044 (es106.our.domain.com)
2014/10/24 12:02:21.743082 Connected to 10.128.248.146
2014/10/24 12:02:21.745460 Registrar: precessing 6 events
2014/10/24 12:02:31.680960 Read error looking for ack: EOF
2014/10/24 12:02:31.681032 Setting trusted CA from file: /opt/our-configuration/logstash-forwarder/elk-cluster.crt
2014/10/24 12:02:31.683085 Connecting to [10.128.248.149]:5044 (es109.our.domain.com)
2014/10/24 12:02:31.747049 Connected to 10.128.248.149
2014/10/24 12:02:31.752734 Registrar: precessing 6 events
2014/10/24 12:02:41.683922 Registrar: precessing 6 events
2014/10/24 12:02:51.681371 Read error looking for ack: EOF
2014/10/24 12:02:51.681447 Setting trusted CA from file: /opt/our-configuration/logstash-forwarder/elk-cluster.crt
2014/10/24 12:02:51.683464 Connecting to [10.128.248.147]:5044 (es107.our.domain.com)
2014/10/24 12:02:51.743022 Connected to 10.128.248.147
2014/10/24 12:02:51.745627 Registrar: precessing 6 events
2014/10/24 12:03:01.681006 Read error looking for ack: EOF
2014/10/24 12:03:01.681074 Setting trusted CA from file: /opt/our-configuration/logstash-forwarder/elk-cluster.crt
2014/10/24 12:03:01.686820 Connecting to [10.128.248.150]:5044 (es110.our.domain.com)
2014/10/24 12:03:01.751134 Connected to 10.128.248.150
2014/10/24 12:03:01.753258 Registrar: precessing 6 events

As you can see, always on the dot 10 seconds after the last log event, even though I've got the timeout set to 30s in the logstash-forwarder.conf

      "timeout": 30

Any ideas what could be going on here?

@driskell
Copy link
Contributor

How are you connecting to Logstash - it says elk-cluster - is it via AWS ELB or something else? Or just direct?

@robin13
Copy link
Contributor Author

robin13 commented Oct 24, 2014

Connecting direct to logstash (1.4.2) lumberjack input.
Naming for the certificate is perhaps not optimal... :)

@robin13
Copy link
Contributor Author

robin13 commented Oct 24, 2014

In the same physical network too. I had problems with read error before when load was too high and the logstash agents weren't able to process the events fast enough, but at the moment the cluster is healthy and this time the oddity is that the slowest machines are experiencing the error, not the ones pumping hundreds/thousands of events per second...

@driskell
Copy link
Contributor

Have you got many servers connecting? It might be that issue where if a client fails to connect due to certificate problem - it can cause the disconnection of other clients randomly. And because it retries connection every second it randomly throws loads of other clients off.

You can patch for the above using the current version:
https://github.com/elasticsearch/logstash-forwarder/blob/master/lib/lumberjack/server.rb

It sits in logstash/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.20/lib if I remember correctly. The gem hasn't been updated so Logstash still using old one.

If you can, try Logstash in debug mode it might tell you what's going on, but I don't think the current gem does any logging - maybe look at #180 in that case as it will log who's connecting and who's failing.

@robin13
Copy link
Contributor Author

robin13 commented Oct 24, 2014

Hi Jason,

That patch worked! :) You're the bomb! Beers are on me if we ever meet IRL!
I still have to identify the clients which have invalid certificates - there are ~110 logstash-forwarder agents connecting to 10 logstash agents.

How could we get that patch pushed so it is included in the next logstash package?

Cheers,
-Robin-

@robin13 robin13 closed this as completed Oct 24, 2014
@caseydunham
Copy link

I am getting this on CentOS 6.4 with recent update of logstash and I built logstash-forwarder from master.

Is this related to an issue with the SSL cert on my logstash server?

@jordansissel
Copy link
Contributor

"Read error looking for ack: EOF" means likely that the remote (logstash,
or whatever is receiving the conncetion from lsf) closed the connection. By
the time lsf gets to sending data and waiting for acks, I'm reasonably
certain it's already successfully done a tls handshake and is not showing
you a cert error.

On Fri, Dec 19, 2014 at 11:37 AM, Casey Dunham [email protected]
wrote:

I am getting this on CentOS 6.4 with recent update of logstash and I built
logstash-forwarder from master.

Is this related to an issue with the SSL cert on my logstash server?


Reply to this email directly or view it on GitHub
#293 (comment)
.

@jmreicha
Copy link

Seeing this issue as well on CentOS 5.9 using LSF 3.1.1 and Logstash 1.4.2. I think I read there will be a fix for this in the newest release of Logstash? I see that there is a beta release out, would it be worth switching over to this release to correct the issue?

@rabidscorpio
Copy link

I'm seeing this as well with Ubuntu 14.04, LSF 3.1.1 and Logstash 1.4.2. I upgraded to 1.5.0.beta1 but no luck with that either so @jmreicha I don't know if that's going to change.

@rabidscorpio
Copy link

I also just tried logstash-forwarder 0.4.0 with logstash 1.4.2 to no avail (I made a mistake above, should have been LSF 0.3.1 for the version.

@driskell
Copy link
Contributor

@rabidscorpio Can you provide the full LSF log output?

@richardgoetz
Copy link

I also have this problem -
2015/02/12 15:24:04.401619 Read error looking for ack: read tcp x.x.x.x:5000: i/o timeout

I have thought this is issue with config on
elasticsearch.yml
cluster.name: SOMENAME
cluster name not matching
not matching
30-lumberjack-output.conf

output {
elasticsearch {
host => localhost
}
cluster => SOMENAME
stdout { codec => rubydebug }
}

this is keeping data from loading

@nornagon
Copy link

nornagon commented Mar 3, 2015

It seems like LSF 0.4.0 (built from master a week or so ago) is not compatible with Logstash 1.4.2?

@jmreicha
Copy link

jmreicha commented Mar 5, 2015

I have been troubleshooting this again recently and can see packets from the client (logstash-forwarder) going through the firewall, sending Syn packets endlessly.

Is there anything that I can look at on the logstash server side to see why it is closing the connection (or isn't accepting the syn)?

It is strange, for at least a small period of time, logs are being forwarded correctl.

@tuukkamustonen
Copy link

I also upgraded to LSF 0.4.0 (with Logstash 1.4.2) and am facing this issue now. Could someone re-open this ticket or is there a new one somewhere?

@brendangibat
Copy link

I see this issue with Logstash 1.5.0beta1 (also saw it with 1.4.2) and a docker container for LSF from https://registry.hub.docker.com/u/digitalwonderland/logstash-forwarder/

One thing to note is that the connections tend to eventually become stable - however I have not been able to tell by any pattern that one is connecting correctly, since it can fail against connecting to one logstash server, then attempt to reconnect and join successfully another time.

@jordansissel
Copy link
Contributor

"read error looking for ack: EOF"

Let me describe what I believe this message to mean:

  • "Looking for ack" - This is a phase of the lumberjack protocol when logstash-forwarder is waiting for a response from Logstash saying, "I got your messages! Send more!"
  • "EOF" - This means that, while waiting for, and instead of receiving, an ackknowledgement, the socket connection was terminated. EOF == end of file, meaning the network read was terminated due to a connection closure, probably on the Logstash side.

EXPERIENCING THIS ISSUE?

Please put the following on https://gist.github.com/ and link it here:

  • The last 500 lines of the logstash-forwarder log during the time this problem occurs
  • The last 500 lines of the receiving Logstash agent's log during the time this problem occurs
  • Your logstash config
  • Your logstash-forwarder config
  • What version of logstash
  • What version of logstash-forwarder
  • A brief description of the networking scenario between your two servers, for example: Are you using a proxy? Can you also attach the proxy's configuration?

If you aren't using Logstash 1.4.2 and logstash-forwarder 0.4.0, please upgrade and let me know if you are still having problems.

@jordansissel
Copy link
Contributor

Reopening -

I need more information to make a determination. Please read my above comment and provide as much information as you can. Thank you!

@jordansissel jordansissel reopened this Mar 6, 2015
@brendangibat
Copy link

Thanks for getting back to me so quickly! I'm testing updating to rc2 currently, I'll bring back all the info you need once I complete some changes in the configuration for rc2 deployment.

@jmreicha
Copy link

jmreicha commented Mar 6, 2015

Hey @jordansissel thanks for reopening. I am going to put what I have so far in to the following gist. I will add on to this as much as I can.

https://gist.github.com/jmreicha/dc25c3790793ae4a163f

Logstash version is 1.5.0.beta1
logstash-forwarder is/was 0.4.0 built from master a week or so ago.

Everything is running in Docker containers. The logstash stack (ELK) is running on its own server, the containers running the clients are running in a few different ways. The stuff I am having trouble is running in Kubernetes, so I don't know if that is maybe partially causing this issue? As far a I know there isn't any proxying. But logs that are being shipped by logstash-forwarder that are outside of Kubernetes seem to work fine for the most part so far but I haven't been able to narrow anything down yet.

Here's a few things I've tested so far:

  • I am able to hit the public logstash server ports from inside the kubernetes containers with netcat.
  • If I run just one container in Kubernetes with the logstash-forwarder shipping logs, things seem to work for the most part. But after I add in another client then that is when I start seeing the connection errors.
  • I have tried running log-courier with the same results inside of Kubernetes. One container running the client works but after adding in another, it starts reporting connection errors.

Definitely let me know what other details you need. I have been playing around with tcpdump but don't know if any of the debugs from it would be useful or not.

@jordansissel
Copy link
Contributor

@jmreicha we fixed a bug on the lumberjack input in 1.5.0 RC2, btw. Not sure if it's your bug, but may be.

@jordansissel
Copy link
Contributor

running in Kubernetes

I have two hypotheses at this time:

  1. A bug in the lumberjack input causing connections to be terminated silently in Logstash. This woul manifest as an EOF on the logstash-forwarder side.
  2. A network configuraiton problem causing connections to be interrupted or terminated (stateful firewall with a low idle timeout, or something)

@brendangibat
Copy link

@jordansissel For me bumping up to 1.5.0rc2 seems to have corrected the issue; I've not seen the issue repeat today since updating and using the new logstash plugins.

@jmreicha
Copy link

jmreicha commented Mar 6, 2015

Oh sheesh, I didn't realize there was a new release out :) I will get that put in place and see if it helps.

The extent of my Kubernetes networking knowledge is limited. I know it uses a network overlay for distributed containers to communicate using flannel, and I also know it does some things in the background with iptables, I'm just not sure what exactly what it is.

Interestingly enough, with tcpdump running I can see logstash-forwarder traffic coming in to the logstash server.

Maybe my issue is a combination of the two?

@jordansissel
Copy link
Contributor

I can see logstash-forwarder traffic coming in

When logstash-forwarder says things like this: Registrar: precessing 6 events, it means that it has successfully sent 6 events to Logstash and Logstash has acknowledge all 6 successfully. It means your network connectivity worked for one round-trip of a log payload.

If you see these messages, and then later see "read error waiting for ack" it means something interrupted the previously-healthy connection.

@jordansissel
Copy link
Contributor

1.5.0rc2 seems to have corrected the issue

@brendangibat This is very exciting. Thank you for the report!

@jmreicha
Copy link

jmreicha commented Mar 6, 2015

Ah that makes sense, thanks for the clarification.

@driskell
Copy link
Contributor

Ah interesting. The json codec decode actually succeeds - it takes the float at the beginning of the line and returns that, rather than actually failing. This creates an event that is not a set of key-value pairs, it just is a float.

This should be raised in https://github.com/logstash-plugins/logstash-codec-json I think - it really should check the decoded result is a key-value pairs I think, otherwise it does crash the input and really it should just mark it as decode failed.

@vqng
Copy link

vqng commented Jun 25, 2015

correct me if I am wrong but I thought based on the configs, only grok pattern decoder should be invoked, not json decoder?
anyway, I raised an issue at the other repo. Thank you.

@TristanMatthews
Copy link

I seem to be running into this same problem, followed the same DO tutorial as naviehuynh, I have almost exactly the same config file as him from the 22nd.

Are other people able to reproduce this error consistently? I'm seeing it as an intermittent problem where everything works fine and then it breaks. Non-json log statements seem to cause the EOF, but most of the time it picks up and starts processing again, only some times does it go into a repeating error state that strzelecki-maciek reported.

@driskell
Copy link
Contributor

I would say don't use the codec if some lines are not json. Use the filter and put a conditional on it to ensure you only decide valid json logs. This is my own method with Log Courier

Though I have heard ppl say filter has issues. But with the plugin separations hopefully fixing the issues is faster and it's so easy to update individual filters in 1.5.

@TristanMatthews
Copy link

@driskell Thanks, I'll try that. Is your filter actually validating the JSON? Or just filtering for files that should have valid json?

I found the issue because I had a log file with corrupted text (not sure how, but there was weird corrupted text in the middle of a log for a python Traceback of an error.), so I can't assume that I will always have valid json in my logs.

On Friday I was seeing the issue intermittently, over the weekend with the corrupted log file I could put the system in the EOF loop state every time I moved a copy of the file into the path, but my system seems to be in a different state today and only some times goes into the EOF loop. Just so that I understand logs like:

Jun 29 06:20:06 ip-10-0-0-152 logstash-forwarder[16258]: 2015/06/29 06:20:06.542950 Read error looking for ack: EOF
Jun 29 06:20:06 ip-10-0-0-152 logstash-forwarder[16258]: 2015/06/29 06:20:06.543015 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder.crt
Jun 29 06:20:06 ip-10-0-0-152 logstash-forwarder[16258]: 2015/06/29 06:20:06.586836 Connecting to [X.X.X.X]:5000 (logs.example.com)
Jun 29 06:20:06 ip-10-0-0-152 logstash-forwarder[16258]: 2015/06/29 06:20:06.644742 Connected to X.X.X.X

Indicate that something went wrong on the logstash server and the connection was closed, a properly working system should never see that even if you try to ingest a corrupted file?

@driskell
Copy link
Contributor

EOF usually means network problem or a crashed input. Checking logstash error log will say what it was as it will normally log why it failed.

@frutik
Copy link

frutik commented Jul 14, 2015

I had the same "Read error looking for ack: EOF" issue. I've started logstash in debug mode

/opt/logstash/bin/logstash agent -f /etc/logstash/conf.d  --debug

And noticed that logstash could not connect to ES using default protocol.

Changing it to http resolved my issue

     elasticsearch {
             host => localhost
             index => 'logstash-%{+YYYY.MM}'
+            protocol => http
         }

So, probably there is one of possible reasons of this famous issue

@pmoust
Copy link
Member

pmoust commented Jul 29, 2015

I can confirm that explicitly declaring a protocol for elasticsearch output resolved this for us.

@strzelecki-maciek
Copy link

[endless loop of]
2015/08/05 13:58:42.609958 Connected to 10.20.0.249
2015/08/05 13:58:42.611478 Read error looking for ack: EOF
2015/08/05 13:58:42.611615 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder.crt
2015/08/05 13:58:42.612001 Connecting to [10.20.0.249]:5517 (10.20.0.249)
[...]

after updating the plugin (bin/plugin update logstash-input-lumberjack)

in logstash logs (with constantly disconnecting logstash_forwarder <--> logstash, the) i can see

{:timestamp=>"2015-08-05T13:58:42.173000+0200", :message=>"Exception in lumberjack input thread", :exception=>#<ArgumentError: comparison of String with 1 failed>, :level=>:error}

this only happens for redis and redis sentinel logs. Every other logfile works fine

@strzelecki-maciek
Copy link

on logstash side - removal of codec: json fixed the issue.

@kuduk
Copy link

kuduk commented Aug 6, 2015

Same problem with nxlog in windows. I resolved adding a new line into the certificate.

@suyograo suyograo added the bug label Aug 10, 2015
@ph
Copy link
Contributor

ph commented Aug 10, 2015

Some of comments here (recent one) make me believe its related to #293

@rejia
Copy link

rejia commented Aug 14, 2015

Logstash-forwarder: 0.4.0 (client on diff server, collecting logs)
Logstash: 1.5.3 (dedicated server)
Elasticsearch: 1.7.1 (on same server as Logstash)
OS: RHEL 6.6
Java: 1.8

2015/08/14 12:28:29.137179 Socket error, will reconnect: write tcp 10.200.106.101:5043: broken pipe
2015/08/14 12:28:30.137494 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder.crt
2015/08/14 12:28:30.137864 Connecting to [10.200.106.101]:5043 (server107)
2015/08/14 12:28:45.138399 Failed to tls handshake with 10.200.106.101 read tcp 10.200.106.101:5043: i/o timeout
2015/08/14 12:28:46.138739 Connecting to [10.200.106.101]:5043 (server107)
2015/08/14 12:29:00.253962 Connected to 10.200.106.101
2015/08/14 12:29:00.269056 Registrar: processing 1024 events
....
2015/08/14 12:29:06.067480 Registrar: processing 1024 events
2015/08/14 12:29:14.601041 Read error looking for ack: EOF
2015/08/14 12:29:14.601172 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder.crt
2015/08/14 12:29:14.601588 Connecting to [10.200.106.101]:5043 (server107)
2015/08/14 12:29:29.602059 Failed to tls handshake with 10.200.106.101 read tcp 10.200.106.101:5043: i/o timeout

I had the same issue with all the same error messages on the forwarder and indexer, was able verify that there were no firewall issues, could see that the tcp connections were working... checked access .. tried stopping Logstash, but wouldn't go shutdown.. finally killed it :-( .. tried brining back up.. came up but same issue with all the "broken pipe" and other connection errors...

After reading all the github issues 415, 416, and 293. .. and hearing that some folks had success .. thought I would give it a try.. although not the best option, I stopped all services/connections to Logstash.. stopped all the logstash-forwarder instances.. stopped Elasticsearch .. THEN :-) brought back Logstash first, then all the other services ... NO further errors in either of the logs, and Logstash + Logstash-forwarder is working fine.. so is Elasticsearch and I can see the messages in Kibana

@rejia
Copy link

rejia commented Aug 20, 2015

Back to square one.. the moment I touch anything in Logstash .. the "broken pipe" happens and cannot shutdown Logstash gracefully..

[root@sever107 software]# service logstash stop
Killing logstash (pid 22234) with SIGTERM
Waiting logstash (pid 22234) to die...
Waiting logstash (pid 22234) to die...
Waiting logstash (pid 22234) to die...
Waiting logstash (pid 22234) to die...
Waiting logstash (pid 22234) to die...
logstash stop failed; still running.

@izenk
Copy link

izenk commented Sep 8, 2015

Hi.
Get this issue with
logstash 1.5.4
logstash-forwarder 0.4.0

Env is:
logstash and logstash-forwarder are on the same VM
OS ubuntu 14.04.2

input {
  stdin {}
  lumberjack {
    port => 5043
    ssl_certificate => "/etc/ssl/certs/server.crt"
    ssl_key => "/etc/ssl/private/server.key"
    codec => rubydebug {
      metadata => true
    }
  }
}

output {
  elasticsearch {
    host => localhost
    cluster => "log"
    index => "logstash-test-%{+YYYY.MM.dd}"
    workers => 1
  }
}

logstash-forwarder config:

{
  "network": {
  "servers": [ "my_server:5043" ],
  "ssl ca": "/etc/ssl/certs/server.ca.crt",
  "timeout": 15
},
  "files": [
    {
      "paths": [ "-" ]
    }
  ]
}

logstash-forwarder starts normally:

2015/09/08 18:00:59.490129 --- options -------
2015/09/08 18:00:59.490280 config-arg: ./logstash-forwarder.conf
2015/09/08 18:00:59.490337 idle-timeout: 5s
2015/09/08 18:00:59.490397 spool-size: 1024
2015/09/08 18:00:59.490425 harvester-buff-size: 16384
2015/09/08 18:00:59.490458 --- flags ---------
2015/09/08 18:00:59.490540 tail (on-rotation): false
2015/09/08 18:00:59.490588 log-to-syslog: false
2015/09/08 18:00:59.490608 quiet: false
2015/09/08 18:00:59.490673 {
"network": {
"servers": [ "my_server:5043" ],
"ssl ca": "/etc/ssl/certs/server.ca.crt",

"timeout": 15

},

"files": [
{
"paths": [ "-" ]
}
]
}

2015/09/08 18:00:59.491422 Waiting for 1 prospectors to initialise
2015/09/08 18:00:59.491510 harvest: "-" (offset snapshot:0)
2015/09/08 18:00:59.491768 All prospectors initialised with 0 states to persist
2015/09/08 18:00:59.492290 Setting trusted CA from file: /etc/ssl/certs/server.ca.crt
2015/09/08 18:00:59.493863 Connecting to [10.1.199.116]:5043 (<my_server>)
2015/09/08 18:00:59.651542 Connected to 10.1.199.116

After (due to STDIN input) I try enter some text and error appears
test

2015/09/08 18:02:59.513948 Read error looking for ack: EOF
2015/09/08 18:02:59.514135 Setting trusted CA from file: /etc/ssl/certs/server.ca.crt
2015/09/08 18:02:59.514895 Connecting to [10.1.199.116]:5043 (my_server)
2015/09/08 18:02:59.587461 Connected to 10.1.199.116
2015/09/08 18:02:59.592947 Read error looking for ack: EOF
2015/09/08 18:02:59.593078 Setting trusted CA from file: /etc/ssl/certs/server.ca.crt
2015/09/08 18:02:59.593807 Connecting to [10.1.199.116]:5043 (my_server)
2015/09/08 18:02:59.663463 Connected to 10.1.199.116
2015/09/08 18:02:59.670076 Read error looking for ack: EOF

In logstash in the same time I see errors too:
config LogStash::Codecs::RubyDebug/@metadata = true {:level=>:debug, :file=>"logstash/config/mixin.rb", :line=>"111", :method=>"config_init"}
Lumberjack input: unhandled exception {:exception=>#<RuntimeError: Not implemented>, :backtrace=>["/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-codec-rubydebug-1.0.0/lib/logstash/codecs/rubydebug.rb:24:in decode'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-lumberjack-1.0.5/lib/logstash/inputs/lumberjack.rb:77:inrun'", "org/jruby/RubyProc.java:271:in call'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-lumberjack-1.0.5/lib/logstash/inputs/lumberjack.rb:105:ininvoke'", "org/jruby/RubyProc.java:271:in call'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.24/lib/lumberjack/server.rb:264:indata'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.24/lib/lumberjack/server.rb:246:in read_socket'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.24/lib/lumberjack/server.rb:190:indata_field_value'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.24/lib/lumberjack/server.rb:101:in feed'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.24/lib/lumberjack/server.rb:206:incompressed_payload'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.24/lib/lumberjack/server.rb:101:in feed'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.24/lib/lumberjack/server.rb:239:inread_socket'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/jls-lumberjack-0.0.24/lib/lumberjack/server.rb:224:in run'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-lumberjack-1.0.5/lib/logstash/inputs/lumberjack.rb:104:ininvoke'", "org/jruby/RubyProc.java:271:in call'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.9.1-java/lib/concurrent/executor/executor_service.rb:515:inrun'", "Concurrent$$JavaExecutorService$$Job_1733551258.gen:13:in `run'"], :level=>:error, :file=>"logstash/inputs/lumberjack.rb", :line=>"117", :method=>"invoke"}

About my network:
This error I get, when run both LS and LSF on the same vm
my_server is pointed to its ip through /etc/hosts file
Also I get this error, when run LS and LSF on different VMs

@ph
Copy link
Contributor

ph commented Sep 8, 2015

Can you try using the rubydebug codec in a stdout output instead of the lumberjack input?
This config might help:

input {
  stdin {}
  lumberjack {
    port => 5043
    ssl_certificate => "/etc/ssl/certs/server.crt"
    ssl_key => "/etc/ssl/private/server.key"
  }
}

output {
 stdout { 
     codec => rubydebug { metadata => true  }
 }
  elasticsearch {
    host => "localhost"
    cluster => "log"
    index => "logstash-test-%{+YYYY.MM.dd}"
    workers => 1
  }
}

@izenk
Copy link

izenk commented Sep 8, 2015

Ye, it works. A problem is, that for rubydebug codec method decode is not implemented, and this method is called when add codec in input section
Thank you

@Salmani
Copy link

Salmani commented Sep 29, 2015

Hi All,

FYI for anyone potentially googling to this bug (like me):

In my case with the error "Read error looking for ack: EOF" in logstash-forwarder log there was no bug in the code and I was getting this error because logstash server was not able to connect to the ouput elasticsearch servers due to configuration error in logstash/ES/firewall on my part. Judging from the logs this prevented Logstash from flushing events to the output server and kept disconnecting the logstash-forwarder client with errors logged in logstash.log and logstash-forwarder log.

This error in the logstash logs pointed to the real problem:

":timestamp=>"2015-09-29T04:04:35.592000+0000", :message=>"Got error to send bulk of actions: connect timed out", :level=>:error}
{:timestamp=>"2015-09-29T04:04:35.592000+0000", :message=>"Failed to flush outgoing items"

Once I fixed the elasticsearch servers settings in my logstash config(+ES+firewall) it started to work fine.

LogStash : 1.5.4
Logstash Forwarder : 0.4.0
ES Version : 1.7.2

Forwarder Log ****
2015/09/29 14:03:16.495311 Connected to x.x.x.x
2015/09/29 14:03:21.499879 Read error looking for ack: EOF
2015/09/29 14:03:21.499981 Setting trusted CA from file: x.crt
2015/09/29 14:03:21.513262 Connecting to x.x.x.x
2015/09/29 14:03:21.571949 Connected to x.x.x.x
2015/09/29 14:03:21.574897 Read error looking for ack: EOF
Forwarder Log ****

_LogStash Log_**
{:timestamp=>"2015-09-29T04:04:13.661000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-09-29T04:04:14.162000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-09-29T04:04:14.662000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
_LogStash Log_**

Thanks,

Salman

@jamesblackburn
Copy link

I've managed to reproduce this reliably with logstash-forwarder 0.4.0 and logstash 2.1.1:
https://github.com/jamesblackburn/logstash-issue/blob/master/logstash-issue.zip

What I've seen is that copy-truncate can produce a very large Java GC log

This log is starts with a bunch of NULL bytes. Running logstash forwarder on this fails to make any progress:

Logstash

$ LOGSTASH_VERSION=1.5.5 logstash --config=./logstash-receiver.conf 
Logstash startup completed

Logstash-forwarder

$ /apps/research/tools/logstash-forwarder/0.4.0/logstash-forwarder -config=logstash-forwarder.conf 
2015/12/11 09:46:20.716506      --- options -------
2015/12/11 09:46:20.716600      config-arg:          logstash-forwarder.conf
2015/12/11 09:46:20.716617      idle-timeout:        5s
2015/12/11 09:46:20.716623      spool-size:          1024
2015/12/11 09:46:20.716629      harvester-buff-size: 16384
2015/12/11 09:46:20.716633      --- flags ---------
2015/12/11 09:46:20.716638      tail (on-rotation):  false
2015/12/11 09:46:20.716643      log-to-syslog:          false
2015/12/11 09:46:20.716648      quiet:             false
2015/12/11 09:46:20.716976 {
  "network": {
    "servers": [ "logs.dev.man.com:8999"
                  ],
    "timeout": 30,
    "ssl key": "/ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.key",
    "ssl ca": "/ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt"
  },

  "files": [
    {
      "paths": [ "/users/is/jblackburn/logstash-issue/logs/*.log"
               ],
      "fields": { "type": "lumberjack", "service": "%%SVC_HOST%%" }
    }
  ]
}

2015/12/11 09:46:20.717926 Waiting for 1 prospectors to initialise
2015/12/11 09:46:20.718315 Launching harvester on new file: /users/is/jblackburn/logstash-issue/logs/elasticsearch-gc.log
2015/12/11 09:46:20.718714 harvest: "/users/is/jblackburn/logstash-issue/logs/elasticsearch-gc.log" (offset snapshot:0)
2015/12/11 09:46:20.729481 All prospectors initialised with 0 states to persist
2015/12/11 09:46:20.729652 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:46:20.731115 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:46:21.105701 Connected to 10.220.53.55
2015/12/11 09:46:36.549103 Read error looking for ack: EOF
2015/12/11 09:46:36.549171 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:46:36.549920 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:46:36.634797 Connected to 10.220.53.55
2015/12/11 09:46:40.746815 Read error looking for ack: EOF
2015/12/11 09:46:40.746883 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:46:40.747473 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:46:40.888935 Connected to 10.220.53.55
2015/12/11 09:46:44.560995 Read error looking for ack: EOF
2015/12/11 09:46:44.561062 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:46:44.563795 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:46:44.697986 Connected to 10.220.53.55
2015/12/11 09:46:48.477937 Read error looking for ack: EOF
2015/12/11 09:46:48.478011 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:46:48.478623 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:46:48.616089 Connected to 10.220.53.55
2015/12/11 09:46:52.778987 Read error looking for ack: EOF
2015/12/11 09:46:52.779054 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:46:52.779699 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:46:52.910973 Connected to 10.220.53.55
2015/12/11 09:46:57.065908 Read error looking for ack: EOF
2015/12/11 09:46:57.065978 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:46:57.066617 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:46:57.198152 Connected to 10.220.53.55
2015/12/11 09:47:01.254964 Read error looking for ack: EOF
2015/12/11 09:47:01.255034 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:47:01.255556 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:47:01.391207 Connected to 10.220.53.55
2015/12/11 09:47:05.499504 Read error looking for ack: EOF
2015/12/11 09:47:05.499574 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:47:05.500155 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:47:05.589293 Connected to 10.220.53.55
2015/12/11 09:47:09.039561 Read error looking for ack: EOF
2015/12/11 09:47:09.039629 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:47:09.040237 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:47:09.169342 Connected to 10.220.53.55
2015/12/11 09:47:13.254280 Read error looking for ack: EOF
2015/12/11 09:47:13.254347 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:47:13.254948 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:47:13.384358 Connected to 10.220.53.55
2015/12/11 09:47:17.729627 Read error looking for ack: EOF
2015/12/11 09:47:17.729698 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:47:17.730253 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:47:17.860438 Connected to 10.220.53.55
2015/12/11 09:47:22.081924 Read error looking for ack: EOF
2015/12/11 09:47:22.081992 Setting trusted CA from file: /ahlinfra/monit/bin/../conf/logs.dev.man.com/../common/logs.dev.man.com.crt
2015/12/11 09:47:22.082687 Connecting to [10.220.53.55]:8999 (logs.dev.man.com) 
2015/12/11 09:47:22.211463 Connected to 10.220.53.55

@ruflin
Copy link
Member

ruflin commented Dec 27, 2015

@jamesblackburn Interesting issue. I didn't test it yet but I assume this issue also exists in filebeat https://github.com/elastic/beats/tree/master/filebeat On the one hand it seems like a log rotation issue but on the other hand the log crawler should be able to "overcome" this issue. Did you find a solution to the problem or is it still an issue? If yes, would you mind testing it with filebeat and in case it is an issue, open an issue there?

@jamesblackburn
Copy link

Thanks @ruflin

It's still an issue with these files. The workaround is to disable logrotate on the Java GC out files.

The biggest issue we have is actually on the jruby logstash end. I'm using exactly the config above - the receiver just pipes data directly to files, no filtering or parsing. Unfortunately the receivers are very heavyweight, and can't keep up with the write load. I've got 25 hosts with forwarders, and I'm running 8 receivers. The problems I've seen:

  • if there are too few receivers then you end up exhausting number of threads as forwarders connect, timeout and reconnect. In the mean time the java process leaks threads for a long period of time
  • receiving input and output are single threaded
  • the receivers need to be restarted periodically due to some leak or otherwise they just grind to a halt eventually

At this point I'm not sure that logstash scales, I'm considering Heka which claims to handle 10Gb/s/node(!): https://blog.mozilla.org/services/2013/04/30/introducing-heka/
Looking at the history on my receiver box it's only doing 3MB/s (~20Mb/s) which is orders of magnitude less

@ruflin
Copy link
Member

ruflin commented Jan 12, 2016

@jamesblackburn As you mentioned yourself, it looks more like a Logstash issue. Perhaps it's better to start a discussion here: https://discuss.elastic.co/c/logstash

@ph
Copy link
Contributor

ph commented Jan 12, 2016

He created this issue elastic/logstash#4333
And since the test case is provided, I will check what I can do to improve the situation. (The log is pretty big!)

In the mean time the java process leaks threads for a long period of time
receiving input and output are single threaded

The receiving input is multi threaded, only the TCP accept loop is not, each connections has his own thread. But you are correct the output to file is single threaded.

@jamesboorn
Copy link

I have hit the dreaded 'Read error looking for ack: EOF' and 'Connecting to' loop of death. We tracked it down to what looks like a parse error on the logstash side cause by a log message that begins with a number followed by white space.

I reproduced the issue with a simple logstash set up (lumber jack input and file output, no filtering and no codec specified). When the line in the file logstash-forwarder is watching begins with a number followed by white space the non-recoverable error condition occurs.

For example this will cause the error to occur:
echo '3 you are going down' >> file_forwarder_is_watching
This will not cause the error to occur:
echo ‘3you are still up' >> file_forwarder_is_watching

using logstash-1.5.4 and logstash-forwarder-0.4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests