timeout is buggy in JRuby #58

HoneyryderChuck · 2016-01-27T17:55:19Z

First I thought it was celluloid-io sockets, but then I realized that in Jruby, timeouts will not always be triggered. Here is a simple script:

require 'time'
require 'celluloid/io'


class Tester
  include Celluloid::IO
  def initialize
    @sock = TCPSocket.new("localhost", 80)
  end

  def test_timeout
    begin
      timeout(2) do
         # option 1
         # @sock.wait_readable
         # option 2
         # sleep 3
         # option 3 
         loop do
            sleep 1
         end
      end
    rescue Celluloid::Task::TimeoutError
      puts "successfull timeout fetch: #{Time.now}"
      retry
    ensure
      puts "#{$!.class} #{$!.message}" if $!
    end
  end

end


puts "starting the actor..."
t = Tester.new
t.async(:test_timeout)
sleep

# Option 1
# Both MRI 2.1 and jruby 9.0.5.0
....
successfull timeout fetch: 2016-01-27 18:50:47 +0100
successfull timeout fetch: 2016-01-27 18:50:49 +0100
successfull timeout fetch: 2016-01-27 18:50:51 +0100
successfull timeout fetch: 2016-01-27 18:50:53 +0100
successfull timeout fetch: 2016-01-27 18:50:55 +0100
successfull timeout fetch: 2016-01-27 18:50:57 +0100
successfull timeout fetch: 2016-01-27 18:50:59 +0100
successfull timeout fetch: 2016-01-27 18:51:01 +0100


(hangs at some point...)


# Option 2
# MRI 2.1 and JRUBY-9.0.5.0
starting the actor...
successfull timeout fetch: 2016-01-27 18:49:12 +0100
W, [2016-01-27T18:49:15.799990 #328595]  WARN -- : Attempted to resume a dead task: type=:call, meta={:dangerous_suspend=>false, :method_name=>:test_timeout}, status=:dead

Option 3 works in MRI (apparently), but also hangs at some point.

This can seriously hinder celluloid/celluloid-io#170.

tarcieri · 2016-01-27T18:58:46Z

Yeah, this has been reported several times...

Any ideas @ioquatix?

ioquatix · 2016-01-27T23:29:53Z

I will investigate the issue, is the above test case failing on MRI 2.3?

HoneyryderChuck · 2016-01-28T05:27:13Z

didn't test with all possible mri versions. will continue to test today, I'll come up with more results.

HoneyryderChuck · 2016-01-28T09:23:33Z

maybe this is old news, but I'll state them nonetheless: replacing the Celluloid::IO with a Celluloid actor seems to fix the issues. My bet is now on the evented mailbox.

HoneyryderChuck · 2016-01-28T12:31:46Z

I've been investigating option 3, which consistently happens in jruby. Something's wrong with the timeout...

I've added a few messages before the reactor selects, and also inside the jruby extension, before select is called on the NIO channel. Here's an example of a successful select:

# on celluloid/io/reactor.rb:59
selecting for 0.9849073000000006 seconds...   
# on ext/nio4r/org/nio4r/Nio4r.java:291
timing out for 984.9073000000006 milliseconds...

Here's where it gets blocked:

selecting for 0.0009740119999985808 seconds...
timing out for 0.9740119999985808 milliseconds

The doSelect() method has a condition stating that if timeout is 0, one calls selectNow() on the NIO channel.

I've patched it by calling selectNow() when the timeout is smaller than 1 millisecond, and it doesn't hang anymore:

/* inside doSelect(), line 290 */
                        long tim = (long)(t * 1000);
                        if (tim < 1) {
                          result = this.selector.selectNow();
                        } else {
                          result = this.selector.select(tim);
                        }
...

Would this be an acceptable patch?

I think the real "problem" still comes from timer, as an interval triggering a message handling is probably so small that the java Runtime can't handle it properly. Or are java NIO channels not able to handle that kind of precision?

ioquatix · 2016-01-28T12:33:53Z

This could be an okay patch, but I have a request, could you print out values of tim where the bug happens? Is this bug also solved if you change the condition to (tim <= 0)?

HoneyryderChuck · 2016-01-28T12:44:43Z

@ioquatix I just did:

timing out for 0.9740119999985808 milliseconds

I'm digging a bit more, problem comes from Celluloid::Actor#run, particularly the wait block (line 157).

Timers is passing the wait_interval to the block, and this will be used to check a message from the mailbox. This wait_interval can be apparently very small! I've just tracked the wait_interval:

# these are okay!
checking message: 0.9983331230000001 seconds
checking message: 0.9967141259999996 seconds
# this one is not! that's your block!
checking message: 7.559099999987495e-05 seconds

that's some precise shit!

Were select system calls ever able to deal with such an interval? Is this just the java way?

ioquatix · 2016-01-28T12:50:57Z

I wrote the code which generate those intervals, but it was a while ago. Yes, it's possible to generate small intervals.

Did you try with the condition I suggested?

Are you referring to this code in Nio4r?

https://github.com/celluloid/nio4r/blob/8cc48508a7cf3bb8497ba65f15febd691cf5d2a6/ext/nio4r/org/nio4r/Nio4r.java#L284-L291

That code is a bit different from what you quoted above. It uses double rather than long so avoids rounding issues, handles negative values correctly.

ioquatix · 2016-01-28T12:53:39Z

Can you link me to the code you are looking at w.r.t wait_interval, thanks :)

ioquatix · 2016-01-28T12:54:28Z

Just FYI, in some cases timers will provide a -ve wait interval if a timer should have fired already (it was late) https://github.com/celluloid/timers/blob/41145ed260e4f061c6b93604c5e0ba86192f37af/lib/timers/group.rb#L81-L90 but I'm not sure if this is used in the code you are talking about or not.

HoneyryderChuck · 2016-01-28T12:55:30Z

it casts it to a long before calling select, check a few lines below. That's what tim was, the long already converted to milliseconds value that the channel must interpret on the selec call .

I don't think anymore that the interval is the issue, as timers are supposed to be that precise, right? It's as if the NIO channels can't handle that low of an interval.

The code I meant: https://github.com/celluloid/celluloid/blob/master/lib/celluloid/actor.rb#L152-L156

ioquatix · 2016-01-28T12:58:10Z

When it gets cast to a long, perhaps the result is 0. In that case, some API have sleep indefinitely behaviour when called with 0 (or sometimes -1).

ioquatix · 2016-01-28T13:02:01Z

So you can see here exactly what I said:

https://docs.oracle.com/javase/7/docs/api/java/nio/channels/Selector.html#select(long)

timeout - If positive, block for up to timeout milliseconds, more or less, while waiting for a channel to become ready; if zero, block indefinitely; must not be negative

So if you call it with 0, clearly it will block forever :D

ioquatix · 2016-01-28T13:02:47Z

IF you call that API with a negative number, it will return the meaning of life.

HoneyryderChuck · 2016-01-28T13:02:54Z

The result is not 0, is the one that was printed out in the example above. Are you saying the channel will interpret it as 0 and block forever?

ioquatix · 2016-01-28T13:03:47Z

selecting for 0.0009740119999985808 seconds...
timing out for 0.9740119999985808 milliseconds

If you cast 0.97 to an int, I'd expect truncation?

HoneyryderChuck · 2016-01-28T13:05:25Z

yes, but the API expects a long as argument. Unless it is truncating it inside. If so, the API is misleading as it doesn't accept positive in milliseconds, but > 1. I missed java...

ioquatix · 2016-01-28T13:05:44Z

result = this.selector.select((long)(t * 1000));

If t < 0.001 this will result in select being called with 0.

HoneyryderChuck · 2016-01-28T13:15:24Z

Actually, it blocks even with 0.1

ioquatix · 2016-01-28T13:16:39Z

Are you saying t = 0.1 or t * 1000 = 0.1? Because 0.1 * 1000 should result in select(100) which should definitely be okay, but select(0.1) is definitely not.

HoneyryderChuck · 2016-01-28T13:17:15Z

t is already (long) milliseconds. so, 0.1 milliseconds.

HoneyryderChuck · 2016-01-28T13:18:08Z

Ok, I think this is java:

import java.io.IOException;
import java.nio.channels.Channel;

public class SelectorText {
  public static void main(String[] args) {
    System.out.println("starting the test...");
    java.nio.channels.Selector selector = null;
    try {
      selector = java.nio.channels.Selector.open();
      System.out.println("selecting now...");
      selector.selectNow();
      System.out.println("selecting later...");
      selector.select((long)1);
      System.out.println("selecting small...");
      selector.select((long)0.9999);
      System.out.println("that's all");


    }  catch(IOException ie) {
      System.out.println("balls");
    }

  }
}

try to run this.

ioquatix · 2016-01-28T13:18:41Z

Dude,

 selector.select((long)0.9999);

is the same as

 selector.select(0);

ioquatix · 2016-01-28T13:19:45Z

long x = (long)0.9999;

x will contain 0.

HoneyryderChuck · 2016-01-28T13:21:09Z

jesus, what a mental block... of course.

HoneyryderChuck · 2016-01-28T13:22:29Z

well, I guess it's settled then. I'll come up with a patch in nio4r.

ioquatix · 2016-01-28T13:22:56Z

sweet, let me know and I'll check it out.

HoneyryderChuck · 2016-01-28T13:24:53Z

I was reading long and thinking double...

This solves socketry/timers#58 , a bug that has been around. Long story short, timers were passing really short intervals on the actor, these were being passed to the reactor for select. Some of them were being cast to 0 before being passed to the nio.Channel#select method, which according to documentation, blocks forever when passed 0.

HoneyryderChuck · 2016-01-28T15:28:10Z

@ioquatix , can you reopen this? option 1 still hanging...

HoneyryderChuck · 2016-01-28T15:56:45Z

Problem is here, wait_interval at some point is nil, and the selector will block. This is behaviour counts for MRI as well (tested in ruby 2.1 and 2.3). I think this time it's really timers...

When calling select with timeout set to nil, what is the expected behaviour? Both NIO as libev seem to have an inner threshold, as at some point I see the reactor calling back. But my timeout block is not retried anymore...

ioquatix · 2016-01-29T00:57:41Z

nil means there are no timers. So, in theory, wait indefinitely.

ioquatix · 2016-01-29T00:58:28Z

Yes we can certainly reopen this issue, what does @mailbox.check(interval) do if interval is nil?

ioquatix · 2016-01-29T01:11:46Z

Here is the non-everted mailbox:

https://github.com/celluloid/celluloid/blob/master/lib/celluloid/mailbox.rb#L48-L70

Timers::Wait.for(nil) loops indefinitely:

https://github.com/celluloid/timers/blob/41145ed260e4f061c6b93604c5e0ba86192f37af/lib/timers/wait.rb#L7-L17

Here is for evented mailbox:

https://github.com/celluloid/celluloid/blob/b19e19bf201787248e5b19e47ba99d1b7be6fd44/lib/celluloid/mailbox/evented.rb#L41-L54

Calling reactor.run_once(nil):

https://github.com/celluloid/celluloid-io/blob/cf249b9a1baf4e16e765a5c8e43f8531ceeb1319/lib/celluloid/io/reactor.rb#L56-L67

That code looks like it is designed to handle nil, but passes it to @selector.select(nil), which does seem to be handled correctly:

It looks like C expects timeout to be non-nil which isn't consistent with the other implementations:

https://github.com/celluloid/nio4r/blob/8cc48508a7cf3bb8497ba65f15febd691cf5d2a6/ext/nio4r/selector.c#L282-L284

Not sure about libev, perhaps @tarcieri can comment on this since he wrote most of that code (?)

ioquatix · 2016-01-29T01:13:08Z

In theory, the behaviour of run_once(nil) should be to block indefinitely and run_once(t >= 0) should be to block for at most t seconds, and run_once(0) should check for events and return immediately.

HoneyryderChuck · 2016-01-29T11:53:48Z

humm, I don't know... when is it safe to run the reactor indefinitely? The main problem for me is, the nil is passed to the reactor while the timers are still handling a timeout. It seems that one is blocking the other from triggering.

The way I see it, either the timers should not yield nil, or the reactor should not be passed nil (which would be in accordance to what libev expects). I just don't understand where is the nil being yielded from. Does this mean there are no timers to process? I'm not as familiar with the timers code.

ioquatix · 2016-01-29T11:54:35Z

Yes, as explained, nil means no timers.

HoneyryderChuck · 2016-01-29T13:52:00Z

It seems that the timers group is not triggering the exceptions properly. Here is a bit more logging for you @ioquatix :

# good
...
successfull timeout fetch: 2016-01-29 14:35:35 +0100
timeout: adding timer: 2                            
timeout: timer added: 2 ; timers: 1                 

new timer: 1.9999205120000028                       
reactor inside: 1.9999205120000028                  
reactor outside: 1.9999205120000028                 
kaboom!!!!!!!!                                      

successfull timeout fetch: 2016-01-29 14:35:37 +0100
...

# also good
...
successfull timeout fetch: 2016-01-29 14:35:33 +0100
timeout: adding timer: 2                            
timeout: timer added: 2 ; timers: 1                 

new timer: 1.9999047220000037                       
reactor inside: 1.9999047220000037                  
reactor outside: 1.9999047220000037                 

new timer: 0.00020313299999941137                
reactor inside: 0.00020313299999941137              
reactor outside: 0.00020313299999941137             
kaboom!!!!!!!!                                      

successfull timeout fetch: 2016-01-29 14:35:35 +0100
...

# bad
...
successfull timeout fetch: 2016-01-29 14:35:37 +0100 
timeout: adding timer: 2                             
timeout: timer added: 2 ; timers: 1                  

new timer: 1.9999061400000002                        
reactor inside: 1.9999061400000002                   
reactor outside: 1.9999061400000002                  

new timer:     
reactor inside:                                      
reactor outside:                                     

new timer:                                           
mailbox: no message                                  
reactor inside:                                      
reactor outside:                                     
.... (add eternum)

to add a bit of context, first block of logs is what happens when the timeout fails and the action is retried; (a new timeout is added to timers); the second block of logs is from the Actor#run method: new timer interval, interval passed to the reactor; "kaboom" is when the timeout timer event triggers.

So, the latter is not happening in our bug scenario. I don't know yet why, though, what I see is that, if the timer event associated with the timeout call doesn't trigger within the 2 second interval, it's all gone. Now, why is it not triggering? Next step.

ioquatix · 2016-01-29T13:56:16Z

Can you make a spec in timers for the failing behaviour.

HoneyryderChuck · 2016-01-29T13:57:35Z

still in the process of figuring out, I'll get to the failing scenario later.

HoneyryderChuck · 2016-01-29T14:54:04Z

@ioquatix problem is here: https://github.com/celluloid/timers/blob/master/lib/timers/group.rb#L66-L70. I've been following this and, in the normal case, you yield two intervals, fire the events in the end, and the timer get's cancelled after. You can see an example below:

successfull timeout fetch: 2016-01-29 15:42:59 +0100   
Group has 1 events        
Handle: #<Timers::Events::Handle:0x007fe97ad1cf70>     
actor#run: interval 1.9999147540000024                     
Group has 1 events        
Handle: #<Timers::Events::Handle:0x007fe97ad1cf70>     
kaboom!!!!!!!! (timer.cancel from Celluloid::Actor)
successfull timeout fetch: 2016-01-29 15:43:01 +0100
...

Main feature is that the handle must be a valid (not yet cancelled one), which means the timer hasn't been triggered yet. In fact, the problem arrive as soon as the timer cancels in between the two calls on wait_interval there in the link:

successfull timeout fetch: 2016-01-29 15:43:03 +0100     
Group has 1 events          
Handle: #<Timers::Events::Handle:0x007fe97ad146b8>       
actor: interval 1.999946142000006                        
(timer gets cancelled here, but the task is not resumable at this point)
Group has 1 events    
Handle: #<Timers::Events::Handle:0x007fe97ad146b8> but it is cancelled!!!
Group has 0 events    
actor: interval (nil)
# our problems start here
....

I can't investigate this further, at least for today and the weekend, but my the main issue is right there: you can't fire cancelled handles.

ioquatix · 2016-01-29T22:53:39Z

I need a failing test case or it's impossible for me to debug :) Do you have Google Hangouts?

ioquatix · 2016-01-29T23:03:16Z

Okay dude so I'm trying to understand the failure case you describe above, but it's not failing for me:

  it "should be okay to cancel during wait" do
    fired = false
    timer = subject.after(0.1) { puts "Fired!"; fired = true }

    first = true

    subject.wait do |interval|
      puts "Interval: #{interval}"

      if first 
        timer.cancel
        first = false
      end

      break if interval < 0 or interval == nil
    end

    expect(fired).to be_falsey
  end

ioquatix · 2016-01-29T23:12:00Z

Please try to reproduce the issue in that spec.

HoneyryderChuck · 2016-01-30T05:08:51Z

@ioquatix no worries, I'll take the weekend to cool off, but I'll have time to come back to this issue next week, single-out exactly what it is causing this behaviour (sporadically, apparently) and write a meaningful spec. Sorry for the noise, just sharing info while getting to the bottom in the hope someone can have some input.

This solves socketry/timers#58 , a bug that has been around. Long story short, timers were passing really short intervals on the actor, these were being passed to the reactor for select. Some of them were being cast to 0 before being passed to the nio.Channel#select method, which according to documentation, blocks forever when passed 0.

HoneyryderChuck · 2016-02-01T11:07:41Z

@ioquatix false alarm. Problem was that my test wasn't waiting for socket to be readable, and the blocking call stops on EOF. wait_readable returns nil in such a case, but EOFError is only triggered once you actually try to read from the socket... which my small test script wasn't doing, and so therein lies the impression that the timeout wasn't being properly triggered. I think that the main bug has already been fixed in nio4r.

ioquatix · 2016-02-01T11:09:22Z

We should probably make that nil is handled correctly by run_once though.

HoneyryderChuck · 2016-02-01T11:36:45Z

Ok, so let's just document here, so anyone come back to it:

JRuby: falls back to select(), waits indefinitely for events until wakeup is invoked or thread is interrupted.
MRI: uses IO.select, if timeout argument is nil, same thing.
libev: when passed nil for timeout it stops the loop timer and calls ev_loop with EVLOOP_ONESHOT , which according to the documentation, "... will block your process until at least one new event arrives, and will return after one iteration of the loop."

It seems to me that they all handle nil more or less the same way.

ioquatix · 2016-02-01T11:39:06Z

Okay, awesome digging, thanks for providing a summary 👍

ioquatix closed this as completed Jan 28, 2016

HoneyryderChuck mentioned this issue Jan 28, 2016

Nio forever socketry/nio4r#82

Merged

ioquatix reopened this Jan 29, 2016

ioquatix added a commit that referenced this issue Jan 29, 2016

Trying to reproduce #58

2955551

HoneyryderChuck closed this as completed Feb 1, 2016

timeout is buggy in JRuby #58

timeout is buggy in JRuby #58

Comments

HoneyryderChuck commented Jan 27, 2016

tarcieri commented Jan 27, 2016

ioquatix commented Jan 27, 2016

HoneyryderChuck commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

ioquatix commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

ioquatix commented Jan 28, 2016

ioquatix commented Jan 28, 2016

ioquatix commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

ioquatix commented Jan 28, 2016

ioquatix commented Jan 28, 2016

ioquatix commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

ioquatix commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

ioquatix commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

ioquatix commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

ioquatix commented Jan 28, 2016

ioquatix commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

ioquatix commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

HoneyryderChuck commented Jan 28, 2016

ioquatix commented Jan 29, 2016

ioquatix commented Jan 29, 2016

ioquatix commented Jan 29, 2016

ioquatix commented Jan 29, 2016

HoneyryderChuck commented Jan 29, 2016

ioquatix commented Jan 29, 2016

HoneyryderChuck commented Jan 29, 2016

ioquatix commented Jan 29, 2016

HoneyryderChuck commented Jan 29, 2016

HoneyryderChuck commented Jan 29, 2016

ioquatix commented Jan 29, 2016

ioquatix commented Jan 29, 2016

ioquatix commented Jan 29, 2016

HoneyryderChuck commented Jan 30, 2016

HoneyryderChuck commented Feb 1, 2016

ioquatix commented Feb 1, 2016

HoneyryderChuck commented Feb 1, 2016

ioquatix commented Feb 1, 2016