Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeout is buggy in JRuby #58

Closed
HoneyryderChuck opened this issue Jan 27, 2016 · 48 comments
Closed

timeout is buggy in JRuby #58

HoneyryderChuck opened this issue Jan 27, 2016 · 48 comments

Comments

@HoneyryderChuck
Copy link

First I thought it was celluloid-io sockets, but then I realized that in Jruby, timeouts will not always be triggered. Here is a simple script:

require 'time'
require 'celluloid/io'


class Tester
  include Celluloid::IO
  def initialize
    @sock = TCPSocket.new("localhost", 80)
  end

  def test_timeout
    begin
      timeout(2) do
         # option 1
         # @sock.wait_readable
         # option 2
         # sleep 3
         # option 3 
         loop do
            sleep 1
         end
      end
    rescue Celluloid::Task::TimeoutError
      puts "successfull timeout fetch: #{Time.now}"
      retry
    ensure
      puts "#{$!.class} #{$!.message}" if $!
    end
  end

end


puts "starting the actor..."
t = Tester.new
t.async(:test_timeout)
sleep

# Option 1
# Both MRI 2.1 and jruby 9.0.5.0
....
successfull timeout fetch: 2016-01-27 18:50:47 +0100
successfull timeout fetch: 2016-01-27 18:50:49 +0100
successfull timeout fetch: 2016-01-27 18:50:51 +0100
successfull timeout fetch: 2016-01-27 18:50:53 +0100
successfull timeout fetch: 2016-01-27 18:50:55 +0100
successfull timeout fetch: 2016-01-27 18:50:57 +0100
successfull timeout fetch: 2016-01-27 18:50:59 +0100
successfull timeout fetch: 2016-01-27 18:51:01 +0100


(hangs at some point...)


# Option 2
# MRI 2.1 and JRUBY-9.0.5.0
starting the actor...
successfull timeout fetch: 2016-01-27 18:49:12 +0100
W, [2016-01-27T18:49:15.799990 #328595]  WARN -- : Attempted to resume a dead task: type=:call, meta={:dangerous_suspend=>false, :method_name=>:test_timeout}, status=:dead

Option 3 works in MRI (apparently), but also hangs at some point.

This can seriously hinder celluloid/celluloid-io#170.

@tarcieri
Copy link
Contributor

Yeah, this has been reported several times...

Any ideas @ioquatix?

@ioquatix
Copy link
Member

I will investigate the issue, is the above test case failing on MRI 2.3?

@HoneyryderChuck
Copy link
Author

didn't test with all possible mri versions. will continue to test today, I'll come up with more results.

@HoneyryderChuck
Copy link
Author

maybe this is old news, but I'll state them nonetheless: replacing the Celluloid::IO with a Celluloid actor seems to fix the issues. My bet is now on the evented mailbox.

@HoneyryderChuck
Copy link
Author

I've been investigating option 3, which consistently happens in jruby. Something's wrong with the timeout...

I've added a few messages before the reactor selects, and also inside the jruby extension, before select is called on the NIO channel. Here's an example of a successful select:

# on celluloid/io/reactor.rb:59
selecting for 0.9849073000000006 seconds...   
# on ext/nio4r/org/nio4r/Nio4r.java:291
timing out for 984.9073000000006 milliseconds...

Here's where it gets blocked:

selecting for 0.0009740119999985808 seconds...
timing out for 0.9740119999985808 milliseconds

The doSelect() method has a condition stating that if timeout is 0, one calls selectNow() on the NIO channel.

I've patched it by calling selectNow() when the timeout is smaller than 1 millisecond, and it doesn't hang anymore:

/* inside doSelect(), line 290 */
                        long tim = (long)(t * 1000);
                        if (tim < 1) {
                          result = this.selector.selectNow();
                        } else {
                          result = this.selector.select(tim);
                        }
...

Would this be an acceptable patch?

I think the real "problem" still comes from timer, as an interval triggering a message handling is probably so small that the java Runtime can't handle it properly. Or are java NIO channels not able to handle that kind of precision?

@ioquatix
Copy link
Member

This could be an okay patch, but I have a request, could you print out values of tim where the bug happens? Is this bug also solved if you change the condition to (tim <= 0)?

@HoneyryderChuck
Copy link
Author

@ioquatix I just did:

timing out for 0.9740119999985808 milliseconds

I'm digging a bit more, problem comes from Celluloid::Actor#run, particularly the wait block (line 157).

Timers is passing the wait_interval to the block, and this will be used to check a message from the mailbox. This wait_interval can be apparently very small! I've just tracked the wait_interval:

# these are okay!
checking message: 0.9983331230000001 seconds
checking message: 0.9967141259999996 seconds
# this one is not! that's your block!
checking message: 7.559099999987495e-05 seconds 

that's some precise shit!

Were select system calls ever able to deal with such an interval? Is this just the java way?

@ioquatix
Copy link
Member

I wrote the code which generate those intervals, but it was a while ago. Yes, it's possible to generate small intervals.

Did you try with the condition I suggested?

Are you referring to this code in Nio4r?

https://github.com/celluloid/nio4r/blob/8cc48508a7cf3bb8497ba65f15febd691cf5d2a6/ext/nio4r/org/nio4r/Nio4r.java#L284-L291

That code is a bit different from what you quoted above. It uses double rather than long so avoids rounding issues, handles negative values correctly.

@ioquatix
Copy link
Member

Can you link me to the code you are looking at w.r.t wait_interval, thanks :)

@ioquatix
Copy link
Member

Just FYI, in some cases timers will provide a -ve wait interval if a timer should have fired already (it was late) https://github.com/celluloid/timers/blob/41145ed260e4f061c6b93604c5e0ba86192f37af/lib/timers/group.rb#L81-L90 but I'm not sure if this is used in the code you are talking about or not.

@HoneyryderChuck
Copy link
Author

it casts it to a long before calling select, check a few lines below. That's what tim was, the long already converted to milliseconds value that the channel must interpret on the selec call .

I don't think anymore that the interval is the issue, as timers are supposed to be that precise, right? It's as if the NIO channels can't handle that low of an interval.

The code I meant: https://github.com/celluloid/celluloid/blob/master/lib/celluloid/actor.rb#L152-L156

@ioquatix
Copy link
Member

When it gets cast to a long, perhaps the result is 0. In that case, some API have sleep indefinitely behaviour when called with 0 (or sometimes -1).

@ioquatix
Copy link
Member

So you can see here exactly what I said:

https://docs.oracle.com/javase/7/docs/api/java/nio/channels/Selector.html#select(long)

timeout - If positive, block for up to timeout milliseconds, more or less, while waiting for a channel to become ready; if zero, block indefinitely; must not be negative

So if you call it with 0, clearly it will block forever :D

@ioquatix
Copy link
Member

IF you call that API with a negative number, it will return the meaning of life.

@HoneyryderChuck
Copy link
Author

The result is not 0, is the one that was printed out in the example above. Are you saying the channel will interpret it as 0 and block forever?

@ioquatix
Copy link
Member

selecting for 0.0009740119999985808 seconds...
timing out for 0.9740119999985808 milliseconds

If you cast 0.97 to an int, I'd expect truncation?

@HoneyryderChuck
Copy link
Author

yes, but the API expects a long as argument. Unless it is truncating it inside. If so, the API is misleading as it doesn't accept positive in milliseconds, but > 1. I missed java...

@ioquatix
Copy link
Member

result = this.selector.select((long)(t * 1000));

If t < 0.001 this will result in select being called with 0.

@HoneyryderChuck
Copy link
Author

Actually, it blocks even with 0.1

@ioquatix
Copy link
Member

Are you saying t = 0.1 or t * 1000 = 0.1? Because 0.1 * 1000 should result in select(100) which should definitely be okay, but select(0.1) is definitely not.

@HoneyryderChuck
Copy link
Author

t is already (long) milliseconds. so, 0.1 milliseconds.

@HoneyryderChuck
Copy link
Author

Ok, I think this is java:

import java.io.IOException;
import java.nio.channels.Channel;

public class SelectorText {
  public static void main(String[] args) {
    System.out.println("starting the test...");
    java.nio.channels.Selector selector = null;
    try {
      selector = java.nio.channels.Selector.open();
      System.out.println("selecting now...");
      selector.selectNow();
      System.out.println("selecting later...");
      selector.select((long)1);
      System.out.println("selecting small...");
      selector.select((long)0.9999);
      System.out.println("that's all");


    }  catch(IOException ie) {
      System.out.println("balls");
    }

  }
}

try to run this.

@ioquatix
Copy link
Member

Dude,

 selector.select((long)0.9999);

is the same as

 selector.select(0);

@ioquatix
Copy link
Member

long x = (long)0.9999;

x will contain 0.

@HoneyryderChuck
Copy link
Author

jesus, what a mental block... of course.

@HoneyryderChuck
Copy link
Author

well, I guess it's settled then. I'll come up with a patch in nio4r.

@ioquatix
Copy link
Member

sweet, let me know and I'll check it out.

@HoneyryderChuck
Copy link
Author

I was reading long and thinking double...

HoneyryderChuck pushed a commit to HoneyryderChuck/nio4r that referenced this issue Jan 28, 2016
This solves socketry/timers#58 , a bug that has been around.
Long story short, timers were passing really short intervals on the actor, these were being
passed to the reactor for select. Some of them were being cast to 0 before being passed to the
nio.Channel#select method, which according to documentation, blocks forever when passed 0.
@HoneyryderChuck
Copy link
Author

@ioquatix , can you reopen this? option 1 still hanging...

@HoneyryderChuck
Copy link
Author

Problem is here, wait_interval at some point is nil, and the selector will block. This is behaviour counts for MRI as well (tested in ruby 2.1 and 2.3). I think this time it's really timers...

When calling select with timeout set to nil, what is the expected behaviour? Both NIO as libev seem to have an inner threshold, as at some point I see the reactor calling back. But my timeout block is not retried anymore...

@ioquatix
Copy link
Member

nil means there are no timers. So, in theory, wait indefinitely.

@ioquatix ioquatix reopened this Jan 29, 2016
@ioquatix
Copy link
Member

Yes we can certainly reopen this issue, what does @mailbox.check(interval) do if interval is nil?

@ioquatix
Copy link
Member

In theory, the behaviour of run_once(nil) should be to block indefinitely and run_once(t >= 0) should be to block for at most t seconds, and run_once(0) should check for events and return immediately.

@HoneyryderChuck
Copy link
Author

humm, I don't know... when is it safe to run the reactor indefinitely? The main problem for me is, the nil is passed to the reactor while the timers are still handling a timeout. It seems that one is blocking the other from triggering.

The way I see it, either the timers should not yield nil, or the reactor should not be passed nil (which would be in accordance to what libev expects). I just don't understand where is the nil being yielded from. Does this mean there are no timers to process? I'm not as familiar with the timers code.

@ioquatix
Copy link
Member

Yes, as explained, nil means no timers.

@HoneyryderChuck
Copy link
Author

It seems that the timers group is not triggering the exceptions properly. Here is a bit more logging for you @ioquatix :

# good
...
successfull timeout fetch: 2016-01-29 14:35:35 +0100
timeout: adding timer: 2                            
timeout: timer added: 2 ; timers: 1                 

new timer: 1.9999205120000028                       
reactor inside: 1.9999205120000028                  
reactor outside: 1.9999205120000028                 
kaboom!!!!!!!!                                      

successfull timeout fetch: 2016-01-29 14:35:37 +0100
...

# also good
...
successfull timeout fetch: 2016-01-29 14:35:33 +0100
timeout: adding timer: 2                            
timeout: timer added: 2 ; timers: 1                 

new timer: 1.9999047220000037                       
reactor inside: 1.9999047220000037                  
reactor outside: 1.9999047220000037                 

new timer: 0.00020313299999941137                
reactor inside: 0.00020313299999941137              
reactor outside: 0.00020313299999941137             
kaboom!!!!!!!!                                      

successfull timeout fetch: 2016-01-29 14:35:35 +0100
...

# bad
...
successfull timeout fetch: 2016-01-29 14:35:37 +0100 
timeout: adding timer: 2                             
timeout: timer added: 2 ; timers: 1                  

new timer: 1.9999061400000002                        
reactor inside: 1.9999061400000002                   
reactor outside: 1.9999061400000002                  

new timer:     
reactor inside:                                      
reactor outside:                                     

new timer:                                           
mailbox: no message                                  
reactor inside:                                      
reactor outside:                                     
.... (add eternum)

to add a bit of context, first block of logs is what happens when the timeout fails and the action is retried; (a new timeout is added to timers); the second block of logs is from the Actor#run method: new timer interval, interval passed to the reactor; "kaboom" is when the timeout timer event triggers.

So, the latter is not happening in our bug scenario. I don't know yet why, though, what I see is that, if the timer event associated with the timeout call doesn't trigger within the 2 second interval, it's all gone. Now, why is it not triggering? Next step.

@ioquatix
Copy link
Member

Can you make a spec in timers for the failing behaviour.

@HoneyryderChuck
Copy link
Author

still in the process of figuring out, I'll get to the failing scenario later.

@HoneyryderChuck
Copy link
Author

@ioquatix problem is here: https://github.com/celluloid/timers/blob/master/lib/timers/group.rb#L66-L70. I've been following this and, in the normal case, you yield two intervals, fire the events in the end, and the timer get's cancelled after. You can see an example below:

successfull timeout fetch: 2016-01-29 15:42:59 +0100   
Group has 1 events        
Handle: #<Timers::Events::Handle:0x007fe97ad1cf70>     
actor#run: interval 1.9999147540000024                     
Group has 1 events        
Handle: #<Timers::Events::Handle:0x007fe97ad1cf70>     
kaboom!!!!!!!! (timer.cancel from Celluloid::Actor)
successfull timeout fetch: 2016-01-29 15:43:01 +0100
...

Main feature is that the handle must be a valid (not yet cancelled one), which means the timer hasn't been triggered yet. In fact, the problem arrive as soon as the timer cancels in between the two calls on wait_interval there in the link:

successfull timeout fetch: 2016-01-29 15:43:03 +0100     
Group has 1 events          
Handle: #<Timers::Events::Handle:0x007fe97ad146b8>       
actor: interval 1.999946142000006                        
(timer gets cancelled here, but the task is not resumable at this point)
Group has 1 events    
Handle: #<Timers::Events::Handle:0x007fe97ad146b8> but it is cancelled!!!
Group has 0 events    
actor: interval (nil)
# our problems start here
....

I can't investigate this further, at least for today and the weekend, but my the main issue is right there: you can't fire cancelled handles.

@ioquatix
Copy link
Member

I need a failing test case or it's impossible for me to debug :) Do you have Google Hangouts?

@ioquatix
Copy link
Member

Okay dude so I'm trying to understand the failure case you describe above, but it's not failing for me:

  it "should be okay to cancel during wait" do
    fired = false
    timer = subject.after(0.1) { puts "Fired!"; fired = true }

    first = true

    subject.wait do |interval|
      puts "Interval: #{interval}"

      if first 
        timer.cancel
        first = false
      end

      break if interval < 0 or interval == nil
    end

    expect(fired).to be_falsey
  end

ioquatix added a commit that referenced this issue Jan 29, 2016
@ioquatix
Copy link
Member

Please try to reproduce the issue in that spec.

@HoneyryderChuck
Copy link
Author

@ioquatix no worries, I'll take the weekend to cool off, but I'll have time to come back to this issue next week, single-out exactly what it is causing this behaviour (sporadically, apparently) and write a meaningful spec. Sorry for the noise, just sharing info while getting to the bottom in the hope someone can have some input.

HoneyryderChuck pushed a commit to HoneyryderChuck/nio4r that referenced this issue Jan 31, 2016
This solves socketry/timers#58 , a bug that has been around.
Long story short, timers were passing really short intervals on the actor, these were being
passed to the reactor for select. Some of them were being cast to 0 before being passed to the
nio.Channel#select method, which according to documentation, blocks forever when passed 0.
@HoneyryderChuck
Copy link
Author

@ioquatix false alarm. Problem was that my test wasn't waiting for socket to be readable, and the blocking call stops on EOF. wait_readable returns nil in such a case, but EOFError is only triggered once you actually try to read from the socket... which my small test script wasn't doing, and so therein lies the impression that the timeout wasn't being properly triggered. I think that the main bug has already been fixed in nio4r.

@ioquatix
Copy link
Member

ioquatix commented Feb 1, 2016

We should probably make that nil is handled correctly by run_once though.

@HoneyryderChuck
Copy link
Author

Ok, so let's just document here, so anyone come back to it:

  • JRuby: falls back to select(), waits indefinitely for events until wakeup is invoked or thread is interrupted.
  • MRI: uses IO.select, if timeout argument is nil, same thing.
  • libev: when passed nil for timeout it stops the loop timer and calls ev_loop with EVLOOP_ONESHOT , which according to the documentation, "... will block your process until at least one new event arrives, and will return after one iteration of the loop."

It seems to me that they all handle nil more or less the same way.

@ioquatix
Copy link
Member

ioquatix commented Feb 1, 2016

Okay, awesome digging, thanks for providing a summary 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants