-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
timeout is buggy in JRuby #58
Comments
Yeah, this has been reported several times... Any ideas @ioquatix? |
I will investigate the issue, is the above test case failing on MRI 2.3? |
didn't test with all possible mri versions. will continue to test today, I'll come up with more results. |
maybe this is old news, but I'll state them nonetheless: replacing the Celluloid::IO with a Celluloid actor seems to fix the issues. My bet is now on the evented mailbox. |
I've been investigating option 3, which consistently happens in jruby. Something's wrong with the timeout... I've added a few messages before the reactor selects, and also inside the jruby extension, before select is called on the NIO channel. Here's an example of a successful select:
Here's where it gets blocked:
The doSelect() method has a condition stating that if timeout is 0, one calls selectNow() on the NIO channel. I've patched it by calling selectNow() when the timeout is smaller than 1 millisecond, and it doesn't hang anymore: /* inside doSelect(), line 290 */
long tim = (long)(t * 1000);
if (tim < 1) {
result = this.selector.selectNow();
} else {
result = this.selector.select(tim);
}
... Would this be an acceptable patch? I think the real "problem" still comes from timer, as an interval triggering a message handling is probably so small that the java Runtime can't handle it properly. Or are java NIO channels not able to handle that kind of precision? |
This could be an okay patch, but I have a request, could you print out values of |
@ioquatix I just did:
I'm digging a bit more, problem comes from Celluloid::Actor#run, particularly the wait block (line 157). Timers is passing the wait_interval to the block, and this will be used to check a message from the mailbox. This wait_interval can be apparently very small! I've just tracked the wait_interval:
that's some precise shit! Were select system calls ever able to deal with such an interval? Is this just the java way? |
I wrote the code which generate those intervals, but it was a while ago. Yes, it's possible to generate small intervals. Did you try with the condition I suggested? Are you referring to this code in Nio4r? That code is a bit different from what you quoted above. It uses double rather than long so avoids rounding issues, handles negative values correctly. |
Can you link me to the code you are looking at w.r.t wait_interval, thanks :) |
Just FYI, in some cases timers will provide a -ve wait interval if a timer should have fired already (it was late) https://github.com/celluloid/timers/blob/41145ed260e4f061c6b93604c5e0ba86192f37af/lib/timers/group.rb#L81-L90 but I'm not sure if this is used in the code you are talking about or not. |
it casts it to a long before calling select, check a few lines below. That's what I don't think anymore that the interval is the issue, as timers are supposed to be that precise, right? It's as if the NIO channels can't handle that low of an interval. The code I meant: https://github.com/celluloid/celluloid/blob/master/lib/celluloid/actor.rb#L152-L156 |
When it gets cast to a long, perhaps the result is 0. In that case, some API have sleep indefinitely behaviour when called with 0 (or sometimes -1). |
So you can see here exactly what I said: https://docs.oracle.com/javase/7/docs/api/java/nio/channels/Selector.html#select(long) timeout - If positive, block for up to timeout milliseconds, more or less, while waiting for a channel to become ready; if zero, block indefinitely; must not be negative So if you call it with 0, clearly it will block forever :D |
IF you call that API with a negative number, it will return the meaning of life. |
The result is not 0, is the one that was printed out in the example above. Are you saying the channel will interpret it as 0 and block forever? |
If you cast 0.97 to an int, I'd expect truncation? |
yes, but the API expects a long as argument. Unless it is truncating it inside. If so, the API is misleading as it doesn't accept positive in milliseconds, but > 1. I missed java... |
If |
Actually, it blocks even with 0.1 |
Are you saying |
t is already (long) milliseconds. so, 0.1 milliseconds. |
Ok, I think this is java: import java.io.IOException;
import java.nio.channels.Channel;
public class SelectorText {
public static void main(String[] args) {
System.out.println("starting the test...");
java.nio.channels.Selector selector = null;
try {
selector = java.nio.channels.Selector.open();
System.out.println("selecting now...");
selector.selectNow();
System.out.println("selecting later...");
selector.select((long)1);
System.out.println("selecting small...");
selector.select((long)0.9999);
System.out.println("that's all");
} catch(IOException ie) {
System.out.println("balls");
}
}
} try to run this. |
Dude,
is the same as
|
x will contain 0. |
jesus, what a mental block... of course. |
well, I guess it's settled then. I'll come up with a patch in nio4r. |
sweet, let me know and I'll check it out. |
I was reading |
This solves socketry/timers#58 , a bug that has been around. Long story short, timers were passing really short intervals on the actor, these were being passed to the reactor for select. Some of them were being cast to 0 before being passed to the nio.Channel#select method, which according to documentation, blocks forever when passed 0.
@ioquatix , can you reopen this? option 1 still hanging... |
Problem is here, wait_interval at some point is nil, and the selector will block. This is behaviour counts for MRI as well (tested in ruby 2.1 and 2.3). I think this time it's really timers... When calling select with timeout set to nil, what is the expected behaviour? Both NIO as libev seem to have an inner threshold, as at some point I see the reactor calling back. But my timeout block is not retried anymore... |
|
Yes we can certainly reopen this issue, what does |
Here is the non-everted mailbox: https://github.com/celluloid/celluloid/blob/master/lib/celluloid/mailbox.rb#L48-L70
Here is for evented mailbox: Calling That code looks like it is designed to handle nil, but passes it to
It looks like C expects timeout to be non-nil which isn't consistent with the other implementations: Not sure about libev, perhaps @tarcieri can comment on this since he wrote most of that code (?) |
In theory, the behaviour of |
humm, I don't know... when is it safe to run the reactor indefinitely? The main problem for me is, the nil is passed to the reactor while the timers are still handling a timeout. It seems that one is blocking the other from triggering. The way I see it, either the timers should not yield nil, or the reactor should not be passed nil (which would be in accordance to what libev expects). I just don't understand where is the nil being yielded from. Does this mean there are no timers to process? I'm not as familiar with the timers code. |
Yes, as explained, nil means no timers. |
It seems that the timers group is not triggering the exceptions properly. Here is a bit more logging for you @ioquatix :
to add a bit of context, first block of logs is what happens when the timeout fails and the action is retried; (a new timeout is added to timers); the second block of logs is from the Actor#run method: new timer interval, interval passed to the reactor; "kaboom" is when the timeout timer event triggers. So, the latter is not happening in our bug scenario. I don't know yet why, though, what I see is that, if the timer event associated with the timeout call doesn't trigger within the 2 second interval, it's all gone. Now, why is it not triggering? Next step. |
Can you make a spec in timers for the failing behaviour. |
still in the process of figuring out, I'll get to the failing scenario later. |
@ioquatix problem is here: https://github.com/celluloid/timers/blob/master/lib/timers/group.rb#L66-L70. I've been following this and, in the normal case, you yield two intervals, fire the events in the end, and the timer get's cancelled after. You can see an example below:
Main feature is that the handle must be a valid (not yet cancelled one), which means the timer hasn't been triggered yet. In fact, the problem arrive as soon as the timer cancels in between the two calls on wait_interval there in the link:
I can't investigate this further, at least for today and the weekend, but my the main issue is right there: you can't fire cancelled handles. |
I need a failing test case or it's impossible for me to debug :) Do you have Google Hangouts? |
Okay dude so I'm trying to understand the failure case you describe above, but it's not failing for me: it "should be okay to cancel during wait" do
fired = false
timer = subject.after(0.1) { puts "Fired!"; fired = true }
first = true
subject.wait do |interval|
puts "Interval: #{interval}"
if first
timer.cancel
first = false
end
break if interval < 0 or interval == nil
end
expect(fired).to be_falsey
end |
Please try to reproduce the issue in that spec. |
@ioquatix no worries, I'll take the weekend to cool off, but I'll have time to come back to this issue next week, single-out exactly what it is causing this behaviour (sporadically, apparently) and write a meaningful spec. Sorry for the noise, just sharing info while getting to the bottom in the hope someone can have some input. |
This solves socketry/timers#58 , a bug that has been around. Long story short, timers were passing really short intervals on the actor, these were being passed to the reactor for select. Some of them were being cast to 0 before being passed to the nio.Channel#select method, which according to documentation, blocks forever when passed 0.
@ioquatix false alarm. Problem was that my test wasn't waiting for socket to be readable, and the blocking call stops on EOF. wait_readable returns nil in such a case, but EOFError is only triggered once you actually try to read from the socket... which my small test script wasn't doing, and so therein lies the impression that the timeout wasn't being properly triggered. I think that the main bug has already been fixed in nio4r. |
We should probably make that nil is handled correctly by |
Ok, so let's just document here, so anyone come back to it:
It seems to me that they all handle nil more or less the same way. |
Okay, awesome digging, thanks for providing a summary 👍 |
First I thought it was celluloid-io sockets, but then I realized that in Jruby, timeouts will not always be triggered. Here is a simple script:
Option 3 works in MRI (apparently), but also hangs at some point.
This can seriously hinder celluloid/celluloid-io#170.
The text was updated successfully, but these errors were encountered: