Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexflow crashes when querying jobstate (from a dead server) #71

Open
matthdsm opened this issue Jul 17, 2024 · 5 comments · May be fixed by #72
Open

Nexflow crashes when querying jobstate (from a dead server) #71

matthdsm opened this issue Jul 17, 2024 · 5 comments · May be fixed by #72
Assignees

Comments

@matthdsm
Copy link
Collaborator

Hi,

We noticed the nextflow process crashes when the plugin (temporarily) can't query the jobstate. Perhaps it would be good to add a timeout and some retries here?

Cheers
M

@abhi18av
Copy link
Member

Hi @matthdsm ,

Interesting, the closest experience on my side has been a WARNing that the job hasn't been allocated to a node yet. Which we addressed in a recent commit.

Could you please share a minimal reproducible use-case and the version of the plugin used?

Ideally

  1. Nextflow log
  2. Any specific command /config /pipeline not in the main log

@matthdsm
Copy link
Collaborator Author

We were rebooting some of our services and the public address of our nomad server was offline for a short while. I got the following in the logs

Jul-17 07:07:01.940 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] Failed to get jobState nf-70f3e6e3033816668cd1ffc4e6217165-NFCMGG_PREPROCESSING_PRE -- Cause: java.net.ConnectException: Failed to connect to nomad.ops.cmgg.be/172.20.1.206:80
io.nomadproject.client.ApiException: java.net.ConnectException: Failed to connect to nomad.ops.cmgg.be/172.20.1.206:80
        at io.nomadproject.client.ApiClient.execute(ApiClient.java:928)
        at io.nomadproject.client.api.JobsApi.getJobAllocationsWithHttpInfo(JobsApi.java:629)
        at io.nomadproject.client.api.JobsApi.getJobAllocations(JobsApi.java:596)
        at nextflow.nomad.executor.NomadService.getJobState(NomadService.groovy:274)
        at nextflow.nomad.executor.NomadTaskHandler.taskState0(NomadTaskHandler.groovy:187)
        at nextflow.nomad.executor.NomadTaskHandler.checkIfCompleted(NomadTaskHandler.groovy:87)
        at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:649)
        at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:571)
        at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:441)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
        at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
        at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
        at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:316)
        at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
        at groovy.lang.Closure.run(Closure.java:505)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.ConnectException: Failed to connect to nomad.ops.cmgg.be/172.20.1.206:80
        at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:297)
        at okhttp3.internal.connection.RealConnection.connect(RealConnection.kt:207)
        at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.kt:226)
        at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:106)
        at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:74)
        at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:255)
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
        at okhttp3.internal.connection.RealCall.execute(RealCall.kt:154)
        at io.nomadproject.client.ApiClient.execute(ApiClient.java:924)
        ... 24 common frames omitted
Caused by: java.net.ConnectException: Connection refused (Connection refused)
        at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
        at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
        at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
        at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.base/java.net.Socket.connect(Socket.java:609)
        at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:120)
        at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
        ... 40 common frames omitted

@abhi18av
Copy link
Member

Ah, its related to the main http connection itself. Sure, this will be addressed soon 👍

@abhi18av abhi18av self-assigned this Jul 17, 2024
@jagedn
Copy link
Collaborator

jagedn commented Jul 17, 2024

interesting edge case to be addressed

@abhi18av abhi18av linked a pull request Jul 17, 2024 that will close this issue
@abhi18av abhi18av changed the title Nexflow crashes when querying jobstate Nexflow crashes when querying jobstate (from a dead server) Jul 17, 2024
@abhi18av
Copy link
Member

@matthdsm @jhaezebr a couple of questions for you both

  1. How many nomad servers are in the cluster?

  2. The nomad.ops.cmgg.be/172.20.1.206:80 is the address of the leader right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants