-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NSFS | Concurrency Tests | Warp fails with error - connect: connection refused #8471
Comments
@guymguym @romayalon Do we have any any limit on number of connections? |
@javieramirez1 where are the logs of this run? |
In order to send a cleaner log, an 8-hour retest is performed with the latest rpm 1026, the following warp run was performed and it seems that the first issue is already resolved (connection refused errors in get operations no longer occur) but now I see another different one since for some reason the put part is not performed, it says that it is skipping put because it has few samples but it is a workload of 500 connections for 8 hours so it should not happen Operation: PUT. Concurrency: 0 Operation: GET. Concurrency: 500
Throughput by host:
Throughput, split into 95 x 5m0s:
I attached the log via slack |
Hi @javieramirez1 , Detailed Information: From a node in the cluster that runs noobaa:
Note: For the
Note: without this change running it would result in an error of timeout during the preparing step of warp, for example:
From a client node: Although the run stuck a couple of times (broken pipe, not related to the run) in one of the runs which was 25/30 minutes before it stuck I saw:
I looked at the logs and I saw that noobaa was down (in
I would add that I tried to search for ERROR, PANIC printings in the logs I also didn't find something in the events (
cc: @romayalon |
HI @javieramirez1 , I received from you today a log file size of 7.29 GB.
I would add that I saw these statistics and I didn't see any errors:
If you can please give more details related to:
the put-part that was not performed - depends on which side: warp or noobaa? Maybe you have a timestamp or a specific printing in the logs you observed that you can share? cc: @romayalon |
Other runs were made in which it is observed in both (2 cesnodes used) noobaa logs that put operations were performed, between these operations nothing out of the ordinary or any error is seen, also in the entire log no panics, fails or other errors are observed but comparing with previous warp behavior a workload of 8 hours with concurrency = 500 Operation: PUT. Concurrency: 0 It should not show in the results that the concurrency in put operations is 0 |
Hi @javieramirez1, At this point, we don't think the issue is related to versioning.
I would mention that in all our reproductions, we didn't see any errors in noobaa logs, or events (as mentioned in the comment). |
Hi @javieramirez1 , process.stderr.write('PANIC: ' + message + (err.stack || err) + '\n', () => {
process.exit(1);
}); instead of:
to make sure that the error message is printed before the process exits. we noticed that the error was related to open files, for example:
Therefore, as a workaround, I can suggest setting a higher number of cc: @romayalon @nadavMiz |
Hi @javieramirez1, cc: @romayalon @nadavMiz |
Environment info
c83f2-dan8-hs200.test.net: noobaa-core-5.17.0-20241016.el9.x86_64
c83f2-dan10-hs200.test.net: noobaa-core-5.17.0-20241016.el9.x86_64
Actual behavior
1.The following workload was executed with a bucket previously put in suspend mode (the errors are also seen when executing warp stat and warp list)
warp get --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --concurrent=1000 --duration=30m --bucket=bucket1$j --insecure --tls
Expected behavior
1.workload completed without problems
Steps to reproduce
1.warp get --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --concurrent=1000 --duration=30m --bucket=bucket1$j --insecure --tls
More information - Screenshots / Logs / Other output
cluster time at which the workload started
Thu Oct 17 01:54:35 AM EDT 2024
the failures were seen after 19 minutes of execution
The log is too large, so I'll add it to the Slack channel.
The text was updated successfully, but these errors were encountered: