Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSFS | Concurrency Tests | Warp fails with error - connect: connection refused #8471

Open
javieramirez1 opened this issue Oct 17, 2024 · 9 comments
Assignees
Labels

Comments

@javieramirez1
Copy link

Environment info

  • NooBaa Version: VERSION
    c83f2-dan8-hs200.test.net: noobaa-core-5.17.0-20241016.el9.x86_64
    c83f2-dan10-hs200.test.net: noobaa-core-5.17.0-20241016.el9.x86_64

Actual behavior

1.The following workload was executed with a bucket previously put in suspend mode (the errors are also seen when executing warp stat and warp list)
warp get --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --concurrent=1000 --duration=30m --bucket=bucket1$j --insecure --tls

arp: <ERROR> download error:Get "https://172.20.100.66:6443/bucket11/RS12F%28xN/3.VOaX3zJIp805QLN%29.rnd": dial tcp 172.20.100.66:6443: connect: connection refused
warp: <ERROR> download error:Get "https://172.20.100.67:6443/bucket11/AE5otFFM/1.6WqmwEJd0re19jG6.rnd": dial tcp 172.20.100.67:6443: connect: connection refused
warp: <ERROR> download error:Get "https://172.20.100.67:6443/bucket11/UgiXqpIj/5.Wct8ORJhXx5iyO20.rnd": dial tcp 172.20.100.67:6443: connect: connection refused
warp: <ERROR> download error:Get "https://172.20.100.64:6443/bucket11/YzjuGWbj/1.gHPA46P9I3pKkUO0.rnd": dial tcp 172.20.100.64:6443: connect: connection refused
----------------------------------------
Operation: PUT. Concurrency: 1000
* Average:0.09 MiB/s, 93.16 obj/s

Throughput by host:
 * https://172.20.100.60:6443: Avg: 0.01 MiB/s, 10.09 obj/s
 * https://172.20.100.61:6443: Avg: 0.01 MiB/s, 9.58 obj/s
 * https://172.20.100.62:6443: Avg: 0.01 MiB/s, 9.50 obj/s
 * https://172.20.100.63:6443: Avg: 0.01 MiB/s, 9.58 obj/s
 * https://172.20.100.64:6443: Avg: 0.01 MiB/s, 9.07 obj/s
 * https://172.20.100.65:6443: Avg: 0.01 MiB/s, 9.46 obj/s
 * https://172.20.100.66:6443: Avg: 0.01 MiB/s, 8.18 obj/s
 * https://172.20.100.67:6443: Avg: 0.01 MiB/s, 8.48 obj/s
 * https://172.20.100.68:6443: Avg: 0.01 MiB/s, 9.93 obj/s
 * https://172.20.100.69:6443: Avg: 0.01 MiB/s, 8.69 obj/s

Throughput, split into 22 x 1s:
 * Fastest:706.5KiB/s, 724.46 obj/s
 * 50% Median:46.4KiB/s, 48.00 obj/s
 * Slowest:46.2KiB/s, 47.88 obj/s

----------------------------------------
Operation: GET. Concurrency: 1000
Errors:614
* Average:5.97 MiB/s, 6265.07 obj/s

Throughput by host:
 * https://172.20.100.60:6443: Avg: 0.59 MiB/s, 622.45 obj/s
 * https://172.20.100.61:6443: Avg: 0.60 MiB/s, 632.92 obj/s, 125 errors
Errors:125
 * https://172.20.100.62:6443: Avg: 0.59 MiB/s, 620.88 obj/s
 * https://172.20.100.63:6443: Avg: 0.60 MiB/s, 630.26 obj/s
 * https://172.20.100.64:6443: Avg: 0.60 MiB/s, 627.16 obj/s, 119 errors
Errors:119
 * https://172.20.100.65:6443: Avg: 0.60 MiB/s, 626.47 obj/s
 * https://172.20.100.66:6443: Avg: 0.59 MiB/s, 622.15 obj/s, 128 errors
Errors:128
 * https://172.20.100.67:6443: Avg: 0.59 MiB/s, 618.20 obj/s, 121 errors
Errors:121
 * https://172.20.100.68:6443: Avg: 0.60 MiB/s, 632.47 obj/s
 * https://172.20.100.69:6443: Avg: 0.60 MiB/s, 632.12 obj/s, 121 errors
Errors:121

Throughput, split into 359 x 5s:
 * Fastest:7.3MiB/s, 7702.89 obj/s
 * 50% Median:6.0MiB/s, 6314.32 obj/s
 * Slowest:0.00 obj/s

Expected behavior

1.workload completed without problems

Steps to reproduce

1.warp get --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --concurrent=1000 --duration=30m --bucket=bucket1$j --insecure --tls

More information - Screenshots / Logs / Other output

cluster time at which the workload started
Thu Oct 17 01:54:35 AM EDT 2024
the failures were seen after 19 minutes of execution
The log is too large, so I'll add it to the Slack channel.

@ramya-c3
Copy link

@guymguym @romayalon Do we have any any limit on number of connections?

@romayalon
Copy link
Contributor

@javieramirez1 where are the logs of this run?

@romayalon romayalon changed the title Suspend mode connect: connection refused NSFS | Concurrency Tests | Versioning | Warp fails Suspend mode with error - connect: connection refused Oct 27, 2024
@javieramirez1
Copy link
Author

In order to send a cleaner log, an 8-hour retest is performed with the latest rpm 1026, the following warp run was performed
warp get --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --concurrent=500 --duration=8h --bucket=bucket18 --insecure --tls

and it seems that the first issue is already resolved (connection refused errors in get operations no longer occur) but now I see another different one since for some reason the put part is not performed, it says that it is skipping put because it has few samples but it is a workload of 500 connections for 8 hours so it should not happen


Operation: PUT. Concurrency: 0
SkippingPUTtoo few samples. Longer benchmark run required for reliable results.


Operation: GET. Concurrency: 500

  • Average:6.18 MiB/s, 6476.31 obj/s

Throughput by host:

Throughput, split into 95 x 5m0s:

  • Fastest:6.6MiB/s, 6915.12 obj/s
  • 50% Median:6.2MiB/s, 6534.52 obj/s
  • Slowest:4.7MiB/s, 4889.34 obj/s

I attached the log via slack
The log was cleared so that the start is the start of the test

@shirady
Copy link
Contributor

shirady commented Oct 30, 2024

Hi @javieramirez1 ,
Regarding the original issue that was opened, "connection refused" it also happened to me yesterday when I ran the warp test on GPFS machine and what caused it was noobaa was down. I don't think it is related in any way that the bucket was in suspended mode, only that noobaa was starting up at a certain step which lead the connection ti be refused during the starting.

Detailed Information:
Testing on GPFS machine:

From a node in the cluster that runs noobaa:

  1. I Change the code in one of the nodes before running the test: vi /usr/local/noobaa-core/config.js (change config.NSFS_RENAME_RETRIES = 10; to config.NSFS_RENAME_RETRIES = 1000;)

Note: For the --concurrent=1000 field we increase the number of retries (we use a retry mechanism in case of competing threads).

  1. Change the config.json in path vi /ibm/fs1/cessharedroot/ces/s3-config/config.json and add the values: "UV_THREADPOOL_SIZE": 16 and "ENDPOINT_FORKS":32 (synced with @nadavMiz as I understand it is part of the setup that you used in previous issues).

Note: without this change running it would result in an error of timeout during the preparing step of warp, for example:

warp: upload error: Put "https:///bucket-02/w%28chStmR/3.4uZB%28P%29PrYWLMmAr.rnd": dial tcp 10.11.87.62:6443: i/o timeout

  1. mms3 config change DEBUGLEVEL="all" for the restart.

From a client node:
4. Create an account: noobaa-cli account add --name warp-shira --new_buckets_path /ibm/fs1/teams/ --uid 1001 gid 1001 --fs_backend GPFS
5. Create the alias for the account (based on the existing account):
alias s3-u1='AWS_ACCESS_KEY_ID=<> AWS_SECRET_ACCESS_KEY=<> aws --no-verify-ssl --endpoint https://10.11.87.62:6443/ --no-verify-ssl'
Check the connection (by trying to list the buckets of the account): s3-u1 s3 ls; echo $?
6. Create a bucket: s3-u1 s3 mb s3://bsw-01
7. Put the bucket in suspended mode: s3-u1 s3api put-bucket-versioning --bucket bsw-01 --versioning-configuration Status=Suspended
8. Run the warp command: cd warp;
./warp get --host=10.11.87.62:6443 --access-key="<>" --secret-key="<>" --obj.size=1k --concurrent=1000 --duration=30m --bucket=bsw-01 --insecure --tls (I run it with 1 host)
Note I run it a couple of times the logs are on the run where the name of the bucket is bsw-05 (and not bsw-01).

Although the run stuck a couple of times (broken pipe, not related to the run) in one of the runs which was 25/30 minutes before it stuck I saw:

warp: download error: Get "https:///bsw-05/ZslyUWhX/5.32L7b88aBA4eX%29pT.rnd": dial tcp : connect: connection refused

I looked at the logs and I saw that noobaa was down (in /var/log/messages):

Oct 29 07:18:34 <node-name> systemd[1]: noobaa.service: Failed with result 'exit-code'.
Oct 29 07:18:34 <node-name> systemd[1]: noobaa.service: Consumed 33min 7.469s CPU time.
...
Oct 29 07:18:36 <node-name> systemd[1]: noobaa.service: Scheduled restart job, restart counter is at 2.
Oct 29 07:18:36 <node-name> systemd[1]: Stopped The NooBaa service..
Oct 29 07:18:36 <node-name> systemd[1]: noobaa.service: Consumed 33min 7.469s CPU time.
Oct 29 07:18:36 <node-name> systemd[1]: Starting The NooBaa service....

I would add that I tried to search for ERROR, PANIC printings in the logs /var/log/messages and I didn't find any (using grep ERROR <log_file> | wc -l for example).

I also didn't find something in the events (/var/log/noobaa_events.log), only found these:

Oct 29 06:48:10 <node-name> [3563230]: {"timestamp":"2024-10-29T13:48:10.075Z","host":"<node-name>","event":{"code":"noobaa_bucket_created","message":"Bucket created, value: bsw-05","description":"Noobaa bucket created","entity_type":"NODE","event_type":"INFO","scope":"NODE","severity":"INFO","state":"HEALTHY","arguments":{"bucket_name":"bsw-05"},"pid":3563230}}
Oct 29 07:00:35 <node-name> [3633270]: {"timestamp":"2024-10-29T14:00:35.744Z","host":"<node-name>","event":{"code":"noobaa_started","message":"Noobaa started","description":"Noobaa started running","entity_type":"NODE","event_type":"STATE_CHANGE","scope":"NODE","severity":"INFO","state":"HEALTHY","pid":3633270}}

cc: @romayalon

@shirady
Copy link
Contributor

shirady commented Oct 30, 2024

HI @javieramirez1 ,
Regarding the second run, you mention in the comment I would like more details for the investigation.

I received from you today a log file size of 7.29 GB.
I tried to search for many keywords and didn't find anything interesting, examples:

➜  ~ grep ERROR  /Users/shiradymnik/Downloads/suspended.log
➜  ~ grep panic -i /Users/shiradymnik/Downloads/suspended.log
➜  ~ grep fail -i /Users/shiradymnik/Downloads/suspended.log

I would add that I saw these statistics and I didn't see any errors:

Oct 29 00:31:32 c83f2-dan10 [3852308]: [nsfs/3852308]    [L1] core.server.system_services.stats_aggregator:: standalon_update_nsfs_stats. nsfs_stats = { fs_workers_stats: { stat: { count: 6113, error_count: 0, min_time: 51, max_time: 77332, sum_time: 15020751 }, fileopen: { count: 2038, error_count: 0, min_time: 14, max_time: 298, sum_time: 104759 }, filestat: { count: 2036, error_count: 0, min_time: 60, max_time: 3583, sum_time: 473412 }, fileread: { count: 4073, error_count: 0, min_time: 4, max_time: 1105, sum_time: 99499 }, fileclose: { count: 2037, error_count: 0, min_time: 5, max_time: 271, sum_time: 39491 } }, op_stats: { head_object: { count: 2038, error_count: 0, min_time: 1, max_time: 89, sum_time: 45311 }, read_object: { count: 2037, error_count: 0, min_time: 2, max_time: 250, sum_time: 123179 } }, io_stats: { read_count: 2037, read_bytes: 2037000 } }

If you can please give more details related to:

the put part is not performed, it says that it is skipping put because it has few samples but it is a workload of 500 connections for 8 hours so it should not happen

the put-part that was not performed - depends on which side: warp or noobaa?
Because noobaa should get the requests and according to parsing the http message should execute the action according to the namespace\backetspace (in CES it is always FS).

Maybe you have a timestamp or a specific printing in the logs you observed that you can share?
At this point, I don't understand the issue that you are trying to point, if you can please share more information.

cc: @romayalon

@javieramirez1
Copy link
Author

Other runs were made in which it is observed in both (2 cesnodes used) noobaa logs that put operations were performed, between these operations nothing out of the ordinary or any error is seen, also in the entire log no panics, fails or other errors are observed but comparing with previous warp behavior a workload of 8 hours with concurrency = 500

Operation: PUT. Concurrency: 0
SkippingPUTtoo few samples. Longer benchmark run required for reliable results.

It should not show in the results that the concurrency in put operations is 0
When in the noobaa logs it does show that put operations are being performed, I do not know what could be happening

@shirady
Copy link
Contributor

shirady commented Nov 4, 2024

Hi @javieramirez1,
An update so it would be clear -
On Thursday, we were able to reproduce the issue of error - connect: connection refused using the instructions written in this comment., we also reproduced it using the warp command on a bucket without versioning suspended (on versioning disabled), and in another reproduced it using warp command with flag --stress.

At this point, we don't think the issue is related to versioning.
We don't have the root cause yet, but the mutual part is that we see these printings in the noobaa logs come from systemd (noobaa is down and therefore wrap see it as connection refused):

systemd[1]: noobaa.service: Failed with result 'exit-code'.
systemd[1]: noobaa.service: Consumed 33min 7.469s CPU time.
systemd[1]: noobaa.service: Scheduled restart job, restart counter is at 2.
systemd[1]: Stopped The NooBaa service..
systemd[1]: noobaa.service: Consumed 33min 7.469s CPU time.
systemd[1]: Starting The NooBaa service....

I would mention that in all our reproductions, we didn't see any errors in noobaa logs, or events (as mentioned in the comment).
We also tried to run it and in the background run top on the node that receives the requests - and we didn't notice unusual use of resources (also near the time the process exited).

@shirady
Copy link
Contributor

shirady commented Nov 10, 2024

Hi @javieramirez1 ,
From what we observed in some cases it creates the issue error - connect: connection refused and we changed a bit the function panic (in util/panic.js) to have:

    process.stderr.write('PANIC: ' + message + (err.stack || err) + '\n', () => {
        process.exit(1);
    });

instead of:

process.exit(1);

to make sure that the error message is printed before the process exits.

we noticed that the error was related to open files, for example:

Nov 10 03:35:56 tmtscalets-protocol-1 node[3912820]: Nov-10 3:35:56.877 [nsfs/3912820] [ERROR] CONSOLE:: PANIC: process uncaughtException Error: EMFILE: too many open files, uv_resident_set_memory

Therefore, as a workaround, I can suggest setting a higher number of LimitNOFILE when running with high number of --concurrent flag using warp (even LimitNOFILE=infinity so that the open file descriptors will not be limited by the service definition, but the OS resources).

cc: @romayalon @nadavMiz

@shirady
Copy link
Contributor

shirady commented Nov 13, 2024

Hi @javieramirez1,
I wanted to update you that we added some code changes related to this issue to avoid a situation where noobaa is down (you can see in the PR link #8518 all the details and explanantions).
As mentioned in the PR those code changes will not change the behavior of open FDs.
Anyway, for now, you can use the workaround by increasing the LimitNOFILE number on the cluster on the node that you're running.
I'm changing the title since we understand that this is not a versioning issue.

cc: @romayalon @nadavMiz

@shirady shirady changed the title NSFS | Concurrency Tests | Versioning | Warp fails Suspend mode with error - connect: connection refused NSFS | Concurrency Tests | Warp fails with error - connect: connection refused Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants