-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-resume upon disconnect #5
Comments
This happened again last night, so 7h of possible transfer time was not used:
|
For reference, it happened again now (but I caught it fairly quickly this time)
I'm just posting these for reference, so we have an idea about how common this is. |
Once more connection time out at 2021-03-24 14:21:17.50. |
Do you have an estimate of how many transfers did succeed between failures? So as to have a feeling for "once in every N"? |
:) I see your point. I don't have logs (would be a nother nice feature of etc to give a logfile flag) but from the terminal output I have on screen I estimate that I transfer about 10 files in 4 minutes. From the log above, I had 3 timeouts in about 24h. This would imply a failure rate of 3/(102460/4)=0.08%. It is a tiny number indeed. But, in some way the rate is not so important. The problem is that if it fails just after I leave the office, I cannot resume it until the next day. This means the possible down-time due to the failure is 16h (assuming 8h work day) which for a 24h possible transfer range would be a failure rate of 16/24 = 67%. So the severity is highly depending on the calculation method. However, in practice, the problem is not that I have to restart it often (I could do that), it is the long periods when there's noone looking at the transfer (e.g. at night). So I'd prefer the second method of calculation, which unfortunately gives quite a high possible failure impact. |
? |
Yeah, that could work. I did something similar for m5copy where I had other issues. But it would be so nice to just be able to run one command and not have to write wrapper scripts for each transfer :). I deal with a few transfers every week, and although trivial it would just be nice to have the transfer software handle all the issues for me :). |
It is a very useful suggestion you make and it will arrive in the code at some point. But for your current ailment at hand this poor man's solution could get you up to 24h usage cycles. Add in a bit of "output redirecting" and massaging and you have your logging facility too; skipping the pesky "Oh I already did this file" messages
|
Thank you for the "bashing", it'll make me happy for now :). |
Can you try this branch: |
Not sure exactly how to test this, but I tried to run the "etd" from the compile-issues branch on one machine, and the new issue-5 etc from another machine. Seems to fail, although not sure why.
etd log:
|
And just to confirm; running etc from compile-issues, not from issue-5 branch, works just fine with the command above. |
Note regarding propose bash-while workaround: As stated, the suggested bash code has two problems a) It did not run when I put my etc command where I thought it should be, but that may just be due to bash incompetence, and b) after spending some time getting it working (using eval instead to excecute the command string), it now does not handle ^C anymore since the loop eats it. I can of course work around this, but then I'm spending time coding work arounds which is exactly what I wanted to avoid :). For now, I settled on a bash script which just has the same "eval" call 10 times, no loops or anything, to just catch a few timeouts that will happen. Simple enough to just work (tm) for now. |
Note: This is now very nearly solved, except for possible unusual cases like #12 . |
Closing this one for now as this is continued in issue #12 |
I want to transfer data and I run e.g.
etc '/mnt/vbsmnt/vo1074_oe*' TARGETIP#PORT:/gpfs/cdata/incoming/oe-test/ --resume -v -m 3
This runs fine, until it suddenly doesn't:
due to the "connection time out." error. @haavee theorised that this may happen because UDT being based on UDP, so if an important packets (e.g. connection setup or the reply to that) is lost, that it cannot recover from that. This may be the reason, and it may be a rare problem, but it is unfortunate because then no data is transferred. The fix is to simply restart the transfer, i.e. I press "up-arrow, enter" in my terminal. Now etc reconnects, resumes the transfer and all works fine. Until next time it breaks.
I would like to be able to start etc with an auto-reconnect option, e.g. "--autorecon" which would upon "connection time out" try to automatically re-connect. It could, for example, try to auto-reconnect 5 times with 1 minute delay between each try. In this way, transfers will not sit idle due to an unlucky interruption.
Because we in practice start the transfer and then go do something else, there is a high risk of transfer failing silently in this way. Therefore, this auto-reconnect feature would be good.
The text was updated successfully, but these errors were encountered: