Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pySpark job freezes #5

Open
zrtvwp opened this issue Oct 5, 2022 · 1 comment
Open

pySpark job freezes #5

zrtvwp opened this issue Oct 5, 2022 · 1 comment
Assignees

Comments

@zrtvwp
Copy link

zrtvwp commented Oct 5, 2022

For the third time in a row it hangs in the same place. Sometimes it just freezes, sometimes end up flooding org.apache.spark.network.server.TransportChannelHandler errors. It also bothers me because after each restart of the Job bucket increases in size and it is not clear to me, either he downloads what was unavailable the last time, or some action repeats and downloads the same files.

log1

total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3758 - count: 9976071
worker  - success: 0.845 - failed to download: 0.139 - failed to resize: 0.016 - images per sec: 24 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3755 - count: 9986071

log3

total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3728 - count: 9972142
worker  - success: 0.847 - failed to download: 0.134 - failed to resize: 0.019 - images per sec: 1 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 580 - count: 9982142

log3

total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3747 - count: 9892144
worker  - success: 0.853 - failed to download: 0.133 - failed to resize: 0.014 - images per sec: 27 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3741 - count: 9902144
worker  - success: 0.852 - failed to download: 0.130 - failed to resize: 0.018 - images per sec: 27 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3742 - count: 9912144
worker  - success: 0.852 - failed to download: 0.133 - failed to resize: 0.015 - images per sec: 27 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3745 - count: 9922144
worker  - success: 0.847 - failed to download: 0.138 - failed to resize: 0.015 - images per sec: 27 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3740 - count: 9932144
worker  - success: 0.855 - failed to download: 0.127 - failed to resize: 0.018 - images per sec: 25 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3740 - count: 9942144
worker  - success: 0.849 - failed to download: 0.135 - failed to resize: 0.016 - images per sec: 23 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3738 - count: 9952144
worker  - success: 0.848 - failed to download: 0.137 - failed to resize: 0.016 - images per sec: 27 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3731 - count: 9962144
worker  - success: 0.844 - failed to download: 0.141 - failed to resize: 0.015 - images per sec: 24 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3728 - count: 9972144
worker  - success: 0.852 - failed to download: 0.130 - failed to resize: 0.018 - images per sec: 21 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3624 - count: 9982144
22/10/04 18:52:28 WARN org.apache.spark.network.server.TransportChannelHandler: Exception in connection from /10.128.15.220:53598
java.io.IOException: Connection timed out
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
	at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:750)

Screenshot 2022-10-05 at 10 40 55

@mwbyeon mwbyeon self-assigned this Oct 5, 2022
@mwbyeon
Copy link
Collaborator

mwbyeon commented Oct 5, 2022

@zrtvwp
img2dataset supports incremental mode for non-downloaded shards on restart.

https://github.com/rom1504/img2dataset#api

incremental_mode: Can be "incremental" or "overwrite". For "incremental", img2dataset will download all the shards that were not downloaded, for "overwrite" img2dataset will delete recursively the output folder then start from zero (default incremental)

That exception message can be caused by an instance being terminated using a preemptible secondary worker.
In this case, dataproc will automatically restart the instance.

The hang issue seems to be related to rom1504/img2dataset#187,
but I'm not sure how to fix it yet :(

I think it will start downloading again in incremental mode.
There seems to be a problem with Spark's task scheduling.
If you run the task again, it will start downloading again in incremental mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants