Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunking takes a very long time #157

Open
wndywllms opened this issue Nov 11, 2016 · 3 comments
Open

chunking takes a very long time #157

wndywllms opened this issue Nov 11, 2016 · 3 comments

Comments

@wndywllms
Copy link
Collaborator

For the full band (4ch 4s) the initial chunking took me over 12 hrs. After turning on the compression and switching to using a ramdisk (/dev/shm) for the dir_local, instead of using the local disk on the node, I got it down to roughly 7 hrs. This seems a bit extreme. (I did have to limit the number of chunking tasks running simultaneously per freq. band). I've tweaked the chunk size to produce 8 chunks (integer multiple of the thread limit on io-heavy tasks, thread_io), so it is now producing chunks of ~1.5G (pre-compression was 1.5G). The work dir is on a large shared disk.

@darafferty
Copy link
Collaborator

That does seem slow. The chunking script could likely be improved quite a bit, as it does a lot of copying of columns. I'll take a look at it.

Another issue is that the chunking is limited to a single node, so it can't take advantage of multiple nodes of a cluster. We could get around this by making a "chunking pipeline" or perhaps by moving the whole chunking operation into the initial-subtract pipeline.

@wndywllms
Copy link
Collaborator Author

Using multiple nodes would be quite useful here. I've been using only one node for the init subtract (deep) but 3-4 for factor so for me it would go faster if you pipeline it in factor.

@AHorneffer
Copy link
Contributor

Doing the chunking in the initial-subtract pipeline would be possible. I'm not too fond of this because it would make the initial-subtract pipeline even more a "Factor-pipeline", but it probably already is, so there is no real harm done.

Another question is if the chunking part of Factor will speed up if the input data is already compressed with dysco?

Wendy: Is the chunking limited by CPU speed or by IO speed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants