Improvements for transpose, and more. #1624
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Transpose
is improved by forcing use of our multi-threaded code (preexisting) instead ofEigen::
's. Checked to be 4+ times faster. AsTranspose
is used in several GDL functions this provides some stamina to them. The previous choice ofEigen::
seems strange in retrospect but was based on actual measurements at the time, but optimizations have been added since.I've added a commandline switch (
--with-eigen-transpose
) to enable Eigen::Transpose in case this would prove faster on some architectures or afterEigen::
made progresses.Second, this version permits, via the use of another switch (
--smart-tpool
) to use a threadpool mode where, in case threads are available,TPOOL_MIN_ELTS
would also be more or less the number of elements that each thread will process, so that GDL may use less threads than the machine can provide (some GDL running machines have 64 or more cores). Obviously it is not worth starting 128 threads if 10 would already do the job in time. To get more concurrential threads, diminishTPOOL_MIN_ELTS
, and conversely, to find the optimum for a specific case. May be a cure for #1149?OTOH, it is not always the number of elements processed by one thread that govern the overall spent time. The time spent per element, were it a simple addition or a long procedure, is also a key factor. The
parallelize()
function (inbasegdl.cpp
) accepts modifiers to change this behaviour. I've tweaked a few, but this is not very 'adaptive', introspection will be needed.Running GDL with
--smart-tpool
on machines with a large number of threads and test it would be invaluable.