Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Regent on a distributed system #69

Open
crl123 opened this issue Nov 4, 2020 · 6 comments
Open

Run Regent on a distributed system #69

crl123 opened this issue Nov 4, 2020 · 6 comments

Comments

@crl123
Copy link

crl123 commented Nov 4, 2020

Good afternoon,
I am running Regent on my cluster of 9 node with the following parameters:
mpirun -np 9 -ppn 1 ./TaskBench/task-bench/regent/main.shard14 -steps 10 -type fft -kernel compute_bound -iter 1000000
And it is giving me the following problem:
main.shard14: core.cc:588: void TaskGraph::execute_point(long int, long int, char*, size_t, const char**, const size_t*, size_t, char*, size_t) const: Assertion `input[i].second == dep' failed.

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
And sometimes the following problem:
main.shard14: core.cc:565: void TaskGraph::execute_point(long int, long int, char*, size_t, const char**, const size_t*, size_t, char*, size_t) const: Assertion `offset <= point && point < offset+width' failed.

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
I have the same problem when I use the tree type, but when I use the stencil_1d type I don't have the problem.
I compile regent as follows:
DEFAULT_FEATURES=0 USE_REGENT=1 ./get_deps.sh
export CXX=mpicxx
export CC=mpicc
./build_all.sh
Thank you in advance for your help,

@elliottslaughter
Copy link
Contributor

Hi @crl123,

This means that Task Bench is computing the wrong result. I'm a little confused, I thought the Regent implementation was fully debugged.

I'm not expecting this to make a difference, but can you confirm what Task Bench branch/tag you're on?

I'll try to confirm on my end as well.

@crl123
Copy link
Author

crl123 commented Nov 4, 2020

I'm on the 'origin/master' branch.
I updated the repository in my local machine on this Sunday.

@elliottslaughter
Copy link
Contributor

Ok, I'm a bit swamped with things going on this week, but I'll try to find time to verify the Regent implementation on my own machine.

@elliottslaughter
Copy link
Contributor

Sorry for taking so long to get back to this.

Looking back at your configuration here, I don't see any settings for the network. Typically you'd use something like:

export USE_GASNET=1
export CONDUIT=aries

Otherwise what you're doing is running N copies of the single-node program. Which is probably why this is misbehaving.

@ysfess22
Copy link

ysfess22 commented Oct 5, 2023

Hi @elliottslaughter. I have a further question about multi-node benchmarks.
Using gasnet the way you explained for a cluster with two nodes (udp conduit) creates double the number of tasks in the graph; half of the tasks is ran by node 1 and the other half by node 2. Is that the expected behaviour? Or is there a way to have the tasks be split between nodes? E.g., Given a 10x10 stencil graph, the 100 tasks would be split between two nodes.

@elliottslaughter
Copy link
Contributor

@ysfess22 Please submit this as a new issue unless it's specifically related to the original posting.

The answer will depend on how you have configured your system, and I will require more information, which will clog this thread if it's not specifically related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants