Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want to know the process that task-bench kernel executes on legion #98

Open
AngryBear2 opened this issue Oct 6, 2023 · 3 comments
Open

Comments

@AngryBear2
Copy link

Hello, I have some understanding of task-bench at present, but there are still many things that are not very clear, here I would like to ask you:

  1. It seems that type-type is the type that generates the dependencies of the task graph, and kernel is the part to be executed. For example, if I want to use stencil to execute compute_kernel, how is the generated legion code partitioned? What about data dependencies on multiple nodes? Does each node split the compute_kernel into many parts, or does each node execute the same compute_kernel?

  2. I am not familiar with the execution process of memory_kernel on legion, so I hope you can consult me.

@elliottslaughter
Copy link
Contributor

Legion has two main sets of partitions, which are defined here:

task-bench/legion/main.cc

Lines 477 to 478 in bf4ad19

std::vector<LogicalPartitionT<1> > primary_partitions;
std::vector<std::vector<std::vector<LogicalPartitionT<1> > > > secondary_partitions;

The primary partition is just an equal partition:

IndexPartitionT<1> primary_ip = runtime->create_equal_partition(ctx, is, ts);

The secondary partitions are more complicated and encode, essentially, the dependence patterns:

task-bench/legion/main.cc

Lines 529 to 571 in bf4ad19

// Next create secondary partitions for dependencies
long ndsets = g.max_dependence_sets();
long max_deps = 0;
for (long dset = 0; dset < ndsets; ++dset) {
for (long point = 0; point < g.max_width; ++point) {
long deps = 0;
for (auto dep : g.dependencies(dset, point)) {
deps += dep.second - dep.first + 1;
}
max_deps = std::max(max_deps, deps);
}
}
std::vector<std::vector<LogicalPartitionT<1> > > secondary_lps(ndsets);
for (long dset = 0; dset < ndsets; ++dset) {
for (long ndep = 0; ndep < max_deps; ++ndep) {
IndexPartitionT<1> secondary_ip = runtime->create_pending_partition(ctx, is, ts);
for (long point = 0; point < g.max_width; ++point) {
std::vector<IndexSpace> subspaces;
long ninput = 0;
bool done = false;
for (auto dep : g.dependencies(dset, point)) {
for (long i = dep.first; i <= dep.second; ++i) {
if (ninput == ndep) {
subspaces.push_back(runtime->get_index_subspace(ctx, primary_ip, i));
done = true;
break;
}
++ninput;
}
if (done) break;
}
runtime->create_index_space_union(ctx, secondary_ip, point, subspaces);
}
LogicalPartitionT<1> secondary_lp = runtime->get_logical_partition(result_lr, secondary_ip);
secondary_lps[dset].push_back(secondary_lp);
}
}

As a general rule, data in Task Bench is "fake" in that it does not encode "real" data, and the data is not consumed by any of the kernels (whether compute or memory or whatever). The data does contain information to encode where it's coming from so we can check it for correctness. But to a first approximation, you should completely separate in your mind the partitioning (which is related to the dependence pattern) and the kernels (which actually execute, but ignore all data).

Kernels are never "partitioned", they just do what they're told. So if you run a compute kernel and tell it to execute N iterations, it will always execute N iterations, regardless of the size and shape of the graph.

Absolutely nothing in Task Bench is sensitive to the number of nodes. The graph is configured explicitly (via command line paramaters) and it is up to each implementation to spread that graph out as best it sees fit. But the Task Bench core (used by each implementation) is oblivious to how many nodes/cores there are or how things are parallelized. You could just as easily make a sequential version of Task Bench that executes the same thing.

Hope that helps.

@AngryBear2
Copy link
Author

Thank you. According to your words, I understand that each node in the figure is executing the same kernel, not all nodes are executing the same kernel. How do I judge the data transfer time of the dependent nodes?

@elliottslaughter
Copy link
Contributor

There is a summary printed at the end which should include a bandwidth figure; but you may need to pass in the number of nodes (-nodes N) in order for it to accurately calculate the intra-node vs inter-node bandwidth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants