I want to know the process that task-bench kernel executes on legion #98

AngryBear2 · 2023-10-06T13:34:25Z

Hello, I have some understanding of task-bench at present, but there are still many things that are not very clear, here I would like to ask you:

It seems that type-type is the type that generates the dependencies of the task graph, and kernel is the part to be executed. For example, if I want to use stencil to execute compute_kernel, how is the generated legion code partitioned? What about data dependencies on multiple nodes? Does each node split the compute_kernel into many parts, or does each node execute the same compute_kernel?
I am not familiar with the execution process of memory_kernel on legion, so I hope you can consult me.

elliottslaughter · 2023-10-06T16:18:09Z

Legion has two main sets of partitions, which are defined here:

task-bench/legion/main.cc

Lines 477 to 478 in bf4ad19

    
           std::vector<LogicalPartitionT<1> > primary_partitions; 
        
           std::vector<std::vector<std::vector<LogicalPartitionT<1> > > > secondary_partitions;

The primary partition is just an equal partition:

task-bench/legion/main.cc

Line 526 in bf4ad19

IndexPartitionT<1> primary_ip = runtime->create_equal_partition(ctx, is, ts);

The secondary partitions are more complicated and encode, essentially, the dependence patterns:

task-bench/legion/main.cc

Lines 529 to 571 in bf4ad19

    
           // Next create secondary partitions for dependencies 
        
           long ndsets = g.max_dependence_sets(); 
        
           long max_deps = 0; 
        
           for (long dset = 0; dset < ndsets; ++dset) { 
        
             for (long point = 0; point < g.max_width; ++point) { 
        
               long deps = 0; 
        
               for (auto dep : g.dependencies(dset, point)) { 
        
                 deps += dep.second - dep.first + 1; 
        
               } 
        
               max_deps = std::max(max_deps, deps); 
        
             } 
        
           } 
        
           std::vector<std::vector<LogicalPartitionT<1> > > secondary_lps(ndsets); 
        
           for (long dset = 0; dset < ndsets; ++dset) { 
        
             for (long ndep = 0; ndep < max_deps; ++ndep) { 
        
               IndexPartitionT<1> secondary_ip = runtime->create_pending_partition(ctx, is, ts); 
        
               for (long point = 0; point < g.max_width; ++point) { 
        
                 std::vector<IndexSpace> subspaces; 
        
                 long ninput = 0; 
        
                 bool done = false; 
        
                 for (auto dep : g.dependencies(dset, point)) { 
        
                   for (long i = dep.first; i <= dep.second; ++i) { 
        
                     if (ninput == ndep) { 
        
                       subspaces.push_back(runtime->get_index_subspace(ctx, primary_ip, i)); 
        
                       done = true; 
        
                       break; 
        
                     } 
        
                     ++ninput; 
        
                   } 
        
                   if (done) break; 
        
                 } 
        
                 runtime->create_index_space_union(ctx, secondary_ip, point, subspaces); 
        
               } 
        
               LogicalPartitionT<1> secondary_lp = runtime->get_logical_partition(result_lr, secondary_ip); 
        
               secondary_lps[dset].push_back(secondary_lp); 
        
             } 
        
           }

As a general rule, data in Task Bench is "fake" in that it does not encode "real" data, and the data is not consumed by any of the kernels (whether compute or memory or whatever). The data does contain information to encode where it's coming from so we can check it for correctness. But to a first approximation, you should completely separate in your mind the partitioning (which is related to the dependence pattern) and the kernels (which actually execute, but ignore all data).

Kernels are never "partitioned", they just do what they're told. So if you run a compute kernel and tell it to execute N iterations, it will always execute N iterations, regardless of the size and shape of the graph.

Absolutely nothing in Task Bench is sensitive to the number of nodes. The graph is configured explicitly (via command line paramaters) and it is up to each implementation to spread that graph out as best it sees fit. But the Task Bench core (used by each implementation) is oblivious to how many nodes/cores there are or how things are parallelized. You could just as easily make a sequential version of Task Bench that executes the same thing.

Hope that helps.

AngryBear2 · 2023-10-07T02:25:15Z

Thank you. According to your words, I understand that each node in the figure is executing the same kernel, not all nodes are executing the same kernel. How do I judge the data transfer time of the dependent nodes?

elliottslaughter · 2023-10-10T17:04:33Z

There is a summary printed at the end which should include a bandwidth figure; but you may need to pass in the number of nodes (-nodes N) in order for it to accurately calculate the intra-node vs inter-node bandwidth.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I want to know the process that task-bench kernel executes on legion #98

I want to know the process that task-bench kernel executes on legion #98

AngryBear2 commented Oct 6, 2023

elliottslaughter commented Oct 6, 2023

AngryBear2 commented Oct 7, 2023

elliottslaughter commented Oct 10, 2023

I want to know the process that task-bench kernel executes on legion #98

I want to know the process that task-bench kernel executes on legion #98

Comments

AngryBear2 commented Oct 6, 2023

elliottslaughter commented Oct 6, 2023

AngryBear2 commented Oct 7, 2023

elliottslaughter commented Oct 10, 2023