Documentation needed #84

devreal · 2022-04-22T14:01:47Z

I am trying to implement the task-bench benchmark on top of a data flow paradigm (TTG) and found it to be a rather frustrating exercise, mostly due to the total lack of documentation. In the task_graph_t exactly 3 out of 11 fields have some kind of documentation and only 2 of the 12 task_graph_* functions are blessed with a comment, one of which is a FIXME... For me, it really comes down to a guessing game and trying to read other implementations to figure out whether my implementation is actually correct in the sense of the benchmark.

Particular questions I still don't know the answer to are:

What is nb_fields and what is the connection to timesteps? It appears that nb_fields is set to the number of timesteps if not explicitly provided. Why? What does the number of fields have to do with the number of timesteps? Can I ignore it and just use 1?
What is the relation between the task_graph_[reverse]_dependencies and task_graph_offset_at_timestep/task_graph_width_at_timestep? Do I have to apply a correction for the offset/width to the dependencies provided by task_graph_[reverse]_dependencies? If so, why is that not done in task_graph_[reverse]_dependencies?
I couldn't find any run rules. For a data-flow model like TTG it is possible to feed data into the graph and have updates never materialize outside of it (i.e., to never write back updated data). In dependency-based models such as StarPU, PaRSEC, OpenMP, etc dependencies are described directly on physical memory locations, which is very different from dataflow-based models. So what is the expectation here? Also, can I add artificial dependencies if that benefits my model (say between points x in consecutive timesteps)? What is the correctness metric? Am I free to distribute the data however I want? What is the minimum number of dependencies I need to support? (clearly, the models mentioned above couldn't support the alltoall pattern at scale and chose a seemingly arbitrary number of 10)

Given the state of things, I will go ahead and choose the interpretation most favorable for my model that won't crash or error out during execution. However, that should not be the way to write a benchmark meant to be portable...

The text was updated successfully, but these errors were encountered:

elliottslaughter · 2022-04-22T23:28:20Z

Hi @devreal ,

I agree more documentation would be good. :-)

nb_fields is there for uniformity of scripting. It is not used by the Task Bench core. You can safely ignore it, though the concept may be useful to your implementation. Basically, many implementations do a form of a N-way buffering (double buffering is N=2, triple buffering is N=3, etc.). This is a compromise between (a) trying to cut the WAR dependencies to reduce blocking in the implementations, and (b) not over-allocating memory. Again, this is purely an implementation detail and you can use it or ignore it as you see fit.
As you've noticed, the width and offset are specified separately from the dependence patterns. This is because in some systems, it is very important for performance to amortize those dependence patterns. So in a pattern like dom, the pattern is constant (depend on yourself and the point to your left), but the width/offset vary over the course of the run.
TensorFlow has semantics like you describe. The way we do this in TensorFlow is that we require the output of every task to be included in the final result (see outputs in https://github.com/StanfordLegion/task-bench/blob/master/tensorflow/task_bench.py). You could probably get away with only requiring the value of the last task in each column, though that can, strictly speaking, cause problems on some dependence patterns. The bottom line is that there should be some way of making sure that every task has executed.

For your other questions:

You can add artificial dependencies. Though most patterns (besides trivial) include a self-dependence. (Edit: however, to be clear, you should not pass these extra dependencies into execute_point. However, if you just use them for your own internal use, that's fine.)
If you want to replicate the experiments from the SC paper, you need to support at least 10 dependencies. This is sort of a hack though; if systems supported arbitrary numbers of dependencies this wouldn't be required. Legion does not have an upper bound (aside from performance) on the number of dependencies.
You can distribute data however you want.

In general, Task Bench will aggressively check any inputs you give to any tasks. It is pretty much impossible to do this wrong, unless you just don't execute the task graph at all. By default, we'll also check that you execute at least one task from each task graph at the point where we print results. So it's pretty hard to mess up, but if you want you could add a print statement to https://github.com/StanfordLegion/task-bench/blob/master/core/core.cc#L546 and check that it's printing the number of tasks you expect. As long as it does that and does not crash, you're good.

You may also benefit from Task Bench's diagnostics about the task graph itself. E.g., for the dom pattern:

$ ./task_bench -ll:cpu 1 -ll:io 1 -type dom -width 4 -steps 8 -v
...
      Timestep 0 (offset 0, width 1, last offset 0, last width 0):
        Points: 0
        Dependencies:
          Point 0:
      Timestep 1 (offset 0, width 2, last offset 0, last width 1):
        Points: 0 1
        Dependencies:
          Point 0: 0
          Point 1: 0
      Timestep 2 (offset 0, width 3, last offset 0, last width 2):
        Points: 0 1 2
        Dependencies:
          Point 0: 0
          Point 1: 0 1
          Point 2: 1
      Timestep 3 (offset 0, width 4, last offset 0, last width 3):
        Points: 0 1 2 3
        Dependencies:
          Point 0: 0
          Point 1: 0 1
          Point 2: 1 2
          Point 3: 2
      Timestep 4 (offset 0, width 4, last offset 0, last width 4):
        Points: 0 1 2 3
        Dependencies:
          Point 0: 0
          Point 1: 0 1
          Point 2: 1 2
          Point 3: 2 3
      Timestep 5 (offset 1, width 3, last offset 0, last width 4):
        Points: 1 2 3
        Dependencies:
          Point 1: 0 1
          Point 2: 1 2
          Point 3: 2 3
      Timestep 6 (offset 2, width 2, last offset 1, last width 3):
        Points: 2 3
        Dependencies:
          Point 2: 1 2
          Point 3: 2 3
      Timestep 7 (offset 3, width 1, last offset 2, last width 2):
        Points: 3
        Dependencies:
          Point 3: 2 3
...

Pass -vv if you also want to see the reverse dependencies.

Hope that helps and feel free to ask questions if you have any.

1193749292 · 2023-07-26T11:34:43Z

https://arxiv.org/pdf/1908.05790v2.pdf

So far, I have only found this one. I would like to learn more about the execution process and the scheduling principles of openmp/region. Have you found any more documentation?

elliottslaughter · 2023-07-26T17:05:36Z

@1193749292 I'm not sure what you're asking for. In general, the implementations are entirely separate from the core. What OpenMP does, how it schedules, is not something that has anything to do with the description of task graphs.

If you have specific questions, feel free to ask them here. Please also read the earlier comments in this thread because there's good information up there as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation needed #84

Documentation needed #84

devreal commented Apr 22, 2022

elliottslaughter commented Apr 22, 2022 •

edited

Loading

1193749292 commented Jul 26, 2023 •

edited

Loading

elliottslaughter commented Jul 26, 2023

Documentation needed #84

Documentation needed #84

Comments

devreal commented Apr 22, 2022

elliottslaughter commented Apr 22, 2022 • edited Loading

1193749292 commented Jul 26, 2023 • edited Loading

elliottslaughter commented Jul 26, 2023

elliottslaughter commented Apr 22, 2022 •

edited

Loading

1193749292 commented Jul 26, 2023 •

edited

Loading