Skip to content

v0.15

Compare
Choose a tag to compare
@tbennun tbennun released this 16 Oct 17:32
0755385

What's Changed

Work-Depth / Average Parallelism Analysis by @hodelcl in #1363 and #1327

A new analysis engine allows SDFGs to be statically analyzed for work and depth / average parallelism. The analysis allows specifying a series of assumptions about symbolic program parameters that can help simplify and improve the analysis results. For an example on how to use the analysis, see the following example:

from dace.sdfg.work_depth_analysis import work_depth

# A dictionary mapping each SDFG element to a tuple (work, depth)
work_depth_map = {}
# Assumptions about symbolic parameters
assumptions = ['N>5', 'M<200', 'K>N']
work_depth.analyze_sdfg(mysdfg, work_depth_map, work_depth.get_tasklet_work_depth, assumptions)

# A dictionary mapping each SDFG element to its average parallelism
average_parallelism_map = {}
work_depth.analyze_sdfg(mysdfg, average_parallelism_map, work_depth.get_tasklet_avg_par, assumptions)

Symbol parameter reduction in generated code (#1338, #1344)

To improve our integration with external codes, we limit the symbolic parameters generated by DaCe to only the used symbols. Take the following code for example:

@dace
def addone(a: dace.float64[N]):
  for i in dace.map[0:10]:
    a[i] += 1

Since the internal code does not actually need N to process the array, it will not appear in the generated code. Before this release the signature of the generated code would be:

DACE_EXPORTED void __program_addone(addone_t *__state, double * __restrict__ a, int N);

After this release it is:

DACE_EXPORTED void __program_addone(addone_t *__state, double * __restrict__ a);

Note that this is a major, breaking change that requires users who manually interact with the generated .so files to adapt to.

Externally-allocated memory (workspace) support (#1294)

A new allocation lifetime, dace.AllocationLifetime.External, has been introduced into DaCe. Now you can use your DaCe code with external memory allocators (such as PyTorch) and ask DaCe for: (a) how much transient memory it will need; and (b) to use a specific pre-allocated pointer. Example:

@dace
def some_workspace(a: dace.float64[N]):
  workspace = dace.ndarray([N], dace.float64, lifetime=dace.AllocationLifetime.External)
  workspace[:] = a
  workspace += 1
  a[:] = workspace

csdfg = some_workspace.to_sdfg().compile()

sizes = csdfg.get_workspace_sizes()  # Returns {dace.StorageType.CPU_Heap: N*8}
wsp = # ...Allocate externally...
csdfg.set_workspace(dace.StorageType.CPU_Heap, wsp)

The same interface is available in the generated code:

size_t __dace_get_external_memory_size_CPU_Heap(programname_t *__state, int N);
void __dace_set_external_memory_CPU_Heap(programname_t *__state, char *ptr, int N);
// or GPU_Global...

Schedule Trees (EXPERIMENTAL, #1145)

An experimental feature that allows you to analyze your SDFGs in a schedule-oriented format. It takes in SDFGs (even after applying transformations) and outputs a tree of elements that can be printed out in a Python-like syntax. For example:

@dace.program
def matmul(A: dace.float32[10, 10], B: dace.float32[10, 10], C: dace.float32[10, 10]):
  for i in range(10):
   for j in dace.map[0:10]:
     atile = dace.define_local([10], dace.float32)
     atile[:] = A[i]
     for k in range(10):
       with dace.tasklet:
         # ...
sdfg = matmul.to_sdfg()

from dace.sdfg.analysis.schedule_tree.sdfg_to_tree import as_schedule_tree
stree = as_schedule_tree(sdfg)
print(stree.as_string())

will print:

for i = 0; (i < 10); i = i + 1:
  map j in [0:10]:
    atile = copy A[i, 0:10]
    for k = 0; (k < 10); k = (k + 1):
      C[i, j] = tasklet(atile[k], B(10) [k, j], C[i, j])

There are some new transformation classes and passes in dace.sdfg.analysis.schedule_tree.passes, for example, to remove empty control flow scopes:

class RemoveEmptyScopes(tn.ScheduleNodeTransformer):
  def visit_scope(self, node: tn.ScheduleTreeScope):
    if len(node.children) == 0:
      return None
    return self.generic_visit(node)

We hope you find new ways to analyze and optimize DaCe programs with this feature!

Other Major Changes

Minor Changes

Fixes and Smaller Changes:

  • Fix transient bug in test with array_equal of empty arrays by @tbennun in #1374
  • Fixes GPUTransform bug when data are already in GPU memory by @alexnick83 in #1291
  • Fixed erroneous parsing of data slices when the data are defined inside a nested scope by @alexnick83 in #1287
  • Disable OpenMP sections by default by @tbennun in #1282
  • Make SDFG.name a proper property by @phschaad in #1289
  • Refactor and fix performance regression with GPU runtime checks by @tbennun in #1292
  • Fixed RW dependency violation when accessing data attributes by @alexnick83 in #1296
  • Externally-managed memory lifetime by @tbennun in #1294
  • External interaction fixes by @tbennun in #1301
  • Improvements to RefineNestedAccess by @alexnick83 and @Sajohn-CH in #1310
  • Fixed erroneous parsing of while-loop conditions by @alexnick83 in #1313
  • Improvements to MapFusion when the Map bodies contain NestedSDFGs by @alexnick83 in #1312
  • Fixed erroneous code generation of indirected accesses by @alexnick83 in #1302
  • RefineNestedAccess take indices into account when checking for missing free symbols by @Sajohn-CH in #1317
  • Fixed SubgraphFusion erroneously removing/merging intermediate data nodes by @alexnick83 in #1307
  • Fixed SDFG DFS traversal missing InterstateEdges by @alexnick83 in #1320
  • Frontend now uses the AST nodes' context to infer read/write accesses by @alexnick83 in #1297
  • Added capability for non-strict shape validation by @alexnick83 in #1321
  • Fixes for persistent schedule and GPUPersistentFusion transformation by @tbennun in #1322
  • Relax test for inter-state edges in default schedules by @tbennun in #1326
  • Improvements to inference of an SDFGState's read and write sets by @Sajohn-CH in #1325 and #1329
  • Fixed ArrayElimination pass trying to eliminate data that were already removed in #1314
  • Bump certifi from 2023.5.7 to 2023.7.22 by @dependabot in #1332
  • Fix some underlying issues with tensor core sample by @computablee in #1336
  • Updated hlslib to support Xilinx Vitis >=2022.2 by @carljohnsen in #1340
  • Docs: mention FPGA backend tested with Intel Quartus PRO by @TizianoDeMatteis in #1335
  • Improved validation of NestedSDFG connectors by @alexnick83 in #1333
  • Remove unused global data descriptor shapes from arguments by @tbennun in #1338
  • Fixed Scalar data validation in NestedSDFGs by @alexnick83 in #1341
  • Fix for None set properties by @tbennun in #1345
  • Add Object to defined types in code generation and some documentation by @tbennun in #1343
  • Fix symbolic parsing for ternary operators by @tbennun in #1346
  • Fortran fix memlet indices by @Sajohn-CH in #1342
  • Have memory type as argument for fpga auto interleave by @TizianoDeMatteis in #1352
  • Eliminate extraneous branch-end gotos in code generation by @tbennun in #1355
  • TaskletFusion: Fix additional edges in case of none-connectors by @lukastruemper in #1360
  • Fix dynamic memlet propagation condition by @tbennun in #1364
  • Configurable GPU thread/block index types, minor fixes to integer code generation and GPU runtimes by @tbennun in #1357

New Contributors

Full Changelog: v0.14.4...v0.15