v0.15
What's Changed
Work-Depth / Average Parallelism Analysis by @hodelcl in #1363 and #1327
A new analysis engine allows SDFGs to be statically analyzed for work and depth / average parallelism. The analysis allows specifying a series of assumptions about symbolic program parameters that can help simplify and improve the analysis results. For an example on how to use the analysis, see the following example:
from dace.sdfg.work_depth_analysis import work_depth
# A dictionary mapping each SDFG element to a tuple (work, depth)
work_depth_map = {}
# Assumptions about symbolic parameters
assumptions = ['N>5', 'M<200', 'K>N']
work_depth.analyze_sdfg(mysdfg, work_depth_map, work_depth.get_tasklet_work_depth, assumptions)
# A dictionary mapping each SDFG element to its average parallelism
average_parallelism_map = {}
work_depth.analyze_sdfg(mysdfg, average_parallelism_map, work_depth.get_tasklet_avg_par, assumptions)
Symbol parameter reduction in generated code (#1338, #1344)
To improve our integration with external codes, we limit the symbolic parameters generated by DaCe to only the used symbols. Take the following code for example:
@dace
def addone(a: dace.float64[N]):
for i in dace.map[0:10]:
a[i] += 1
Since the internal code does not actually need N
to process the array, it will not appear in the generated code. Before this release the signature of the generated code would be:
DACE_EXPORTED void __program_addone(addone_t *__state, double * __restrict__ a, int N);
After this release it is:
DACE_EXPORTED void __program_addone(addone_t *__state, double * __restrict__ a);
Note that this is a major, breaking change that requires users who manually interact with the generated .so files to adapt to.
Externally-allocated memory (workspace) support (#1294)
A new allocation lifetime, dace.AllocationLifetime.External
, has been introduced into DaCe. Now you can use your DaCe code with external memory allocators (such as PyTorch) and ask DaCe for: (a) how much transient memory it will need; and (b) to use a specific pre-allocated pointer. Example:
@dace
def some_workspace(a: dace.float64[N]):
workspace = dace.ndarray([N], dace.float64, lifetime=dace.AllocationLifetime.External)
workspace[:] = a
workspace += 1
a[:] = workspace
csdfg = some_workspace.to_sdfg().compile()
sizes = csdfg.get_workspace_sizes() # Returns {dace.StorageType.CPU_Heap: N*8}
wsp = # ...Allocate externally...
csdfg.set_workspace(dace.StorageType.CPU_Heap, wsp)
The same interface is available in the generated code:
size_t __dace_get_external_memory_size_CPU_Heap(programname_t *__state, int N);
void __dace_set_external_memory_CPU_Heap(programname_t *__state, char *ptr, int N);
// or GPU_Global...
Schedule Trees (EXPERIMENTAL, #1145)
An experimental feature that allows you to analyze your SDFGs in a schedule-oriented format. It takes in SDFGs (even after applying transformations) and outputs a tree of elements that can be printed out in a Python-like syntax. For example:
@dace.program
def matmul(A: dace.float32[10, 10], B: dace.float32[10, 10], C: dace.float32[10, 10]):
for i in range(10):
for j in dace.map[0:10]:
atile = dace.define_local([10], dace.float32)
atile[:] = A[i]
for k in range(10):
with dace.tasklet:
# ...
sdfg = matmul.to_sdfg()
from dace.sdfg.analysis.schedule_tree.sdfg_to_tree import as_schedule_tree
stree = as_schedule_tree(sdfg)
print(stree.as_string())
will print:
for i = 0; (i < 10); i = i + 1:
map j in [0:10]:
atile = copy A[i, 0:10]
for k = 0; (k < 10); k = (k + 1):
C[i, j] = tasklet(atile[k], B(10) [k, j], C[i, j])
There are some new transformation classes and passes in dace.sdfg.analysis.schedule_tree.passes
, for example, to remove empty control flow scopes:
class RemoveEmptyScopes(tn.ScheduleNodeTransformer):
def visit_scope(self, node: tn.ScheduleTreeScope):
if len(node.children) == 0:
return None
return self.generic_visit(node)
We hope you find new ways to analyze and optimize DaCe programs with this feature!
Other Major Changes
- Support for tensor linear algebra (transpose, dot products) by @alexnick83 in #1309
- (Experimental) support for nested data containers and structures by @alexnick83 in #1324
- (Experimental) basic support for mpi4py syntax by @alexnick83 and @Com1t in #1070 and #1288
- (Experimental) Added support for a subset of F77 and F90 language features by @acalotoiu and @mcopik #1275, #1293, #1349 and #1367
Minor Changes
- Support for Python 3.12 by @alexnick83 in #1386
- Support attributes in symbolic expressions by @tbennun in #1369
- GPU User Experience Improvements by @tbennun in #1283
- State Fusion Extension with happens before dependency edge by @acalotoiu in #1268
- Add
CPU_Persistent
map schedule (OpenMP parallel regions) by @tbennun in #1330
Fixes and Smaller Changes:
- Fix transient bug in test with
array_equal
of empty arrays by @tbennun in #1374 - Fixes GPUTransform bug when data are already in GPU memory by @alexnick83 in #1291
- Fixed erroneous parsing of data slices when the data are defined inside a nested scope by @alexnick83 in #1287
- Disable OpenMP sections by default by @tbennun in #1282
- Make SDFG.name a proper property by @phschaad in #1289
- Refactor and fix performance regression with GPU runtime checks by @tbennun in #1292
- Fixed RW dependency violation when accessing data attributes by @alexnick83 in #1296
- Externally-managed memory lifetime by @tbennun in #1294
- External interaction fixes by @tbennun in #1301
- Improvements to RefineNestedAccess by @alexnick83 and @Sajohn-CH in #1310
- Fixed erroneous parsing of while-loop conditions by @alexnick83 in #1313
- Improvements to MapFusion when the Map bodies contain NestedSDFGs by @alexnick83 in #1312
- Fixed erroneous code generation of indirected accesses by @alexnick83 in #1302
- RefineNestedAccess take indices into account when checking for missing free symbols by @Sajohn-CH in #1317
- Fixed SubgraphFusion erroneously removing/merging intermediate data nodes by @alexnick83 in #1307
- Fixed SDFG DFS traversal missing InterstateEdges by @alexnick83 in #1320
- Frontend now uses the AST nodes' context to infer read/write accesses by @alexnick83 in #1297
- Added capability for non-strict shape validation by @alexnick83 in #1321
- Fixes for persistent schedule and GPUPersistentFusion transformation by @tbennun in #1322
- Relax test for inter-state edges in default schedules by @tbennun in #1326
- Improvements to inference of an SDFGState's read and write sets by @Sajohn-CH in #1325 and #1329
- Fixed ArrayElimination pass trying to eliminate data that were already removed in #1314
- Bump certifi from 2023.5.7 to 2023.7.22 by @dependabot in #1332
- Fix some underlying issues with tensor core sample by @computablee in #1336
- Updated hlslib to support Xilinx Vitis >=2022.2 by @carljohnsen in #1340
- Docs: mention FPGA backend tested with Intel Quartus PRO by @TizianoDeMatteis in #1335
- Improved validation of NestedSDFG connectors by @alexnick83 in #1333
- Remove unused global data descriptor shapes from arguments by @tbennun in #1338
- Fixed Scalar data validation in NestedSDFGs by @alexnick83 in #1341
- Fix for None set properties by @tbennun in #1345
- Add Object to defined types in code generation and some documentation by @tbennun in #1343
- Fix symbolic parsing for ternary operators by @tbennun in #1346
- Fortran fix memlet indices by @Sajohn-CH in #1342
- Have memory type as argument for fpga auto interleave by @TizianoDeMatteis in #1352
- Eliminate extraneous branch-end gotos in code generation by @tbennun in #1355
- TaskletFusion: Fix additional edges in case of none-connectors by @lukastruemper in #1360
- Fix dynamic memlet propagation condition by @tbennun in #1364
- Configurable GPU thread/block index types, minor fixes to integer code generation and GPU runtimes by @tbennun in #1357
New Contributors
- @computablee made their first contribution in #1290
- @Com1t made their first contribution in #1288
- @mcopik made their first contribution in #1349
Full Changelog: v0.14.4...v0.15