How to promise alignment of Input and Output buffers? #6495

mcourteaux · 2021-12-08T15:55:22Z

mcourteaux
Dec 8, 2021
Collaborator

I'd like to simplify addressing and intermediate buffer allocation size computations.

Basically, I want to promise in the scheduling language that a generators Input<Buffer> will have dimension extents that have a zero remainder modulo 32. It can generate a pipeline assert to validate if the buffer I pass in effectively does that. Now I have a bunch of computations I don't really grasp easily like these:

let sum_prob_per_sample_global_wrapper$0.s0.sample.max.s = min((min(coord.extent.0, 16) + (((coord.extent.0 + -1) / 16) * 16)), coord.extent.0)
let logit.s0.sample.max.s = max(coord.extent.0, sum_prob_per_sample_global_wrapper$0.s0.sample.max.s)
let coord.extent.0.required.s = (max(((((logit.s0.sample.max.s + -1) / 16) * 16) + 16), coord.extent.0) - min(coord.extent.0, 16))

_halide_buffer_init(coord.buffer, _halide_buffer_get_shape(coord.buffer), reinterpret((uint64)0), (uint64)0, reinterpret((uint64)0), 2, 32, 2, make_struct(0, (coord.extent.0.required.s + 16), 1, 0, 0, 7, (coord.extent.0.required.s + 16), 0), (uint64)0)

So it does a bunch of calculations that will yield results that are aligned to 16. But I know that the coord buffer will always be non-zero multiple of 32 along the sample dimension. If I manually substitute and propagate this information, I'd get:

let sum_prob_per_sample_global_wrapper$0.s0.sample.max.s
 = min((min(coord.extent.0, 16) + (((coord.extent.0 + -1) / 16) * 16)), coord.extent.0)
 = min((16 + (((coord.extent.0 + -1) / 16) * 16)), coord.extent.0)
 = min((16 + (coord.extent.0 - 16)), coord.extent.0)
 = min(coord.extent.0, coord.extent.0)
 = coord.extent.0
 
let logit.s0.sample.max.s
 = max(coord.extent.0, sum_prob_per_sample_global_wrapper$0.s0.sample.max.s)
 = max(coord.extent.0, coord.extent.0)
 = coord.extent.0

let coord.extent.0.required.s
 = (max(((((logit.s0.sample.max.s + -1) / 16) * 16) + 16), coord.extent.0) - min(coord.extent.0, 16))
 = (max(((((coord.extent.0 + -1) / 16) * 16) + 16), coord.extent.0) - 16)
 = (max((coord.extent.0 - 16 + 16), coord.extent.0) - 16)
 = (max(coord.extent.0, coord.extent.0) - 16)
 = coord.extent.0 - 16

_halide_buffer_init(coord.buffer, 
    _halide_buffer_get_shape(coord.buffer), reinterpret((uint64)0), (uint64)0, reinterpret((uint64)0), 2, 32, 2,
     make_struct(0, (coord.extent.0), 1, 0, 0, 7, (coord.extent.0), 0), (uint64)0)

In fact, I think coord.extent.0.required.s would even disappear. This would make the statement file so much cleaner. But I don't know if it is possible to add buffer constraints.

abadams · 2021-12-08T16:43:30Z

abadams
Dec 8, 2021
Maintainer

See test/correctness/constraints.cpp

Normally I say something like:

my_input.dim(0).set_bounds((my_input.dim(0).min() / 32) * 32, (my_input.dim(0).extent() / 32) * 32));

That asserts that the min and extent are equal to themselves rounded down to the next multiple of 32.

I think you can also say my_pipeline.add_requirement(my_input.dim(0).extent() % 32 == 0) (and similar for the min).

0 replies

mcourteaux · 2021-12-08T16:54:05Z

mcourteaux
Dec 8, 2021
Collaborator Author

Aha, cool! Thanks for the reply. I think this could have a helper function. If you think this is a good idea, I'll make a PR for this.

0 replies

mcourteaux · 2021-12-08T17:07:00Z

mcourteaux
Dec 8, 2021
Collaborator Author

I think the simplifier doesn't understand my constraint to its full potential:

coord.dim(0).set_bounds(0, max(32, (coord.dim(0).extent() / 32) * 32));

Still yields:

let sum_prob_per_sample_global_wrapper$0.s0.sample.max.s = min((min(coord.extent.0, 16) + (((coord.extent.0 + -1) / 16) * 16)), coord.extent.0)

(Update: promising multiple of 16 also doesn't work)

0 replies

mcourteaux · 2021-12-08T17:14:37Z

mcourteaux
Dec 8, 2021
Collaborator Author

I attempted add_requirement() and it doesn't help either. I think a simplification pass is missing (or maybe even not implemented?) that would optimize this.

0 replies

abadams · 2021-12-08T17:18:41Z

abadams
Dec 8, 2021
Maintainer

Hrm, I do this sort of thing constantly, so something else must be going on. Could you post a full repro?

0 replies

mcourteaux · 2021-12-08T17:25:34Z

mcourteaux
Dec 8, 2021
Collaborator Author

I don't know where to start on making a MWE, so I'll drop a gist of my actual generator here. It's lengthy, and still contains preprocessor #ifs that depend on #6476. https://gist.github.com/mcourteaux/2e0ccb0317ce9fcd3cd1eba8347541dd (promises attempted at lines 143-145).
(I'm still optimizing the schedule with the NVIDIA profiling tools, but today -- after two weeks of tweaking -- I started being faster than TensorFlow. More optimizations to come, but dang, I love this. Halide is amazing 😄 )

0 replies

abadams · 2021-12-08T18:07:19Z

abadams
Dec 8, 2021
Maintainer

Where are you seeing the bad indexing? Here's the .stmt I get:

module name=scratch, target=x86-64-linux-avx-avx2-avx512-avx512_skylake-f16c-fma-no_runtime-sse41
external_plus_metadata func scratch (num_kernels, coord, indices, mu, invCholR, logit_kernel_offset, acc_resp, acc_d, acc_ddT, loglikelihood) {
assert((uint64)reinterpret((halide_buffer_t *)mu.buffer) != (uint64)0, halide_error_buffer_argument_is_null("mu"))
assert((uint64)reinterpret((halide_buffer_t *)loglikelihood.buffer) != (uint64)0, halide_error_buffer_argument_is_null("loglikelihood"))
assert((uint64)reinterpret((halide_buffer_t *)logit_kernel_offset.buffer) != (uint64)0, halide_error_buffer_argument_is_null("logit_kernel_offset"))
assert((uint64)reinterpret((halide_buffer_t *)invCholR.buffer) != (uint64)0, halide_error_buffer_argument_is_null("invCholR"))
assert((uint64)reinterpret((halide_buffer_t *)indices.buffer) != (uint64)0, halide_error_buffer_argument_is_null("indices"))
assert((uint64)reinterpret((halide_buffer_t *)coord.buffer) != (uint64)0, halide_error_buffer_argument_is_null("coord"))
assert((uint64)reinterpret((halide_buffer_t *)acc_resp.buffer) != (uint64)0, halide_error_buffer_argument_is_null("acc_resp"))
assert((uint64)reinterpret((halide_buffer_t *)acc_ddT.buffer) != (uint64)0, halide_error_buffer_argument_is_null("acc_ddT"))
assert((uint64)reinterpret((halide_buffer_t *)acc_d.buffer) != (uint64)0, halide_error_buffer_argument_is_null("acc_d"))
let acc_d = (void *)_halide_buffer_get_host((halide_buffer_t *)acc_d.buffer)
let acc_d.type = (uint32)_halide_buffer_get_type((halide_buffer_t *)acc_d.buffer)
let acc_d.device_dirty = (uint1)_halide_buffer_get_device_dirty((halide_buffer_t *)acc_d.buffer)
let acc_d.dimensions = _halide_buffer_get_dimensions((halide_buffer_t *)acc_d.buffer)
let acc_d.min.0 = _halide_buffer_get_min((halide_buffer_t *)acc_d.buffer, 0)
let acc_d.extent.0 = _halide_buffer_get_extent((halide_buffer_t *)acc_d.buffer, 0)
let acc_d.stride.0 = _halide_buffer_get_stride((halide_buffer_t *)acc_d.buffer, 0)
let acc_d.min.1 = _halide_buffer_get_min((halide_buffer_t *)acc_d.buffer, 1)
let acc_d.extent.1 = _halide_buffer_get_extent((halide_buffer_t *)acc_d.buffer, 1)
let acc_d.stride.1 = _halide_buffer_get_stride((halide_buffer_t *)acc_d.buffer, 1)
let acc_ddT = (void *)_halide_buffer_get_host((halide_buffer_t *)acc_ddT.buffer)
let acc_ddT.type = (uint32)_halide_buffer_get_type((halide_buffer_t *)acc_ddT.buffer)
let acc_ddT.device_dirty = (uint1)_halide_buffer_get_device_dirty((halide_buffer_t *)acc_ddT.buffer)
let acc_ddT.dimensions = _halide_buffer_get_dimensions((halide_buffer_t *)acc_ddT.buffer)
let acc_ddT.min.0 = _halide_buffer_get_min((halide_buffer_t *)acc_ddT.buffer, 0)
let acc_ddT.extent.0 = _halide_buffer_get_extent((halide_buffer_t *)acc_ddT.buffer, 0)
let acc_ddT.stride.0 = _halide_buffer_get_stride((halide_buffer_t *)acc_ddT.buffer, 0)
let acc_ddT.min.1 = _halide_buffer_get_min((halide_buffer_t *)acc_ddT.buffer, 1)
let acc_ddT.extent.1 = _halide_buffer_get_extent((halide_buffer_t *)acc_ddT.buffer, 1)
let acc_ddT.stride.1 = _halide_buffer_get_stride((halide_buffer_t *)acc_ddT.buffer, 1)
let acc_ddT.min.2 = _halide_buffer_get_min((halide_buffer_t *)acc_ddT.buffer, 2)
let acc_ddT.extent.2 = _halide_buffer_get_extent((halide_buffer_t *)acc_ddT.buffer, 2)
let acc_ddT.stride.2 = _halide_buffer_get_stride((halide_buffer_t *)acc_ddT.buffer, 2)
let acc_resp = (void *)_halide_buffer_get_host((halide_buffer_t *)acc_resp.buffer)
let acc_resp.type = (uint32)_halide_buffer_get_type((halide_buffer_t *)acc_resp.buffer)
let acc_resp.device_dirty = (uint1)_halide_buffer_get_device_dirty((halide_buffer_t *)acc_resp.buffer)
let acc_resp.dimensions = _halide_buffer_get_dimensions((halide_buffer_t *)acc_resp.buffer)
let acc_resp.min.0 = _halide_buffer_get_min((halide_buffer_t *)acc_resp.buffer, 0)
let acc_resp.extent.0 = _halide_buffer_get_extent((halide_buffer_t *)acc_resp.buffer, 0)
let acc_resp.stride.0 = _halide_buffer_get_stride((halide_buffer_t *)acc_resp.buffer, 0)
let coord = (void *)_halide_buffer_get_host((halide_buffer_t *)coord.buffer)
let coord.type = (uint32)_halide_buffer_get_type((halide_buffer_t *)coord.buffer)
let coord.device_dirty = (uint1)_halide_buffer_get_device_dirty((halide_buffer_t *)coord.buffer)
let coord.dimensions = _halide_buffer_get_dimensions((halide_buffer_t *)coord.buffer)
let coord.min.0 = _halide_buffer_get_min((halide_buffer_t *)coord.buffer, 0)
let coord.extent.0 = _halide_buffer_get_extent((halide_buffer_t *)coord.buffer, 0)
let coord.stride.0 = _halide_buffer_get_stride((halide_buffer_t *)coord.buffer, 0)
let coord.min.1 = _halide_buffer_get_min((halide_buffer_t *)coord.buffer, 1)
let coord.extent.1 = _halide_buffer_get_extent((halide_buffer_t *)coord.buffer, 1)
let coord.stride.1 = _halide_buffer_get_stride((halide_buffer_t *)coord.buffer, 1)
let indices = (void *)_halide_buffer_get_host((halide_buffer_t *)indices.buffer)
let indices.type = (uint32)_halide_buffer_get_type((halide_buffer_t *)indices.buffer)
let indices.device_dirty = (uint1)_halide_buffer_get_device_dirty((halide_buffer_t *)indices.buffer)
let indices.dimensions = _halide_buffer_get_dimensions((halide_buffer_t *)indices.buffer)
let indices.min.0 = _halide_buffer_get_min((halide_buffer_t *)indices.buffer, 0)
let indices.extent.0 = _halide_buffer_get_extent((halide_buffer_t *)indices.buffer, 0)
let indices.stride.0 = _halide_buffer_get_stride((halide_buffer_t *)indices.buffer, 0)
let invCholR = (void *)_halide_buffer_get_host((halide_buffer_t *)invCholR.buffer)
let invCholR.type = (uint32)_halide_buffer_get_type((halide_buffer_t *)invCholR.buffer)
let invCholR.device_dirty = (uint1)_halide_buffer_get_device_dirty((halide_buffer_t *)invCholR.buffer)
let invCholR.dimensions = _halide_buffer_get_dimensions((halide_buffer_t *)invCholR.buffer)
let invCholR.min.0 = _halide_buffer_get_min((halide_buffer_t *)invCholR.buffer, 0)
let invCholR.extent.0 = _halide_buffer_get_extent((halide_buffer_t *)invCholR.buffer, 0)
let invCholR.stride.0 = _halide_buffer_get_stride((halide_buffer_t *)invCholR.buffer, 0)
let invCholR.min.1 = _halide_buffer_get_min((halide_buffer_t *)invCholR.buffer, 1)
let invCholR.extent.1 = _halide_buffer_get_extent((halide_buffer_t *)invCholR.buffer, 1)
let invCholR.stride.1 = _halide_buffer_get_stride((halide_buffer_t *)invCholR.buffer, 1)
let invCholR.min.2 = _halide_buffer_get_min((halide_buffer_t *)invCholR.buffer, 2)
let invCholR.extent.2 = _halide_buffer_get_extent((halide_buffer_t *)invCholR.buffer, 2)
let invCholR.stride.2 = _halide_buffer_get_stride((halide_buffer_t *)invCholR.buffer, 2)
let logit_kernel_offset = (void *)_halide_buffer_get_host((halide_buffer_t *)logit_kernel_offset.buffer)
let logit_kernel_offset.type = (uint32)_halide_buffer_get_type((halide_buffer_t *)logit_kernel_offset.buffer)
let logit_kernel_offset.device_dirty = (uint1)_halide_buffer_get_device_dirty((halide_buffer_t *)logit_kernel_offset.buffer)
let logit_kernel_offset.dimensions = _halide_buffer_get_dimensions((halide_buffer_t *)logit_kernel_offset.buffer)
let logit_kernel_offset.min.0 = _halide_buffer_get_min((halide_buffer_t *)logit_kernel_offset.buffer, 0)
let logit_kernel_offset.extent.0 = _halide_buffer_get_extent((halide_buffer_t *)logit_kernel_offset.buffer, 0)
let logit_kernel_offset.stride.0 = _halide_buffer_get_stride((halide_buffer_t *)logit_kernel_offset.buffer, 0)
let loglikelihood = (void *)_halide_buffer_get_host((halide_buffer_t *)loglikelihood.buffer)
let loglikelihood.type = (uint32)_halide_buffer_get_type((halide_buffer_t *)loglikelihood.buffer)
let loglikelihood.device_dirty = (uint1)_halide_buffer_get_device_dirty((halide_buffer_t *)loglikelihood.buffer)
let loglikelihood.dimensions = _halide_buffer_get_dimensions((halide_buffer_t *)loglikelihood.buffer)
let mu = (void *)_halide_buffer_get_host((halide_buffer_t *)mu.buffer)
let mu.type = (uint32)_halide_buffer_get_type((halide_buffer_t *)mu.buffer)
let mu.device_dirty = (uint1)_halide_buffer_get_device_dirty((halide_buffer_t *)mu.buffer)
let mu.dimensions = _halide_buffer_get_dimensions((halide_buffer_t *)mu.buffer)
let mu.min.0 = _halide_buffer_get_min((halide_buffer_t *)mu.buffer, 0)
let mu.extent.0 = _halide_buffer_get_extent((halide_buffer_t *)mu.buffer, 0)
let mu.stride.0 = _halide_buffer_get_stride((halide_buffer_t *)mu.buffer, 0)
let mu.min.1 = _halide_buffer_get_min((halide_buffer_t *)mu.buffer, 1)
let mu.extent.1 = _halide_buffer_get_extent((halide_buffer_t *)mu.buffer, 1)
let mu.stride.1 = _halide_buffer_get_stride((halide_buffer_t *)mu.buffer, 1)
let acc_ddT.s0.kernel.max.s = max(acc_ddT.extent.0, indices.extent.0)
let acc_d.s0.kernel.max.s = max(acc_d.extent.0, indices.extent.0)
let acc_resp.s0.kernel.max.s = max(acc_resp.extent.0, indices.extent.0)
let acc_d.extent.0.required = max(acc_d.s0.kernel.max.s, indices.extent.0)
let acc_ddT.extent.0.required = max(acc_ddT.s0.kernel.max.s, indices.extent.0)
let acc_resp.extent.0.required = max(acc_resp.s0.kernel.max.s, indices.extent.0)
if ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)acc_d.buffer)) {
 (halide_buffer_t *)_halide_buffer_init((halide_buffer_t *)acc_d.buffer, (halide_dimension_t *)_halide_buffer_get_shape((halide_buffer_t *)acc_d.buffer), (void *)reinterpret((uint64)0), (uint64)0, (halide_device_interface_t *)reinterpret((uint64)0), 2, 32, 2, (halide_dimension_t *)make_struct(0, acc_d.extent.0.required, 1, 0, 0, 7, acc_d.extent.0.required, 0), (uint64)0)
}
if ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)acc_ddT.buffer)) {
 (halide_buffer_t *)_halide_buffer_init((halide_buffer_t *)acc_ddT.buffer, (halide_dimension_t *)_halide_buffer_get_shape((halide_buffer_t *)acc_ddT.buffer), (void *)reinterpret((uint64)0), (uint64)0, (halide_device_interface_t *)reinterpret((uint64)0), 2, 32, 3, (halide_dimension_t *)make_struct(0, acc_ddT.extent.0.required, 1, 0, 0, 7, acc_ddT.extent.0.required, 0, 0, 7, acc_ddT.extent.0.required*7, 0), (uint64)0)
}
if ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)acc_resp.buffer)) {
 (halide_buffer_t *)_halide_buffer_init((halide_buffer_t *)acc_resp.buffer, (halide_dimension_t *)_halide_buffer_get_shape((halide_buffer_t *)acc_resp.buffer), (void *)reinterpret((uint64)0), (uint64)0, (halide_device_interface_t *)reinterpret((uint64)0), 2, 32, 1, (halide_dimension_t *)make_struct(0, acc_resp.extent.0.required, 1, 0), (uint64)0)
}
if ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)coord.buffer)) {
 (halide_buffer_t *)_halide_buffer_init((halide_buffer_t *)coord.buffer, (halide_dimension_t *)_halide_buffer_get_shape((halide_buffer_t *)coord.buffer), (void *)reinterpret((uint64)0), (uint64)0, (halide_device_interface_t *)reinterpret((uint64)0), 2, 32, 2, (halide_dimension_t *)make_struct(0, coord.extent.0, 1, 0, 0, 7, coord.extent.0, 0), (uint64)0)
}
if ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)indices.buffer)) {
 (halide_buffer_t *)_halide_buffer_init((halide_buffer_t *)indices.buffer, (halide_dimension_t *)_halide_buffer_get_shape((halide_buffer_t *)indices.buffer), (void *)reinterpret((uint64)0), (uint64)0, (halide_device_interface_t *)reinterpret((uint64)0), 1, 32, 1, (halide_dimension_t *)make_struct(0, num_kernels, 1, 0), (uint64)0)
}
if ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)invCholR.buffer)) {
 (halide_buffer_t *)_halide_buffer_init((halide_buffer_t *)invCholR.buffer, (halide_dimension_t *)_halide_buffer_get_shape((halide_buffer_t *)invCholR.buffer), (void *)reinterpret((uint64)0), (uint64)0, (halide_device_interface_t *)reinterpret((uint64)0), 2, 32, 3, (halide_dimension_t *)make_struct(0, num_kernels, 1, 0, 0, 7, num_kernels, 0, 0, 7, num_kernels*7, 0), (uint64)0)
}
if ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)logit_kernel_offset.buffer)) {
 (halide_buffer_t *)_halide_buffer_init((halide_buffer_t *)logit_kernel_offset.buffer, (halide_dimension_t *)_halide_buffer_get_shape((halide_buffer_t *)logit_kernel_offset.buffer), (void *)reinterpret((uint64)0), (uint64)0, (halide_device_interface_t *)reinterpret((uint64)0), 2, 32, 1, (halide_dimension_t *)make_struct(0, num_kernels, 1, 0), (uint64)0)
}
if ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)loglikelihood.buffer)) {
 (halide_buffer_t *)_halide_buffer_init((halide_buffer_t *)loglikelihood.buffer, (halide_dimension_t *)_halide_buffer_get_shape((halide_buffer_t *)loglikelihood.buffer), (void *)reinterpret((uint64)0), (uint64)0, (halide_device_interface_t *)reinterpret((uint64)0), 2, 32, 0, (halide_dimension_t *)make_struct(), (uint64)0)
}
if ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)mu.buffer)) {
 (halide_buffer_t *)_halide_buffer_init((halide_buffer_t *)mu.buffer, (halide_dimension_t *)_halide_buffer_get_shape((halide_buffer_t *)mu.buffer), (void *)reinterpret((uint64)0), (uint64)0, (halide_device_interface_t *)reinterpret((uint64)0), 2, 32, 2, (halide_dimension_t *)make_struct(0, num_kernels, 1, 0, 0, 7, num_kernels, 0), (uint64)0)
}
if (!((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)mu.buffer) || ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)loglikelihood.buffer) || ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)logit_kernel_offset.buffer) || ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)invCholR.buffer) || ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)indices.buffer) || ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)coord.buffer) || ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)acc_resp.buffer) || ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)acc_d.buffer) || (uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)acc_ddT.buffer)))))))))) {
 assert(acc_d.type == (uint32)73730, halide_error_bad_type("Output buffer acc_d", acc_d.type, (uint32)73730))
 assert(acc_d.dimensions == 2, halide_error_bad_dimensions("Output buffer acc_d", acc_d.dimensions, 2))
 assert(acc_ddT.type == (uint32)73730, halide_error_bad_type("Output buffer acc_ddT", acc_ddT.type, (uint32)73730))
 assert(acc_ddT.dimensions == 3, halide_error_bad_dimensions("Output buffer acc_ddT", acc_ddT.dimensions, 3))
 assert(acc_resp.type == (uint32)73730, halide_error_bad_type("Output buffer acc_resp", acc_resp.type, (uint32)73730))
 assert(acc_resp.dimensions == 1, halide_error_bad_dimensions("Output buffer acc_resp", acc_resp.dimensions, 1))
 assert(coord.type == (uint32)73730, halide_error_bad_type("Input buffer coord", coord.type, (uint32)73730))
 assert(coord.dimensions == 2, halide_error_bad_dimensions("Input buffer coord", coord.dimensions, 2))
 assert(indices.type == (uint32)73729, halide_error_bad_type("Input buffer indices", indices.type, (uint32)73729))
 assert(indices.dimensions == 1, halide_error_bad_dimensions("Input buffer indices", indices.dimensions, 1))
 assert(invCholR.type == (uint32)73730, halide_error_bad_type("Input buffer invCholR", invCholR.type, (uint32)73730))
 assert(invCholR.dimensions == 3, halide_error_bad_dimensions("Input buffer invCholR", invCholR.dimensions, 3))
 assert(logit_kernel_offset.type == (uint32)73730, halide_error_bad_type("Input buffer logit_kernel_offset", logit_kernel_offset.type, (uint32)73730))
 assert(logit_kernel_offset.dimensions == 1, halide_error_bad_dimensions("Input buffer logit_kernel_offset", logit_kernel_offset.dimensions, 1))
 assert(loglikelihood.type == (uint32)73730, halide_error_bad_type("Output buffer loglikelihood", loglikelihood.type, (uint32)73730))
 assert(loglikelihood.dimensions == 0, halide_error_bad_dimensions("Output buffer loglikelihood", loglikelihood.dimensions, 0))
 assert(mu.type == (uint32)73730, halide_error_bad_type("Input buffer mu", mu.type, (uint32)73730))
 assert(mu.dimensions == 2, halide_error_bad_dimensions("Input buffer mu", mu.dimensions, 2))
 assert((acc_d.min.0 <= 0) && (acc_d.extent.0.required <= (acc_d.extent.0 + acc_d.min.0)), halide_error_access_out_of_bounds("Output buffer acc_d", 0, 0, acc_d.extent.0.required + -1, acc_d.min.0, (acc_d.extent.0 + acc_d.min.0) + -1))
 assert(0 <= acc_d.extent.0, halide_error_buffer_extents_negative("Output buffer acc_d", 0, acc_d.extent.0))
 assert((acc_d.min.1 <= 0) && (7 <= (acc_d.extent.1 + acc_d.min.1)), halide_error_access_out_of_bounds("Output buffer acc_d", 1, 0, 6, acc_d.min.1, (acc_d.extent.1 + acc_d.min.1) + -1))
 assert(0 <= acc_d.extent.1, halide_error_buffer_extents_negative("Output buffer acc_d", 1, acc_d.extent.1))
 assert((acc_ddT.min.0 <= 0) && (acc_ddT.extent.0.required <= (acc_ddT.extent.0 + acc_ddT.min.0)), halide_error_access_out_of_bounds("Output buffer acc_ddT", 0, 0, acc_ddT.extent.0.required + -1, acc_ddT.min.0, (acc_ddT.extent.0 + acc_ddT.min.0) + -1))
 assert(0 <= acc_ddT.extent.0, halide_error_buffer_extents_negative("Output buffer acc_ddT", 0, acc_ddT.extent.0))
 assert((acc_ddT.min.1 <= 0) && (7 <= (acc_ddT.extent.1 + acc_ddT.min.1)), halide_error_access_out_of_bounds("Output buffer acc_ddT", 1, 0, 6, acc_ddT.min.1, (acc_ddT.extent.1 + acc_ddT.min.1) + -1))
 assert(0 <= acc_ddT.extent.1, halide_error_buffer_extents_negative("Output buffer acc_ddT", 1, acc_ddT.extent.1))
 assert((acc_ddT.min.2 <= 0) && (7 <= (acc_ddT.extent.2 + acc_ddT.min.2)), halide_error_access_out_of_bounds("Output buffer acc_ddT", 2, 0, 6, acc_ddT.min.2, (acc_ddT.extent.2 + acc_ddT.min.2) + -1))
 assert(0 <= acc_ddT.extent.2, halide_error_buffer_extents_negative("Output buffer acc_ddT", 2, acc_ddT.extent.2))
 assert((acc_resp.min.0 <= 0) && (acc_resp.extent.0.required <= (acc_resp.extent.0 + acc_resp.min.0)), halide_error_access_out_of_bounds("Output buffer acc_resp", 0, 0, acc_resp.extent.0.required + -1, acc_resp.min.0, (acc_resp.extent.0 + acc_resp.min.0) + -1))
 assert(0 <= acc_resp.extent.0, halide_error_buffer_extents_negative("Output buffer acc_resp", 0, acc_resp.extent.0))
 assert((coord.min.0 <= 0) && (0 <= coord.min.0), halide_error_access_out_of_bounds("Input buffer coord", 0, 0, coord.extent.0 + -1, coord.min.0, (coord.extent.0 + coord.min.0) + -1))
 assert(0 <= coord.extent.0, halide_error_buffer_extents_negative("Input buffer coord", 0, coord.extent.0))
 assert((coord.min.1 <= 0) && (7 <= (coord.extent.1 + coord.min.1)), halide_error_access_out_of_bounds("Input buffer coord", 1, 0, 6, coord.min.1, (coord.extent.1 + coord.min.1) + -1))
 assert(0 <= coord.extent.1, halide_error_buffer_extents_negative("Input buffer coord", 1, coord.extent.1))
 assert((indices.min.0 <= 0) && (num_kernels <= (indices.extent.0 + indices.min.0)), halide_error_access_out_of_bounds("Input buffer indices", 0, 0, num_kernels + -1, indices.min.0, (indices.extent.0 + indices.min.0) + -1))
 assert(0 <= indices.extent.0, halide_error_buffer_extents_negative("Input buffer indices", 0, indices.extent.0))
 assert((invCholR.min.0 <= 0) && (num_kernels <= (invCholR.extent.0 + invCholR.min.0)), halide_error_access_out_of_bounds("Input buffer invCholR", 0, 0, num_kernels + -1, invCholR.min.0, (invCholR.extent.0 + invCholR.min.0) + -1))
 assert(0 <= invCholR.extent.0, halide_error_buffer_extents_negative("Input buffer invCholR", 0, invCholR.extent.0))
 assert((invCholR.min.1 <= 0) && (7 <= (invCholR.extent.1 + invCholR.min.1)), halide_error_access_out_of_bounds("Input buffer invCholR", 1, 0, 6, invCholR.min.1, (invCholR.extent.1 + invCholR.min.1) + -1))
 assert(0 <= invCholR.extent.1, halide_error_buffer_extents_negative("Input buffer invCholR", 1, invCholR.extent.1))
 assert((invCholR.min.2 <= 0) && (7 <= (invCholR.extent.2 + invCholR.min.2)), halide_error_access_out_of_bounds("Input buffer invCholR", 2, 0, 6, invCholR.min.2, (invCholR.extent.2 + invCholR.min.2) + -1))
 assert(0 <= invCholR.extent.2, halide_error_buffer_extents_negative("Input buffer invCholR", 2, invCholR.extent.2))
 assert((logit_kernel_offset.min.0 <= 0) && (num_kernels <= (logit_kernel_offset.extent.0 + logit_kernel_offset.min.0)), halide_error_access_out_of_bounds("Input buffer logit_kernel_offset", 0, 0, num_kernels + -1, logit_kernel_offset.min.0, (logit_kernel_offset.extent.0 + logit_kernel_offset.min.0) + -1))
 assert(0 <= logit_kernel_offset.extent.0, halide_error_buffer_extents_negative("Input buffer logit_kernel_offset", 0, logit_kernel_offset.extent.0))
 assert((mu.min.0 <= 0) && (num_kernels <= (mu.extent.0 + mu.min.0)), halide_error_access_out_of_bounds("Input buffer mu", 0, 0, num_kernels + -1, mu.min.0, (mu.extent.0 + mu.min.0) + -1))
 assert(0 <= mu.extent.0, halide_error_buffer_extents_negative("Input buffer mu", 0, mu.extent.0))
 assert((mu.min.1 <= 0) && (7 <= (mu.extent.1 + mu.min.1)), halide_error_access_out_of_bounds("Input buffer mu", 1, 0, 6, mu.min.1, (mu.extent.1 + mu.min.1) + -1))
 assert(0 <= mu.extent.1, halide_error_buffer_extents_negative("Input buffer mu", 1, mu.extent.1))
 assert(acc_d.stride.0 == 1, halide_error_constraint_violated("acc_d.stride.0", acc_d.stride.0, "1", 1))
 assert(acc_d.min.0 == 0, halide_error_constraint_violated("acc_d.min.0", acc_d.min.0, "0", 0))
 assert(acc_d.min.1 == 0, halide_error_constraint_violated("acc_d.min.1", acc_d.min.1, "0", 0))
 assert(acc_d.extent.1 == 7, halide_error_constraint_violated("acc_d.extent.1", acc_d.extent.1, "7", 7))
 assert(acc_ddT.stride.0 == 1, halide_error_constraint_violated("acc_ddT.stride.0", acc_ddT.stride.0, "1", 1))
 assert(acc_ddT.min.0 == 0, halide_error_constraint_violated("acc_ddT.min.0", acc_ddT.min.0, "0", 0))
 assert(acc_ddT.min.1 == 0, halide_error_constraint_violated("acc_ddT.min.1", acc_ddT.min.1, "0", 0))
 assert(acc_ddT.extent.1 == 7, halide_error_constraint_violated("acc_ddT.extent.1", acc_ddT.extent.1, "7", 7))
 assert(acc_ddT.min.2 == 0, halide_error_constraint_violated("acc_ddT.min.2", acc_ddT.min.2, "0", 0))
 assert(acc_ddT.extent.2 == 7, halide_error_constraint_violated("acc_ddT.extent.2", acc_ddT.extent.2, "7", 7))
 assert(acc_resp.stride.0 == 1, halide_error_constraint_violated("acc_resp.stride.0", acc_resp.stride.0, "1", 1))
 assert(acc_resp.min.0 == 0, halide_error_constraint_violated("acc_resp.min.0", acc_resp.min.0, "0", 0))
 assert(coord.stride.0 == 1, halide_error_constraint_violated("coord.stride.0", coord.stride.0, "1", 1))
 assert(coord.min.1 == 0, halide_error_constraint_violated("coord.min.1", coord.min.1, "0", 0))
 assert(coord.extent.1 == 7, halide_error_constraint_violated("coord.extent.1", coord.extent.1, "7", 7))
 assert(indices.stride.0 == 1, halide_error_constraint_violated("indices.stride.0", indices.stride.0, "1", 1))
 assert(indices.min.0 == 0, halide_error_constraint_violated("indices.min.0", indices.min.0, "0", 0))
 assert(invCholR.stride.0 == 1, halide_error_constraint_violated("invCholR.stride.0", invCholR.stride.0, "1", 1))
 assert(invCholR.min.0 == 0, halide_error_constraint_violated("invCholR.min.0", invCholR.min.0, "0", 0))
 assert(invCholR.extent.0 == indices.extent.0, halide_error_constraint_violated("invCholR.extent.0", invCholR.extent.0, "indices.extent.0", indices.extent.0))
 assert(invCholR.min.1 == 0, halide_error_constraint_violated("invCholR.min.1", invCholR.min.1, "0", 0))
 assert(invCholR.extent.1 == 7, halide_error_constraint_violated("invCholR.extent.1", invCholR.extent.1, "7", 7))
 assert(invCholR.min.2 == 0, halide_error_constraint_violated("invCholR.min.2", invCholR.min.2, "0", 0))
 assert(invCholR.extent.2 == 7, halide_error_constraint_violated("invCholR.extent.2", invCholR.extent.2, "7", 7))
 assert(logit_kernel_offset.stride.0 == 1, halide_error_constraint_violated("logit_kernel_offset.stride.0", logit_kernel_offset.stride.0, "1", 1))
 assert(logit_kernel_offset.min.0 == 0, halide_error_constraint_violated("logit_kernel_offset.min.0", logit_kernel_offset.min.0, "0", 0))
 assert(logit_kernel_offset.extent.0 == indices.extent.0, halide_error_constraint_violated("logit_kernel_offset.extent.0", logit_kernel_offset.extent.0, "indices.extent.0", indices.extent.0))
 assert(mu.stride.0 == 1, halide_error_constraint_violated("mu.stride.0", mu.stride.0, "1", 1))
 assert(mu.min.0 == 0, halide_error_constraint_violated("mu.min.0", mu.min.0, "0", 0))
 assert(mu.extent.0 == indices.extent.0, halide_error_constraint_violated("mu.extent.0", mu.extent.0, "indices.extent.0", indices.extent.0))
 assert(mu.min.1 == 0, halide_error_constraint_violated("mu.min.1", mu.min.1, "0", 0))
 assert(mu.extent.1 == 7, halide_error_constraint_violated("mu.extent.1", mu.extent.1, "7", 7))
 assert(uint64(acc_d.extent.0) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("acc_d", uint64(acc_d.extent.0), (uint64)2147483647))
 assert((uint64)abs(int64(acc_d.stride.1)*(int64)7) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("acc_d", (uint64)abs(int64(acc_d.stride.1)*(int64)7), (uint64)2147483647))
 assert(int64(acc_d.extent.0) <= (int64)306783378, halide_error_buffer_extents_too_large("acc_d", int64(acc_d.extent.0)*(int64)7, (int64)2147483647))
 assert(uint64(acc_ddT.extent.0) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("acc_ddT", uint64(acc_ddT.extent.0), (uint64)2147483647))
 assert((uint64)abs(int64(acc_ddT.stride.1)*(int64)7) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("acc_ddT", (uint64)abs(int64(acc_ddT.stride.1)*(int64)7), (uint64)2147483647))
 assert(int64(acc_ddT.extent.0) <= (int64)306783378, halide_error_buffer_extents_too_large("acc_ddT", int64(acc_ddT.extent.0)*(int64)7, (int64)2147483647))
 assert((uint64)abs(int64(acc_ddT.stride.2)*(int64)7) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("acc_ddT", (uint64)abs(int64(acc_ddT.stride.2)*(int64)7), (uint64)2147483647))
 assert(int64(acc_ddT.extent.0) <= (int64)43826196, halide_error_buffer_extents_too_large("acc_ddT", int64(acc_ddT.extent.0)*(int64)49, (int64)2147483647))
 assert(uint64(acc_resp.extent.0) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("acc_resp", uint64(acc_resp.extent.0), (uint64)2147483647))
 assert(uint64(coord.extent.0) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("coord", uint64(coord.extent.0), (uint64)2147483647))
 assert((uint64)abs(int64(coord.stride.1)*(int64)7) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("coord", (uint64)abs(int64(coord.stride.1)*(int64)7), (uint64)2147483647))
 assert(int64(coord.extent.0) <= (int64)306783378, halide_error_buffer_extents_too_large("coord", int64(coord.extent.0)*(int64)7, (int64)2147483647))
 assert(uint64(indices.extent.0) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("indices", uint64(indices.extent.0), (uint64)2147483647))
 assert(uint64(indices.extent.0) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("invCholR", uint64(indices.extent.0), (uint64)2147483647))
 assert((uint64)abs(int64(invCholR.stride.1)*(int64)7) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("invCholR", (uint64)abs(int64(invCholR.stride.1)*(int64)7), (uint64)2147483647))
 assert(int64(indices.extent.0) <= (int64)306783378, halide_error_buffer_extents_too_large("invCholR", int64(indices.extent.0)*(int64)7, (int64)2147483647))
 assert((uint64)abs(int64(invCholR.stride.2)*(int64)7) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("invCholR", (uint64)abs(int64(invCholR.stride.2)*(int64)7), (uint64)2147483647))
 assert(int64(indices.extent.0) <= (int64)43826196, halide_error_buffer_extents_too_large("invCholR", int64(indices.extent.0)*(int64)49, (int64)2147483647))
 assert(uint64(indices.extent.0) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("logit_kernel_offset", uint64(indices.extent.0), (uint64)2147483647))
 assert(uint64(indices.extent.0) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("mu", uint64(indices.extent.0), (uint64)2147483647))
 assert((uint64)abs(int64(mu.stride.1)*(int64)7) <= (uint64)2147483647, halide_error_buffer_allocation_too_large("mu", (uint64)abs(int64(mu.stride.1)*(int64)7), (uint64)2147483647))
 assert(int64(indices.extent.0) <= (int64)306783378, halide_error_buffer_extents_too_large("mu", int64(indices.extent.0)*(int64)7, (int64)2147483647))
 assert(!acc_d.device_dirty, halide_error_device_dirty_with_no_device_support("Output buffer acc_d"))
 assert(!acc_ddT.device_dirty, halide_error_device_dirty_with_no_device_support("Output buffer acc_ddT"))
 assert(!acc_resp.device_dirty, halide_error_device_dirty_with_no_device_support("Output buffer acc_resp"))
 assert(!coord.device_dirty, halide_error_device_dirty_with_no_device_support("Input buffer coord"))
 assert(!indices.device_dirty, halide_error_device_dirty_with_no_device_support("Input buffer indices"))
 assert(!invCholR.device_dirty, halide_error_device_dirty_with_no_device_support("Input buffer invCholR"))
 assert(!logit_kernel_offset.device_dirty, halide_error_device_dirty_with_no_device_support("Input buffer logit_kernel_offset"))
 assert(!loglikelihood.device_dirty, halide_error_device_dirty_with_no_device_support("Output buffer loglikelihood"))
 assert(!mu.device_dirty, halide_error_device_dirty_with_no_device_support("Input buffer mu"))
 assert(acc_d != (void *)reinterpret((uint64)0), halide_error_host_is_null("Output buffer acc_d"))
 assert(acc_ddT != (void *)reinterpret((uint64)0), halide_error_host_is_null("Output buffer acc_ddT"))
 assert(acc_resp != (void *)reinterpret((uint64)0), halide_error_host_is_null("Output buffer acc_resp"))
 assert(coord != (void *)reinterpret((uint64)0), halide_error_host_is_null("Input buffer coord"))
 assert(indices != (void *)reinterpret((uint64)0), halide_error_host_is_null("Input buffer indices"))
 assert(invCholR != (void *)reinterpret((uint64)0), halide_error_host_is_null("Input buffer invCholR"))
 assert(logit_kernel_offset != (void *)reinterpret((uint64)0), halide_error_host_is_null("Input buffer logit_kernel_offset"))
 assert(loglikelihood != (void *)reinterpret((uint64)0), halide_error_host_is_null("Output buffer loglikelihood"))
 assert(mu != (void *)reinterpret((uint64)0), halide_error_host_is_null("Input buffer mu"))
 assert((coord.extent.0 % 32) == 0, halide_error_requirement_failed((char *)stringify((uint1)0), ""))
 assert(32 <= coord.extent.0, halide_error_requirement_failed((char *)stringify((uint1)0), ""))
 produce acc_resp {
  for (acc_resp.s1.domK$x, 0, num_kernels) {
   allocate resp_k[float32 * 1]
   produce resp_k {
    resp_k[0] = 0.000000f
    let t970 = max(num_kernels + -1, acc_resp.s1.domK$x)
    let t971 = max(acc_resp.s1.domK$x + 1, num_kernels)
    for (resp_k.s1.domS$x, 0, coord.extent.0) {
     allocate total_weight[float32 * 1]
     produce total_weight {
      total_weight[0] = 0.000000f
      for (total_weight.s1.domK$x, 0, num_kernels) {
       let mahaSq.s1.kernel.max_3 = max(num_kernels + -1, total_weight.s1.domK$x)
       allocate max_logit[float32 * 1]
       produce max_logit {
        max_logit[0] = -340282346638528859811704183484516925440.000000f
        for (max_logit.s1.domK$x, 0, num_kernels) {
         allocate mahaSq[float32 * 1]
         produce mahaSq {
          mahaSq[0] = 0.000000f
          for (mahaSq.s1.domC$x, 0, 7) {
           allocate normDiff[float32 * 1]
           produce normDiff {
            normDiff[0] = 0.000000f
            let t972 = (invCholR.stride.1*mahaSq.s1.domC$x) + max_logit.s1.domK$x
            for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
             normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t972]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + resp_k.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + max_logit.s1.domK$x]))
            }
           }
           consume normDiff {
            let t954 = normDiff[0]
            mahaSq[0] = mahaSq[0] + (t954*t954)
           }
           free normDiff
          }
         }
         consume mahaSq {
          max_logit[0] = max(max_logit[0], logit_kernel_offset[max_logit.s1.domK$x] + (mahaSq[0]*-0.500000f))
         }
         free mahaSq
        }
       }
       let mahaSq.kernel.extent_realized = max(total_weight.s1.domK$x + 1, num_kernels)
       allocate mahaSq[float32 * 1 * mahaSq.kernel.extent_realized]
       produce mahaSq {
        for (mahaSq.s0.kernel, 0, mahaSq.s1.kernel.max_3 + 1) {
         mahaSq[mahaSq.s0.kernel] = 0.000000f
        }
        for (mahaSq.s1.kernel, 0, mahaSq.s1.kernel.max_3 + 1) {
         for (mahaSq.s1.domC$x, 0, 7) {
          allocate normDiff[float32 * 1]
          produce normDiff {
           normDiff[0] = 0.000000f
           let t973 = (invCholR.stride.1*mahaSq.s1.domC$x) + mahaSq.s1.kernel
           for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
            normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t973]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + resp_k.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + mahaSq.s1.kernel]))
           }
          }
          consume normDiff {
           let t955 = normDiff[0]
           mahaSq[mahaSq.s1.kernel] = mahaSq[mahaSq.s1.kernel] + (t955*t955)
          }
          free normDiff
         }
        }
       }
       consume mahaSq {
        consume max_logit {
         total_weight[0] = (float32)exp_f32((logit_kernel_offset[total_weight.s1.domK$x] + (mahaSq[total_weight.s1.domK$x]*-0.500000f)) - max_logit[0]) + total_weight[0]
        }
       }
       free max_logit
       free mahaSq
      }
     }
     allocate max_logit[float32 * 1]
     produce max_logit {
      max_logit[0] = -340282346638528859811704183484516925440.000000f
      for (max_logit.s1.domK$x, 0, num_kernels) {
       allocate mahaSq[float32 * 1]
       produce mahaSq {
        mahaSq[0] = 0.000000f
        for (mahaSq.s1.domC$x, 0, 7) {
         allocate normDiff[float32 * 1]
         produce normDiff {
          normDiff[0] = 0.000000f
          let t974 = (invCholR.stride.1*mahaSq.s1.domC$x) + max_logit.s1.domK$x
          for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
           normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t974]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + resp_k.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + max_logit.s1.domK$x]))
          }
         }
         consume normDiff {
          let t956 = normDiff[0]
          mahaSq[0] = mahaSq[0] + (t956*t956)
         }
         free normDiff
        }
       }
       consume mahaSq {
        max_logit[0] = max(max_logit[0], logit_kernel_offset[max_logit.s1.domK$x] + (mahaSq[0]*-0.500000f))
       }
       free mahaSq
      }
     }
     allocate mahaSq[float32 * 1 * t971]
     produce mahaSq {
      for (mahaSq.s0.kernel, 0, t970 + 1) {
       mahaSq[mahaSq.s0.kernel] = 0.000000f
      }
      for (mahaSq.s1.kernel, 0, t970 + 1) {
       for (mahaSq.s1.domC$x, 0, 7) {
        allocate normDiff[float32 * 1]
        produce normDiff {
         normDiff[0] = 0.000000f
         let t975 = (invCholR.stride.1*mahaSq.s1.domC$x) + mahaSq.s1.kernel
         for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
          normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t975]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + resp_k.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + mahaSq.s1.kernel]))
         }
        }
        consume normDiff {
         let t957 = normDiff[0]
         mahaSq[mahaSq.s1.kernel] = mahaSq[mahaSq.s1.kernel] + (t957*t957)
        }
        free normDiff
       }
      }
     }
     consume mahaSq {
      consume max_logit {
       consume total_weight {
        resp_k[0] = resp_k[0] + ((float32)exp_f32((logit_kernel_offset[acc_resp.s1.domK$x] + (mahaSq[acc_resp.s1.domK$x]*-0.500000f)) - max_logit[0])/total_weight[0])
       }
      }
     }
     free total_weight
     free max_logit
     free mahaSq
    }
   }
   consume resp_k {
    let t958 = int32(indices[acc_resp.s1.domK$x])
    acc_resp[t958] = acc_resp[t958] + resp_k[0]
   }
   free resp_k
  }
 }
 produce acc_d {
  for (acc_d.s1.row, 0, 7) {
   let t977 = acc_d.s1.row*mu.stride.1
   let t976 = acc_d.s1.row*coord.stride.1
   let t978 = acc_d.s1.row*acc_d.stride.1
   for (acc_d.s1.domK$x, 0, num_kernels) {
    allocate block_sum_x[float32 * 1]
    produce block_sum_x {
     block_sum_x[0] = 0.000000f
     let t979 = max(num_kernels + -1, acc_d.s1.domK$x)
     let t980 = max(acc_d.s1.domK$x + 1, num_kernels)
     let t981 = acc_d.s1.domK$x + t977
     for (block_sum_x.s1.domS$x, 0, coord.extent.0) {
      allocate total_weight[float32 * 1]
      produce total_weight {
       total_weight[0] = 0.000000f
       for (total_weight.s1.domK$x, 0, num_kernels) {
        let mahaSq.s1.kernel.max_12 = max(num_kernels + -1, total_weight.s1.domK$x)
        allocate max_logit[float32 * 1]
        produce max_logit {
         max_logit[0] = -340282346638528859811704183484516925440.000000f
         for (max_logit.s1.domK$x, 0, num_kernels) {
          allocate mahaSq[float32 * 1]
          produce mahaSq {
           mahaSq[0] = 0.000000f
           for (mahaSq.s1.domC$x, 0, 7) {
            allocate normDiff[float32 * 1]
            produce normDiff {
             normDiff[0] = 0.000000f
             let t982 = (invCholR.stride.1*mahaSq.s1.domC$x) + max_logit.s1.domK$x
             for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
              normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t982]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + block_sum_x.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + max_logit.s1.domK$x]))
             }
            }
            consume normDiff {
             let t959 = normDiff[0]
             mahaSq[0] = mahaSq[0] + (t959*t959)
            }
            free normDiff
           }
          }
          consume mahaSq {
           max_logit[0] = max(max_logit[0], logit_kernel_offset[max_logit.s1.domK$x] + (mahaSq[0]*-0.500000f))
          }
          free mahaSq
         }
        }
        let mahaSq.kernel.extent_realized = max(total_weight.s1.domK$x + 1, num_kernels)
        allocate mahaSq[float32 * 1 * mahaSq.kernel.extent_realized]
        produce mahaSq {
         for (mahaSq.s0.kernel, 0, mahaSq.s1.kernel.max_12 + 1) {
          mahaSq[mahaSq.s0.kernel] = 0.000000f
         }
         for (mahaSq.s1.kernel, 0, mahaSq.s1.kernel.max_12 + 1) {
          for (mahaSq.s1.domC$x, 0, 7) {
           allocate normDiff[float32 * 1]
           produce normDiff {
            normDiff[0] = 0.000000f
            let t983 = (invCholR.stride.1*mahaSq.s1.domC$x) + mahaSq.s1.kernel
            for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
             normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t983]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + block_sum_x.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + mahaSq.s1.kernel]))
            }
           }
           consume normDiff {
            let t960 = normDiff[0]
            mahaSq[mahaSq.s1.kernel] = mahaSq[mahaSq.s1.kernel] + (t960*t960)
           }
           free normDiff
          }
         }
        }
        consume mahaSq {
         consume max_logit {
          total_weight[0] = (float32)exp_f32((logit_kernel_offset[total_weight.s1.domK$x] + (mahaSq[total_weight.s1.domK$x]*-0.500000f)) - max_logit[0]) + total_weight[0]
         }
        }
        free max_logit
        free mahaSq
       }
      }
      allocate max_logit[float32 * 1]
      produce max_logit {
       max_logit[0] = -340282346638528859811704183484516925440.000000f
       for (max_logit.s1.domK$x, 0, num_kernels) {
        allocate mahaSq[float32 * 1]
        produce mahaSq {
         mahaSq[0] = 0.000000f
         for (mahaSq.s1.domC$x, 0, 7) {
          allocate normDiff[float32 * 1]
          produce normDiff {
           normDiff[0] = 0.000000f
           let t984 = (invCholR.stride.1*mahaSq.s1.domC$x) + max_logit.s1.domK$x
           for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
            normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t984]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + block_sum_x.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + max_logit.s1.domK$x]))
           }
          }
          consume normDiff {
           let t961 = normDiff[0]
           mahaSq[0] = mahaSq[0] + (t961*t961)
          }
          free normDiff
         }
        }
        consume mahaSq {
         max_logit[0] = max(max_logit[0], logit_kernel_offset[max_logit.s1.domK$x] + (mahaSq[0]*-0.500000f))
        }
        free mahaSq
       }
      }
      allocate mahaSq[float32 * 1 * t980]
      produce mahaSq {
       for (mahaSq.s0.kernel, 0, t979 + 1) {
        mahaSq[mahaSq.s0.kernel] = 0.000000f
       }
       for (mahaSq.s1.kernel, 0, t979 + 1) {
        for (mahaSq.s1.domC$x, 0, 7) {
         allocate normDiff[float32 * 1]
         produce normDiff {
          normDiff[0] = 0.000000f
          let t985 = (invCholR.stride.1*mahaSq.s1.domC$x) + mahaSq.s1.kernel
          for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
           normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t985]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + block_sum_x.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + mahaSq.s1.kernel]))
          }
         }
         consume normDiff {
          let t962 = normDiff[0]
          mahaSq[mahaSq.s1.kernel] = mahaSq[mahaSq.s1.kernel] + (t962*t962)
         }
         free normDiff
        }
       }
      }
      consume mahaSq {
       consume max_logit {
        consume total_weight {
         block_sum_x[0] = block_sum_x[0] + (((float32)exp_f32((logit_kernel_offset[acc_d.s1.domK$x] + (mahaSq[acc_d.s1.domK$x]*-0.500000f)) - max_logit[0])/total_weight[0])*(coord[block_sum_x.s1.domS$x + t976] - mu[t981]))
        }
       }
      }
      free total_weight
      free max_logit
      free mahaSq
     }
    }
    consume block_sum_x {
     let t963 = t978 + int32(indices[acc_d.s1.domK$x])
     acc_d[t963] = acc_d[t963] + block_sum_x[0]
    }
    free block_sum_x
   }
  }
 }
 produce acc_ddT {
  for (acc_ddT.s1.column, 0, 7) {
   let t987 = acc_ddT.s1.column*mu.stride.1
   let t986 = acc_ddT.s1.column*coord.stride.1
   let t988 = acc_ddT.s1.column*acc_ddT.stride.2
   for (acc_ddT.s1.row, 0, 7) {
    let t990 = acc_ddT.s1.row*mu.stride.1
    let t989 = acc_ddT.s1.row*coord.stride.1
    let t991 = (acc_ddT.s1.row*acc_ddT.stride.1) + t988
    for (acc_ddT.s1.domK$x, 0, num_kernels) {
     allocate block_sum_xxT[float32 * 1]
     produce block_sum_xxT {
      block_sum_xxT[0] = 0.000000f
      let t992 = max(num_kernels + -1, acc_ddT.s1.domK$x)
      let t993 = max(acc_ddT.s1.domK$x + 1, num_kernels)
      let t994 = acc_ddT.s1.domK$x + t990
      let t995 = acc_ddT.s1.domK$x + t987
      for (block_sum_xxT.s1.domS$x, 0, coord.extent.0) {
       allocate total_weight[float32 * 1]
       produce total_weight {
        total_weight[0] = 0.000000f
        for (total_weight.s1.domK$x, 0, num_kernels) {
         let mahaSq.s1.kernel.max_21 = max(num_kernels + -1, total_weight.s1.domK$x)
         allocate max_logit[float32 * 1]
         produce max_logit {
          max_logit[0] = -340282346638528859811704183484516925440.000000f
          for (max_logit.s1.domK$x, 0, num_kernels) {
           allocate mahaSq[float32 * 1]
           produce mahaSq {
            mahaSq[0] = 0.000000f
            for (mahaSq.s1.domC$x, 0, 7) {
             allocate normDiff[float32 * 1]
             produce normDiff {
              normDiff[0] = 0.000000f
              let t996 = (invCholR.stride.1*mahaSq.s1.domC$x) + max_logit.s1.domK$x
              for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
               normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t996]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + block_sum_xxT.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + max_logit.s1.domK$x]))
              }
             }
             consume normDiff {
              let t964 = normDiff[0]
              mahaSq[0] = mahaSq[0] + (t964*t964)
             }
             free normDiff
            }
           }
           consume mahaSq {
            max_logit[0] = max(max_logit[0], logit_kernel_offset[max_logit.s1.domK$x] + (mahaSq[0]*-0.500000f))
           }
           free mahaSq
          }
         }
         let mahaSq.kernel.extent_realized = max(total_weight.s1.domK$x + 1, num_kernels)
         allocate mahaSq[float32 * 1 * mahaSq.kernel.extent_realized]
         produce mahaSq {
          for (mahaSq.s0.kernel, 0, mahaSq.s1.kernel.max_21 + 1) {
           mahaSq[mahaSq.s0.kernel] = 0.000000f
          }
          for (mahaSq.s1.kernel, 0, mahaSq.s1.kernel.max_21 + 1) {
           for (mahaSq.s1.domC$x, 0, 7) {
            allocate normDiff[float32 * 1]
            produce normDiff {
             normDiff[0] = 0.000000f
             let t997 = (invCholR.stride.1*mahaSq.s1.domC$x) + mahaSq.s1.kernel
             for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
              normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t997]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + block_sum_xxT.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + mahaSq.s1.kernel]))
             }
            }
            consume normDiff {
             let t965 = normDiff[0]
             mahaSq[mahaSq.s1.kernel] = mahaSq[mahaSq.s1.kernel] + (t965*t965)
            }
            free normDiff
           }
          }
         }
         consume mahaSq {
          consume max_logit {
           total_weight[0] = (float32)exp_f32((logit_kernel_offset[total_weight.s1.domK$x] + (mahaSq[total_weight.s1.domK$x]*-0.500000f)) - max_logit[0]) + total_weight[0]
          }
         }
         free max_logit
         free mahaSq
        }
       }
       allocate max_logit[float32 * 1]
       produce max_logit {
        max_logit[0] = -340282346638528859811704183484516925440.000000f
        for (max_logit.s1.domK$x, 0, num_kernels) {
         allocate mahaSq[float32 * 1]
         produce mahaSq {
          mahaSq[0] = 0.000000f
          for (mahaSq.s1.domC$x, 0, 7) {
           allocate normDiff[float32 * 1]
           produce normDiff {
            normDiff[0] = 0.000000f
            let t998 = (invCholR.stride.1*mahaSq.s1.domC$x) + max_logit.s1.domK$x
            for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
             normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t998]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + block_sum_xxT.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + max_logit.s1.domK$x]))
            }
           }
           consume normDiff {
            let t966 = normDiff[0]
            mahaSq[0] = mahaSq[0] + (t966*t966)
           }
           free normDiff
          }
         }
         consume mahaSq {
          max_logit[0] = max(max_logit[0], logit_kernel_offset[max_logit.s1.domK$x] + (mahaSq[0]*-0.500000f))
         }
         free mahaSq
        }
       }
       allocate mahaSq[float32 * 1 * t993]
       produce mahaSq {
        for (mahaSq.s0.kernel, 0, t992 + 1) {
         mahaSq[mahaSq.s0.kernel] = 0.000000f
        }
        for (mahaSq.s1.kernel, 0, t992 + 1) {
         for (mahaSq.s1.domC$x, 0, 7) {
          allocate normDiff[float32 * 1]
          produce normDiff {
           normDiff[0] = 0.000000f
           let t999 = (invCholR.stride.1*mahaSq.s1.domC$x) + mahaSq.s1.kernel
           for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
            normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t999]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + block_sum_xxT.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + mahaSq.s1.kernel]))
           }
          }
          consume normDiff {
           let t967 = normDiff[0]
           mahaSq[mahaSq.s1.kernel] = mahaSq[mahaSq.s1.kernel] + (t967*t967)
          }
          free normDiff
         }
        }
       }
       consume mahaSq {
        consume max_logit {
         consume total_weight {
          block_sum_xxT[0] = block_sum_xxT[0] + ((((float32)exp_f32((logit_kernel_offset[acc_ddT.s1.domK$x] + (mahaSq[acc_ddT.s1.domK$x]*-0.500000f)) - max_logit[0])/total_weight[0])*(coord[block_sum_xxT.s1.domS$x + t989] - mu[t994]))*(coord[block_sum_xxT.s1.domS$x + t986] - mu[t995]))
         }
        }
       }
       free total_weight
       free max_logit
       free mahaSq
      }
     }
     consume block_sum_xxT {
      let t968 = t991 + int32(indices[acc_ddT.s1.domK$x])
      acc_ddT[t968] = acc_ddT[t968] + block_sum_xxT[0]
     }
     free block_sum_xxT
    }
   }
  }
 }
 produce loglikelihood {
  loglikelihood[0] = 0.000000f
  for (loglikelihood.s1.domS$x, 0, coord.extent.0) {
   allocate sum_prob_per_sample[float32 * 1]
   produce sum_prob_per_sample {
    sum_prob_per_sample[0] = 0.000000f
    for (sum_prob_per_sample.s1.domK$x, 0, num_kernels) {
     allocate mahaSq[float32 * 1]
     produce mahaSq {
      mahaSq[0] = 0.000000f
      for (mahaSq.s1.domC$x, 0, 7) {
       allocate normDiff[float32 * 1]
       produce normDiff {
        normDiff[0] = 0.000000f
        let t1000 = (invCholR.stride.1*mahaSq.s1.domC$x) + sum_prob_per_sample.s1.domK$x
        for (normDiff.s1.domCltmm$x, 0, mahaSq.s1.domC$x + 1) {
         normDiff[0] = normDiff[0] + (invCholR[(invCholR.stride.2*normDiff.s1.domCltmm$x) + t1000]*(coord[(coord.stride.1*normDiff.s1.domCltmm$x) + loglikelihood.s1.domS$x] - mu[(mu.stride.1*normDiff.s1.domCltmm$x) + sum_prob_per_sample.s1.domK$x]))
        }
       }
       consume normDiff {
        let t969 = normDiff[0]
        mahaSq[0] = mahaSq[0] + (t969*t969)
       }
       free normDiff
      }
     }
     consume mahaSq {
      sum_prob_per_sample[0] = (float32)exp_f32(logit_kernel_offset[sum_prob_per_sample.s1.domK$x] + (mahaSq[0]*-0.500000f)) + sum_prob_per_sample[0]
     }
     free mahaSq
    }
   }
   consume sum_prob_per_sample {
    loglikelihood[0] = (float32)log_f32(max(sum_prob_per_sample[0], 0.000000f)) + loglikelihood[0]
   }
   free sum_prob_per_sample
  }
 }
}
}

0 replies

mcourteaux · 2021-12-08T22:41:44Z

mcourteaux
Dec 8, 2021
Collaborator Author

You don't have a GPU target selected. Sorry if that wasn't clear. I use: avx2-cuda-cuda_capability_75-disable_llvm_loop_opt as my target.

0 replies

mcourteaux · 2021-12-12T09:30:40Z

mcourteaux
Dec 12, 2021
Collaborator Author

This is not a discussion. This is a bug. Why is this made into a discussion?

1 reply

abadams Dec 12, 2021
Maintainer

I suspect Alex was basing that on the title alone. We still don't know if this is a bug or not. I'll try to take a look on Monday.

abadams · 2021-12-13T17:31:23Z

abadams
Dec 13, 2021
Maintainer

What I'm seeing is some code that doesn't exploit it that occurs before the assert. That's expected because it still might not be a multiple of 16 at that point in the code. If we incorrectly assume that it won't fail an assert in the future, we may return a garbled or incorrect error before that point.

It's possible that a let computed before the assert should be known to be a multiple of 16 but isn't because it's computed before the assert, and then that let is used in a context where we'd like to know it's a multiple of 16 later after the assert, but Halide's not smart enough to infer the alignment of dependent lets, but I'm not seeing anything like that. I do see things like coord.extent.0/16*16 scattered throughout, but that's fine. That's how we communicate to inner code that this thing is a multiple of 16. It'll be lifted as a loop invariant and computed once by masking off the low bits of coord.extent.0.

I don't think I'm seeing a bug here. I think it's exploiting it to the fullest extent necessary. Are you seeing unnecessary indexing code in the ptx somewhere?

2 replies

mcourteaux Dec 13, 2021
Collaborator Author

Oh, okay, that's interesting! I am taking a few days off now, so I'll check it when I'm back at work (which should be Wednesday). Thank you very much for your support! It has been of great value and is deeply appreciated.

mcourteaux Dec 16, 2021
Collaborator Author

Okay, looking at the generated code (now with %64, instead of 16), I see indeed that the /64*64 and +63)/64 is not there anymore, except for in the allocate[] statements, which is not bad for performance, but the most confusing one to investigate manually.

However, the asserts being inserted that late in the pipeline leaves out a lot of possibilities to optimize and simplify for reading. I think the asserts (at least the ones I'm dealing with) can be inserted higher. Right after the block of code that fetches all the buffers specs with _halide_buffer_get_stride, _halide_buffer_get_extent and _halide_buffer_get_min. It seems that the assert is now inserted after bounds inference computations, which is exactly the thing I want to simplify and influence with my add_requirement.

Judging from the documentation of add_requirement:

    /** Add a top-level precondition to the generated pipeline,
     * expressed as a boolean Expr. The Expr may depend on parameters
     * only, and may not call any Func or use a Var. If the condition
     * is not true at runtime, the pipeline will call halide_error
     * with the remaining arguments, and return
     * halide_error_code_requirement_failed. Requirements are checked
     * in the order added. */

and looking at the statement file, it seems that the requirement-asserts can be inserted earlier. Moreover, when looking at the statement file, the requirement-asserts are inserted after a bunch of Halide-generated asserts. So let's analyse if they can move up until before the bounds inference without trouble:

So these Halide-generated asserts and the requirement-asserts are commutative (meaning the requirement-asserts can go above them just fine).
Then there is a block of buffer initialization like this:

if ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)acc_d.buffer)) {
(halide_buffer_t *)_halide_buffer_init((halide_buffer_t *)acc_d.buffer, (halide_dimension_t *)_halide_buffer_get_shape((halide_buffer_t *)acc_d.buffer), (void *)reinterpret((uint64)0), (uint64)0, (halide_device_interface_t *)reinterpret((uint64)0), 2, 32, 2, (halide_dimension_t *)make_struct(0, indices.extent.0, 1, 0, 0, 7, indices.extent.0, 0), (uint64)0)
}

I have not a clue what this does (because this the then-body runs if the host is 0 and device is 0, and then proceeds setting host and device to 0 if it is so, and then just copies the other values (like shapes, strides, etc) into the buffer that were already there. But none of this matters, because if you don't supply the buffer host or device, the pipeline will not even run. 😕). However, it looks commutative with the requirement-asserts.

Then above that, there is a block of bounds inference (my target for optimizing / simplifyng) which contains a lot of let statements that are used throughout the rest of the statement file and cause unnecessary /64*64 and alike. This is exactly the part where they (the requirement-asserts and the let-statements) are not commutative, as changing the order influences computations produced. However, they are functionally commutative, which is exactly what I want, and will simplify the following let statements, and by extent the entire statement file. For completeness, this is the block of let statements:

let coord.extent.0.required = max(min((min(coord.extent.0, 16) + ((coord.extent.0/16)*16)) + -16, coord.extent.0), (max(logit.s0.sample.max.s + 15, coord.extent.0)/16)*16)
let indices.extent.0.required = min(max(min(num_kernels, 64) + (((num_kernels + -1)/64)*64), min(num_kernels, 32) + (((num_kernels + -1)/32)*32)), num_kernels)
assert(!(uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)invCholR.buffer) || ((16 <= num_kernels) && (num_kernels <= indices.extent.0.required)), halide_error_constraints_make_required_region_smaller("Input buffer invCholR", 0, 0, indices.extent.0.required + -1, min(num_kernels, 16) + -16, num_kernels + -1))
assert(!(uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)logit_kernel_offset.buffer) || ((16 <= num_kernels) && (num_kernels <= indices.extent.0.required)), halide_error_constraints_make_required_region_smaller("Input buffer logit_kernel_offset", 0, 0, indices.extent.0.required + -1, min(num_kernels, 16) + -16, num_kernels + -1))
assert(!(uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)mu.buffer) || ((16 <= num_kernels) && (num_kernels <= indices.extent.0.required)), halide_error_constraints_make_required_region_smaller("Input buffer mu", 0, 0, indices.extent.0.required + -1, min(num_kernels, 16) + -16, num_kernels + -1))

Then we arrive at the block of _halide_buffer_get_stride, _halide_buffer_get_extent and _halide_buffer_get_min, which we of course don't want to reorder, as it just fetches all of the variables we want to define asserts over.

So, in conclusion, I believe the requirement-asserts -- in general -- could be placed higher in the pipeline, and produce valid results, while simplifying a lot in the statement file. I have not investigated how to do that, but it would be very interesting to try it, and see if all the tests still pass.

abadams · 2021-12-16T16:29:56Z

abadams
Dec 16, 2021
Maintainer

The is_bound_query stuff is so that a user can call into a Halide pipeline to ask what the constraints *are*. So they need to be able to pass in a (unallocated) buffer that fails the bounds checks, and get an updated shape back that would pass the bounds checks. That's supposed to work with things like set_stride and so on. So the bounds inference code and the bounds querying code needs to come before those asserts. However, I'm not sure it needs to come before the add_requirement asserts, because those aren't actionable by a bounds query (it can't easily change the buffer shape to pass the assert). I'm not sure how people would feel about add_requirement applying even to bounds queries. This is something we'd have to raise more broadly.

…

On Thu, Dec 16, 2021 at 2:13 AM Martijn Courteaux ***@***.***> wrote: Okay, looking at the generated code (now with %64, instead of 16), I see indeed that the /64*64 and +63)/64 is not there anymore, except for in the allocate[] statements, which is not bad for performance, but the most confusing one to investigate manually. However, the asserts being inserted that late in the pipeline leaves out a lot of possibilities to optimize and simplify for reading. I think the asserts (at least the ones I'm dealing with) can be inserted higher. Right after the block of code that fetches all the buffers specs with _halide_buffer_get_stride, _halide_buffer_get_extent and _halide_buffer_get_min. It seems that the assert is now inserted after bounds inference computations, which is exactly the thing I want to simplify and influence with my add_requirement. Judging from the documentation of add_requirement: /** Add a top-level precondition to the generated pipeline, * expressed as a boolean Expr. The Expr may depend on parameters * only, and may not call any Func or use a Var. If the condition * is not true at runtime, the pipeline will call halide_error * with the remaining arguments, and return * halide_error_code_requirement_failed. Requirements are checked * in the order added. */ and looking at the statement file, it seems that the requirement-asserts can be inserted earlier. Moreover, when looking at the statement file, the requirement-asserts are inserted after a bunch of Halide-generated asserts. So let's analyse if they can move up until before the bounds inference without trouble: 1. So these Halide-generated asserts and the requirement-asserts are commutative (meaning the requirement-asserts can go above them just fine). 2. Then there is a block of buffer initialization like this: if ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)acc_d.buffer)) {(halide_buffer_t *)_halide_buffer_init((halide_buffer_t *)acc_d.buffer, (halide_dimension_t *)_halide_buffer_get_shape((halide_buffer_t *)acc_d.buffer), (void *)reinterpret((uint64)0), (uint64)0, (halide_device_interface_t *)reinterpret((uint64)0), 2, 32, 2, (halide_dimension_t *)make_struct(0, indices.extent.0, 1, 0, 0, 7, indices.extent.0, 0), (uint64)0)} I have not a clue what this does (because this the then-body runs if the host is 0 and device is 0, and then proceeds setting host and device to 0 if it is so, and then just c opies the other values (like shapes, strides, etc) into the buffer that were already there). However, it looks commutative with the requirement-asserts. 1. Then above that, there is a block of bounds inference (my target for optimizing / simplifyng) which contains a lot of let statements that are used throughout the rest of the statement file and cause unnecessary /64*64 and alike. This is exactly the part where they (the requirement-asserts and the let-statements) are not commutative, as changing the order influences computations produced. However, they are functionally commutative, which is exactly what I want, and will simplify the following let statements, and by extent the entire statement file. For completeness, this is the block of let statements: let coord.extent.0.required = max(min((min(coord.extent.0, 16) + ((coord.extent.0/16)*16)) + -16, coord.extent.0), (max(logit.s0.sample.max.s + 15, coord.extent.0)/16)*16)let indices.extent.0.required = min(max(min(num_kernels, 64) + (((num_kernels + -1)/64)*64), min(num_kernels, 32) + (((num_kernels + -1)/32)*32)), num_kernels)assert(!(uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)invCholR.buffer) || ((16 <= num_kernels) && (num_kernels <= indices.extent.0.required)), halide_error_constraints_make_required_region_smaller("Input buffer invCholR", 0, 0, indices.extent.0.required + -1, min(num_kernels, 16) + -16, num_kernels + -1))assert(!(uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)logit_kernel_offset.buffer) || ((16 <= num_kernels) && (num_kernels <= indices.extent.0.required)), halide_error_constraints_make_required_region_smaller("Input buffer logit_kernel_offset", 0, 0, indices.extent.0.required + -1, min(num_kernels, 16) + -16, num_kernels + -1))assert(!(uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)mu.buffer) || ((16 <= num_kernels) && (num_kernels <= indices.extent.0.required)), halide_error_constraints_make_required_region_smaller("Input buffer mu", 0, 0, indices.extent.0.required + -1, min(num_kernels, 16) + -16, num_kernels + -1)) 1. Then we arrive at the block of _halide_buffer_get_stride, _halide_buffer_get_extent and _halide_buffer_get_min, which we of course don't want to reorder, as it just fetches all of the variables we want to define asserts over. So, in conclusion, I believe the requirement-asserts -- in general -- could be placed higher in the pipeline, and produce valid results, while simplifying a lot in the statement file. I have not investigated how to do that, but it would be very interesting to try it, and see if all the tests still pass. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6495 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAD52RRCATANPEW6WC2D4P3URG3V3ANCNFSM5J4BNXMQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

2 replies

mcourteaux Dec 16, 2021
Collaborator Author

Aha, i see regarding the technique of bounds querying. However, isn't it super reasonable that you have to comply with the constraints, even for a bounds query? Why would you want to produce shapes for a hypothetical pipeline run that is not gonna run if you don't comply with the constraints anyway?

abadams Dec 16, 2021
Maintainer

Which constraints? Bounds queries are commonly used to ask the question "what shape and layout do my buffers need to be to not fail any constraints?" E.g. when Halide pipelines call out to other Halide pipelines using define_extern, they use this interface to negotiate a compatible memory layout. So it depends on the type of constraint. Maybe for the add_requirement constraints it's reasonable, because they're generally not of the form buffer_field == some_value so they don't fit the interface anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to promise alignment of Input and Output buffers? #6495

{{title}}

Replies: 11 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to promise alignment of Input and Output buffers? #6495

mcourteaux Dec 8, 2021 Collaborator

Replies: 11 comments · 5 replies

abadams Dec 8, 2021 Maintainer

mcourteaux Dec 8, 2021 Collaborator Author

mcourteaux Dec 8, 2021 Collaborator Author

mcourteaux Dec 8, 2021 Collaborator Author

abadams Dec 8, 2021 Maintainer

mcourteaux Dec 8, 2021 Collaborator Author

abadams Dec 8, 2021 Maintainer

mcourteaux Dec 8, 2021 Collaborator Author

mcourteaux Dec 12, 2021 Collaborator Author

abadams Dec 12, 2021 Maintainer

abadams Dec 13, 2021 Maintainer

mcourteaux Dec 13, 2021 Collaborator Author

mcourteaux Dec 16, 2021 Collaborator Author

abadams Dec 16, 2021 Maintainer

mcourteaux Dec 16, 2021 Collaborator Author

abadams Dec 16, 2021 Maintainer

mcourteaux
Dec 8, 2021
Collaborator

Replies: 11 comments 5 replies

abadams
Dec 8, 2021
Maintainer

mcourteaux
Dec 8, 2021
Collaborator Author

mcourteaux
Dec 8, 2021
Collaborator Author

mcourteaux
Dec 8, 2021
Collaborator Author

abadams
Dec 8, 2021
Maintainer

mcourteaux
Dec 8, 2021
Collaborator Author

abadams
Dec 8, 2021
Maintainer

mcourteaux
Dec 8, 2021
Collaborator Author

mcourteaux
Dec 12, 2021
Collaborator Author

abadams Dec 12, 2021
Maintainer

abadams
Dec 13, 2021
Maintainer

mcourteaux Dec 13, 2021
Collaborator Author

mcourteaux Dec 16, 2021
Collaborator Author

abadams
Dec 16, 2021
Maintainer

mcourteaux Dec 16, 2021
Collaborator Author

abadams Dec 16, 2021
Maintainer