Benchmarking and Optimizing Aesara Code #1174

jessegrabowski · 2022-09-12T15:11:36Z

jessegrabowski
Sep 12, 2022

Hi everyone,

I've been working on implementing a Kalman Filter in Aesara, with the objective of computing the log-likelihood of a class of time series models. The filter is pretty simple, it just recursively predicts new data using model dynamics, computes the error between the prediction and a single time-step, and then combines the prediction and observation into a optimal fused state.

In the most general form, the filter is a bunch of matrix equations, but a special case where everything is scalar can also be implemented. Here it this special case in Aesara:

Code block for the scalar kalman filter

N_CONST = at.log(2 * at.constant(np.pi, dtype='float64'))

def scalar_predict(a_filtered, P_filtered, T, R, Q):
    a_predicted = T * a_filtered
    P_predicted = T ** 2 * P_filtered + R ** 2 * Q
    
    return a_predicted, P_predicted
    
def scalar_filter(a_predicted, P_predicted, y, T, Z, H):
    v = y - Z * a_predicted 
    F = Z ** 2 * P_predicted + H
    
    # Kalman Gain
    K = P_predicted * Z / F
    
    a_filtered = a_predicted + K * v
    P_filtered =  (1 - K * Z) * P_predicted

    ll = -0.5 * (N_CONST + at.log(F) + v ** 2 / F)
    
    return a_filtered, P_filtered, ll
    
def scalar_kalman_step(y, a_predicted, P_predicted, T, Z, R, H, Q):
    a_filtered, P_filtered, ll = scalar_filter(a_predicted, P_predicted, y=y, T=T, Z=Z, H=H)
    a_predicted, P_predicted = scalar_predict(a_filtered, P_filtered, T=T, R=R, Q=Q)
    
    return a_filtered, a_predicted, P_filtered, P_predicted, ll

a_data = at.dvector('data')
a_x0 = at.dscalar('x0')
a_P0 = at.dscalar('P0')
a_T = at.dscalar('T')
a_Z = at.dscalar('Z')
a_R = at.dscalar('R')
a_H = at.dscalar('H')
a_Q = at.dscalar('Q')

filter_result, updates = aesara.scan(scalar_kalman_step,
                              sequences=[a_data],
                              outputs_info=[None, a_x0, None, a_P0, None],
                              non_sequences=[a_T, a_Z, a_R, a_H, a_Q])

kalman_filter = aesara.function([a_data, a_x0, a_P0, a_T, a_Z, a_R, a_H, a_Q], filter_result, profile=True)

On 100 data-points with fixed parameters, this scalar function runs extremely fast: 511 µs ± 4.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each). Plugging the log-likelihood into PyMC, things also sample very fast (22 seconds). Here is the function profile, as a basis for comparison:

Aesara profile output for the scalar kalman filter

==================
  Message: C:\Users\Jesse\AppData\Local\Temp\ipykernel_9088\1667070693.py:17
  Time in 8112 calls to Function.__call__: 3.865356e+00s
  Time in Function.vm.__call__: 3.526334285736084s (91.229%)
  Time in thunks: 3.4546408653259277s (89.374%)
  Total compile time: 1.781565e+00s
    Number of Apply nodes: 32
    Aesara Optimizer time: 1.561722e+00s
       Aesara validate time: 2.998829e-03s
    Aesara Linker time (includes C, CUDA code generation/compiling): 0.17497658729553223s
       Import time 4.139328e-02s
       Node make_thunk time 1.739793e-01s
           Node forall_inplace,cpu,scan_fn}(Shape_i{0}.0, Subtensor{int64:int64:int8}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, T, Z, H, Elemwise{sqr,no_inplace}.0, Elemwise{Composite{(sqr(i0) * i1)}}.0, Elemwise{sqr,no_inplace}.0) time 1.157212e-01s
           Node InplaceDimShuffle{x}(P0) time 6.981850e-03s
           Node Elemwise{Composite{Switch(LE(i0, i1), i1, i2)}}(Shape_i{0}.0, TensorConstant{0}, TensorConstant{0}) time 3.989458e-03s
           Node Elemwise{Composite{(((Switch(LT(i0, i1), i2, i1) - i3) - i4) + i5)}}(TensorConstant{1}, Elemwise{add,no_inplace}.0, TensorConstant{1}, Shape_i{0}.0, TensorConstant{1}, Elemwise{Composite{maximum(((i0 - Switch(LT(i1, (i0 + i2)), i3, (i0 + i2))) + i4), i5)}}.0) time 3.097057e-03s
           Node IncSubtensor{InplaceSet;:int64:}(AllocEmpty{dtype='float64'}.0, Unbroadcast{0}.0, ScalarConstant{1}) time 2.996445e-03s

Time in all call to aesara.grad() 0.000000e+00s
Time since aesara import 10.688s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  98.7%    98.7%       3.409s       4.20e-04s     Py    8112       1   aesara.scan.op.Scan
   0.6%    99.3%       0.021s       1.98e-07s     C   105456      13   aesara.tensor.elemwise.Elemwise
   0.3%    99.5%       0.009s       3.69e-07s     C    24336       3   aesara.tensor.subtensor.Subtensor
   0.2%    99.7%       0.006s       3.68e-07s     C    16224       2   aesara.tensor.elemwise.DimShuffle
   0.1%    99.8%       0.004s       2.44e-07s     C    16224       2   aesara.tensor.subtensor.IncSubtensor
   0.1%    99.9%       0.003s       6.20e-08s     C    48672       6   aesara.tensor.basic.ScalarFromTensor
   0.1%   100.0%       0.002s       1.21e-07s     C    16224       2   aesara.tensor.shape.Unbroadcast
   0.0%   100.0%       0.001s       6.32e-08s     C    16224       2   aesara.tensor.basic.AllocEmpty
   0.0%   100.0%       0.000s       0.00e+00s     C     8112       1   aesara.tensor.shape.Shape_i
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  98.7%    98.7%       3.409s       4.20e-04s     Py    8112        1   forall_inplace,cpu,scan_fn}
   0.3%    98.9%       0.009s       3.69e-07s     C     24336        3   Subtensor{int64:int64:int8}
   0.2%    99.1%       0.006s       3.68e-07s     C     16224        2   InplaceDimShuffle{x}
   0.1%    99.2%       0.004s       2.46e-07s     C     16224        2   Elemwise{Composite{(((Switch(LT(i0, i1), i2, i1) - i3) - i4) + i5)}}
   0.1%    99.3%       0.004s       2.44e-07s     C     16224        2   IncSubtensor{InplaceSet;:int64:}
   0.1%    99.4%       0.003s       6.20e-08s     C     48672        6   ScalarFromTensor
   0.1%    99.5%       0.003s       3.69e-07s     C     8112        1   Elemwise{Composite{(sqr(i0) * i1)}}
   0.1%    99.6%       0.003s       3.69e-07s     C     8112        1   Elemwise{Composite{Switch(LE(i0, i1), i1, i2)}}
   0.1%    99.7%       0.002s       2.46e-07s     C     8112        1   Elemwise{Composite{maximum(((i0 - Switch(LT(i1, (i0 + i2)), i3, (i0 + i2))) + i4), i5)}}
   0.1%    99.7%       0.002s       2.45e-07s     C     8112        1   Elemwise{Composite{maximum(((i0 - Switch(LT(i1, i2), i3, i2)) + i4), i5)}}
   0.1%    99.8%       0.002s       2.44e-07s     C     8112        1   Elemwise{Composite{(((i0 - i1) - i2) + i3)}}
   0.1%    99.8%       0.002s       2.43e-07s     C     8112        1   Elemwise{add,no_inplace}
   0.1%    99.9%       0.002s       1.21e-07s     C     16224        2   Unbroadcast{0}
   0.0%    99.9%       0.001s       6.32e-08s     C     16224        2   AllocEmpty{dtype='float64'}
   0.0%    99.9%       0.001s       6.15e-08s     C     16224        2   Elemwise{sqr,no_inplace}
   0.0%   100.0%       0.001s       1.23e-07s     C     8112        1   Elemwise{Composite{(((i0 - i1) - i2) + i3)}}[(0, 0)]
   0.0%   100.0%       0.001s       5.97e-08s     C     16224        2   Elemwise{Composite{(Switch(GT(i0, i1), (i1 + i0), (i1 - i0)) + i2)}}[(0, 1)]
   0.0%   100.0%       0.000s       0.00e+00s     C     8112        1   Shape_i{0}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  98.7%    98.7%       3.409s       4.20e-04s   8112    29   forall_inplace,cpu,scan_fn}(Shape_i{0}.0, Subtensor{int64:int64:int8}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, T, Z, H, Elemwise{sqr,no_inplace}.0, Elemwise{Composite{(sqr(i0) * i1)}}.0, Elemwise{sqr,no_inplace}.0)
   0.1%    98.8%       0.005s       6.13e-07s   8112     5   InplaceDimShuffle{x}(x0)
   0.1%    98.9%       0.004s       4.95e-07s   8112    20   Subtensor{int64:int64:int8}(data, ScalarFromTensor.0, ScalarFromTensor.0, ScalarConstant{1})
   0.1%    99.0%       0.004s       4.88e-07s   8112    28   IncSubtensor{InplaceSet;:int64:}(AllocEmpty{dtype='float64'}.0, Unbroadcast{0}.0, ScalarConstant{1})
   0.1%    99.1%       0.003s       3.73e-07s   8112    12   Elemwise{Composite{(((Switch(LT(i0, i1), i2, i1) - i3) - i4) + i5)}}(TensorConstant{1}, Elemwise{add,no_inplace}.0, TensorConstant{1}, Shape_i{0}.0, TensorConstant{1}, Elemwise{Composite{maximum(((i0 - Switch(LT(i1, (i0 + i2)), i3, (i0 + i2))) + i4), i5)}}.0)
   0.1%    99.2%       0.003s       3.69e-07s   8112     1   Elemwise{Composite{(sqr(i0) * i1)}}(R, Q)
   0.1%    99.3%       0.003s       3.69e-07s   8112     9   Elemwise{Composite{Switch(LE(i0, i1), i1, i2)}}(Shape_i{0}.0, TensorConstant{0}, TensorConstant{0})
   0.1%    99.4%       0.003s       3.68e-07s   8112    31   Subtensor{int64:int64:int8}(forall_inplace,cpu,scan_fn}.0, ScalarFromTensor.0, ScalarFromTensor.0, ScalarConstant{1})
   0.1%    99.5%       0.002s       2.46e-07s   8112     6   Elemwise{Composite{maximum(((i0 - Switch(LT(i1, (i0 + i2)), i3, (i0 + i2))) + i4), i5)}}(Shape_i{0}.0, TensorConstant{1}, TensorConstant{1}, TensorConstant{1}, TensorConstant{1}, TensorConstant{2})
   0.1%    99.5%       0.002s       2.45e-07s   8112    14   Elemwise{Composite{maximum(((i0 - Switch(LT(i1, i2), i3, i2)) + i4), i5)}}(Shape_i{0}.0, TensorConstant{1}, Elemwise{add,no_inplace}.0, TensorConstant{1}, TensorConstant{1}, TensorConstant{2})
   0.1%    99.6%       0.002s       2.44e-07s   8112    13   Elemwise{Composite{(((i0 - i1) - i2) + i3)}}(Elemwise{add,no_inplace}.0, Shape_i{0}.0, TensorConstant{1}, Elemwise{Composite{maximum(((i0 - Switch(LT(i1, (i0 + i2)), i3, (i0 + i2))) + i4), i5)}}.0)
   0.1%    99.6%       0.002s       2.43e-07s   8112    30   Subtensor{int64:int64:int8}(forall_inplace,cpu,scan_fn}.1, ScalarFromTensor.0, ScalarFromTensor.0, ScalarConstant{1})
   0.1%    99.7%       0.002s       2.43e-07s   8112     7   Elemwise{add,no_inplace}(TensorConstant{1}, Shape_i{0}.0)
   0.1%    99.7%       0.002s       2.42e-07s   8112    11   Unbroadcast{0}(InplaceDimShuffle{x}.0)
   0.0%    99.8%       0.001s       1.26e-07s   8112    27   AllocEmpty{dtype='float64'}(Elemwise{Composite{(Switch(GT(i0, i1), (i1 + i0), (i1 - i0)) + i2)}}[(0, 1)].0)
   0.0%    99.8%       0.001s       1.26e-07s   8112     8   ScalarFromTensor(Shape_i{0}.0)
   0.0%    99.8%       0.001s       1.23e-07s   8112    24   ScalarFromTensor(Elemwise{Composite{(((Switch(LT(i0, i1), i2, i1) - i3) - i4) + i5)}}.0)
   0.0%    99.9%       0.001s       1.23e-07s   8112     2   Elemwise{sqr,no_inplace}(Z)
   0.0%    99.9%       0.001s       1.23e-07s   8112    26   ScalarFromTensor(Elemwise{Composite{(((i0 - i1) - i2) + i3)}}[(0, 0)].0)
   0.0%    99.9%       0.001s       1.23e-07s   8112    22   Elemwise{Composite{(((i0 - i1) - i2) + i3)}}[(0, 0)](Elemwise{add,no_inplace}.0, Shape_i{0}.0, TensorConstant{1}, Elemwise{Composite{maximum(((i0 - Switch(LT(i1, i2), i3, i2)) + i4), i5)}}.0)
   ... (remaining 12 Apply instances account for 0.08%(0.00s) of the runtime)


Scan overhead:
<Scan op time(s)> <sub scan fct time(s)> <sub scan op time(s)> <sub scan fct time(% scan op time)> <sub scan op time(% scan op time)> <node>
        3.4s    3.1s    0.8s   91.2%   24.3% forall_inplace,cpu,scan_fn}(Shape_i{0}.0, Subtensor{int64:int64:int8}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, T, Z, H, Elemwise{sqr,no_inplace}.0, Elemwise{Composite{(sqr(i0) * i1)}}.0, Elemwise{sqr,no_inplace}.0)
total   3.4s    3.1s    0.8s   91.2%   24.3%
Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  - Try the Aesara flag floatX=float32

Question 1: Checking the profile for kalman_filter, 98% of the execution time is spent in the scan. That makes perfect sense, since there isn't much else here. The scan is of type Py rather than C, is that correct? Is there anything else interesting in this profile output that should catch my attention? So far, everything seems fine.

What I really want, though, is to implement the filter with matrices, so that it can generalize to more complex time series models. Here is the code for a Kalman filter that uses matrices:

Matrix Kalman filter code

from aesara.tensor.nlinalg import matrix_dot

N_CONST = at.log(2 * at.constant(np.pi, dtype='float64'))

def matrix_predict(a_filtered, P_filtered, T, R, Q):
    a_predicted = T.dot(a_filtered)
    P_predicted = matrix_dot(T, P_filtered, T.T) + matrix_dot(R, Q, R.T)

    # Force P_predicted to be symmetric
    P_predicted = 0.5 * (P_predicted + P_predicted.T)

    return a_predicted, P_predicted
    
def matrix_filter(a_predicted, P_predicted, y, T, Z, H):
    v = y - Z.dot(a_predicted)

    PZT = P_predicted.dot(Z.T)
    F = Z.dot(PZT) + H

    F_inv = at.linalg.solve(F, at.eye(F.shape[0]), assume_a='pos')

    K = PZT.dot(F_inv)
    I_KZ = at.eye(K.shape[0]) - K.dot(Z)

    a_filtered = a_predicted + K.dot(v)
    P_filtered = matrix_dot(I_KZ, P_predicted, I_KZ.T) + matrix_dot(K, H, K.T)  # Joseph form

    inner_term = matrix_dot(v.T, F_inv, v)
    ll = -0.5 * (N_CONST + at.log(at.linalg.det(F)) + inner_term).ravel()[0]

    return a_filtered, P_filtered, ll
    
def matrix_kalman_step(y, a_predicted, P_predicted, T, Z, R, H, Q):
    a_filtered, P_filtered, ll = matrix_filter(a_predicted, P_predicted, y=y, T=T, Z=Z, H=H)
    a_predicted, P_predicted = matrix_predict(a_filtered, P_filtered, T=T, R=R, Q=Q)
    
    return a_filtered, a_predicted, P_filtered, P_predicted, ll

a_data = at.dmatrix('data')
a_x0 = at.dvector('x0')
a_P0 = at.dmatrix('P0')
a_T = at.dmatrix('T')
a_Z = at.dmatrix('Z')
a_R = at.dmatrix('R')
a_H = at.dmatrix('H')
a_Q = at.dmatrix('Q')

filter_result, updates = aesara.scan(matrix_kalman_step,
                              sequences=[a_data],
                              outputs_info=[None, a_x0, None, a_P0, None],
                              non_sequences=[a_T, a_Z, a_R, a_H, a_Q])

Implementing the same scalar model using a set of 1x1 matrices gives the following time: 11.3 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each), or a 22x slowdown. Passing the likelihood to PyMC, sampling goes from 22 seconds (scalar model) to 4 minutes (matrix model).

Once again, I look at the profile for matrix_kalman_filter, which is dominated by the scan Op, with a bit of Dot22 thrown in.

Matrix Kalman Filter function profile


Function profiling
==================
  Message: /var/folders/wy/ph4j9vrx23v000gc9flt78y40000gn/T/ipykernel_5729/1327232318.py:9
  Time in 812 calls to Function.__call__: 1.043983e+01s
  Time in Function.vm.__call__: 10.406508922576904s (99.681%)
  Time in thunks: 10.388258695602417s (99.506%)
  Total compile time: 5.600317e+00s
    Number of Apply nodes: 37
    Aesara Optimizer time: 4.325502e-01s
       Aesara validate time: 4.658222e-03s
    Aesara Linker time (includes C, CUDA code generation/compiling): 5.104015827178955s
       Import time 5.539792e-01s
       Node make_thunk time 5.101875e+00s
           Node Dot22(R, Q) time 2.391016e+00s
           Node forall_inplace,cpu,scan_fn}(Shape_i{0}.0, Subtensor{int64:int64:int8}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, T, Z, H, InplaceDimShuffle{1,0}.0, Dot22.0, InplaceDimShuffle{1,0}.0) time 1.588120e+00s
           Node Subtensor{int64:int64:int8}(forall_inplace,cpu,scan_fn}.1, ScalarFromTensor.0, ScalarFromTensor.0, ScalarConstant{1}) time 1.033803e+00s
           Node AllocEmpty{dtype='float64'}(Elemwise{Composite{(Switch(GT(i0, i1), (i1 + i0), (i1 - i0)) + i2)}}[(0, 1)].0, Shape_i{0}.0, Shape_i{1}.0) time 1.466918e-02s
           Node AllocEmpty{dtype='float64'}(Elemwise{Composite{(Switch(GT(i0, i1), (i1 + i0), (i1 - i0)) + i2)}}[(0, 1)].0, Shape_i{0}.0) time 1.202893e-02s

Time in all call to aesara.grad() 2.364747e+00s
Time since aesara import 148.830s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  98.8%    98.8%      10.261s       1.26e-02s     Py     812       1   aesara.scan.op.Scan
   1.1%    99.8%       0.109s       6.73e-05s     C     1624       2   aesara.tensor.blas.Dot22
   0.0%    99.9%       0.005s       1.17e-06s     C     4060       5   aesara.tensor.elemwise.DimShuffle
   0.0%    99.9%       0.005s       5.58e-07s     C     8120      10   aesara.tensor.elemwise.Elemwise
   0.0%    99.9%       0.003s       1.58e-06s     C     1624       2   aesara.tensor.subtensor.IncSubtensor
   0.0%   100.0%       0.002s       8.77e-07s     C     2436       3   aesara.tensor.subtensor.Subtensor
   0.0%   100.0%       0.001s       4.25e-07s     C     3248       4   aesara.tensor.shape.Shape_i
   0.0%   100.0%       0.001s       2.74e-07s     C     4872       6   aesara.tensor.basic.ScalarFromTensor
   0.0%   100.0%       0.001s       6.12e-07s     C     1624       2   aesara.tensor.basic.AllocEmpty
   0.0%   100.0%       0.000s       1.95e-07s     C     1624       2   aesara.tensor.basic.Rebroadcast
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  98.8%    98.8%      10.261s       1.26e-02s     Py     812        1   forall_inplace,cpu,scan_fn}
   1.1%    99.8%       0.109s       6.73e-05s     C     1624        2   Dot22
   0.0%    99.9%       0.003s       1.58e-06s     C     1624        2   IncSubtensor{InplaceSet;:int64:}
   0.0%    99.9%       0.002s       9.49e-07s     C     2436        3   InplaceDimShuffle{1,0}
   0.0%    99.9%       0.002s       8.77e-07s     C     2436        3   Subtensor{int64:int64:int8}
   0.0%    99.9%       0.001s       1.75e-06s     C      812        1   InplaceDimShuffle{x,0}
   0.0%    99.9%       0.001s       2.74e-07s     C     4872        6   ScalarFromTensor
   0.0%    99.9%       0.001s       5.04e-07s     C     2436        3   Shape_i{0}
   0.0%    99.9%       0.001s       1.27e-06s     C      812        1   InplaceDimShuffle{x,0,1}
   0.0%   100.0%       0.001s       6.12e-07s     C     1624        2   AllocEmpty{dtype='float64'}
   0.0%   100.0%       0.001s       5.05e-07s     C     1624        2   Elemwise{Composite{(((Switch(LT(i0, i1), i2, i1) - i3) - i4) + i5)}}
   0.0%   100.0%       0.001s       8.38e-07s     C      812        1   Elemwise{Composite{maximum(((i0 - Switch(LT(i1, i2), i3, i2)) + i4), i5)}}
   0.0%   100.0%       0.001s       8.02e-07s     C      812        1   Elemwise{Composite{Switch(LE(i0, i1), i1, i2)}}
   0.0%   100.0%       0.001s       7.02e-07s     C      812        1   Elemwise{Composite{maximum(((i0 - Switch(LT(i1, (i0 + i2)), i3, (i0 + i2))) + i4), i5)}}
   0.0%   100.0%       0.001s       6.76e-07s     C      812        1   Elemwise{Composite{(((i0 - i1) - i2) + i3)}}
   0.0%   100.0%       0.000s       2.83e-07s     C     1624        2   Elemwise{Composite{(Switch(GT(i0, i1), (i1 + i0), (i1 - i0)) + i2)}}[(0, 1)]
   0.0%   100.0%       0.000s       5.04e-07s     C      812        1   Elemwise{add,no_inplace}
   0.0%   100.0%       0.000s       4.82e-07s     C      812        1   Elemwise{Composite{(((i0 - i1) - i2) + i3)}}[(0, 0)]
   0.0%   100.0%       0.000s       1.95e-07s     C     1624        2   Rebroadcast{(0, False)}
   0.0%   100.0%       0.000s       1.86e-07s     C      812        1   Shape_i{1}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  98.8%    98.8%      10.261s       1.26e-02s    812    34   forall_inplace,cpu,scan_fn}(Shape_i{0}.0, Subtensor{int64:int64:int8}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, T, Z, H, InplaceDimShuffle{1,0}.0, Dot22.0, InplaceDimShuffle{1,0}.0)
   1.0%    99.8%       0.109s       1.34e-04s    812     2   Dot22(R, Q)
   0.0%    99.8%       0.002s       1.93e-06s    812    33   IncSubtensor{InplaceSet;:int64:}(AllocEmpty{dtype='float64'}.0, Rebroadcast{(0, False)}.0, ScalarConstant{1})
   0.0%    99.9%       0.001s       1.75e-06s    812     8   InplaceDimShuffle{x,0}(x0)
   0.0%    99.9%       0.001s       1.27e-06s    812     5   InplaceDimShuffle{x,0,1}(P0)
   0.0%    99.9%       0.001s       1.23e-06s    812    30   IncSubtensor{InplaceSet;:int64:}(AllocEmpty{dtype='float64'}.0, Rebroadcast{(0, False)}.0, ScalarConstant{1})
   0.0%    99.9%       0.001s       1.11e-06s    812    36   Subtensor{int64:int64:int8}(forall_inplace,cpu,scan_fn}.0, ScalarFromTensor.0, ScalarFromTensor.0, ScalarConstant{1})
   0.0%    99.9%       0.001s       1.09e-06s    812     3   InplaceDimShuffle{1,0}(Z)
   0.0%    99.9%       0.001s       8.92e-07s    812     1   InplaceDimShuffle{1,0}(R)
   0.0%    99.9%       0.001s       8.68e-07s    812     0   InplaceDimShuffle{1,0}(T)
   0.0%    99.9%       0.001s       8.45e-07s    812    35   Subtensor{int64:int64:int8}(forall_inplace,cpu,scan_fn}.1, ScalarFromTensor.0, ScalarFromTensor.0, ScalarConstant{1})
   0.0%    99.9%       0.001s       8.38e-07s    812    19   Elemwise{Composite{maximum(((i0 - Switch(LT(i1, i2), i3, i2)) + i4), i5)}}(Shape_i{0}.0, TensorConstant{1}, Elemwise{add,no_inplace}.0, TensorConstant{1}, TensorConstant{1}, TensorConstant{2})
   0.0%    99.9%       0.001s       8.02e-07s    812    14   Elemwise{Composite{Switch(LE(i0, i1), i1, i2)}}(Shape_i{0}.0, TensorConstant{0}, TensorConstant{0})
   0.0%    99.9%       0.001s       7.47e-07s    812    24   Elemwise{Composite{(((Switch(LT(i0, i1), i2, i1) - i3) - i4) + i5)}}(TensorConstant{1}, Elemwise{add,no_inplace}.0, TensorConstant{1}, Shape_i{0}.0, TensorConstant{1}, Elemwise{Composite{maximum(((i0 - Switch(LT(i1, i2), i3, i2)) + i4), i5)}}.0)
   0.0%    99.9%       0.001s       7.41e-07s    812     4   Shape_i{0}(data)
   0.0%    99.9%       0.001s       7.02e-07s    812    11   Elemwise{Composite{maximum(((i0 - Switch(LT(i1, (i0 + i2)), i3, (i0 + i2))) + i4), i5)}}(Shape_i{0}.0, TensorConstant{1}, TensorConstant{1}, TensorConstant{1}, TensorConstant{1}, TensorConstant{2})
   0.0%    99.9%       0.001s       6.76e-07s    812    18   Elemwise{Composite{(((i0 - i1) - i2) + i3)}}(Elemwise{add,no_inplace}.0, Shape_i{0}.0, TensorConstant{1}, Elemwise{Composite{maximum(((i0 - Switch(LT(i1, (i0 + i2)), i3, (i0 + i2))) + i4), i5)}}.0)
   0.0%    99.9%       0.001s       6.72e-07s    812    25   Subtensor{int64:int64:int8}(data, ScalarFromTensor.0, ScalarFromTensor.0, ScalarConstant{1})
   0.0%   100.0%       0.001s       6.44e-07s    812    26   AllocEmpty{dtype='float64'}(Elemwise{Composite{(Switch(GT(i0, i1), (i1 + i0), (i1 - i0)) + i2)}}[(0, 1)].0, Shape_i{0}.0, Shape_i{1}.0)
   0.0%   100.0%       0.000s       5.79e-07s    812    32   AllocEmpty{dtype='float64'}(Elemwise{Composite{(Switch(GT(i0, i1), (i1 + i0), (i1 - i0)) + i2)}}[(0, 1)].0, Shape_i{0}.0)
   ... (remaining 17 Apply instances account for 0.04%(0.00s) of the runtime)


Scan overhead:
<Scan op time(s)> <sub scan fct time(s)> <sub scan op time(s)> <sub scan fct time(% scan op time)> <sub scan op time(% scan op time)> <node>
       10.3s   10.1s    9.0s   98.7%   87.3% forall_inplace,cpu,scan_fn}(Shape_i{0}.0, Subtensor{int64:int64:int8}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, T, Z, H, InplaceDimShuffle{1,0}.0, Dot22.0, InplaceDimShuffle{1,0}.0)
total  10.3s   10.1s    9.0s   98.7%   87.3%
Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  - Try the Aesara flag floatX=float32

Question 2: How should I interpret these profile times in the presence of a single dominating apply node (the scan itself)? Can I really interpret the entirety of the speed differential to the calls to Dot22, even though this accounts for only 1.1% of the time spent in execution? (Obviously there are more differences between the two implementations to be addressed below; I am trying to ask a concrete question about this single output).

Now, these are not apples-to-apples comparisons, because aside from switching from multiplication and division to dot-products and matrix inversion, there are several additional operations introduced for numerical stability. The scalar filter has 32 apply nodes, while the matrix filter has 37, so obviously I expect the code to be somewhat slower. How much slower, and whether I am leaving any optimizations on the table in the matrix form of the filter, are open questions.

To get a sense of how much slowdown I should expect when moving from scalar math to linear algebra, I ran everything in pure python, and saw a 25x slowdown, so Aesara is actually doing a bit better here. To get a sense of what optimizations are possible, I put njit decorators on all the functions and re-ran the speed tests. This brought times down to 9.33 µs ± 500 ns and 733 µs ± 3.13 µs for the scalar and matrix versions, respectively. So the linear algebra slows down numba the most, by 78x, but in absolute time it's still the fastest implementation.

Here a little table summarizing results:

          scalar   matrix slowdown
package                           
Aesara    726 µs  12.8 ms    17.6x
Numpy     337 µs  8.47 ms    25.1x
Numba    9.33 µs   733 µs    78.6x

Back to Aesara, @brandonwillard told me that all efforts start with aesara.dprint, so I checked the graph for the log-likelihood for the matrix function, but nothing jumps out at me as being "off"; there's no obvious duplicate computation or bugs happening.

dprint output for matrix kalman filter log-likelihood term

for{cpu,scan_fn}.4 [id A]
 |Subtensor{int64} [id B]
 | |Shape [id C]
 | | |Subtensor{int64::} [id D] 'data[0:]'
 | |   |data [id E]
 | |   |ScalarConstant{0} [id F]
 | |ScalarConstant{0} [id G]
 |Subtensor{:int64:} [id H]
 | |Subtensor{int64::} [id D] 'data[0:]'
 | |ScalarFromTensor [id I]
 |   |Subtensor{int64} [id B]
 |IncSubtensor{Set;:int64:} [id J]
 | |AllocEmpty{dtype='float64'} [id K]
 | | |Elemwise{add,no_inplace} [id L]
 | | | |Subtensor{int64} [id B]
 | | | |Subtensor{int64} [id M]
 | | |   |Shape [id N]
 | | |   | |Rebroadcast{(0, False)} [id O]
 | | |   |   |InplaceDimShuffle{x,0} [id P]
 | | |   |     |x0 [id Q]
 | | |   |ScalarConstant{0} [id R]
 | | |Subtensor{int64} [id S]
 | |   |Shape [id T]
 | |   | |Rebroadcast{(0, False)} [id O]
 | |   |ScalarConstant{1} [id U]
 | |Rebroadcast{(0, False)} [id O]
 | |ScalarFromTensor [id V]
 |   |Subtensor{int64} [id M]
 |IncSubtensor{Set;:int64:} [id W]
 | |AllocEmpty{dtype='float64'} [id X]
 | | |Elemwise{add,no_inplace} [id Y]
 | | | |Subtensor{int64} [id B]
 | | | |Subtensor{int64} [id Z]
 | | |   |Shape [id BA]
 | | |   | |Rebroadcast{(0, False)} [id BB]
 | | |   |   |InplaceDimShuffle{x,0,1} [id BC]
 | | |   |     |P0 [id BD]
 | | |   |ScalarConstant{0} [id BE]
 | | |Subtensor{int64} [id BF]
 | | | |Shape [id BG]
 | | | | |Rebroadcast{(0, False)} [id BB]
 | | | |ScalarConstant{1} [id BH]
 | | |Subtensor{int64} [id BI]
 | |   |Shape [id BJ]
 | |   | |Rebroadcast{(0, False)} [id BB]
 | |   |ScalarConstant{2} [id BK]
 | |Rebroadcast{(0, False)} [id BB]
 | |ScalarFromTensor [id BL]
 |   |Subtensor{int64} [id Z]
 |Subtensor{int64} [id B]
 |Subtensor{int64} [id B]
 |Subtensor{int64} [id B]
 |T [id BM]
 |Z [id BN]
 |R [id BO]
 |H [id BP]
 |Q [id BQ]

Inner graphs:

for{cpu,scan_fn}.4 [id A]
 >dot [id BR]
 > |*3-<TensorType(float64, (None, None))> [id BS] -> [id BM]
 > |Elemwise{add,no_inplace} [id BT]
 >   |*1-<TensorType(float64, (None,))> [id BU] -> [id J]
 >   |dot [id BV]
 >     |dot [id BW]
 >     | |dot [id BX]
 >     | | |*2-<TensorType(float64, (None, None))> [id BY] -> [id W]
 >     | | |InplaceDimShuffle{1,0} [id BZ] 'Z.T'
 >     | |   |*4-<TensorType(float64, (None, None))> [id CA] -> [id BN]
 >     | |Solve{assume_a='pos', lower=False, check_finite=True} [id CB]
 >     |   |Elemwise{add,no_inplace} [id CC]
 >     |   | |dot [id CD]
 >     |   | | |*4-<TensorType(float64, (None, None))> [id CA] -> [id BN]
 >     |   | | |dot [id BX]
 >     |   | |*6-<TensorType(float64, (None, None))> [id CE] -> [id BP]
 >     |   |Eye{dtype='float64'} [id CF]
 >     |     |Subtensor{int64} [id CG]
 >     |     | |Shape [id CH]
 >     |     | | |Elemwise{add,no_inplace} [id CC]
 >     |     | |ScalarConstant{0} [id CI]
 >     |     |Subtensor{int64} [id CG]
 >     |     |TensorConstant{0} [id CJ]
 >     |Elemwise{sub,no_inplace} [id CK]
 >       |*0-<TensorType(float64, (None,))> [id CL] -> [id H]
 >       |dot [id CM]
 >         |*4-<TensorType(float64, (None, None))> [id CA] -> [id BN]
 >         |*1-<TensorType(float64, (None,))> [id BU] -> [id J]
 >Elemwise{mul,no_inplace} [id CN]
 > |InplaceDimShuffle{x,x} [id CO]
 > | |TensorConstant{0.5} [id CP]
 > |Elemwise{add,no_inplace} [id CQ]
 >   |Elemwise{add,no_inplace} [id CR]
 >   | |dot [id CS]
 >   | | |dot [id CT]
 >   | | | |*3-<TensorType(float64, (None, None))> [id BS] -> [id BM]
 >   | | | |Elemwise{add,no_inplace} [id CU]
 >   | | |   |dot [id CV]
 >   | | |   | |dot [id CW]
 >   | | |   | | |Elemwise{sub,no_inplace} [id CX]
 >   | | |   | | | |Eye{dtype='float64'} [id CY]
 >   | | |   | | | | |Subtensor{int64} [id CZ]
 >   | | |   | | | | | |Shape [id DA]
 >   | | |   | | | | | | |dot [id BW]
 >   | | |   | | | | | |ScalarConstant{0} [id DB]
 >   | | |   | | | | |Subtensor{int64} [id CZ]
 >   | | |   | | | | |TensorConstant{0} [id DC]
 >   | | |   | | | |dot [id DD]
 >   | | |   | | |   |dot [id BW]
 >   | | |   | | |   |*4-<TensorType(float64, (None, None))> [id CA] -> [id BN]
 >   | | |   | | |*2-<TensorType(float64, (None, None))> [id BY] -> [id W]
 >   | | |   | |InplaceDimShuffle{1,0} [id DE]
 >   | | |   |   |Elemwise{sub,no_inplace} [id CX]
 >   | | |   |dot [id DF]
 >   | | |     |dot [id DG]
 >   | | |     | |dot [id BW]
 >   | | |     | |*6-<TensorType(float64, (None, None))> [id CE] -> [id BP]
 >   | | |     |InplaceDimShuffle{1,0} [id DH]
 >   | | |       |dot [id BW]
 >   | | |InplaceDimShuffle{1,0} [id DI] 'T.T'
 >   | |   |*3-<TensorType(float64, (None, None))> [id BS] -> [id BM]
 >   | |dot [id DJ]
 >   |   |dot [id DK]
 >   |   | |*5-<TensorType(float64, (None, None))> [id DL] -> [id BO]
 >   |   | |*7-<TensorType(float64, (None, None))> [id DM] -> [id BQ]
 >   |   |InplaceDimShuffle{1,0} [id DN] 'R.T'
 >   |     |*5-<TensorType(float64, (None, None))> [id DL] -> [id BO]
 >   |InplaceDimShuffle{1,0} [id DO]
 >     |Elemwise{add,no_inplace} [id CR]
 >Elemwise{add,no_inplace} [id BT]
 >Elemwise{add,no_inplace} [id CU]
 >Elemwise{mul,no_inplace} [id DP]
 > |TensorConstant{-0.5} [id DQ]
 > |Subtensor{int64} [id DR]
 >   |Rebroadcast{(0, True)} [id DS]
 >   | |Reshape{1} [id DT]
 >   |   |Elemwise{add,no_inplace} [id DU]
 >   |   | |Elemwise{add,no_inplace} [id DV]
 >   |   | | |Elemwise{log,no_inplace} [id DW]
 >   |   | | | |Elemwise{mul,no_inplace} [id DX]
 >   |   | | |   |TensorConstant{2} [id DY]
 >   |   | | |   |TensorConstant{3.141592653589793} [id DZ]
 >   |   | | |Elemwise{log,no_inplace} [id EA]
 >   |   | |   |Det [id EB]
 >   |   | |     |Elemwise{add,no_inplace} [id CC]
 >   |   | |dot [id EC]
 >   |   |   |dot [id ED]
 >   |   |   | |InplaceDimShuffle{0} [id EE]
 >   |   |   | | |Elemwise{sub,no_inplace} [id CK]
 >   |   |   | |Solve{assume_a='pos', lower=False, check_finite=True} [id CB]
 >   |   |   |Elemwise{sub,no_inplace} [id CK]
 >   |   |TensorConstant{(1,) of -1} [id EF]
 >   |ScalarConstant{0} [id EG]

So what have I learned? Linear algebra is slower than scalar multiplication. Thanks for coming to my TED talk. In all seriousness, what can be done to speed up the matrix formulation of the filter? From looking at the profile output and the aesara.dprint output, I don't see any obvious paths forward. The final goal is to be able to sample using this log-likelihood function, so speed is absolutely critical. With the simplest possible model already already taking 5 minutes, it's quite dishartening. For comparison, an ARMA(1,1) model takes over 15 minutes, and this is comprised only of 2x2 matrices. Scaling gets worse from there.

All the code used to make this post is available in this gist if anyone is curious. I am really hoping that I am missing something obvious. At the very least, helpful pointers about how to read the profile summary and dprint outputs will be appreciated.

ricardoV94 · 2022-09-12T15:25:05Z

ricardoV94
Sep 12, 2022

Out of curiosity does the numba backend work for this graph? Asking because we have a bug with numba+scan (#923). If it works, how does it perform?

Just do aesara.function(..., mode="NUMBA"). You won't be able to use the profiler but you can still time it with timeit.

Regarding PyMC, keep in mind that the gradient probably dominates the sampling time, so it can be easily be worse than what you would expect from examining the logp performance alone. I read somewhere that the Scan grad is 4-6x slower than the value evaluation, but that obviously depends on the graph itself.

4 replies

jessegrabowski Sep 12, 2022
Author

Keeping everything as above and using mode="NUMBA" raises an error about a duplicate argument name in function definition. Maybe related to the None's I pass into the outputs_info field -- there are three instances of auto_446096 in the generated function signature and 3 None in my scan, although the ordering is not as expected.

SyntaxError: duplicate argument 'auto_446096' in function definition

Traceback (most recent call last):

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3398 in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  Input In [67] in <cell line: 9>
    matrix_kf = aesara.function([a_data, a_x0, a_P0, a_T, a_Z, a_R, a_H, a_Q], filter_result, mode='NUMBA')

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/aesara/compile/function/__init__.py:317 in function
    fn = pfunc(

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/aesara/compile/function/pfunc.py:374 in pfunc
    return orig_function(

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/aesara/compile/function/types.py:1763 in orig_function
    fn = m.create(defaults)

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/aesara/compile/function/types.py:1656 in create
    _fn, _i, _o = self.linker.make_thunk(

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/aesara/link/basic.py:254 in make_thunk
    return self.make_all(

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/aesara/link/basic.py:698 in make_all
    thunks, nodes, jit_fn = self.create_jitable_thunk(

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/aesara/link/basic.py:642 in create_jitable_thunk
    converted_fgraph = self.fgraph_convert(

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/aesara/link/numba/linker.py:10 in fgraph_convert
    return numba_funcify(fgraph, **kwargs)

  File ~/opt/anaconda3/envs/econ/lib/python3.10/functools.py:889 in wrapper
    return dispatch(args[0].__class__)(*args, **kw)

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/aesara/link/numba/dispatch/basic.py:381 in numba_funcify_FunctionGraph
    return fgraph_to_python(

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/aesara/link/utils.py:741 in fgraph_to_python
    compiled_func = op_conversion_fn(

  File ~/opt/anaconda3/envs/econ/lib/python3.10/functools.py:889 in wrapper
    return dispatch(args[0].__class__)(*args, **kw)

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/aesara/link/numba/dispatch/scan.py:153 in numba_funcify_Scan
    scalar_op_fn = compile_function_src(

  File ~/opt/anaconda3/envs/econ/lib/python3.10/site-packages/aesara/link/utils.py:605 in compile_function_src
    mod_code = compile(src, filename, mode="exec")

  File /var/folders/wy/ph4j9vrx23v000gc9flt78y40000gn/T/tmpmmqtt8nb:2
    def scan(n_steps, auto_448050, auto_452429, auto_452428, auto_446096, auto_446096, auto_446096, auto_445347, auto_445348, auto_445350, auto_446406, auto_446412, auto_446414):
                                                                          ^
SyntaxError: duplicate argument 'auto_446096' in function definition

I thought about the gradient problem as well, and I profile and examine the functions created by model.compile_logp and model.compile_dlogp in the gist. I'm not sure what to say there, though. I am working on the assumption that optimizations in the logp function will translate to improvements in the dlogp function as well, because I don't see any better assumption to work on.

brandonwillard Sep 12, 2022
Maintainer

We should create an issue for that; it looks distinct from the other known Scan issues.

jessegrabowski Sep 13, 2022
Author

Playing a bit with this it seems that multiple "map outputs" (None values) is the cause of the error, I opened an issue #1176

jessegrabowski Sep 14, 2022
Author

To reply more directly to your question @ricardoV94, the graph does work, when I explicitly carry along all computations in the scan. That is, I provide no more than one None value in the outputs_info.

The timeit for the Numba backend is very fast: 3.6ms. It's still significantly slower than just adding @njit decorators to all the functions (along with a python loop over the data), though. It also doesn't seem to work with PyMC, the following snippit raises a shape error that seems to be related to the log-likelihood term:

PyMC model with Numba scan

with pm.Model() as matrix_model:
    x0 = pm.Normal('x0', sigma=1, size=(1,))
    P0 = pm.Gamma('P0', alpha=2, beta=8, size=(1,1))
    H = pm.Gamma('H', alpha=2, beta=8, size=(1,1))
    Q = pm.Gamma('Q', alpha=2, beta=8, size=(1,1))
    
    filter_result, updates = aesara.scan(matrix_kalman_step,
                              sequences=[data],
                              outputs_info=[at.zeros_like(x0), x0, at.zeros_like(P0), P0, None],
                              non_sequences=[T, Z, R, H, Q],
                                        mode='NUMBA')
    
    a_filtered, a_predicted, P_filtered, P_predicted, ll = filter_result
        
    loglike = pm.Potential('log_likelihood', ll.sum())
    
    idata_matrix = pm.sample()

TypeError: Inconsistency in the inner graph of scan 'scan_fn' : an input and an output are associated with the same recurrent state and should have compatible types but have type 'TensorType(float64, (1, 1))' and 'TensorType(float64, (None, None))' respectively.

TypeError                                 Traceback (most recent call last)
Input In [21], in <cell line: 1>()
     13 a_filtered, a_predicted, P_filtered, P_predicted, ll = filter_result
     15 loglike = pm.Potential('log_likelihood', ll.sum())
---> 17 idata_matrix = pm.sample()

File ~\miniconda3\envs\econ\lib\site-packages\pymc\sampling.py:531, in sample(draws, step, init, n_init, initvals, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, idata_kwargs, mp_ctx, **kwargs)
    528         auto_nuts_init = False
    530 initial_points = None
--> 531 step = assign_step_methods(model, step, methods=pm.STEP_METHODS, step_kwargs=kwargs)
    533 if isinstance(step, list):
    534     step = CompoundStep(step)

File ~\miniconda3\envs\econ\lib\site-packages\pymc\sampling.py:229, in assign_step_methods(model, step, methods, step_kwargs)
    221         selected = max(
    222             methods,
    223             key=lambda method, var=rv_var, has_gradient=has_gradient: method._competence(
    224                 var, has_gradient
    225             ),
    226         )
    227         selected_steps[selected].append(var)
--> 229 return instantiate_steppers(model, steps, selected_steps, step_kwargs)

File ~\miniconda3\envs\econ\lib\site-packages\pymc\sampling.py:147, in instantiate_steppers(model, steps, selected_steps, step_kwargs)
    145         args = step_kwargs.get(step_class.name, {})
    146         used_keys.add(step_class.name)
--> 147         step = step_class(vars=vars, model=model, **args)
    148         steps.append(step)
    150 unused_args = set(step_kwargs).difference(used_keys)

File ~\miniconda3\envs\econ\lib\site-packages\pymc\step_methods\arraystep.py:89, in BlockedStep.__new__(cls, *args, **kwargs)
     86 step = super().__new__(cls)
     87 # If we don't return the instance we have to manually
     88 # call __init__
---> 89 step.__init__([var], *args, **kwargs)
     90 # Hack for creating the class correctly when unpickling.
     91 step.__newargs = ([var],) + args, kwargs

File ~\miniconda3\envs\econ\lib\site-packages\pymc\step_methods\metropolis.py:229, in Metropolis.__init__(self, vars, S, proposal_dist, scaling, tune, tune_interval, model, mode, **kwargs)
    226 self.mode = mode
    228 shared = pm.make_shared_replacements(initial_values, vars, model)
--> 229 self.delta_logp = delta_logp(initial_values, model.logp(), vars, shared)
    230 super().__init__(vars, shared)

File ~\miniconda3\envs\econ\lib\site-packages\pymc\step_methods\metropolis.py:1056, in delta_logp(point, logp, vars, shared)
   1052 inarray1 = tensor_type("inarray1")
   1054 logp1 = pm.CallableTensor(logp0)(inarray1)
-> 1056 f = compile_pymc([inarray1, inarray0], logp1 - logp0)
   1057 f.trust_input = True
   1058 return f

File ~\miniconda3\envs\econ\lib\site-packages\pymc\aesaraf.py:1034, in compile_pymc(inputs, outputs, random_seed, mode, **kwargs)
   1032 opt_qry = mode.provided_optimizer.including("random_make_inplace", check_parameter_opt)
   1033 mode = Mode(linker=mode.linker, optimizer=opt_qry)
-> 1034 aesara_function = aesara.function(
   1035     inputs,
   1036     outputs,
   1037     updates={**rng_updates, **kwargs.pop("updates", {})},
   1038     mode=mode,
   1039     **kwargs,
   1040 )
   1041 return aesara_function

File ~\miniconda3\envs\econ\lib\site-packages\aesara\compile\function\__init__.py:317, in function(inputs, outputs, mode, updates, givens, no_default_updates, accept_inplace, name, rebuild_strict, allow_input_downcast, profile, on_unused_input)
    311     fn = orig_function(
    312         inputs, outputs, mode=mode, accept_inplace=accept_inplace, name=name
    313     )
    314 else:
    315     # note: pfunc will also call orig_function -- orig_function is
    316     #      a choke point that all compilation must pass through
--> 317     fn = pfunc(
    318         params=inputs,
    319         outputs=outputs,
    320         mode=mode,
    321         updates=updates,
    322         givens=givens,
    323         no_default_updates=no_default_updates,
    324         accept_inplace=accept_inplace,
    325         name=name,
    326         rebuild_strict=rebuild_strict,
    327         allow_input_downcast=allow_input_downcast,
    328         on_unused_input=on_unused_input,
    329         profile=profile,
    330         output_keys=output_keys,
    331     )
    332 return fn

File ~\miniconda3\envs\econ\lib\site-packages\aesara\compile\function\pfunc.py:362, in pfunc(params, outputs, mode, updates, givens, no_default_updates, accept_inplace, name, rebuild_strict, allow_input_downcast, profile, on_unused_input, output_keys, fgraph)
    359 elif isinstance(profile, str):
    360     profile = ProfileStats(message=profile)
--> 362 inputs, cloned_outputs = construct_pfunc_ins_and_outs(
    363     params,
    364     outputs,
    365     mode,
    366     updates,
    367     givens,
    368     no_default_updates,
    369     rebuild_strict,
    370     allow_input_downcast,
    371     fgraph=fgraph,
    372 )
    374 return orig_function(
    375     inputs,
    376     cloned_outputs,
   (...)
    383     fgraph=fgraph,
    384 )

File ~\miniconda3\envs\econ\lib\site-packages\aesara\compile\function\pfunc.py:492, in construct_pfunc_ins_and_outs(params, outputs, mode, updates, givens, no_default_updates, rebuild_strict, allow_input_downcast, fgraph)
    489         out_list = [outputs]
    490 extended_outputs = out_list + additional_outputs
--> 492 output_vars = rebuild_collect_shared(
    493     extended_outputs,
    494     in_variables,
    495     replace=givens,
    496     updates=updates,
    497     rebuild_strict=rebuild_strict,
    498     copy_inputs_over=True,
    499     no_default_updates=no_default_updates,
    500     clone_inner_graphs=True,
    501 )
    502 input_variables, cloned_extended_outputs, other_stuff = output_vars
    503 clone_d, update_d, update_expr, shared_inputs = other_stuff

File ~\miniconda3\envs\econ\lib\site-packages\aesara\compile\function\pfunc.py:222, in rebuild_collect_shared(outputs, inputs, replace, updates, rebuild_strict, copy_inputs_over, no_default_updates, clone_inner_graphs)
    220 for v in outputs:
    221     if isinstance(v, Variable):
--> 222         cloned_v = clone_v_get_shared_updates(v, copy_inputs_over)
    223         cloned_outputs.append(cloned_v)
    224     elif isinstance(v, Out):

File ~\miniconda3\envs\econ\lib\site-packages\aesara\compile\function\pfunc.py:95, in rebuild_collect_shared.<locals>.clone_v_get_shared_updates(v, copy_inputs_over)
     93 if owner not in clone_d:
     94     for i in owner.inputs:
---> 95         clone_v_get_shared_updates(i, copy_inputs_over)
     96     clone_node_and_cache(
     97         owner,
     98         clone_d,
     99         strict=rebuild_strict,
    100         clone_inner_graphs=clone_inner_graphs,
    101     )
    102 return clone_d.setdefault(v, v)

File ~\miniconda3\envs\econ\lib\site-packages\aesara\compile\function\pfunc.py:95, in rebuild_collect_shared.<locals>.clone_v_get_shared_updates(v, copy_inputs_over)
     93 if owner not in clone_d:
     94     for i in owner.inputs:
---> 95         clone_v_get_shared_updates(i, copy_inputs_over)
     96     clone_node_and_cache(
     97         owner,
     98         clone_d,
     99         strict=rebuild_strict,
    100         clone_inner_graphs=clone_inner_graphs,
    101     )
    102 return clone_d.setdefault(v, v)

    [... skipping similar frames: rebuild_collect_shared.<locals>.clone_v_get_shared_updates at line 95 (2 times)]

File ~\miniconda3\envs\econ\lib\site-packages\aesara\compile\function\pfunc.py:95, in rebuild_collect_shared.<locals>.clone_v_get_shared_updates(v, copy_inputs_over)
     93 if owner not in clone_d:
     94     for i in owner.inputs:
---> 95         clone_v_get_shared_updates(i, copy_inputs_over)
     96     clone_node_and_cache(
     97         owner,
     98         clone_d,
     99         strict=rebuild_strict,
    100         clone_inner_graphs=clone_inner_graphs,
    101     )
    102 return clone_d.setdefault(v, v)

File ~\miniconda3\envs\econ\lib\site-packages\aesara\compile\function\pfunc.py:96, in rebuild_collect_shared.<locals>.clone_v_get_shared_updates(v, copy_inputs_over)
     94         for i in owner.inputs:
     95             clone_v_get_shared_updates(i, copy_inputs_over)
---> 96         clone_node_and_cache(
     97             owner,
     98             clone_d,
     99             strict=rebuild_strict,
    100             clone_inner_graphs=clone_inner_graphs,
    101         )
    102     return clone_d.setdefault(v, v)
    103 elif isinstance(v, SharedVariable):

File ~\miniconda3\envs\econ\lib\site-packages\aesara\graph\basic.py:1038, in clone_node_and_cache(node, clone_d, clone_inner_graphs, **kwargs)
   1034 new_op: Optional["Op"] = cast(Optional["Op"], clone_d.get(node.op))
   1036 cloned_inputs: List[Variable] = [cast(Variable, clone_d[i]) for i in node.inputs]
-> 1038 new_node = node.clone_with_new_inputs(
   1039     cloned_inputs,
   1040     # Only clone inner-graph `Op`s when there isn't a cached clone (and
   1041     # when `clone_inner_graphs` is enabled)
   1042     clone_inner_graph=clone_inner_graphs if new_op is None else False,
   1043     **kwargs,
   1044 )
   1046 if new_op:
   1047     # If we didn't clone the inner-graph `Op` above, because
   1048     # there was a cached version, set the cloned `Apply` to use
   1049     # the cached clone `Op`
   1050     new_node.op = new_op

File ~\miniconda3\envs\econ\lib\site-packages\aesara\graph\basic.py:291, in Apply.clone_with_new_inputs(self, inputs, strict, clone_inner_graph)
    289     new_node.tag = copy(self.tag).__update__(new_node.tag)
    290 else:
--> 291     new_node = self.clone(clone_inner_graph=clone_inner_graph)
    292     new_node.inputs = new_inputs
    293 return new_node

File ~\miniconda3\envs\econ\lib\site-packages\aesara\graph\basic.py:231, in Apply.clone(self, clone_inner_graph)
    228 new_op = self.op
    230 if isinstance(new_op, HasInnerGraph) and clone_inner_graph:  # type: ignore
--> 231     new_op = new_op.clone()  # type: ignore
    233 cp = self.__class__(
    234     new_op, self.inputs, [output.clone() for output in self.outputs]
    235 )
    236 cp.tag = copy(self.tag)

File ~\miniconda3\envs\econ\lib\site-packages\aesara\scan\op.py:1487, in Scan.clone(self)
   1486 def clone(self) -> "Scan":
-> 1487     res = copy(self)
   1488     res.fgraph = res.fgraph.clone()
   1489     return res

File ~\miniconda3\envs\econ\lib\copy.py:102, in copy(x)
    100 if isinstance(rv, str):
    101     return x
--> 102 return _reconstruct(x, None, *rv)

File ~\miniconda3\envs\econ\lib\copy.py:273, in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
    271     state = deepcopy(state, memo)
    272 if hasattr(y, '__setstate__'):
--> 273     y.__setstate__(state)
    274 else:
    275     if isinstance(state, tuple) and len(state) == 2:

File ~\miniconda3\envs\econ\lib\site-packages\aesara\scan\op.py:913, in Scan.__setstate__(self, d)
    911 self.__dict__.update(d)
    912 # Ensure that the graph associated with the inner function is valid.
--> 913 self.validate_inner_graph()

File ~\miniconda3\envs\econ\lib\site-packages\aesara\scan\op.py:634, in ScanMethodsMixin.validate_inner_graph(self)
    629 type_output = self.inner_outputs[inner_oidx].type
    630 if (
    631     type_input.dtype != type_output.dtype
    632     or type_input.broadcastable != type_output.broadcastable
    633 ):
--> 634     raise TypeError(
    635         "Inconsistency in the inner graph of "
    636         f"scan '{self.name}' : an input and an output are "
    637         "associated with the same recurrent state "
    638         "and should have compatible types but have "
    639         f"type '{type_input}' and '{type_output}' respectively."
    640     )

TypeError: Inconsistency in the inner graph of scan 'scan_fn' : an input and an output are associated with the same recurrent state and should have compatible types but have type 'TensorType(float64, (1, 1))' and 'TensorType(float64, (None, None))' respectively.

brandonwillard · 2022-09-12T18:38:19Z

brandonwillard
Sep 12, 2022
Maintainer

Back to Aesara, @brandonwillard told me that all efforts start with aesara.dprint, so I checked the graph for the log-likelihood for the matrix function, but nothing jumps out at me as being "off"; there's no obvious duplicate computation or bugs happening.

Printing the graphs is often good for relative comparisons (e.g. the scalar-case graph vs. the matrix-case graph). Sometimes one can see which optimizations were performed in one case but not the other, and, if those optimizations are meaningful, they could explain any large relative differences in performance.

0 replies

brandonwillard · 2022-09-12T18:41:27Z

brandonwillard
Sep 12, 2022
Maintainer

Question 2: How should I interpret these profile times in the presence of a single dominating apply node (the scan itself)?

Don't forget that a Scan node essentially represents an entire sub-graph (i.e. the inner-graph inside the Scan Op).

8 replies

brandonwillard Sep 13, 2022
Maintainer

I've created #1179 and #1180 for two of the issues mentioned above.

jessegrabowski Sep 14, 2022
Author

The problem appears to be the sheer number of DimShuffle calls, so those need to be reduced.

What can be done to cut down on the number of DimShuffles? If a formula calls for a transpose, I just need to do it no?

ricardoV94 Sep 14, 2022

If Solve and Det are being used with scalars, we should have a rewrite that simplifies/removes them. This might require another new issue.

I guess the point was just to compare how much performance degrades when one switches to linalg operators, not that @jessegrabowski was actually interested in scalar matrixes right?

jessegrabowski Sep 14, 2022
Author

Yes, the point was to demonstrate, to my self I suppose, that Scan can have very good performance, even in a PyMC model, and then to see if I could narrow down which operations cause trouble.

brandonwillard Sep 14, 2022
Maintainer

I was only speaking to the obvious lack of "translation invariance" between the scalar and (scalar) matrix cases.

jessegrabowski · 2022-09-14T14:10:03Z

jessegrabowski
Sep 14, 2022
Author

Based on the discussion and the profiling so far, it seems the main bottleneck is in linear algebra operators. There is another implementation of the Kalman Filter that avoids computing determinants or inverting any matrices by introducing a second scan over the states of the system. Introducing this second scan reduces all the matrix-matrix operations to at most vector-vector, with no matrix inversions at all. Combined with the observed speed of the scalar implementation, especially when sampling in PyMC, I thought this might be a fruitful way to go.

Here is the so-called "Univariate Kalman Filter", which has a scan-in-a-scan. It ends up being significantly slower than the linear algebra operations, which surprised me, especially since the inner scan is only over a single state (so it should only be a single operation).

Univariate Kalman Filter with a scan inside a scan to avoid `at.linalg.solve` and `at.linalg.det`

def univariate_inner_filter_step(y, Z_row, sigma_H, a, P):
    v = y - Z_row.dot(a)

    PZT = P.dot(Z_row.T)
    F = Z_row.dot(PZT) + sigma_H

    K = PZT / F 
    a_filtered = a + K * v
    P_filtered = P - at.outer(K, K) * F
    ll_inner = at.log(F) + v ** 2 / F

    return a_filtered, P_filtered, ll_inner

def matrix_predict2(a_filtered, P_filtered, T, R, Q):
    a_predicted = T.dot(a_filtered)
    P_predicted = matrix_dot(T, P_filtered, T.T) + matrix_dot(R, Q, R.T)

    # Force P_predicted to be symmetric
    P_predicted = 0.5 * (P_predicted + P_predicted.T)

    return a_predicted, P_predicted

def univariate_kalman_step(y, a, P, T, Z, R, H, Q):
    y = y[:, None]

    result, updates = aesara.scan(univariate_inner_filter_step,
                                  sequences=[y, Z, at.diag(H)],
                                  outputs_info=[a, P, None],
                                  name='scan_over_states',
                                  profile=True)

    a_filtered, P_filtered, ll_inner = result
    a_filtered, P_filtered = a_filtered[-1], P_filtered[-1]

    a_predicted, P_predicted = matrix_predict2(a_filtered, P_filtered, T=T, R=R, Q=Q)

    ll = -0.5 * ((at.neq(ll_inner, 0).sum()) * N_CONST + ll_inner.sum())

    return a_filtered, a_predicted, P_filtered, P_predicted, ll


filter_result, updates = aesara.scan(univariate_kalman_step,
                              sequences=[a_data],
                              outputs_info=[None, a_x0, None, a_P0, None],
                              non_sequences=[a_T, a_Z, a_R, a_H, a_Q],
                                     name='univariate_filter',
                                    profile=True)

Compiling the inner scan by itself and timing it shows that it's a very fast function, just a few microseconds (makes sense since it has only 1 step). When I profile the whole function together, though, the inner scan ends up dominating the execution time:

Outer scan profile

Scan Op profiling ( univariate_filter )
==================
  Message: None
  Time in 812 calls of the op (for a total of 81200 steps) 8.622588999995514s

  Total time spent in calling the VM 9.000000e+00s (104.377%)
  Total overhead (computing slices..) -3.774110e-01s (-4.377%)

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  77.8%    77.8%       6.007s       7.40e-05s     Py   81201       1   aesara.scan.op.Scan
   4.7%    82.5%       0.364s       1.49e-06s     Py  243603       3   aesara.tensor.basic.TensorFromScalar
   3.7%    86.3%       0.287s       5.89e-07s     C   487202       6   aesara.tensor.elemwise.DimShuffle
   3.7%    89.9%       0.282s       2.04e-07s     C   1380414      17   aesara.tensor.elemwise.Elemwise
   1.8%    91.8%       0.143s       8.78e-07s     C   162400       2   aesara.tensor.blas.Gemm
   1.6%    93.3%       0.121s       7.42e-07s     C   162402       2   aesara.tensor.subtensor.IncSubtensor
   1.6%    94.9%       0.120s       2.95e-07s     C   406004       5   aesara.tensor.subtensor.Subtensor
   1.3%    96.2%       0.101s       6.20e-07s     C   162400       2   aesara.tensor.blas.Dot22
   1.3%    97.4%       0.097s       1.33e-07s     C   730808       9   aesara.tensor.basic.ScalarFromTensor
   0.6%    98.1%       0.050s       2.05e-07s     C   243603       3   aesara.scalar.basic.Add
   0.6%    98.7%       0.044s       1.80e-07s     C   243603       3   aesara.tensor.basic.AllocEmpty
   0.5%    99.2%       0.039s       4.77e-07s     C    81201       1   aesara.tensor.blas_c.CGemv
   0.4%    99.5%       0.030s       1.84e-07s     C   162400       2   aesara.tensor.math.Sum
   0.2%    99.8%       0.019s       4.67e-08s     C   406005       5   aesara.tensor.shape.Shape_i
   0.2%   100.0%       0.016s       9.75e-08s     C   162402       2   aesara.tensor.shape.Unbroadcast
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  77.8%    77.8%       6.007s       7.40e-05s     Py    81201        1   forall_inplace,cpu,scan_over_states}
   4.7%    82.5%       0.364s       1.49e-06s     Py    243603        3   TensorFromScalar
   2.4%    84.9%       0.183s       5.65e-07s     C     324800        4   InplaceDimShuffle{1,0}
   1.8%    86.8%       0.143s       8.78e-07s     C     162400        2   Gemm{no_inplace}
   1.6%    88.3%       0.121s       7.42e-07s     C     162402        2   IncSubtensor{InplaceSet;:int64:}
   1.3%    89.6%       0.101s       6.20e-07s     C     162400        2   Dot22
   1.3%    90.9%       0.097s       1.33e-07s     C     730808        9   ScalarFromTensor
   0.9%    91.8%       0.067s       8.24e-07s     C     81201        1   InplaceDimShuffle{x,0}
   0.8%    92.6%       0.064s       2.62e-07s     C     243602        3   Elemwise{add,no_inplace}
   0.6%    93.2%       0.050s       2.05e-07s     C     243603        3   add
   0.6%    93.8%       0.046s       2.83e-07s     C     162401        2   Subtensor{int64}
   0.6%    94.4%       0.044s       1.80e-07s     C     243603        3   AllocEmpty{dtype='float64'}
   0.5%    94.9%       0.039s       4.79e-07s     C     81201        1   Subtensor{int64:int64:int8}
   0.5%    95.4%       0.039s       4.77e-07s     C     81201        1   CGemv{no_inplace}
   0.5%    95.9%       0.037s       4.54e-07s     C     81201        1   InplaceDimShuffle{x,0,1}
   0.5%    96.3%       0.035s       2.15e-07s     C     162402        2   Subtensor{:int64:}
   0.4%    96.7%       0.030s       3.67e-07s     C     81200        1   Elemwise{Composite{(i0 * ((i1 * i2) + i3))}}
   0.3%    97.0%       0.022s       2.69e-07s     C     81201        1   Elemwise{Composite{maximum(((i0 - Switch(LT(i1, i2), i3, i1)) + i4), i5)}}
   0.3%    97.3%       0.021s       2.58e-07s     C     81201        1   Elemwise{Composite{(((Switch(LT(i0, i1), i2, i0) - i3) - i4) + i5)}}[(0, 0)]
   0.2%    97.5%       0.019s       2.34e-07s     C     81200        1   Elemwise{neq,no_inplace}
   ... (remaining 14 Ops account for   2.48%(0.19s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  77.8%    77.8%       6.007s       7.40e-05s   81201    49   forall_inplace,cpu,scan_over_states}(Elemwise{Composite{minimum(minimum(i0, i1), i2)}}.0, Subtensor{:int64:}.0, Subtensor{int64:int64:int8}.0, Subtensor{:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, Elemwise{Composite{minimum(minimum(i0, i1), i2)}}.0)
   2.5%    80.3%       0.191s       2.36e-06s   81201    24   TensorFromScalar(add.0)
   1.2%    81.5%       0.095s       1.17e-06s   81200    60   Gemm{no_inplace}(*8-<TensorType(float64, (None, None))>, TensorConstant{0.5}, Dot22.0, *9-<TensorType(float64, (None, None))>, TensorConstant{0.5})
   1.1%    82.7%       0.088s       1.08e-06s   81200     9   InplaceDimShuffle{1,0}(*8-<TensorType(float64, (None, None))>)
   1.1%    83.8%       0.088s       1.08e-06s   81201    30   TensorFromScalar(add.0)
   1.1%    84.9%       0.085s       1.04e-06s   81201    31   TensorFromScalar(add.0)
   1.0%    85.9%       0.074s       9.08e-07s   81200    56   Dot22(*3-<TensorType(float64, (None, None))>, Subtensor{int64}.0)
   0.9%    86.8%       0.070s       8.58e-07s   81201    48   IncSubtensor{InplaceSet;:int64:}(AllocEmpty{dtype='float64'}.0, Unbroadcast{0}.0, ScalarConstant{1})
   0.9%    87.6%       0.067s       8.24e-07s   81201     4   InplaceDimShuffle{x,0}(*1-<TensorType(float64, (None,))>)
   0.7%    88.3%       0.051s       6.26e-07s   81201    47   IncSubtensor{InplaceSet;:int64:}(AllocEmpty{dtype='float64'}.0, Unbroadcast{0}.0, ScalarConstant{1})
   0.6%    88.9%       0.048s       5.90e-07s   81200    61   Gemm{no_inplace}(InplaceDimShuffle{1,0}.0, TensorConstant{0.5}, InplaceDimShuffle{1,0}.0, Dot22.0, TensorConstant{0.5})
   0.6%    89.5%       0.044s       5.41e-07s   81200    62   Elemwise{add,no_inplace}(Gemm{no_inplace}.0, Gemm{no_inplace}.0)
   0.5%    90.0%       0.039s       4.79e-07s   81201    41   Subtensor{int64:int64:int8}(*0-<TensorType(float64, (None, 1))>, ScalarFromTensor.0, ScalarFromTensor.0, ScalarConstant{1})
   0.5%    90.5%       0.039s       4.77e-07s   81201    57   CGemv{no_inplace}(AllocEmpty{dtype='float64'}.0, TensorConstant{1.0}, *3-<TensorType(float64, (None, None))>, Subtensor{int64}.0, TensorConstant{0.0})
   0.5%    91.0%       0.038s       4.67e-07s   81200    55   InplaceDimShuffle{1,0}(Subtensor{int64}.0)
   0.5%    91.5%       0.037s       4.55e-07s   81201    53   Subtensor{int64}(forall_inplace,cpu,scan_over_states}.0, ScalarFromTensor.0)
   0.5%    91.9%       0.037s       4.54e-07s   81201     1   InplaceDimShuffle{x,0,1}(*2-<TensorType(float64, (None, None))>)
   0.4%    92.4%       0.033s       4.05e-07s   81200     6   InplaceDimShuffle{1,0}(*3-<TensorType(float64, (None, None))>)
   0.4%    92.8%       0.030s       3.67e-07s   81200    58   Elemwise{Composite{(i0 * ((i1 * i2) + i3))}}(TensorConstant{-0.5}, TensorConstant{1.8378770664093453}, Sum{acc_dtype=int64}.0, Sum{acc_dtype=float64}.0)
   0.3%    93.1%       0.027s       3.32e-07s   81200    59   Dot22(InplaceDimShuffle{1,0}.0, InplaceDimShuffle{1,0}.0)
   ... (remaining 43 Apply instances account for 6.90%(0.53s) of the runtime)


Scan overhead:
<Scan op time(s)> <sub scan fct time(s)> <sub scan op time(s)> <sub scan fct time(% scan op time)> <sub scan op time(% scan op time)> <node>
        6.0s    2.5s    0.5s   41.1%    7.5% forall_inplace,cpu,scan_over_states}(Elemwise{Composite{minimum(minimum(i0, i1), i2)}}.0, Subtensor{:int64:}.0, Subtensor{int64:int64:int8}.0, Subtensor{:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, Elemwise{Composite{minimum(minimum(i0, i1), i2)}}.0)
total   6.0s    2.5s    0.5s   41.1%    7.5%
Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  - Try the Aesara flag floatX=float32

Inner scan profile

Scan Op profiling ( scan_over_states )
==================
  Message: None
  Time in 81201 calls of the op (for a total of 81201 steps) 2.4703522000640987s

  Total time spent in calling the VM 1.000000e+00s (40.480%)
  Total overhead (computing slices..) 1.470352e+00s (59.520%)

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  39.7%    39.7%       0.179s       4.40e-07s     C   406005       5   aesara.tensor.elemwise.DimShuffle
  21.7%    61.5%       0.098s       2.01e-07s     C   487206       6   aesara.tensor.elemwise.Elemwise
  18.3%    79.8%       0.083s       3.39e-07s     C   243603       3   aesara.tensor.blas_c.CGemv
  10.6%    90.5%       0.048s       1.96e-07s     C   243603       3   aesara.tensor.basic.AllocEmpty
   5.5%    96.0%       0.025s       3.07e-07s     C    81201       1   aesara.tensor.blas_c.CGer
   4.0%   100.0%       0.018s       2.22e-07s     C    81201       1   aesara.tensor.shape.Shape_i
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  19.1%    19.1%       0.086s       5.30e-07s     C     162402        2   InplaceDimShuffle{x,0}
  18.3%    37.5%       0.083s       3.39e-07s     C     243603        3   CGemv{inplace}
  14.4%    51.9%       0.065s       3.99e-07s     C     162402        2   InplaceDimShuffle{}
  10.6%    62.5%       0.048s       1.96e-07s     C     243603        3   AllocEmpty{dtype='float64'}
   6.2%    68.7%       0.028s       3.44e-07s     C     81201        1   InplaceDimShuffle{x}
   5.5%    74.3%       0.025s       3.07e-07s     C     81201        1   CGer{non-destructive}
   5.3%    79.6%       0.024s       2.95e-07s     C     81201        1   Elemwise{TrueDiv}[(0, 0)]
   4.4%    84.0%       0.020s       2.46e-07s     C     81201        1   Elemwise{Composite{(i0 + (i1 * i2))}}
   4.2%    88.2%       0.019s       2.33e-07s     C     81201        1   Elemwise{Composite{(log(i0) + (sqr(i1) / i0))}}
   4.0%    92.2%       0.018s       2.22e-07s     C     81201        1   Shape_i{0}
   3.1%    95.3%       0.014s       1.72e-07s     C     81201        1   Elemwise{Sub}[(0, 1)]
   2.9%    98.2%       0.013s       1.61e-07s     C     81201        1   Elemwise{Add}[(0, 0)]
   1.8%   100.0%       0.008s       9.85e-08s     C     81201        1   Elemwise{Neg}[(0, 0)]
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  12.0%    12.0%       0.054s       6.65e-07s   81201     9   InplaceDimShuffle{x,0}(CGemv{inplace}.0)
   8.6%    20.6%       0.039s       4.77e-07s   81201     7   CGemv{inplace}(AllocEmpty{dtype='float64'}.0, TensorConstant{1.0}, *4-<TensorType(float64, (None, None))>, *0-<TensorType(float64, (None,))>, TensorConstant{0.0})
   7.5%    28.1%       0.034s       4.17e-07s   81201    11   InplaceDimShuffle{}(CGemv{inplace}.0)
   7.1%    35.3%       0.032s       3.95e-07s   81201     0   InplaceDimShuffle{x,0}(*3-<TensorType(float64, (None,))>)
   6.9%    42.1%       0.031s       3.81e-07s   81201     8   InplaceDimShuffle{}(Elemwise{Sub}[(0, 1)].0)
   6.2%    48.3%       0.028s       3.44e-07s   81201    14   InplaceDimShuffle{x}(Elemwise{Add}[(0, 0)].0)
   5.5%    53.9%       0.025s       3.07e-07s   81201    18   CGer{non-destructive}(*4-<TensorType(float64, (None, None))>, Elemwise{Neg}[(0, 0)].0, Elemwise{TrueDiv}[(0, 0)].0, Elemwise{TrueDiv}[(0, 0)].0)
   5.3%    59.2%       0.024s       2.95e-07s   81201    15   Elemwise{TrueDiv}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
   5.3%    64.5%       0.024s       2.94e-07s   81201     4   CGemv{inplace}(AllocEmpty{dtype='float64'}.0, TensorConstant{1.0}, InplaceDimShuffle{x,0}.0, *0-<TensorType(float64, (None,))>, TensorConstant{0.0})
   5.1%    69.6%       0.023s       2.82e-07s   81201     5   AllocEmpty{dtype='float64'}(Shape_i{0}.0)
   4.4%    74.0%       0.020s       2.46e-07s   81201    17   Elemwise{Composite{(i0 + (i1 * i2))}}(*3-<TensorType(float64, (None,))>, Elemwise{TrueDiv}[(0, 0)].0, Elemwise{Sub}[(0, 1)].0)
   4.4%    78.5%       0.020s       2.46e-07s   81201    10   CGemv{inplace}(AllocEmpty{dtype='float64'}.0, TensorConstant{1.0}, InplaceDimShuffle{x,0}.0, *0-<TensorType(float64, (None,))>, TensorConstant{0.0})
   4.2%    82.7%       0.019s       2.34e-07s   81201     1   AllocEmpty{dtype='float64'}(TensorConstant{1})
   4.2%    86.9%       0.019s       2.33e-07s   81201    13   Elemwise{Composite{(log(i0) + (sqr(i1) / i0))}}(Elemwise{Add}[(0, 0)].0, InplaceDimShuffle{}.0)
   4.0%    90.9%       0.018s       2.22e-07s   81201     3   Shape_i{0}(*4-<TensorType(float64, (None, None))>)
   3.1%    94.0%       0.014s       1.72e-07s   81201     6   Elemwise{Sub}[(0, 1)](*1-<TensorType(float64, (1,))>, CGemv{inplace}.0)
   2.9%    96.9%       0.013s       1.61e-07s   81201    12   Elemwise{Add}[(0, 0)](InplaceDimShuffle{}.0, *2-<TensorType(float64, ())>)
   1.8%    98.7%       0.008s       9.85e-08s   81201    16   Elemwise{Neg}[(0, 0)](Elemwise{Add}[(0, 0)].0)
   1.3%   100.0%       0.006s       7.32e-08s   81201     2   AllocEmpty{dtype='float64'}(TensorConstant{1})
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  - Try the Aesara flag floatX=float32
  - Try installing amdlibm and set the Aesara flag lib__amblibm=True. This speeds up only some Elemwise operation.

I don't understand the discrepancy between the execution times in the two profiles. The outer scan reports that the inner scan used 6s to execute, but the profile of the inner scan says that it executed for only 2.4s.

Sampling time for this set-up in PyMC is worst of all, so I guess @ricardoV94 was right that gradients of Scans are a major sticking point. I'm just surprised, because the scalar scan samples and runs extremely quickly; I was hoping that I could leverage this speed here too.

4 replies

brandonwillard Sep 15, 2022
Maintainer

Based on the discussion and the profiling so far, it seems the main bottleneck is in linear algebra operators. There is another implementation of the Kalman Filter that avoids computing determinants or inverting any matrices by introducing a second scan over the states of the system. Introducing this second scan reduces all the matrix-matrix operations to at most vector-vector, with no matrix inversions at all. Combined with the observed speed of the scalar implementation, especially when sampling in PyMC, I thought this might be a fruitful way to go.

I've found that the SVD approach (e.g. here) works quite well in non-trivial multivariate problems.

Univariate Kalman Filter with a scan inside a scan to avoid at.linalg.solve and at.linalg.det

Don't forget to return the inner-Scan's updates (as the last return value of the outer-Scan's function).

I don't understand the discrepancy between the execution times in the two profiles. The outer scan reports that the inner scan used 6s to execute, but the profile of the inner scan says that it executed for only 2.4s.

Regarding the "Scan overhead:" measurements:
The poorly labeled "sub scan fct time(s)" is the ~2.5s measurement, and it comes from here and here. It's the total time spent calling the Cython method that implements Scan, and only that. The other ~6s measurement labeled "Scan op time(s)" is the already displayed total Apply node time, I believe. Both of those measurements are referring to the outer-Scan only, though.

Sampling time for this set-up in PyMC is worst of all, so I guess @ricardoV94 was right that gradients of Scans are a major sticking point. I'm just surprised, because the scalar scan samples and runs extremely quickly; I was hoping that I could leverage this speed here too.

Yeah, gradients of Scans are probably in dire need of some careful inspection. Also, there's a fundamental limitation to gradients and I imagine it affects Scan quite a bit more than other Ops: gradients are computed for unsimplified graphs. This means that Scans that can be rewritten using non-Scan Ops will have gradients derived from their Scan forms, and those won't always translate into the potentially simpler gradients of their simpler rewritten forms.

Even so, the main problem with Scan gradients is that they're made up of more Scans, so they can effectively multiply the number of Scans in a graph.

brandonwillard Sep 15, 2022
Maintainer

Also, there appears to be a bug somewhere:

  Total time spent in calling the VM 9.000000e+00s (104.377%)
  Total overhead (computing slices..) -3.774110e-01s (-4.377%)

rlouf Sep 15, 2022
Maintainer

gradients are computed for unsimplified graphs.

Should we open an issue for that?

brandonwillard Sep 15, 2022
Maintainer

gradients are computed for unsimplified graphs.

Should we open an issue for that?

Yes, definitely, but we need an MWE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking and Optimizing Aesara Code #1174

{{title}}

Replies: 4 comments 16 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Benchmarking and Optimizing Aesara Code #1174

jessegrabowski Sep 12, 2022

Replies: 4 comments · 16 replies

ricardoV94 Sep 12, 2022

jessegrabowski Sep 12, 2022 Author

brandonwillard Sep 12, 2022 Maintainer

jessegrabowski Sep 13, 2022 Author

jessegrabowski Sep 14, 2022 Author

brandonwillard Sep 12, 2022 Maintainer

brandonwillard Sep 12, 2022 Maintainer

brandonwillard Sep 13, 2022 Maintainer

jessegrabowski Sep 14, 2022 Author

ricardoV94 Sep 14, 2022

jessegrabowski Sep 14, 2022 Author

brandonwillard Sep 14, 2022 Maintainer

jessegrabowski Sep 14, 2022 Author

brandonwillard Sep 15, 2022 Maintainer

brandonwillard Sep 15, 2022 Maintainer

rlouf Sep 15, 2022 Maintainer

brandonwillard Sep 15, 2022 Maintainer

jessegrabowski
Sep 12, 2022

Replies: 4 comments 16 replies

ricardoV94
Sep 12, 2022

jessegrabowski Sep 12, 2022
Author

brandonwillard Sep 12, 2022
Maintainer

jessegrabowski Sep 13, 2022
Author

jessegrabowski Sep 14, 2022
Author

brandonwillard
Sep 12, 2022
Maintainer

brandonwillard
Sep 12, 2022
Maintainer

brandonwillard Sep 13, 2022
Maintainer

jessegrabowski Sep 14, 2022
Author

jessegrabowski Sep 14, 2022
Author

brandonwillard Sep 14, 2022
Maintainer

jessegrabowski
Sep 14, 2022
Author

brandonwillard Sep 15, 2022
Maintainer

brandonwillard Sep 15, 2022
Maintainer

rlouf Sep 15, 2022
Maintainer

brandonwillard Sep 15, 2022
Maintainer