Some data in the State variable may be corrupt #625

sjsprecious · 2024-08-19T17:47:01Z

As identified by this issue https://github.com/NCAR/MUSICA-Performance-Comparison/issues/62, if we reuse the State variable between different iterations of solve function, the results are corrupt and we get fail to converge error message later.

Kyle commented that this was also the reason why he had to initialize the LU matrix to zero originally (#587).

This issue indicates that the LU matrix in State variable is not overwritten correctly between iterations (or we should not expect it to be overwritten correctly at all?).

The text was updated successfully, but these errors were encountered:

K20shores · 2024-08-19T18:31:48Z

The state should be completely reusable without any zeroing, from my understanding. The fact that it isn't is a problem

sjsprecious · 2024-08-19T19:19:54Z

Thanks Kyle for the clarification. I looked at these two lines https://github.com/NCAR/micm/blob/main/include/micm/solver/lu_decomposition.inl#L201 and https://github.com/NCAR/micm/blob/main/include/micm/solver/lu_decomposition.inl#L214. Could the Uvector use the values from the previous time step without a zeroing?

K20shores · 2024-08-19T20:22:53Z

We set them just before those, but it's in an if block so maybe not all of the values are zeros. Perhaps we should add back a zero of L and U

sjsprecious · 2024-08-20T15:52:33Z

Thanks @K20shores for your comments. Hmm, interesting. For the first time when the State variable is generated, are the L and U matrices zeroed?

K20shores · 2024-09-17T16:37:21Z

I've learned some things. Thanks to @mattldawson's suggestion, I set the L and U matrices to Inf and found out that sometimes we end up with infs after the LU decomposition.

The issue seems to be related to the interplay between block size (which are grid cells) and the number of species.

Sizes leading to valid L and U matrices

Species (size)	Blocks (grid cells)
1	1
2	1
2	2
3	2

Sizes leading to L and U matrices with infs

Species (size)	Blocks (grid cells)
3	1
3	2
3	3
3	4

K20shores · 2024-09-17T21:39:47Z

Also, the math on geeks for geeks for this algorithm indicates that U should always take a value from the input matrix, but I can see in the debugger that we sometimes don't. I think this means we are missing some do_aik_ values which are made in the initialize function

micm/include/micm/solver/lu_decomposition.inl

Lines 56 to 66 in 94c22da

    
           if (matrix.IsZero(i, k)) 
        
           { 
        
             if (nkj == 0 && k != i) 
        
               continue; 
        
             do_aik_.push_back(false); 
        
           } 
        
           else 
        
           { 
        
             do_aik_.push_back(true); 
        
             aik_.push_back(matrix.VectorIndex(0, i, k)); 
        
           }

Missing values mean that when we set the U vector here we are subtracting a value from whatever the value of U is (in this case inf). The reason we haven't seen this before is because we were always initializing U to 1e-30, which is basically zero. This means that the first time we do an operation we are getting the correct values, but repeated usage isn't overwriting the values.

micm/include/micm/solver/lu_decomposition.inl

Lines 197 to 203 in 94c22da

    
           if (*(do_aik++)) 
        
             U_vector[uik_nkj->first] = A_vector[*(aik++)]; 
        
           for (std::size_t ikj = 0; ikj < uik_nkj->second; ++ikj) 
        
           { 
        
             U_vector[uik_nkj->first] -= L_vector[lij_ujk->first] * U_vector[lij_ujk->second]; 
        
             ++lij_ujk; 
        
           }

sjsprecious · 2024-09-18T18:36:47Z

Thanks to @K20shores , we track down the issue further. Given an initial Jacobian matrix with size 4x4 and nonzero elements marked as X below:

X 0 0 0 
0 X X X 
X X X X 
0 X 0 X

The expected L matrix should be:

X 0 0 0 
0 X 0 0 
X X X 0 
0 X X X

While we are constructing the third element of the fourth row,L[3][2] is not overwritten by the corresponding Jacobian matrix element first.

micm/include/micm/solver/lu_decomposition.inl

Line 211 in 94c22da

L_vector[lki_nkj->first] = A_vector[*(aki++)];

Since we are initializing the L matrix with inf now, this leads to a inf - some value operation later and corrupts the result.

micm/include/micm/solver/lu_decomposition.inl

Line 214 in 94c22da

    
           L_vector[lki_nkj->first] -= L_vector[lkj_uji->first] * U_vector[lkj_uji->second];

After discussion with Kyle, the lines

micm/include/micm/solver/lu_decomposition.inl

Line 56 in 94c22da

if (matrix.IsZero(i, k))

and

micm/include/micm/solver/lu_decomposition.inl

Line 85 in 94c22da

if (matrix.IsZero(k, i))

can be problematic. The nonzero index of lower and upper matrices should be set by checking the lower and upper matrices themselves, rather than the Jacobian matrix unless we have a misunderstanding of what is going on here.

K20shores · 2024-09-18T22:26:27Z

@jian helped me pin down the problem.

The issue is that the sparsity pattern of the LU matrices doesn't necessarily match the sparsity pattern of the jacobian matrix. In the algorithm, the LU matrices need to be filled with the value from the A matrix when the LU matrices are nonzero.

The first change we needed to make was to accurately record when to set the LU matrix values based off of when the LU matrices are zero, not the jacobian as was happening before.

The second change is to record a sentinel value. This value indicates when the L or U matrix is nonzero and the jacobian is actually zero. In this case, we set the LU matrix to zero rather than copying from the jacobian matrix.

sjsprecious self-assigned this Aug 19, 2024

sjsprecious added the bug Something isn't working label Aug 19, 2024

sjsprecious added this to the CUDA Rosenbrock Solver milestone Aug 19, 2024

sjsprecious removed their assignment Aug 19, 2024

K20shores mentioned this issue Sep 10, 2024

Provide an interface to the micm State NCAR/musica#217

Open

K20shores self-assigned this Sep 10, 2024

sjsprecious mentioned this issue Sep 19, 2024

More strict tolerance for unit/integration tests #662

Open

This was referenced Sep 19, 2024

Inspect the Robertson integration test for accuracy #578

Open

correcting indexing for LU decomposition #663

Open

K20shores linked a pull request Sep 19, 2024 that will close this issue

correcting indexing for LU decomposition #663

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some data in the State variable may be corrupt #625

Some data in the State variable may be corrupt #625

sjsprecious commented Aug 19, 2024

K20shores commented Aug 19, 2024 •

edited

Loading

sjsprecious commented Aug 19, 2024

K20shores commented Aug 19, 2024

sjsprecious commented Aug 20, 2024

K20shores commented Sep 17, 2024

K20shores commented Sep 17, 2024 •

edited

Loading

sjsprecious commented Sep 18, 2024

K20shores commented Sep 18, 2024

Some data in the State variable may be corrupt #625

Some data in the State variable may be corrupt #625

Comments

sjsprecious commented Aug 19, 2024

K20shores commented Aug 19, 2024 • edited Loading

sjsprecious commented Aug 19, 2024

K20shores commented Aug 19, 2024

sjsprecious commented Aug 20, 2024

K20shores commented Sep 17, 2024

K20shores commented Sep 17, 2024 • edited Loading

sjsprecious commented Sep 18, 2024

K20shores commented Sep 18, 2024

K20shores commented Aug 19, 2024 •

edited

Loading

K20shores commented Sep 17, 2024 •

edited

Loading