Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some data in the State variable may be corrupt #625

Open
sjsprecious opened this issue Aug 19, 2024 · 8 comments · May be fixed by #663
Open

Some data in the State variable may be corrupt #625

sjsprecious opened this issue Aug 19, 2024 · 8 comments · May be fixed by #663
Assignees
Labels
bug Something isn't working

Comments

@sjsprecious
Copy link
Collaborator

As identified by this issue https://github.com/NCAR/MUSICA-Performance-Comparison/issues/62, if we reuse the State variable between different iterations of solve function, the results are corrupt and we get fail to converge error message later.

Kyle commented that this was also the reason why he had to initialize the LU matrix to zero originally (#587).

This issue indicates that the LU matrix in State variable is not overwritten correctly between iterations (or we should not expect it to be overwritten correctly at all?).

@sjsprecious sjsprecious self-assigned this Aug 19, 2024
@sjsprecious sjsprecious added the bug Something isn't working label Aug 19, 2024
@sjsprecious sjsprecious added this to the CUDA Rosenbrock Solver milestone Aug 19, 2024
@sjsprecious sjsprecious removed their assignment Aug 19, 2024
@K20shores
Copy link
Collaborator

K20shores commented Aug 19, 2024

The state should be completely reusable without any zeroing, from my understanding. The fact that it isn't is a problem

@sjsprecious
Copy link
Collaborator Author

Thanks Kyle for the clarification. I looked at these two lines https://github.com/NCAR/micm/blob/main/include/micm/solver/lu_decomposition.inl#L201 and https://github.com/NCAR/micm/blob/main/include/micm/solver/lu_decomposition.inl#L214. Could the Uvector use the values from the previous time step without a zeroing?

@K20shores
Copy link
Collaborator

We set them just before those, but it's in an if block so maybe not all of the values are zeros. Perhaps we should add back a zero of L and U

@sjsprecious
Copy link
Collaborator Author

Thanks @K20shores for your comments. Hmm, interesting. For the first time when the State variable is generated, are the L and U matrices zeroed?

@K20shores
Copy link
Collaborator

I've learned some things. Thanks to @mattldawson's suggestion, I set the L and U matrices to Inf and found out that sometimes we end up with infs after the LU decomposition.

The issue seems to be related to the interplay between block size (which are grid cells) and the number of species.

Sizes leading to valid L and U matrices

Species (size) Blocks (grid cells)
1 1
2 1
2 2
3 2

Sizes leading to L and U matrices with infs

Species (size) Blocks (grid cells)
3 1
3 2
3 3
3 4

@K20shores
Copy link
Collaborator

K20shores commented Sep 17, 2024

Also, the math on geeks for geeks for this algorithm indicates that U should always take a value from the input matrix, but I can see in the debugger that we sometimes don't. I think this means we are missing some do_aik_ values which are made in the initialize function

Image

if (matrix.IsZero(i, k))
{
if (nkj == 0 && k != i)
continue;
do_aik_.push_back(false);
}
else
{
do_aik_.push_back(true);
aik_.push_back(matrix.VectorIndex(0, i, k));
}

Missing values mean that when we set the U vector here we are subtracting a value from whatever the value of U is (in this case inf). The reason we haven't seen this before is because we were always initializing U to 1e-30, which is basically zero. This means that the first time we do an operation we are getting the correct values, but repeated usage isn't overwriting the values.

if (*(do_aik++))
U_vector[uik_nkj->first] = A_vector[*(aik++)];
for (std::size_t ikj = 0; ikj < uik_nkj->second; ++ikj)
{
U_vector[uik_nkj->first] -= L_vector[lij_ujk->first] * U_vector[lij_ujk->second];
++lij_ujk;
}

@sjsprecious
Copy link
Collaborator Author

Thanks to @K20shores , we track down the issue further. Given an initial Jacobian matrix with size 4x4 and nonzero elements marked as X below:

X 0 0 0 
0 X X X 
X X X X 
0 X 0 X 

The expected L matrix should be:

X 0 0 0 
0 X 0 0 
X X X 0 
0 X X X 

While we are constructing the third element of the fourth row,L[3][2] is not overwritten by the corresponding Jacobian matrix element first.

L_vector[lki_nkj->first] = A_vector[*(aki++)];

Since we are initializing the L matrix with inf now, this leads to a inf - some value operation later and corrupts the result.
L_vector[lki_nkj->first] -= L_vector[lkj_uji->first] * U_vector[lkj_uji->second];

After discussion with Kyle, the lines

if (matrix.IsZero(i, k))
and
if (matrix.IsZero(k, i))
can be problematic. The nonzero index of lower and upper matrices should be set by checking the lower and upper matrices themselves, rather than the Jacobian matrix unless we have a misunderstanding of what is going on here.

@K20shores
Copy link
Collaborator

@jian helped me pin down the problem.

The issue is that the sparsity pattern of the LU matrices doesn't necessarily match the sparsity pattern of the jacobian matrix. In the algorithm, the LU matrices need to be filled with the value from the A matrix when the LU matrices are nonzero.

The first change we needed to make was to accurately record when to set the LU matrix values based off of when the LU matrices are zero, not the jacobian as was happening before.

The second change is to record a sentinel value. This value indicates when the L or U matrix is nonzero and the jacobian is actually zero. In this case, we set the LU matrix to zero rather than copying from the jacobian matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants