Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap for tests #50

Open
2 of 7 tasks
BenWeber42 opened this issue Oct 6, 2020 · 6 comments
Open
2 of 7 tasks

Roadmap for tests #50

BenWeber42 opened this issue Oct 6, 2020 · 6 comments
Labels
large Requires large effort/time med-prio Medium priority issue/task tech-debt Technical debt that is overdue

Comments

@BenWeber42
Copy link
Contributor

BenWeber42 commented Oct 6, 2020

Improving dusk's tests will probably take a bit. A preliminary roadmap is outlined here:

  • Add basic testing infrastructure ( Infrastructure for version 0.5 #37 )
  • Add sparse field tests ( Added sparse fill tests and improved reduction tests #43 )
  • Allow for negative tests (we want to be able to say that some stencil code should lead to a particular exception) ( see Negative Tests #74 )
  • Enable the naive & cuda backends for testing (it's good if dusk works well with its backend)
  • Allow for checking against reference SIR (other output?) to ensure the translation was correct
  • Allow good ways to write different kinds of stencil tests (decorator + builder pattern?)
  • Enable code coverage metrics if it doesn't take too much effort
@Stagno
Copy link
Contributor

Stagno commented Oct 6, 2020

Current tests rely on dawn to check that the output of dusk (SIR) is correct, however dawn is an unreliable validator of SIR, as it gives many false negatives (most likely the SIR will be correct, but dawn will fail in either lowering, optimization or codegen because of some implementation bug).

Although our serialized formats of SIR and IIR are not the most user friendly ones, it looks like the "industry standard" for testing compiler components is by validating the serialized output IR (or parts of it) against a reference (specifically I'm referring to how MLIR approaches this problem).
Rightly you raised the point that, in order to produce the correct reference, the programmer implementing a change in dusk would need to be himself a good validator of SIR, but that's difficult because SIR lacks good documentation and it's not always clear to everybody how to represent something in SIR. Also addition of new concepts means addition of new SIR structures.

I'm convinced that the following procedure to add a test in dusk will solve the above problem:

  • you write the example source,
  • you use dusk to translate it to SIR (if any problem occurs at this point it's within dusk, so it should be solved first) and serialize it to file,
  • you pass the SIR to dawn to code generate it, two possible outcomes:
    • no errors, you get valid generated code (need to visually check that it's correct, but this doesn't require obscure validation skills), or
    • some error occurred: at this point you, either by yourself or with the help of a "dawn senior", figure out whether the problem is
      • incorrect SIR, then fix something in dusk to make it correct and retry, or
      • some problem in dawn, then report it and since it's not a problem with your SIR you can go ahead,
  • store the SIR file as reference,
  • write the test to compare against serialized reference.

Once you setup your tests in this way, they will never fail because of some bug in dawn passes or codegen. If they fail it's because wrong SIR was produced.

Tests that aim at covering the interaction between the tools should belong to the e2e repo.

For what concerns negative tests i see two categories:

  • stencil code contains a semantic error, which can be caught only be running analyses/checks and, since dawn takes care of those, such tests belong to the e2e repo,
  • stencil code contains a syntax error, which should trigger an error in dusk to be caught and checked within dusk testing framework.

@BenWeber42
Copy link
Contributor Author

You're right, I meant to have the list include the ability to check against reference SIR. I updated the issue accordingly.

There are two problems still:

Once you setup your tests in this way, they will never fail because of some bug in dawn passes or codegen. If they fail it's because wrong SIR was produced.

They can also fail if dusk produces correct SIR but different SIR (e.g., because an existing element is translated differently in a new change).

stencil code contains a semantic error, which can be caught only be running analyses/checks and, since dawn takes care of those, such tests belong to the e2e repo,

Dawn can't do all semantic checks. Field accesses in ambiguous neighbor iterations need mandatory horizontal offsets. Outside of ambiguous neighbor iterations, horizontal offsets are voluntary and will be correctly inferred. In SIR horizontal offsets are always declared. So checking for this can only happen in dusk right now. Same with the semantics of index fields.

@Stagno
Copy link
Contributor

Stagno commented Oct 7, 2020

They can also fail if dusk produces correct SIR but different SIR (e.g., because an existing element is translated differently in a new change).

But if a new change updates how an element is translated, tests that were checking the old translation of the element are to be considered invalid. This means that you should follow again the procedure from the beginning.

Dawn can't do all semantic checks. Field accesses in ambiguous neighbor iterations need mandatory horizontal offsets. Outside of ambiguous neighbor iterations, horizontal offsets are voluntary and will be correctly inferred. In SIR horizontal offsets are always declared. So checking for this can only happen in dusk right now. Same with the semantics of index fields.

Very correct. This means that there's not a 1-to-1 map: syntax = "domain of dusk", semantics = "domain of dawn".
However it remains that the checks that are implemented in dawn should be tested:

  • in dawn's internal tests,
  • in end-to-end tests exercising dusk+dawn in the e2e repo.

Analogous argument for dusk.

@mroethlin
Copy link
Contributor

I would just like to weigh in that I feel that tests exercising the combination of dusk + dawn (e.g. by checking that dawn does indeed raise an error in the UnstructuredDimensionsChecker for a dimensionaly inconsistent dusk stencil) do not necessarily belong in the e2e repo. In the e2e repo we test full examples, which do have a driver, and tests ensure that the output is the same as a reference solution / two backends behave the same.

I think we could and should use dusk also as a productivity tool to quickly cover a lot of cases in dawn, but I don't think the e2e repo, as it is currently understood, would be the right place for that.

@Stagno
Copy link
Contributor

Stagno commented Oct 7, 2020

I think we should clarify how we understand icondusk-e2e, or rather how we intend it to be.
My understanding is that we want it to be a repository for both ICON stencils written in dusk and end-to-end tests of the dusk+dawn toolchain.
Also in my understanding, a driver in icondusk-e2e (regardless of whether it's in C++ or Fortran) is a test stub/mock for the real driver code (part of ICON) which takes care of preparing the input, running the stencil and verifying it. In this context, a driver is not part of what's being tested, but rather it constitutes the code of a test (a positive test, to be specific).

I think the most important question to be answered when designing a test is: what software components should the test cover? That determines the type of test: unit test, integration test or end-to-end test.
End-to-end tests are the ones covering all the stages and their purpose is to simulate a real usage scenario, regardless of whether it's a successful scenario or not. It shouldn't matter if the output to be checked are output fields (when a successful outcome is expected) or errors/exceptions thrown by the toolchain (when an ill-formed input was intentionally given). Trying to compile an ill-formed stencil source (unintentionally) is also a real usage scenario.

Given these premises, I'd say that

checking that dawn does indeed raise an error in the UnstructuredDimensionsChecker for a dimensionaly inconsistent dusk stencil

should be categorized as negative (cause it expects an error/exception) end-to-end (cause it covers dusk+dawn) test.

The risk that I see in allowing testing also dawn-opt (and possibly dawn-codegen) within dusk's testing framework, is that we might be tempted to write only an integration (or an end-to-end) test and not a "pure"1 dusk-only test (checks SIR against serialized reference) when introducing a new feature or a change in dusk. Don't get me wrong: integration and end-to-end tests are great in discovering problems! But they should test the integration of components and shouldn't substitute unit tests (which pinpoint much better the faulty component).

An example scenario:
You implement a new feature in dusk and in the same PR you write tests that involve dawn-opt for the validity checks. Since you provided some tests soliciting your new feature, you might be tempted to think that you're set, because if your implementation wasn't correct, the new tests would surely tell you.
Your implementation is correct and the tests initially pass. At some point a change in dawn breaks your dusk tests (the CI of the dusk repo), but the code of dusk remained the same. I say this is very odd.
What I would like to see, instead, is the CI of the e2e's that fails (if dawn tests didn't fail).

1 As discussed in multiple occasions, unit testing in compilers follows a more relaxed definition. Serialization/deserialization routines are not substituted by mock objects and they are even often employed by tests themselves.

@BenWeber42
Copy link
Contributor Author

For transparency: We had a bigger discussion on this. We agree that it's good to properly separate different kinds of tests and try to strive for a clean design when it comes to testing. However, this takes a long time to develop. We also noticed that we found many bugs simply by writing many combinations of different dusk/dawn features. With dusk we can do this rather cheaply (very similar to fuzzing). It's not yet clear how we can get the best of both worlds. We'd like to find a practical way that gives us both.

@BenWeber42 BenWeber42 added tech-debt Technical debt that is overdue large Requires large effort/time med-prio Medium priority issue/task labels Jan 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
large Requires large effort/time med-prio Medium priority issue/task tech-debt Technical debt that is overdue
Projects
None yet
Development

No branches or pull requests

3 participants