-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roadmap for tests #50
Comments
Current tests rely on dawn to check that the output of dusk (SIR) is correct, however dawn is an unreliable validator of SIR, as it gives many false negatives (most likely the SIR will be correct, but dawn will fail in either lowering, optimization or codegen because of some implementation bug). Although our serialized formats of SIR and IIR are not the most user friendly ones, it looks like the "industry standard" for testing compiler components is by validating the serialized output IR (or parts of it) against a reference (specifically I'm referring to how MLIR approaches this problem). I'm convinced that the following procedure to add a test in dusk will solve the above problem:
Once you setup your tests in this way, they will never fail because of some bug in dawn passes or codegen. If they fail it's because wrong SIR was produced. Tests that aim at covering the interaction between the tools should belong to the e2e repo. For what concerns negative tests i see two categories:
|
You're right, I meant to have the list include the ability to check against reference SIR. I updated the issue accordingly. There are two problems still:
They can also fail if dusk produces correct SIR but different SIR (e.g., because an existing element is translated differently in a new change).
Dawn can't do all semantic checks. Field accesses in ambiguous neighbor iterations need mandatory horizontal offsets. Outside of ambiguous neighbor iterations, horizontal offsets are voluntary and will be correctly inferred. In SIR horizontal offsets are always declared. So checking for this can only happen in dusk right now. Same with the semantics of index fields. |
But if a new change updates how an element is translated, tests that were checking the old translation of the element are to be considered invalid. This means that you should follow again the procedure from the beginning.
Very correct. This means that there's not a 1-to-1 map: syntax = "domain of dusk", semantics = "domain of dawn".
Analogous argument for dusk. |
I would just like to weigh in that I feel that tests exercising the combination of dusk + dawn (e.g. by checking that dawn does indeed raise an error in the I think we could and should use dusk also as a productivity tool to quickly cover a lot of cases in dawn, but I don't think the e2e repo, as it is currently understood, would be the right place for that. |
I think we should clarify how we understand I think the most important question to be answered when designing a test is: what software components should the test cover? That determines the type of test: unit test, integration test or end-to-end test. Given these premises, I'd say that
should be categorized as negative (cause it expects an error/exception) end-to-end (cause it covers dusk+dawn) test. The risk that I see in allowing testing also dawn-opt (and possibly dawn-codegen) within dusk's testing framework, is that we might be tempted to write only an integration (or an end-to-end) test and not a "pure"1 dusk-only test (checks SIR against serialized reference) when introducing a new feature or a change in dusk. Don't get me wrong: integration and end-to-end tests are great in discovering problems! But they should test the integration of components and shouldn't substitute unit tests (which pinpoint much better the faulty component). An example scenario: 1 As discussed in multiple occasions, unit testing in compilers follows a more relaxed definition. Serialization/deserialization routines are not substituted by mock objects and they are even often employed by tests themselves. |
For transparency: We had a bigger discussion on this. We agree that it's good to properly separate different kinds of tests and try to strive for a clean design when it comes to testing. However, this takes a long time to develop. We also noticed that we found many bugs simply by writing many combinations of different dusk/dawn features. With dusk we can do this rather cheaply (very similar to fuzzing). It's not yet clear how we can get the best of both worlds. We'd like to find a practical way that gives us both. |
Improving dusk's tests will probably take a bit. A preliminary roadmap is outlined here:
The text was updated successfully, but these errors were encountered: