Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format preprocessed token stream in multiple passes #1898

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kbieganski
Copy link
Collaborator

@kbieganski kbieganski commented Apr 28, 2023

Currently, the formatter doesn't handle many scenarios involving preprocessor ifdefs/endifs interleaved with begins, module headers, etc. (#228, #241, #267) This patch attempts to solve this problem by performing multiple passes of the formatter on preprocessed variants of the source. Each of these variants has a different set of preprocessor branches enabled. Together, they should cover the entire source (though that doesn't work in all cases yet). After several formatting passes for different variants of the AST, a correct and properly formatted file is produced.

This is still work in progress, so not everything works, and the code isn't very clean. I'd love to get some early feedback on this.

@kbieganski kbieganski requested a review from hzeller April 28, 2023 15:32
@kbieganski
Copy link
Collaborator Author

@hzeller Pinging in case you missed it

@kbieganski
Copy link
Collaborator Author

@hzeller I looked at the other perprocessor PRs you mentioned, and I think most of it is in master, except for +define+ handling (even though Verible claims to support it? I may be missing something) and possibly other command line options. That's not really relevant to the formatter though.

But I did decide to reuse FlowTree, which I didn't even realize was available in master, so thanks for pointing me to those PRs.

@hzeller
Copy link
Collaborator

hzeller commented May 23, 2023

I think the other pull request has been mostly added to Verible in separate sub-pull requests, so they are probably mostly added; mostly my remark was to figure out if there are things still missing.

Sorry, I did not have a look at this PR yet - I was busy, then vacation and then forgot about it. I'll have a look later today.

@kbieganski kbieganski force-pushed the formatter-preprocess branch 2 times, most recently from af3df1c to da412ef Compare May 24, 2023 10:36
@kbieganski
Copy link
Collaborator Author

The biggest challenges right now, from looking at the tests, are:

  • Matching column alignment across preprocessor control flow. Unfortunately, while formatting a single variant of the file, the formatter is completely unaware of any of the 'disabled' code. It is unable to infer alignment that takes all variants into account. I'm trying to do some sort of a pre-pass that would gather alignment info and then apply that to all formatting passes, but it's not going great so far. Perhaps it would be a good idea to clean this up and merge it behind a feature flag, and work on alignment in the scope of another PR, as I fear it will require an extensive gutting of the formatter code. That is provided the Verible maintainers like the approach presented in this PR :)
  • There are some problems with convergence. However, I think this is mostly caused by the previous point.

@hzeller
Copy link
Collaborator

hzeller commented Jun 17, 2023

I can imagine that filtering out branches can result in the potential problems you mentioned.

Maybe we can mostly mitigate that by trying to not include preprocessing branches that don't do any damage, and don't include them in the finding of MinCoverDefineVariants() ?

The following (intentionally unaligned) input for instance will be properly aligned in the current state of the formatter:

module foo(
input a,
`ifdef WITH_B
   input      [9999:0]  b,
`endif
  input wire [99:0] c
);
endmodule

So here, here we don't need to do anything special in the brancing of WITH_B as it does not create unbalanced syntax elements the parser would stumble upon. As a result, the formater sees all the parts necessary to generate pleasent alignment; however, if that same WITH_B is later used to create an unbalanced begin ... end block, then we're in trouble if we had not included it in the MinCoverDefineVariants()...

So maybe we need to not identify all the macros that create variants but drill down to only the preprocessing branches that cause invalid syntax trees if parsed without extra treatment ?

Detection of these might be a bit more complicated. Maybe we can have a very simple manual parser just looking at the token stream that e.g. counts if a block contains a balanced set of begin/end in each preprocessing branch (counter pluse/minus) ? There might be other common pairs, e.g. begin/end, module/endmodule ... so this might get a bit unwieldy if there are too many of these potentially nesting pairs to worry about...

If we have such a thing, it could also reduce the number of separate passes we need to do through the file and less chance of coming into a situation where we mess up alignment due to that.

Maybe we should have a video conference to discuss some ideas.

And yes, having the feature behind a flag can help (and if we're confident enough and it works well, we can switch it on by default, but allow to switch off).

(Sorry it took so long - I was distracted with other projects I am part of)

@@ -115,7 +118,9 @@ class VerilogPreprocess {

// Expand macro definition bodies, this will relexes the macro body.
bool expand_macros = false;
// TODO(hzeller): Provide a map of command-line provided +define+'s

MacroDefinitionRegistry macro_definitions;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be the right place for this registry, but I suspect this is here currently mostly for the experimenting.

The FileList struct has such information as well, which should probably be consolidated with what we have here.

The final location might probably be yet different; there are a bunch of TODO's around FileList anyway.

I suspect you just have it here for now to make quick progress, but once that is nearing completion, we probably need to address that 'where to put things'.

@@ -199,10 +213,10 @@ static Status ReformatVerilog(absl::string_view original_text,
}

static absl::StatusOr<std::unique_ptr<VerilogAnalyzer>> ParseWithStatus(
absl::string_view text, absl::string_view filename) {
absl::string_view text, absl::string_view filename,
const verilog::VerilogPreprocess::Config& preprocess_config = {}) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use the filter_branches config from the preprocess_config to enable the 'formatting with preprocessing' feature (so that we can test it with a flag)

} else if (IsPreprocessorControlFlow(verilog_tokentype(
uwline.TokensRange().begin()->TokenEnum()))) {
UnwrappedLine formatted_line = uwline;
formatted_line.SetIndentationSpaces(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This zero is the indentation of the macro-line ? Maybe this can be made a constexpr somewhere so that it is easy to find in case someone wants to indent them in a particular way and wants to find the place
(not sure how common that is in verilog, but it could be some feature that people might request later)

if (!formatted_lines_.back().CompletedFormatting()) {
// Copy over any lines that did not finish wrap searching.
partially_formatted_lines.push_back(&uwline);
if (!uwline.TokensRange().front().token->visible) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we should worry about the cyclomatic complexity of this function, but good enough for this first work.

verilog/formatting/token_annotator.cc Outdated Show resolved Hide resolved
State new_state = old_state;
switch (old_state) {
case kPreprocessorControlFlow: {
if (token_type != PP_Identifier) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that so that we don't start a new partition between the ifdef and the identifier ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@kbieganski kbieganski marked this pull request as ready for review July 7, 2023 16:40
@kbieganski kbieganski force-pushed the formatter-preprocess branch 2 times, most recently from 7669ef9 to 2324a37 Compare July 9, 2023 20:02
@codecov-commenter
Copy link

codecov-commenter commented Jul 9, 2023

Codecov Report

Patch coverage: 94.07% and project coverage change: -0.01 ⚠️

Comparison is base (92dc626) 92.84% compared to head (1319cd3) 92.84%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1898      +/-   ##
==========================================
- Coverage   92.84%   92.84%   -0.01%     
==========================================
  Files         355      355              
  Lines       26249    26538     +289     
==========================================
+ Hits        24372    24640     +268     
- Misses       1877     1898      +21     
Impacted Files Coverage Δ
common/text/text_structure.h 100.00% <ø> (ø)
common/text/token_info.h 100.00% <ø> (ø)
common/util/vector_tree.h 98.66% <ø> (ø)
verilog/analysis/flow_tree.h 100.00% <ø> (ø)
verilog/analysis/verilog_project.cc 85.47% <ø> (ø)
verilog/formatting/formatter.h 100.00% <ø> (ø)
verilog/formatting/tree_unwrapper.h 66.66% <ø> (ø)
verilog/preprocessor/verilog_preprocess.h 100.00% <ø> (ø)
common/formatting/tree_unwrapper.cc 70.94% <9.09%> (-5.15%) ⬇️
verilog/formatting/tree_unwrapper.cc 94.46% <90.00%> (-0.33%) ⬇️
... and 13 more

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@kbieganski kbieganski force-pushed the formatter-preprocess branch 2 times, most recently from 457da90 to ffe1581 Compare July 13, 2023 16:44
@hzeller
Copy link
Collaborator

hzeller commented Jul 13, 2023

Interesting that there are issues on Mac and Windows. I suspect that either newline confusion (on Windows) or some uninitialized value that randomly defaults to something else on these platforms.

Currently, the formatter doesn't handle many scenarios involving preprocessor
`ifdef`s/`endif`s interleaved with `begin`s, module headers, etc.
This patch attempts to solve this problem by performing multiple passes of the
formatting on preprocessed variants of the source. Each of these variants has a
different set of preprocessor branches enabled. Together, they should cover the
entire source (though that doesn't work in all cases yet). After several
formatting passes for different variants of the AST, a correct and properly
formatted file is produced.

Signed-off-by: Krzysztof Bieganski <[email protected]>
@kbieganski
Copy link
Collaborator Author

kbieganski commented Jul 14, 2023

The Windows and MacOS failures were basically due to some incorrect/UB pointer arithmetic. It works now. There's also the Read the Docs failure, not sure what that's about.

The multi-pass formatter is able to deal with all unit tests now, with the exception of two cases. First one:

module pd (
`ifdef FAA
    input  baaaz_t foo,
`else
    output reg     bar
`endif
);
endmodule : pd

(there's one more similar to this)
It sees a syntax error in the first port declaration. If FAA is defined, there's a dangling comma. I would argue Verible should either not accept this code, as it really is a syntax error, or just accept such dangling commas.

Second case:

`J(D(`ifdef e))

The multi-pass formatter doesn't change this, but according to the test, it should be formatted into:

`J(D(
   `ifdef e))

I don't really understand this test case, so I can't comment on the validity, but I think it's okay to not support this for now.

The multi-pass formatter can also format all of VeeR, except for two files where Verible's parser cannot deal with some preprocessor identifiers (but that doesn't work with the default formatter anyway). I haven't checked the result thoroughly, as it's a lot of code, but at least it doesn't crash and there is no divergence.

The PR is basically ready for review; the one architectural thing I don't like is that I had to put a new field in TokenInfo to distinguish tokens from different preprocessor passes. It's kind of weird, as it is only meaningful for the Verilog formatter. The fact that Verible is split into the common and verilog worlds does not help here. I don't have a great idea how to solve this for now.

@hzeller
Copy link
Collaborator

hzeller commented Feb 2, 2024

Just looking through pull requests that didn't finish up to the last development cycle. This one certainly should not get lost.

Since this needs rebase, I might rebase it into a new pull request. Probably collecting more comments here, then addressing them in the new PR. I am doing this in my free time, so probably only progress in evenings and weekends.

The use-after-move situations (clang-tidy 17 points them out nicely) is something I'd address, because we really can't assume that it would be safe to access afterwards (even though it typically works).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants