Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: structured diagnostics for the compiler #45

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Octachron
Copy link
Member

@Octachron Octachron commented Jul 19, 2024

This is a proposal for implementing structured diagnostics for the compiler, to be able to render compiler errors, debugging output, config information and toplevel output in structured formats like S-expression or JSON.

As a higher level overview, this RFC is mostly focused on the definition of a backward-compatibility and versioning policy for those structured diagnostics.

Rendered version

@goldfirere
Copy link
Contributor

I think having structured output from the compiler is great, and I think it is wise to think carefully about versioning and subtyping.

Yet I think this proposal would be strengthened with further examples, and proposed schemata for various outputs. Maybe this is too soon and that information is based saved until later? A few questions I have at this point:

  • The one example in this proposal uses Open_box and Close_box. I would expect instead a Box which contains the contents of the box. Is there a reason you went with explicit open and close directives?
  • The contents of the message are missing important semantic tags, saying that x and y are identifiers, for example. Is this at least forward-compatible with that addition? I would worry that simply adding it to the proposed schema would mix syntax (where to break text) with semantics too much.

I guess that's it for now, actually. I'm excited to see this posted!

@Octachron
Copy link
Member Author

Concerning the error message part:

  • The schema for format document was based on the stream representation for the format document. It sounds indeed better to convert it to a list of trees before printing.
  • The current set of tags should be fairly easy to extend:
    Currently the schema for format tags is defined with
type format_tag =
  | Unknown
  | String_tag of string
  | Error
  | Warning
  | Loc
  | Inline_code
  | Hint
  | Deletion
  | Insertion
  | Modification
  | Preservation

(see also the current implementation). Thus if we want to add a new constructor, we can add it as a derived constructor of String_tag.

Concerning the various schemata, my point of view is that there are two distinct situation in the current compiler.
Either, we already have a record type, or a similar data structure, which is built before being printed; or we are dumping output on stdout. The error report type

type msg = Fmt.t loc
type report = {
  kind : report_kind;
  main : msg;
  sub : msg list;
  footnote: Fmt.t option;
}

and the Config module falls in the first category. For those, I think it makes sense to start with schemata following the existing data types after checking that evolving it should not be painful.

For instance, for the error report, I knew of three points that I have considered to change at points:

  • Upgrading the list of messages to a tree of messages: this would be useful for functor error messages that are currently constructed as a tree structure then flattened to a list. If the msg type is a record, this would be a minor update.
  • Adding a variant of the message type which is associated to two locations, which could be done by adding a secondary optional fields (and messages).
  • Upgrading the report_kind in the case of warning to clearly separate the warning name from its warning number.

From the Config module, following the update policy proposed in this RFC will have avoided the issue in #13244, since the old field names would have been at most removed in a future major version after deprecating them. However, the output itself would be quite close from the existing going from

$ ocamlc -config
version: 5.2.0
standard_library_default: /home/angeletti/.opam/5.2.0+flambda/lib/ocaml
standard_library: /home/angeletti/.opam/5.2.0+flambda/lib/ocaml
ccomp_type: cc
c_compiler: gcc
ocamlc_cflags:  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread
ocamlc_cppflags:  -D_FILE_OFFSET_BITS=64 
ocamlopt_cflags:  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread
ocamlopt_cppflags:  -D_FILE_OFFSET_BITS=64 
bytecomp_c_compiler: gcc  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread  -D_FILE_OFFSET_BITS=64 
native_c_compiler: gcc  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread  -D_FILE_OFFSET_BITS=64 
bytecomp_c_libraries: -lzstd   -lm  -lpthread
native_c_libraries:  -lm  -lpthread
native_ldflags: 
native_pack_linker: ld -r -o 
native_compiler: true
architecture: amd64
model: default
int_size: 63
word_size: 64
system: linux
asm: as
asm_cfi_supported: true
with_frame_pointers: false
ext_exe: 
ext_obj: .o
ext_asm: .s
ext_lib: .a
ext_dll: .so
os_type: Unix
default_executable_name: a.out
systhread_supported: true
host: x86_64-pc-linux-gnu
target: x86_64-pc-linux-gnu
flambda: true
safe_string: true
default_safe_string: true
flat_float_array: true
function_sections: true
afl_instrument: false
tsan: false
windows_unicode: false
supports_shared_libraries: true
native_dynlink: true
naked_pointers: false
exec_magic_number: Caml1999X034
cmi_magic_number: Caml1999I034
cmo_magic_number: Caml1999O034
cma_magic_number: Caml1999A034
cmx_magic_number: Caml1999y034
cmxa_magic_number: Caml1999z034
ast_impl_magic_number: Caml1999M034
ast_intf_magic_number: Caml1999N034
cmxs_magic_number: Caml1999D034
cmt_magic_number: Caml1999T034
linear_magic_number: Caml1999L034

to

$ ocamlc -config -log-format json
{
  "metadata" : { "version" : [1, 0], "valid" : "Full"},
  "version" : "5.3.0+dev0-2023-12-22",
  "standard_library_default" : "/usr/local/lib/ocaml",
  "standard_library" : "/usr/local/lib/ocaml",
  "ccomp_type" : "cc",
  "c_compiler" : "gcc",
  "ocamlc_cflags" :
    " -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread",
  "ocamlc_cppflags" : " -D_FILE_OFFSET_BITS=64 ",
  "ocamlopt_cflags" :
    " -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread",
  "ocamlopt_cppflags" : " -D_FILE_OFFSET_BITS=64 ",
  "bytecomp_c_compiler" :
    "gcc  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread  -D_FILE_OFFSET_BITS=64 ",
  "native_c_compiler" :
    "gcc  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread  -D_FILE_OFFSET_BITS=64 ",
  "bytecomp_c_libraries" : "-lzstd   -lm  -lpthread",
  "native_c_libraries" : " -lm  -lpthread",
  "native_ldflags" : "",
  "native_pack_linker" : "ld -r -o ",
  "native_compiler" : true,
  "architecture" : "amd64",
  "model" : "default",
  "int_size" : 63,
  "word_size" : 64,
  "system" : "linux",
  "asm" : "as",
  "asm_cfi_supported" : true,
  "with_frame_pointers" : false,
  "ext_exe" : "",
  "ext_obj" : ".o",
  "ext_asm" : ".s",
  "ext_lib" : ".a",
  "ext_dll" : ".so",
  "os_type" : "Unix",
  "default_executable_name" : "a.out",
  "systhread_supported" : true,
  "host" : "x86_64-pc-linux-gnu",
  "target" : "x86_64-pc-linux-gnu",
  "flambda" : false,
  "safe_string" : true,
  "default_safe_string" : true,
  "flat_float_array" : true,
  "function_sections" : true,
  "afl_instrument" : false,
  "tsan" : false,
  "windows_unicode" : false,
  "supports_shared_libraries" : true,
  "native_dynlink" : true,
  "naked_pointers" : false,
  "exec_magic_number" : "Caml1999X035",
  "cmi_magic_number" : "Caml1999I035",
  "cmo_magic_number" : "Caml1999O035",
  "cma_magic_number" : "Caml1999A035",
  "cmx_magic_number" : "Caml1999Y035",
  "cmxa_magic_number" : "Caml1999Z035",
  "ast_impl_magic_number" : "Caml1999M035",
  "ast_intf_magic_number" : "Caml1999N035",
  "cmxs_magic_number" : "Caml1999D035",
  "cmt_magic_number" : "Caml1999T035",
  "linear_magic_number" : "Caml1999L035"
}

For the cases where the compiler is current dumping raw contents (for instance with -dparsetree) in presence of certain flags, I propose to start by using the flag name as field name. Typically, this gives us the following json schema

      "debug" :
        {
          "type" : "object",
          "properties" :
            {
              "profile" : { "type" : "string"},
              "cmm_invariant" : { "type" : "string"},
              "linear" : { "type" : "array", "items" : { "type" : "string"}},
              "mach" : { "type" : "array", "items" : { "type" : "string"}},
              "unbox-specialised-args" :  { "type" : "array", "items" : { "type" : "string"}},
              "unbox-closures" : { "type" : "array", "items" : { "type" : "string"}},
              "unbox-free-vars-of-closures" :  { "type" : "array", "items" : { "type" : "string"}},
              "remove-free-vars-equal-to-args" :  { "type" : "array", "items" : { "type" : "string"}},
              "cmm" : { "type" : "array", "items" : { "type" : "string"}},
              "raw_clambda" : { "type" : "array", "items" : { "type" : "string"}},
              "clambda" : { "type" : "array", "items" : { "type" : "string"}},
              "raw_flambda" : { "type" : "array", "items" : { "type" : "string"}},
              "flambda" : { "type" : "array", "items" : { "type" : "string"}},
              "raw_lambda" : { "type" : "string"},
              "lambda" : { "type" : "string"},
              "instr" : { "type" : "string"},
              "shape" : { "type" : "string"},
              "typedtree" : { "type" : "string"},
              "source" : { "type" : "string"},
              "parsetree" : { "type" : "string"}
            },
          "required" : []
        },

while the full compiler schema is

  "type" : "object",
  "properties" :
    {
      "metadata" : { "$ref" : "#/$defs/metadata"},
      "alerts" :
        { "type" : "array", "items" : { "$ref" : "#/$defs/error_report"}},
      "warnings" :
        { "type" : "array", "items" : { "$ref" : "#/$defs/error_report"}},
      "error" : { "$ref" : "#/$defs/error_report"},
      "debug" : { "$ref" : "#/$defs/debug"}
    },
  "required" : ["metadata"]

(full version)

The possible disadvantage of this schema for debugging outout is that if we want to structure the output in a more structured way, we will need to add new fields for the structured version.

On a more metalevel, I agree that it is necessary to discuss the various schemata, but I am not sure what are the best venues, nor specification to do that. Typically, json scheme is quite heavy to use to describe sum types, for instance the error_kind scheme looks like

      "error_kind" :
        {
          "oneOf" :
            [{
               "type" : "array",
               "prefixItems" :
                 [{ "const" : "Report_warning_as_error"},
                   { "type" : "string"}]
             },
              {
                "type" : "array",
                "prefixItems" :
                  [{ "const" : "Report_warning"}, { "type" : "string"}]
              },
              {
                "type" : "array",
                "prefixItems" :
                  [{ "const" : "Report_alert_as_error"}, { "type" : "string"}
                    ]
              },
              {
                "type" : "array",
                "prefixItems" :
                  [{ "const" : "Report_alert"}, { "type" : "string"}]
              }, { "const" : "Report_error"}]
        },

when transposed to json scheme. I am thus quite tempted by adding a versioned ADT output to my specification generator and validation checker tool. Would you prefer to discuss the schemata in this format?

@goldfirere
Copy link
Contributor

Would you prefer to discuss the schemata in this format?

Maybe? I'm really more after concrete examples instead of the generic schemata. That is, I think with maybe 2 examples, it would be easy to infer the rough schemata, but if we just look at schemata, it might be hard to intuit how examples would look.

@Octachron
Copy link
Member Author

I can certainly give some examples. At the same time, the exact schema can be updated easily. For instance, taking the functor error example from the RFC with a tree representation of document format and inlined constructor arguments:

((metadata ((version (1 0)) (valid Full)))
 (error
  ((kind Report_error)
   (main
    ((msg
      ((Box HV 0
        ((Text "This application of the functor ")
         (Tag Inline_code ((Text "F"))) (Text " is ill-typed.")
         (Simple_break 1 0) (Text "These arguments:") (Simple_break 1 2)
         (Box B 0 ((Tag Preservation ((Text "Y"))))) (Simple_break 1 0)
         (Text "do not match these parameters:") (Simple_break 1 2)
         (Box B 0
          ((Text "functor") (Simple_break 1 0)
           (Tag Insertion ((Text "(X : ") (Box B 2 ((Text "x"))) (Text ")")))
           (Simple_break 1 0)
           (Tag Preservation
            ((Text "(Y : ") (Box B 2 ((Text "y"))) (Text ")")))
           (Simple_break 1 0) (Text "-> ...")))))))
     (loc
      ((file "functor_example.ml") (start_line 7) (stop_line 7)
       (characters (11 15))))))
   (sub
    (((msg
       ((Tab_break 0 0)
        (Tbox
         ((Tag Insertion ((Text "1. "))) Set_tab
          (Box HV 2
           ((Text "An argument appears to be missing with module type")
            (Simple_break 1 2) (Box B 0 ((Box B 2 ((Text "x"))))))))))))
     ((msg
       ((Tab_break 0 0)
        (Tbox
         ((Tag Preservation ((Text "2. "))) Set_tab
          (Box HV 2
           ((Text "Module Y matches the expected module type ")
            (Box B 2 ((Text "y")))))))))))))))

Another compiler output with warnings, alerts and dump output:

(* example_warning_alert.ml *)
let f x = Gc.eventlog_pause ()
ocamlopt -dparsetree -dsource -dlambda  -log-format sexp -w +A example_warning_alert.ml
((metadata ((version (1 0)) (valid Full)))
 (debug
  ((source "let f x = Gc.eventlog_pause ()")
   (lambda
    "(seq\n  (let\n    (f/270 =\n       (function x/272 : int (apply (field_imm 7 (global Stdlib__Gc!)) 0)))\n    (setfield_ptr(root-init) 0 (global Example_warning_alert!) f/270))\n  0)"
   )))
 (warnings
  (((kind (Report_warning "27 [unused-var-strict]"))
    (main
     ((msg ((Text "unused variable x.")))
      (loc
       ((file "example_warning_alert.ml") (start_line 2) (stop_line 2)
        (characters (6 7)))))))
   ((kind (Report_warning "70 [missing-mli]"))
    (main
     ((msg ((Text "Cannot find interface file.")))
      (loc ((file "example_warning_alert.ml"))))))))
 (alerts
  (((kind (Report_alert "deprecated"))
    (main
     ((msg
       ((Text "Stdlib.Gc.eventlog_pause\nUse Runtime_events.pause instead."))
      )
      (loc
       ((file "example_warning_alert.ml") (start_line 2) (stop_line 2)
        (characters (10 27))))))))))

For the toplevel, I propose to put the compiler diagnostic inside a compiler_diagnostic field of the toplevel scheme at the same level of the toplevel specific fields (for the toplevel output, trace, and error message)

# let f x = 1 in raise Not_found; f ();;
((metadata ((version (1 0)) (valid Full)))
 (output
  ((Box B 0 ((Text "Exception:") (Simple_break 1 0) (Text "Not_found.")))))
 (compiler_diagnostic
  ((warnings
    (((kind (Report_warning "27 [unused-var-strict]"))
      (main
       ((msg ((Text "unused variable x.")))
        (loc ((start_line 1) (stop_line 1) (characters (6 7)))))))
     ((kind (Report_warning "21 [nonreturning-statement]"))
      (main
       ((msg
         ((Text "this statement never returns (or has an unsound type.)")))
        (loc ((start_line 1) (stop_line 1) (characters (15 30))))))))))))
#warn A;;
((metadata ((version (1 0)) (valid Full)))
 (errors
  (((Text "Unknown directive ") (Tag Inline_code ((Text "warn"))) (Text ".")
    (Flush false)))))

@goldfirere
Copy link
Contributor

Those examples are really helpful. Thanks.

My reaction to those is that I'm still worried about the mix of syntax and semantics. That is, the top few nodes are all about semantics: where this message was generated from and what it's doing. Then the lower nodes are all about syntax: how should we pretty-print this? Instead, I would rather see something like

((metadata ((version (1 0)) (valid Full)))
 (error
  ((kind Report_error)
   (main
    ((msg
      (error Ill_typed_functor_application
       (details (functor F) (arguments (Y)) (expected-arguments (X x) (Y y))))
      (root-cause-hint Missing_functor_argument (details x) (suggestion (Y y))))
     (loc
      ((file "functor_example.ml") (start_line 7) (stop_line 7)
       (characters (11 15))))))))))

(I did not check for parenthesis matching.)

Then, there would be a function render_error that takes the error bit and renders it into something like we see in the original sexp. With this plan, we have semantics all the way down, with the possibility of concrete pretty-printing if a client wants it. The advantage of this more-semantics approach is that clients can handle some errors specially if they so choose, without having to do a big pattern-match.

@Octachron
Copy link
Member Author

I agree that at a later stage, it would be better to provide more semantical information.

Nevertheless I think it is useful to start by exposing the current rendered error message as a reasonable way to present the error. This doesn't prevent us to add more semantics information later, and possibly deprecates the "raw" output if it becomes completely unused.

To be honest, on the review side, I would be a bit worried to ask reviewers to review both a new logging mechanism and a change that alters all errors in the compiler. Typically, one important information that is missing for exploiting error messages would be having error names (and error numbers?) as proposed in #33. I would rather propose to move step by step, first with structured output, which will make #33 more appealing, and then add more semantics structure to the error reports.

One alternative that has been suggested to me for the msg was to have a message format which is only "text + tags". But I tend to think that this format is losing quite a bit of information without providing way to recover it.

@lpw25
Copy link
Contributor

lpw25 commented Jul 29, 2024

I would rather propose to move step by step, first with structured output, which will make #33 more appealing, and then add more semantics structure to the error reports.

Whilst I generally favor an iterative approach, it is worth pointing out that some consumers of the structured output will probably need to consume every version of it, so the more versions you have the more work for your users. It is worth spending some time up front trying to get the right information for the tools that want this to avoid needing long term support for a bunch of versions.

@gasche
Copy link
Member

gasche commented Jul 29, 2024

@lpw25 what are the tools / use-case / scenario / persona that you have in mind?

@lpw25
Copy link
Contributor

lpw25 commented Jul 29, 2024

Without thinking hard about it, I'd assume build systems and editor tooling.

@goldfirere
Copy link
Contributor

I would be a bit worried to ask reviewers to review both a new logging mechanism and a change that alters all errors in the compiler.

I agree! But I don't think I understand the concrete worry here. The logging mechanism would be new, for sure, but why do you say that my suggestion also alters all errors in the compiler?

@Octachron
Copy link
Member Author

@lpw25 , I agree that it is better to avoid reporting too much complexity on the consumer side.

However, when speaking to dune developer and merlin developers, the list of features required right now that I ended up was:

  • Being able to redirect the diagnostic to a separate file
  • Parsable enough to be able to remove dune diagnostic lexer and other adhoc parsers.
  • Versioned and pinable to a previous version.

I think it makes sense to start with an initial version that implements those highly demanded features as soon as possible.

Then once we have actual structured diagnostics, I have no doubts that other feature requests will appear. But I am bit wary to try to anticipate on hypothetical future feedbacks without more data. At the same time, if you think that I might be ignoring some possible sources of concrete feedbacks at Janestreet, don't hesitate to point them to me.

A possible good compromise could be to have a ramp-up period: we could start with diagnostic at version 0.0 (0.1) to let developer tools try to use it and gather feedback, with the idea that tools may drop in the future the support for all versions 0.* once the versions 1.* are out.

@goldfirere , sorry I was unclear. I meant the error report pipeline rather than the error messages themselves. Currently, we have three levels of representation for errors:

  • the numerous error variants, which often contain a full typing-environment
  • the error report, which is essentially some metadata combined with a list of error messages
  • the terminal output.

Serializing the error report is straightforward, and require only to adapt the report printers without any other changes.
Serializing the error variants, which would be mostly busy work, and end up with data which can probably only be exploited by importing the compiler error printers.
Your proposition (which is probably the right one ultimately) would require to add a new intermediary representation as rendered (at least without environment and type digraphes) error variants. I expect that adding this new changes to be quite noisy and extensive. I would rather implement this in a second phase, or even a third after another round of error message improvement.

@gasche
Copy link
Member

gasche commented Jul 30, 2024

For the record, there is a history of previous work and previous discussions on this feature.

It was first explicitly discussed in a feature request created in january 2016 ocaml/ocaml#7123, with implementations proposed four years later in summer-fall 2020, when @nojb implemented ocaml/ocaml#9758 and Muskan Gard and @Octachron implemented ocaml/ocaml#9538, ocaml/ocaml#9979 (during a summer internship). At the time we asked for feedback from tool authors, but the feedback was inconclusive and the various PRs ended up being stalled -- seemingly forever.

In response to this feedback, @Octachron worked on schemas versioning for a long while, and the current RFC is the new iteration -- sitting on top of non-trivial buildup work, in particular ocaml/ocaml#13169 .

@gasche
Copy link
Member

gasche commented Jul 30, 2024

A possible good compromise could be to have a ramp-up period: we could start with diagnostic at version 0.0 (0.1) to let developer tools try to use it and gather feedback, with the idea that tools may drop in the future the support for all versions 0.* once the versions 1.* are out.

We are already in such a ramp-up period, where there are zero versions of the protocol available and tools are not using any of them. No one is forcing tools to adopt/support the structured output as soon as it gets out, they can keep doing whatever they want, and give us feedback on what they would need/want to start consuming it.

@goldfirere
Copy link
Contributor

Thanks for the responses.

I agree it makes sense to decouple the initial implementation of structured output from the work of updating all the code that assembles error messages. I can observe that the two steps -- add structured output and improve the error-message assembling -- can happen in either order. That is, we could go through the error messages first and add structure to their assembly without changing concrete output. (I don't necessarily think it's better to reorder. Just saying it's possible.)

As for ramp-up period: that's true. But part of what's in play here is time. That is, suppose we get to a first version of this, with little semantic information (that is, we implement the version that produces the sexps as they appear in this PR). Tools might think of adopting. If everything is going to change in 6 months -- and the change would make adoption easier -- then authors might wait for things to get easier. But if it's going to be 6 years, authors might just forge ahead, even if it means awkward compatibility support later. Of course it's always hard to make rock-solid commitments in open-source software (or closed-source!). Yet I think that, if we broadcast the direction of travel to tool authors, they can make more informed decisions about when, and what, to adopt.

But with all of this in mind, I think perhaps forging ahead with the proposal as written is a good next step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants