Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for extending inline parsing with custom inline content parsers #321

Merged
merged 10 commits into from
Apr 26, 2024

Conversation

robinst
Copy link
Collaborator

@robinst robinst commented Apr 26, 2024

This adds an API for users/extensions to extend inline parsing (or override some built-in inline parsing), similar to the current support for custom block parsers. Fixes:

See also:

Overview

Since 5790505 we've internally extracted some of the inline parsing logic. This PR turns that into an API. It's currently exported in the beta package because it might be subject to change (and Scanner, which this depends on are is in beta).

The entry point for this is Parser.Builder#customInlineContentParserFactory.

The registered factory needs to provide one or more "trigger characters". When such a character is encountered during inline parsing, the parser is called with a Scanner and it can parse inline content from there. It can decide not to return a result; in that case other parsers (including built-in ones) are tried next.

Example

See InlineContentParserTest for an example of how to use it. Registering the factory:

var parser = Parser.builder().customInlineContentParserFactory(new DollarInlineParser.Factory()).build();

And the factory and parser:

private static class DollarInlineParser implements InlineContentParser {
private int index = 0;
@Override
public ParsedInline tryParse(InlineParserState inlineParserState) {
var scanner = inlineParserState.scanner();
scanner.next();
var pos = scanner.position();
var end = scanner.find('$');
if (end == -1) {
return ParsedInline.none();
}
var content = scanner.getSource(pos, scanner.position()).getContent();
scanner.next();
return ParsedInline.of(new DollarInline(content, index++), scanner.position());
}
static class Factory implements InlineContentParserFactory {
@Override
public Set<Character> getTriggerCharacters() {
return Set.of('$');
}
@Override
public InlineContentParser create() {
return new DollarInlineParser();
}
}
}

Design considerations

The trickiest part was deciding on the lifetime of the parser instance. There were a few options:

  • The parser takes an instance (instead of factory) which is used for all parsed documents. That would mean the instance has to be stateless, as it could be called concurrently from multiple parses. Care has been taken to make using a single parser from multiple threads safe, so this would go against that.
  • The parser takes a factory, and for each parsed document, one instance is created. While this solves the above problem, there is the following quote in the spec: "Note that the first step requires processing lines in sequence, but the second can be parallelized, since the inline parsing of one block element does not affect the inline parsing of any other." If we only had a single parser instance per document, we wouldn't be able to parallelize inline parsing in the future. There's also another problem: If an instance wants to keep state, it would probably be per inline snippet. In order to allow that, we would need to add some kind of "reset" method to signal it that a snippet has finished.
  • The parser takes a factory, and for each parsed inline snippet, an instance is created. This would allow parallelizing inline parsing, and is also a useful granularity if the parser wants to keep state. Note that e.g. link parsing is not yet using this mechanism, but with this, it could. (This is the option that was chosen.)
  • The parser takes a factory, and for each trigger character, an instance is created. This seems too small of a scope, and wouldn't allow the instance to keep any state across multiple parses.

Some other thoughts:

  • Why a Set<Character> for trigger characters instead of just a single char? If link parsing was implemented on top of this, it would require at least [ and ] as trigger characters, so let's allow multiple. There's not really a downside to allowing multiple.
  • Could a parser return more than a single Node? Not with the current API, but we could add that in a backwards-compatible way.
  • Why another Factory? Not sure what the alternative would be, and we already have the same concept for BlockParserFactory :).

@robinst robinst merged commit 3733963 into main Apr 26, 2024
10 checks passed
@robinst robinst deleted the robinst-inline-content-parser branch April 26, 2024 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant