Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scanning #1429

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from
Draft

Add scanning #1429

wants to merge 8 commits into from

Conversation

MegaIng
Copy link
Member

@MegaIng MegaIng commented Jun 20, 2024

An implement of Lark.scan. Also adds start_pos and end_pos to Lark.parse, Lark.parse_interactive and Lark.lex.

TODO:

  • add example
  • A bit more documentation for what exactly this function does
  • Notes about start_pos and end_pos mirroring the behavior of stdlib re with regard to look behind and look ahead.

But I do think the core logic is pretty stable and I would like a review of that already @erezsh.

Future work:

  • Check if it already works/What needs to be done to make this work with mmap to not have to load the text into memory at all (also involves checking up on the byte parsing implementation)
  • Check to see if I can implement a custom lexer that uses python's stdlib tokenize module, which would have a few benefits especially with regard to the new f string syntax, and how well that would play with this feature.

This PR is based on #1428, so merging it first would be better.

Copy link
Member

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I only gave it a quick look so far, but looks pretty good.

I left a few comments. I'll take a deeper look later, and I might come up with more.

lark/parser_frontends.py Outdated Show resolved Hide resolved
lark/parser_frontends.py Outdated Show resolved Hide resolved
lark/lexer.py Outdated Show resolved Hide resolved
@MegaIng
Copy link
Member Author

MegaIng commented Jun 20, 2024

Btw, can we drop support for the now completely unsupported versions 3.6 and 3.7? https://devguide.python.org/versions/

The regex stubs in typshed now uses the / positional only syntax, meaning mypy error exists because the stubs can't be used for 3.6, which we have as our lowest supported version.

@erezsh
Copy link
Member

erezsh commented Jun 20, 2024

Yeah, I think we can. Users of <=3.7 will just stay stuck on this version of Lark, which is fine. But I think it should be in a separate PR.

@MegaIng MegaIng requested a review from erezsh June 20, 2024 18:48
Copy link
Member

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks really good!

There is one design idea that I have, I hope you'll keep an open mind. If we change the definition of text to text: str | TextSlice, where TextSlice has 3 fields: text, start, end.

That would follow the principal of keeping related data together, and also make all the None checks a little less awkward ( I think ).

And as a bonus, that way we don't have to add new parameters to every function on the way.

lark/lexer.py Outdated
if m:
return m.group(0), m.lastgroup

def search(self, text, start_pos, end_pos):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose a different way to write this function:

    def search(self, text, start_pos, end_pos):
        results = list(filter(None, [
            mre.search(text, start_pos, end_pos)
            for mre in self._mres
        ]))
        if not results:
            return None

        best = min(results, key=lambda m: m.start())
        return (best.group(0), best.lastgroup), best.start()

# We don't want to check early since this can be expensive.
valid_end = []
ip = self.parse_interactive(text, start=chosen_start, start_pos=found.start_pos, end_pos=end_pos)
tokens = ip.lexer_thread.lex(ip.parser_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use ip.iter_parse()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had in mind that it doesn't work, gotta check.

@MegaIng
Copy link
Member Author

MegaIng commented Jun 20, 2024

There is one design idea that I have, I hope you'll keep an open mind. If we change the definition of text to text: str | TextSlice, where TextSlice has 3 fields: text, start, end.

That would follow the principal of keeping related data together, and also make all the None checks a little less awkward ( I think ).

Yep, sounds like a good idea. I think in a later PR it might also make sense to go a step further and make Lexer (and therefore Lark) generic over the type to parse, which should be quite reasonably possible without breaking changes thanks to Type Var defaults.

@erezsh
Copy link
Member

erezsh commented Jun 20, 2024

make Lexer (and therefore Lark) generic over the type to parse

I'm not sure what exactly you have in mind, but overall it sounds like a positive thing. Are you referring to the text also accepting bytes, or something else?

@MegaIng
Copy link
Member Author

MegaIng commented Jun 20, 2024

Specifically the think I had in mind is some kind of externally generated token stream with a custom lexer.

@erezsh
Copy link
Member

erezsh commented Jun 20, 2024

Hmm.. I'm not sure I get it.

@MegaIng
Copy link
Member Author

MegaIng commented Jun 20, 2024

Imagine using something like stdlib's tokenizer to produce a list of tokens, and using those as the input to Lark.parse instead of the raw text. Or using a bytecode stream from decompiling a python code object as the input to recover structures like loops.

@MegaIng
Copy link
Member Author

MegaIng commented Jun 20, 2024

I am going to bed for now, but an observation I am already making is that using TextSlice instead requires touching a lot more code. We have quite a few places where we slightly break layers of abstraction, which was fine because we always found either str or bytes. Now this is TextSlice instead which requires special handling. I think I might start from scratch tomorrow and separate TextSlice and scan into two different PRs.

@erezsh
Copy link
Member

erezsh commented Jun 21, 2024

I think I might start from scratch tomorrow and separate TextSlice and scan into two different PRs.

Sounds like that might be a good idea.

@erezsh
Copy link
Member

erezsh commented Aug 15, 2024

@MegaIng Do you have plans to submit the TextSlice PR in the near future? If not, maybe I'll take a crack at it.

@MegaIng
Copy link
Member Author

MegaIng commented Aug 15, 2024

Sorry, no, not next-few-days-soon. Forgot about this. It was a bit more work than I thought (since it also needed to touch a chunk of earley code). I am not currently on my main PC, so I can't even provide the work-in-progress, but I would be able to in ~2 days.

@erezsh
Copy link
Member

erezsh commented Aug 15, 2024

Sure, I'll wait until you're back to your PC.

I'm not in a rush, I'm just in a mood to work on something, and might as well do this. The scan feature is pretty cool, so I'll be happy to unblock it if I can.

@MegaIng
Copy link
Member Author

MegaIng commented Aug 15, 2024

If you want to work on something, maybe you can come up with a solution of the issue of multi-path imports, as seen in this SO question?

The problem is that the same terminal gets imported with different names and then ofcourse collides with "itself". I don't really know if there is a good solution to this, maybe you can come up with something.

@erezsh
Copy link
Member

erezsh commented Aug 15, 2024

I gave it some thought, and it's a tricky problem. I wrote my thoughts here: #1448

@MegaIng MegaIng marked this pull request as draft August 17, 2024 16:51
@MegaIng
Copy link
Member Author

MegaIng commented Aug 17, 2024

@erezsh Pushed a version of TextSlice I had into this branch if you want to use that as a jumping off point. The commit is called "temp", but we should probably never merge this specific PR anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants