-
Notifications
You must be signed in to change notification settings - Fork 1
Home
The new location is https://boozetools.readthedocs.io/en/latest/ but not everything has necessarily moved over just yet.
As mentioned on the readme.md
file, booze-tools is a set of language-processing tools.
At the moment, these are:
-
macroparse
for both scanning and parsing with a separate language-definition document. -
miniparse
for bootstrapping context-free language recognition. -
miniscan
for bootstrapping lexical analysis. - The beginnings of GLR support.
MacroParse takes language definition documents and produces corresponding automatons for both scanner and parser. As JSON objects, these are independent of host-language (e.g. Python, C, or Java) but sample implementations for Python are provided.
MacroParse is so named because it supports "macro-productions" which enable you to encapsulate patterns (e.g. "comma-separated-list-of-FOO") with standardized semantics rather than having to repeat yourself all over the grammar as normal BNF would require. As such, it's a bit nicer than typical EBNF.
It also supports embedded actions and a few other abbreviation devices. Full examples are provided explaining how to access (almost?) all the features.
MiniScan makes scanners: these analyze text into the basic lexical units like words, numbers, and punctuation. MiniScan definitions associate a set of patterns (regular expressions) with corresponding actions to perform.
The mini-scanner also supports:
- Rule ranks, which can override the general leftmost-longest heuristic and yield a shorter match that ranks higher.
- For example,
foo
at rank 1 preempts[a-z]+
at (default) rank 0 for inputfoobar
.
- For example,
- Intersection of character classes:
- For example,
[A-Z&&^AEIOU]
matches an upper-case consonant.
- For example,
- Perl-style counted repetitions.
- For example,
{xdigit}{4,8}
matches anywhere between a 16 and 32-bit hexadecimal numeral.
- For example,
- Named subexpressions.
- If the subexpression is a character class, it may be referred to inside another character class.
- Unicode input.
The MiniScan how-to page explains in detail how to get set up and working with the mini-scanner. The MiniScan usage notes page covers the regular expression dialect and limitations.
This is your basic LALR(1) or LR(1) parser accepting BNF rules, with one key extra-fancy feature:
- You can have more than one "start" symbol within a single grammar, and select among them at parse time.
In your application, you might want to be able to parse specific sub-phrases of a larger language without having to duplicate (parts of) the grammar. There is real application of this feature in the way the miniscan
module grapples with named sub-expressions, which may not contain anchors but otherwise run the gamut of regular expressions.
The miniparse
how-to page contains further detail.
The algorithms are in place for generating non-deterministic parse tables for all supported strengths. These are a straightforward extension of the deterministic case.
Simple routines are provided for both a test-parse (unconcerned with semantics) and normal semantic parsing with a tree-structured stack.
A tree-structured stack works OK for mild non-determinism over small-ish inputs, but it's not really ready for prime time. Preferable is a true graph-structured stack, with merging as in Elk-Hound or Bison. The code is yet to come. With it will probably arrive various disambiguation strategies, including support for computed semantic predicates.
For a complete, functional, working, tested example, please see the example
folder, and in particular the file JSON.py
therein defined. To see it in action, look in tests/test_examples.py
.