-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exhaustive MySQL Parser #157
base: develop
Are you sure you want to change the base?
Conversation
logic. Refactor the grammar.
…le for a more useful parse tree, 15x smaller grammar file, and faster parsing time
@bgrgicak tested all plugins in the WordPress plugin directory for installation errors. The top 1000 results are published at https://github.com/bgrgicak/playground-tester/blob/main/logs/2024-09-13-09-22-17/wordpress-seo. A lot of these are about SQL queries. Just migrating to new parser would solve many of these errors and give us a proper foundation to add support for more MySQL features. CC @JanJakes |
d3e623d
to
135f29f
Compare
@@ -422,6 +422,16 @@ private function parse_recursive($rule_id) { | |||
$node->append_child($subnode); | |||
} | |||
} | |||
|
|||
// Negative lookahead for INTO after a valid SELECT statement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great comments!
NOW is a non-reserved keyword that can be used as identifier in some cases, and thus need to be passed through "determineFunction". CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP are reserved keywords and they can't be used as identifiers.
I spent a week looking further into this, and I'd like to summarize my progress. Summary
Overall, I think @adamziel's approach is a great and viable one! 👏 Parsing mysqltestMySQL server tests are written using The MySQL Test Framework which usesa a custom DSL called mysqltest Language. I implemented a simple script that attempts to skip all mysqltest-specific commands and preserves only SQL queries. The script consumes the
The current implementation doesn't handle the last two points yet, but both of them seem pretty straightforward to implement when needed. The extracted queries are then deduplicated for exact matches and dumped in a CSV format to preserve newlines within SQL queries. In the future, we can also expand the data set to extract queries from other directories within the test suite. Lexer issuesThe AI-generated lexer is surprisingly good and well-structured, but it does have some issues and missing pieces:
There's likely to be some more issues discovered in the lexer, but I wouldn't expect running into any blockers. I'd like to do a full manual walkthrough to ensure the lexer fully corresponds to the MySQLLexer.g4 specification. Grammar issuesThe converted and compressed grammar originally comes from the MySQLParser.g4 specification, and it's almost equivalent. Although I discovered some small issues in the original specs and the grammar conversion as well, these are not any major problems:
Grammar conflictsWhat can pose more significant challenges is grammar conflicts. The original ANTLR grammar is meant to be processed by the ANTLR toolkit that does a lot of heavy-lifting in terms of conflict resolution. During the grammar compilation, the toolkit can refactor and reorder some rules, or introduce lookaheads into the generated parser. In our case, we don't have such a compiler at hand, which results in some conflicts manifesting when parsing. To very briefly explain how conflicts occur, it is worth summarizing how the parser works:
This behavior is correct (backtracking can be extremely expensive). In fact, it is a description of a simple LL parser. In this scenario, the conflicts can occur in the following ways:
A special case of a FIRST/FIRST conflict is left recursion, but that was already eliminated by @adamziel. Additionally, grammars can also contain ambiguities (multiple different parse trees match the same input), but I wouldn't expect running into that in this case. As for the above-mentioned conflicts, I did run into some indeed:
Solving the conflicts manually and reordering the rules is probably not the ideal final solution, but it can help us understand the nature and quantity of these conflicts. As a next step, it might be worth addressing the conflicts by writing a simple script to analyze the grammar, but it's good to first understand what we actually need to solve. Ultimately, we should probably do some left-factoring, use lookaheads, and maybe try to build parsing tables to see if they could have a reasonable size in some form. Some of the tooling seem to be provided by enbfutils, but I had little luck addressing any of the conflict using those tools. At least, I wrote a simple script to dump conflicts from the expanded JSON grammar. Ultimately, I think when we understand the conflicts we're running into and how we want to address them, writing an algorithm to do so won't be a difficult task. I'll keep exploring a bit in this area. Next stepsAs for the next steps, I think it would make sense to focus on the following:
What do you think? |
This. Is. Incredible. Such a good work here @JanJakes! A few thoughts I had as I was reading:
Manual resolution is not a general solution over all possible parsers we could generate here, but it should be fine for the MySQL parser. I can only think of two scenarios where that would be a problem:
An automated script would surely help us with the latter, but perhaps that work doesn't need to happen before the initial SQLite translation layer – what do you think?
Another thought – once we have a generalized solution, we could generate more parsers this way. I wanted a reasonably fast Markdown parser in PHP for a while (although that's totally out of scope here :-))
This is lit 🔥 |
@adamziel Well, I'm only building on your phenomenal foundations 😊
💯 I think in the first phase, we should be able to go with the manual fixes. Doing it properly and automatically could be a rabbit hole, so I would avoid going that way for now if we can. The fact that we got to a 3% error rate on an extensive test suite is promising. Actually, even more promising, given the fact that are still some issues in the lexer. I've done the first pass of semi-manually comparing the lexer to the MySQLLexer.g4 grammar, which already yielded some nice results:
What's promising is that this is only a part of the manual pass. I sorted out all tokens, function identifiers, synonyms, and version-specifics, but there are things I haven't addressed yet, and I know some of them need to be fixed. For instance, the AI used all the In summary, there's a chance even more test failures will be eliminated just by improving the lexer, hopefully leaving us with fewer interventions on the grammar side. |
I had a conversation with @JanJakes about this and just wanted to note the idea (whether feasible or not) whether we might be able to transform the MySQL AST into a SQLite AST before we transform it back into a SQL query string and pass it on to SQLite. I found for example this (JS) project that creates SQL from an SQLite AST. |
This was done together with the first semi-manual pass & check based on MySQLLexer.g4 grammar from https://github.com/mysql/mysql-workbench/blob/8.0.38/library/parsers/grammars/MySQLLexer.g4. It also cuts the size of the lexer from 9460 LOC to 3827 LOC, and the correctness improvements further reduce the number of failing tests.
Another thing that's worth mentioningis the SHOW PARSE_TREE statement in MySQL. These commands work with a debug build of MySQL, and it could be useful if we want to assert the parser correctness in the future. We would need to do some kind of mapping to our AST, but I suppose it could work. |
a188f48
to
3d9671b
Compare
Note
You're the most welcome to take over this PR. I won't be able to drive this to completion before November. Do you need a reliable SQLite support? You can make this happen by driving this PR to completion (and fork if needed)! Are you dreaming about running WordPress on PostgreSQL? Finishing this PR is the first step there.
Read the MySQL parser proposal for the full context on this PR.
Ships an exhaustive MySQL Lexer and Parser that produce a structured parse tree. This is the first step towards supporting multiple databases. It's an easier, more stable, and an easier to maintain method than the token processing we use now. It will also dramatically improve WordPress Playground experience – database integration is the single largest source of issues.
We don't have an AST yet, but we have a decent parse tree. That may already be sufficient – by adjusting the grammar file we can mold it into almost anything we want. If that won't be sufficient or convenient for any reason, converting that parse tree into an AST is a simple tree mapping. It could be done with a well crafted
MySQLASTRecursiveIterator
and a good AI prompt.Implementation
The three focal points of this PR are:
Before diving further, check out a few parse trees this parser generated:
MySQLLexer.php
This is an AI-generated lexer I initially proposed in #153. It needs a few passes from a human to inline most methods and cover a few tokens it doesn't currently produce, but overall it seems solid.
DynamicRecursiveDescentParser.php
A simple recursive parser to transform
(token stream, grammar) => parse tree
. In this PR we use MySQL tokens and MySQL grammar, but the same parser could also support XML, IMAP, many other grammars (as long as they have specific properties).The
parse_recursive()
method is just 100 lines of code (excluding comments). All of the parsing rules are provided by the grammar.run-mysql-driver.php
A quick and dirty implementation of what a
MySQL parse tree ➔ SQLite
database driver could look like. It easily supportsWITH
andUNION
queries that would be really difficult to implement the current SQLite integration plugin.The tree transformation is an order of magnitude easier to read, expand, and maintain than the current. I stand by this, even though the temporary
ParseTreeTools
/SQLiteTokenFactory
API included in this PR seems annoying and I'd like to ship something better than that. Here's a glimpse:A deeper technical dive
MySQL Grammar
We use the MySQL workbench grammar converted from ANTLR4 format to a PHP array.
You can tweak the
MySQLParser-reordered.ebnf
file and regenerate thephp
grammar with thecreate_grammar.sh
script. You'll need to run npm install before you do that.The grammar conversion pipeline goes like this:
g4 ➔ EBNF
with grammar-converterEBNF ➔ JSON
with node-ebnf. This already factors compound rules into separate rules, e.g.query ::= SELECT (ALL | DISTINCT)
becomesquery ::= select %select_fragment0
and%select_fragment0 ::= ALL | DISTINCT
.*
,+
,?
into modifiers into separate, right-recursive rules. For example,columns ::= column (',' column)*
becomescolumns ::= column columns_rr
andcolumns_rr ::= ',' column | ε
.JSON ➔ PHP
with a PHP script. It replaces all string names with integers and ships an int->string map to reduce the file size,I ignored nuances like MySQL version-specific rules and output channels for this initial explorations. I'm now confident the approach from this PR will work. We're in a good place to start thinking about incorporating these nuances. I wonder if we even have to distinguish between MySQL 5 vs 8 syntax, perhaps we could just assume version 8 or use a union of all the rules.
✅ The grammar file is large, but fine for v1
Edit: I factored the grammar manually instead of using the automated factoring algorithm, and the grammar.php file size went down to 70kb. This one is now solved. Everything until the next header is no longer relevant and I'm only leaving it here for context.
grammar.php
is 1.2MB, or 100kb gzipped. This already is a "compressed" form where all rules and tokens are encoded as integers.I see three ways to reduce the size:
AB|AC|AD
intoA(B|C|D)
etc.CREATE PROCEDURE
, or modularize the grammar and lazy-load the parts we actually need to use. For example, most of the time we won't need anything related toGRANT PRIVILIGES
orDROP INDEX
.The speed is decent
The proposed parser can handle about 1000 complex SELECT queries per second on a MacBook pro. It only took a few easy optimizations to go from 50/seconds to 1000/second. There's a lot of further optimization opportunities once we need more speed. We could factor the grammar in different ways, explore other types of lookahead tables, or memoize the matching results per token. However, I don't think we need to do that in the short term.
If we spend enough time factoring the grammar, we could potentially switch to a LALR(1) parser and cut most time spent on dealing with ambiguities.
Next steps
These could be implemented either in follow-up PRs or as updates to this PR – whichever is more convenient:
MySQLOnSQLite
database driver to enable running MySQL queries on SQLite. Read this comment for more context. Use any method that's convenient for generating SQLite queries. Feel free to restructure and redo any APIs proposed in this PR. Be inspired by the idea we may build aMySQLOnPostgres
driver one day, but don't actually build any abstractions upfront. Make the driver generic so it can be used without WordPress. Perhaps it could implement a PDO driver interface?MySQLOnSQLite
driver. For example,SQL_CALC_FOUND_ROWS
option or theINTERVAL
syntax.MySQLOnSQLite
driver and ensure they pass.MySQLOnSQLite
driver instead of the current plumbing.