Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exhaustive MySQL Parser #157

Open
wants to merge 28 commits into
base: develop
Choose a base branch
from
Open

Exhaustive MySQL Parser #157

wants to merge 28 commits into from

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Aug 17, 2024

Note

You're the most welcome to take over this PR. I won't be able to drive this to completion before November. Do you need a reliable SQLite support? You can make this happen by driving this PR to completion (and fork if needed)! Are you dreaming about running WordPress on PostgreSQL? Finishing this PR is the first step there.

Read the MySQL parser proposal for the full context on this PR.

Ships an exhaustive MySQL Lexer and Parser that produce a structured parse tree. This is the first step towards supporting multiple databases. It's an easier, more stable, and an easier to maintain method than the token processing we use now. It will also dramatically improve WordPress Playground experience – database integration is the single largest source of issues.

We don't have an AST yet, but we have a decent parse tree. That may already be sufficient – by adjusting the grammar file we can mold it into almost anything we want. If that won't be sufficient or convenient for any reason, converting that parse tree into an AST is a simple tree mapping. It could be done with a well crafted MySQLASTRecursiveIterator and a good AI prompt.

Implementation

The three focal points of this PR are:

  • MySQLLexer.php – turns an SQL query into a stream of tokens
  • DynamicRecursiveDescentParser.php – turns a stream of tokens into a parse tree
  • run-mysql-driver.php – proof of concept of an parse-tree-based query conversion

Before diving further, check out a few parse trees this parser generated:

MySQLLexer.php

This is an AI-generated lexer I initially proposed in #153. It needs a few passes from a human to inline most methods and cover a few tokens it doesn't currently produce, but overall it seems solid.

DynamicRecursiveDescentParser.php

A simple recursive parser to transform (token stream, grammar) => parse tree. In this PR we use MySQL tokens and MySQL grammar, but the same parser could also support XML, IMAP, many other grammars (as long as they have specific properties).

The parse_recursive() method is just 100 lines of code (excluding comments). All of the parsing rules are provided by the grammar.

run-mysql-driver.php

A quick and dirty implementation of what a MySQL parse tree ➔ SQLite database driver could look like. It easily supports WITH and UNION queries that would be really difficult to implement the current SQLite integration plugin.

The tree transformation is an order of magnitude easier to read, expand, and maintain than the current. I stand by this, even though the temporary ParseTreeTools/SQLiteTokenFactory API included in this PR seems annoying and I'd like to ship something better than that. Here's a glimpse:

function translateQuery($subtree, $rule_name=null) {
    if(is_token($subtree)) {
        $token = $subtree;
        switch ($token->type) {
            case MySQLLexer::EOF: return new SQLiteExpression([]);
            case MySQLLexer::IDENTIFIER:
                return SQLiteTokenFactory::identifier(
                    SQLiteTokenFactory::identifierValue($token)
                );

            default:
                return SQLiteTokenFactory::raw($token->text);
        }
    }

    switch($rule_name) {
        case 'indexHintList':
            // SQLite doesn't support index hints. Let's
            // skip them.
            return null;

        case 'fromClause':
            // Skip `FROM DUAL`. We only care about a singular 
            // FROM DUAL statement, as FROM mytable, DUAL is a syntax
            // error.
            if(
                ParseTreeTools::hasChildren($ast, MySQLLexer::DUAL_SYMBOL) && 
                !ParseTreeTools::hasChildren($ast, 'tableReferenceList')
            ) {
                return null;
            }

        case 'functionCall':
            $name = $ast[0]['pureIdentifier'][0]['IDENTIFIER'][0]->text;
            return translateFunctionCall($name, $ast[0]['udfExprList']);
    }
}

A deeper technical dive

MySQL Grammar

We use the MySQL workbench grammar converted from ANTLR4 format to a PHP array.

You can tweak the MySQLParser-reordered.ebnf file and regenerate the php grammar with the create_grammar.sh script. You'll need to run npm install before you do that.

The grammar conversion pipeline goes like this:

  1. g4 ➔ EBNF with grammar-converter
  2. EBNF ➔ JSON with node-ebnf. This already factors compound rules into separate rules, e.g. query ::= SELECT (ALL | DISTINCT) becomes query ::= select %select_fragment0 and %select_fragment0 ::= ALL | DISTINCT.
  3. Rule expansion with a python script: Expand *, +, ? into modifiers into separate, right-recursive rules. For example, columns ::= column (',' column)* becomes columns ::= column columns_rr and columns_rr ::= ',' column | ε.
  4. JSON ➔ PHP with a PHP script. It replaces all string names with integers and ships an int->string map to reduce the file size,

I ignored nuances like MySQL version-specific rules and output channels for this initial explorations. I'm now confident the approach from this PR will work. We're in a good place to start thinking about incorporating these nuances. I wonder if we even have to distinguish between MySQL 5 vs 8 syntax, perhaps we could just assume version 8 or use a union of all the rules.

✅ The grammar file is large, but fine for v1

Edit: I factored the grammar manually instead of using the automated factoring algorithm, and the grammar.php file size went down to 70kb. This one is now solved. Everything until the next header is no longer relevant and I'm only leaving it here for context.

grammar.php is 1.2MB, or 100kb gzipped. This already is a "compressed" form where all rules and tokens are encoded as integers.

I see three ways to reduce the size:

  1. Explore further factorings of the grammar. Run left factoring to deduplicate any ambigous rules, then extract AB|AC|AD into A(B|C|D) etc.
  2. Remove a large part of the grammar. We can either drop support for hand-picked concepts like CREATE PROCEDURE, or modularize the grammar and lazy-load the parts we actually need to use. For example, most of the time we won't need anything related to GRANT PRIVILIGES or DROP INDEX.
  3. Collapse some tokens into the same token. Perhaps we don't need the same granularity as the original grammar.

The speed is decent

The proposed parser can handle about 1000 complex SELECT queries per second on a MacBook pro. It only took a few easy optimizations to go from 50/seconds to 1000/second. There's a lot of further optimization opportunities once we need more speed. We could factor the grammar in different ways, explore other types of lookahead tables, or memoize the matching results per token. However, I don't think we need to do that in the short term.

If we spend enough time factoring the grammar, we could potentially switch to a LALR(1) parser and cut most time spent on dealing with ambiguities.

Next steps

These could be implemented either in follow-up PRs or as updates to this PR – whichever is more convenient:

  • Bring in a comprehensive MySQL queries test suite, similar to WHATWG URL test data for parsing URLs. First, just ensure the parser either returns null or any parse tree where appropriate. Then, once we have more advanced tree processing, actually assert the parser outputs the expected query structures.
  • Create a MySQLOnSQLite database driver to enable running MySQL queries on SQLite. Read this comment for more context. Use any method that's convenient for generating SQLite queries. Feel free to restructure and redo any APIs proposed in this PR. Be inspired by the idea we may build a MySQLOnPostgres driver one day, but don't actually build any abstractions upfront. Make the driver generic so it can be used without WordPress. Perhaps it could implement a PDO driver interface?
  • Port MySQL features already supported by the SQLite database integration plugin to the new MySQLOnSQLite driver. For example, SQL_CALC_FOUND_ROWS option or the INTERVAL syntax.
  • Run SQLite database integration plugin test suite on the new MySQLOnSQLite driver and ensure they pass.
  • Rewire this plugin to use the new MySQLOnSQLite driver instead of the current plumbing.

@adamziel adamziel changed the title Custom MySQL AST Parser Exhaustive MySQL Parser Aug 17, 2024
@adamziel
Copy link
Collaborator Author

adamziel commented Sep 13, 2024

@bgrgicak tested all plugins in the WordPress plugin directory for installation errors. The top 1000 results are published at https://github.com/bgrgicak/playground-tester/blob/main/logs/2024-09-13-09-22-17/wordpress-seo. A lot of these are about SQL queries. Just migrating to new parser would solve many of these errors and give us a proper foundation to add support for more MySQL features. CC @JanJakes

@@ -422,6 +422,16 @@ private function parse_recursive($rule_id) {
$node->append_child($subnode);
}
}

// Negative lookahead for INTO after a valid SELECT statement.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comments!

NOW is a non-reserved keyword that can be used as identifier in some cases,
and thus need to be passed through "determineFunction".

CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP are reserved keywords
and they can't be used as identifiers.
@JanJakes
Copy link
Collaborator

I spent a week looking further into this, and I'd like to summarize my progress.

Summary

  1. I wrote a script to extract MySQL queries from mysql-server/mysql-test (link). I used it to scan the mysql-test/t directory (link), which, after a basic deduplication, yields over 66,000 queries.
  2. I run @adamziel's lexer and parser on the dataset, and without any modifications, I got about 3,900 crashes (5%) and parsing 6,500 failures (9%). This is a wonderful score, considering that the AI-generated lexer is incomplete and buggy, and I would say that this result by itself already shows that this approach appears to be viable.
  3. After implementing missing lexer parts, fixing some lexer issues, and solving some grammar conflicts, I got the numbers down to 0 crashes and about 2,500 failures (3%). It is likely that it's only a matter of a few more fixes to get the number of parsing failures closer to 0 as well. Additionally, the query-scanning script doesn't yet handle some special constructs and SQL modes, which likely causes some failures as well.

Overall, I think @adamziel's approach is a great and viable one! 👏

Parsing mysqltest

MySQL server tests are written using The MySQL Test Framework which usesa a custom DSL called mysqltest Language.

I implemented a simple script that attempts to skip all mysqltest-specific commands and preserves only SQL queries. The script consumes the .test files and looks for a command delimiter. It is quite simple, but it needs to handle some specifics:

  1. There is a DELIMITER <delimiter> command that dynamically changes the delimiter itself.
  2. The delimiter must be ignored inside strings.
  3. The test files can have line comments.
  4. Some queries are expected to fail, which is marked by a special error command.
  5. Queries like SET SQL_MODE = ... change the SQL mode for all subsequent queries, which can affect lexing and parsing.
  6. Some commands can have multi-line arguments that are delimited by a dynamically set EOF sequence.

The current implementation doesn't handle the last two points yet, but both of them seem pretty straightforward to implement when needed.

The extracted queries are then deduplicated for exact matches and dumped in a CSV format to preserve newlines within SQL queries.

In the future, we can also expand the data set to extract queries from other directories within the test suite.

Lexer issues

The AI-generated lexer is surprisingly good and well-structured, but it does have some issues and missing pieces:

  1. Over 40 lexer symbols were simply missing. Previously, @adamziel already added them to the list of symbols, and I also added their implementation.
  2. There were some small issues with parsing alphanumerical and digit-based strings. Easy to fix.
  3. Binary and HEX strings in format x'abcdef' and b'010111' were not supported.
  4. The AI also hallucinated a bit and invented some non-existent version-based conditions.
  5. The lexer wasn't able to recognize charset definitions.
  6. A method to determine whether a non-reserved keyword is a function call or an identifier was not implemented.
  7. Handling of CURRENT_TIMESTAMP , LOCALTIME, and LOCALTIMESTAMP was implemented incorrectly. Fun fact — while all of these functions are synonyms of the NOW() function, they are reserved keywords and can be used without parentheses, while NOW is a non-reserved keyword, and must be used with parentheses (otherwise, it would be an identifier). This is now handled correctly.

There's likely to be some more issues discovered in the lexer, but I wouldn't expect running into any blockers. I'd like to do a full manual walkthrough to ensure the lexer fully corresponds to the MySQLLexer.g4 specification.

Grammar issues

The converted and compressed grammar originally comes from the MySQLParser.g4 specification, and it's almost equivalent. Although I discovered some small issues in the original specs and the grammar conversion as well, these are not any major problems:

  1. The "alterOrderList" rule had a slightly incorrect definition in the original grammar.
  2. There was a rule that was incorrectly converted from ANTLR to EBNF.
  3. The "castType" seemed to be incomplete in the original grammar.

Grammar conflicts

What can pose more significant challenges is grammar conflicts. The original ANTLR grammar is meant to be processed by the ANTLR toolkit that does a lot of heavy-lifting in terms of conflict resolution. During the grammar compilation, the toolkit can refactor and reorder some rules, or introduce lookaheads into the generated parser. In our case, we don't have such a compiler at hand, which results in some conflicts manifesting when parsing.

To very briefly explain how conflicts occur, it is worth summarizing how the parser works:

  1. The grammar defines a set of rules (non-terminals) and symbols (terminals) that can be represented in the tree structure. In the tree, rules are the internal nodes and symbols are the leaves. Each rule can expand into one or more sequences of rules or symbols (its children).
  2. The parser starts at the root of the tree and tries to match the input on the first child branch. If it doesn't match, it will try the second branch, etc. If no match is found for the current input using the chosen path, the parser won't backtrack to try other alternatives. In other words, if part of the input matches a specific subtree, but the parent rule fails to match the whole input, the parser does not reconsider previous decisions. This can lead to a parsing conflict.

This behavior is correct (backtracking can be extremely expensive). In fact, it is a description of a simple LL parser. In this scenario, the conflicts can occur in the following ways:

  1. FIRST/FIRST conflict — when different rules share the same terminal prefix.
  2. FIRST/FOLLOW conflict — when a rule can be skipped (ε), and the next input could fit either that rule or the one after it.

A special case of a FIRST/FIRST conflict is left recursion, but that was already eliminated by @adamziel. Additionally, grammars can also contain ambiguities (multiple different parse trees match the same input), but I wouldn't expect running into that in this case.

As for the above-mentioned conflicts, I did run into some indeed:

  1. SELECT ... INTO @var was matched by a branch for a simple select (without INTO), which made the full query fail to parse. For now, I hardcoded a manual lookahead to fix this.
  2. Non-reserved keywords in MySQL can be matched as identifiers in the grammar. However, identifiers can never take precedence over these keywords. For instance, it is valid to use ROWS as an identifier (e.g, in an alias), but in some places (e.g., OVER (ROWS ...) it must be a keyword. In these cases, it helped to reorder some grammar rules — one, two, three, where the last one also had a first/first conflict.

Solving the conflicts manually and reordering the rules is probably not the ideal final solution, but it can help us understand the nature and quantity of these conflicts. As a next step, it might be worth addressing the conflicts by writing a simple script to analyze the grammar, but it's good to first understand what we actually need to solve. Ultimately, we should probably do some left-factoring, use lookaheads, and maybe try to build parsing tables to see if they could have a reasonable size in some form.

Some of the tooling seem to be provided by enbfutils, but I had little luck addressing any of the conflict using those tools. At least, I wrote a simple script to dump conflicts from the expanded JSON grammar.

Ultimately, I think when we understand the conflicts we're running into and how we want to address them, writing an algorithm to do so won't be a difficult task. I'll keep exploring a bit in this area.

Next steps

As for the next steps, I think it would make sense to focus on the following:

  1. Manual pass of the lexer and comparison with MySQLLexer.g4.
  2. Some more investigation into the reasons of the remaining parsing failures.
  3. Bringing this branch to a mergeable state to close the initial prototyping phase. The parser wouldn't be used anywhere just yet, but it would be a milestone.

What do you think?

@adamziel
Copy link
Collaborator Author

adamziel commented Sep 30, 2024

This. Is. Incredible. Such a good work here @JanJakes!

A few thoughts I had as I was reading:

Solving the conflicts manually and reordering the rules is probably not the ideal final solution, but it can help us understand the nature and quantity of these conflicts.

Manual resolution is not a general solution over all possible parsers we could generate here, but it should be fine for the MySQL parser. I can only think of two scenarios where that would be a problem:

  1. There's too many conflicts to reasonably address them manually
  2. The MySQL syntax will keep evolving in a direction that introduces more of these conflicts, making the maintenance more difficult

An automated script would surely help us with the latter, but perhaps that work doesn't need to happen before the initial SQLite translation layer – what do you think?

As a next step, it might be worth addressing the conflicts by writing a simple script to analyze the grammar

Another thought – once we have a generalized solution, we could generate more parsers this way. I wanted a reasonably fast Markdown parser in PHP for a while (although that's totally out of scope here :-))

I got the numbers down to 0 crashes and about 2,500 failures (3%).

This is lit 🔥

@JanJakes
Copy link
Collaborator

JanJakes commented Oct 1, 2024

@adamziel Well, I'm only building on your phenomenal foundations 😊

An automated script would surely help us with the latter, but perhaps that work doesn't need to happen before the initial SQLite translation layer – what do you think?

💯 I think in the first phase, we should be able to go with the manual fixes. Doing it properly and automatically could be a rabbit hole, so I would avoid going that way for now if we can. The fact that we got to a 3% error rate on an extensive test suite is promising.

Actually, even more promising, given the fact that are still some issues in the lexer. I've done the first pass of semi-manually comparing the lexer to the MySQLLexer.g4 grammar, which already yielded some nice results:

  1. More correctness (some more test failures eliminated).
  2. Nice size reduction (9460 LOC to 3828 LOC) while preserving performance.

What's promising is that this is only a part of the manual pass. I sorted out all tokens, function identifiers, synonyms, and version-specifics, but there are things I haven't addressed yet, and I know some of them need to be fixed. For instance, the AI used all the ctype functions, but that doesn't correctly represent the full range of allowed MySQL identifiers.

In summary, there's a chance even more test failures will be eliminated just by improving the lexer, hopefully leaving us with fewer interventions on the grammar side.

@akirk
Copy link
Member

akirk commented Oct 2, 2024

I had a conversation with @JanJakes about this and just wanted to note the idea (whether feasible or not) whether we might be able to transform the MySQL AST into a SQLite AST before we transform it back into a SQL query string and pass it on to SQLite. I found for example this (JS) project that creates SQL from an SQLite AST.

This was done together with the first semi-manual pass & check  based on MySQLLexer.g4 grammar
from https://github.com/mysql/mysql-workbench/blob/8.0.38/library/parsers/grammars/MySQLLexer.g4.

It also cuts the size of the lexer from 9460 LOC to 3827 LOC, and the correctness improvements
further reduce the number of failing tests.
@JanJakes
Copy link
Collaborator

JanJakes commented Oct 2, 2024

Another thing that's worth mentioningis the SHOW PARSE_TREE statement in MySQL. These commands work with a debug build of MySQL, and it could be useful if we want to assert the parser correctness in the future. We would need to do some kind of mapping to our AST, but I suppose it could work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants