-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the "<" or the ">" character always escaped in the database? #17
Comments
Hey, @Zodiac1978! Likewise, it was a pleasure meeting you! Character escapingI tested this myself because I was surprised by your report that characters were not escaped in the caption's text content. Everything else was more or less expected. Here is where I misled you: I told you about the block editor's serialiser (which escapes You can observe the discrepancies between what the client does and what the server does by opening the block editor, adding content, then switching to code mode, and comparing that markup with what you get later from the database. What you describe sounds a lot like WordPress/gutenberg#15636, where the culprit appears to be Bringing this back to the realm of your plugin: it's clear that we can't guarantee that something along the way hasn't produced malformed HTML when dealing with user content, but the best next thing you can do is assume that attribute values can contain Sourcing
|
No, synchronized alternative text is not recommended, because one image could have different alternative texts in different places.
For the escaping issue this looks like a major blocker for my solution. Need to dig deeper. Thanks for the links and more context. |
I'm unsure if this long-standing problem should be solved by my little plugin. Thinking about going back to my first approach, just ignoring the HTML comments. This would make the RegEx much more robust: Additionally, this would leave the filename, alternative text and caption intact. |
I think that's your best bet, yes! Actually, come to think of it, I think that the HTML spec reserves I suggest you take a look at one of our official parsers, specifically its Please read the documentation attached to the expression. I don't know how compatible the expression is with the regex implementation of the database, but this one is definitely battle-tested. :) |
Thanks @mcsf - but the tokenizer still only affects the comment part: It does not help me, getting the markup removed. There is also this new WP HTML Tag Processor: But I have no idea how this would help me in this case. The only way would be to use an own table with the cleaned up content ... but that is a little bit too much for core I think. MySQL has the ability to define functions itself. But the solution ChatGPT is suggesting is also just looking for "<" and ">", so we have the same issues if those are not reliable encoded to entities. DELIMITER //
CREATE FUNCTION strip_html_tags(input_text TEXT) RETURNS TEXT
BEGIN
DECLARE output_text TEXT DEFAULT '';
DECLARE start_pos INT DEFAULT 1;
DECLARE end_pos INT DEFAULT 0;
DECLARE text_length INT DEFAULT CHAR_LENGTH(input_text);
DECLARE in_tag BOOLEAN DEFAULT FALSE;
WHILE start_pos <= text_length DO
SET end_pos = LOCATE('<', input_text, start_pos);
IF end_pos = 0 THEN
SET output_text = CONCAT(output_text, SUBSTRING(input_text, start_pos));
SET start_pos = text_length + 1;
ELSE
IF start_pos < end_pos THEN
SET output_text = CONCAT(output_text, SUBSTRING(input_text, start_pos, end_pos - start_pos));
END IF;
SET start_pos = LOCATE('>', input_text, end_pos) + 1;
IF start_pos = 0 THEN
SET start_pos = text_length + 1;
END IF;
END IF;
END WHILE;
RETURN output_text;
END //
DELIMITER ; |
Hey @mcsf - thanks again for our great chat at WordCamp Porto last weekend!
I've just tried to insert ">" in the alt attribute of the image block (WordPress 6.5.3) and it looks like the ">" is encoded correctly as ">", but the "<" is indeed not modified.
Even worse, in the caption both are not modified ...
This is from the database:
It looks like that the unmodified "<" is not breaking the regex, but the second issue, the unmodified "<" is preventing the search for "test" in my case above:
https://regex101.com/r/3moGTD/1
In my test case, I was also wondering that the alternative text from the media library wasn't used for the image block. Is this the intended behavior?
The text was updated successfully, but these errors were encountered: