Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rich text formatting #81

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Rich text formatting #81

wants to merge 3 commits into from

Conversation

jelmervdl
Copy link
Collaborator

First steps towards working with text with markup info. End goal is to be able to copy/paste a (part of a) document from Microsoft Word into TranslateLocally, have it translated, and be able to copy/paste the output directly back into Word and be it with the right formatting.

  • Maintain markup through translation
    • Expand bergamot-translator to accept xhtml-style self-closing tags :(
  • Toggle between rich text and plain text
  • Have controls/keyboard shortcuts to edit markup (e.g. ctrl + b to make something bold)
  • (Is it just me or is it slow?)

No tools yet to actually format it yourself, but you can copy/paste in an existing bit of marked-up HTML from for example a webbrowser
@@ -117,9 +117,14 @@ MarianInterface::MarianInterface(QObject *parent)
model = std::make_shared<marian::bergamot::TranslationModel>(modelConfig, std::move(bundle), modelChange->settings.cpu_threads);
} else if (input) {
if (model) {
// Remove the "<!DOCTYPE html>" bit
auto begin = input->find("<html");
input->erase(0, begin);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we shouldn't be doing this at bergamot-translator? This is nice to have for everyone trying to translate HTML with our library, is it not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we probably should. It's not difficult to parse. But then we also have to support adding it back into the translated output, right?

@XapaJIaMnu
Copy link
Owner

Hmm, I can't run it. I get stuck at this screen on Startup:
Screenshot_20220120_130542

QT5 issue?

@jelmervdl
Copy link
Collaborator Author

jelmervdl commented Jan 20, 2022

Hmm, I can't run it. I get stuck at this screen on Startup: (screenshot…)

QT5 issue?

Looks like you're missing the bergamot-translator update that makes it accept self-closing tags. Did you update the submodule to the specific checkout (or main, since it has been merged already)?

@XapaJIaMnu
Copy link
Owner

You were right, i didn't run git submodule update --recursive
BBC looks... Pretty bad:
Screenshot_20220120_131421

Wikipedia gives me an error:

Screenshot_20220120_131523

Also, Maybe in this mode, we should paste images for it to look better?

https://stackoverflow.com/questions/3254652/several-ways-of-placing-an-image-in-a-qtextedit

But in general web pages look horrible even when pasted in more complicated editors sometimes:

Screenshot_20220120_133114

@jelmervdl
Copy link
Collaborator Author

The HTML supported by Qt's TextEdit is quite limited. It basically is some inline stuff that is translatable to their internal markup. Nothing more. So copy/pasting most pages with CSS will not do anything good.

Looking at that error though, I'm starting to see trouble. It might be pretty difficult to implement this properly using the HTML support we have in bergamot-translator. Looking at how Qt turns markup into HTML it looks like they don't emit closing tags for many elements, such as <li>. If we want this, we might need to write our own QTextDocument -> HTML -> QTextDocument serialisation pipeline.

(Also looking at that code, I think things break down if you put HTML into the document title because that is added unescaped?)

@XapaJIaMnu
Copy link
Owner

Hmm... In that case we should probably disable handling html and just handle xml from like word documents? Can we detect that by peeking in the clipboard.

@jelmervdl
Copy link
Collaborator Author

jelmervdl commented Jan 21, 2022

Hmm... In that case we should probably disable handling html and just handle xml from like word documents? Can we detect that by peeking in the clipboard.

It's not our support for HTML, it's Qt's support for markup. When you copy/paste something into TranslateLocally, it is converted into Qt's own markup structure (QTextDocument, which is a bunch of QTextBlocks). What they support with that markup is limited. Whether you copy in from a Word document or a website, it will always be limited to what that data structure can represent. Any other markup is lost.

The error message you ran into was due to how Qt does QTextDocument -> html:str. The screenshots though, those are all just illustrations of the limitations of QTextDocument.

Edit: Oh but if you mean, could we intercept the clipboard, and handle whatever XML Word throws at us directly. Where the QTextEdit is just there to give you a preview, but we copy the real un-messed-up translation directly back to the clipboard, ready for pasting… I don't know!

bergamot-translator can't handle bad HTML, so that limits us a bit. But we could make the assumption that HTML generated by an application for copy/pasting is valid. Rough XML support could be implemented without too much issues (although it would need some flags to turn off some HTML specific behaviour), but I don't know what Office XML looks like. Is all text in the XML translateable, or is there like with HTML certain tags that you should not translate, like <style> and <script>?

Also, the interaction would be a bit weird. We can't use anything other than HTML to insert text with markup in a QTextEdit, unless we write our own XXX->QTextDocument decoder. If you would then interact with the QTextEdit, and copy/paste back from that to Word, you'd get messed-up markup back. But if you'd copy the directly translated output without doing any editing, it could work because we could bypass QTextEdit? Also there is no guarantee that we'd copy the translation you see in the preview to the clipboard because the QTextEdit preview might be using an inferior format from the clipboard to populate (often when you copy/paste, the program you copy from makes the copied data available in multiple mime types). We can't just translate the Office XML, and then use that to populate the QTextEdit preview because it wouldn't know how to read it. Looks like QTextEdit only supports HTML and plain text when copy/pasting really.

@XapaJIaMnu
Copy link
Owner

Hmmmm..... What if we use QtWebKit for rendering? That at the very least would have a much better HTML support, but looking into it, it will not be a trivial task. I am also worried about overengineering though.

At the moment it seems like enabling rich text will work in a very few limited cases. My idea about peeking was mostly to see if we have XML content that we can handle before enabling RTF mode, else fall back to plain text mode?

@Godnoken
Copy link

Godnoken commented Oct 1, 2022

I'm sorry if this is not on topic as you are talking about HTML primarily.

I noticed that I can't successfully translate texts that contain \n through the command line
I "fixed" it by substituting \n with another symbol like * during translation and then vice versa after the translation is done. It works sometimes but often not, due to my text input having different sets of \n, \r and so on.

Is there or would there be a way to ignore all sorts of carriage returns?

Edit

Hmm, apologies. It seems like it does indeed translate with no problems and automatically removes all carriage returns on the command line. The issue I had must have had to do with Tauri.

I'll revise my question - Would it be possible to implement an option to only ignore \n & \r but not remove them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants