-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rich text formatting #81
base: master
Are you sure you want to change the base?
Conversation
No tools yet to actually format it yourself, but you can copy/paste in an existing bit of marked-up HTML from for example a webbrowser
@@ -117,9 +117,14 @@ MarianInterface::MarianInterface(QObject *parent) | |||
model = std::make_shared<marian::bergamot::TranslationModel>(modelConfig, std::move(bundle), modelChange->settings.cpu_threads); | |||
} else if (input) { | |||
if (model) { | |||
// Remove the "<!DOCTYPE html>" bit | |||
auto begin = input->find("<html"); | |||
input->erase(0, begin); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we shouldn't be doing this at bergamot-translator? This is nice to have for everyone trying to translate HTML with our library, is it not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we probably should. It's not difficult to parse. But then we also have to support adding it back into the translated output, right?
Looks like you're missing the bergamot-translator update that makes it accept self-closing tags. Did you update the submodule to the specific checkout (or main, since it has been merged already)? |
You were right, i didn't run Wikipedia gives me an error: Also, Maybe in this mode, we should paste images for it to look better? https://stackoverflow.com/questions/3254652/several-ways-of-placing-an-image-in-a-qtextedit But in general web pages look horrible even when pasted in more complicated editors sometimes: |
The HTML supported by Qt's TextEdit is quite limited. It basically is some inline stuff that is translatable to their internal markup. Nothing more. So copy/pasting most pages with CSS will not do anything good. Looking at that error though, I'm starting to see trouble. It might be pretty difficult to implement this properly using the HTML support we have in bergamot-translator. Looking at how Qt turns markup into HTML it looks like they don't emit closing tags for many elements, such as (Also looking at that code, I think things break down if you put HTML into the document title because that is added unescaped?) |
Hmm... In that case we should probably disable handling html and just handle xml from like word documents? Can we detect that by peeking in the clipboard. |
It's not our support for HTML, it's Qt's support for markup. When you copy/paste something into TranslateLocally, it is converted into Qt's own markup structure (QTextDocument, which is a bunch of QTextBlocks). What they support with that markup is limited. Whether you copy in from a Word document or a website, it will always be limited to what that data structure can represent. Any other markup is lost. The error message you ran into was due to how Qt does QTextDocument -> html:str. The screenshots though, those are all just illustrations of the limitations of QTextDocument. Edit: Oh but if you mean, could we intercept the clipboard, and handle whatever XML Word throws at us directly. Where the QTextEdit is just there to give you a preview, but we copy the real un-messed-up translation directly back to the clipboard, ready for pasting… I don't know! bergamot-translator can't handle bad HTML, so that limits us a bit. But we could make the assumption that HTML generated by an application for copy/pasting is valid. Rough XML support could be implemented without too much issues (although it would need some flags to turn off some HTML specific behaviour), but I don't know what Office XML looks like. Is all text in the XML translateable, or is there like with HTML certain tags that you should not translate, like Also, the interaction would be a bit weird. We can't use anything other than HTML to insert text with markup in a QTextEdit, unless we write our own XXX->QTextDocument decoder. If you would then interact with the QTextEdit, and copy/paste back from that to Word, you'd get messed-up markup back. But if you'd copy the directly translated output without doing any editing, it could work because we could bypass QTextEdit? Also there is no guarantee that we'd copy the translation you see in the preview to the clipboard because the QTextEdit preview might be using an inferior format from the clipboard to populate (often when you copy/paste, the program you copy from makes the copied data available in multiple mime types). We can't just translate the Office XML, and then use that to populate the QTextEdit preview because it wouldn't know how to read it. Looks like QTextEdit only supports HTML and plain text when copy/pasting really. |
Hmmmm..... What if we use QtWebKit for rendering? That at the very least would have a much better HTML support, but looking into it, it will not be a trivial task. I am also worried about overengineering though. At the moment it seems like enabling rich text will work in a very few limited cases. My idea about peeking was mostly to see if we have XML content that we can handle before enabling RTF mode, else fall back to plain text mode? |
I'm sorry if this is not on topic as you are talking about HTML primarily. I noticed that I can't successfully translate texts that contain \n through the command line Is there or would there be a way to ignore all sorts of carriage returns? Edit Hmm, apologies. It seems like it does indeed translate with no problems and automatically removes all carriage returns on the command line. The issue I had must have had to do with Tauri. I'll revise my question - Would it be possible to implement an option to only ignore \n & \r but not remove them? |
First steps towards working with text with markup info. End goal is to be able to copy/paste a (part of a) document from Microsoft Word into TranslateLocally, have it translated, and be able to copy/paste the output directly back into Word and be it with the right formatting.