Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content types? #435

Open
kpu opened this issue Jul 8, 2022 · 1 comment
Open

Content types? #435

kpu opened this issue Jul 8, 2022 · 1 comment
Labels
question Further information is requested

Comments

@kpu
Copy link
Member

kpu commented Jul 8, 2022

I have 3 MSc students adding Word support. The most logical way of handling this is extending HTML support to OOXML, most of which is configuring the HTML Options object (though there's also stuff like multiple spaces are semantically meaningful). Doing that is the responsibility of the students. Question is how this should be exposed in the native interface. What we have right now is a boolean for HTML. Should it change to content type text/plain, text/html, and application/vnd.openxmlformats-officedocument.wordprocessingml.document ?

@kpu kpu added the question Further information is requested label Jul 8, 2022
@jelmervdl
Copy link
Member

Using mime types instead of a boolean sounds like good way to indicate how the content should be handled. Also pretty future proof. Especially if we can just associate a mime type with a processing class somewhere in the code.

I'd warn them about extending the current HTML object to support OOXML. That code has already become pretty complicated on its own, and filled with assumptions about how HTML is used semantically. Some of that is encapsulated in the Options HTML Object, but there's also assumptions in how tags that are inserted back in the element need to align with whitespace (i.e. it turns hello_<b>_world_</b>! into hello_ _<b>world</b>_!). It could be a source of frustration, and I would not be opposed to just copying the HTML class for the OOXML one, and stripping out all the bits you don't need, just to avoid possible weird interactions.

The xh_scanner.{h,cpp} files both also have hard-coded assumptions about HTML, e.g. which tags should never have a closing element, like <input>, and which tags should never have their contents parsed, e.g. <script>. For XML parsing those rules would need to be disabled, and support for CDATA needs to be added back. Not sure whether its easier adding those back in through if-statements (and then also having to add support for these new sections to the HTML class) or having a copy of parts of that code in a separate XML parsing class. The latter might be more maintainable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants