-
Notifications
You must be signed in to change notification settings - Fork 272
Translation Helper API
Formatter doubles as a translation helper to assist in extracting translatable text spans from the document and replacing non-translating text with identifiers so that they are not changed by the translation.
ℹ️ in versions prior to 0.60.0 formatter functionality was implemented in
flexmark-formatter
module and required an additional dependency.
The assumption for the process and format of the extracted text is that the translation process
will not change the markup elements consisting of *~()[]{}<>#
characters. Non-translating text
is replaced with placeholder text _#_
where #
is an integer used to identify the original
text of the placeholder.
The translation process used was tested with Yandex.Translate which does an excellent job of preserving markdown markup during translation.
The translation process is handled is several steps:
-
Parse the document to get markdown AST, this is normal
flexmark-java
markdown parsing. -
Format the document to get markdown strings for translation, document node from step 1 is used with purpose set to
RenderPurpose.TRANSLATED_SPANS
-
Get the strings to be translated from translation handler.
-
Translate the strings by your translation service of preference.
-
Set the translated strings in the translation handler.
-
Generate markdown with placeholders for non-translating string and out of context translations, document node from step 1 is used with purpose set to
RenderPurpose.TRANSLATED_SPANS
-
Parse the document with placeholders. This is normal
flexmark-java
markdown parsing done on the document text returned from step 6. -
Generate the final translated markdown with all non-translating placeholders replaced by original text and translating placeholders by their translated text, document node from step 7 is used with purpose set to
RenderPurpose.TRANSLATED
.
The extracted text runs are classified into three different types:
- Translating Spans - these are paragraphs, heading text, table cells and other stretches of text which can contain inline text elements: bold, italic and other custom elements such as strike-through, inserted, deleted, etc. The inline code element is excluded and considered to non-translating, preserving its text as is.
- Non-Translating snippets - these are all text which should not be translated such as: link URI, identifiers, html blocks, inline html tags, etc.
- Translating snippets - these are text parts of other elements such as link text, image alt, reference link id, reference definition id. This translatable text is translated as a separate element outside the context of its container text.
For example:
Paragraph text with embedded link [Example Link](http://example.com) in it.
Although the link text appears inside a translating text span, it should not be translated as part of it because the translator can erroneously use its context to change the translation. The same element appearing in a different textual context would result in a different translation.
To eliminate such effects, the text Example Link
will be replaced in the paragraph for
translation by its placeholder _1_
and its text provided as a separate translatable string.
The non-translating URL will be replaced by _2_
placeholder and excluded from the translating
text list.
In this example, the extracted translating text strings will be:
Example Link
Paragraph text with embedded link [_1_](_2_) in it.
If the following example translations provided to translation handler:
eXaAmpLeE liINK
paARaAGRaAph teEXt WiIth eEmBeEDDeED LiINK [_1_](_2_) iIN iIt.
Generation of the document at step 6 of the translation process will result in:
paARaAGRaAph teEXt WiIth eEmBeEDDeED LiINK [_1_](_2_) iIN iIt.
Parsing this markdown text in step 7 and generating the final document with placeholder replacement in step 8 will result in the translated document:
paARaAGRaAph teEXt WiIth eEmBeEDDeED LiINK [eXaAmpLeE liINK](http://example.com) iIN iIt.
A translator usage example is included in the flexmark-java-samples
module
TranslationSample.java
Translation assistance is provided by Formatter.translationRender()
methods which take the
same arguments as Formatter.render()
with two additional arguments: TranslationHandler, RenderPurpose
.
RenderPurpose
set the purpose of the translation rendering:
-
RenderPurpose.FORMAT
- regular format, same as theFormatter.render
methods -
RenderPurpose.TRANSLATION_SPANS
- extract translating text spans from the document and identify non-translating text spans -
RenderPurpose.TRANSLATED_SPANS
- replace translating text spans with translated corresponding text. -
RenderPurpose.TRANSLATED
- replace placeholder text with translated or original text depending on the placeholder.
TranslationHandler
provides functionality for tracking translating and non-translating spans,
storage of information between renderer invocations. The default implementation can be
customized or completely replaced.
The difficulty in the translation process is to ensure that intermediate text with placeholders results in text which will be recognized as the original markdown element which produced the placeholder. For this the parser is modified to recognize placeholders as valid elements.
For example, HTML block element is replaces with a single <___#_>
where #
is the integer
placeholder ordinal position. Normally this is not a valid HTML block tag, but for purposes of
translation the parser will recognize it as such. Similarly, inline HTML elements are replaced
with <__#_>
and auto-link URLs with <____#_>
.
Other caveats, include reference block element ids and their references which for proper markdown parsing require to have matching placeholders, otherwise they will not properly resolve and not result in the desired AST for placeholder replacement.
One such caveat relates to anchor refs which refer to heading elements and which are defined by the heading text. The translation process through the formatter will replace any anchor references in links to headings in the same document with new anchor refs, generated the translated heading text.
The most complex handling of reference consistency exists in the
EnumeratedReferenceNodeFormatter.java and AttributesNodeFormatter.java where each enumerated
reference consists of two parts category:id
with both parts needing to be consistent because
category
part of the reference can also be used without the id
part.
To help customize the placeholder format, recognition of these placeholder by the parser and
exclusion of non-translating text snippets, the following options are available in the
Formatter
-
TRANSLATION_ID_FORMAT
, default"_%d_"
, format used for String.format(format, placeholderId)` to convert an integer id into a text placeholder. -
TRANSLATION_HTML_BLOCK_PREFIX
, default"__"
, characters prefixed to placeholder text to distinguish an HTML block tag from the HTML inline block tags and other non-translating text during translation formatting. -
TRANSLATION_HTML_INLINE_PREFIX
, default"_"
, characters prefixed to placeholder text to distinguish an HTML inline tag from the HTML block tags and other non-translating text during translation formatting. -
TRANSLATION_AUTOLINK_PREFIX
, default"_"
, characters prefixed to placeholder text to distinguish an auto-link placeholder tag from the HTML block tags and other non-translating text during translation formatting. -
TRANSLATION_EXCLUDE_PATTERN
, default"^[^\\p{IsAlphabetic}]*$"
, pattern to exclude any translating strings which match the pattern. The default will exclude any which do not contain any unicode alphabetic character group. -
TRANSLATION_HTML_BLOCK_TAG_PATTERN
, default"___(?:\\d+)_"
, parser pattern used to recognize HTML block tags which contain translation placeholders. -
TRANSLATION_HTML_INLINE_TAG_PATTERN
, default"__(?:\\d+)_"
, parser pattern used to recognize HTML block tags which contain translation placeholders.
Custom elements which contain no identifiers nor non-translating text need no changes since by default all text nodes are treated as translating spans.
All text elements and reference identifiers in custom elements require implementing a
NodeFormatter
with handling of node rendering by using translation API methods for rendering
the text implemented in the MarkdownWriter
used for appending formatted markdown.
-
MarkdownWriter.appendNonTranslating(CharSequence)
- will render a non-translating text snippet, depending on the rendering purpose it either takes text to be replaced with a placeholder, takes a placeholder and passes it through as is, or replaces the placeholder with original text. -
MarkdownWriter.appendTranslating(CharSequence)
- will render a translating text snippet, depending on the rendering purpose it either takes text to be replaced with a placeholder, takes a placeholder and passes it through as is, or replaces the placeholder with translated text.
Handling of translating and non-translating text spans is handled through the
NodeFormatterContext
:
-
NodeFormatterContext.translatingSpan(TranslatingSpanRender)
- will treat all text rendered byTranslatingSpanRender
as translating. Depending on the rendering purpose will collect the text for translating, replace it with the translated text or simply pass it through to theMarkdownWriter
as is. -
NodeFormatterContext.nonTranslatingSpan(TranslatingSpanRender)
- will treat all text rendered byTranslatingSpanRender
as non-translating. Depending on the rendering purpose will replace the text by placeholder text, replace it with the original text or simply pass it through to theMarkdownWriter
as is.
For examples of how references are handled it is best to reference implementation of core elements in CoreNodeFormatter.java or extensions: