Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Latin <-> Cyrillic transliteration and Latin digraphs for Serbian #483

Open
ivankokan opened this issue Mar 23, 2021 · 12 comments
Open

Comments

@ivankokan
Copy link
Contributor

ivankokan commented Mar 23, 2021

We do support multiple scripts (same in in- and output) via the script option. We do not have a case yet where we support transliteration, though.

Wikipedia tells me that three scripts are common in different regions: Arabic, Latin, and Cyrillic. Given this, a script option would make sense.

Originally posted by @jspitz in #482 (comment)

I am not sure if the upper comment means that the transliteration is considered for polyglossia's future...

Here are the bidirectional Unicode mappings for Serbian to start with.

serbian_cyrillic-latin_transliteration.xlsx

Note:

  • To be precise, there are no Cyrillic digraphs in Serbian (Љ, љ, Њ and њ can be considered as digraph-like letter pairs merged into single characters).
  • On top of that, there are no Cyrillic Title case variants.
  • The same mechanism as for Croatian should be used for Latin digraphs (checks, fallbacks to separate characters, options, and shorthands - at least for digraphs). See Idea: Digraphs in Croatian definitions #216.
  • Cyrillic Serbian "digraphs" are widely used and available within Cyrillic fonts (even within T2A) and keyboard layout, i.e. no checks nor fallbacks to separate characters must be implemented.
  • Mappings are almost completely bijective, except the 3 mappings where the Latin Title case digraphs must be mapped to Cyrillic Upper case characters (there is no Title case for Cyrillic at all).
  • There are no Latin digraphs nor Cyrillic "digraphs" present in gloss-serbian.ldf, good - nothing to take care of.

Some good examples to eventually test with:

  • аАбБвВгГдДђЂжЖћЋчЧшШ <-> aAbBvVgGdDđĐžŽćĆčČšŠ
  • љЉњЊџЏ -> ljLJnjNJdžDŽ (Latin digraphs if disableligatures is false)
  • љЉњЊџЏ -> ljLJnjNJdžDŽ (separate characters if the font is missing Latin digraphs or disableligatures is true)
  • ljLJnjNJdžDŽ (separate characters) -> лјЛЈнјНЈджДЖ (separate characters)
  • "lj"Lj"LJ"nj"Nj"NJ"dž"Dž"DŽ (shorthands with separate characters) -> љЉЉњЊЊџЏЏ
  • ljLjLJnjNjNJdžDžDŽ (Latin digraphs) -> љЉЉњЊЊџЏЏ.
@ivankokan ivankokan changed the title Support transliteration (Latin - Cyrillic) for Serbian Support transliteration (Latin - Cyrillic) and Latin digraphs for Serbian Mar 23, 2021
@ivankokan ivankokan changed the title Support transliteration (Latin - Cyrillic) and Latin digraphs for Serbian Support transliteration (Latin <-> Cyrillic) and Latin digraphs for Serbian Mar 23, 2021
@ivankokan ivankokan changed the title Support transliteration (Latin <-> Cyrillic) and Latin digraphs for Serbian Support Latin <-> Cyrillic transliteration and Latin digraphs for Serbian Mar 23, 2021
@jspitz
Copy link
Collaborator

jspitz commented Mar 23, 2021

I am not sure if the upper comment means that the transliteration is considered for polyglossia's future...

Many things are possible if someone steps up and does the implementation.

@ivankokan
Copy link
Contributor Author

@yannis1962 has prepared map files based on my contribution here. We'll see what happens next...

@yannis1962
Copy link

I have prepared map files for Latin->Cyrillic and Cyrillic->Latin in the case of Serbian.

The only flaw I see is that when I have Љ Њ Џ as input, I can send them either to LJ NJ DŽ (uppercase) or to Lj Nj Dž (titlecase).
I added a context rule so that Љ Њ Џ followed by a lowercase letter is always sent to titlecase, and otherwise to uppercase.

I need confirmation by native speakers that this is a good choice.

For example, what happens when somebody has a given name starting with Љ? When I transliterate the initial “Љ.” into Latin
I will get “LJ.” which is obviously bad, but is the correct way to write the initial in that case “Lj.” or rather “L.” ?

Maybe should I implement another rule saying that when Љ is not preceded by a capital letter and followed by a period, it
should be titlecase?

I need help from native speakers…

I'm including the MAP and TEC files, as well as two test files with the UHRD in Serbian (converted from Latin to Cyrillic and from Cyrillic to Latin) in TeX and PDF format. You will need to use some other font if you run them (XeTeX only).

Archive.zip

@ivankokan
Copy link
Contributor Author

ivankokan commented Mar 23, 2021

The only flaw I see is that when I have Љ Њ Џ as input, I can send them either to LJ NJ DŽ (uppercase) or to Lj Nj Dž (titlecase).
I added a context rule so that Љ Њ Џ followed by a lowercase letter is always sent to titlecase, and otherwise to uppercase.

I need confirmation by native speakers that this is a good choice.

I am not a native speaker/writer but it looks OK.

For example, what happens when somebody has a given name starting with Љ? When I transliterate the initial “Љ.” into Latin
I will get “LJ.” which is obviously bad, but is the correct way to write the initial in that case “Lj.” or rather “L.” ?

Here I can contribute with the explicit rule:
Правопис српскога језика, Матица српска, 1994. (друго издање)
https://sr.wikipedia.org/sr-el/%D0%9F%D1%80%D0%B0%D0%B2%D0%BE%D0%BF%D0%B8%D1%81_%D1%81%D1%80%D0%BF%D1%81%D0%BA%D0%BE%D0%B3%D0%B0_%D1%98%D0%B5%D0%B7%D0%B8%D0%BA%D0%B0
https://gimnazijadg.files.wordpress.com/2012/03/pravopis-srpskoga-jezika.pdf
image

Free translation: Latin digraphs used as starting letters in a sentence, a given name, or an abbreviation must be written as given in Table 8: Dž, Lj, Nj; but as DŽ, LJ, NJ in fully uppercase context (to emphasize).

I guess there are no changes in newer editions.

Maybe should I implement another rule saying that when Љ is not preceded by a capital letter and followed by a period, it
should be titlecase?

I need help from native speakers…

Definitely, let us wait until then...

@yannis1962
Copy link

yannis1962 commented Mar 23, 2021 via email

@ivankokan
Copy link
Contributor Author

As I suspected. So that raises the question: how do I force the transcription into titlecase? How about using a LaTeX macro \titlecase{Љ} to be sure you will get a titlecase, no matter what follows?

"Smart ways": transliterate to titlecase if it is followed by something lowercase (starting a sentence) or a period (initials/abbreviations). This would obviously fail with a sentence having simply "Љ" as its first word.

I think that macro is inevitable in any case, hence no "smart way" must be implemented.

@yannis1962
Copy link

yannis1962 commented Mar 23, 2021 via email

@yannis1962
Copy link

Here are the files with the three smart rules mentioned in the previous message

Archive.zip

@ivankokan
Copy link
Contributor Author

ivankokan commented Apr 12, 2021

I have been in contact with Uroš Stefanović (https://ctan.org/author/stefanovic) meanwhile. It seems we are getting somewhere with this implementation.

Let me just summarize what we currently have:

  • map files prepared by Yannis Haralambous (@yannis1962), XeTeX only, including three smart rules on how to transliterate from Cyrillic uppercase to Latin:
  1. titlecase if followed by lowercase
  2. uppercase if preceded by uppercase
  3. titlecase if not 2. and followed by a period
  • enriched set of small test examples (spaces are added so that the rules 1.-3. do not transliterate wrongly):
    • а А б Б в В г Г д Д ђ Ђ ж Ж ћ Ћ ч Ч ш Ш <-> a A b B v V g G d D đ Đ ž Ž ć Ć č Č š Š
    • љ Љ њ Њ џ Џ -> lj LJ nj NJ dž DŽ (Latin digraphs if disableligatures is false)
    • љ Љ њ Њ џ Џ -> lj LJ nj NJ dž DŽ (separate characters if the font is missing Latin digraphs or disableligatures is true)
    • lj Lj LJ nj Nj NJ dž Dž DŽ (separate characters) -> лј Лј ЛЈ нј Нј НЈ дж Дж ДЖ (separate characters)
    • "lj "Lj "LJ "nj "Nj "NJ "dž "Dž "DŽ (shorthands with separate characters) -> љ Љ Љ њ Њ Њ џ Џ Џ
    • lj Lj LJ nj Nj NJ dž Dž DŽ (Latin digraphs) -> љ Љ Љ њ Њ Њ џ Џ Џ
  • more test examples to test smart rules (each one in two variants depending on disableligatures):
    • ЉУДИ -> LJUDI / LJUDI (none rule would be applied)
    • Љубљана -> Ljubljana / Ljubljana (rule 1.)
    • КОЊ -> KONJ / KONJ (rule 2.)
    • Џ. Костанза -> Dž. Kostanza / Dž. Kostanza (rule 3.)
    • one wants Џ. КОСТАНЗА -> DŽ. KOSTANZA / DŽ. KOSTANZA (rule 3. would be wrongly applied producing Dž / Dž, one would need to use something like \uppercase{Џ})
    • one wants Љ -> Lj / Lj (none rule would be wrongly applied producing LJ / LJ, one would need to use something like \titlecase{Љ})
    • ADDED: one wants Џ. К О С Т А Н З А -> D Ž. K O S T A N Z A (rule 3. would be wrongly applied producing Dž / Dž, one would need to use something like \uppercase[separate]{Џ})
    • ADDED: one wants Љ У Б Љ А Н А -> L J U B L J A N A (none rule would be wrongly applied producing LJ U B LJ A N A / LJ U B LJ A N A, one would need to use something like \uppercase[separate]{Љ})
    • ADDED: one wants Љ у б љ а н а -> L j u b l j a n a (none rule would be wrongly applied producing LJ u b lj a n a / LJ u b lj a n a, one would need to use something like \titlecase[separate]{Љ} and \lowercase[separate]{љ})

TODO:

  • integrate Yannis' map files
  • Yannis Haralambous (@yannis1962) should eventually be acknowledged as a contributor in the manual
  • LuaTeX transliteration support - can someone provide references on how to achieve the same?
  • take over all serbian/serbianc babelshorthands
  • add digraphs ligatures shorthands (like in Croatian, be careful with "D and "d as such babelshorthands already exist for Đ/đ)
  • add support for explicit uppercase -> uppercase / titlecase transliteration in Cyrillic -> Latin direction

I guess that's all.

@jspitz
Copy link
Collaborator

jspitz commented Apr 13, 2021

As for LuaTeX: Look at how ArabLuaTeX does it.

@jspitz
Copy link
Collaborator

jspitz commented Apr 17, 2021

More specifically: https://tex.stackexchange.com/questions/285610/

@ivankokan
Copy link
Contributor Author

ivankokan commented Apr 20, 2021

I have found two additional rules:
Правопис српскога језика, Матица српска, 2010. (измењено и допуњено, четврто издање)
https://jelenaradomir.files.wordpress.com/2016/08/pravopis-ms_2010.pdf

image

При размакнутом (спационираном) писању сва слова се једнако раздвајају (L j u b l j a n a а не Lj u b lj a n a). Ако се натписи (нпр. MENJAČNICA) пишу одозго надоле, NJ, LJ односно DŽ не треба да остану састављени, него друго слово долази испод првог.

Google Translate (a bit improved):
With an increased letter spacing (separated characters), all glyphs are equally separated (L j u b l j a n a, not Lj u b lj a n a). If the inscriptions (e.g. MENJAČNICA) are written from top to bottom, NJ, LJ or DŽ should not remain composed, but the second letter comes below the first instead.

I would tell that the first rule is feasible providing optional arguments separate to the future macros \uppercase{Љ} and \titlecase{Љ}. (I edited my previous comment that summarizes everything: #483 (comment).)

The second rule is way off polyglossia's scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants