language codes #32

MortenHofft · 2020-02-07T13:22:04Z

We are currently using 3 letter language codes. That is not enough to describe all the languages we would like to support/describe. An example is zh-TW Chinese traditional/taiwanese. We already have the website translated into traditional Chinese - we do not want to loose this option.

So we need a new enumeration for languages (existing is here).
It seems natural to look to Crowdin as they make a living from translations.

marcos-lg · 2020-03-23T16:37:04Z

New enum is here: http://api.gbif-dev.org/v1/enumeration/basic/TranslationLanguage

Java code here: https://github.com/gbif/gbif-api/blob/master/src/main/java/org/gbif/api/vocabulary/TranslationLanguage.java

It's deployed in DEV

replaced Language enum with TranslationLanguage

MattBlissett · 2020-03-23T16:43:46Z

What's this for? (Meaning the API change more than the vocabulary using it.)

We have language codes for interpreting the language of a vernacular name, which will include minority and dead languages.

I don't know if that's the same thing as the languages we translate the portal / registry into.

marcos-lg · 2020-03-23T16:51:42Z

@MortenHofft requested it to differentiate between language variants like the Chinese ones (our current Language enum doesn't support that). The vocabularies will also be used to for example populate dropdowns in the UI and the UI uses these variants.

The only reason I put it in gbif-api is for consistency for front-end developers to have this enum in the same endpoint as the others (http://api.gbif-dev.org/v1/enumeration/basic/TranslationLanguage). Should I move it?

MattBlissett · 2020-03-23T17:09:25Z

I'm not sure if we should add a second language vocabulary to the v1 API. We already have one, and should consider how it might be extended.

It seems a bit arbitrary to choose Crowdin's list of supported languages. There's a mixture of two and three letter codes, a few without countries, and stuff like Upside Down English and "Quenya" which is Lord of the Rings Elvish.

We'll support these APIs and vocabularies/enumerations for years, so it's worth spending the time to get it right.

@timrobertson100, @mdoering, what do you think?

mdoering · 2020-03-23T19:21:49Z

Yes, I feel similar. There is a prominent open issue in the GBIF API for some time about extending the existing but limited language enumeration: gbif/gbif-api#29

For CoL we have the need to support a wide array of languages for vernacular names. We decided to drop the GBIF language enum and instead go with a large list of 3 letter iso codes (>8000) taken from https://iso639-3.sil.org/code_tables/download_tables. These do not fit into an enum anymore.

This does not solve the problem with simplified and traditional chinese though. These are seen as the same language but using different scripts. So you need a locale to distinguish them.

marcos-lg · 2020-03-23T19:56:58Z

I think it wouldn't be so easy to extend the current Language enum to accommodate the locales because we'd have to change the default serialization to use the locale instead of the 3-letter-code as we do now and this will break the code that relies on that.

One solution could be to rename this new enum to Locale and don't store the ISO 639-1 and 639-3 codes. Actually, this enum could use the current Language and Country enums so we only support the languages and countries available in those enums (we could add more languages if needed). So I mean this:

Locale(Language lang, Country country) {}

and this Locale will be serialized as something like Language.getIso2LetterCode-Country.get2LetterCode (e.g.: en-US)

timrobertson100 · 2020-03-24T07:10:35Z

Since we can't fix Language to work for us, I propose we consider marking Language as deprecated with instructions to use a LanguageCode. This is similar to the original proposal but with a naming change and following typical behavior for retiring something still in use.

If we did this, in deprecating we should state that Language is expected to be removed in a v2 GBIF API and LanguageCode can contain a mix of existing Language codes plus the necessary subset of CrowdIn language codes to meet our foreseen needs - adding more in future releases as necessary.

3-letter ISO codes look like a repeat of previous mistakes.

The Locale proposal looks likely to be limited in similar ways to 2 and 3 letter ISO codes (but I recognize the attempt to accommodate requests stated on this thread).

MattBlissett · 2020-03-24T10:28:05Z

3-letter ISO codes look like a repeat of previous mistakes.

What were those mistakes? Or, what are our requirements?

Representing the many languages for which we have vernacular names. This can cover every language (major, minority, differences between countries etc).
- This rules out an enum, it won't fit. That doesn't matter; we hardly ever refer to the languages in code. The ones we do can have a constant defined.
Representing the languages we have in the registry, for dataset descriptions etc, which is a much smaller set, but still requires the country (zh-CN, zh-TW)
Representing the languages the portal is translated to. Aligning with Crowdin would be helpful, otherwise including a mapping from these.
Serializing the result into something reasonable for users of the checklist, registry and portal APIs.

Locale(LanguageCode, CountryCode), would cover these, except:

Adding a ScriptCode would allow for when a language is written in different scripts, e.g. kaz-KZ-Cyrl and kaz-KZ-Latn. That might not be necessary, as the situation where it would be used (vernacular names) already accepts multiple values: hbs-SR-Cyrl: голуб hbs-SR-Latn: golub.
Where a language is written in different ways but not according to countries. Mentioned only because I see an IETF language tag would then describe the GBIF/UN-style language, en-GB-oxendict.

If, alternatively, this is only for a small number of languages we choose to support, then an enum with 8-10 values seems reasonable.

marcos-lg · 2020-03-24T12:23:53Z

Summarizing and if I understood correctly, looks like the Locale(LanguageCode, CountryCode) option is preferred? Looks to me that it's useful to keep the current Language enum for the places where we need the 3-letter ISO codes. And the Locale option is actually an extension of it just to cover a different use case. If at some point we need something that is not covered by this Locale enum (the 2 cases Matt mentioned) we'll see how we do it.

So I think we can do this next:

I convert this new enum into a Locale(LanguageCode, CountryCode)
We clean it a little and remove languages that we presumably won't need.
I create a PR with these changes and you guys review it or do changes in there (I will need help with the previous step).

Please comment if I've missed something or you disagree with something.

MattBlissett · 2020-03-24T13:40:27Z

This is the kind of thing I had in mind:

public class LanguageCode {
  private final String code2, code3, englishName;
  - Always use three-letter codes to serialize
  public static fromString(String code) ...
  - Validate and cache based on the ISO list Markus posted

Then we need a "Locale", except since there's java.util.Locale I think we should pick a different name. How about LanguageRegion?

public class LanguageRegion {
  private final LanguageCode languageCode;
  private final Optional<Country> region;
  - Serialize using the IETF form, i.e. prefer the two-letter language code if it exists.
  - It's possible to create "es_JP" or whatever, I don't think deciding what's valid is this class's issue.
  private static final EN = ... // if it's useful to have these in code
  private static final ES = ...
  ... fromString(String code)
  ... fromString(language, region)
  
}

The current Language can then be deprecated.

marcos-lg · 2020-03-24T14:55:59Z

Ok so I understand you mean to load all languages at startup form the file Markus posted (as they do in CoL)? And for the LanguageRegion we also have to do the same in order to know all the possible language-country combinations. Probably everything has to be in the same file since the LanguageRegion depends on the LanguageCode, so that file as it is now is not valid for us.

Also, since they are not enums anymore we'd also have to do some changes in the http://api.gbif.org/v1/enumeration endpoint to accommodate these classes (probably agreed with the front-end developers).

As a long-term solution it looks good but requires some time, specially to come up with the file with all the possible combinations.

Since this is blocking the vocabulary from starting the import and curation of vocabularies, I suggest that we move this issue to the gbif-api and I move the TranslationLanguage enum from gbif-api to the vocabulary project and I remove some weird languages. As long as the new future implementation uses the same serialization it's no problem to change the vocabulary to use different classes.

Does this make sense to you?

MattBlissett · 2020-03-24T15:03:00Z

Since this is blocking the vocabulary from starting the import and curation of vocabularies, I suggest that we move this issue to the gbif-api and I move the TranslationLanguage enum from gbif-api to the vocabulary project and I remove some weird languages. As long as the new future implementation uses the same serialization it's no problem to change the vocabulary to use different classes.

Yes, that's fine for the moment. It gives more time to consider how the API should handle languages.

#32 locales

marcos-lg · 2020-03-25T09:38:31Z

I moved the enum and renamed it: https://github.com/gbif/vocabulary/blob/master/model/src/main/java/org/gbif/vocabulary/model/enums/LanguageRegion.java

Also created endpoint in the vocabulary to retrieve the values: http://api.gbif-dev.org/v1/vocabularyLanguage

It's only deployed in DEV for now.

Changes can be done due to UI needs or if more language cleaning is required.

marcos-lg · 2020-08-26T12:09:52Z

I close this and this discussion can be continued in gbif/gbif-api#51

marcos-lg self-assigned this Mar 20, 2020

marcos-lg mentioned this issue Mar 23, 2020

replaced Language enum with TranslationLanguage #39

Merged

marcos-lg added a commit that referenced this issue Mar 23, 2020

Merge pull request #39 from gbif/#32-language-codes

f8e994f

replaced Language enum with TranslationLanguage

This was referenced Mar 24, 2020

Implement new way of handling languages in the API gbif/gbif-api#51

Open

#32 locales #42

Merged

marcos-lg added a commit that referenced this issue Mar 25, 2020

Merge pull request #42 from gbif/#32-locales

fb236ba

#32 locales

MortenHofft mentioned this issue Aug 26, 2020

Add language to user's profile gbif/portal16#1402

Closed

marcos-lg closed this as completed Aug 26, 2020

ahahn-gbif mentioned this issue Apr 19, 2021

Language - curation before uploading first vocabulary version #77

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

language codes #32

language codes #32

MortenHofft commented Feb 7, 2020

marcos-lg commented Mar 23, 2020 •

edited

Loading

MattBlissett commented Mar 23, 2020

marcos-lg commented Mar 23, 2020 •

edited

Loading

MattBlissett commented Mar 23, 2020

mdoering commented Mar 23, 2020 •

edited

Loading

marcos-lg commented Mar 23, 2020 •

edited

Loading

timrobertson100 commented Mar 24, 2020 •

edited

Loading

MattBlissett commented Mar 24, 2020 •

edited

Loading

marcos-lg commented Mar 24, 2020

MattBlissett commented Mar 24, 2020 •

edited

Loading

marcos-lg commented Mar 24, 2020

MattBlissett commented Mar 24, 2020

marcos-lg commented Mar 25, 2020

marcos-lg commented Aug 26, 2020

language codes #32

language codes #32

Comments

MortenHofft commented Feb 7, 2020

marcos-lg commented Mar 23, 2020 • edited Loading

MattBlissett commented Mar 23, 2020

marcos-lg commented Mar 23, 2020 • edited Loading

MattBlissett commented Mar 23, 2020

mdoering commented Mar 23, 2020 • edited Loading

marcos-lg commented Mar 23, 2020 • edited Loading

timrobertson100 commented Mar 24, 2020 • edited Loading

MattBlissett commented Mar 24, 2020 • edited Loading

marcos-lg commented Mar 24, 2020

MattBlissett commented Mar 24, 2020 • edited Loading

marcos-lg commented Mar 24, 2020

MattBlissett commented Mar 24, 2020

marcos-lg commented Mar 25, 2020

marcos-lg commented Aug 26, 2020

marcos-lg commented Mar 23, 2020 •

edited

Loading

marcos-lg commented Mar 23, 2020 •

edited

Loading

mdoering commented Mar 23, 2020 •

edited

Loading

marcos-lg commented Mar 23, 2020 •

edited

Loading

timrobertson100 commented Mar 24, 2020 •

edited

Loading

MattBlissett commented Mar 24, 2020 •

edited

Loading

MattBlissett commented Mar 24, 2020 •

edited

Loading