Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language.detect failed to detect Japanese from Mandarin #48

Closed
chengchingwen opened this issue Jul 9, 2023 · 1 comment
Closed

Language.detect failed to detect Japanese from Mandarin #48

chengchingwen opened this issue Jul 9, 2023 · 1 comment

Comments

@chengchingwen
Copy link

chengchingwen commented Jul 9, 2023

We found that the Language.detect failed to detect Japanese text with kanji characters.

julia> Languages.detect("組織が人材を募集する際")
(Languages.Mandarin(), Languages.MandarinScript(), 1.0)

julia> Languages.detect("職場で危険な状況を見つけた場合、")
(Languages.Mandarin(), Languages.MandarinScript(), 1.0)

julia> Languages.detect("業務で血液やその他の感染性物質に触れる機会がある場合")
(Languages.Mandarin(), Languages.MandarinScript(), 1.0)

julia> Languages.detect("情報を紛失や盗難から保護することは、評判を維持し、ビジネスを成長させ続けるために不可欠です。")
(nothing, nothing, 0)

This doesn't happen with the whatlang-rs and whatlang-pyo3:

use whatlang::{detect, Lang, Script};

fn main() {
    let text1 = "業務で血液やその他の感染性物質に触れる機会がある場合";
    let text2 = "組織が人材を募集する際、採用候補者に対して求める最大のスキルの 1 つに、コラボレーション能力があります";

    let info1 = detect(text1).unwrap();
    let info2 = detect(text2).unwrap();

    dbg!(info1);
    dbg!(info2);
}
[src/main.rs:10] info1 = Info {
    script: Mandarin,
    lang: Jpn,
    confidence: 1.0,
}
[src/main.rs:11] info2 = Info {
    script: Hiragana,
    lang: Jpn,
    confidence: 1.0,
}
>>> from whatlang import detect
>>> detect("組織が人材を募集する際")
Language: jpn - Script: Mandarin - Confidence: 1 - Is reliable: true
>>> detect("職場で危険な状況を見つけた場合、")
Language: jpn - Script: Mandarin - Confidence: 1 - Is reliable: true
>>> detect("業務で血液やその他の感染性物質に触れる機会がある場合")
Language: jpn - Script: Mandarin - Confidence: 1 - Is reliable: true
>>> from whatlang import detect_lang
>>> detect_lang("業務で血液やその他の感染性物質に触れる機会がある場合")
Language: jpn

This happen because the detect_lang_based_on_script only return "Cmn" for MandarinScript

detect_lang_based_on_script(text::AbstractString, script::MandarinScript, options) = ("Cmn", 1.0)

We might either need a more complicated detect_lang_based_on_script for MandarinScript or use the whatlang-ffi directly.

cc @rssdev10

@AnnaZav
Copy link
Contributor

AnnaZav commented Jul 25, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants