Skip to content

Commit

Permalink
added initial docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Roman S Samarev committed Jul 29, 2023
1 parent c928e9d commit 99446b5
Show file tree
Hide file tree
Showing 8 changed files with 117 additions and 0 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ Languages.jl
[![pkgeval](https://juliahub.com/docs/Languages/pkgeval.svg)](https://juliahub.com/ui/Packages/Languages/w1H1r)


[![](https://img.shields.io/badge/docs-stable-blue.svg)](https://juliatext.github.io/Languages.jl) [![](https://img.shields.io/badge/docs-dev-blue.svg)](https://juliatext.github.io/Languages.jl/dev)


## Introduction

Languages.jl is a Julia package for working with human languages. It provides:
Expand Down
2 changes: 2 additions & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
build/
site/
2 changes: 2 additions & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[deps]
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
16 changes: 16 additions & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
using Documenter
using Languages

makedocs(
sitename = "Languages",
format = Documenter.HTML(),
modules = [Languages],
pages = [
"Home" => "index.md",
"API" => "api.md"
]
)

deploydocs(
repo = "github.com/JuliaText/Languages.jl.git"
)
4 changes: 4 additions & 0 deletions docs/src/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
```@autodocs
Modules = [Languages]
Private = false
```
57 changes: 57 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Languages.jl

Languages.jl is a Julia package for working with human languages. It provides:

* Lists of words from each language for basic categories:
* Articles
* Indefinite Articles
* Definite Articles
* Prepositions
* Pronouns
* Stopwords

These methods are supported only for English and German currently.

This package also detects the script and language for written text in a wide variety of languages.

## Usage

using Languages

articles(Languages.English())
stopwords(Languages.English())

All word lists are returned as vectors of UTF-8 strings.

## Script detection

Script detection model works by checking the unicode character ranges present within
the input text

Languages.detect_script("To be or not to be") # => Languages.LatinScript()

## Language Detection

A trigram based model is used to detect the language for the text. The model is
filtered based on the detected script.

We detect 84 of the most common languages spoken around the world. This usually
covers most languages with more than 10 million native speakers.

detector = LanguageDetector()
detector("To be or not to be") #=> (Languages.English(), Languages.LatinScript(), 1.0)

## List All Supported Languages
You can use `list_languages()` to get all supported languages.

The `LanguageDetector` model returns the language, the script, and the confidence when applied to a string.

The language and script detection code in this package is heavily inspired from the rust package [whatlang-rs](https://github.com/greyblake/whatlang-rs). That package is in turn derived from [franc](https://github.com/wooorm/franc). See `LICENSE.whatlang-rs` for details.

## Deprecations

The API of this package has been refurbished recently. If you have used this package earlier,
please be aware of these changes.

* The language names have been shortened. So `English` instead of `EnglishLanguage`. However, the language names are no longer exported. So they should be referred to with the package name: `Languages.English`
* Every language is a type. However all functions now accept and return instances of these types, rather than the types themselves.
22 changes: 22 additions & 0 deletions src/types.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,25 @@ abstract type Language; end
# Portuguese, Romanian, Russian, Spanish, Swedish, Turkish

# These are ISO 639-2T alpha-3 and ISO 639-3 codes
"""
isocode(lang::T) where {T<:Language}
Returns ISO code of the `lang`
"""
isocode(lang::T) where {T<:Language} = isocode(T)

"""
name(lang::T) where {T<:Language}
Returns the self-name of the language `lang`.
"""
name(lang::T) where {T<:Language} = name(T)

"""
english_name(lang::T) where {T<:Language}
Returns the name of the language `lang` in English.
"""
english_name(lang::T) where {T<:Language} = english_name(T)

struct Esperanto <: Language; end; english_name(::Type{Esperanto}) = "Esperanto"; name(::Type{Esperanto}) = "Esperanto"; isocode(::Type{Esperanto}) = "epo";
Expand Down Expand Up @@ -182,6 +199,11 @@ global const code_to_lang = Dict{String, Language}(
"uig" => Uyghur(),
)

"""
from_code(code::String)
Returns the language object for the ISO `code`.
"""
function from_code(code::String)
return get(code_to_lang, lowercase(code), nothing)
end
Expand Down
11 changes: 11 additions & 0 deletions src/whatlang.jl
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@

const RELIABLE_CONFIDENCE_THRESHOLD = 0.8;

"""
detect_script(text::AbstractString)
Detect a script for the given `text`.
Returns either `Script` or a tuple `(Script, probability)`.
"""
function detect_script(text::AbstractString)
script_counters = [
[LatinScript() , 0],
Expand Down Expand Up @@ -384,6 +390,11 @@ Base.@deprecate detect(text::AbstractString, options=default_options()) Language
mutable struct LanguageDetector
end

"""
detector::LanguageDetector(text::AbstractString, options=default_options())
Returns a tuple `(Language, Script, confidence)` for the given `text`
"""
function(m::LanguageDetector)(text::AbstractString, options=default_options())
if text==""; throw(ArgumentError("Cannot detect language for empty text")); end
script = detect_script(text)
Expand Down

0 comments on commit 99446b5

Please sign in to comment.