Skip to content

Commit

Permalink
Merge pull request #233 from tejasvaidhyadev/doc_fix_LM
Browse files Browse the repository at this point in the history
LM doc fix
  • Loading branch information
aviks authored Sep 6, 2020
2 parents 9185436 + 07dc373 commit 67434f7
Showing 1 changed file with 62 additions and 57 deletions.
119 changes: 62 additions & 57 deletions docs/src/LM.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,25 +46,28 @@ julia> train = ["khan","is","my","good", "friend","and","He","is","my","brother"
# voc and train are used to train vocabulary and model respectively

julia> model = MLE(voc)
MLE(Vocabulary(Dict("khan"=>1,"name"=>1,"<unk>"=>1,"salman"=>1,"is"=>2,"Khan"=>1,"my"=>1,"he"=>1,"shahrukh"=>1,"and"=>1), 1, "<unk
>", ["my", "name", "is", "salman", "khan", "and", "he", "is", "shahrukh", "Khan", "<unk>"]))
MLE(Vocabulary(Dict("khan"=>1,"name"=>1,"<unk>"=>1,"salman"=>1,"is"=>2,"Khan"=>1,"my"=>1,"he"=>1,"shahrukh"=>1,"and"=>1), 1, "<unk>", ["my", "name", "is", "salman", "khan", "and", "he", "is", "shahrukh", "Khan", "<unk>"]))

julia> print(voc)
11-element Array{String,1}:
"my"
"name"
"is"
"salman"
"khan"
"and"
"he"
"is"
"my"
"name"
"is"
"salman"
"khan"
"and"
"he"
"is"
"shahrukh"
"Khan"
"<unk>"
"Khan"
"<unk>"

# you can see "<unk>" token is added to voc
julia> fit = model(train,2,2) #considering only bigrams

julia> unmaskedscore = score(model, fit, "is" ,"<unk>") #score output P(word | context) without replacing context word with "<unk>"
0.3333333333333333

julia> masked_score = maskedscore(model,fit,"is","alien")
0.3333333333333333
#as expected maskedscore is equivalent to unmaskedscore with context replaced with "<unk>"
Expand All @@ -79,86 +82,87 @@ julia> masked_score = maskedscore(model,fit,"is","alien")

### `score`

used to evaluate the probability of word given context (*P(word | context)*)
used to evaluate the probability of word given context (*P(word | context)*)

```julia
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
```

Arguments:
Arguments:

1. `m` : Instance of `Langmodel` struct.
2. `temp_lm`: output of function call of instance of `Langmodel`.
3. `word`: string of word
4. `context`: context of given word

In case of Lidstone and Laplace it apply smoothing and,
- In case of `Lidstone` and `Laplace` it apply smoothing and,

In Interpolated language model, provide Kneserney and WittenBell smoothing
- In Interpolated language model, provide `Kneserney` and `WittenBell` smoothing

### `maskedscore`

It is used to evaluate *score* with masks out of vocabulary words
- It is used to evaluate *score* with masks out of vocabulary words

The arguments are the same as for score
- The arguments are the same as for `score`

### `logscore`

Evaluate the log score of this word in this context.
- Evaluate the log score of this word in this context.

The arguments are the same as for score and maskedscore
- The arguments are the same as for `score` and `maskedscore`

### `entropy`
```julia
entropy(m::Langmodel,lm::DefaultDict,text_ngram::word::Vector{T}) where { T <: AbstractString}
```

Calculate cross-entropy of model for given evaluation text.
```julia
entropy(m::Langmodel,lm::DefaultDict,text_ngram::word::Vector{T}) where { T <: AbstractString}
```

- Calculate *cross-entropy* of model for given evaluation text.

Input text must be Array of ngram of same lengths
- Input text must be Array of ngram of same lengths

### `perplexity`

Calculates the perplexity of the given text.
- Calculates the perplexity of the given text.

This is simply 2 ** cross-entropy(`entropy`) for the text, so the arguments are the same as `entropy`.
- This is simply 2 ** cross-entropy(`entropy`) for the text, so the arguments are the same as `entropy`.

## Preprocessing

For Preprocessing following functions:

1. `everygram`: Return all possible ngrams generated from sequence of items, as an Array{String,1}

```julia
julia> seq = ["To","be","or","not"]
julia> a = everygram(seq,min_len=1, max_len=-1)
10-element Array{Any,1}:
"or"
"not"
"To"
"be"
"or not"
"be or"
"be or not"
"To be or"
"To be or not"
```
```julia
julia> seq = ["To","be","or","not"]
julia> a = everygram(seq,min_len=1, max_len=-1)
10-element Array{Any,1}:
"or"
"not"
"To"
"be"
"or not"
"be or"
"be or not"
"To be or"
"To be or not"
```

2. `padding_ngrams`: padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n

It also pad the original input Array of string

```julia
julia> example = ["1","2","3","4","5"]
julia> padding_ngrams(example,2,pad_left=true,pad_right=true)
6-element Array{Any,1}:
"<s> 1"
"1 2"
"2 3"
"3 4"
"4 5"
"5 </s>"
```
```julia
julia> example = ["1","2","3","4","5"]
julia> padding_ngrams(example,2,pad_left=true,pad_right=true)
6-element Array{Any,1}:
"<s> 1"
"1 2"
"2 3"
"3 4"
"4 5"
"5 </s>"
```
## Vocabulary

Struct to store Language models vocabulary
Expand All @@ -169,17 +173,18 @@ It also Adds a special "unkown" tokens which unseen words are mapped to

```julia
julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]

julia> vocabulary = Vocabulary(words, 2)
Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>")
Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>")

# lookup a sequence or words in the vocabulary
julia> word = ["a", "-", "d", "c", "a"]

julia> lookup(vocabulary ,word)
5-element Array{Any,1}:
"a"
"a"
"<unk>"
"d"
"c"
"d"
"c"
"a"
```

0 comments on commit 67434f7

Please sign in to comment.