Skip to content

Commit

Permalink
Merge pull request #270 from pitmonticone/master
Browse files Browse the repository at this point in the history
Clean README, docs and docstrings
  • Loading branch information
aviks authored Aug 21, 2023
2 parents 66862db + 0e0fa23 commit 7cc7ab2
Show file tree
Hide file tree
Showing 10 changed files with 23 additions and 23 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,15 @@ TextAnalysis provides support for standard tools and models for working with tex
* DocumentTermMatrix and TF/IDF
* LSA/LDA
* Vocabulary and statistical Language Model
* Co-occurance matrix
* Co-occurrence matrix
* NaiveBayes classifier
* ROUGE evaluation metrics

This package also incorporates features from the [Languages](https://juliahub.com/ui/Packages/Languages/w1H1r) and [WordTokenizers](https://juliahub.com/ui/Packages/WordTokenizers/wKkKC) packages within the [JuliaText](https://github.com/JuliaText) ecosystem.

## TextModels

The [TextModels](https://github.com/JuliaText/TextModels.jl) package enhances this library with the additon of practical neural network based models. Some of that code used to live in this package, but was moved to simplify installation and reduce the number of dependencies.
The [TextModels](https://github.com/JuliaText/TextModels.jl) package enhances this library with the addition of practical neural network based models. Some of that code used to live in this package, but was moved to simplify installation and reduce the number of dependencies.

## Installation

Expand Down
4 changes: 2 additions & 2 deletions docs/src/LM.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ Arguments:

* `unk_cutoff`: Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.

* `unk_label`: token for unkown labels
* `unk_label`: token for unknown labels

* `gamma`: smoothing arugment gamma
* `gamma`: smoothing argument gamma

* `discount`: discounting factor for `KneserNeyInterpolated`

Expand Down
2 changes: 1 addition & 1 deletion docs/src/documents.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ allows one to work with documents stored in a variety of formats:
* _NGramDocument_ : A document represented as a bag of n-grams, which are UTF8 n-grams that map to counts

!!! note
These formats represent a hierarchy: you can always move down the hierachy, but can generally not move up the hierachy. A `FileDocument` can easily become a `StringDocument`, but an `NGramDocument` cannot easily become a `FileDocument`.
These formats represent a hierarchy: you can always move down the hierarchy, but can generally not move up the hierarchy. A `FileDocument` can easily become a `StringDocument`, but an `NGramDocument` cannot easily become a `FileDocument`.

Creating any of the four basic types of documents is very easy:

Expand Down
8 changes: 4 additions & 4 deletions docs/src/evaluation_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ As of now TextAnalysis provides the following evaluation metrics.
* [ROUGE-L](https://en.wikipedia.org/wiki/ROUGE_(metric))

## ROUGE-N
This metric evaluatrion based on the overlap of N-grams
This metric evaluation based on the overlap of N-grams
between the system and reference summaries.

rouge_n(references, candidate, n; avg, lang)
Expand All @@ -18,15 +18,15 @@ The function takes the following arguments -
* `candidate::AbstractString` = Input candidate summary, to be scored against reference summaries.
* `n::Integer` = Order of NGrams
* `avg::Bool` = Setting this parameter to `true`, applies jackkniving the calculated scores. Defaults to `true`
* `lang::Language` = Language of the text, usefule while generating N-grams. Defaults to English i.e. Languages.English()
* `lang::Language` = Language of the text, useful while generating N-grams. Defaults to English i.e. Languages.English()

```julia
julia> candidate_summary = "Brazil, Russia, China and India are growing nations. They are all an important part of BRIC as well as regular part of G20 summits."
"Brazil, Russia, China and India are growing nations. They are all an important part of BRIC as well as regular part of G20 summits."

julia> reference_summaries = ["Brazil, Russia, India and China are the next big poltical powers in the global economy. Together referred to as BRIC(S) along with South Korea.", "Brazil, Russia, India and China are together known as the BRIC(S) and have been invited to the G20 summit."]
julia> reference_summaries = ["Brazil, Russia, India and China are the next big political powers in the global economy. Together referred to as BRIC(S) along with South Korea.", "Brazil, Russia, India and China are together known as the BRIC(S) and have been invited to the G20 summit."]
2-element Array{String,1}:
"Brazil, Russia, India and China are the next big poltical powers in the global economy. Together referred to as BRIC(S) along with South Korea."
"Brazil, Russia, India and China are the next big political powers in the global economy. Together referred to as BRIC(S) along with South Korea."
"Brazil, Russia, India and China are together known as the BRIC(S) and have been invited to the G20 summit."

julia> rouge_n(reference_summaries, candidate_summary, 2, avg=true)
Expand Down
2 changes: 1 addition & 1 deletion docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,5 @@ before every snippet of code.

## TextModels

The [TextModels](https://github.com/JuliaText/TextModels.jl) package enhances this library with the additon of practical neural network based models. Some of that code used to live in this package, but was moved to simplify installation and dependencies.
The [TextModels](https://github.com/JuliaText/TextModels.jl) package enhances this library with the addition of practical neural network based models. Some of that code used to live in this package, but was moved to simplify installation and dependencies.

2 changes: 1 addition & 1 deletion docs/src/semantic.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
```

Latent Dirchlet Allocation has two hyper parameters -
Latent Dirichlet Allocation has two hyper parameters -
* _α_ : The hyperparameter for topic distribution per document. `α<1` yields a sparse topic mixture for each document. `α>1` yields a more uniform topic mixture for each document.
- _β_ : The hyperparameter for word distribution per topic. `β<1` yields a sparse word mixture for each topic. `β>1` yields a more uniform word mixture for each topic.

Expand Down
2 changes: 1 addition & 1 deletion src/LM/counter.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ using DataStructures

"""
counter is used to make conditional distribution, which is used by score functions to
calculate conditonal frequency distribution
calculate conditional frequency distribution
"""
function counter2(data, min::Integer, max::Integer)
data = everygram(data, min_len=min, max_len=max)
Expand Down
8 changes: 4 additions & 4 deletions src/LM/langmodel.jl
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ end
"""
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
score is used to output probablity of word given that context
score is used to output probability of word given that context
Add-one smoothing to Lidstone or Laplace(gammamodel) models
Expand All @@ -96,7 +96,7 @@ end
"""
To get probability of word given that context
In otherwords, for given context calculate frequency distribution of word
In other words, for given context calculate frequency distribution of word
"""
function prob(m::Langmodel, templ_lm::DefaultDict, word, context=nothing)
Expand All @@ -120,7 +120,7 @@ end
"""
score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
score is used to output probablity of word given that context in MLE
score is used to output probability of word given that context in MLE
"""
function score(m::MLE, temp_lm::DefaultDict, word, context=nothing)
Expand Down Expand Up @@ -179,7 +179,7 @@ end
"""
score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
score is used to output probablity of word given that context in InterpolatedLanguageModel
score is used to output probability of word given that context in InterpolatedLanguageModel
Apply Kneserney and WittenBell smoothing
depending upon the sub-Type
Expand Down
4 changes: 2 additions & 2 deletions src/coom.jl
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@
"""
coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool)
Basic low-level function that calculates the co-occurence matrix of a document.
Returns a sparse co-occurence matrix sized `n × n` where `n = length(vocab)`
Basic low-level function that calculates the co-occurrence matrix of a document.
Returns a sparse co-occurrence matrix sized `n × n` where `n = length(vocab)`
with elements of type `T`. The document `doc` is represented by a vector of its
terms (in order)`. The keywords `window` and `normalize` indicate the size of the
sliding word window in which co-occurrences are counted and whether to normalize
Expand Down
10 changes: 5 additions & 5 deletions src/preprocessing.jl
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ end
"""
remove_words!(doc, words::Vector{AbstractString})
remove_words!(crps, words::Vector{AbstractString})
Remove the occurences of words from `doc` or `crps`.
Remove the occurrences of words from `doc` or `crps`.
# Example
```julia-repl
julia> str="The quick brown fox jumps over the lazy dog"
Expand Down Expand Up @@ -247,7 +247,7 @@ end

"""
sparse_terms(crps, alpha=0.05])
Find the sparse terms from Corpus, occuring in less than `alpha` percentage of the documents.
Find the sparse terms from Corpus, occurring in less than `alpha` percentage of the documents.
# Example
```
julia> crps = Corpus([StringDocument("This is Document 1"),
Expand Down Expand Up @@ -282,7 +282,7 @@ end

"""
frequent_terms(crps, alpha=0.95)
Find the frequent terms from Corpus, occuring more than `alpha` percentage of the documents.
Find the frequent terms from Corpus, occurring more than `alpha` percentage of the documents.
# Example
```
julia> crps = Corpus([StringDocument("This is Document 1"),
Expand Down Expand Up @@ -318,7 +318,7 @@ end

"""
remove_sparse_terms!(crps, alpha=0.05)
Remove sparse terms in crps, occuring less than `alpha` percent of documents.
Remove sparse terms in crps, occurring less than `alpha` percent of documents.
# Example
```julia-repl
julia> crps = Corpus([StringDocument("This is Document 1"),
Expand All @@ -342,7 +342,7 @@ remove_sparse_terms!(crps::Corpus, alpha::Real = alpha_sparse) = remove_words!(c

"""
remove_frequent_terms!(crps, alpha=0.95)
Remove terms in `crps`, occuring more than `alpha` percent of documents.
Remove terms in `crps`, occurring more than `alpha` percent of documents.
# Example
```julia-repl
julia> crps = Corpus([StringDocument("This is Document 1"),
Expand Down

0 comments on commit 7cc7ab2

Please sign in to comment.