Merge pull request #270 from pitmonticone/master

Clean README, docs and docstrings
JuliaText · Aug 21, 2023 · 7cc7ab2 · 7cc7ab2
2 parents 66862db + 0e0fa23
commit 7cc7ab2
Show file tree

Hide file tree

Showing 10 changed files with 23 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -19,15 +19,15 @@ TextAnalysis provides support for standard tools and models for working with tex
 * DocumentTermMatrix and TF/IDF
 * LSA/LDA
 * Vocabulary and statistical Language Model
-* Co-occurance matrix
+* Co-occurrence matrix
 * NaiveBayes classifier
 * ROUGE evaluation metrics
 
 This package also incorporates features from the [Languages](https://juliahub.com/ui/Packages/Languages/w1H1r) and [WordTokenizers](https://juliahub.com/ui/Packages/WordTokenizers/wKkKC) packages within the [JuliaText](https://github.com/JuliaText) ecosystem. 
 
 ## TextModels
 
-The [TextModels](https://github.com/JuliaText/TextModels.jl) package enhances this library with the additon of practical neural network based models. Some of that code used to live in this package, but was moved to simplify installation and reduce the number of dependencies. 
+The [TextModels](https://github.com/JuliaText/TextModels.jl) package enhances this library with the addition of practical neural network based models. Some of that code used to live in this package, but was moved to simplify installation and reduce the number of dependencies. 
 
 ## Installation
 

diff --git a/docs/src/LM.md b/docs/src/LM.md
@@ -31,9 +31,9 @@ Arguments:
 
  * `unk_cutoff`: Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.
 
- * `unk_label`: token for unkown labels 
+ * `unk_label`: token for unknown labels 
 
- *  `gamma`: smoothing arugment gamma 
+ *  `gamma`: smoothing argument gamma 
 
  * `discount`:  discounting factor for `KneserNeyInterpolated`
 

diff --git a/docs/src/documents.md b/docs/src/documents.md
@@ -9,7 +9,7 @@ allows one to work with documents stored in a variety of formats:
 * _NGramDocument_ : A document represented as a bag of n-grams, which are UTF8 n-grams that map to counts
 
 !!! note
-    These formats represent a hierarchy: you can always move down the hierachy, but can generally not move up the hierachy. A `FileDocument` can easily become a `StringDocument`, but an `NGramDocument` cannot easily become a `FileDocument`.
+    These formats represent a hierarchy: you can always move down the hierarchy, but can generally not move up the hierarchy. A `FileDocument` can easily become a `StringDocument`, but an `NGramDocument` cannot easily become a `FileDocument`.
 
 Creating any of the four basic types of documents is very easy:
 

diff --git a/docs/src/evaluation_metrics.md b/docs/src/evaluation_metrics.md
@@ -7,7 +7,7 @@ As of now TextAnalysis provides the following evaluation metrics.
 * [ROUGE-L](https://en.wikipedia.org/wiki/ROUGE_(metric))
 
 ## ROUGE-N
-This metric evaluatrion based on the overlap of N-grams
+This metric evaluation based on the overlap of N-grams
 between the system and reference summaries.
 
     rouge_n(references, candidate, n; avg, lang)
@@ -18,15 +18,15 @@ The function takes the following arguments -
 * `candidate::AbstractString` = Input candidate summary, to be scored against reference summaries.
 * `n::Integer` = Order of NGrams
 * `avg::Bool` = Setting this parameter to `true`, applies jackkniving the calculated scores. Defaults to `true`
-* `lang::Language` = Language of the text, usefule while generating N-grams. Defaults to English i.e. Languages.English()
+* `lang::Language` = Language of the text, useful while generating N-grams. Defaults to English i.e. Languages.English()
 
 ```julia
 julia> candidate_summary =  "Brazil, Russia, China and India are growing nations. They are all an important part of BRIC as well as regular part of G20 summits."
 "Brazil, Russia, China and India are growing nations. They are all an important part of BRIC as well as regular part of G20 summits."
 
-julia> reference_summaries = ["Brazil, Russia, India and China are the next big poltical powers in the global economy. Together referred to as BRIC(S) along with South Korea.", "Brazil, Russia, India and China are together known as the  BRIC(S) and have been invited to the G20 summit."]
+julia> reference_summaries = ["Brazil, Russia, India and China are the next big political powers in the global economy. Together referred to as BRIC(S) along with South Korea.", "Brazil, Russia, India and China are together known as the  BRIC(S) and have been invited to the G20 summit."]
 2-element Array{String,1}:
- "Brazil, Russia, India and China are the next big poltical powers in the global economy. Together referred to as BRIC(S) along with South Korea."
+ "Brazil, Russia, India and China are the next big political powers in the global economy. Together referred to as BRIC(S) along with South Korea."
  "Brazil, Russia, India and China are together known as the  BRIC(S) and have been invited to the G20 summit."                                    
 
 julia> rouge_n(reference_summaries, candidate_summary, 2, avg=true)

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -21,5 +21,5 @@ before every snippet of code.
 
 ## TextModels
 
-The [TextModels](https://github.com/JuliaText/TextModels.jl) package enhances this library with the additon of practical neural network based models. Some of that code used to live in this package, but was moved to simplify installation and dependencies. 
+The [TextModels](https://github.com/JuliaText/TextModels.jl) package enhances this library with the addition of practical neural network based models. Some of that code used to live in this package, but was moved to simplify installation and dependencies. 
 
diff --git a/docs/src/semantic.md b/docs/src/semantic.md
@@ -42,7 +42,7 @@ julia> update_lexicon!(crps)
 julia> m = DocumentTermMatrix(crps)
 ```
 
-Latent Dirchlet Allocation has two hyper parameters -
+Latent Dirichlet Allocation has two hyper parameters -
 * _α_ : The hyperparameter for topic distribution per document. `α<1` yields a sparse topic mixture for each document. `α>1` yields a more uniform topic mixture for each document.
 - _β_ : The hyperparameter for word distribution per topic. `β<1` yields a sparse word mixture for each topic. `β>1` yields a more uniform word mixture for each topic.
 

diff --git a/src/LM/counter.jl b/src/LM/counter.jl
@@ -2,7 +2,7 @@ using DataStructures
 
 """
     counter is used to make conditional distribution, which is used by score functions to 
-    calculate conditonal frequency distribution
+    calculate conditional frequency distribution
 """
 function counter2(data, min::Integer, max::Integer)
     data = everygram(data, min_len=min, max_len=max)

diff --git a/src/LM/langmodel.jl b/src/LM/langmodel.jl
@@ -76,7 +76,7 @@ end
 """
 	score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)	
 
-score is used to output probablity of word given that context 
+score is used to output probability of word given that context 
 
 Add-one smoothing to Lidstone or Laplace(gammamodel) models
         
@@ -96,7 +96,7 @@ end
 """
 To get probability of word given that context
 
-In otherwords, for given context calculate frequency distribution of word
+In other words, for given context calculate frequency distribution of word
   
 """
 function prob(m::Langmodel, templ_lm::DefaultDict, word, context=nothing)
@@ -120,7 +120,7 @@ end
 """
 	score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)	
 
-score is used to output probablity of word given that context in MLE
+score is used to output probability of word given that context in MLE
         
 """
 function score(m::MLE, temp_lm::DefaultDict, word, context=nothing)
@@ -179,7 +179,7 @@ end
 """
 	score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)	
 
-score is used to output probablity of word given that context in InterpolatedLanguageModel
+score is used to output probability of word given that context in InterpolatedLanguageModel
 
 Apply Kneserney and WittenBell smoothing
 depending upon the sub-Type

diff --git a/src/coom.jl b/src/coom.jl
@@ -10,8 +10,8 @@
 """
     coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool)
 
-Basic low-level function that calculates the co-occurence matrix of a document.
-Returns a sparse co-occurence matrix sized `n × n` where `n = length(vocab)`
+Basic low-level function that calculates the co-occurrence matrix of a document.
+Returns a sparse co-occurrence matrix sized `n × n` where `n = length(vocab)`
 with elements of type `T`. The document `doc` is represented by a vector of its
 terms (in order)`. The keywords `window` and `normalize` indicate the size of the
 sliding word window in which co-occurrences are counted and whether to normalize

diff --git a/src/preprocessing.jl b/src/preprocessing.jl
@@ -209,7 +209,7 @@ end
 """
     remove_words!(doc, words::Vector{AbstractString})
     remove_words!(crps, words::Vector{AbstractString})
-Remove the occurences of words from `doc` or `crps`.
+Remove the occurrences of words from `doc` or `crps`.
 # Example
 ```julia-repl
 julia> str="The quick brown fox jumps over the lazy dog"
@@ -247,7 +247,7 @@ end
 
 """
     sparse_terms(crps, alpha=0.05])
-Find the sparse terms from Corpus, occuring in less than `alpha` percentage of the documents.
+Find the sparse terms from Corpus, occurring in less than `alpha` percentage of the documents.
 # Example
 ```
 julia> crps = Corpus([StringDocument("This is Document 1"),
@@ -282,7 +282,7 @@ end
 
 """
     frequent_terms(crps, alpha=0.95)
-Find the frequent terms from Corpus, occuring more than `alpha` percentage of the documents.
+Find the frequent terms from Corpus, occurring more than `alpha` percentage of the documents.
 # Example
 ```
 julia> crps = Corpus([StringDocument("This is Document 1"),
@@ -318,7 +318,7 @@ end
 
 """
     remove_sparse_terms!(crps, alpha=0.05)
-Remove sparse terms in crps, occuring less than `alpha` percent of documents.
+Remove sparse terms in crps, occurring less than `alpha` percent of documents.
 # Example
 ```julia-repl
 julia> crps = Corpus([StringDocument("This is Document 1"),
@@ -342,7 +342,7 @@ remove_sparse_terms!(crps::Corpus, alpha::Real = alpha_sparse) = remove_words!(c
 
 """
     remove_frequent_terms!(crps, alpha=0.95)
-Remove terms in `crps`, occuring more than `alpha` percent of documents.
+Remove terms in `crps`, occurring more than `alpha` percent of documents.
 # Example
 ```julia-repl
 julia> crps = Corpus([StringDocument("This is Document 1"),