Merge pull request #13 from datasciencecampus/feature/unit_tests

Update cleaning and spellchecker functions
datasciencecampus · Aug 8, 2023 · 428e33f · 428e33f
2 parents fe6d915 + 32c1ba3
commit 428e33f
Show file tree

Hide file tree

Showing 31 changed files with 1,643 additions and 842 deletions.
diff --git a/docs/aqa/aqa_plan.md b/docs/aqa/aqa_plan.md
@@ -10,3 +10,11 @@ them to this plan.
 
 [aqua-book]: https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government
 [aqua-book-resources]: https://www.gov.uk/government/collections/aqua-book-resources
+
+
+# How are we assuring our work
+
+1) The analysis is delivered as part of a reproducible analytical pipeline
+2) The code has been developed in accordance with [best practice](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html)
+3) Code has been peer-reviewed by a suitibly qualified third-party
+4) This analysis is being verified by manual process conducted by qualitative researchers
diff --git a/docs/aqa/assumptions_caveats.md b/docs/aqa/assumptions_caveats.md
@@ -3,6 +3,7 @@
 This log contains a list of assumptions and caveats used in this analysis.
 
 ## NLTK stopwords
+
 Stopwords are commonly used words which on their own don't really mean much. The NLTK package has a pre-defined list of stopwords which we have implemented in this pipeline, so we can focus our analysis on the key words that we think are likely to reveal more insights.
 
 For transparency, here is a list of the NLTK stopwords:
@@ -13,3 +14,11 @@ For transparency, here is a list of the NLTK stopwords:
 'not', 'on', 'shouldn', 'ours', 'be', 'me', 'we', 'here', 'o', 'was', 'herself', 'after', 'aren', 'the', 'ma', 'which', "you've", 'then', 'against', 'same', 'being', 'below', 'in', 'wasn', 'over', 'don', 'them', 'both', 'some', 'such', 'during', 'why', 'its', 're', 'won', 'where', 'of', 'under', 'she']`
 
 We have also added a few additional words which can be found in the config (e.g. 'census', 'data')
+
+## Spell Checker
+
+The spell checker function identifies any words that it thinks are mis-spelled with a flag which then uses a Levenshtien Distance algorithm to find permutations within an edit distance of 2 from the original word. Each word within this list has a frequency value associated with it, the algorithm then finds the most likely word and replaces the mis-spelled word with it. A more thorough explaination of this method can be found in [Peter Norvig's Blog](https://norvig.com/spell-correct.html).
+
+One of the potential challenges of using this method is that it can auto-correct words or phrases which are unknown to the pre-defined dictionary (e.g. DfE) or fail to adapt to words which are more or less likely in a specific context e.g. amin data -> main data, when it most probably is refering to admin data.
+
+To override the preset word dictionary, we have added json file in the src directory called `spelling_words.json`. This file contains key value pairs, where the word is the key, and the value corrosponds to the word frequency. Higher frequency words are more likely to get selected if there are similar distance matches. According to Norvigs article, this method achieves a 70% correction rate. 
diff --git a/docs/aqa/data_log.md b/docs/aqa/data_log.md
@@ -1,3 +1,3 @@
 # Data log
-
 This log contains a list of data sources used in this analysis.
+- Consultation on the future of population and migration statistics in England and Wales (with Welsh already translated)
diff --git a/docs/user_guide/README.md b/docs/user_guide/README.md
@@ -6,3 +6,71 @@ This is the user guide for the `consultation-nlp-2023` project.
 :maxdepth: 2
 ./loading_environment_variables.md
 ```
+
+## How to set the configure the model
+The majority of the model configuration happens in the `question_model_config.yaml`
+
+Within this file you will have configuration options for each of the questions that get's processed.
+
+**example:**
+```yaml
+qu_12:
+  max_features: null
+  ngram_range: !!python/tuple [1,2]
+  min_df: 2
+  max_df: 0.9
+  n_topics: 3
+  n_top_words: 10
+  max_iter:
+    lda: 25
+    nmf: 1000
+  lowercase: true
+  topic_labels:
+    lda: null
+    nmf:
+        - "Admin Data"
+        - "Research"
+        - "Policy"
+```
+In this example you can see that the yaml file is indented at various levels.
+
+### qu_12
+type:str
+At the top level of indentation, we have the question-id, in this case 'qu_12'. Each number corosponds to the column nuber of the raw input data (i.e. qu_12 is column 12 of the raw data csv).
+
+### max_features
+type: int (or null)
+This is an optional value, which can either be null (which will convert to None when transposed to Python) or an integer value for the maximum number of text features to include.
+
+### ngram_range
+type: tuple (but looks like a bit like a list)
+ngrams or word combination ranges, can help to increase the number of features you have in your dataset which is useful if multi-word phrases like "admin data" utilised a lot in the responses. The two values `[1,2]` corrospond to the start and end of the range. So this example would include unigrams (individual words) and bi-grams (2 word combinations). To have only one word combinations, you can change the settings to `[1,1]`. You can also include tri-grams and longer if you wish.
+
+### min_df
+type: int or float
+This is a way of filtering out less important words, that don't appear in enough responses.  `min_df` can either be a float value (e.g. 0.1), in which case it will be interpreted as a proportion, or an integer value (e.g 1) where it will be interpretted as a number of responses.
+So 0.1 would mean that a word needs to appear in at least 10% of the corpus to get through, or 2 would mean that it needs to appear in at least 2 documents.
+
+### max_df
+type: int or float
+Similar to min_df, max_df is a way of filtering out words, but this time the more common words. This field also takes, floats and integers, interpretting them as proportions and absolute numbers respectively. So 0.9 would stop words appearing in more than 90% of documents from making their way through, or 100 would stop words that appear in more than 100 documents coming through.
+
+### n_topics
+type: int
+This is the number of topics to attempt to model in the topic modelling, it must be an integer value.
+
+### n_top_words
+type: int
+This is the number of top words to include in the modelling, it must be an integer value.
+
+### max_iter
+type: dictionary
+This option breaks down further into `lda` and `nmf` which are both integers. This setting relates to the number of iterations for the models to run through in order to move towards convergence. You may need to adjust these seperately depending on model performance.
+
+### lowercase
+type: boolean
+A switch setting for parsing words as lowercase or leaving them in their unadjusted form.
+
+### topic_labels
+type: dictionary
+Again this one breaks down furhter into lda, and nmf, as it is likely that after you have run the models, you may wish to add specific topic lables for the plots you are generating. These can either be null or a list of strings. If you are setting labels, you must ensure there are the same number of labels as there are n_topics, otherwise the system will through an error.
diff --git a/requirements.txt b/requirements.txt
@@ -2,12 +2,14 @@ matplotlib==3.7.1
 nltk==3.8.1
 numpy==1.25.0
 pandas==2.0.2
+pyspellchecker==0.7.2
 pytest==7.3.2
 PyYAML==6.0
 PyYAML==6.0
 rapidfuzz==3.1.1
 scikit_learn==1.2.2
 scipy==1.11.1
+schema==0.7.5
 setuptools==67.6.1
 spacy==3.6.0
 textblob==0.17.1

diff --git a/src/config.yaml b/src/config.yaml
diff --git a/src/general.yaml b/src/general.yaml
@@ -0,0 +1,4 @@
+raw_data_path: "data/raw/20230717_consultation_ingest.csv" #str
+additional_stopwords: #list of words to filter; must be type str
+  - 'he'
+lemmatize: True #bool; select False to use Stemmer
diff --git a/src/models.yaml b/src/models.yaml
@@ -0,0 +1,16 @@
+# See docs/user_guide/README.md for instructions on how to set
+# these configuration settings.
+qu_12:
+  max_features: null #null converts to None, or int value
+  ngram_range: !!python/tuple [1,2]
+  min_df: 2 #float (proportion) or int (count)
+  max_df: 0.9 #float (proportion) or int (count)
+  n_topics: 3 #int greater than 0
+  n_top_words: 10 #int
+  max_iter:
+    lda: 25 #int
+    nmf: 1000 #int
+  lowercase: true #bool
+  topic_labels: #if not null, must be same number as n_topics in each
+    lda: null
+    nmf: null
diff --git a/src/modules/analysis.py b/src/modules/analysis.py