Skip to content

Commit

Permalink
Merge pull request #13 from datasciencecampus/feature/unit_tests
Browse files Browse the repository at this point in the history
Update cleaning and spellchecker functions
  • Loading branch information
ColinDaglish authored Aug 8, 2023
2 parents fe6d915 + 32c1ba3 commit 428e33f
Show file tree
Hide file tree
Showing 31 changed files with 1,643 additions and 842 deletions.
8 changes: 8 additions & 0 deletions docs/aqa/aqa_plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,11 @@ them to this plan.

[aqua-book]: https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government
[aqua-book-resources]: https://www.gov.uk/government/collections/aqua-book-resources


# How are we assuring our work

1) The analysis is delivered as part of a reproducible analytical pipeline
2) The code has been developed in accordance with [best practice](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html)
3) Code has been peer-reviewed by a suitibly qualified third-party
4) This analysis is being verified by manual process conducted by qualitative researchers
9 changes: 9 additions & 0 deletions docs/aqa/assumptions_caveats.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
This log contains a list of assumptions and caveats used in this analysis.

## NLTK stopwords

Stopwords are commonly used words which on their own don't really mean much. The NLTK package has a pre-defined list of stopwords which we have implemented in this pipeline, so we can focus our analysis on the key words that we think are likely to reveal more insights.

For transparency, here is a list of the NLTK stopwords:
Expand All @@ -13,3 +14,11 @@ For transparency, here is a list of the NLTK stopwords:
'not', 'on', 'shouldn', 'ours', 'be', 'me', 'we', 'here', 'o', 'was', 'herself', 'after', 'aren', 'the', 'ma', 'which', "you've", 'then', 'against', 'same', 'being', 'below', 'in', 'wasn', 'over', 'don', 'them', 'both', 'some', 'such', 'during', 'why', 'its', 're', 'won', 'where', 'of', 'under', 'she']`

We have also added a few additional words which can be found in the config (e.g. 'census', 'data')

## Spell Checker

The spell checker function identifies any words that it thinks are mis-spelled with a flag which then uses a Levenshtien Distance algorithm to find permutations within an edit distance of 2 from the original word. Each word within this list has a frequency value associated with it, the algorithm then finds the most likely word and replaces the mis-spelled word with it. A more thorough explaination of this method can be found in [Peter Norvig's Blog](https://norvig.com/spell-correct.html).

One of the potential challenges of using this method is that it can auto-correct words or phrases which are unknown to the pre-defined dictionary (e.g. DfE) or fail to adapt to words which are more or less likely in a specific context e.g. amin data -> main data, when it most probably is refering to admin data.

To override the preset word dictionary, we have added json file in the src directory called `spelling_words.json`. This file contains key value pairs, where the word is the key, and the value corrosponds to the word frequency. Higher frequency words are more likely to get selected if there are similar distance matches. According to Norvigs article, this method achieves a 70% correction rate.
2 changes: 1 addition & 1 deletion docs/aqa/data_log.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Data log

This log contains a list of data sources used in this analysis.
- Consultation on the future of population and migration statistics in England and Wales (with Welsh already translated)
68 changes: 68 additions & 0 deletions docs/user_guide/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,71 @@ This is the user guide for the `consultation-nlp-2023` project.
:maxdepth: 2
./loading_environment_variables.md
```

## How to set the configure the model
The majority of the model configuration happens in the `question_model_config.yaml`

Within this file you will have configuration options for each of the questions that get's processed.

**example:**
```yaml
qu_12:
max_features: null
ngram_range: !!python/tuple [1,2]
min_df: 2
max_df: 0.9
n_topics: 3
n_top_words: 10
max_iter:
lda: 25
nmf: 1000
lowercase: true
topic_labels:
lda: null
nmf:
- "Admin Data"
- "Research"
- "Policy"
```
In this example you can see that the yaml file is indented at various levels.
### qu_12
type:str
At the top level of indentation, we have the question-id, in this case 'qu_12'. Each number corosponds to the column nuber of the raw input data (i.e. qu_12 is column 12 of the raw data csv).
### max_features
type: int (or null)
This is an optional value, which can either be null (which will convert to None when transposed to Python) or an integer value for the maximum number of text features to include.
### ngram_range
type: tuple (but looks like a bit like a list)
ngrams or word combination ranges, can help to increase the number of features you have in your dataset which is useful if multi-word phrases like "admin data" utilised a lot in the responses. The two values `[1,2]` corrospond to the start and end of the range. So this example would include unigrams (individual words) and bi-grams (2 word combinations). To have only one word combinations, you can change the settings to `[1,1]`. You can also include tri-grams and longer if you wish.

### min_df
type: int or float
This is a way of filtering out less important words, that don't appear in enough responses. `min_df` can either be a float value (e.g. 0.1), in which case it will be interpreted as a proportion, or an integer value (e.g 1) where it will be interpretted as a number of responses.
So 0.1 would mean that a word needs to appear in at least 10% of the corpus to get through, or 2 would mean that it needs to appear in at least 2 documents.

### max_df
type: int or float
Similar to min_df, max_df is a way of filtering out words, but this time the more common words. This field also takes, floats and integers, interpretting them as proportions and absolute numbers respectively. So 0.9 would stop words appearing in more than 90% of documents from making their way through, or 100 would stop words that appear in more than 100 documents coming through.

### n_topics
type: int
This is the number of topics to attempt to model in the topic modelling, it must be an integer value.

### n_top_words
type: int
This is the number of top words to include in the modelling, it must be an integer value.

### max_iter
type: dictionary
This option breaks down further into `lda` and `nmf` which are both integers. This setting relates to the number of iterations for the models to run through in order to move towards convergence. You may need to adjust these seperately depending on model performance.

### lowercase
type: boolean
A switch setting for parsing words as lowercase or leaving them in their unadjusted form.

### topic_labels
type: dictionary
Again this one breaks down furhter into lda, and nmf, as it is likely that after you have run the models, you may wish to add specific topic lables for the plots you are generating. These can either be null or a list of strings. If you are setting labels, you must ensure there are the same number of labels as there are n_topics, otherwise the system will through an error.
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,14 @@ matplotlib==3.7.1
nltk==3.8.1
numpy==1.25.0
pandas==2.0.2
pyspellchecker==0.7.2
pytest==7.3.2
PyYAML==6.0
PyYAML==6.0
rapidfuzz==3.1.1
scikit_learn==1.2.2
scipy==1.11.1
schema==0.7.5
setuptools==67.6.1
spacy==3.6.0
textblob==0.17.1
Expand Down
24 changes: 0 additions & 24 deletions src/config.yaml

This file was deleted.

4 changes: 4 additions & 0 deletions src/general.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
raw_data_path: "data/raw/20230717_consultation_ingest.csv" #str
additional_stopwords: #list of words to filter; must be type str
- 'he'
lemmatize: True #bool; select False to use Stemmer
16 changes: 16 additions & 0 deletions src/models.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# See docs/user_guide/README.md for instructions on how to set
# these configuration settings.
qu_12:
max_features: null #null converts to None, or int value
ngram_range: !!python/tuple [1,2]
min_df: 2 #float (proportion) or int (count)
max_df: 0.9 #float (proportion) or int (count)
n_topics: 3 #int greater than 0
n_top_words: 10 #int
max_iter:
lda: 25 #int
nmf: 1000 #int
lowercase: true #bool
topic_labels: #if not null, must be same number as n_topics in each
lda: null
nmf: null
131 changes: 0 additions & 131 deletions src/modules/analysis.py

This file was deleted.

Loading

0 comments on commit 428e33f

Please sign in to comment.