Fixes on tokeniser, normalisation, qualifiers and CI #329

Thomzoy · 2024-10-11T18:11:39Z

Description

Regarding tokenization:

In texts, words can be split with "-" when too long. This can impede matching: dia-\nbete won't be matched by a simple "diabete" regex. To this end:

The EDS.Tokenizer now threats -\n as a token by itself
The eds.pollution can tag this token a to-be-discarded

Regarding `ignore_space_tokens`

With ignore_space_tokens=True, using edsnlp.utils.doc_to_text.get_text (which is used under the hood by e.g. the regex matcher) will remove linebreaks, which can be problematic in texts with enumeration without trailing spaces. E.g, get_text("Tabac\nAlcool\nSport", "TEXT", ignore_space_tokens=True) would ouput "TabacAlcoolSport"`.

Now, we replace this \n with a space when necessary

Regarding the status mapping of behavior/disorder pipes

For entities matched by those pipes, there is:

A _.status attribute, by default set to 1, but that can take the value 2
A _.detailed_status attribute, which is actually a getter that uses a mapping dictionary to get the human-readable status

When loading already-annotated docs, it can occurs that a status will be automaticaly set to None. To avoid a KeyError, when now handle this status=None case

Regarding CI

ubuntu-latest doesn't support python 3.7 anymore, so we should use ubuntu-22

Checklist

If this PR is a bug fix, the bug is documented in the test suite.
Changes were documented in the changelog (pending section).
If necessary, changes were made to the documentation (eg new pipeline).

github-actions · 2024-10-11T18:36:44Z

Coverage Report

Name

Stmts

Miss

∆ Miss

Cover

edsnlp/pipes/qualifiers/base.py

New missing coverage at line 178 !

     def process(self, doc: Doc) -> BaseQualifierResults:
-         doc = self.ensure_doc(doc)
         raise NotImplementedError

Was already missing at line 182

     def __call__(self, doc: Doc) -> Doc:
-         results = self.process(doc)
         raise NotImplementedError(f"{type(results)} should be used to tag the document")

53

2

1

96.23%

TOTAL

9450

211

1

97.77%

Files without new missing coverage

Name	Stmts	Miss	Cover
edsnlp/utils/span_getters.py Was already missing at lines 52-55 else: - for span in candidates: - if span.label_ in span_filter: - yield span Was already missing at lines 59-61 if span_getter is None: - yield doc[:], None - return if callable(span_getter): Was already missing at lines 62-64 if callable(span_getter): - yield from span_getter(doc) - return for key, span_filter in span_getter.items(): Was already missing at line 66 if key == "*": - candidates = ( (span, group) for group in doc.spans.values() for span in group Was already missing at lines 75-78 else: - for span, group in candidates: - if span.label_ in span_filter: - yield span, group Was already missing at line 82 if callable(span_setter): - span_setter(doc, matches) else: Was already missing at line 124 elif isinstance(v, str): - new_value[k] = [v] elif isinstance(v, list) and all(isinstance(i, str) for i in v): Was already missing at line 162 elif isinstance(v, str): - new_value[k] = [v] elif isinstance(v, list) and all(isinstance(i, str) for i in v):	153	14	90.85%
edsnlp/utils/resources.py Was already missing at line 33 if not verbs: - return conjugated_verbs	24	1	95.83%
edsnlp/utils/numbers.py Was already missing at line 34 else: - string = s string = string.lower().strip() Was already missing at lines 38-41 return int(string) - except ValueError: - parsed = DIGITS_MAPPINGS.get(string, None) - return parsed	16	4	75.00%
edsnlp/utils/lazy_module.py Was already missing at line 46 ): - continue for import_node in node.body:	31	1	96.77%
edsnlp/utils/filter.py Was already missing at line 206 if isinstance(label, int): - return [span for span in spans if span.label == label] else:	74	1	98.65%
edsnlp/utils/bindings.py Was already missing at line 22 return "." + path - return path	66	1	98.48%
edsnlp/train.py Was already missing at line 190 else: - sample_len = lambda idx, noise=True: 1 # noqa: E731 Was already missing at lines 257-263 if total + num_tokens > self.grad_accumulation_max_tokens: - print( ... - mini_batches.append([]) total += num_tokens Was already missing at line 349 if 0 <= self.limit <= count: - break if not (len(doc) and (filter_fn is None or filter_fn(doc))): Was already missing at line 351 if not (len(doc) and (filter_fn is None or filter_fn(doc))): - continue count += 1 Was already missing at lines 385-387 for ent in doc.ents: - for token in ent: - token.is_sent_start = False for sent in doc.sents if doc.has_annotation("SENT_START") else (doc[:],):	257	8	96.89%
edsnlp/processing/spark.py Was already missing at line 51 getActiveSession = SparkSession.getActiveSession - except AttributeError:	43	1	97.67%
edsnlp/processing/multiprocessing.py Was already missing at lines 227-231 if os.environ.get("TORCH_SHARING_STRATEGY"): - try: - torch.multiprocessing.set_sharing_strategy(os.environ["TORCH_SHARING_STRATEGY"]) - except NameError: - pass Was already missing at line 249 def save_align_devices_hook(pickler: Any, obj: Any): - pickler.save_reduce(load_align_devices_hook, (obj.__dict__,), obj=obj) Was already missing at lines 252-259 def load_align_devices_hook(state): - state["execution_device"] = MAP_LOCATION ... - AlignDevicesHook = None Was already missing at line 452 - new_batch_iterator = None Was already missing at lines 570-572 else: - batch = gpu_pipe.prepare_batch(docs, device=device) - inputs = None active_batches[batch_id] = (docs, task_id, inputs) Was already missing at line 949 if isinstance(outputs, BaseException): - raise outputs Was already missing at line 1017 if v is not None: - os.environ[k] = v	420	16	96.19%
edsnlp/processing/deprecated_pipe.py Was already missing at lines 207-209 def converter(doc): - res = results_extractor(doc) - return ( [{"note_id": doc._.note_id, **row} for row in res]	57	2	96.49%
edsnlp/pipes/trainable/span_linker/span_linker.py Was already missing at lines 401-403 if self.reference_mode == "synonym": - embeds = embeds.to(new_lin.weight) - new_lin.weight.data = embeds else:	172	2	98.84%
edsnlp/pipes/trainable/ner_crf/ner_crf.py Was already missing at line 250 if self.labels is not None and not self.infer_span_setter: - return Was already missing at lines 258-260 if callable(self.target_span_getter): - for span in get_spans(doc, self.target_span_getter): - inferred_labels.add(span.label_) else:	157	3	98.09%
edsnlp/pipes/trainable/layers/crf.py Was already missing at line 21 # out: 2 * N * O - return (log_A.unsqueeze(-1) + log_B.unsqueeze(-3)).logsumexp(-2) Was already missing at line 29 # out: 2 * N * O - return (log_A.unsqueeze(-1) + log_B.unsqueeze(-3)).max(-2) Was already missing at line 97 if learnable_transitions: - self.transitions = torch.nn.Parameter( torch.zeros_like(forbidden_transitions, dtype=torch.float) Was already missing at line 107 if learnable_transitions and with_start_end_transitions: - self.start_transitions = torch.nn.Parameter( torch.zeros(num_tags, dtype=torch.float) Was already missing at line 116 if learnable_transitions and with_start_end_transitions: - self.end_transitions = torch.nn.Parameter( torch.zeros(num_tags, dtype=torch.float)	137	5	96.35%
edsnlp/pipes/trainable/embeddings/transformer/transformer.py Was already missing at line 165 if quantization is not None: - kwargs["quantization_config"] = quantization	157	1	99.36%
edsnlp/pipes/qualifiers/reported_speech/reported_speech.py Was already missing at lines 24-28 return "REPORTED" - elif token._.rspeech is False: - return "DIRECT" - else: - return None	100	3	97.00%
edsnlp/pipes/qualifiers/negation/negation.py Was already missing at line 28 else: - return None	100	1	99.00%
edsnlp/pipes/qualifiers/hypothesis/hypothesis.py Was already missing at line 27 else: - return None	97	1	98.97%
edsnlp/pipes/qualifiers/history/history.py Was already missing at lines 26-32 def history_getter(token: Union[Token, Span]) -> Optional[str]: - if token._.history is True: - return "ATCD" - elif token._.history is False: - return "CURRENT" - else: - return None Was already missing at lines 338-344 ) - except ValueError: ... - note_datetime = None Was already missing at lines 353-359 ) - except ValueError: ... - birth_datetime = None Was already missing at lines 425-428 ) - except ValueError as e: - absolute_date = None - logger.warning( "In doc {}, the following date {} raises this error: {}. "	178	14	92.13%
edsnlp/pipes/qualifiers/family/family.py Was already missing at line 27 else: - return None	82	1	98.78%
edsnlp/pipes/ner/tnm/tnm.py Was already missing at lines 156-158 value = TNM.parse_obj(groupdict) - except ValidationError: - value = TNM.parse_obj({})	44	2	95.45%
edsnlp/pipes/ner/tnm/model.py Was already missing at line 139 def __str__(self): - return self.norm() Was already missing at line 163 ) - exclude_unset = skip_defaults	104	2	98.08%
edsnlp/pipes/ner/scores/sofa/sofa.py Was already missing at line 32 if not assigned: - continue if assigned.get("method_max") is not None: Was already missing at line 40 else: - method = "Non précisée"	25	2	92.00%
edsnlp/pipes/ner/scores/elston_ellis/patterns.py Was already missing at line 26 if x <= 5: - return 1 Was already missing at lines 32-36 else: - return 3 - - except ValueError: - return None	21	4	80.95%
edsnlp/pipes/ner/scores/charlson/patterns.py Was already missing at lines 21-23 return int(extracted_score) - except ValueError: - return None	13	2	84.62%
edsnlp/pipes/ner/scores/base_score.py Was already missing at line 154 if value is None: - continue normalized_value = self.score_normalization(value)	47	1	97.87%
edsnlp/pipes/ner/disorders/solid_tumor/solid_tumor.py Was already missing at lines 121-124 if use_tnm: - from edsnlp.pipes.ner.tnm import TNM - - self.tnm = TNM(nlp, pattern=None, attr="TEXT") Was already missing at lines 126-136 def process_tnm(self, doc): - spans = self.tnm.process(doc) ... - yield span Was already missing at line 156 if self.use_tnm: - yield from self.process_tnm(doc)	37	12	67.57%
edsnlp/pipes/ner/disorders/peripheral_vascular_disease/peripheral_vascular_disease.py Was already missing at line 107 if "peripheral" not in span._.assigned.keys(): - continue	15	1	93.33%
edsnlp/pipes/ner/disorders/diabetes/diabetes.py Was already missing at line 131 # Mostly FP - continue Was already missing at line 134 elif self.has_far_complications(span): - span._.status = 2 Was already missing at line 146 if next(iter(self.complication_matcher(context)), None) is not None: - return True return False	31	3	90.32%
edsnlp/pipes/ner/disorders/connective_tissue_disease/connective_tissue_disease.py Was already missing at line 103 # Huge change of FP / Title section - continue	14	1	92.86%
edsnlp/pipes/ner/disorders/ckd/ckd.py Was already missing at lines 120-123 dfg_value = float(dfg_span.text.replace(",", ".").strip()) - except ValueError: - logger.trace(f"DFG value couldn't be extracted from {dfg_span.text}") - return False	29	3	89.66%
edsnlp/pipes/ner/disorders/cerebrovascular_accident/cerebrovascular_accident.py Was already missing at lines 111-113 if span._.source == "ischemia": - if "brain" not in span._.assigned.keys(): - continue	17	2	88.24%
edsnlp/pipes/ner/adicap/models.py Was already missing at line 15 def norm(self) -> str: - return self.code Was already missing at line 18 def __str__(self): - return self.norm()	14	2	85.71%
edsnlp/pipes/misc/sections/sections.py Was already missing at line 126 if sections is None: - sections = patterns.sections sections = dict(sections)	45	1	97.78%
edsnlp/pipes/misc/quantities/quantities.py Was already missing at lines 147-149 def __getitem__(self, item: int): - assert isinstance(item, int) - return [self][item] Was already missing at lines 160-163 def __eq__(self, other: Any): - if isinstance(other, SimpleQuantity): - return self.convert_to(other.unit) == other.value - return False Was already missing at line 166 if other.unit == self.unit: - return self.__class__(self.value + other.value, self.unit, self.registry) return self.__class__( Was already missing at line 193 return self.convert_to(other_unit) - except KeyError: raise AttributeError(f"Unit {other_unit} not found") Was already missing at line 198 def verify(cls, ent): - return True Was already missing at line 237 def __lt__(self, other: Union[SimpleQuantity, "RangeQuantity"]): - return max(self.convert_to(other.unit)) < min((part.value for part in other)) Was already missing at line 248 return self.convert_to(other.unit) == other.value - return False Was already missing at line 262 def verify(cls, ent): - return True Was already missing at line 861 if snippet.end != last and doclike.doc[last: snippet.end].text.strip() == "": - pseudo.append("w") pseudo = "".join(pseudo) Was already missing at line 1042 if start_line is None: - continue Was already missing at lines 1073-1075 unit_norm = self.unit_followers[unit_before.label_] - except (KeyError, AttributeError, IndexError): - pass Was already missing at line 1118 ): - ent = doc[unit_text.start: number.end] else: Was already missing at lines 1125-1127 dims = self.unit_registry.parse_unit(unit_norm)[0] - except KeyError: - continue Was already missing at lines 1233-1235 last._.set(last.label_, new_value) - except (AttributeError, TypeError): - merged.append(ent) else:	432	20	95.37%
edsnlp/pipes/misc/dates/models.py Was already missing at line 152 else: - d["month"] = note_datetime.month if self.day is None: Was already missing at lines 156-162 else: - if self.year is None: ... - d["day"] = default_day Was already missing at lines 170-172 return dt - except ValueError: - return None Was already missing at line 188 else: - return None Was already missing at line 204 if self.second: - norm += f"{self.second:02}s"	196	11	94.39%
edsnlp/pipes/misc/dates/dates.py Was already missing at line 248 if isinstance(absolute, str): - absolute = [absolute] if isinstance(relative, str): Was already missing at line 250 if isinstance(relative, str): - relative = [relative] if isinstance(duration, str): Was already missing at line 252 if isinstance(duration, str): - relative = [duration] if isinstance(false_positive, str): Was already missing at lines 356-365 if self.merge_mode == "align": - alignments = align_spans(matches, spans, sort_by_overlap=True) ... - matches.append(span) Was already missing at line 450 elif d1 in seen or v1.bound is None or v2.bound is None: - continue Was already missing at lines 461-463 if v1.mode == Mode.DURATION: - m1 = Bound.FROM if v2.bound == Bound.UNTIL else Bound.UNTIL - m2 = v2.mode or Bound.FROM elif v2.mode == Mode.DURATION:	152	15	90.13%
edsnlp/pipes/misc/consultation_dates/consultation_dates.py Was already missing at line 131 else: - self.date_matcher = None Was already missing at line 134 if not consultation_mention: - consultation_mention = [] elif consultation_mention is True:	48	2	95.83%
edsnlp/pipes/core/normalizer/__init__.py Was already missing at line 7 def excluded_or_space_getter(t): - return t.is_space or t.tag_ == "EXCLUDED"	5	1	80.00%
edsnlp/pipes/core/endlines/endlines.py Was already missing at lines 155-159 if end_lines_model is None: - path = build_path(__file__, "base_model.pkl") - - with open(path, "rb") as inp: - self.model = pickle.load(inp) elif isinstance(end_lines_model, str): Was already missing at lines 162-164 self.model = pickle.load(inp) - elif isinstance(end_lines_model, EndLinesModel): - self.model = end_lines_model else: Was already missing at line 195 ): - return "ENUMERATION" Was already missing at line 282 if np.isnan(sigma): - sigma = 1	87	7	91.95%
edsnlp/pipes/core/contextual_matcher/models.py Was already missing at lines 19-23 if isinstance(v, list): - assert ( - len(v) == 2 - ), "`window` should be a tuple/list of two integer, or a single integer" - v = tuple(v) if isinstance(v, int):	115	2	98.26%
edsnlp/pipes/core/contextual_matcher/contextual_matcher.py Was already missing at line 94 ) - label = label_name if label is None: Was already missing at line 343 if assigned is None: - continue if replace_entity:	143	2	98.60%
edsnlp/patch_spacy.py Was already missing at lines 67-69 # if module is reloaded. - existing_func = registry.factories.get(internal_name) - if not util.is_same_func(factory_func, existing_func): raise ValueError(	31	2	93.55%
edsnlp/optimization.py Was already missing at line 32 def param_groups(self, value): - self.optim.param_groups = value Was already missing at line 36 def state(self): - return self.optim.state Was already missing at line 40 def state(self, value): - self.optim.state = value Was already missing at line 89 def __init__(self, groups): - self.param_groups = groups	77	4	94.81%
edsnlp/matchers/simstring.py Was already missing at line 280 if custom: - attr = attr[1:].lower() Was already missing at line 295 if custom: - token_text = getattr(token._, attr) else:	146	2	98.63%
edsnlp/language.py Was already missing at line 103 if last != begin: - logger.warning( "Missed some characters during"	51	1	98.04%
edsnlp/data/standoff.py Was already missing at line 43 def __init__(self, ann_file, line): - super().__init__(f"File {ann_file}, unrecognized Brat line {line}") Was already missing at line 83 if not len(ann_paths): - return { "text": text, Was already missing at line 197 ) - except Exception: raise Exception(	172	3	98.26%
edsnlp/data/polars.py Was already missing at line 26 if hasattr(data, "collect"): - data = data.collect() assert isinstance(data, pl.DataFrame)	44	1	97.73%
edsnlp/data/json.py Was already missing at line 94 if not is_jsonl: - obj[FILENAME] = filename results.append(obj) Was already missing at line 96 results.append(obj) - except Exception: raise Exception(f"Cannot parse {filename}")	107	2	98.13%
edsnlp/data/converters.py Was already missing at line 659 if isinstance(converter, type) or kwargs_to_init: - return converter(**kwargs), {} return converter, validate_kwargs(converter, kwargs)	192	1	99.48%
edsnlp/data/base.py Was already missing at lines 174-180 """ - data = LazyCollection.ensure_lazy(data) - if converter: - converter, kwargs = get_doc2dict_converter(converter, kwargs) - data = data.map(converter, kwargs=kwargs) - - return data	39	5	87.18%
edsnlp/core/torch_component.py Was already missing at line 390 if hasattr(self, "compiled"): - res = self.compiled(batch) else: Was already missing at line 436 """ - return self.preprocess(doc)	179	2	98.88%
edsnlp/core/registries.py Was already missing at line 78 if obj.error is not None: - raise obj.error	164	1	99.39%
edsnlp/core/pipeline.py Was already missing at line 552 if name in exclude: - continue if name not in components:	410	1	99.76%
edsnlp/core/lazy_collection.py Was already missing at line 51 def __call__(self, args, kwargs): - return self.forward(args, *kwargs) Was already missing at line 448 for name, pipe, _ in self.torch_components(): - pipe.to(device) return self	151	2	98.68%
edsnlp/connectors/omop.py Was already missing at line 69 if not isinstance(row.ents, list): - continue Was already missing at line 87 else: - doc.spans[span.label_].append(span) Was already missing at line 127 if df.note_id.isna().any(): - df["note_id"] = range(len(df)) Was already missing at line 171 if i > 0: - df.term_modifiers += ";" df.term_modifiers += ext + "=" + df[ext].astype(str)	84	4	95.24%

263 files skipped due to complete coverage.

Coverage failure: total of 97.77% is less than 97.78% ❌

docs/pipes/ner/disorders/index.md

edsnlp/pipes/ner/behaviors/alcohol/alcohol.py

edsnlp/pipes/ner/disorders/base.py

edsnlp/pipes/qualifiers/base.py

edsnlp/pipes/qualifiers/family/family.py

tests/pipelines/ner/disorders/alcohol.py

tests/pipelines/ner/disorders/tobacco.py

tests/test_language.py

edsnlp/utils/doc_to_text.py

…tespace

…snlp into hotfix_qualifier_process

sonarcloud · 2024-11-14T15:15:23Z

Quality Gate passed

Issues
7 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.7% Duplication on New Code

See analysis details on SonarQube Cloud

Thomzoy requested a review from percevalw October 11, 2024 18:12

Thomzoy changed the title ~~Fix handling of Span in BaseQualifier.process~~ [DRAFT] Fix handling of Span in BaseQualifier.process Oct 11, 2024

Thomzoy force-pushed the hotfix_qualifier_process branch 3 times, most recently from 5f31166 to 4f90b63 Compare October 15, 2024 09:23

Thomzoy changed the title ~~[DRAFT] Fix handling of Span in BaseQualifier.process~~ Small fixes Oct 15, 2024

Thomzoy force-pushed the hotfix_qualifier_process branch from 4f90b63 to 585b9d2 Compare October 15, 2024 09:53

percevalw reviewed Oct 15, 2024

View reviewed changes

Thomzoy added 6 commits October 16, 2024 09:15

ci: use ubuntu-22 instead of latest to keep python37 compatibility

808c392

feat: handle linebreak inside words and linebreak without leading whi…

543d5df

…tespace

fix: handle case where status is None in behavior/disorder pipes

698a8df

fix: treat span as doc in Qualifier process method

827f4be

fix: small bug in alcohol and tobacco pipes

5d790d2

docs: add details for disorders and behavior pipes

c1cf750

Thomzoy force-pushed the hotfix_qualifier_process branch from 6852be5 to c1cf750 Compare October 16, 2024 07:47

chore: update changelog

51d5d71

Thomzoy changed the title ~~Small fixes~~ Fixes on tokeniser, normalisation, qualifiers and CI Oct 16, 2024

Thomzoy added 4 commits October 16, 2024 11:14

fix: update pattern for intraword linebreak

336c463

various changes

fa337bc

Merge branch 'hotfix_qualifier_process' of https://github.com/aphp/ed…

25b9294

…snlp into hotfix_qualifier_process

fix yielding last span + allow limited repeat in dataloader

464cd97

percevalw force-pushed the master branch 4 times, most recently from 2038fb9 to 232ca91 Compare November 4, 2024 21:23

continue

514157c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes on tokeniser, normalisation, qualifiers and CI #329

Fixes on tokeniser, normalisation, qualifiers and CI #329

Thomzoy commented Oct 11, 2024 •

edited

Loading

github-actions bot commented Oct 11, 2024

sonarcloud bot commented Nov 14, 2024

Fixes on tokeniser, normalisation, qualifiers and CI #329

Are you sure you want to change the base?

Fixes on tokeniser, normalisation, qualifiers and CI #329

Conversation

Thomzoy commented Oct 11, 2024 • edited Loading

Description

Regarding tokenization:

Regarding ignore_space_tokens

Regarding the status mapping of behavior/disorder pipes

Regarding CI

Checklist

github-actions bot commented Oct 11, 2024

Coverage Report

sonarcloud bot commented Nov 14, 2024

Quality Gate passed

Thomzoy commented Oct 11, 2024 •

edited

Loading

Regarding `ignore_space_tokens`