Skip to content

Commit

Permalink
docs: add details for disorders and behavior pipes
Browse files Browse the repository at this point in the history
  • Loading branch information
Thomzoy committed Oct 16, 2024
1 parent 5d790d2 commit c1cf750
Show file tree
Hide file tree
Showing 5 changed files with 90 additions and 149 deletions.
2 changes: 2 additions & 0 deletions docs/pipes/ner/behaviors/alcohol.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Alcohol consumption {: #edsnlp.pipes.ner.behaviors.alcohol.factory.create_component }

--8<-- "docs/pipes/ner/disorders/warning.md"

::: edsnlp.pipes.ner.behaviors.alcohol.factory.create_component
options:
heading_level: 2
Expand Down
97 changes: 2 additions & 95 deletions docs/pipes/ner/behaviors/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,99 +2,6 @@

## Presentation

EDS-NLP offers two components to extract behavioral patterns, namely the tobacco and alcohol consumption status. Each component is based on the ContextualMatcher component.
Some general considerations about those components:
EDS-NLP offers two components to extract behavioral patterns, namely the tobacco and alcohol consumption status. Each component is based on the [ContextualMatcher][edsnlp.pipes.core.contextual_matcher.ContextualMatcher] matcher, itself based on `eds.contextual_matcher` component.

- Extracted entities are stored in `doc.ents` and `doc.spans`. For instance, the `eds.tobacco` component stores matches in `doc.spans["tobacco"]`.
- The matched comorbidity is also available under the `ent.label_` of each match.
- Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
- Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline with the following parameters:
```{ .python .no-check }
nlp.add_pipe(
eds.normalizer(
accents=True,
lowercase=True,
quotes=True,
spaces=True,
pollution=dict(
information=True,
bars=True,
biology=True,
doctors=True,
web=True,
coding=True,
footer=True,
),
),
)
```

!!! warning "Use qualifiers"
Those components **should be used with a qualification pipeline** to avoid extracted unwanted matches. At the very least, you can use available rule-based qualifiers (`eds.negation`, `eds.hypothesis` and `eds.family`). Better, a machine learning qualification component was developed and trained specifically for those components. For privacy reason, the model isn't publicly available yet.

!!! aphp "Use the ML model"

The model will soon be available in the models catalogue of AP-HP's CDW.

## Usage

```{ .python .no-check }
import edsnlp, edsnlp.pipes as eds
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(
eds.normalizer(
accents=True,
lowercase=True,
quotes=True,
spaces=True,
pollution=dict(
information=True,
bars=True,
biology=True,
doctors=True,
web=True,
coding=True,
footer=True,
),
),
)
nlp.add_pipe(eds.tobacco())
nlp.add_pipe(eds.diabetes())
text = """
Compte-rendu de consultation.
Je vois ce jour M. SCOTT pour le suivi de sa rétinopathie diabétique.
Le patient va bien depuis la dernière fois.
Je le félicite pour la poursuite de son sevrage tabagique (toujours à 10 paquet-année).
Sur le plan de son diabète, la glycémie est stable.
"""
doc = nlp(text)
doc.spans
# Out: {
# 'pollutions': [],
# 'tobacco': [sevrage tabagique (toujours à 10 paquet-année],
# 'diabetes': [rétinopathie diabétique, diabète]
# }
tobacco_matches = doc.spans["tobacco"]
tobacco_matches[0]._.detailed_status
# Out: "ABSTINENCE" #
tobacco_matches[0]._.assigned["PA"] # paquet-année
# Out: 10 # (1)
diabetes = doc.spans["diabetes"]
(diabetes[0]._.detailed_status, diabetes[1]._.detailed_status)
# Out: ('WITH_COMPLICATION', 'WITHOUT_COMPLICATION') # (2)
```

1. Here we see an example of additional information that can be extracted
2. Here we see the importance of document-level aggregation to extract the correct severity of each comorbidity.
--8<-- "docs/pipes/ner/disorders/presentation.md"
56 changes: 2 additions & 54 deletions docs/pipes/ner/disorders/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,58 +2,6 @@

## Presentation

The following components extract 16 different conditions from the [Charlson Comorbidity Index](https://www.rdplf.org/calculateurs/pages/charlson/charlson.html). Each component is based on the ContextualMatcher component.
The following components extract 16 different conditions from the [Charlson Comorbidity Index](https://www.rdplf.org/calculateurs/pages/charlson/charlson.html). Each component is based on the [ContextualMatcher][edsnlp.pipes.core.contextual_matcher.ContextualMatcher] matcher, itself based on `eds.contextual_matcher` component.

The components were developed by AP-HP's Data Science team with a team of medical experts, following the insights of the algorithm proposed by [@petitjean_2024]

Some general considerations about those components:

- Extracted entities are stored in `doc.ents` and `doc.spans`. For instance, the `eds.tobacco` component stores matches in `doc.spans["tobacco"]`.
- The matched comorbidity is also available under the `ent.label_` of each match.
- Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
- Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline with the following parameters:

```{ .python .no-check }
import edsnlp, edsnlp.pipes as eds
...
nlp.add_pipe(
eds.normalizer(
accents=True,
lowercase=True,
quotes=True,
spaces=True,
pollution=dict(
information=True,
bars=True,
biology=True,
doctors=True,
web=True,
coding=True,
footer=True,
),
),
)
```

!!! warning "Use qualifiers"
Those components **should be used with a qualification pipeline** to avoid extracted unwanted matches. At the very least, you can use available rule-based qualifiers (`eds.negation`, `eds.hypothesis` and `eds.family`). Better, a machine learning qualification component was developed and trained specifically for those components. For privacy reason, the model isn't publicly available yet.

!!! aphp "Use the ML model"

The model will soon be available in the models catalogue of AP-HP's CDW.

!!! tip "On the medical definition of the comorbidities"

Those components were developped to extract **chronic** and **symptomatic** conditions only.

## Aggregation

For relevant phenotyping, matches should be aggregated at the document-level. For instance, a document might mention a complicated diabetes at the beginning ("*Le patient a une rétinopathie diabétique*"), and then refer to this diabetes without mentionning that it is complicated anymore ("*Concernant son diabète, le patient ...*").
Thus, a good and simple aggregation rule is, for each comorbidity, to

- disregard all entities tagged as irrelevant by the qualification component(s)
- take the maximum (i.e., the most severe) status of the leftover entities

An implementation of this rule is presented [here][aggregating-results]
--8<-- "docs/pipes/ner/disorders/presentation.md"
77 changes: 77 additions & 0 deletions docs/pipes/ner/disorders/presentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
The components were developed by AP-HP's Data Science team with a team of medical experts, following the insights of the algorithm proposed by [@petitjean_2024]

Some general considerations about those components:

- Extracted entities are stored in `doc.ents` and `doc.spans`. For instance, the `eds.tobacco` component stores matches in `doc.spans["tobacco"]`.
- The matched comorbidity is also available under the `ent.label_` of each match.
- Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
- Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline (see [Usage](#usage) below)

--8<-- "docs/pipes/ner/disorders/warning.md"

!!! warning "Use qualifiers"
Those components **should be used with a qualification pipeline** to avoid extracted unwanted matches. At the very least, you should use available rule-based qualifiers (`eds.negation`, `eds.hypothesis` and `eds.family`). Better, a machine learning qualification component was developed and trained specifically for those components. For privacy reason, the model isn't publicly available yet.

!!! aphp "Use the ML model"

For projects working on AP-HP's CDW, this model is available via its models catalogue.

## Usage

```{ .python .no-check }
import edsnlp, edsnlp.pipes as eds
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(
eds.normalizer(
accents=True,
lowercase=True,
quotes=True,
spaces=True,
pollution=dict(
biology=True, #(1)
coding=True, #(2)
),
),
)
nlp.add_pipe(eds.tobacco())
nlp.add_pipe(eds.diabetes())
text = """
Compte-rendu de consultation.
Je vois ce jour M. SCOTT pour le suivi de sa rétinopathie diabétique.
Le patient va bien depuis la dernière fois.
Je le félicite pour la poursuite de son sevrage tabagique (toujours à 10 paquet-année).
Sur le plan de son diabète, la glycémie est stable.
"""
doc = nlp(text)
doc.spans
# Out: {
# 'pollutions': [],
# 'tobacco': [sevrage tabagique (toujours à 10 paquet-année],
# 'diabetes': [rétinopathie diabétique, diabète]
# }
tobacco_matches = doc.spans["tobacco"]
tobacco_matches[0]._.detailed_status
# Out: "ABSTINENCE" #
tobacco_matches[0]._.assigned["PA"] # paquet-année
# Out: 10 # (3)
diabetes = doc.spans["diabetes"]
(diabetes[0]._.detailed_status, diabetes[1]._.detailed_status)
# Out: ('WITH_COMPLICATION', 'WITHOUT_COMPLICATION') # (4)
```

1. This will discard mentions of biology results, which often leads to false positive
2. This will discard mentions of ICD10 coding that sometimes appears at the end of clinical documents
3. Here we see an example of additional information that can be extracted
4. Here we see the importance of document-level aggregation to extract the correct severity of each comorbidity.
7 changes: 7 additions & 0 deletions docs/pipes/ner/disorders/warning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
!!! danger "On overlapping entities"
When using multiple disorders or behavior pipelines, some entities may be extracted from different pipes. For instance:

* "Intoxication éthylotabagique" will be tagged both by `eds.tobacco` and `eds.alcohol`
* "Chirrose alcoolique" will be tagged both by `eds.liver_disease` and `eds.alcohol`

As `doc.ents` discards overlapping entities, you should use `doc.spans` instead.

0 comments on commit c1cf750

Please sign in to comment.