Skip to content

bramiozo/PubScience

Repository files navigation

image

DataTools4Heart Project         AI4HF Project

PubScience

Tip

This library comes with zero guarantees. If you want to see improvements, please add a gitissue, or contribute with a pull-request. If you want collaborate scientifically, please send an email

Repository for public-article extraction and mining.

Multiple components:

  • Select (S) using API's to connect with 3rd party data
  • Retrieve (S) text data from Arxiv/Biorxiv/Medrxiv or Pubmed/PMC
  • Parse (S) process ingress XML/JSON/HTML/PDF/CSV into desired format
  • Identify (S) relevant text from generic corpora
  • Deduplicate (S) remove exact and mark approximate duplicates
  • Clean the XML/JSON/.. etc. from the previous step and output cleaned text
  • Translate the pruned/cleaned text to any target language
  • Anonymize replace PII-information by placeholder terms
  • Share make shareable through e.g. Huggingface
  • Augment Add paraphrasing
  • Synonimize identify and replace typos
  • Deabbreviate identify and deabbreviate abbreviations
  • Stats extract corpus statistics

Here the (S) indicates that these functions should be calleable in streaming mode. Especially to for smaller domains, with limited storage capacity, we may not want to download Terabytes of corpora before we start our higher level processing functions.

Status (minimum working example):

Task In progress Completed
Select & Retrieve [ ] [ ]
Parse [ ] [ ]
Identify [ ] [ ]
Deduplicate [ ] [ ]
Clean [x] [ ]
Translate [x] [ ]
Anonymise [x] [ ]
Share [ ] [ ]
Augment [ ] [ ]
Synonimize [ ] [ ]
Deabbreviate [ ] [ ]
Stats [ ] [ ]

Project descriptions

Here we can a bit more detail on the projects.

Select & Retrieve: interfaces with APIs for S2ORC/Pubmed/PMC/arxiv/medxriv/biorxiv/OAI and Huggingface.

The select function must be able to pull in data in streaming mode.

For Huggingface datasets this might be easy:

from datasets import load_dataset

datasets = load_dataset('some/dataset', *params, streaming=True)

Parse: parser to normalise incoming data in JSON/YAML or HuggingFace dataset formats

Identify: functionality to identify medical texts in general corpora using supervised and self-supervised models

  • Use pre-trained supervised models to identify relevant documents or text sections
  • Use LLMs to identify relevant texts using in-context-learning
  • Use seed-texts in combination with bi-encoder and cross-encoder models to find texts that are near

The core function ab initio is to ease the creation and dissemination of Dutch clinical NLP work (including corpora) but in principle this code is not limited to the Dutch language or the medical domain.

Deduplicate: remove exact duplicates and mark approximate duplicates. Following the Llama3.1 recipe we use

  • MinHash (see Broder)
  • Line-level deduplication; line-level frequency determination with cut-off, and selective removal

Clean: remove noise, code/format artifacts, escape/remove quotes

  • duplicated n-gram coverage ratio (see Rao et al.) to identify error logs
  • Encoding degarbling
  • file-format headers/endings
  • using fasttext-based language detectors remove text-sections that exceed a pre-set fraction being other lingual based on a per-line basis. e.g. if >50% of the paragraph or document is non-English we remove that paragraph

The core function here is the extract the text intended to be read.

Translate: using NMT and translation APIs optionally in combination with glossaries translate corporate to a target language.

Anonymize: replace PII-information by placeholder terms using deidentification libraries and optional custom patterns.

Share: turn translated dataset into shared datasets including a dataset-card, license, etc.

Augment: code to use paraphrasing for text generation

Synonimize: identify and replace typos, normalise variations of the same word

Deabbreviate: identify and deabbreviate abbreviations to reduce the ambiguity

Stats: extract stats from corpora, specifically; number of tokens, number of sentences, number of documents, vocab size

Basic operation:

from pubscience import clean, deduplicate, anonymise
from pubscience.utils import Pipeline

Cleaner = clean(**clean_kwargs)
Deduplicate = deduplicate(**dedup_kwargs)
Deid = anonymise(**deid_kwargs)


TextPipe = Pipeline([('Cleaner', Cleaner),
                     ('Deduplicate',  Deduplicate),
                     ('Deid', Deid)],
                    n_jobs=16)

df['processed_text'] = TextPipe.fit_transform(df['raw_text'])

# here Deduplicate adds a column to indicate the duplication degree

image

Tools

Language: This is primarily interesting because large scale text-processing can in principle be parallelized in an embarassingly simple way, that means we should prefer natively heteregenous languages such as

Select

Retrieve

  • Use the API's to pull .pdf's, .xml's or .json's.
  • Pull directly from http of ftp.
  • Parse from local files (parquet/csv.gzip).

Identify

Based on

  • keyword lookup, using e.g. FlashText
  • relevant document embedders (bi-encoders/cross-encoders) or
  • topic models, or
  • supervised models, trained to distinguish between domain specific texts and generic/other texts

A simple recipe could be (1) use command line string manipulation tools such as grep, awk and cat for the initial pruning so for instance grep "cardiale\|hartziekte\|vasculair\|tachycardie\|hartritme\|angina pectoris\|vaatlijden" nl_clean_0000.jsonl > nl_clean_cardiale.jsonl, this is then followed by (2) a bi-encoder to check whether documents are 'near' medical texts or (3) a supervised model to identify medical texts.

We want to be able to do this as part of the select process. E.g. in case of the PubMed fulltext articles we can use the abstract for semantic search to identify the relevant PubMed identifiers, which we can then selectively parse from the fulltext.

Deduplicate

Clean

Fix broken XML/JSON, and select text-sections using Beautifulsoup and other Python libraries, clean for non-word characters and e.g. formatting spans.

Translate

Use Bulk google Translate/DeepL/LLM's(GPT4/Gemini/etc) or open source translation models in combination with UMLS-based glossaries to translate the cleaned text to Dutch.

  • External LLM APIs:
    • Google Gemini
    • OpenAI GPT4
    • Anthropic Claude
    • Groq (Llama, Mistral etc.)
  • External translation APIs:
    • Google Translate
    • DeepL
  • pre-trained NLMs (in principle all models that are availabe through Huggingface):
    • Maria NMT
    • NLLB200
    • M2M100
    • MADLAD400
    • T5
  • pre-trained local LLMs (assuming quantized models):
    • Llama
    • Mistral
    • DCLM

Key features:

  • A domain specific glossary, and related,
  • a domain specific vocabulary.
  • A cache functionality to reduce translation cost, i.e. a dynamically programmed wrapper
  • Medical span alignment

image

When we translate annotated corpora we need to make sure that the labeled spans are correctly translated and spanned. We identify three approaches: (1) span-preserving translation, (2) span-inference of translation, (3) translate-then-align

Span preserving translation

An example approach is given by Seinen et al.; Seinen et al inject the span-information directly in the original text prior to translation. Even though this might, arguably, negatively effect the translation quality the resulting models trained on the translated corpora showed similar accuracy to the model trained on the original English corpora.

Span-inference of translation

In principle we are able to create a training set with span-to-span information, e.g. as part of existing collective translation efforts (such as datatools4heart.

Translate-then-align

We translate a text as is: the fox jumps over the fence -> de vos springt over het hek, then we identify the spans in the translated sentence. One possible solution is to perform semantic similarity matching using multi-lingual (or at least bilingual) bi- or cross-encoders.

A more lexical/syntactic approach is followed by Soares and Krallinger, who use the Aligner tool.

Anonymise

DEDUCE, Presidio

Share

Text extraction pipelines:

  • download pdf, extract body text, translate, clean, store
  • download XML, fix broken XML, extract body text, translate, clean, store
  • download pdf, extract Dutch section, clean, store

Pre-training Sources

Dutch

As part of Dutch generic corpora

Here we have to note that CC100, mC4, GigaCorpus and MADLAD-400 all consists primarily (if not solely) of CC text. The mC4 corpus is "filtered" for profanities and is therefore unsuitable as a basis for medical corpora. If you use multiple extraction versions of CC, be aware of the considerable required effort to deduplicates the text.

English

As part of English corpora that we can filter, clean, then translate

We have Italian corpora:

And in principle we are able to identify medical texts in non-Dutch generic corpora followed by a translation.

As part of Dutch clinical texts

  • NtvG journals
  • Dutch medical protocols
  • medical health records from participating medical centers.
  • EMEA

Spanish

Finetuning source

English

Sentence similarity

Term similarity

NER

Entity classification

De-abbreviation

Relationship extraction

Document classification

Summarisation

Q/A

German

NER

Spanish

Entity Classification

Document Classification

Translation of majority language sources

In principle all the English corpora can be used given an appropriate translation method.

About

Repository for public-article extraction and mining.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published