Python Concept Recognition and Entity Linking Library

This package allows one to implement generic concept recognition components on the basis of a dictionary. A concept recognizer indexes the dictionaries and identifies concepts (begin and end offsets) in text.

The current version includes two concept recognizers:

StemIntersectionConceptRecognizer
DoubleMetaphoneConceptRecognizer

Each recognizer implements the ConceptRecognizer base class and implements the initializemethod that handles the indexing of the dictionary and recognize method that annotates a text with concepts from the dictionary. recognizereturns a list of Annotationobjects that contain the information about the annotations.

Dictionary loading and formats

The DictionaryLoader base class allows for the implementation of dictionary loaders that accept different formats.

Currently, a single format is supported:

The Mgrep TSV format (same as for BioPortal Annotator) with the MgrepDictionaryLoader which is the following (one per line:)

ID<TAB>LABEL

ID is a unique identifier for the concept (e.g. an URI)

LABEL is a label for that concept (can include spaces as the separator is a tabulation)

A concept which has several labels will result in several ID<TAB>LABEL lines.

The dictionary loader can be instantiated as follows:

loader = MgrepDictionaryLoader("/path/to/tsv/file")

It is fairly straightforward to implement custom dictionary loader. The loader is passed to a recognizer during its construction as will be exemplified in the next section.

Usage Example

Let us see how to instantiate a recognizer, to initialize it and to annotate a list of texts with it. Individual recognizers may require additional data files. For the two recognizers that are currently supported, the files are provided in the data directory for French (clinical text). Beware: the termination and stop lists are typically domain specific.

corpus = list() # type: List[str]
#Load some corpus as a list of strings

recognizer = IntersStemConceptRecognizer(dictionary_loader=loader,
                                         stop_words_file="pyclinrec/stopwordsfr.txt",                                 termination_terms_file="pyclinrec/termination_termsfr.txt")
recognizer.initialize()

for text in corpus: 
    annotations = recognizer.recognize(text)
    for annotation in annotations:
        concept_id = annotation.concept_id # The unique identifier of the matching concept as defined in the dictionary
        start = annotation.start # Start character offset of the annotation
        end = annotation.end # End character offset of the annotation
        matched_text = annotation.matched_text # The surface form of the text matching te annotation

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github/workflows		.github/workflows
example_applications		example_applications
html/pyclinrec		html/pyclinrec
pyclinrec		pyclinrec
resources		resources
.gitignore		.gitignore
AgrovocAnnotator_en.pkl		AgrovocAnnotator_en.pkl
AgrovocAnnotator_fr.pkl		AgrovocAnnotator_fr.pkl
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
dev_playground.py		dev_playground.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Concept Recognition and Entity Linking Library

Dictionary loading and formats

Usage Example

About

Releases

Packages

Languages

License

EuromovDHM-SemTaxM/pyclinrec

Folders and files

Latest commit

History

Repository files navigation

Python Concept Recognition and Entity Linking Library

Dictionary loading and formats

Usage Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages