Skip to content

Task Schemata

Nicolay Rusnachenko edited this page Oct 12, 2023 · 9 revisions

Task schemata for no-label annotated text opinion extraction from RuSentRel collection of mass-media articles written in Russian with document level sentiment attitude annotations; entity annotation represent a part of BRAT, grouped in synonyms collection by their stemmed version (Yandex Mystem); opinion annotation based on index-based annotation pairs and non-assigned annotation of all pairs in every sentence, for which distance in words does not exceed 50 words:

# 1. text parser pipeline.
text_parser = BaseTextParser(pipeline=[
    BratTextEntitiesParser(),
    DefaultTextTokenizer(keep_tokens=True),
]) 

# Initialize empty synonyms collection.
# Using stemming for values grouping.
stemmer = MystemWrapper()
synonyms = StemmerBasedSynonymCollection(RuSentRelSynonymsHelper.iter_groups(), 
                                         stemmer, 
                                         is_read_only=False)

# Initialize provider of the documents.
doc_provider = RuSentrelDocumentOperations()

# 2. text opinion annotation pipeline.
opinion_annotation = text_opinion_extraction_pipeline(
    # Describing function that provides doc.
    get_doc_func=doc_provider.by_id,
    # Pipeline of the text processing.
    text_parser=text_parser,
    # List of annotations.
    annotators=[ 
        # Value-based annotation.
        AlgorithmBasedTextOpinionAnnotator(
            PairBasedOpinionAnnotationAlgorithm(
                dist_in_terms_bound=50,
                label_provider=ConstantLabelProvider(NoLabel())),
                value_to_group_id_func=lambda v: stemmer.lemmatize_to_str(v))
                get_doc_existed_opinions_func=None,
                create_empty_collection_func=lambda: OpinionCollection(synonyms))
   ])

Text opinion annotator declaration, which performs conversion of the document-level RuSentRel collection attitudes onto the context-level opinions

# Custom labels declaration.
class PositiveLabel(Label): pass
class NegativeLabel(Label): pass

# Label formatting declaration.
label_formatter = RuSentRelLabelsFormatter(
 pos_label_type=PositiveLabel,
 neg_label_type=NegativeLabel)

annot = AlgorithmBasedTextOpinionAnnotator(
    PredefinedOpinionAnnotationAlgorithm(
        doc_provider=doc_provider,
        get_opinions_by_doc_id_func=lambda doc_id: OpinionCollection(
            RuSentRelOpinions.iter_from_doc(doc_id, labels_fmt)),
        value_to_group_id_func=lambda value: GroupingProviders.provide_value(
            synonyms=synonyms, value=value)
        create_empty_collection_func=lambda: OpinionCollection(synonyms)) 
)

Application of the large (252K documents) RuAttitudes collection consist of annotated attitudes using distant supervision technique.

NOTE: Doc provider here is expected to be related to RuAttitudes

pipeline = text_opinion_extraction_pipeline(
    annotators=[ 
        # Index-based annotation.
        PredefinedTextOpinionAnnotator(
            doc_provider=doc_provider, 
            label_formatter=RuAttitudesLabelFormatter(RuAttitudesLabelScaler()))
    ],
    get_doc_by_id_func=doc_provider.by_id,
    text_parser=text_parser)
Clone this wiki locally