Skip to content

arekit-0.22.0

Compare
Choose a tag to compare
@nicolay-r nicolay-r released this 17 Mar 11:46
· 604 commits to master since this release

Release Notes 🎉

  • Pipelines integration!
    • Utilized now in text processing, which now could be deleted onto tokenization, entities assignation, frames assignation stages.
  • Repositories for opinions and network input samples!
  • Storage kernel customizations support for opinion and samples! Using Pandas by default.
  • Opinion-related service turn into providers: pairs, opinions, text-opinions, etc.

NOTE: issue #232 has been moved to the next release.
This version does not support RuAttitudes collection news parsing!
Will be fixed in the upcomming project.

Changelog

v0.22.0-rc (2022-03-17)

Full Changelog

Changes

Implemented enhancements:

  • create_term_embedding -- Embedding algorithm based on parts requires useless check #298
  • UnitTests -- BertOntoNotes is no longer below the core processing #293
  • SingleLabelScaler -- provide [QUICK] #291
  • BRAT visualization -- support processing in case of multiple documents. #286
  • Entity -- IDs Refactoring #280
  • BaseSampleRowProvider -- provide sentence id #279
  • BRAT tool -- adopt ui as a callback for the predict pipeline #275
  • ExperimentIterationHandler -- add Labeled Output Samples convertion to OpinionCollection #270
  • InferenceContext -- split bags and samples extraction from a single method [Quick] #268
  • DataFolding -- organize united data folding. #267
  • BaseDataFolding -- iter_index is not related to the base implementation #266
  • DataFolding -- move into experiment context #264
  • DataIO (exp_data var) -- rename it to ExperimentContext #263
  • ExperimentIterationHandler (Callback before) -- organize ExperimentEvaluationCallback #262
  • NetworkCallback -- this callback should not inherit experiment base Callback #261
  • Neural Network Hidden states writers and providers refactoring #260
  • TrainingCallback -- separate onto TrainingTerminationCallback and HiddenWriterCallback classes. #259
  • BaseTensorflowModel -- simplify fit and predict operations. #258
  • LabeledCollection -- remove is_empty and reset_labels api #257
  • NetworkCallback -- move train/predict notification info into callback #256
  • Tensorflow saver -- move the related logic outside of the model implementation #255
  • DefaultSingleLabelAnnotationAlgorithm -- single label is not a part of the algo #244
  • ThreeScaleTaskAnnotator -- rename and move into core. #243
  • Data/output -- create pipelines directory with the related output processing #240
  • Examples -- document parsing executes twicely #239
  • Might be utilized pipeline implementation #238
  • OpinionsProvider -- performs two actions, including ids assignation #236
  • entity_to_group_func -- BaseExperiment should not provide this method. #235
  • TextOpinionHelper -- to news/parsed/providers (implement the latter as a provider) #233
  • DefaultSingleLabelAnnotationAlgorithm -- iter_opinion duplicates the generalized pair opinion pair creation approach #231
  • Common languages dir -- move its contents into processing contrib. #229
  • Linked Text Opinions Refactoring. #228
  • Lemmatization should be a part of the frames processing pipeline stage #226
  • DefaultTextParser -- this class is actually a Tokenizer #225
  • News -- text-opinions provider and entities access API might be a part of a ParsedNews by means of NewsParser (new class) #224
  • StringLabelsFormatter -- switch to label_types instead of label instances. #223
  • AnnotationAlgorithm -- iter_opinions requires EntitiesCollection while the latter utilized for entities iteration #222
  • TextParseOptions -- add keep_tokens #221
  • FrameVariantsParser -- return modified terms only #218
  • FramesAnnotation -- is_inverted flag and processing shoult be a pipeline item #217
  • FramesCollection -- use FrameConnotationProvider instead #216
  • FrameVariantsParser -- move into processing subfolder. #215
  • OpinionOperations -- remove try_read_annotated_opinion_collection #213
  • DocumentOperation -- unify iter_doc_ids operation into one with tag parameter. #212
  • OpinionOperations -- move readers* into IO. #211
  • OpinionCollectionsProvider -- serialization should not be a part of this class #210
  • data -- separate data-related information from the experiment #209
  • BaseInputReader -- class stores _df, however it should replaced with BaseRowsStorage #207
  • Repositories -- fill method should be a part of a storage rather than provider. #204
  • BaseStorage -- exclude save method into separated class BaseRowsWriter #202
  • Experiments -- rename formats to api (QUICK) #201
  • Embedding and Vocabulary -- organize Storage/Repository with serialize/load operations. #200
  • Sample -- remove dependency from DefaultNetworkConfig. #199
  • BaseOutputFormatter -- both provider and formatter mixes df usage #198
  • OpinionProvider -- remove dependency from Opinion and Document Operation instances. #197
  • Repositiories -- add this class which unite all the providers for data writing #195
  • Add column providers #194
  • NetworkSampleFormatter -- switch to provider #193
  • BaseSampleStorage -- use store_labels instead of data_type passing (QUICK) #192
  • NetworkOutputEncoder -- separate formatting from serialization. #191
  • BaseSampleFormatter -- __create_row is not relted to the Formatter, should be moved. #190
  • BaseDocumentStatGenerator -- provider depends on IO files. #189
  • OpinonFormatter -- use the latter in experiment io. #188
  • News -- remove return_text parameter from iter_sentences method (QUICK) #187
  • BaseRowsFormatter -- move format method in another class #185
  • BaseSampleFormatter -- _iter_sentence_terms should not be a part of this class. (QUICK) #184
  • BaseSampleFormatter -- _provide_rows behavior depends on row_ids_provider instance type. #182
  • BaseSampleFormatter -- remove data_type parameter from ctor #181
  • BaseObjectParser -- parse method should return object of the same type as sentence #179
  • News -- remove entities_parser instance from News class. #178
  • BaseEntitiesParser -- generalize to BaseObjectsParser. #177
  • Provide SHA checksums utilization for downloaded resources. #176
  • OpinionCollectionsFormatter -- use it as instance, created within with block #175
  • BaseOutput -- move _csv_to_dataframe out of this class. #174
  • DataIO -- remove Stemmer instance #172
  • BaseRowsFormatter -- formatter_type_log_name mehod should be removed. #171
  • BaseOpinionsFormatter -- leave save method implementation for inheritor classes. #170
  • BaseSampleFormatter -- leave save method implementation for inheritor classes. #169
  • BaseIOUtils -- remove dependencies from file/(path) based data storage format #168
  • BaseIOUtils -- get_input_sample_filepath get_input_opinions_filepath are limit possible storage abilities. #166
  • perform_reading_and_initialization -- provide samples reader. #165
  • perform_reading_and_initialization -- remove dependency from doc_ops #164
  • NetworkInputSampleReader -- remove inheritance from TSV-based reader. #163
  • OpinionCollectionsFormatter -- use save_to and load_from notation for method names with source provider (file/archive/storage, etc.). #142
  • RuSentRelOpinionCollectionFormatter -- move all the opinion iteration during saving/loading into base class #141
  • news_id or doc_id -- normalize class and field names #133
  • embeddings subdir -- considered to be a part of networks contrib #132
  • Sentiment frame polarity (A0->A1) considered to be a part of the related experiment. #118
  • EnumServices -- provide a base class with string to Enum conversion functionality #117
  • EntityFormaters -- Move formaters into the particular experiment implementation #116
  • _create_parse_options -- remove this method from DocumentOperations across all the experiments. #112
  • NewsParseOptions -- provide this options for the particular DefaultParser derived from TextParser #111
  • TextParser -- Provide a separated class with a text processing algorithm implementation API #75
  • Providing all the logging information into log_utils.py #30

Fixed bugs:

  • ModuleNotFoundError: No module named 'arekit.common.data.input.providers.instances' #301
  • UnitTests -- Discard RuAttitudes-v1.2 support due to index out of range exception on reading #295
  • text_opinions_iter_pipeline -- ids assigments varies after multiple calls #278
  • EntitiesParser -- provide doc_level ids #277
  • DeepPavlovNER -- BertOntoNotes entities annotation [Treating string and list-based text representation simultaneously] #274
  • Examples -- get_index_by_term of Vocabulary failed #271
  • Annotator Performance -- keeps all possible pairs between entities. #253
  • Network SampleID -- has type unicode, but expected to be integer type #248
  • Example -- given two sentences results in samples of only last of them. #246
  • UnitTests -- Incorrect labels formatter (QUICK) #186
  • test_samples_iter.py -- incorrect API usage in Tensorflow contrib. #158

Closed issues:

  • Transfer examples folder into separated project [ARElight] #300
  • RuSentRel Experiment -- Text is lemmatized irrespect of the save_lemmas parameter in parser [OK] #297
  • Experiment -- refactor inference pipeline implementation #290
  • Example -- reorganize infer folder (experiment) #289
  • Experiment -- Organize pipeline stages as items of the BasePipeline #285
  • BaseSampleRowProvider -- provide entity values and entity types. [QUICK] #283
  • DeepPavlov NER -- adopt BERTontonotes. #272
  • NeuralNetworks -- graph and tf session should be initialized before the predict method call. #247
  • NewsServiceCollection -- implement #245
  • numpy 1.19.5 -- returns int64 by default #242
  • Organize unit tests for Output to Opinion conversion pipeline #241
  • Iter_opinions_collection -- complicated, considering pipeline processing instead #237
  • EntitiesCollection -- provide value_to_group function instead of SynonymsCollection. #230
  • BaseTextParser -- parse_news is not related to the text parsing concepts and should be a part of the another class #220
  • DocumentOperations -- _get_text_parser should not be a part of this API #219
  • Create simple parser for text with mentioned [entities] #214
  • NetworkInputHelper -- performing serialize_missed_collections during writing process #208
  • RowIDs -- should be common for input and output #206
  • SampleRowBalancerHelper -- simplify by using pandas group sampling #203
  • convert_output_to_opinion_collections -- pass opinion reader into parameters. #167
  • Experiment -- Separate TSV-based formater from based one for samples and opinions #162
  • Switch to Python3.6 #160
  • RuSentRel Experiment Contrib -- update description #153
  • Provide Cache for data sources #151
  • SynonymsCollection considered in ReadOnly mode only #5

Merged pull requests: