Add TF-IGM (Inverse Gravity Moment) weighting #45

rth · 2019-12-26T23:12:54Z

This adds TF-IGM (Inverse Gravity Moment), a supervised feature weighting scheme for text classification that measures class distinguishing power for each term.

This PR implements the following paper: "Turning from TF-IDF to TF-IGM for term weighting in text
classification", Chen et al 2016.

There are a number feature weighting techniques used for text, both supervised (see e.g. https://github.com/textvec/textvec) and unsupervised (e.g. see SMART notation in IR). One of the interesting points of TF-IGM is that is work well in the multi-class case (unlike some of the other supervised schemes that are primarly applied to binary classification).

Chen et al, did a benchmark with some of the other classical weighting methods on 3 datasets,

the last two columns correspond to TfigmTransformer() and TfigmTransformer(tf_scale="sqrt"). They are meant to be drop in replacements for TfidfTransformer. The included example illustrates its use on 20newsgroups; below is the cross-validation scores for BOW unigrams (i.e. ngram_range=(1, 1)),

metric                         F1-macro balanced_accuracy
preprocessing                                            
TF-IGM(tf_scale='sqrt')     0.935±0.005       0.934±0.005
TF-IGM(tf_scale='log1p')    0.934±0.004       0.933±0.004
TF-IDF(sublinear_tf=True)   0.927±0.005       0.926±0.005
TF-IGM(tf_scale=None)       0.927±0.005       0.926±0.005
TF-IDF(sublinear_tf=False)  0.918±0.005       0.917±0.005
TF                          0.908±0.002       0.907±0.002

I have also run this on a non public dataset several month ago where it behaved marginally better than TF-IDF. Have not tried other datasets so far, even if 20newsgroup is not a very good benchmark.

Another point I like about this, is that it performs well when using simultaneously unigram and bigrams where classical TF-IDF doesn't make much sense since bigrams are less frequent and will generally be weighted higher than unigrams. Below are CV scores for the same example with ngram_range=(1, 2),

metric                         F1-macro balanced_accuracy
preprocessing                                            
TF-IGM(tf_scale='sqrt')     0.945±0.002       0.944±0.002
TF-IGM(tf_scale='log1p')    0.943±0.003       0.943±0.003
TF-IGM(tf_scale=None)       0.937±0.004       0.936±0.004
TF-IDF(sublinear_tf=True)   0.930±0.004       0.928±0.004
TF-IDF(sublinear_tf=False)  0.922±0.004       0.921±0.004
TF                          0.912±0.002       0.911±0.002

TF-IDF remains quite similar, while there is notable improvement for TF-IGM.

TODO:

add user manual
more extensive tests

chkoar

At first pass it seems good to me

chkoar · 2019-12-26T23:21:34Z

sklearn_extra/feature_weighting/_text.py

+        elif self.tf_scale == "log1p":
+            X = np.log1p(X)
+        else:
+            raise ValueError


A friendlier message here?

chkoar · 2020-03-04T15:24:41Z

Maybe move that in the package sklearn_extra.feature_extraction to stay in line with scikit-learn?

JustickDM · 2024-05-15T10:30:29Z

Any news?

rth and others added 8 commits December 26, 2019 16:15

ENH TF-IGM feature weighting (initial implementation)

f719269

Improve example

ff8e065

Fix bug in TF-IGM

92486f5

Improve example

04fd530

Improve docstrings

bcff83e

TST Additional tests

0c08fea

Style improvements

7a8ce1f

flake8

690a085

chkoar reviewed Dec 26, 2019

View reviewed changes

rth added 3 commits December 27, 2019 00:45

Better parameter validation

4f98816

FIX for legacy scipy

ddbf828

DOC Add to reference API

f415906

rth mentioned this pull request Dec 27, 2019

TffvVectorizer Enconding scikit-learn/scikit-learn#15970

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TF-IGM (Inverse Gravity Moment) weighting #45

Add TF-IGM (Inverse Gravity Moment) weighting #45

rth commented Dec 26, 2019 •

edited

Loading

chkoar left a comment

chkoar Dec 26, 2019

chkoar commented Mar 4, 2020 •

edited

Loading

JustickDM commented May 15, 2024

Add TF-IGM (Inverse Gravity Moment) weighting #45

Are you sure you want to change the base?

Add TF-IGM (Inverse Gravity Moment) weighting #45

Conversation

rth commented Dec 26, 2019 • edited Loading

chkoar left a comment

Choose a reason for hiding this comment

chkoar Dec 26, 2019

Choose a reason for hiding this comment

chkoar commented Mar 4, 2020 • edited Loading

JustickDM commented May 15, 2024

rth commented Dec 26, 2019 •

edited

Loading

chkoar commented Mar 4, 2020 •

edited

Loading