Word Sense Disambiguation

The task of Word Sense Disambiguation (WSD) consists of associating words in context with their most suitable entry in a pre-defined sense inventory. The de-facto sense inventory for English in WSD is WordNet. For example, given the word “mouse” and the following sentence:

“A mouse consists of an object held in one's hand, with one or more buttons.”

we would assign “mouse” with its electronic device sense (the 4th sense in the WordNet sense inventory).

Fine-grained WSD:

The Evaluation framework of Raganato et al. 2017 [1] includes two training sets (SemCor-Miller et al., 1993- and OMSTI-Taghipour and Ng, 2015-) and five test sets from the Senseval/SemEval series (Edmonds and Cotton, 2001; Snyder and Palmer, 2004; Pradhan et al., 2007; Navigli et al., 2013; Moro and Navigli, 2015), standardized to the same format and sense inventory (i.e. WordNet 3.0).

Typically, there are two kinds of approach for WSD: supervised (which make use of sense-annotated training data) and knowledge-based (which make use of the properties of lexical resources).

Supervised: The most widely used training corpus used is SemCor, with 226,036 sense annotations from 352 documents manually annotated. All supervised systems in the evaluation table are trained on SemCor. Some supervised methods, particularly neural architectures, usually employ the SemEval 2007 dataset as development set (marked by *). The most usual baseline is the Most Frequent Sense (MFS) heuristic, which selects for each target word the most frequent sense in the training data.

Knowledge-based: Knowledge-based systems usually exploit WordNet or BabelNet as semantic network. The first sense given by the underlying sense inventory (i.e. WordNet 3.0) is included as a baseline.

The main evaluation measure is F1-score.

Supervised:

Model	Senseval 2	Senseval 3	SemEval 2007	SemEval 2013	SemEval 2015	Paper / Source
MFS baseline	65.6	66.0	54.5	63.8	67.1	[1]
Bi-LSTM_att+LEX	72.0	69.4	63.7*	66.4	72.4	[2]
Bi-LSTM_att+LEX+POS	72.0	69.1	64.8*	66.9	71.5	[2]
context2vec	71.8	69.1	61.3	65.6	71.9	[3]
ELMo	71.6	69.6	62.2	66.2	71.3	[4]
GAS (Linear)	72.0	70.0	--*	66.7	71.6	[5]
GAS (Concatenation)	72.1	70.2	--*	67	71.8	[5]
GAS_ext (Linear)	72.4	70.1	--*	67.1	72.1	[5]
GAS_ext (Concatenation)	72.2	70.5	--*	67.2	72.6	[5]
supWSD	71.3	68.8	60.2	65.8	70.0	[6] [11]
supWSD_emb	72.7	70.6	63.1	66.8	71.8	[7] [11]

Knowledge-based:

Model	All	Senseval 2	Senseval 3	SemEval 2007	SemEval 2013	SemEval 2015	Paper / Source
WN 1st sense baseline	65.2	66.8	66.2	55.2	63.0	67.8	[1]
Babelfy	65.5	67.0	63.5	51.6	66.4	70.3	[8]
UKB_{ppr_w2w-nf}	57.5	64.2	54.8	40.0	64.5	64.5	[9] [12]
UKB_{ppr_w2w}	67.3	68.8	66.1	53.0	68.8	70.3	[9] [12]
WSD-TM	66.9	69.0	66.9	55.6	65.3	69.6	[10]

Note: 'All' is the concatenation of all datasets, as described in [10] and [12]. The scores of [6,7] and [9] are not taken from the original papers but from the results of the implementations of [11] and [12], respectively.