A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration

There are 21 Hebrew homographs covered in this dataset, with two possible analyses per homograph. There are approximately 500-2K sentences for each of the two possible analyses for each homograph.

This dataset is provided for research purposes (see License section below).

Files

Currently the dataset contains sample sentences for 21 homographs. Each homograph is in its own directory, with two files inside the directory - one for each possible analysis. The name of the directory/file represents the homograph/diacritized form. For example, for the homograph עמנו:

----------------------------------------------
.
..
עמנו
    עִמָּנוּ.txt
    עַמֵּנוּ.txt
...
----------------------------------------------

Format

Each file is a text document, containing the tagged examples for that diacritic form. Each example is a full sentence on it's own line, with the relevant homograph marked in between double-daggers (U+2021) correctly diacritized, marking the class.

For example:

ידענו שוויתור על המנהג הזה יביא את הקץ על ‡עַמֵּנוּ‡ ועל היהדות כולה.

License

We provide this dataset for research purposes and make no ownership claim on any part of it.

Citing

If you use this dataset in academic work, please cite as follows:

@inproceedings{shmidman-etal-2020-challenge,
  author    = {Avi Shmidman and
               Joshua Guedalia and
               Shaltiel Shmidman and
               Moshe Koppel and
               Reut Tsarfaty},
  title     = {A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration},
  booktitle = "EMNLP 2020 Findings, forthcoming",
  publisher = "Association for Computational Linguistics",
  year      = {2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ChallengeSets		ChallengeSets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration

Files

Format

License

Citing

About

Releases

Packages

shaltielshmid/hebrew-disambig-challenge-set

Folders and files

Latest commit

History

Repository files navigation

A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration

Files

Format

License

Citing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages