zh_en_word_alignment_data

This repository contains data and parameters to replicate experiments in the following paper.

Model-Based Aligner Combination Using Dual Decomposition John DeNero and Klaus Macherey ACL 2011 http://denero.org/content/pubs/acl11_denero_dual.pdf

This repository has several components, listed below. It does not contain an implementation of the bidirectional model described in the paper. It does contain a simple reference implementation of the HMM-based alignment model baseline (Vogel, '96).

(1) zh_en_data contains the tokenized, labeled evaluation data. It is a small hand-aligned subset of the NIST 2002 MT evaluation set.

(2) hmm_models contains the parameters of HMM-based alignment models trained to generate English given Chinese and Chinese given English. These are trained on the LDC FBIS corpus. The data format and interpretation is described in src/hmm.py files.

(3) src contains hmm.py, which reads and applies an HMM-based alignment model. It also contains align.py, which reads a dataset and applies two models to each sentence, rendering a text grid of the intersection of their output.

(4) output contains text grid renderings of the alignments evaluated in Table 2 of the paper.

(5) align.sh runs src/align.py on zh_en_data using hmm_models.

Note: This implementation in src/hmm.py is not the original one used to conduct experiments, nor is it efficient for performing word alignment of full corpora. It exists only to clarify the interpretation of parameters files.

Warning: This implementation is not an exact replication of the original aligner. Known differences are marked as TODO's in the code. Small unknown differences exist as well. In the initial release of this dataset, approximately 1% of words have different Viterbi alignments from those used in the original implementation in one alignment model or the other. Hopefully, those differences will be identified and removed in future updates.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
hmm_models		hmm_models
output		output
src		src
zh_en_data		zh_en_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
align.sh		align.sh
eval.sh		eval.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zh_en_word_alignment_data

About

Releases

Packages

Languages

License

papajohn/zh_en_word_alignment_data

Folders and files

Latest commit

History

Repository files navigation

zh_en_word_alignment_data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages