cynical-selection

Allo-media data selection tool

This code implements the data selection method and algorithms proposed in Axelrod's paper CYNICAL SELECTION OF LANGUAGE MODEL TRAINING DATA, based on the paper's explanations and the Perl implementation Axelrod proposed on github

Comments in code and details on usage to come, but it's pretty simple right now.

Basic usage

Say you have a (small) representative corpus (task.txt) and a (big) general one (unadapted.txt) and you want to select sentences from the big corpus that look like the small corpus ones.

Usage would be:

./cynical-selection.py --task task.txt --unadapted unadapted.txt

This will produce a .jaded file containing the selected sentences using the following tab-separated format:

model score sentence score (penalty + gain) length penalty sentence gain sentence id (in the selection) sentence id (in the unadapted corpus) best word word gain sentence.

See header of the script for available options, here is the two most important:

batch: essential with big corpora, allows to select more than one sentence at a time, see Axelrod's paper

iterate: iterate selection runs till no more than 10% of original size can be removed

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
cynical-selection.py		cynical-selection.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cynical-selection

Basic usage

About

Releases

Packages

Languages

License

allo-media/cynical-selection

Folders and files

Latest commit

History

Repository files navigation

cynical-selection

Basic usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages