Skip to content

testscript

Michal Koutny edited this page Apr 18, 2011 · 5 revisions

Script for statistical evaluation of language models

  • works with letter models

  • input file format

    • utf-8 encoding
    • one sentence per line
    • sentence consists of space separated words
  • output (in textual form)

    • cross entropy of chosen model over given file(s)
    • average no. of keystrokes per character
      • stems from model of only arrow-enter key model
      • no. of keystrokes needed is (o_c + 1), where (o_c) is order of the char in a probability sorted list, the one is for enter
  • modes of operation

    • name of the model is a parameter
    • simple
      • model should be already trained
      • for each filename CLI argument calculates the measures
      • for more files calculates also mean value and variance
    • leaving-one-out
      • (N) is a parameter
      • model must support receiving trainig data
      • no. of files must be either one or a multiple of (N)
        • for one file it must have at least (N) lines
      • input data are divided into (N) groups (units are either files or lines)
      • each test leaves one group out
        • that is held out data/test data Note: clear out the differce.
        • the rest is used as training data
      • the output is then analogous to simple mode with more files
Clone this wiki locally