Skip to content

Latest commit

 

History

History
67 lines (49 loc) · 2.54 KB

README.md

File metadata and controls

67 lines (49 loc) · 2.54 KB

TGen Baseline for the E2E NLG Challenge

To train and evaluate the baseline, you need to:

  1. Convert the downloaded E2E data into a format used by TGen. This is done using the input/convert.py script.

    Note that multiple references are joined for one MR in the development set, but kept separate for the training set. All files are plain text, one instance per line (except for multiple references, where instances are separated by empty lines).

    The name and near slots in the MRs are delexicalized. The output files are:

    • *-abst.txt -- lexicalization instructions (what was delexicalized at which position in the references, can be used to lexicalize the outputs)
    • *-conc_das.txt -- original, lexicalized MRs (converted to TGen's representation, semantically equivalent)
    • *-conc.txt -- original, lexicalized reference texts
    • *-das.txt -- delexicalized MRs
    • *-text.txt -- delexicalized reference texts
./convert.py -a name,near -n new-data/trainset.csv train
./convert.py -a name,near -n -m new-data/devset.csv devel
  1. Train TGen on the training set. This uses the default configuration file, the converted data, and the default random seed. It will save the model into model.pickle.gz (and several other files starting with model). Note that we used five different random seeds (-r s0, -r s1 ... -r s4), then picked the setup that was best on the development data
../run_tgen.py seq2seq_train config/seq2seq.py \
    input/train-das.txt input/train-text.txt \
    model.pickle.gz
  1. Generate outputs on the development set. This will also perform lexicalization of the outputs.
../run_tgen.py seq2seq_gen -w outputs.txt -a input/devel-abst.txt \
    model.pickle.gz input/devel-das.txt
  1. Postprocess the outputs. This basically amounts to a simple detokenization. The script changes the outputs in-place, or you can specify a target file name.
./postprocess/postprocess.py outputs.txt

Remarks

Please refer to ../USAGE.md for TGen installation instructions.

The Makefile in this directory contains a simple experiment management system, but this assumes running on a SGE computing cluster and there are probably site-specific settings hardcoded. Please contact me if you want to use it.