Skip to content

Commit fb15574

Browse files
author
Sam Wiseman
committed
preproc
1 parent 4d4360b commit fb15574

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+12677
-1
lines changed

README.md

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,30 @@
1-
# data2text
1+
# data2text
2+
3+
Code for [Challenges in Data-to-Document Generation](https://arxiv.org/abs/1707.08052); much of this code is adapted from an earlier fork of [OpenNMT](https://github.com/OpenNMT/OpenNMT).
4+
5+
The boxscore-data associated with the above paper can be downloaded from the [boxscore-data repo](https://github.com/harvardnlp/boxscore-data), and this README will go over running experiments on the RotoWire portion of the data; running on the SBNation data (or other data) is quite similar.
6+
7+
8+
## Preprocessing
9+
Before training models, you must preprocess the data. Assuming the RotoWire json files reside at `~/Documents/code/boxscore-data/rotowire`, the following command will preprocess the data
10+
11+
```
12+
th box_preprocess.lua -json_data_dir ~/Documents/code/boxscore-data/rotowire -save_data roto
13+
```
14+
15+
and write files called roto-train.t7, roto.src.dict, and roto.tgt.dict to your local directory.
16+
17+
### Incorporating Pointer Information
18+
For the "conditional copy" model, it is necessary to know where in the source table each target word may have been copied from. To generate a pointer file, running
19+
20+
```
21+
python data_utils.py -mode ptrs -input_fi ~/Documents/code/boxscore-data/rotowire/train.json -output_fi "roto-ptrs.txt"
22+
```
23+
24+
which will generate a file called roto-ptrs.txt.
25+
26+
This pointer information can be incorporated into the preprocessing by then running:
27+
28+
```
29+
th box_preprocess.lua -json_data_dir ~/Documents/code/boxscore-data/rotowire -save_data roto -ptr_fi "roto-ptrs.txt"
30+
```

0 commit comments

Comments
 (0)