Skip to content

Commit 0fb4390

Browse files
authored
Add files via upload
1 parent f652de7 commit 0fb4390

File tree

13 files changed

+1788
-0
lines changed

13 files changed

+1788
-0
lines changed

NLP/cross_aligner/README.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# CrossAligner & Co: Zero-Shot Transfer Methods for Task-Oriented Cross-lingual Natural Language Understanding
2+
This is the data/code repository and usage instructions for the above [ACL 2022 paper](https://arxiv.org/abs/2203.09982v1). If you find the resources/paper useful, please cite:
3+
4+
```
5+
@article{gritta2022crossaligner,
6+
title={CrossAligner \& Co: Zero-Shot Transfer Methods for Task-Oriented Cross-lingual Natural Language Understanding},
7+
author={Gritta, Milan and Hu, Ruoyu and Iacobacci, Ignacio},
8+
journal={arXiv preprint arXiv:2203.09982},
9+
year={2022}
10+
}
11+
```
12+
13+
#### Acknowledgements
14+
The starting point for this repo was cloned from one of our previous [papers](https://aclanthology.org/2021.findings-acl.32/) called [XeroAlign](https://github.com/huawei-noah/noah-research/tree/master/xero_align).
15+
16+
### Getting started
17+
**`git clone`** the project first, then you can set up data, models and runs.
18+
19+
#### Python Environment
20+
Everything is written in Python 3.7.9 and PyTorch 1.7.0 so install the following packages:
21+
- Install [miniconda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) with Python 3.7.9 for Linux or MacOSX
22+
- Install [PyTorch](https://pytorch.org/get-started/locally/) 1.7.0 with conda using something like `conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.2 -c pytorch` and select the appropriate **cudatoolkit** for your GPU.
23+
- Use the `requirements.txt` with **conda** to install every single package
24+
- That should be that, no more extra packages required
25+
26+
#### Pretrained Transformers
27+
You need to download the XLM-R pretrained model from **HuggingFace**. The base XLM-R can be downloaded [here](https://huggingface.co/xlm-roberta-base/tree/main) and the large model [here](https://huggingface.co/xlm-roberta-large/tree/main). Save these models **_outside_** the project directory as they will be loaded using these paths: **`../xlm-roberta-base/`** and **`../xlm-roberta-large/`** from the project (root) directory. Each model directory should contain at least the following: **`config.json`**, **`pytorch_model.bin`** and **`sentencepiece.bpe.model`**. That should be that for models!
28+
29+
#### Datasets
30+
The XNLU datasets can be downloaded from [this Github repository](https://github.com/milangritta/Datasets) (with some very minor corrections for MultiATIS++, see [XeroAlign paper](https://aclanthology.org/2021.findings-acl.32/)). Next, **place the [downloaded](https://github.com/milangritta/Datasets) zip files into the ```data``` folder**. By the way, the original data downloads can be obtained here: [MTOP](https://fb.me/mtop_dataset), [MTOD](https://fb.me/multilingual_task_oriented_data) and [MultiATIS++](https://github.com/amazon-research/multiatis).
31+
32+
**The next command assumes you have saved a tokenizer in ```../xlm-roberta-large/``` folder**. Run the preprocessing script like this: **`python preprocess.py`**. This will generate the required files and subdirectories. Inside the **`data`** folder, there should be four task folders, each with multiple languages/subfolders with the generated files and folders.
33+
34+
#### Running Experiments
35+
36+
In the **`config`** folder, we saved the experiments as shell files that were reported in the paper. That should help you reproduce our results. Here is an example:
37+
38+
Open the command line and type: **`nohup ./config/mtop.sh mtop_aligned large &`** this command will run the **`mtop_aligned`** experiments with XLM-R Large. The base model can be launched by using **`'... base &'`** instead of `... large &`.
39+
40+
Once you trained an English model for MultiATIS++, for instance, you can type: **`./config/m_atis.sh m_atis_zero_shot base`** to obtain the baseline zero-shot scores for MultiATIS++.
41+
42+
Finally, **`nohup ./config/mtod.sh mtod_target large &`** will train the large XLM-R on the labelled data, referred to as 'Target' in the paper.
43+
44+
To specify which alignment methods to use during training, set the flag **` --use_aux_losses `** to any combination of **` CA,XA,CTR,TI `** (comma-separated, no spaces). To use Coefficient of Variation ([Groenendijk et al. 2020](https://arxiv.org/pdf/2009.01717.pdf)) weighting, set the **` --use_weighting `** flag to **` COV `**. Otherwise 1+1 weighting will be used if weighting method is not specified.
45+
46+
TOP TIP: Should some runs degrade to zero accuracy and zero F-Score, just decrease the learning rate a bit. Some languages are more sensitive to higher learning rates than others (usually happens quite rarely).
47+
48+
That should give a good idea for launching further runs, if unsure, look inside the shell file for hints :)
49+
50+
#### Additional Notes
51+
52+
Here are some more notes to get you started with running experiments with auxiliary losses.
53+
54+
#### 1. Selecting Auxiliary Losses
55+
When running alignment tasks (`m_atis_aligned`, `mtop_aligned`, `mtod_aligned`), set the `--use_aux_losses` flag to a list of auxiliary losses you would like to use.
56+
57+
For example the following code in `config/m_atis.sh` would run the experiments with XeroAlign and CrossAligner as auxiliary losses weighted using CoV weighting.
58+
```angular2html
59+
if [$1 == "m_atis_aligned"]
60+
then
61+
for lang in de es tr zh hi fr ja pt
62+
do
63+
python main.py --task m_atis \
64+
...
65+
--max_seq_len 100 \
66+
--use_aux_losses XA CA \
67+
--use_weighting COV
68+
done
69+
fi
70+
```
71+
72+
To use 1+1 weighting instead, comment/delete the line, it will default to 1+1 as loss weighted are initialised to 1 and not specifying a method will leave it unaltered.
73+
```angular2html
74+
# --use_weighting COV
75+
```
76+
77+
For what parameters corresponds to each loss, see the dictionary in `utils.py`
78+
79+
#### 2. Adding Auxiliary Losses
80+
New auxiliary losses can be added in `train.py`. For example, see XeroAlign, CrossAligner, etc.
81+
82+
To enable new aux losses on the command line, add your new loss to the `choices` for the `use_aux_losses` option of the parser in `main.py`, **and also** add the new loss as a (loss, flag) pair in the `set_aux_losses(args)` function in `utils.py`
83+
84+
For example, to implement my new loss `my_new_loss` add the following to `utils.py`
85+
86+
```
87+
def set_aux_losses(args):
88+
...
89+
loss_keys = {
90+
...
91+
"NL": "use_new_loss",
92+
}
93+
```
94+
95+
The new loss could then be implemented in `train.py`
96+
```angular2html
97+
...
98+
if use_losses.use_new_loss:
99+
# Implement new auxiliary alignment loss here
100+
...
101+
```
102+
103+
#### 3. Adding Weighting Methods
104+
New loss weighting methods can also be added in `train.py` in a manner similar to auxiliary losses. As of the time of writing, the weighting method parser does not use `choices`, thus only `set_weighting_method(args)` in `utils.py` needs to be updated.
105+
```
106+
def set_weighting_method(args):
107+
weighting_methods = {
108+
...
109+
"NW": "use_new_weighting_method",
110+
}
111+
...
112+
```
113+
And the implement the weighting method in `train.py`
114+
```angular2html
115+
if use_weighting.use_new_weighting_method:
116+
# Implement new weighting scheme here
117+
loss_weights = ...
118+
...
119+
```
120+
121+
We hope you find our resources useful. Get in touch if you need further help :)

NLP/cross_aligner/alignment.py

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Copyright (C) 2022. Huawei Technologies Co., Ltd. All rights reserved.
2+
#
3+
# Permission is hereby granted, free of charge, to any person obtaining a copy
4+
# of this software and associated documentation files (the "Software"), to deal
5+
# in the Software without restriction, including without limitation the rights
6+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7+
# copies of the Software, and to permit persons to whom the Software is
8+
# furnished to do so, subject to the following conditions:
9+
#
10+
# The above copyright notice and this permission notice shall be included in
11+
# all copies or substantial portions of the Software.
12+
#
13+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19+
# THE SOFTWARE
20+
21+
import logging
22+
import os
23+
import numpy as np
24+
import pickle
25+
import torch
26+
from torch.utils.data import TensorDataset
27+
from data_loader import InputFeatures
28+
from utils import load_tokenizer, Tasks
29+
from xlm_ra import get_intent_labels, get_slot_labels
30+
31+
logger = logging.getLogger(__name__)
32+
33+
34+
def generate_alignment_pairs(args):
35+
36+
# TOP TIP: Experiments could be faster by caching the alignment data
37+
# cached_alignment_file = os.path.join(args.model_dir, 'alignment_{}.bin'.format(args.task))
38+
39+
examples = []
40+
slot_labels = get_slot_labels(args)
41+
intent_labels = get_intent_labels(args)
42+
for language in args.align_languages.split(","):
43+
with open(os.path.join(args.data_dir, args.task, language, "train.tsv"), "r", encoding="utf-8") as tar_f:
44+
data = pickle.load(open(os.path.join(args.data_dir, args.task, "en", "train", "data.pkl"), "rb"))
45+
for target_line, label, slots in zip(tar_f, data['intent_labels'], data['slot_labels']):
46+
examples.append((target_line.strip(), label.strip(), slots))
47+
logger.info("Read %d lines...." % len(examples))
48+
49+
tokenizer = load_tokenizer(args.model_name_or_path)
50+
pad_token_id = tokenizer.pad_token_id
51+
52+
feats = []
53+
for ex_id, example in enumerate(examples):
54+
if ex_id % 5000 == 0:
55+
logger.info("Processed %d examples..." % ex_id)
56+
57+
if args.task in [Tasks.MTOD.value, Tasks.MTOP.value, Tasks.M_ATIS.value]:
58+
tokens = tokenizer.tokenize(example[0])
59+
ids = tokenizer.build_inputs_with_special_tokens(tokenizer.convert_tokens_to_ids(tokens))
60+
assert tokens is not None and len(ids) <= args.max_seq_len
61+
ids = ids + ([pad_token_id] * (args.max_seq_len - len(ids)))
62+
else:
63+
raise Exception("The task '%s' is not recognised!" % args.task)
64+
65+
if ids:
66+
slots_present = np.zeros((len(slot_labels),))
67+
for s in example[2]:
68+
if s == "PAD":
69+
continue
70+
slots_present[slot_labels.index(s)] = 1
71+
feats.append(InputFeatures(ids, intent_labels.index(example[1]), slots_present, None))
72+
73+
target_input_ids = torch.tensor([f.input_ids for f in feats], dtype=torch.long)
74+
slots_binary = torch.tensor([f.slot_labels for f in feats], dtype=torch.float32)
75+
labels = torch.tensor([f.class_label for f in feats], dtype=torch.long)
76+
train_dataset = TensorDataset(target_input_ids, labels, slots_binary)
77+
assert len(target_input_ids) == len(slots_binary)
78+
79+
logger.info("Created %d train/align instances." % len(train_dataset))
80+
return train_dataset

NLP/cross_aligner/analysis.py

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Copyright (C) 2022. Huawei Technologies Co., Ltd. All rights reserved.
2+
#
3+
# Permission is hereby granted, free of charge, to any person obtaining a copy
4+
# of this software and associated documentation files (the "Software"), to deal
5+
# in the Software without restriction, including without limitation the rights
6+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7+
# copies of the Software, and to permit persons to whom the Software is
8+
# furnished to do so, subject to the following conditions:
9+
#
10+
# The above copyright notice and this permission notice shall be included in
11+
# all copies or substantial portions of the Software.
12+
#
13+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19+
# THE SOFTWARE
20+
21+
import os
22+
import pickle
23+
import random
24+
from seqeval.metrics import classification_report as seq_report
25+
from sklearn.metrics import classification_report as sk_report
26+
27+
# Generate debug files by running e.g. './config/m_atis.sh m_atis_eval large' with your chosen language(s)
28+
task = "m_atis"
29+
target = "debug_" + task + "_target_es"
30+
aligned = "debug_" + task + "_aligned_es"
31+
random.seed(123456789)
32+
print("Processing task: %s, target file: %s, aligned file: %s" % (task, target, aligned))
33+
target_dump = pickle.load(open(os.path.join(target, "debug_dump.pkl"), "rb"))
34+
aligned_dump = pickle.load(open(os.path.join(aligned, "debug_dump.pkl"), "rb"))
35+
intent_map = pickle.load(open(os.path.join("data", task, "intents.pkl"), 'rb'))
36+
37+
assert len(target_dump) == len(aligned_dump)
38+
total_slots_disagreed = 0
39+
intent_prediction_agreement = 0
40+
aligned_slots_preds, aligned_slot_labels = [], []
41+
intent_predictions_target, intent_predictions_aligned = [], []
42+
keys = list(target_dump.keys())
43+
random.shuffle(keys)
44+
45+
for i in keys:
46+
t_example, a_example = target_dump[i], aligned_dump[i]
47+
s_preds_target, s_labels_target, i_pred_target, i_label_target = t_example[2], t_example[3], t_example[4], t_example[5]
48+
s_preds_aligned, s_labels_aligned, i_pred_aligned, i_label_aligned = a_example[2], a_example[3], a_example[4], a_example[5]
49+
assert len(s_preds_target) == len(s_preds_aligned)
50+
assert s_labels_target == s_labels_aligned
51+
52+
if not s_preds_aligned == s_preds_target:
53+
print("-" * 50)
54+
total_slots_disagreed += 1
55+
print("INTENT: " + intent_map[i_label_target])
56+
print("SENTENCE: " + "".join([w.replace("▁", " ") for w in t_example[0]]))
57+
s_preds_iter, s_labels_iter = iter(s_preds_aligned), iter(s_labels_target)
58+
print("WORD".ljust(25) + "PREDICTION".ljust(25) + "LABEL".ljust(25))
59+
for word, slot_type in zip(a_example[0], a_example[1]):
60+
if slot_type != -100:
61+
p, l = next(s_preds_iter), next(s_labels_iter)
62+
print("".join([word.ljust(25), p.ljust(25), l.ljust(25), str(p == l)]))
63+
else:
64+
print(word.ljust(25) + "PAD".ljust(25) + "PAD".ljust(25))
65+
aligned_slots_preds.append(s_preds_aligned)
66+
aligned_slot_labels.append(s_labels_aligned)
67+
68+
if i_pred_aligned == i_pred_target:
69+
intent_prediction_agreement += 1
70+
intent_predictions_target.append(intent_map[i_pred_target])
71+
intent_predictions_aligned.append(intent_map[i_pred_aligned])
72+
73+
print(sk_report(intent_predictions_target, intent_predictions_aligned, zero_division=0))
74+
print(seq_report(aligned_slot_labels, aligned_slots_preds, zero_division=0))
75+
print("-" * 80)
76+
print("%.1f percent slots agreement (%d out of %d) for aligned & target." % (100 * (1 - (total_slots_disagreed / float(len(keys)))), len(keys) - total_slots_disagreed, len(keys)))
77+
print("%.1f percent intent agreement (%d out of %d) for aligned & target." % (100 * (intent_prediction_agreement / float(len(keys))), intent_prediction_agreement, len(keys)))
78+
print("-" * 80)

NLP/cross_aligner/config/m_atis.sh

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
#!/bin/bash
2+
3+
if [ $1 == "m_atis_english" ]
4+
then
5+
python main.py --task m_atis \
6+
--train_languages en \
7+
--dev_languages en \
8+
--test_languages en \
9+
--model_dir m_atis_english \
10+
--do_train \
11+
--do_eval \
12+
--cuda_device cuda:0 \
13+
--train_batch_size 2 \
14+
--eval_batch_size 2 \
15+
--gradient_accumulation_steps 5 \
16+
--num_train_epochs 10 \
17+
--learning_rate 0.00002 \
18+
--save_model \
19+
--model_type $2 \
20+
--max_seq_len 100
21+
fi
22+
23+
if [ $1 == "m_atis_aligned" ]
24+
then
25+
for lang in de es tr zh hi fr ja pt
26+
do
27+
python main.py --task m_atis \
28+
--train_languages en \
29+
--dev_languages $lang \
30+
--test_languages $lang \
31+
--model_dir m_atis_aligned_$lang \
32+
--do_train \
33+
--do_eval \
34+
--cuda_device cuda:0 \
35+
--train_batch_size 10 \
36+
--eval_batch_size 2 \
37+
--gradient_accumulation_steps 1 \
38+
--num_train_epochs 10 \
39+
--learning_rate 0.00002 \
40+
--align_languages $lang \
41+
--save_model \
42+
--model_type $2 \
43+
--max_seq_len 100 \
44+
--use_aux_losses CA XA \
45+
--use_weighting COV
46+
done
47+
fi
48+
49+
if [ $1 == "m_atis_zero_shot" ]
50+
then
51+
for lang in de es tr zh hi fr ja pt
52+
do
53+
python main.py --task m_atis \
54+
--train_languages $lang \
55+
--dev_languages $lang \
56+
--test_languages $lang \
57+
--model_dir m_atis_zero_shot_$lang \
58+
--do_eval \
59+
--cuda_device cuda:0 \
60+
--eval_batch_size 10 \
61+
--model_type $2 \
62+
--load_eval_model m_atis_english \
63+
--max_seq_len 100
64+
done
65+
fi
66+
67+
if [ $1 == "m_atis_target" ]
68+
then
69+
for lang in de es tr zh hi fr ja pt
70+
do
71+
python main.py --task m_atis \
72+
--train_languages $lang \
73+
--dev_languages $lang \
74+
--test_languages $lang \
75+
--model_dir m_atis_target_$lang \
76+
--do_eval \
77+
--do_train \
78+
--cuda_device cuda:0 \
79+
--train_batch_size 2 \
80+
--eval_batch_size 2 \
81+
--gradient_accumulation_steps 5 \
82+
--num_train_epochs 10 \
83+
--learning_rate 0.00002 \
84+
--model_type $2 \
85+
--save_model \
86+
--max_seq_len 100
87+
done
88+
fi
89+
90+
if [ $1 == "m_atis_eval" ]
91+
then
92+
for lang in de es tr zh hi fr ja pt
93+
do
94+
python main.py --task m_atis \
95+
--train_languages $lang \
96+
--dev_languages $lang \
97+
--test_languages $lang \
98+
--model_dir m_atis_eval_$lang \
99+
--do_eval \
100+
--cuda_device cuda:0 \
101+
--eval_batch_size 10 \
102+
--load_eval_model m_atis_aligned_$lang \
103+
--model_type $2 \
104+
--max_seq_len 100 \
105+
--debug
106+
done
107+
fi

0 commit comments

Comments
 (0)