huawei-noah
diff --git a/‎NLP/cross_aligner/README.md‎
Lines changed: 121 additions & 0 deletions b/‎NLP/cross_aligner/README.md‎
Lines changed: 121 additions & 0 deletions
diff --git a/‎NLP/cross_aligner/alignment.py‎
Lines changed: 80 additions & 0 deletions b/‎NLP/cross_aligner/alignment.py‎
Lines changed: 80 additions & 0 deletions
diff --git a/‎NLP/cross_aligner/analysis.py‎
Lines changed: 78 additions & 0 deletions b/‎NLP/cross_aligner/analysis.py‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎NLP/cross_aligner/config/m_atis.sh‎
Lines changed: 107 additions & 0 deletions b/‎NLP/cross_aligner/config/m_atis.sh‎
Lines changed: 107 additions & 0 deletions
@@ -0,0 +1,121 @@
+# CrossAligner & Co: Zero-Shot Transfer Methods for Task-Oriented Cross-lingual Natural Language Understanding
+This is the data/code repository and usage instructions for the above [ACL 2022 paper](https://arxiv.org/abs/2203.09982v1). If you find the resources/paper useful, please cite:
+
+```
+@article{gritta2022crossaligner,
+  title={CrossAligner \& Co: Zero-Shot Transfer Methods for Task-Oriented Cross-lingual Natural Language Understanding},
+  author={Gritta, Milan and Hu, Ruoyu and Iacobacci, Ignacio},
+  journal={arXiv preprint arXiv:2203.09982},
+  year={2022}
+}
+```
+
+#### Acknowledgements
+The starting point for this repo was cloned from one of our previous [papers](https://aclanthology.org/2021.findings-acl.32/) called [XeroAlign](https://github.com/huawei-noah/noah-research/tree/master/xero_align).
+
+### Getting started
+**`git clone`** the project first, then you can set up data, models and runs.
+
+#### Python Environment
+Everything is written in Python 3.7.9 and PyTorch 1.7.0 so install the following packages:
+- Install [miniconda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) with Python 3.7.9 for Linux or MacOSX
+- Install [PyTorch](https://pytorch.org/get-started/locally/) 1.7.0 with conda using something like `conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.2 -c pytorch` and select the appropriate **cudatoolkit** for your GPU.
+- Use the `requirements.txt` with **conda** to install every single package
+- That should be that, no more extra packages required
+
+#### Pretrained Transformers
+You need to download the XLM-R pretrained model from **HuggingFace**. The base XLM-R can be downloaded [here](https://huggingface.co/xlm-roberta-base/tree/main) and the large model [here](https://huggingface.co/xlm-roberta-large/tree/main). Save these models **_outside_** the project directory as they will be loaded using these paths: **`../xlm-roberta-base/`** and **`../xlm-roberta-large/`** from the project (root) directory. Each model directory should contain at least the following: **`config.json`**, **`pytorch_model.bin`** and **`sentencepiece.bpe.model`**. That should be that for models! 
+
+#### Datasets
+The XNLU datasets can be downloaded from [this Github repository](https://github.com/milangritta/Datasets) (with some very minor corrections for MultiATIS++, see [XeroAlign paper](https://aclanthology.org/2021.findings-acl.32/)). Next, **place the [downloaded](https://github.com/milangritta/Datasets) zip files into the ```data``` folder**. By the way, the original data downloads can be obtained here: [MTOP](https://fb.me/mtop_dataset), [MTOD](https://fb.me/multilingual_task_oriented_data) and [MultiATIS++](https://github.com/amazon-research/multiatis).
+
+**The next command assumes you have saved a tokenizer in ```../xlm-roberta-large/``` folder**. Run the preprocessing script like this: **`python preprocess.py`**. This will generate the required files and subdirectories. Inside the **`data`** folder, there should be four task folders, each with multiple languages/subfolders with the generated files and folders.
+
+#### Running Experiments
+
+In the **`config`** folder, we saved the experiments as shell files that were reported in the paper. That should help you reproduce our results. Here is an example:
+
+Open the command line and type: **`nohup ./config/mtop.sh mtop_aligned large &`** this command will run the **`mtop_aligned`** experiments with XLM-R Large. The base model can be launched by using **`'... base &'`** instead of `... large &`.
+
+Once you trained an English model for MultiATIS++, for instance, you can type: **`./config/m_atis.sh m_atis_zero_shot base`** to obtain the baseline zero-shot scores for MultiATIS++.
+
+Finally, **`nohup ./config/mtod.sh mtod_target large &`** will train the large XLM-R on the labelled data, referred to as 'Target' in the paper.
+
+To specify which alignment methods to use during training, set the flag **` --use_aux_losses `** to any combination of **` CA,XA,CTR,TI `** (comma-separated, no spaces). To use Coefficient of Variation ([Groenendijk et al. 2020](https://arxiv.org/pdf/2009.01717.pdf)) weighting, set the **` --use_weighting `** flag to **` COV `**. Otherwise 1+1 weighting will be used if weighting method is not specified.
+
+TOP TIP: Should some runs degrade to zero accuracy and zero F-Score, just decrease the learning rate a bit. Some languages are more sensitive to higher learning rates than others (usually happens quite rarely).
+
+That should give a good idea for launching further runs, if unsure, look inside the shell file for hints :)
+
+#### Additional Notes
+
+Here are some more notes to get you started with running experiments with auxiliary losses.
+
+#### 1. Selecting Auxiliary Losses
+When running alignment tasks (`m_atis_aligned`, `mtop_aligned`, `mtod_aligned`), set the `--use_aux_losses` flag to a list of auxiliary losses you would like to use.
+
+For example the following code in `config/m_atis.sh` would run the experiments with XeroAlign and CrossAligner as auxiliary losses weighted using CoV weighting.
+```angular2html
+if [$1 == "m_atis_aligned"]
+then
+    for lang in de es tr zh hi fr ja pt
+    do
+        python main.py --task m_atis \
+        ...
+        --max_seq_len 100 \
+        --use_aux_losses XA CA \
+        --use_weighting COV
+    done
+fi
+```
+
+To use 1+1 weighting instead, comment/delete the line, it will default to 1+1 as loss weighted are initialised to 1 and not specifying a method will leave it unaltered.
+```angular2html
+# --use_weighting COV
+```
+
+For what parameters corresponds to each loss, see the dictionary in `utils.py`
+
+#### 2. Adding Auxiliary Losses
+New auxiliary losses can be added in `train.py`. For example, see XeroAlign, CrossAligner, etc.
+
+To enable new aux losses on the command line, add your new loss to the `choices` for the `use_aux_losses` option of the parser in `main.py`, **and also** add the new loss as a (loss, flag) pair in the `set_aux_losses(args)` function in `utils.py`
+
+For example, to implement my new loss `my_new_loss` add the following to `utils.py`
+
+```
+def set_aux_losses(args):
+    ...
+    loss_keys = {
+        ...
+        "NL":  "use_new_loss",
+    }
+```
+
+The new loss could then be implemented in `train.py`
+```angular2html
+...
+if use_losses.use_new_loss:
+    # Implement new auxiliary alignment loss here
+...
+```
+
+#### 3. Adding Weighting Methods
+New loss weighting methods can also be added in `train.py` in a manner similar to auxiliary losses. As of the time of writing, the weighting method parser does not use `choices`, thus only `set_weighting_method(args)` in `utils.py` needs to be updated.
+```
+def set_weighting_method(args):
+    weighting_methods = {
+        ...
+        "NW": "use_new_weighting_method",
+    }
+    ...
+```
+And the implement the weighting method in `train.py`
+```angular2html
+if use_weighting.use_new_weighting_method:
+    # Implement new weighting scheme here
+    loss_weights = ...
+    ...
+```
+
+We hope you find our resources useful. Get in touch if you need further help :)
@@ -0,0 +1,80 @@
+# Copyright (C) 2022. Huawei Technologies Co., Ltd. All rights reserved.
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE
+
+import logging
+import os
+import numpy as np
+import pickle
+import torch
+from torch.utils.data import TensorDataset
+from data_loader import InputFeatures
+from utils import load_tokenizer, Tasks
+from xlm_ra import get_intent_labels, get_slot_labels
+
+logger = logging.getLogger(__name__)
+
+
+def generate_alignment_pairs(args):
+
+    # TOP TIP: Experiments could be faster by caching the alignment data
+    # cached_alignment_file = os.path.join(args.model_dir, 'alignment_{}.bin'.format(args.task))
+
+    examples = []
+    slot_labels = get_slot_labels(args)
+    intent_labels = get_intent_labels(args)
+    for language in args.align_languages.split(","):
+        with open(os.path.join(args.data_dir, args.task, language, "train.tsv"), "r", encoding="utf-8") as tar_f:
+            data = pickle.load(open(os.path.join(args.data_dir, args.task, "en", "train", "data.pkl"), "rb"))
+            for target_line, label, slots in zip(tar_f, data['intent_labels'], data['slot_labels']):
+                examples.append((target_line.strip(), label.strip(), slots))
+        logger.info("Read %d lines...." % len(examples))
+
+    tokenizer = load_tokenizer(args.model_name_or_path)
+    pad_token_id = tokenizer.pad_token_id
+
+    feats = []
+    for ex_id, example in enumerate(examples):
+        if ex_id % 5000 == 0:
+            logger.info("Processed %d examples..." % ex_id)
+
+        if args.task in [Tasks.MTOD.value, Tasks.MTOP.value, Tasks.M_ATIS.value]:
+            tokens = tokenizer.tokenize(example[0])
+            ids = tokenizer.build_inputs_with_special_tokens(tokenizer.convert_tokens_to_ids(tokens))
+            assert tokens is not None and len(ids) <= args.max_seq_len
+            ids = ids + ([pad_token_id] * (args.max_seq_len - len(ids)))
+        else:
+            raise Exception("The task '%s' is not recognised!" % args.task)
+
+        if ids:
+            slots_present = np.zeros((len(slot_labels),))
+            for s in example[2]:
+                if s == "PAD":
+                    continue
+                slots_present[slot_labels.index(s)] = 1
+            feats.append(InputFeatures(ids, intent_labels.index(example[1]), slots_present, None))
+
+    target_input_ids = torch.tensor([f.input_ids for f in feats], dtype=torch.long)
+    slots_binary = torch.tensor([f.slot_labels for f in feats], dtype=torch.float32)
+    labels = torch.tensor([f.class_label for f in feats], dtype=torch.long)
+    train_dataset = TensorDataset(target_input_ids, labels, slots_binary)
+    assert len(target_input_ids) == len(slots_binary)
+
+    logger.info("Created %d train/align instances." % len(train_dataset))
+    return train_dataset
@@ -0,0 +1,78 @@
+# Copyright (C) 2022. Huawei Technologies Co., Ltd. All rights reserved.
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE
+
+import os
+import pickle
+import random
+from seqeval.metrics import classification_report as seq_report
+from sklearn.metrics import classification_report as sk_report
+
+# Generate debug files by running e.g. './config/m_atis.sh m_atis_eval large' with your chosen language(s)
+task = "m_atis"
+target = "debug_" + task + "_target_es"
+aligned = "debug_" + task + "_aligned_es"
+random.seed(123456789)
+print("Processing task: %s, target file: %s, aligned file: %s" % (task, target, aligned))
+target_dump = pickle.load(open(os.path.join(target, "debug_dump.pkl"), "rb"))
+aligned_dump = pickle.load(open(os.path.join(aligned, "debug_dump.pkl"), "rb"))
+intent_map = pickle.load(open(os.path.join("data", task, "intents.pkl"), 'rb'))
+
+assert len(target_dump) == len(aligned_dump)
+total_slots_disagreed = 0
+intent_prediction_agreement = 0
+aligned_slots_preds, aligned_slot_labels = [], []
+intent_predictions_target, intent_predictions_aligned = [], []
+keys = list(target_dump.keys())
+random.shuffle(keys)
+
+for i in keys:
+    t_example, a_example = target_dump[i], aligned_dump[i]
+    s_preds_target, s_labels_target, i_pred_target, i_label_target = t_example[2], t_example[3], t_example[4], t_example[5]
+    s_preds_aligned, s_labels_aligned, i_pred_aligned, i_label_aligned = a_example[2], a_example[3], a_example[4], a_example[5]
+    assert len(s_preds_target) == len(s_preds_aligned)
+    assert s_labels_target == s_labels_aligned
+
+    if not s_preds_aligned == s_preds_target:
+        print("-" * 50)
+        total_slots_disagreed += 1
+        print("INTENT:  " + intent_map[i_label_target])
+        print("SENTENCE: " + "".join([w.replace("▁", " ") for w in t_example[0]]))
+        s_preds_iter, s_labels_iter = iter(s_preds_aligned), iter(s_labels_target)
+        print("WORD".ljust(25) + "PREDICTION".ljust(25) + "LABEL".ljust(25))
+        for word, slot_type in zip(a_example[0], a_example[1]):
+            if slot_type != -100:
+                p, l = next(s_preds_iter), next(s_labels_iter)
+                print("".join([word.ljust(25), p.ljust(25), l.ljust(25), str(p == l)]))
+            else:
+                print(word.ljust(25) + "PAD".ljust(25) + "PAD".ljust(25))
+        aligned_slots_preds.append(s_preds_aligned)
+        aligned_slot_labels.append(s_labels_aligned)
+
+    if i_pred_aligned == i_pred_target:
+        intent_prediction_agreement += 1
+    intent_predictions_target.append(intent_map[i_pred_target])
+    intent_predictions_aligned.append(intent_map[i_pred_aligned])
+
+print(sk_report(intent_predictions_target, intent_predictions_aligned, zero_division=0))
+print(seq_report(aligned_slot_labels, aligned_slots_preds, zero_division=0))
+print("-" * 80)
+print("%.1f percent slots agreement (%d out of %d) for aligned & target." % (100 * (1 - (total_slots_disagreed / float(len(keys)))), len(keys) - total_slots_disagreed, len(keys)))
+print("%.1f percent intent agreement (%d out of %d) for aligned & target." % (100 * (intent_prediction_agreement / float(len(keys))), intent_prediction_agreement, len(keys)))
+print("-" * 80)
@@ -0,0 +1,107 @@
+#!/bin/bash
+
+if [ $1 == "m_atis_english" ]
+then
+    python main.py --task m_atis \
+                   --train_languages en \
+                   --dev_languages en \
+                   --test_languages en \
+                   --model_dir m_atis_english \
+                   --do_train \
+                   --do_eval \
+                   --cuda_device cuda:0 \
+                   --train_batch_size 2 \
+                   --eval_batch_size 2 \
+                   --gradient_accumulation_steps 5 \
+                   --num_train_epochs 10 \
+                   --learning_rate 0.00002 \
+                   --save_model \
+                   --model_type $2 \
+                   --max_seq_len 100
+fi
+
+if [ $1 == "m_atis_aligned" ]
+then
+    for lang in de es tr zh hi fr ja pt
+    do
+        python main.py --task m_atis \
+                       --train_languages en \
+                       --dev_languages $lang \
+                       --test_languages $lang \
+                       --model_dir m_atis_aligned_$lang \
+                       --do_train \
+                       --do_eval \
+                       --cuda_device cuda:0 \
+                       --train_batch_size 10 \
+                       --eval_batch_size 2 \
+                       --gradient_accumulation_steps 1 \
+                       --num_train_epochs 10 \
+                       --learning_rate 0.00002 \
+                       --align_languages $lang \
+                       --save_model \
+                       --model_type $2 \
+                       --max_seq_len 100 \
+                       --use_aux_losses CA XA \
+                       --use_weighting COV
+    done
+fi
+
+if [ $1 == "m_atis_zero_shot" ]
+then
+    for lang in de es tr zh hi fr ja pt
+    do
+        python main.py --task m_atis \
+                       --train_languages $lang \
+                       --dev_languages $lang \
+                       --test_languages $lang \
+                       --model_dir m_atis_zero_shot_$lang \
+                       --do_eval \
+                       --cuda_device cuda:0 \
+                       --eval_batch_size 10 \
+                       --model_type $2 \
+                       --load_eval_model m_atis_english \
+                       --max_seq_len 100
+    done
+fi
+
+if [ $1 == "m_atis_target" ]
+then
+    for lang in de es tr zh hi fr ja pt
+    do
+        python main.py --task m_atis \
+                       --train_languages $lang \
+                       --dev_languages $lang \
+                       --test_languages $lang \
+                       --model_dir m_atis_target_$lang \
+                       --do_eval \
+                       --do_train \
+                       --cuda_device cuda:0 \
+                       --train_batch_size 2 \
+                       --eval_batch_size 2 \
+                       --gradient_accumulation_steps 5 \
+                       --num_train_epochs 10 \
+                       --learning_rate 0.00002 \
+                       --model_type $2 \
+                       --save_model \
+                       --max_seq_len 100
+    done
+fi
+
+if [ $1 == "m_atis_eval" ]
+then
+    for lang in de es tr zh hi fr ja pt
+    do
+        python main.py --task m_atis \
+                       --train_languages $lang \
+                       --dev_languages $lang \
+                       --test_languages $lang \
+                       --model_dir m_atis_eval_$lang \
+                       --do_eval \
+                       --cuda_device cuda:0 \
+                       --eval_batch_size 10 \
+                       --load_eval_model m_atis_aligned_$lang \
+                       --model_type $2 \
+                       --max_seq_len 100 \
+                       --debug
+    done
+fi