less-and-less-bugs
diff --git a/‎Readme.md‎
Lines changed: 70 additions & 0 deletions b/‎Readme.md‎
Lines changed: 70 additions & 0 deletions
diff --git a/‎Robust_Domain_Misinformation_Detection_via_Multi-modal_Feature_Alignment.pdf‎
1.92 MB b/‎Robust_Domain_Misinformation_Detection_via_Multi-modal_Feature_Alignment.pdf‎
1.92 MB
diff --git a/‎baseline/__init__.py‎
Lines changed: 6 additions & 0 deletions b/‎baseline/__init__.py‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎baseline/download_pretrain.py‎
Lines changed: 11 additions & 0 deletions b/‎baseline/download_pretrain.py‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎baseline/generate_vocab.py‎
Lines changed: 31 additions & 0 deletions b/‎baseline/generate_vocab.py‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎baseline/textcnn.py‎
Lines changed: 106 additions & 0 deletions b/‎baseline/textcnn.py‎
Lines changed: 106 additions & 0 deletions
@@ -0,0 +1,70 @@
+# Introduction
+
+This repository is the formal implement of our paper titled “[Robust Domain Misinformation Detection via Multi-modal Feature Alignment](https://ieeexplore.ieee.org/abstract/document/10288548/)”. The contribution of this work can be summarized as follows:
+
+1. A unified framework that tackles the domain generalization (target domain data is unavailable) and domain adaptation tasks (target domain data is available). This is necessary as obtaining sufficient unlabeled data in the target domain at an early stage of misinformation dissemination is difficult.
+2. Inter-domain and cross-modality alignment modules that reduce the domain shift and the modality gap. These modules aim at learning rich features that allow misinformation detection. Both modules are plug-and-play and have the potential to be applied to other multi-modal tasks.
+
+Additionally, we believe that the multimodal generalization algorithms proposed in our work can be used in other multimodal tasks. If you have some questions related to this paper, please feel no hesitate to ask me. 
+
+# To run our code
+
+1. download the dataset and pretrained models from Onedrive and unzip them in the project file.
+
+2. drive_outmodel.py is the main file to drive our algorithms. Please remove the codes related to comel package that enable efficient management of ML experiments or add your api_key and other parameters in the below codes in this file:
+
+   ```python
+    experiment = Experiment(
+           api_key="",
+           project_name="",
+           workspace="",
+       )
+   ```
+
+3. Multimodal JMMD, in our work, devised for multimodal  generalization tasks, can capture cross-modal correlations among multiple modalities with theoretical guarantees. For better implement, I advise to use the implement of MMD in domainbed that fix the parameter of kernels and only adjust the weight of JMMD loss function $\lambda_1$. Otherwise, you can just use my implement to set the kernel manually.
+
+4. At last, you can run our codes as below:
+
+   ```
+   sh multi_out_model.sh
+   ```
+
+# Citation
+
+If you find this repository helpful, please cite our paper:
+
+```
+@ARTICLE{10288548,
+  author={Liu, Hui and Wang, Wenya and Sun, Hao and Rocha, Anderson and Li, Haoliang},
+  journal={IEEE Transactions on Information Forensics and Security}, 
+  title={Robust Domain Misinformation Detection via Multi-modal Feature Alignment}, 
+  year={2023},
+  volume={},
+  number={},
+  pages={1-1},
+  doi={10.1109/TIFS.2023.3326368}}
+```
+
+If you have interest in multimodal misinformation detection, another paper of me on multimodal misinformation task can help you https://arxiv.org/abs/2305.05964. Despite accepted by Funding, this paper got three strong accepts :) :) :) . So it can work as a good reference, haha.
+
+```
+@inproceedings{DBLP:conf/acl/LiuWL23,
+  author       = {Hui Liu and
+                  Wenya Wang and
+                  Haoliang Li},
+  editor       = {Anna Rogers and
+                  Jordan L. Boyd{-}Graber and
+                  Naoaki Okazaki},
+  title        = {Interpretable Multimodal Misinformation Detection with Logic Reasoning},
+  booktitle    = {Findings of the Association for Computational Linguistics: {ACL} 2023,
+                  Toronto, Canada, July 9-14, 2023},
+  pages        = {9781--9796},
+  publisher    = {Association for Computational Linguistics},
+  year         = {2023},
+  url          = {https://doi.org/10.18653/v1/2023.findings-acl.620},
+  doi          = {10.18653/V1/2023.FINDINGS-ACL.620},
+  timestamp    = {Thu, 10 Aug 2023 12:35:42 +0200},
+  biburl       = {https://dblp.org/rec/conf/acl/LiuWL23.bib},
+  bibsource    = {dblp computer science bibliography, https://dblp.org}
+}
+```
@@ -0,0 +1,6 @@
+from .download_pretrain import *
+from .generate_vocab import *
+from .mvae import *
+from .textcnn import *
+from .da_baseline import *
+from .dg_baseline import *
@@ -0,0 +1,11 @@
+from transformers import AutoConfig, AutoModel, AutoTokenizer
+
+model_name = "roberta-base"
+model_path = "./pretrain_model/roberta"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name)
+config = AutoConfig.from_pretrained(model_name)
+
+tokenizer.save_pretrained(model_path)
+model.save_pretrained(model_path)
+config.save_pretrained(model_path)
@@ -0,0 +1,31 @@
+import torch
+from utils import PhemeSet, TwitterSet
+from collections import Counter
+from torchtext.data.utils import get_tokenizer
+from torchtext.vocab import build_vocab_from_iterator
+
+"""
+This is for extracting vocab for both dataset 
+"""
+
+
+if __name__ == '__main__':
+    twitter_set = TwitterSet(json_path="../final_twitter.json", img_path="../twitter/images",
+                             type=0, events=["sandy", "boston", "sochi", "malaysia"], visual_type='resnet',
+                             stage='train')
+
+    twitter_vocab_path = "../vocab/twitter_vocab.pt"
+    pheme_vocab_path = "../vocab/pheme_vocab.pt"
+
+    print(twitter_set[0][0])
+    tokenizer = get_tokenizer("spacy")
+    lines = []
+    for i in range(len(twitter_set)):
+        lines.append(tokenizer(twitter_set[i][0].strip()))
+    line_iter = iter(lines)
+    vocab = build_vocab_from_iterator(line_iter, specials=["<unk>", '<pad>'], min_freq=5)
+    vocab.set_default_index(vocab['<unk>'])
+    print(len(vocab))
+    # torch.save(vocab, pheme_vocab_path)
+    torch.save(vocab, twitter_vocab_path)
+
@@ -0,0 +1,106 @@
+import torch.nn as nn
+import torch.nn.functional as F
+import os
+import torch
+from transformers import AutoConfig, AutoModel
+
+"""
+TextCNN is for uni-modal classification or just textual feature extractor (according to setting num_classes parameter)
+"""
+
+
+class TextCNN(nn.Module):
+    def __init__(self, kernel_sizes, num_filters, num_classes, d_prob,  mode='rand', dataset_name="Pheme"):
+        """
+
+        :param kernel_sizes:
+        :param num_filters:
+        :param num_classes:
+        :param d_prob:
+        :param mode: rand,roberta-yes,roberta-non, bert-yes, bert-non
+        :param path_saved:
+        """
+
+        super(TextCNN, self).__init__()
+        self.kernel_sizes = kernel_sizes
+        self.num_filters = num_filters
+        self.num_classes = num_classes
+        self.d_prob = d_prob
+        # roberta-non bert-non bert-yes bert-yes rand
+        self.mode = mode
+        self.vocab = None
+        self.dataset_name = dataset_name
+        self.vocab_size = 1000
+        self.embedding_dim = 100
+        self.embedding = None
+        # Bert rand mode need padding_idx, Bert/roberta does not need
+        self.load_embeddings()
+        self.conv = nn.ModuleList([nn.Conv1d(in_channels=self.embedding_dim,
+                                             out_channels=num_filters,
+                                             kernel_size=k, stride=1) for k in kernel_sizes])
+        self.dropout = nn.Dropout(d_prob)
+        self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes)
+
+    def forward(self, x):
+        # batch_size, sequence_length = x.shape
+        # b*l*dim->b*dim*l
+        x = self.embedding(x).transpose(1, 2)
+        x = [F.relu(conv(x)) for conv in self.conv]
+        x = [F.max_pool1d(c, c.size(-1)).squeeze(dim=-1) for c in x]
+        x = torch.cat(x, dim=1)
+        x = self.fc(self.dropout(x))
+        return x.squeeze()
+
+    def load_embeddings(self):
+        if self.mode == 'rand':
+            if self.dataset_name == "Pheme":
+                path_saved = "/data/sunhao/robustfakenews/dataset/vocab/pheme_vocab.pt"
+            elif self.dataset_name == "Twitter":
+                path_saved = "/data/sunhao/robustfakenews/dataset/vocab/twitter_vocab.pt"
+            else:
+                print('When Randomly initialized embeddings, the vocabulary is wrong')
+                exit(0)
+            vocab = torch.load(path_saved)
+            self.vocab_size = len(vocab)
+            self.embedding_dim = 100
+            self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim, padding_idx=vocab['<pad>'])
+            self.embedding.weight.data.requires_grad = True
+            del vocab
+            print('Randomly initialized embeddings are used.')
+        else:
+            # /data/sunhao/robustfakenews/pretrain_model
+            mode = self.mode.split("-")
+            assert len(mode) == 2
+            path_saved = "/data/sunhao/robustfakenews/pretrain_model"
+            if mode[0] == 'roberta':
+                config = AutoConfig.from_pretrained(os.path.join(path_saved, "roberta"))
+                roberta = AutoModel.from_pretrained(os.path.join(path_saved, "roberta"), config=config)
+                weight = roberta.get_input_embeddings().weight
+                self.vocab_size = weight.shape[0]
+                self.embedding_dim = weight.shape[1]
+                self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim).from_pretrained(
+                    weight)
+                # self.embedding.weight.data.copy_(roberta.get_input_embeddings().weight)
+                del roberta, config, weight
+            elif mode[0] == 'bert':
+                config = AutoConfig.from_pretrained(os.path.join(path_saved, "bert"))
+                bert = AutoModel.from_pretrained(os.path.join(path_saved, "bert"), config=config)
+                weight = bert.get_input_embeddings().weight
+                self.vocab_size = weight.shape[0]
+                self.embedding_dim = weight.shape[1]
+                self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim).from_pretrained(weight)
+                del bert, config, weight
+
+            else:
+                raise ValueError('Unexpected value of mode. Please choose from roberta-non, roberta-yes, rand.')
+
+            if mode[1] == 'non':
+                self.embedding.weight.data.requires_grad = False
+                print('Loaded pretrained embeddings, weights are not trainable.')
+
+            elif mode[1] == 'yes':
+                self.embedding.weight.data.requires_grad = True
+                print('Loaded pretrained embeddings, weights are trainable.')
+
+            else:
+                raise ValueError('Unexpected value of mode[1].')