Skip to content

Vietnamese Legal Question Answering with Machine Reading Comprehension (MRC) and Answer Generation (AG) approches.

License

Notifications You must be signed in to change notification settings

ntphuc149/ViLQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vietnamese Legal Question Answering: An Experimental Study [paper]

This paper investigates the legal question-answering (QA) task in Vietnamese. Different from prior studies that only report results on the task of machine reading comprehension (MRC), we compare the strong QA models in two scenarios: MRC (span extraction) and answer generation (AG) (text generation). To do that, we first created a new dataset, namely ViBidLQA, using the bidding law. The dataset is synthesized by using a large language model (LLM) and corrected by two domain experts. After that, we train a set of robust MRC and AG models on the ViBidLQA dataset and predict on both ALQAC and the test set of ViBidLQA. Experimental results show that for the MRC scenario, vi-mrc-large achieves the best scores while for the AG scenario, ViT5 obtains good performance. The results also indicate that the new ViBidLQA dataset contributes to improving the performance of MRC models for domain adaptation on ALQAC

Fig.1

Fig.1: An example of Question Answering

Problem Formulation

1. Machine Reading Comprehension (MRC)

QA is formulated as an MRC problem. Given a context $C = {w_1, ..., w_n}$ and question $Q$, the goal is to determine start $(s)$ and end $(e)$ token of the answer $A$, which form a span within $C$. From the start token $s$ and end token $e$, the start position and end position are obtained. MRC models transform $C$ and $Q$ into contextual representations $H_C$ and $H_Q$, apply attention:

$$ A = \text{softmax}(H_Q \cdot H_C^T) $$

Then the model predicts answer span positions as:

Equation

2. Answer Generation (AG)

Answer generation models produce a suitable answer $A$ by extracting tokens from the context $C$.

The training process uses the contextual vector $\boldsymbol{h}$ from the encoder to generate output tokens $y_t$ by the softmax function.

Equation

where $\theta$ is the weight matrix, the objective is to minimize the negative likelihood of the conditional probability between the predicted outputs and the gold answer $A$.

Equation

where $k$ is the number of tokens in $A$, for inference, given an input context with the question, the trained AG models generate the corresponding question.

For more details, access the repo: https://github.com/Shaun-le/ViQAG. In this repo, I'll dive deeply into the Machine Reading Comprehension (MRC) approach.

Installation Guide

1.1. Clone the repository:

git clone https://github.com/ntphuc149/ViLQA.git
cd ViLQA

1.2. Create a virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate

1.3. Install the dependencies:

pip install -r requirements.txt

Usage Instructions

2.1. Configure the project:

Update the parameters in config.py to suit your dataset and requirements.

2.2. Fine-tune and evaluate the model:

Run the following command to start fine-tuning and evaluate the model:

python train.py

Contribution

We welcome contributions to this project. Please create a pull request or open an issue to discuss your ideas for improvement.

ViBidLawQA

We introduce a demo application system ViBidLawQA at here. The brief introduction of the system was also shown in a video ↓↓↓

Alt Text

License

This project is licensed under the MIT License. See the LICENSE file for details.

Access our new dataset ViBidLQA

To access our data, please take the survey at: https://forms.gle/Ti4d31xKoa78Hud69

Citation

Coming soon

About

Vietnamese Legal Question Answering with Machine Reading Comprehension (MRC) and Answer Generation (AG) approches.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages