Skip to content

Commit c698646

Browse files
authored
Add LayoutLMv3 backbone (#656)
1 parent 679fd67 commit c698646

21 files changed

+2141
-745
lines changed

Diff for: README.md

+8-3
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,8 @@ You can do MindSpore Lite inference in MindOCR using **MindOCR models** or **Thi
201201
<details open markdown>
202202
<summary>Key Information Extraction</summary>
203203

204-
- [x] [LayoutXLM SER](configs/kie/vi_layoutxlm/README_CN.md) (arXiv'2016)
204+
- [x] [LayoutXLM](configs/kie/vi_layoutxlm/README_CN.md) (arXiv'2021)
205+
- [x] [LayoutLMv3](configs/kie/layoutlmv3/README.md) (arXiv'2022)
205206

206207
</details>
207208

@@ -287,6 +288,10 @@ Frequently asked questions about configuring environment and mindocr, please ref
287288
<details close markdown>
288289
<summary>News</summary>
289290

291+
- 2023/04/01
292+
1. Add new trained models
293+
- [LayoutLMv3](configs/kie/layoutlmv3/) for key information extraction
294+
290295
- 2024/03/20
291296
1. Add new trained models
292297
- [Vary-toy](configs/llm/vary/vary_toy.yaml) for OCR large model, providing Qwen-1.8B LLM-based object detection and OCR abilities
@@ -299,8 +304,8 @@ Frequently asked questions about configuring environment and mindocr, please ref
299304

300305
- 2023/12/14
301306
1. Add new trained models
302-
- [LayoutXLM SER](configs/kie/vi_layoutxlm) for key information extraction
303-
- [VI-LayoutXLM SER](configs/kie/layoutlm_series) for key information extraction
307+
- [LayoutXLM](configs/kie/layoutxlm) for key information extraction
308+
- [VI-LayoutXLM](configs/kie/vi_layoutxlm) for key information extraction
304309
- [PP-OCRv3 DBNet](configs/det/dbnet/db_mobilenetv3_ppocrv3.yaml) for text detection and [PP-OCRv3 SVTR](configs/rec/svtr/svtr_ppocrv3_ch.yaml) for recognition, supporting online inferece and finetuning
305310
2. Add more benchmark datasets and their results
306311
- [XFUND](configs/kie/vi_layoutxlm/README_CN.md)

Diff for: README_CN.md

+8-3
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,8 @@ python tools/infer/text/predict_system.py --image_dir {path_to_img or dir_to_img
194194
<details open markdown>
195195
<summary>关键信息抽取</summary>
196196

197-
- [x] [LayoutXLM SER](configs/kie/vi_layoutxlm/README_CN.md) (arXiv'2016)
197+
- [x] [LayoutXLM](configs/kie/vi_layoutxlm/README_CN.md) (arXiv'2021)
198+
- [x] [LayoutLMv3](configs/kie/layoutlmv3/README_CN.md) (arXiv'2022)
198199

199200
</details>
200201

@@ -282,6 +283,10 @@ MindOCR提供了[数据格式转换工具](tools/dataset_converters) ,以支
282283
<details close markdown>
283284
<summary>详细</summary>
284285

286+
- 2023/04/01
287+
1. 增加新模型
288+
- 关键信息抽取[LayoutLMv3](configs/kie/layoutlmv3/)
289+
285290
- 2024/03/20
286291
1. 增加新模型
287292
- OCR大模型[Vary-toy](configs/llm/vary/vary_toy.yaml),支持基于通义千问1.8B LLM的检测和OCR功能
@@ -294,8 +299,8 @@ MindOCR提供了[数据格式转换工具](tools/dataset_converters) ,以支
294299

295300
- 2023/12/14
296301
1. 增加新模型
297-
- 关键信息抽取[LayoutXLM SER](configs/kie/vi_layoutxlm)
298-
- 关键信息抽取[VI-LayoutXLM SER](configs/kie/layoutlm_series)
302+
- 关键信息抽取[LayoutXLM](configs/kie/layoutxlm)
303+
- 关键信息抽取[VI-LayoutXLM](configs/kie/vi_layoutxlm)
299304
- 文本检测[PP-OCRv3 DBNet](configs/det/dbnet/db_mobilenetv3_ppocrv3.yaml)和文本识别[PP-OCRv3 SVTR](configs/rec/svtr/svtr_ppocrv3_ch.yaml),支持在线推理和微调训练
300305
2. 添加更多基准数据集及其结果
301306
- [XFUND](configs/kie/vi_layoutxlm/README_CN.md)

Diff for: configs/kie/layoutlmv3/README.md

+250
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
English | [中文](README_CN.md)
2+
3+
# LayoutLMv3
4+
<!--- Guideline: use url linked to abstract in ArXiv instead of PDF for fast loading. -->
5+
6+
> [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387)
7+
8+
9+
## 1. Introduction
10+
Unlike previous LayoutLM series models, LayoutLMv3 does not rely on complex CNN or Faster R-CNN networks to represent images in its model architecture. Instead, it directly utilizes image blocks of document images, thereby greatly reducing parameters and avoiding complex document preprocessing such as manual annotation of target region boxes and document object detection. Its simple unified architecture and training objectives make LayoutLMv3 a versatile pretraining model suitable for both text-centric and image-centric document AI tasks.
11+
12+
The experimental results demonstrate that LayoutLMv3 achieves better performance with fewer parameters on the following datasets:
13+
14+
- Text-centric datasets: Form Understanding FUNSD dataset, Receipt Understanding CORD dataset, and Document Visual Question Answering DocVQA dataset.
15+
- Image-centric datasets: Document Image Classification RVL-CDIP dataset and Document Layout Analysis PubLayNet dataset.
16+
17+
LayoutLMv3 also employs a text-image multimodal Transformer architecture to learn cross-modal representations. Text vectors are obtained by adding word vectors, one-dimensional positional vectors, and two-dimensional positional vectors of words. Text from document images and their corresponding two-dimensional positional information (layout information) are extracted using optical character recognition (OCR) tools. As adjacent words in text often convey similar semantics, LayoutLMv3 shares the two-dimensional positional vectors of adjacent words, while each word in LayoutLM and LayoutLMv2 has different two-dimensional positional vectors.
18+
19+
The representation of image vectors typically relies on CNN-extracted feature grid features or Faster R-CNN-extracted region features, which increase computational costs or depend on region annotations. Therefore, the authors obtain image features by linearly mapping image blocks, a representation method initially proposed in ViT, which incurs minimal computational cost and does not rely on region annotations, effectively addressing the aforementioned issues. Specifically, the image is first resized to a uniform size (e.g., 224x224), then divided into fixed-size blocks (e.g., 16x16), and image features are obtained through linear mapping to form an image feature sequence, followed by addition of a learnable one-dimensional positional vector to obtain the image vector.[[1](#references)]
20+
21+
<p align="center">
22+
<img src=layoutlmv3_arch.jpg width=1000 />
23+
</p>
24+
<p align="center">
25+
<em> Figure 1. LayoutLMv3 architecture [<a href="#references">1</a>] </em>
26+
</p>
27+
28+
## 2. Results
29+
<!--- Guideline:
30+
Table Format:
31+
- Model: model name in lower case with _ seperator.
32+
- Context: Training context denoted as {device}x{pieces}-{MS mode}, where mindspore mode can be G - graph mode or F - pynative mode with ms function. For example, D910x8-G is for training on 8 pieces of Ascend 910 NPU using graph mode.
33+
- Top-1 and Top-5: Keep 2 digits after the decimal point.
34+
- Params (M): # of model parameters in millions (10^6). Keep 2 digits after the decimal point
35+
- Recipe: Training recipe/configuration linked to a yaml config file. Use absolute url path.
36+
- Download: url of the pretrained model weights. Use absolute url path.
37+
-->
38+
39+
### Accuracy
40+
41+
42+
According to our experiments, the performance and accuracy evaluation([Model Evaluation](#33-Model-Evaluation)) results of training ([Model Training](#32-Model-Training)) on the XFUND Chinese dataset are as follows:
43+
44+
<div align="center">
45+
46+
| **Model** | **Task** | **Context** | **Dateset** | **Model Params** | **Batch size** | **Graph train 1P (s/epoch)** | **Graph train 1P (ms/step)** | **Graph train 1P (FPS)** | **hmean** | **Config** | **Download** |
47+
| :----------: | :------: | :-------------: | :--------: | :--------: | :----------: | :--------------------------: | :--------------------------: | :----------------------: | :-------: | :----------------------------------------------------: | :------------------------------------------------------------------------------------------------: |
48+
| LayoutLMv3 | SER | D910x1-MS2.1-G | XFUND_zh | 265.8 M | 8 | 19.53 | 1094.86 | 7.37 | 91.88% | [yaml](../layoutlmv3/ser_layoutlmv3_xfund_zh.yaml) | ckpt(TODO) |
49+
50+
</div>
51+
52+
53+
54+
## 3. Quick Start
55+
### 3.1 Preparation
56+
57+
#### 3.1.1 Installation
58+
Please refer to the [installation instruction](https://github.com/mindspore-lab/mindocr#installation) in MindOCR.
59+
60+
#### 3.1.2 Dataset Download
61+
62+
[The XFUND dataset](https://github.com/doc-analysis/XFUND) is used as the experimental dataset. The XFUND dataset is a multilingual dataset proposed by Microsoft for the Knowledge-Intensive Extraction (KIE) task. It consists of seven datasets, each containing 149 training samples and 50 validation samples.
63+
64+
Respectively: ZH (Chinese), JA (Japanese), ES (Spanish), FR (French), IT (Italian), DE (German), PT (Portuguese)
65+
66+
a preprocessed [Chinese dataset](https://download.mindspore.cn/toolkits/mindocr/vi-layoutxlm/XFUND.tar) that can be directly used is provided for everyone to download.
67+
68+
```bash
69+
mkdir train_data
70+
cd train_data
71+
wget https://download.mindspore.cn/toolkits/mindocr/vi-layoutxlm/XFUND.tar && tar -xf XFUND.tar
72+
cd ..
73+
```
74+
75+
#### 3.1.3 Dataset Usage
76+
77+
After decompression, the data folder structure is as follows:
78+
79+
```bash
80+
└─ zh_train/ Training set
81+
├── image/ Folder for storing images
82+
├── train.json Annotation information
83+
└─ zh_val/ Validation set
84+
├── image/ Folder for storing images
85+
├── val.json Annotation information
86+
87+
```
88+
89+
The annotation format of this dataset is:
90+
91+
```bash
92+
{
93+
"height": 3508, # Image height
94+
"width": 2480, # Image width
95+
"ocr_info": [
96+
{
97+
"text": "邮政地址:", # Single text content
98+
"label": "question", # Category of the text
99+
"bbox": [261, 802, 483, 859], # Single text box
100+
"id": 54, # Text index
101+
"linking": [[54, 60]], # Relationships between the current text and other texts [question, answer]
102+
"words": []
103+
},
104+
{
105+
"text": "湖南省怀化市市辖区",
106+
"label": "answer",
107+
"bbox": [487, 810, 862, 859],
108+
"id": 60,
109+
"linking": [[54, 60]],
110+
"words": []
111+
}
112+
]
113+
}
114+
```
115+
116+
**The data configuration for model training.**
117+
118+
If you want to reproduce the training of the model, it is recommended to modify the dataset-related fields in the configuration YAML file as follows:
119+
120+
```yaml
121+
...
122+
train:
123+
...
124+
dataset:
125+
type: KieDataset
126+
dataset_root: path/to/dataset/ # Root directory of the training dataset
127+
data_dir: XFUND/zh_train/image/ # Directory of the training dataset, concatenated with `dataset_root` to form the complete directory of the training dataset
128+
label_file: XFUND/zh_train/train.json # Path to the label file of the training dataset, concatenated with `dataset_root` to form the complete path of the label file of the training dataset
129+
...
130+
eval:
131+
dataset:
132+
type: KieDataset
133+
dataset_root: path/to/dataset/ # Root directory of the validation dataset
134+
data_dir: XFUND/zh_val/image/ # Directory of the validation dataset, concatenated with `dataset_root` to form the complete directory of the validation dataset
135+
label_file: XFUND/zh_val/val.json # Path to the label file of the validation dataset, concatenated with `dataset_root` to form the complete path of the label file of the validation dataset
136+
...
137+
138+
```
139+
140+
#### 3.1.4 Check YAML Config Files
141+
Apart from the dataset setting, please also check the following important args: `system.distribute`, `system.val_while_train`, `common.batch_size`, `train.ckpt_save_dir`, `train.dataset.dataset_path`, `eval.ckpt_load_path`, `eval.dataset.dataset_path`, `eval.loader.batch_size`. Explanations of these important args:
142+
143+
```yaml
144+
system:
145+
mode:
146+
distribute: False # `True` for distributed training, `False` for standalone training
147+
amp_level: 'O0'
148+
seed: 42
149+
val_while_train: True # Validate while training
150+
drop_overflow_update: False
151+
model:
152+
type: kie
153+
transform: null
154+
backbone:
155+
name: layoutlmv3
156+
head:
157+
name: TokenClassificationHead
158+
num_classes: 7
159+
use_visual_backbone: True
160+
use_float16: True
161+
pretrained:
162+
...
163+
train:
164+
ckpt_save_dir: './tmp_kie_ser' # The training result (including checkpoints, per-epoch performance and curves) saving directory
165+
dataset_sink_mode: False
166+
dataset:
167+
type: KieDataset
168+
dataset_root: path/to/dataset/ # Path of training dataset
169+
data_dir: XFUND/zh_train/image/ # Path of training dataset data dir
170+
label_file: XFUND/zh_train/train.json # Path of training dataset label file
171+
...
172+
eval:
173+
ckpt_load_path: './tmp_kie_ser/best.ckpt' # checkpoint file path
174+
dataset_sink_mode: False
175+
dataset:
176+
type: KieDataset
177+
dataset_root: path/to/dataset/ # Path of evaluation dataset
178+
data_dir: XFUND/zh_val/image/ # Path of evaluation dataset data dir
179+
label_file: XFUND/zh_val/val.json # Path of evaluation dataset label file
180+
...
181+
...
182+
...
183+
```
184+
185+
**Notes:**
186+
- As the global batch size (batch_size x num_devices) is important for reproducing the result, please adjust `batch_size` accordingly to keep the global batch size unchanged for a different number of GPUs/NPUs, or adjust the learning rate linearly to a new global batch size.
187+
188+
189+
### 3.2 Model Training
190+
<!--- Guideline: Avoid using shell script in the command line. Python script preferred. -->
191+
* Distributed Training
192+
193+
It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please modify the configuration parameter `distribute` as True and run:
194+
195+
```shell
196+
# distributed training on multiple GPU/Ascend devices
197+
mpirun --allow-run-as-root -n 8 python tools/train.py --config configs/kie/layoutlmv3/ser_layoutlmv3_xfund_zh.yaml
198+
```
199+
200+
201+
* Standalone Training
202+
203+
If you want to train or finetune the model on a smaller dataset without distributed training, please modify the configuration parameter`distribute` as False and run:
204+
205+
```shell
206+
# standalone training on a CPU/GPU/Ascend device
207+
python tools/train.py --config configs/kie/layoutlmv3/ser_layoutlmv3_xfund_zh.yaml
208+
```
209+
210+
The training result (including checkpoints, per-epoch performance and curves) will be saved in the directory parsed by the arg `ckpt_save_dir`. The default directory is `./tmp_kie_ser`.
211+
212+
### 3.3 Model Evaluation
213+
214+
To evaluate the accuracy of the trained model, you can use `eval.py`. Please set the checkpoint path to the arg `ckpt_load_path` in the `eval` section of yaml config file, set `distribute` to be False, and then run:
215+
216+
```
217+
python tools/eval.py --config configs/kie/layoutlmv3/ser_layoutlmv3_xfund_zh.yaml
218+
```
219+
220+
### 3.4 Model Inference
221+
222+
To perform inference using a pre-trained model, you can utilize `tools/infer/text/predict_ser.py` for inference and visualize the results.
223+
224+
```
225+
python tools/infer/text/predict_ser.py --rec_algorithm CRNN_CH --image_dir {dir of images or path of image}
226+
```
227+
228+
As an example of entity recognition in Chinese forms, use the script to recognize entities in the form of `configs/kie/vi_layoutxlm/example.jpg`. The results will be stored in the `./inference_results` folder by default, and you can also customize the result storage path through the `--draw_img_save_dir` command-line parameter.
229+
230+
<p align="center">
231+
<img src="../vi_layoutxlm/example.jpg" width=1000 />
232+
</p>
233+
<p align="center">
234+
<em> example.jpg </em>
235+
</p>
236+
Recognition results are as shown in the image, and the image is saved as`inference_results/example_ser.jpg`
237+
238+
<p align="center">
239+
<img src="../vi_layoutxlm/example_ser.jpg" width=1000 />
240+
</p>
241+
<p align="center">
242+
<em> example_ser.jpg </em>
243+
</p>
244+
245+
246+
247+
## References
248+
<!--- Guideline: Citation format GB/T 7714 is suggested. -->
249+
250+
[1] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv preprint arXiv:2204.08387, 2022.

0 commit comments

Comments
 (0)