This is the official release page of the DCOH-120K dataset. You can access the dataset by following the instructions.
The DCOH-120K dataset is an online handwriting dataset with more diverse corpus types, comprising 83,142 Chinese text lines (DCOH-Chinese) and 39,398 English text lines (DCOH-English) from 314 writers. The Chinese corpus consists of data from CLUECorpusSmall, as well as web-based collections, including news articles, forum discussions, and comments from various websites. The English corpus comprises content from Wikipedia and e-books. The dataset is collected using tablets and styluses, which provide pressure information and ensure a high sampling rate.
The DCOH-120K dataset can only be used for non-commercial research purposes. Scholars or organizations wishing to use the DCOH-120K dataset should first complete this Application Form and send it via email to us ([email protected] or [email protected]). When submitting the application form to us, please list or attach 1-2 of your publications from the past 6 years to demonstrate that you (or your team) conduct research in the related research fields of optical character recognition, handwriting analysis and recognition, document image processing, and so on. Currently, this dataset is only freely available to scholars in the above-mentioned fields. We will send you the download link and decompression password for the dataset after your letter has been received and approved.
The DCOH-120K dataset should be used and distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.
The dataset is organized in the following directory format:
DCOH-120K
├── Chinese_train
│ ├── 1.json
│ ├── 2.json
│ └── ...
├── Chinese_test
│ ├── 1.json
│ ├── 2.json
│ └── ...
├── English_train
│ ├── 1.json
│ ├── 2.json
│ └── ...
└── English_test
├── 1.json
├── 2.json
└── ...
Each data sample in the DCOH-120K dataset is stored as an individual JSON file. Each JSON file contains three key-value pairs: writer
, label
, and points
. The points
field stores the handwriting trajectory as a two-level nested list. In this list, each element of the first level represents a point in the trajectory. Each point is represented by seven attributes in the following order: x-coordinate, y-coordinate, stroke index, timestamp, pen tilt angle in the x-direction, pen tilt angle in the y-direction, and pressure value.
Please cite our paper when using the dataset:
@article{Li_2025_PR,
title = {EGO-LM: An efficient, generic, and out-of-the-box language model for handwritten text recognition},
author = {Hongliang Li and Dezhi Peng and Lianwen Jin},
journal = {Pattern Recognition},
year = {2025},
volume = {159},
pages = {111130}
}
For any questions, please contact the authors via email at [email protected].