📚 Paper | 📝 Blog | 📁 Dataset | 🧑💻 Code
edinet2dataset is a tool to construct financial datasets using EDINET.
edinet2dataset has two classes to build Japanese financial dataset using EDINET.
- Downloader: Download financial reports of Japanese listed companies using the EDINET API.
- Parser: Extract key items such as the balance sheet (BS), cash flow statement (CF), profit and loss statement (PL), summary, and text from the downloaded TSV reports.
edinet2dataset is used to construct EDINET-Bench, a challenging Japanese financial benchmark dataset.
Install the dependencies using uv.
uv sync
To use EDINET-API, configure your EDINET-API key in a .env file. Please refer to the official documentation to obtain the API key.
- Search for a company name using a substring match query.
$ python src/edinet2dataset/downloader.py --query トヨタ
提出者名 | EDINETコード | 提出者業種 |
---|---|---|
トヨタ紡織株式会社 | E00540 | 輸送用機器 |
トヨタ自動車株式会社 | E02144 | 輸送用機器 |
トヨタファイナンス株式会社 | E05031 | サービス業 |
トヨタ モーター クレジット コーポレーション | E05904 | 外国法人・組合 |
トヨタ ファイナンス オーストラリア リミテッド | E05954 | 外国法人・組合 |
トヨタ モーター ファイナンス(ネザーランズ)ビーブイ | E20989 | 外国法人・組合 |
トヨタファイナンシャルサービス株式会社 | E23700 | 内国法人・組合(有価証券報告書等の提出義務者以外) |
- Download the annual report submitted by Toyota Motor Corporation for the period from June 1, 2024, to June 28, 2024.
$ uv run python src/edinet2dataset/downloader.py --start_date 2024-06-01 --end_date 2024-06-28 --company_name "トヨタ自動車株式会社" --doc_type annual
Downloading documents (2024-06-01 - 2024-06-28): 100%|███████████████████████████████████████████| 28/28 [00:02<00:00, 9.76it/s]
- Extract balance sheet (BS) items from the annual report.
$ uv run python src/edinet2dataset/parser.py --file_path data/E02144/S100TR7I.tsv --category_list BS
2025-04-26 22:03:16.026 | INFO | __main__:parse_tsv:130 - Found 2179 unique elements in data/E02144/S100TR7I.tsv
{'現金及び預金': {'Prior1Year': '2965923000000', 'CurrentYear': '4278139000000'}, '現金及び現金同等物': {'Prior2Year': '6113655000000', 'Prior1Year': '1403311000000', 'CurrentYear': '9412060000000'}, '売掛金': {'Prior1Year': '1665651000000', 'CurrentYear': '1888956000000'}, '有価証券': {'Prior1Year': '1069082000000', 'CurrentYear': '3938698000000'}, '商品及び製品': {'Prior1Year': '271851000000', 'CurrentYear': '257113000000'}
You can reproduce EDINET-Bench by running following commands.
Note
Since only the past 10 years of annual reports are available via the EDINET API, the time window used to construct the dataset shifts with each execution. As a result, datasets generated at different times may not be identical.
Download all annual reports for the year 2024.
$ python scripts/prepare_edinet_corpus.py --doc_type annual --start_date 2024-01-01 --end_date 2025-01-01
Download securities reports spanning 10 years for approximately 4,000 companies from EDINET.
$ bash edinet_corpus.sh
Note
Please be careful not to send too many requests in parallel, as downloading reports from the past 10 years could place a significant load on EDINET.
You will get the following directories
edinet_corpus
├── annual
│ ├── E00004
│ │ ├── S1005SBA.json
│ │ ├── S1005SBA.pdf
│ │ ├── S1005SBA.tsv
│ │ ├── S1008JYI.json
│ │ ├── S1008JYI.pdf
│ │ ├── S1008JYI.tsv
Build a benchmark to detect accounting fraud in the securities report of a given fiscal year.
$ python scripts/fraud_detection/prepare_fraud.py
$ python scripts/fraud_detection/prepare_nonfraud.py
$ python scripts/fraud_detection/prepare_dataset.py
You can analyze the amended report classified as fraud-related by running the following command:
$ python scripts/fraud_detection/analyze_fraud_explanation.py
Build a benchmark to forecast the following year’s profit based on the securities report of a given fiscal year.
$ python scripts/profit_forecast/prepare_dataset.py
Buid a benchmark to predict industry given an annual report.
$ python scripts/industry_prediction/prepare_dataset.py
@misc{sugiura2025edinet,
author={Issa Sugiura and Takashi Ishida and Taro Makino and Chieko Tazuke and Takanori Nakagawa and Kosuke Nakago and David Ha},
title={{EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements}},
year={2025},
eprint={2506.08762},
archivePrefix={arXiv},
primaryClass={q-fin.ST},
url={https://arxiv.org/abs/2506.08762},
}
We acknowledge edgar-crawler as an inspiration for our tool. We also thank EDINET, which served as the primary resource for constructing our benchmark.