Publicly traded companies are required to submit periodic reports with eXtensive Business Reporting Language (XBRL) word-level tags. Manually tagging the reports is tedious and costly. We, therefore, introduce XBRL tagging as a new entity extraction task for the financial domain and release FiNER-139, a dataset of 1.1M sentences with gold XBRL tags. Unlike typical entity extraction datasets, FiNER-139 uses a much larger label set of 139 entity types. Most annotated tokens are numeric, with the correct tag per token depending mostly on context, rather than the token itself. We show that subword fragmentation of numeric expressions harms BERT's performance, allowing word-level BILSTMs to perform better. To improve BERT's performance, we propose two simple and effective solutions that replace numeric expressions with pseudo-tokens reflecting original token shapes and numeric magnitudes. We also experiment with FIN-BERT, an existing BERT model for the financial domain, and release our own BERT (SEC-BERT), pre-trained on financial filings, which performs best. Through data and error analysis, we finally identify possible limitations to inspire future work on XBRL tagging.
Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos and George Paliouras
FiNER: Financial Numeric Entity Recognition for XBRL Tagging
In the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022) (Long Papers), Dublin, Republic of Ireland, May 22 - 27, 2022
@inproceedings{loukas-etal-2022-finer,
title = {FiNER: Financial Numeric Entity Recognition for XBRL Tagging},
author = {Loukas, Lefteris and
Fergadiotis, Manos and
Chalkidis, Ilias and
Spyropoulou, Eirini and
Malakasiotis, Prodromos and
Androutsopoulos, Ion and
Paliouras George},
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)},
publisher = {Association for Computational Linguistics},
location = {Dublin, Republic of Ireland},
year = {2022},
url = {https://arxiv.org/abs/2203.06482}
}
- Dataset and Supported Task
- Dataset Repository
- Models Repository
- Install Python and Project Requirements
- Running an Experiment
- Setting up the experiment's parameters
FiNER-139 is comprised of 1.1M sentences annotated with eXtensive Business Reporting Language (XBRL) tags extracted from annual and quarterly reports of publicly-traded companies in the US. Unlike other entity extraction tasks, like named entity recognition (NER) or contract element extraction, which typically require identifying entities of a small set of common types (e.g., persons, organizations), FiNER-139 uses a much larger label set of 139 entity types. Another important difference from typical entity extraction is that FiNER focuses on numeric tokens, with the correct tag depending mostly on context, not the token itself.
To promote transparency among shareholders and potential investors, publicly traded companies are required to file periodic financial reports annotated with tags from the eXtensive Business Reporting Language (XBRL), an XML-based language, to facilitate the processing of financial information. However, manually tagging reports with XBRL tags is tedious and resource-intensive. We, therefore, introduce XBRL tagging as a new entity extraction task for the financial domain and study how financial reports can be automatically enriched with XBRL tags. To facilitate research towards automated XBRL tagging we release FiNER-139.
FiNER-139 is available at HuggingFace Datasets and you can load it using the following:
import datasets
finer = datasets.load_dataset("nlpaueb/finer-139")
Note: You don't need to download or install any dataset manually, the code is doing that automatically.
The SEC-BERT Models are available at HuggingFace and you can load it using the following:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-base")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-base")
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-num")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-num")
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-shape")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-shape")
Note: You don't need to download or install any model manually, the code is doing that automatically.
It is recommended to create a virtual environment first via Python's venv module or Anaconda's conda.
pip install -r requirements.txt
click
datasets==2.1.0
gensim==4.2.0
regex
scikit-learn>=1.0.2
seqeval==1.2.2
tensorflow==2.8.0
tensorflow-addons==1.16.1
tf2crf==0.1.24
tokenizers==0.12.1
tqdm
transformers==4.18.0
wandb==0.12.16
wget
To run an experiment we call the main function run_experiment.py
located at the root of the project.
We need to provide the following arguments:
method
: neural model to run (possible values:transformer
,bilstm
)mode
: mode of the experiment. The following modes can be selected:train
: train a single modelevaluate
: evaluate a pre-trained model
In order to run a train experiment with a transformer
model we execute:
python run_experiment --method transformer --mode train
We set the parameters of an experiment by editing the configuration file located at the configurations
folder of the project.
Inside the configurations folder three json
configuration files (e.g bilstm.json
, transformer.json
, transformer_bilstm.json
) where we can select the parameters of the experiment we would like to run.
If we want to run a transformer
experiment we need to edit the parameters of transformer.json
These parameters are grouped in six groups:
-
train_parameters
: contains the major parameters of the experiment-
model_name
: transformer model we would like to train (e.g.bert-base-uncased
,sec-bert-base
,sec-bert-num
,sec-bert-shape
) -
max_length
: max length in tokens of the input sample. -
replace_numeric_values
: boolean flag indicating wether to replace the numeric values with the special shape token23.5 -> [XX.X]
-
subword_pooling
: what subword pooling to perform (possible values are:all
,first
,last
) -
use_fast_tokenizer
: boolean flag indicating wether to use fast tokenizers or not
-
-
general_parameters
: general parameters of the experiment-
debug
: boolean flag indicating if we want to enabledebug
mode
Duringdebug
mode we select only a small portion of the dateset (100 samples for each of the train, validation and test splits), and also enabletensorflow's eager execution
-
loss_monitor
: loss that theearly stopping
andreduce learning rate on plateau
tensorflow's
callbacks
will monitor
Possible values are:val_loss
,val_micro_f1
andval_macro_f1
. -
early_stopping_patience
: used by theearly stopping
tensorflow's
callback
and indicates the number of epochs to wait without improvement ofloss_monitor
before the training stops. -
reduce_lr_patience
: used by thereduce learning rate on plateau
tensorflow's
callback
and indicates the number of epochs to wait without improvement of loss_monitor before the learning rate is reduced by half -
reduce_lr_cooldown
: used byreduce learning rate on plateau
tensorflow's
callback
and indicates the number of epochs to wait before resuming normal operation after learning rate has been reduced. -
epochs
: maximum number of iterations (epochs) over the corpus. Usually choose a large value and letearly stopping
stop the training afterpatience
is reached. -
batch_size
: number of samples per gradient update. -
workers
: workers that create samples during model fit. Choose enough workers to saturate the GPU utilization. -
max_queue_size
: max samples in queue. Choose a large number to saturate the GPU utilization. -
use_multiprocessing
: boolean flag indicating the use of multi-processing for generating samples -
wandb_entity
: insert yourWeights & Biases
username or team to log the run -
wandb_project
: insert the project's name where the run will be saved.
-
-
hyper_parameters
: model hyper-parameters to use when training a single model-
learning_rate
: learning rate ofAdam
optimizer -
n_layers
: number of stackedBiLSTM
layers -
n_units
: number of units in eachBiLSTM
layer -
dropout_rate
: randomly sets input units to 0 with a frequency ofdropout_rate
-
crf
: boolean flag indicating the use of CRF layer
-
-
evaluation
: evaluation parameters of the experiment-
pretrained_model
: name of pretrained model used whenevaluate
mode is selected.
The name is the folder name of the experiment we want to re-evaluate (located at/data/experiments/runs
) (e.g.FINER139_2022_01_01_00_00_00
) -
splits
: list of dataset splits to evaluate (e.g.validation
,test
)
-