This repository contains all code used to generate the statistics and plots in the paper
Survey on English Entity Linking on Wikidata.
Additionally, the generated statistics and plots are available.
Also, the code does not necessarily run out of the box. Sometimes, small adjustments to the code have to be made like setting a variable or adding a function call. In the following, all relevant scripts are shortly described and necessary adjustments pointed out.
Run pip install -r requirements.txt
.
To generate the statistics again, one has to download the necessary datasets. The links to the datasets can be found in folder datasets. Both files dataset_links.json and links.txt contain the links to all datasets which were found.
Additionally, the download_datasets.py script downloads all easily downloadable datasets. Note that some datasets still need to be downloaded manually. Those will be mentioned on execution. Run the script via
python -m download_datasets
inside the datasets folder.
To repeat the ES-Index EL tests, one has to set up an ES instance.
To populate the ES index, one can execute the methods found in populate_ES.py.
Most important is populate_entities_elasticsearch
which is run on default if
python -m utilities.populate_ES
is executed.
One has to provide a n-triples files containing only label statements about entities
(Default filename labels.nt
but can be specified by --filename
).
To get such pre-filtered n-triple files, the methods in the filer_labels.py file can be used.
The methods expect an existing compressed (gz) Wikidata n-triples file and if run via
python -m utilities.filter_labels {filename}
creates two files, labels.nt
and description.nt
.
After setting up the ES instance, the tests can be run by executing es_el_tests.py via
python -m dataset_evaluation.scripts.es_el_tests
The datasets have to be placed in the right folder as defined in paths.py via the
datasets_path
variable.
All generated results can be reprocessed via
python -m dataset_evaluation.scripts.reprocess_es__results {folder with results}
to compute additional statistics.
The existing results can be found in the es folder for each dataset.
After downloading all datasets and placing them in the folder specified by the
datasets_path
variable in paths, one can execute the analysis script via
analyse_datasets.py with
python -m dataset_evaluation.scripts.analyse_datasets
The existing results can be found in the analysis folder for each dataset.
One has to download n-triples dumps of the different KGs (DBpedia and Wikidata) and the json dump of Wikidata.
There exist multiple scripts to generate statistics. To calculate the mention overlap, calculate_mention_overlap.py
and calculate_mention_overlap_wikidata.py can be used.
Both expect a n-triples file. The main difference is that calculate_mention_overlap_wikidata.py
can only use a pre-filtered Wikidata dump containing labels.
The first is executed via
python -m kg_evaluation.scripts.calculate_mention_overlap {filename} {mention_dictionary_filename} {results_filename}
and the second via
python -m kg_evaluation.scripts.calculate_mention_overlap_wikidata {filename}
Both generate also a mention dictionary file which is in the case of Wikidata further used via the wikidata_label_length.py script with
python -m kg_evaluation.scripts.wikidata_label_length {mention_dict_file}
It calculates statistics like the average mean, median or p-percentile mention length.
Language-wise statistics of Wikidata are generated by extract_language_statistics.py. It expects a Wikidata dump in the json format and is executed via
python -m kg_evaluation.scripts.extract_language_statistics {filename}
It generates a json file.
The generated json files can be reprocessed by executing reprocess_language_statistics.py
to generate additional statistics.
python -m kg_evaluation.scripts.reprocess_language_statistics {filename}
The existing results can be found in results.
All scripts and additional data used for plotting can be found in plotting. Also, all plots are stored there as well.