ICON (Implicit CONcept Insertion) is a self-supervised taxonomy enrichment system designed for implicit taxonomy completion.
ICON works by representing new concepts with combinations of existing concepts. It uses a seed to retrieve a cluster of closely related concepts, in order to zoom in on a small facet of the taxonomy. It then enumerates subsets of the cluster and uses a generative model to create a virtual concept for each subset that is expected to represent the subset's semantic union. The generated concept will go through a series of valiadations and its placement in the taxonomy will be decided by a search based on a sequence of subsumption tests. The outcome for each validated concept will be either a new concept inserted into the taxonomy, or a merger with existing concepts. The taxonomy is being updated dynamically each step.
ICON depends on the following packages:
numpy
owlready2
networkx
faiss
tqdm
nltk
The pipeline for training sub-models that we provide in this README further depends on the following packages:
torch
pandas
transformers
datasets
evaluate
info-nce-pytorch
ICON requires Python 3.9 or higher.
The simplest usage of ICON is with Jupyter notebook. A walkthrough tutorial is provided at demo.ipynb
. Before initialising an ICON object, make sure you have your data and three dependent sub-models.
-
data
: A taxonomy (taxo_utils.Taxonomy
object, which can be loaded from json viataxo_utils.from_json
, for details see File IO Format or an OWL ontology (owlready2.Ontology
object) -
emb_model
(recommended signature:emb_model(query: List[str], *args, **kwargs) -> np.ndarray)
: Embedding model for one or a batch of sentences -
gen_model
(recommended signature:gen_model(labels: List[str], *args, **kwargs) -> str)
: Generate the union label for an arbitrary set of concept labels -
sub_model
(recommended signature:sub_model(sub: Union[str, List[str]], sup: Union[str, List[str]], *args, **kwargs) -> numpy.ndarray)
: Predict whether eachsup
subsumes the correspondingsub
given two lists ofsub
andsup
The sub-models are essential plug-ins for ICON. Everything above (except emb_model
or gen_model
if you are using ICON in a particular setting, to be explained below) will be required for ICON to function.
We offer a quick pipeline for fine-tuning (roughly year 2020 strength) solid and well-known pretrained language models to obtain the three required models.
-
Use the scripts under
/experiments/data_wrangling
to build the training and evaluation data for each sub-model using your taxonomy (or the Google PT taxonomy placed there by default).-
Open terminal and
cd
to/experiments/data_wrangling
. -
Adjust the data building settings by modifying
data_config.json
. A list of available settings and explanation on the data format is provided below. -
Execute the scripts with
python ./FILENAME.py
whereFILENAME
is replaced by the name of the script you wish to run.
-
-
Download the pretrained language models from HuggingFace. Here we use BERT for both emb_model and sub_model, and T5 for gen_model.
-
Fine-tune the pretrained language models. A demonstration for fine-tuning each model can be found in the notebooks under
/experiments/model_training
. Notice that the tuned language models aren't exactly the sub-models to be called by ICON yet. An example of wrapping the models for ICON and an entire run can be found at/demo.ipynb
.
Please note that this is only a suggestion for the sub-models and deploying later models may be able to enhance ICON performances.
The /experiments/data_wrangling/data_config.json
file contains the variable parameters for each of the dataset generation scripts that we provided:
-
Universal parameters:
-
random_seed
: If set, this seed will be passed to the NumPy pseudorandom generator to ensure reproducibility. -
data_path
: Location of your raw data. -
eval_split_rate
: The ratio (acceptable range $[0,1)$) of evaluation set in the whole dataset.
-
-
EMB model: The data will follow the standard format for contrastive learning that is made of
$(q,p,n_1,\ldots,n_k)$ tuples. Each tuple is called a minibatch.$q$ is the query concept;$p$ is the positive concept, a concept similar to the query (in our case a sibling of the query in the taxonomy);$n_1,\ldots ,n_k$ are the negative concepts which should be concepts that are dissimilar to the query. A sample data is provided here.-
concept_appearance_per_file
: How many times each concept in the taxonomy appears in the data. -
negative_per_minibatch
:$k$ in the aforementioned minibatch format.
-
-
GEN model: The data will be lists of semicolon-delimited concept names accompanied by the concept name of the list's LCA (least common ancestor) as reference. Each row is a
([PREFIX][C1];...;[Cn], [LCA])
tuple. Usually the LCA is not trivial (i.e. not the root concept) but an option exists to intentionally corrupt some of the lists so that the LCA becomes trivial. A sample data is provided here.-
max_chunk_size
: Max length$(\geq 2)$ of the concept list in each row. The generated data will contain lists from length 1 to the specified number. -
corrupt_ratio
: The ratio (acceptable range$[0,1]$ ) of corrupted data rows. -
corrupt_patterns
: The specific ways data will be allowed to get corrupted. This parameter should be a list of distinct pairs of integers$(p_i,n_i)$ where$p$ is the number of uncorrupted concepts and$n$ is the number of randomly chosen concepts used for corruption. For each pair$p+n$ should be no greater thanmax_chunk_size
, and$p$ should not equal 1 since that would be equivalent to$p=0$ . -
pattern_weight
: The relative frequency of each corrupt pattern. These weights do not need to add up to 1. This parameter should have the same list length ascorrupt_patterns
. -
prompt_prefix
: The task prefix that will be prepended to all concept lists, used to facilitate the training of some language models.
-
-
SUB model: The data will be
$(\rm{sub},\rm{sup},\rm{ref})$ tuples.$\rm{ref}$ is 1 when$\rm{sub}$ is a sub-concept of$\rm{sup}$ , and 0 vice versa. Positive data will be all the child-parent and grandchild-grandparent pairs in the dataset. Negative data (rows where$\rm{ref}=0$ ) will be generated in two ways: easy and hard. A sample data is provided here.-
easy_negative_sample_rate
: The amount of easy negative rows relative to the number of positive rows. These negatives are obtained by replacing$\rm{sup}$ with a random concept. -
hard_negative_sample_rate
: The amount of hard negative rows relative to the number of positive rows. These negatives are obtained by replacing$\rm{sup}$ with a concept reached via graph random walk from the original$\rm{sup}$ .
-
Once you are ready, initialise an ICON object with your preferred configurations. If you just want to see ICON at work, use all the default configurations by e.g. iconobj = ICON(data=your_data, emb_model=your_emb_model, gen_model=your_gen_model, sub_model=your_sub_model)
followed by iconobj.run()
(this will trigger auto mode, see below). A complete list of configurations is provided as follows:
-
mode
: Select one of the following-
'auto'
: The system will automatically enrich the entire taxonomy without supervision. -
'semiauto'
: The system will enrich the taxonomy with the seeds specified by user input. -
'manual'
: The system will try to place the new concepts specified by user input directly into the taxonomy. Does not requiregen_model
.
-
-
logging
: How much you want to see ICON reporting its progress. Set to 0 orFalse
to suppress all logging. Set to 1 if you want to see a progress bar and some brief updates. Set toTrue
if you want to hear basically everything! Other possible values for this argument include integers from 2 to 5 (5 is currently equivalent toTrue
), and a list of message types. -
rand_seed
: If provided, this will be passed to numpy and torch as the random seed. Use this to ensure reproducibility. -
transitive_reduction
: Whether to perform transitive reduction on the outcome taxonomy, which will make sure it's in its simplest form with no redundancy. -
Auto mode config:
-
max_outer_loop
: Maximal number of outer loops allowed.
-
-
Semiauto mode config:
-
semiauto_seeds
: An iterable of concepts that will be used as seed for each outer loop.
-
-
Manual mode config:
-
input_concepts
: An iterable of new concept labels to be placed in the taxonomy. -
manual_concept_bases
: If provided, each entry will become the search bases for the corresponding input concept. -
auto_bases
: If enabled, ICON will build the search bases for each input concept. Can speed up the search massively at the cost of search breadth. If disabled,emb_model
will not be required.
-
-
Retrieval config:
-
retrieve_size
: The number of concepts to retrieve for each query. -
restrict_combinations
: Whether you want restrict the subsets under consideration to those including the seed concept.
-
-
Generation config:
-
ignore_label
: The set of output labels that indicate thegen_model
's rejection to generate an union label -
filter_subsets
: Whether you want thegen_model
to skip the subsets that have trivial LCAs. That is, the LCAs of the set form a subset of itself.
-
-
Concept placement config:
-
Search domain constraints:
-
subgraph_crop
: Whether to limit the search domain to the descendants of the LCAs of the concepts which are used to generate the new concept (referred to as search bases in this documentation). -
subgraph_force
: If provided (type: list of list of labels), the search domain will always include the LCAs of search bases w.r.t. the sub-taxonomy defined by the edges whose labels are in each list of the input. Will not take effect ifsubgraph_crop = False
. -
subgraph_strict
: Whether to further limit the search domain to the subsumers of at least one base concept.
-
-
Search:
-
threshold
: Thesub_model
's minimal predicted probability for accepting subsumption. -
tolerance
: Maximal depth to continue searching a branch that has been rejected bysub_model
before pruning branch. -
force_known_subsumptions
: Whether to force the search to place the new concept at least as general as the LCA of the search bases, and at least as specific as the union of the search bases. Enabling this will also force the search to stop at the search bases. -
force_prune_branches
: Whether to force the search to reject all subclasses of a tested non-superclass in superclass search, and to reject all superclasses of a tested non-subclass in subclass search. Enabling this will slow down the search if the taxonomy is roughly tree-like.
-
-
-
Taxonomy update config:
-
do_update
: Whether you would like to actually update the taxonomy. If set toTrue
, running ICON will return the enriched taxonomy. Otherwise, running ICON will return the records of its predictions in a dictionary. -
eqv_score_func
: When ICON is updating taxonomies, it's sometimes necessary to estimate the likelihood of$a=b$ where$a$ and$b$ are two concepts, given the likelihoods of$a \sqsubseteq b$ (b subsumes a) and$b \sqsubseteq a$ . This argument is therefore a function that crunches two probabilities together to estimate the intersection probability. It's usually fine to leave it as default, which is the multiplication operation. -
do_lexical_check
: Whether you would like to run a simple lexical screening for each new concept to see if it coincides with any existing concept. If set toTrue
, ICON will have to pre-compute and cache the lexical features for each concept in the taxonomy when initialising.
-
After completing configuration and having an ICON object initialised, you can run ICON by simply calling run()
. If you want to change configurations, use the method
iconobj.update_config(**your_new_config)
For instance,
iconobj.update_config(threshold=0.9, ignore_label=iconobj.config.gen_config.ignore_label + ['Owl:Thing'])
Would set the subsumption prediction threshold to 0.9, and add 'Owl:Thing'
to the list of ignored generated labels.
The outcome of an ICON run will either be the enriched taxonomy or a record of ICON's predictions.
In the former case, you can save a taxonomy by your_taxo_object.to_json(your_path, **your_kwargs)
. In the latter case, the record will be a Python dictionary in the form of
{concept_name1:
{'eqv': eqv_1,
'sup': sup_1,
'sub': sub_1},
concept_name2:
{'eqv': eqv_2,
'sup': sup_2,
'sub': sub_2},
...
}
Where each eqv
is either empty or a single key-value pair label: score
with the predicted equivalent concept and its confidence score. Likewise, each sup
and sub
is either empty or a dictionary of such key-value pairs, but potentially including more than one concept.
ICON reads and writes taxonomies in a designated JSON format. In particular, the files are expected to have:
Two arrays "nodes"
and "edges"
-
"nodes"
contains a list of node objects. Each node object contains the following fields:-
Mandatory field
"id"
: The ID of the node. ID0
is always reserved for the root node and should be avoided. -
Mandatory field
"label"
: The name / surface form of the node. -
Any other fields will be stored as node attributes.
-
-
"edges"
contains a list of edge objects. Each edge object contains the following fields:-
Mandatory field
"src"
: The ID of the child node. -
Mandatory field
"tgt"
: The ID of the parent node. -
Any other fields will be stored as edge attributes.
-
While the only attribute ICON explicitly uses for each node or edge is "label"
, you can store other attributes, for instance node term embeddings, as additional fields. These attributes will be stored in Taxonomy
objects. An example file can be found in the data directory here.
If you wish to use ICON for your work, please cite our following paper:
@inproceedings{10.1145/3589334.3645584,
author = {Shi, Jingchuan and Dong, Hang and Chen, Jiaoyan and Wu, Zhe and Horrocks, Ian},
title = {Taxonomy Completion via Implicit Concept Insertion},
year = {2024},
isbn = {9798400701719},
publisher = {Association for Computing Machinery},
booktitle = {Proceedings of the ACM on Web Conference 2024},
pages = {2159–2169},
numpages = {11},
series = {WWW '24}
}