Graph ML: Directed Link Prediction on the Implication Graph [Claudio Moroni, Pietro Monticone]

Motivation

Directed Link Prediction [1] is a method enabling machine / deep learning models to predict missing edges in a directed graph. For our implication graph, this translates to predicting the truth values of unproven implications. This task serves as a necessary first step for advancing in the following directions:

Reasoning over mathematical knowledge graphs: Recent advancements allow language models to integrate information from multiple modalities. For example, [2] and [3], share information between corresponding layers of Language Models (LMs) and Graph Neural Networks (GNNs), enabling simultaneous learning from text corpora and graph-structured data within the same expert domain. By leveraging both modalities, the language model can better structure its knowledge and respond to complex queries. This dual learning process combines masked language modelling for text with link prediction on the graph, highlighting the importance of link prediction for robust reasoning.
Higher-order implication graphs: Our implication graph currently represents only implications of the form $p \implies q$, not more complex ones like $(p \land r) \implies q$. Extending to such higher-order edges would likely involve connecting sets of nodes, thereby requiring hypergraph representations. For a systematic overview, see [4]. While specific hypergraph neural architectures exist [5], we believe it is still conceptually important to apply Directed Link Prediction to simpler implication graphs first, providing insights and guiding principles that can anticipate challenges in higher-order graph representation learning.

Data

We used this edge list, generated on October 20, 2024, with the following commands:

lake exe extract_implications outcomes > data/tmp/outcomes.json
scripts/generate_edgelist_csv.py

The structure of the implication graph is summarised below:

graph_summary = {
    "total_nodes": 4694,
    "total_directed_edges": 8178279,
    "edge_density_percentage": 37,  # Percentage of possible edges that exist
    "bidirectional_edges": 2475610,
    "bidirectional_percentage": 30  # Percentage of all edges that are bidirectional
}

Below is a summary of the edge types in the graph:

edge_counts = {
    "explicit_conjecture_false": 92, 
    "explicit_proof_false": 582316, 
    "explicit_proof_true": 10657, 
    "implicit_conjecture_false": 142, 
    "implicit_proof_false": 13272681, 
    "implicit_proof_true": 8167622, 
    "unknown": 126
}

Edges are labeled according to the following scheme:

edge_labels = {
    "implicit_proof_true": 1, 
    "explicit_proof_false": 0, 
    "implicit_proof_false": 0, 
    "explicit_proof_true": 1, 
    "explicit_conjecture_false": 0, 
    "implicit_conjecture_false": 0, 
    "unknown": 0
}

The unknown class contains a very small number of edges (126), so their impact on the training phase is expected to be negligible. Future approaches might address this class by excluding unknown edges from the training set.

Methods

Setup

Consider a directed graph $G = (V, E)$ where $E = {(u,v) \mid u, v \in V}$ is the edge set and $|V| = n$ . We assume that each node is associated with a feature vectore resulting in a $X \in \mathbb{R}^{n \times f}$ feature matrix.

We define the existing edges $(a, b) \in E$ as positives and the non-existing edges $(c, d) \notin E$ as negatives.

Intuitively, performing DLP on $G$ involves randomly splitting $E$ into three disjoint sets: $E_{\text{train}}$, $E_{\text{val}}$, and $E_{\text{test}}$, such that:

$E_{\text{train}}$ is the training set,
$E_{\text{val}}$ is the validation set,
$E_{\text{test}}$ is the test set,
and $E = E_{\text{train}} \dot{\cup} E_{\text{val}} \dot{\cup} E_{\text{test}}$.

The model then learns from $G_{\text{train}} = (V, E_{\text{train}})$ to map the topological and feature-related information of two nodes $u$ and $v$ to a probability $p_{uv}$ that $(u, v) \in E_{\text{test}}$.

However, this setup presents two key issues among others:

The model learns only from positives, so it cannot recognize negatives.
The model is evaluated only on positives, preventing us from measuring its ability to identify negatives.

To address these limitations, we adopted the setup proposed in [6]. Specifically, we redefined $E_{\text{train}}$ to $E_{\text{train}}^p$ (positives) and introduced:

$$ E_{\text{train}} = E_{\text{train}}^p \dot{\cup} E_{\text{train}}^n $$

where $E_{\text{train}}^n$ includes all possible negatives in $G_{\text{train}} = (V, E_{\text{train}}^p)$. The model is now required to predict the non-existence of edges in $E_{\text{train}}^n$.

Similarly, if we redefine $E_{\text{test}}$ as follows:

$$ E_{\text{test}} = E_{\text{test}}^p \dot{\cup} E_{\text{test}}^n $$

where $E_{\text{test}}^n$ is a random sample of negatives, the model’s evaluation would fail to capture two crucial aspects:

The model’s ability to distinguish $(u,v)$ from $(v,u)$ for all $(u,v) \in E_{\text{test}}^p$.
The model’s ability to identify bi-implications.

These limitations arise from the random selection of negative edges in $E_{\text{test}}^n$. To address this, we define three distinct test sets: $E_{\text{test}}^G$, $E_{\text{test}}^D$, and $E_{\text{test}}^B$, to evaluate different facets of the model’s performance:

General Test Set ($E_{\text{test}}^G$): Here, $E_{\text{test}} = E_{\text{test}}^p \dot{\cup} E_{\text{test}}^n$, where $E_{\text{test}}^n$ is a random sample of non-existent edges with the same cardinality as $E_{\text{test}}^p$. This set assesses the model's ability to detect the presence of edges, regardless of direction. A model that cannot distinguish edge direction may still perform well on this set, highlighting the need for the following two additional test sets.
Directional Test Set ($E_{\text{test}}^D$): Defined as $E_{\text{test}}^{\text{up}} \dot{\cup} \tilde{E}{\text{test}}^{\text{up}}$, where $E{\text{test}}^{\text{up}}$ consists of unidirectional edges in $E_{\text{test}}^p$, and $\tilde{E}_{\text{test}}^{\text{up}}$ contains their reverses (negatives by construction). This set evaluates the model's ability to recognize edge direction, making it suitable for assessing direction-sensitive models.
Bidirectional Test Set ($E_{\text{test}}^B$): Defined as $E_{\text{test}}^{\text{bp}} \dot{\cup} E_{\text{B}}^{\text{n}}$, where $E_{\text{test}}^{\text{bp}}$ contains one direction of all bidirectional edges in $E_{\text{test}}^p$, and $E_{\text{B}}^{\text{n}} \subset \tilde{E}$ includes a subset of their reverses. This set evaluates the model's ability to identify bi-implications, as each edge in $E_{\text{test}}^B$ has a reverse that is positive, but only half are bidirectional in practice.

Models

We tested the following models:

GAE [1]
Gravity-GAE [6]
Source/Target-GAE [6]
DiGAE [7]
MagNet [8]

All these models are graph-based autoencoders with distinct encoder-decoder architectures. Notably, GAE is the only model structurally unable to differentiate edge directions. Each model outputs the probability that an ordered pair of nodes has a directed edge between them, with nodes represented using one-hot encodings as features.

We trained the models using Binary Cross Entropy as the loss function, balancing the contribution of positive and negative edges through re-weighting. On the General test set, we evaluated the following metrics:

AUC (Area Under the ROC Curve): Measures the probability that the model ranks a random positive edge higher than a random negative edge. Higher values indicate better discrimination between positive and negative edges.
AUPRC (Area Under Precision-Recall Curve): Assesses model performance, particularly in cases of class imbalance. AUPRC is more robust to imbalanced data than AUC.
Hits@K: Evaluates the fraction of times a positive edge is ranked within the top $K$ positions among personalized negative samples [9]. Briefly, given a positive edge, its $M$ personalized negative samples are $M$ negative edges with the same head but different tails. We calculate Hits@K for $K = 1, 3, 10$ to assess ranking quality, and set $M = 100$. Therefore, Hits@K = 0.1 means that on average, the model ranks a positive edge within the highest-ranked $K$ personalized negatives $10%$ of the times.
MRR (Mean Reciprocal Rank): Computes the average reciprocal rank of positive edges among their personalized negative samples [9] (the same as those used for Hits@K) providing an overall measure of ranking accuracy. For instance, $MRR= 0.1$ means that on average, the model ranks a positive edge as $10^{th}$ among $M$ personalized negative samples.

Each metric ranges from 0 to 1, with higher values reflecting improved performance. Based on prior work, we expect AUC and AUPRC scores to approach 1, while MRR and Hits@K often yield values around 0.15 for similar undirected tasks [9]. Directional and Bidirectional performances were evaluated using only AUC and AUPRC, since Hits@K and MRR are hardly definable in those scenarios. The number of training epochs was optimized through Early Stopping on the General validation set performance (given by the sum of AUC and AUPRC).

Results

The results below represent average performance over six random splits of $E_{\text{train}}$, $E_{\text{val}}$, and $E_{\text{test}}$ while keeping the model's seed fixed for fair comparison. The training / validation / test split proportions are:

85 / 5 / 10 for unidirectional edges
65 / 15 / 30 for bidirectional edges

Model	G_ROC_AUC	G_AUPRC	G_Hits@1	G_Hits@3	G_Hits@10	G_MRR	D_ROC_AUC	D_AUPRC	B_ROC_AUC	B_AUPRC
gae	0.8484 ± 9e-04	0.8558 ± 6e-04	6e-05 ± 4e-05	6e-05 ± 4e-05	0.0001 ± 4e-05	0.0165 ± 2e-04	0.5 ± 0	0.5 ± 0	0.941 ± 5e-03	0.964 ± 3e-03
gravity_gae	0.9806 ± 3e-04	0.9753 ± 4e-04	0.069 ± 6e-03	0.101 ± 5e-03	0.17 ± 3e-03	0.112 ± 5e-03	0.9958 ± 1e-04	0.9874 ± 2e-04	0.99717 ± 4e-05	0.99431 ± 7e-05
sourcetarget_gae	0.99976 ± 1e-05	0.999736 ± 8e-06	0.077 ± 4e-03	0.147 ± 7e-03	0.279 ± 9e-03	0.152 ± 5e-03	0.999982 ± 1e-06	0.999983 ± 1e-06	0.999989 ± 2e-06	0.999987 ± 3e-06
mlp_gae	0.99315 ± 1e-05	0.99409 ± 1e-05	0.181 ± 7e-03	0.299 ± 7e-03	0.53 ± 2e-03	0.289 ± 6e-03	0.99671 ± 2e-05	0.9973 ± 1e-05	0.99692 ± 2e-05	0.99736 ± 2e-05
digae	0.9978 ± 3e-04	0.998 ± 3e-04	0.035 ± 6e-03	0.068 ± 1e-02	0.18 ± 2e-02	0.091 ± 1e-02	0.9991 ± 2e-04	0.9993 ± 2e-04	0.9994 ± 1e-04	0.9995 ± 1e-04
magnet	0.989 ± 1e-04	0.99076 ± 3e-05	0.151 ± 1e-02	0.26 ± 2e-02	0.38 ± 2e-02	0.24 ± 1e-02	0.9962 ± 1e-03	0.9969 ± 6e-04	0.9976 ± 4e-04	0.9979 ± 2e-04

We observe consistently high General AUC and AUPRC scores, close to 1. These high values are expected, as similar neural architectures have demonstrated strong performance in graphs of comparable size [1]. The high ratio of existing to non-existing edges in the implication graph (approximately 37%) likely contributes to the near-perfect General AUC and AUPRC scores. For context, benchmark datasets such as Cora and Citeseer (e.g., directed and undirected) contain fewer than 1% of all possible edges.

Interestingly, the GAE model, though structurally unable to distinguish edge direction, performs well on the General task (if we look at AUC and AUPRC only). This experimentally confirms the need for including Directional and Bidirectional test sets mentioned above, which allow comprehensive evaluation across all facets of DLP.

All other models demonstrate high AUC and AUPRC scores across the General, Directional, and Bidirectional test sets, indicating strong predictive capabilities even when accounting for directionality and bidirectionality.

Notably, the mlp_gae and magnet models also achieve competitive scores in MRR and Hits@K metrics.

Conclusions

We evaluated the performance of six different graph autoencoders on a DLP task. By adopting a multi-task evaluation framework, we assessed the models comprehensively across various aspects of DLP. All models demonstrated strong performance on AUC and AUPRC metrics, with some also achieving high scores on MRR and Hits@K.

Node features were represented using one-hot encodings, meaning that the models received no explicit information about the equations represented by the nodes. Instead, they relied entirely on the topological structure encoded during training. This approach aligns with previous research suggesting that one-hot encodings can promote asymmetric embeddings [6]. However, future experiments could explore alternative representations, such as encoding the equations with text-based embeddings like Word2Vec, to potentially enhance the models' understanding of the nodes’ semantic content.

In summary, our findings highlight the adaptability and robustness of graph autoencoders for DLP tasks in dense, directed graphs. This approach not only demonstrates robustness in predicting directed links but also suggests promising potential for future improvements through enhanced feature representations, thereby advancing the capabilities of link prediction in complex mathematical knowledge graphs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Graph ML: Directed Link Prediction on the Implication Graph [Claudio Moroni, Pietro Monticone]

Motivation

Data

Methods

Setup

Models

Results

Conclusions

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally