Peer-Reviewed Research

In this section you will find summaries of, and links to, all peer-reviewed research I have conducted. The summaries given here are formal abstracts, which means that they are not necessarily easy to understand if you do not have a background in computer science. I am currently working on turning some of these into blog posts, which will be located at Blogs.

Structural Alignment in Link Prediction

https://www.tara.tcd.ie/items/92acfeb2-45ce-4c6a-814b-6d1e62667742

Abstract: While Knowledge Graphs (KGs) have become increasingly popular across various scientific disciplines for their ability to model and interlink huge quantities of data, essentially all real-world KGs are known to be incomplete. As such, with the growth of KG use has been a concurrent development of machine learning tools designed to predict missing information in KGs, which is referred to as the Link Prediction Task. The majority of state-of-the-art link predictors to date have followed an embedding-based paradigm. In this paradigm, it is assumed that the information content of a KG is best represented by the (individual) vector representations of its nodes and edges, and that therefore node and edge embeddings are particularly well-suited to performing link prediction.

This thesis proposes an alternative perspective on the field's approach to link prediction and KG data modelling. Specifically, this work re-analyses KGs and state-of-the-art link predictors from a graph-structure-first perspective that models the information content of a KG in terms of whole triples, rather than individual nodes and edges. After building up a theoretical foundation for this structure-first approach from the state-of-the-art literature, it is evaluated in two contexts.

The thesis concludes that a structure-first perspective on KGs and link prediction is both viable and useful for understanding KG learning. This observation is used to create and propose the Structural Alignment Hypothesis, which postulates that link prediction can be understood and modelled as a structural task.

Ailíniú Struchtúir agus Réamhinsint Nasc

https://www.tara.tcd.ie/items/c8eef2c4-04b8-4351-9f26-a237e62f3a94

Abstract: Cé go bhfuil Graif Eolais (GE-anna) ag éirí i bhfad níos coitianta i roinnt mhaith réimsí eolaíochta toisc á gcumas an-chuid sonraí a stóráil agus a nascadh le chéile, bíonn easnamh sonraí i ngach uile GE nach mór. Taobh le forbairt GE-anna atá, mar sin, an-fhorbairt ar chórais ríomhfhoghlama a dhéantar chun sonraí atá ligthe in easnamh in GE a réamhinsint -- tasc darbh ainm Réamhinsint Nasc. Glacann mórchuid na réamhinsteoirí nasc go dtí seo le cur chuige bunaithe ar leabuithe. Sa modh foghlama seo, glactar leis gurb é is fearr ná sonraí in GE a shamhlú trí leaganacha veicteora dá nóid agus dá cheangail, agus go mbíonn cumas ar leith mar sin ag leabuithe nód / ceangal chun an tasc réamhinsinte nasc a chur i gcrích.

Tugann an tráchtas seo léargas eile ar chur chuige an réimse maidir le réamhinsint nasc agus samhlú sonraí in GE-anna. Go sonrach, déanann an saothar seo ath-anailísíocht ar GE-anna agus ar réamhinsteoirí nasc trí glacadh le radharc bunaithe ar struchtúr a shamhlaíonn sonraí in GE-anna mar abairtí triaracha iomlána, seachas mar nóid / ceangail aonaracha. Tar éis bunús teoirice a chumadh de réir na litríochta don chur chuige seo, déantar é a mheasúnú in dá chomhthéacs ar leith.

Is é príomh-thoradh an tráchtais seo ná go gcuireann radharc bunaithe ar struchtúr go mór le tuiscint an réimse ar GE-anna agus ar an tasc réamhinsinte nasc. Úsáidtear an chonclúid seo chun an Hipitéis ar Ailíniú Struchtúr a chumadh agus a roinnt -- hipitéis a deir gur féidir réamhinsint nasc a thuiscint agus a shamhlú mar thasc bunaithe ar struchtúr.

TWIG-I: Embedding-Free Link Prediction and Cross-KG Transfer Learning Using a Small Neural Architecture

https://ebooks.iospress.nl/pdf/doi/10.3233/SSW240010

Abstract: Knowledge Graphs (KGs) are relational knowledge bases that represent facts as a set of labelled nodes and the labelled relations between them. Their machine learning counterpart, Knowledge Graph Embeddings (KGEs), learn to predict new facts based on the data contained in a KG -- the so-called link prediction task. To date, almost all forms of link prediction for KGs rely on some form of embedding model, and KGEs hold state-of-the-art status for link prediction. In this paper, we present TWIG-I (Topologically-Weighted Intelligence Generation for Inference), a novel link prediction system that can represent the features of a KG in latent space without using node or edge embeddings. TWIG-I shows mixed performance relative to state-of-the-art KGE models -- at times exceeding or falling short of baseline performance. However, unlike KGEs, TWIG-I can be natively used for transfer learning across distinct KGs. We show that using transfer learning with TWIG-I can lead to increases in performance in some cases both over KGE baselines and over TWIG-I models trained without finetuning. While these results are still mixed, TWIG-I clearly demonstrates that structural features are sufficient to solve the link prediction task in the absence of embeddings. Finally, TWIG-I opens up cross-KG transfer learning as a new direction in link prediction research and application.

Extending TWIG: Zero-Shot Predictive Hyperparameter Selection for KGEs based on Graph Structure

https://arxiv.org/pdf/2412.14801?

Abstract: Knowledge Graphs (KGs) have seen increasing use across various domains -- from biomedicine and linguistics to general knowledge modelling. In order to facilitate the analysis of knowledge graphs, Knowledge Graph Embeddings (KGEs) have been developed to automatically analyse KGs and predict new facts based on the information in a KG, a task called 'link prediction'. Many existing studies have documented that the structure of a KG, KGE model components, and KGE hyperparameters can significantly change how well KGEs perform and what relationships they are able to learn. Recently, the Topologically-Weighted Intelligence Generation (TWIG) model has been proposed as a solution to modelling how each of these elements relate. In this work, we extend the previous research on TWIG and evaluate its ability to simulate the output of the KGE model ComplEx in the cross-KG setting. Our results are twofold. First, TWIG is able to summarise KGE performance on a wide range of hyperparameter settings and KGs being learned, suggesting that it represents a general knowledge of how to predict KGE performance from KG structure. Second, we show that TWIG can successfully predict hyperparameter performance on unseen KGs in the zero-shot setting. This second observation leads us to propose that, with additional research, optimal hyperparameter selection for KGE models could be determined in a pre-hoc manner using TWIG-like methods, rather than by using a full hyperparameter search.

A Survey on Knowledge Graph Structure and Knowledge Graph Embeddings Authors

https://arxiv.org/pdf/2412.10092

Abstract: Knowledge Graphs (KGs) and their machine learning counterpart, Knowledge Graph Embedding Models (KGEMs), have seen ever-increasing use in a wide variety of academic and applied settings. In particular, KGEMs are typically applied to KGs to solve the link prediction task; i.e. to predict new facts in the domain of a KG based on existing, observed facts. While this approach has been shown substantial power in many end-use cases, it remains incompletely characterised in terms of how KGEMs react differently to KG structure. This is of particular concern in light of recent studies showing that KG structure can be a significant source of bias as well as partially determinant of overall KGEM performance. This paper seeks to address this gap in the state-of-the-art. This paper provides, to the authors' knowledge, the first comprehensive survey exploring established relationships of Knowledge Graph Embedding Models and Graph structure in the literature. It is the hope of the authors that this work will inspire further studies in this area, and contribute to a more holistic understanding of KGs, KGEMs, and the link prediction task.

NamE: Capturing Biological Context in KGEs via Contextual Named Graph Embeddings

https://ieeexplore.ieee.org/abstract/document/10782639/

Abstract: Large biological databases have become standard in many fields of biology, particularly biomedicine. Much of this data (such as that from DrugBank, PharmKG, PrimeKG, BIOSNAP, and others) is now expressed in a Knowledge Graph (KG) format in which concepts (such as drugs and diseases) are represented as nodes and the relationships between then (such as clinical drug indications) are represented by edges. Bioinformatics pipelines that leverage this data commonly make use of Knowledge Graph Embeddings (KGEs) to learn and analyse the data at hand. While existing work demonstrates that this can effectively assist in a variety of biomedical tasks, existing KGE approaches remain limited by their inability to account for relevant biological context.In this paper, we present a novel KGE framework, called NamE, that solves this problem by explicitly modelling for context in KGE systems. In experiments on two applied bioinformatics use-cases, predicting drug-drug interactions and predicting clinical indications of drugs, we show that NamE substantially improves the ability of KGEs to make accurate predictions by up to 72.2% and 90.9% respectively as compared to conventional KGE methods.

Analysis of Attention Mechanisms in Box-Embedding Systems

https://link.springer.com/chapter/10.1007/978-3-031-26438-2_6

Abstract: Large-scale Knowledge Graphs (KGs) have recently gained considerable research attention for their ability to model the inter- and intra- relationships of data. However, the huge scale of KGs has necessitated the use of querying methods to facilitate human use. Question Answering (QA) systems have shown much promise in breaking down this human-machine barrier. A recent QA model that achieved state-of-the-art performance, Query2box, modelled queries on a KG using box embeddings with an attention mechanism backend to compute the intersections of boxes for query resolution. In this paper, we introduce a new model, Query2Geom, which replaces the Query2box attention mechanism with a novel, exact geometric calculation. Our findings show that Query2Geom generally matches the performance of Query2box while having many fewer parameters. Our analysis of the two models leads us to formally describe the interaction between knowledge graph data and box embeddings with the concepts of semantic-geometric alignment and mismatch. We create the Attention Deviation Metric as a measure of how well the geometry of box embeddings captures the semantics of a knowledge graph, and apply it to explain the difference in performance between Query2box and Query2Geom. We conclude that Query2box’s attention mechanism operates using “latent intersections” that attend to the semantic properties in embeddings not expressed in box geometry, acting as a limit on model interpretability. Finally, we generalise our results and propose that semantic-geometric mismatch is a more general property of attention mechanisms, and provide future directions on how to formally model the interaction between attention and latent semantics.

Veni, Vidi, Vici: Solving the Myriad of Challenges before Knowledge Graph Learning

https://arxiv.org/pdf/2402.06098

Abstract: Knowledge Graphs (KGs) have become increasingly common for representing large-scale linked data. However, their immense size has required graph learning systems to assist humans in analysis, interpretation, and pattern detection. While there have been promising results for researcher- and clinician- empowerment through a variety of KG learning systems, we identify four key deficiencies in state-of-the-art graph learning that simultaneously limit KG learning performance and diminish the ability of humans to interface optimally with these learning systems. These deficiencies are: 1) lack of expert knowledge integration, 2) instability to node degree extremity in the KG, 3) lack of consideration for uncertainty and relevance while learning, and 4) lack of explainability. Furthermore, we characterise state-of-the-art attempts to solve each of these problems and note that each attempt has largely been isolated from attempts to solve the other problems. Through a formalisation of these problems and a review of the literature that addresses them, we adopt the position that not only are deficiencies in these four key areas holding back human-KG empowerment, but that the divide-and-conquer approach to solving these problems as individual units rather than a whole is a significant barrier to the interface between humans and KG learning systems. We propose that it is only through integrated, holistic solutions to the limitations of KG learning systems that human and KG learning co-empowerment will be efficiently affected. We finally present our ”Veni, Vidi, Vici” framework that sets a roadmap for effectively and efficiently shifting to a holistic co-empowerment model in both the KG learning and the broader machine learning domain.

TWIG: Towards pre-hoc Hyperparameter Optimisation and Cross-Graph Generalisation via Simulated KGE Models

https://arxiv.org/pdf/2402.06097

Abstract: Knowledge Graphs (KGs) have become ever-more important for modelling biomedical information, as their intrinsic graph structure matches the structure of many biological interaction networks. Together with KGs, Knowledge GraphEmbeddings (KGEs) have shown immense potential to learn biological data and predict new, in-band facts about the data the KG describes. However, recent literature has suggested several major deficits to KGEs: that they have an extremely short 'receptive field' of data they use to make predictions and that their learning is guided by memorising graph structure, not learning latent semantics. Moreover, while several studies have suggested that graph structure and KGE model choice affect optimal hyperparameters, the exact relationship of hyperparameters to learning remains unknown and is instead solved using a computationally intensive hyperparameter search.

In this paper we introduce TWIG (Topologically-Weighted Intelligence Generation), a novel, embedding-free paradigm for simulating the output of KGEs that uses a tiny fraction of the parameters. TWIG learns weights from inputs that consist of topological features of the graph data, with no coding for latent representations of entities or edges. Our experiments on the UMLS dataset show that a single TWIG neural network can predict the results of state-of-the-art ComplEx-N3 KGE model nearly exactly on across all hyperparameter configurations. To do this it uses a total of 2590 learnable parameters, but accurately predicts the results of 1215 different hyperparameter combinations with a combined cost of 29,322,000 parameters. Based on these results, we make two claims: 1) that KGEs do not learn latent semantics, but only latent representations of structural patterns; 2) that hyperparameter choice in KGEs is a deterministic function of the KGE model and graph structure. We further hypothesise that, as TWIG can simulate KGEs without embeddings, that node and edge embeddings are not needed to learn to accurately predict new facts in KGs. Finally, we formulate all of our findings under the umbrella of the “Structural Generalisation Hypothesis”, which suggests that 'twiggy' embedding-free / data-structure-based learning methods can allow a single neural network to simulate KGE performance, and perhaps solve the Link Prediction task, across many KGs from diverse domains and with different semantics.

Structural Characteristics of Knowledge Graphs Determine the Quality of Knowledge Graph Embeddings Across Model and Hyperparameter Choices

https://ceur-ws.org/Vol-3573/paper2.pdf

Abstract: The realm of biomedicine is producing information at a rate far beyond the capacity of clinicians, researchers, and machine learning experts to analyse in full. Recently, developments in Knowledge Graphs (KGs) have facilitated the representation of all this information in an easily-integrable and easily-queryable format. With increasing academic and clinical interest in Knowledge Graph Embeddings (KGEs), various KGE models have been developed to allow machine learning to efficiently run on these large Knowledge Graphs and predict new, previously unseen information about the domain. However, the need to validate hyperparameters for every new dataset, especially considering the time and expertise needed for validation and model training, have limited the use of KGEs in bi-ology to those who have expertise in machine learning and knowledge engineering. This research presents a framework by which the effect of hyperparameters on model performance for a given KG can be modelled as a function of KG structure. The presented evaluation of the framework finds a clear effect of graph structure on hyperparameter fitness. This leads to the conclusion that more re-search into cross-dataset hyperparameter prediction and re-use holds promise for increasing the accessibility and usability of KGEs for biomedical applications.