Repository logo
 

Towards Knowledge Representation of the Biowaste-to-Chemicals Domain Using Knowledge Graphs


Change log

Abstract

In recent years there has been great interest in transitioning the chemical industry toward a circular and sustainable model by integrating potentially net-zero biowaste resources to produce chemicals, replacing traditional fossil feedstocks. This thesis lays the groundwork for addressing a critical research question: What pathways from biowaste to value-added chemical are the most sustainable, and how can we identify them?

Central to answering this question is a comprehensive understanding of the data requirements within the biowaste-to-chemicals domain, sustainability assessment metrics, and effective methods to contextualize, represent and enrich this knowledge. Specifically, the domain can be divided into two broad parts: the biowaste-to-feedstocks domain, which focuses on extracting feedstocks such as biopolymers through various processes, and the feedstocks-to-chemicals domain, which involves chemical reactions that transform feedstocks into value-added chemicals. With numerous choices at each stage and a highly interconnected structure, this thesis hypothesizes that organizing this knowledge as a knowledge graph (KG) provides a powerful framework for capturing and contextualizing this complexity. The need for such a framework is underscored by significant data gaps in the literature, particularly in the biowaste-to-feedstocks domain, where critical information remains fragmented. Furthermore, while the feedstocks-to-chemicals domain benefits from large reaction datasets, these can be further enriched through advanced methods such as reaction completion and impurity prediction, addressing existing limitations and enhancing their future utility for sustainability assessment.

A state-of-the-art KG was developed to comprehensively represent the biowaste-to-chemicals domain. The process began with knowledge acquisition, which involved compiling a curated literature corpus of 43 papers for the biowaste-to-feedstocks domain and gathering a reaction dataset for the feedstocks-to-chemicals domain. Schema development was undertaken to clearly define entities and relationships, enabling the representation of subdomains such as biowaste location and potential, biowaste sources, compositions, taxonomy, feedstocks, processes for feedstock extraction (e.g., pretreatment), and chemical reactions. Knowledge ingestion followed, where structured outputs were generated from unstructured text and incorporated into the KG based on the developed schemas. This ingestion method leveraged a combination of semi-manual techniques and automated workflows powered by Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), ensuring future scalability.

The final KG encompassed 2.76 million nodes and 7.36 million relationships, covering all the reviewed subdomains. Its utility was demonstrated by leveraging graph traversal algorithms, such as Depth-First search (DFS), to perform tasks such as knowledge retrieval, missing property assessment, process simulation, and exergy assessment. Finally a case study on a sample biowaste source, Empty Fruit Bunch (EFB), was conducted where pathways to feedstock biopolymers via pretreatment processes were assessed based on exergy as a sustainability metric.

To enhance knowledge enrichment within the feedstocks-to-chemicals domain, impurity prediction and chemical reaction completion workflows were developed. For impurity prediction, a transparent 14-step workflow using Python and RDKit was developed based on data mining Reaxys® data, identifying analogue reactions featuring interactions between functional group fragments, extracting templates, and applying these templates to relevant reactants. When applied to paracetamol, agomelatine, and lersivirine synthesis, the workflow consistently ranked literature-identified impurities among the top two results, achieving traceability and accuracy. For chemical reaction completion, a combined heuristic and machine learning workflow based on masked language models was developed to balance incomplete reactions and predict missing molecules. The combined workflow was able to complete 52.4% of USPTO reactions and 60.9% of Reaxys® reactions.

Overall, this thesis highlights the potential of knowledge graphs to represent highly complex and interconnected domains, enabling valuable insights and early-stage decision-making in the search for sustainable pathways that integrate biowaste resources for chemical production. Additionally, it outlines strategies to enrich existing chemical reaction datasets critical to the feedstocks-to-chemicals domain. This work thus lays a strong foundation for future advancements as the proposed knowledge graph continues to scale in size.

Description

Date

2025-01-01

Advisors

Lapkin, Alexei

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as All rights reserved