Improving Abstractive Summarization and Information Consistency Assessment
Repository URI
Repository DOI
Change log
Authors
Abstract
Summarization is the process of compressing a document into a shorter form that contains all the relevant information. It is useful in many applications, for example, a brief summary of the content of an article or a podcast can help decide whether it should be read or listened to. This summarization process is often split into two distinct research areas: extractive summarization, which pulls out key phrases directly from the text; and abstractive summarization, which rephrases and condenses ideas from the text. This thesis examines abstractive summarization as it can provide a more effective and compact summary of the information than extractive summarization. In common with other research areas in natural language processing, deep learning has become the dominant technology in abstractive summarization demonstrating significant performance gains over more traditional approaches. Initially, these neural-based models were trained from scratch, with randomly initialized parameters, relying solely on supervised training. As this meant there was usually limited training data, the generated summaries often had limited diversity and fluency. The recent shift to using foundation models, trained on vast quantities of text exploiting self-supervised training schemes, addressed some of these issues such as fluency. However, there still remain issues such as how to handle long documents. A key challenge to developing these abstractive summarization systems is assessment. In common with many natural language generation tasks, it is not possible to generate an exhaustive set of reference summarises, so simple lexicon-matching approaches have limited accuracy. This thesis investigates both the generation and assessment of abstractive summaries. In addition, approaches motivated by summary assessment are applied to hallucination detection in large language models.
The first area examined in this thesis includes two aspects of summary generation. The first aspect is associated with the diversity issue when training a model from scratch with limited data. The second aspect is associated with applying foundation models to long-document summarization. Initially, recurrent neural networks (RNNs) were the dominant form of abstractive summarization models. However, these models can often produce summaries with low diversity. This lack of diversity means that the models may not fully capture or maximize the available information. To address this problem, this work proposes metrics based on hierarchical representations to explicitly maximize information usage in the source and encourage diversity in the summary. The second aspect addressed is to improve foundation model based systems to efficiently handle long documents. The contributions in this part are three-fold. First, sentence filtering methods for content selection are examined for both the training and inference stages. Second, this work presents design considerations for local attention aimed at improving the efficiency of transformer models. By combining local attention and sentence filtering, high-performance long-document summarization is achieved. Third, this work investigates encoder-decoder attention and demonstrates that its cost can be critical at inference. Based on this finding, sentence structure and sparsity in encoder-decoder attention are exploited, resulting in an improvement in the complexity of the attention with minimal performance degradation.
The second area investigated is summary assessment. Ideally, summaries would be assessed manually. However, this is highly expensive, motivating the need to develop automatic assessment methods. There are many attributes associated with good summaries such as fluency and consistency. Since generation systems based on foundation models are highly fluent, this work focuses on information consistency between the source document and the summary. This work proposes a question answering (QA) based approach to measure the consistency. In contrast to previous QA approaches, a multiple-choice QA framework is proposed. This allows information consistency to be approximated by computing the expected statistical distance between summary and source answer distributions. Additionally, current large language models (LLMs) have been shown to have strong zero-shot abilities in various NLP tasks. Hence, this thesis investigates whether LLMs can be used for summary assessment in a zero-shot manner, and proposes a comparative assessment framework.
Information consistency can be useful in a wide range of tasks in addition to summary assessment. Another application of information consistency explored in this thesis is hallucination detection in LLMs. A hallucination, in the context of LLMs, is a response which contains non-factual information with respect to actual knowledge. This work applies information consistency methods, initially developed for summarization, to wider LLM generation to enhance the reliability of LLMs. Specifically, this work proposes SelfCheckGPT that measures the consistency between stochastically generated responses from an LLM. SelfCheckGPT does not require an external database and it can be applied in a black-box manner, making it applicable in a variety of situations. This work demonstrates the effectiveness of SelfCheckGPT through experiments based on GPT-3's hallucinated contents.