Evaluating Natural Language Generation Tasks for Grammaticality, Faithfulness and Diversity
Repository URI
Repository DOI
Change log
Authors
Abstract
Natural language generation (NLG) plays a vital role in many applications. Evaluating the quality of generated text is crucial for ensuring the effectiveness and user satisfaction of NLG systems. With the popularisation of deep learning in recent years, many models have been reported to achieve super-human performance on popular benchmarks. However, it has been observed that existing holistic benchmarks and evaluation metrics frequently fail to accurately assess specific evaluation factors that are of interest to the field.
This thesis explores a diagnostic evaluation framework for assessing the grammaticality, faithfulness, and diversity (GFD) of generated text in NLG tasks. These three metrics are considered as essential linguistic qualities, which need to be present in the outputs of NLG models. Grammaticality is examined by analysing the parsability of a sentence with a well-defined formal grammar. Faithfulness is divided into two facets: grounding faithfulness and task faithfulness. These two facets investigate how well the model outputs align with both the information provided in the input, and the inherent requirements of the task. Diversity is further divided into word-level and parse-level diversity measures. In the proposed GFD framework, the evaluation of the three metrics does not require task-specific references to be constructed. By clearly defining and evaluating these generation qualities, this framework aims to provide insights into the strengths and limitations of NLG models.
To demonstrate the versatility of the GFD evaluation framework, three different generation tasks are explored: synthetic image captioning, football highlight generation from match statistics, and topic-shift dialogue generation. These tasks are deliberately chosen to cover a diverse range of generation scenarios. Each task provides unique grounding information and constraints that influence the generation process, which in turn create diverse challenges for the evaluation of NLG models. Experiments on these tasks reveal the challenges in fine-grained NLG evaluation when the availability of ground truth representations diminishes or when there is a delicate balance between input groundings and task constraints. This thesis empirically demonstrates how the GFD evaluation framework, in combination with diagnostic datasets, can provide insights into model strengths and limitations to supplement standard evaluations.