1 Introduction
The information explosion and the advancement of technology have raised the importance of making critical decisions by reading numerous documents. However, reading many documents and producing summaries has become tedious and endless. Here is where the text summarization comes into the picture. Text summarization is an important domain in
Natural Language Processing (NLP) and
Information Retrieval (IR). Text summarization was proposed in the 1950s by Luhn [
155], who introduced an approach to extract prominent sentences from text using attributes such as word and phrase frequency. Edmundson et al. [
62] introduced a method in which key phrases are extracted based on word frequencies. The authors introduced cue-, location-, and title-based methods to extract these words. Later, several studies contributed to a significant advancement in the field of abstractive text summarization [
68,
72,
203,
217]. Since then, many methods have been introduced to address text summarization problems.
The text-summarization task summarizes the document in a condensed format by preserving vital information with minimal data loss. Text summarization is an effective way to reduce irrelevant information when reading documents. Text summarization is very difficult, because when humans summarize a text, the entire text is read to develop insight and then reproduce a summary highlighting its main points. Because machines lack a greater depth of human knowledge and linguistic capability, abstractive text summarization poses a challenging and nontrivial task for machines. Inspired by this problem, several researchers have developed models that automatically generate summaries from documents.
Text summarization is categorized based on the summary generation, application, and input document type. Based on summary generation, text summarization can be categorized as extractive, abstractive, and hybrid summarization. Extractive summarization focuses on extracting information directly from a source document without paraphrasing it. In this type of summarization, the extracted text document depends on the source document structure. Abstractive summarization generates short summaries by rewriting an entire document in a human-written manner. Summaries generated by humans are typically abstractive. The goal of abstractive text summarization is more complex than extractive text summarization, because it requires a model to comprehend the input text and produce a summary that is not constrained by the input’s existing sentences. Hybrid text summarization is a type of summarization that combines several methods or approaches. They can be extractive, abstractive, or both. This summarization leverages the advantages of the different types of summarization to generate accurate summaries. Recent research has focused more on abstractive text summarization, because it is more challenging than extractive text summarization [
80].
According to the user’s application, different summarization methods are used to generate summaries. This includes query-focused, generic, and controlled summarization. Query-focused summarization generates summaries according to the user’s query and is used in applications such as question answering systems and search engines. Studies such as References [
23,
178,
273] have focused on query-focused summarization and its application. In contrast, generic summarization generates a summary based on general content without focusing on specific content. However, this summarization is useful when a user requires a general document overview. Controlled summarization [
65,
88] is another type of summarization method that generates a summary based on predetermined rules that control the generated summary length, tone, or style. In addition, there are a few variants of controlled summarization [
36,
60] in which the constraint is facilitated to provide better control over the summaries generated.
Reinforcement learning-based summarization effectively generates informative, fluent, and cohesive summaries. This method yields accurate and contextually relevant summaries and is particularly useful for summarizing news articles. Aspect-based summarization [
53] focuses on specific aspects of the context. For example, when a product in context needs to be known, aspect-based summarization focuses on the precise details of the product, such as features, designs, and performance. It helps the user quickly understand the product’s weaknesses and positive aspects. Opinion summarization [
8,
26] effectively summarizes general sentiments and attitudes from vast text, which contains text data from customer reviews, social media posts, and online comments. It is helpful for various purposes, including market research, customer feedback analysis, and social media monitoring, as it enables businesses to swiftly spot patterns and trends in the attitudes and opinions of their target markets. Knowledge-based summarization is a method that summarizes a given text using external knowledge sources, such as knowledge graphs, ontologies, or domain-specific knowledge bases. Knowledge-based summarization is useful when generating domain-specific summaries, such as technical summaries from specific topics, such as medical documents.
Text summarization also supports multi-lingual summarization [
4,
33]; the input text is taken, and the summaries are generated in different languages, enabling the user to access information in the language of interest. While the base models for text summarization remain consistent across different languages, the primary variation lies in the methods used for training and fine-tuning. The cross-lingual text summarization [
253] requires creating a summary of a source article written in one language using resources written in another language[
17,
190,
296]. It is helpful when data need to be communicated in different languages to reach a wider audience, such as news stories, social media posts, research publications, and business reports.
A hierarchical summarization technique includes creating a multi-level summary of a given document. This method generates a summary that encapsulates essential data at several levels of abstraction, from a high-level overview to more specific details. Extensive collections of papers and articles can be organized and summarized in content management systems using hierarchical summarization, making it more straightforward for users to access critical information. Abstractive dialogue summarization [
141,
260] aims to generate a clear and readable summary of conversational exchanges between two or more speakers. This technique can be used in many contexts such as customer support chats, corporate meetings, and interviews, where condensing lengthy discussions can save time. When labeled training data are unavailable or when dealing with input from a different language, unsupervised summarization [
117,
261] is used. Unsupervised summarization is used in various fields such as social media analysis, scientific article summarization, and news summarization.
Based on the input document type, text summarization is classified into single-document summarization and multi-document summarization. Single-document summarization is a summary generation technique in which generated summaries take a gist from a single document. Multi-document summarization [
186] is a technique in which summaries are generated from different source documents that discuss the same topic.
Long and short document summaries are two distinct ways to summarize documents of varying lengths. A lengthy document summarization [
111] compresses a long document or a vast corpus of text data into a more concise version containing key points, usually by picking the essential content and identifying key terms. Several studies [
92,
160] have used efficient methods for long-document summarization. At the same time, short document summarization condenses a long text into a short and readable summary, which is useful for reading relevant information from sources such as news and blogs.
Humans tend to summarize the source article by reading and thoroughly understanding its context of the source article and rephrasing it in their own words in a comprehensive way. Summaries written by the authors can sometimes lead to bias and may not capture the necessary information from the source document. Manually summarizing a large text corpus is time-consuming and impractical because of the sheer amount of available data. This necessitated the demand for computational machines that can process vast amounts of data quickly and efficiently, making them important for NLP tasks such as text summarization. Abstractive text summarization algorithms aim to generate comprehensive and precise summaries with less bias and to preserve the essence of the original text.
Although state-of-the-art (SOTA) abstractive summarization models generate concise summaries, they face significant challenges. Identifying and preserving the key information of a source article while generating a summary in a shorter format is the foremost challenge in this field. Data availability is another major challenge in abstractive text summarization, which can lead to overfitting or poor generalization of summarization models during training. Furthermore, abstractive text summarization tasks require computational resources to produce high-quality summaries by using neural networks with extensive parameters. Despite advancements in the text summarization domain, a few issues persist, such as hallucinations and bias in the generated summaries.
These challenges emphasize the complexity of abstractive text summarization and highlight the need for advanced research on SLR in this area. The results of the SLR are transparent and accountable, and they can minimize the bias compared to traditional literature reviews [
188] due to the involvement of scientific methods and systematic craft and a clear scientific outlook in the field.
The literature on text summarization indicates significant advancements in this field, as highlighted by key reviews and surveys [
1,
6,
259,
278]. These studies have primarily focused on extractive and abstractive methods and their respective evaluation metrics. Despite the comprehensive coverage, only a few studies [
7,
80,
168,
237] have exclusively targeted the domain of abstractive text summarization. This gap underscores new researchers’ challenges, as existing reviews require a systematic approach, complicating the understanding of the domain. While there are existing SLRs in the domain of abstractive summarization, such as References [
105,
180,
199], these SLRs often cover only a limited scope of the literature and may not delve deeply into the core aspects of abstractive summarization. Many of these SLRs are constrained by their specific scope, selection criteria, time frame, and rapidly evolving nature of the field, which continuously introduces new methodologies and challenges that may not have been thoroughly analyzed at the time of their publication. Given these limitations, there is a significant need for a more comprehensive SLR that thoroughly analyzes the core aspects of abstractive summarization such as datasets, evaluation metrics, approaches, and methods.
As the field of abstractive summarization continues to evolve, researchers are attempting to refine existing technologies to overcome the limitations of earlier models and improve the reliability and efficiency of summarization methods. This systematic approach ensures that advancements in the field are built on a comprehensive understanding of past and current methodologies, thereby enabling more effective innovations and addressing persistent challenges more strategically. In addition, this SLR provides a generic conceptual framework and guidelines for abstractive summarization and serves as a practical guide for researchers to select appropriate summarization models, which is vital for optimizing performance in real-world applications.
The remainder of this article is organized as follows: Section
2 emphasizes the research gaps, objectives, and questions. In Section
3, the results are described and each research question is answered. Section
4 concludes the article with a discussion of future research.
2 Research Framework and Questions
This SLR conducted an in-depth review of 226 papers, positioning it as one of the most comprehensive reviews in single-document abstractive text summarization. The detailed review methodology, including the selection criteria, selection process, quality assessment, and data synthesis, is provided in Electronic Supplement.
In abstractive text summarization, a significant research gap exists because of the need to report and evaluate datasets consistently. This variability undermines the reproducibility and reliability of research findings, impeding scientific progress. Systematic documentation and evaluation of existing datasets are critical for understanding their limitations, enhancing their accessibility, and developing new, comprehensive datasets that are better aligned with the current demands of summarization tasks. In addition to the datasets, the evaluation metrics in text summarization often suffer from the semantic, cohesive, consistent, relevant, and fluency perspectives of automatic evaluation. Developing a standardized set of evaluation methods will guide more refined evaluation practices. This enhancement leads to higher performance and creative summarization models.
Enumerating and assessing recent methodological advancements is essential to ensure that research communities are well informed about the most effective and innovative practices in this field. Despite the rapid technological advancements in abstractive summarization, developing summarization models capable of handling the complexities of natural language requires a detailed understanding of effective approaches and methods, leveraging their strengths, and addressing their weaknesses.
Observing the research gaps in the field of abstractive text summarization has laid the foundation for setting clear objectives for this SLR. The objectives of this study are as follows:
—
To systematically document and evaluate the datasets used in abstractive text summarization;
—
To comprehensively list and review the evaluation metrics for abstractive text summarization;
—
To identify and analyze the most effective approaches employed in abstractive text summarization;
—
To enumerate and assess recent methodological advancements in abstractive text summarization.
The research questions for SLR are listed in Table
1. Research questions RQ1 to RQ3 are the main research questions, focusing on datasets, evaluation metrics, approaches, and methods. Table
2 presents the data extracted from the research papers to answer the research questions.
Based on the research gaps, objectives, and research questions, the contributions of SLR are as follows:
—
A detailed examination and analysis of 226 original studies published between 2011 and 2023 provides a comprehensive overview of single-document abstractive text summarization.
—
A comparative and evaluative study of datasets, evaluation metrics, approaches, and methodologies was presented.
—
The development of a generic conceptual framework for abstractive text summarization has been proposed.
—
A comprehensive set of guidelines to aid in the selection of the most suitable summarization techniques was presented.
3 Results
This section describes the results of the quantitative and qualitative analyses of the papers from the SLR. There are five subsections in this section, which follows a systematic approach to answer each research question in Table
1.
3.1 Datasets in Abstractive Text Summarization - RQ1
The number of datasets in the abstractive text summarization domain has increased in response to the needs of the past decades. A dataset is necessary to train, validate, and test the proposed method on SOTA abstractive text summarization models. This SLR has only focused on studies utilizing English datasets, because although many techniques for other language datasets exist, their underlying architectures and techniques are similar to those using English datasets.
Various datasets were used for abstractive text summarization. Each dataset, irrespective of its curated domain, contains a collection of source documents and its human-written reference summary, also known as a gold summary. Dataset curation in abstractive text summarization can be classified as general and domain-specific, regardless of whether it is a public or private dataset [
259]. Public datasets such as the CNN/DailyMail dataset [
173], Gigaword dataset [
74,
216],
Extreme Summarization (XSum) dataset [
176],
Document Understanding Conferences (DUC) dataset [
87], New York Times Annotated Corpus [
220], and Newsroom [
75] comprise general topics. Domain-specific datasets such as Amazon Fine Food Reviews [
164] and BigPatent [
226] include data on specific topics. This study reviews the datasets used for abstractive text summarization between 2011 and 2023.
The following are the widely used general and domain-specific datasets in abstractive text summarization:
—
CNN/DailyMail: Of the 226 publications, 162 reported their findings on the CNN/DailyMail dataset. CNN/DailyMail is a widely used dataset in abstractive text summarization. Human-generated abstractive summaries were created from news articles on CNN and Daily Mail websites as questions, and the stories were used as relevant passages. The CNN/DailyMail dataset consists of human-generated news stories and contains 286,817 training pairs, 13,368 validation pairs, and 11,487 testing pairs.
—
Gigaword dataset: Forty-eight out of 226 publications reported their findings on the Gigaword dataset. The Gigaword dataset consists of approximately four million news articles. The Gigaword dataset has headline-article pairs consisting of 3,803,957 training pairs, 189,651 validation pairs, and 1,951 testing pairs of data.
—
XSum: The Extreme Summarization (XSum) dataset was used by 57 publications. The XSum dataset contained 226,711 news articles with one-sentence summaries. This dataset is primarily used for abstractive text summarization. The XSum dataset contains a one-sentence news summary that describes the details of an article. In the training, validation, and test sets, the official random split contained 204,045 (90%), 11,332 (5%), and 11,334 (5%) documents, respectively.
—
DUC dataset: The Document Understanding Conferences (DUC) dataset is the sentence summarization dataset used by 42 out of 226 publications. The most widely used DUC datasets were DUC 2002, DUC 2003, DUC 2004, and DUC 2006. Each document collection in the DUC has four to five extra human-written “reference” summaries for a single-source document. Usually, the DUC dataset contains fewer data samples than other datasets. Therefore, many summarization models use this dataset as the test set.
—
New York Times Annotated Corpus: Eleven of 226 publications used the New York Times Annotated Corpus dataset. This dataset contains over 1.8 million articles written and published by the New York Times. The New York Times Annotated Corpus contains 650,000 article-summary pairs written by library scientists.
—
Newsroom: The Cornell Newsroom summarization dataset was used by 12 studies. It has 1.3 million articles and summaries authored by journalists and editors working in the newsrooms of 38 significant periodicals. Summaries of the newsroom dataset were obtained using various summarizing algorithms.
—
Amazon Fine Food Reviews: Two publications used this domain-specific summarization dataset. This dataset contains approximately half a million food reviews from Amazon. The reviews include product and customer information, ratings, and textual reviews.
—
BigPatent: Four publications used this domain-specific summarization dataset. This dataset contains 1.3 million US patent filing records and abstractive summaries written by humans.
Out of the 226 publications we reviewed, 61 reported datasets. However, out of these 61 datasets, only 28 were used more than once by the publications. Nevertheless, few publications have reported their findings on multiple datasets to make it easy for researchers in this field to compare the efficiency of their models. Curating a dataset for abstractive text summarization is tedious and expensive. Unlike extractive text summarization, abstractive text summarization requires abstractive human-written summaries.
Even though there are many abstractive text summarization datasets, the trend in this field is to use publicly available news article-based datasets such as CNN/DM, XSum, or any domain-specific datasets. Publications reporting their findings on the widely used dataset help researchers in this field compare their abstractive text summarization models with benchmarks. However, one potential concern regarding datasets in abstractive text summarization is that the number of training samples is less than the size of the trainable parameters in the recently introduced SOTA abstractive text summarization models. This issue may also result in problems such as hallucinations in the generated summaries [
28]. There is always a need to train the models with increasingly large training samples to perform better on downstream tasks such as abstractive text summarization, even if the models are good at capturing English sentence structures due to pre-training.
This review examined several datasets used in abstractive text summarization. The datasets reviewed included CNN/DailyMail, Gigaword, XSum, DUC, New York Times Annotated Corpus, Newsroom, Amazon Fine Food Reviews, and BigPatent. A comparative analysis and evaluation of the datasets based on the annotation process, diversity, and coverage, the model types they are suited for (extractive, abstractive, or both), and potential biases are presented in Table 5 (see electronic supplement). The findings suggest significant variations in the suitability of these datasets for training abstractive text summarization models. Among the datasets evaluated, CNN/DailyMail and XSum stood out as particularly robust for several reasons.
—
Wide Usage: Both datasets are extensively used in the field of abstractive text summarization, underscoring their reliability.
—
Quality: CNN/DM offers high-quality summaries that challenge models in capturing nuanced content. XSum, known for its concise one-sentence summaries, requires models to perform significant abstraction, making it a rigorous test for summarization capabilities.
—
Diversity and Coverage: CNN/DM covers many news topics and provides diverse articles for model training. Although XSum primarily includes BBC news articles, it presents a broad topic spectrum, supporting an effective model generalization.
—
Benchmarking and Performance Evaluation: Both datasets serve as standard benchmarks in the field, effectively comparing SOTA summarization models against established performance metrics.
Based on the criteria of wide usage, quality of summaries, diversity, and their role in benchmarking, CNN/DailyMail and XSum are currently the best datasets available for abstractive text summarization. Their widespread adoption and the challenges they pose to summarization models make them ideal for developing new models and improving existing ones.
3.2 Evaluation Metrics in Abstractive Text Summarization - RQ2
The evaluation metric evaluates the summaries and determines the performance of the abstractive summarization model. The evaluation technique compares the system-generated summary with the gold summary from the summarization dataset. The evaluations are based on the number of words that match the word’s lemma format and grammatical structures. In abstractive text summarization, evaluation is based on comparing the critical keyword or crucial information from the gold and the generated summaries. The evaluations are mainly conducted to understand the generated summary’s precision, recall, and F1 measure. The studies between 2011 and 2023 have proposed many automatic evaluation metrics.
The following are the widely used evaluation metrics in abstractive text summarization:
—
ROUGE: Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [
138] is one of the standard automatic evaluation metrics for abstractive text summarization. This evaluation metric is widely used in abstractive text summarization. Of the 226 publications, 212 relied solely on the ROUGE metric to evaluate summaries. The ROUGE metric scores the system-generated summaries based on the number of overlapping
\(n\)-grams between the system-generated and gold summaries and generates scores based on precision, recall, and F1-metric. Based on the overlapping
\(n\)-grams, the selection of ROUGE-1, ROUGE-2, and ROUGE-L depended on the specific needs of the summarization task, as each variant targeted different aspects of the summaries. ROUGE-1 assesses word overlap, ROUGE-2 examines consecutive word pairs, and ROUGE-L evaluates the longest common sub-sequence. Choosing the most suitable metric relies on the context of the task, and it is common practice to report multiple ROUGE scores to evaluate the summarization system comprehensively.
Variants of ROUGE evaluation metrics, such as ROUGE-WE [
179], use word-embedding-based similarity metrics with a lexical matching method. ROUGE 2.0, which uses a synonym dictionary based on WordNet [
166], and ROUGE-G [
224] use semantics and lexical matching from WordNet to evaluate abstractive text summaries.
—
Human Evaluation: Human evaluation [
70,
95] is considered an ideal evaluation metric for abstractive text summarization. Source documents, gold, and system-generated summaries were provided to human annotators. They score the system-generated summaries based on the summary’s quality and criteria for relevance, fluency, and consistency, respectively. Sixty-one publications used the human evaluation metrics. However, many studies have used ROUGE and human evaluation metrics to obtain the desired scores.
—
METEOR: The
Metric for Evaluation of Translation with Explicit Ordering (METEOR) [
21] was introduced for machine translation. However, abstractive text summarization follows an approach similar to machine translation. Therefore, METEOR was used for abstractive text summarization. METEOR uses the harmonic mean of precision and recall, with recall being weighted more than precision. Thirteen of 226 publications used the METEOR evaluation metric.
—
BLEU: Bilingual Evaluation Understudy (BLEU) [
192] is an evaluation metric proposed for machine translation and later used for abstractive text summarization. Modified
\(n\)-gram precision and best match length were used in BLEU to estimate precision and recall. Nine of 226 publications used the BLEU evaluation metric.
—
BERTScore: BERTScore [
286] is an automatic evaluation metric for text generation in which token similarities are used to compare contextual embeddings. Scores were generated based on these token similarities. Twelve out of 226 publications used the BERTScore evaluation metric.
—
MoverScore: MoverScore [
290] is an evaluation metric that encodes a gold summary and generated summary, showing a high correlation with human judgment in text quality. MoverScore uses contextualized embeddings and the earth mover distance to evaluate summaries. Two studies have used this evaluation metric.
—
QAGS: Question Answering and Generation for Summarization (QAGS) [
249] is an automatic evaluation metric that measures the factual consistency of system-generated summaries. Three studies used this evaluation metric. The summaries generated by SOTA abstractive text summarization models tend to be factually inconsistent. The QAGS metric was proposed to evaluate the factualness of the summary. QAGS uses the
Question Answering (QA) and
Question Generation (QG) models. APES [
63] is another metric that uses a similar approach to evaluate summaries using QA systems and is a reference-free evaluation metric.
—
FactCC: Factual Consistency Checking (FactCC) [
114] is a model-based and weakly supervised evaluation metric for verifying factual consistency and identifying conflicts between source documents and generated summaries. Twelve studies used this evaluation metric. FactCC uses the BERT model to classify factually consistent summaries based on training.
In abstractive text summarization, summaries might contain novel words, phrases, and sentences that are semantically similar to gold summaries. However, fewer words, phrases, and sentences would overlap between these abstractive summaries and their gold summaries. Table 6 (see electronic supplement) shows the quantitative analysis of the time complexities of the evaluation metrics in the abstractive text summarization.
Evaluation of the quality of summaries generated by abstractive text summarization models can be challenging because of the subjective nature of summarization. The abstractive summary evaluation trend shifted from quantitative to qualitative summary evaluations. Traditional evaluation metrics such as ROUGE, METEOR, and BLEU have quantitatively evaluated system-generated summaries by comparing the number of overlapping \(n\)-grams between gold and system-generated summaries. In contrast, recently introduced evaluation metrics such as QAGS and FactCC have been qualitatively evaluating system-generated summaries by checking for factual inconsistency. However, automatic evaluation metrics have a few potential concerns, such as scoring summaries according to their semantics and abstractiveness. An efficient evaluation metric must capture the semantics, consistency, relevance, and fluency of the summaries and their abstractiveness, which is entirely subjective.
Because an efficient evaluation metric should capture the semantics and abstractiveness of summaries, human evaluation is considered the gold standard evaluation metric for abstractive text summarization [
95]. However, researchers must scrutinize its reliability and investigate potential factors that may influence its accuracy in accepting human evaluation as the gold standard. The proficiency of human evaluators can influence their evaluation outcomes. Evaluators with a higher level of expertise in the topic of the text being summarized are more likely to offer accurate and insightful scores to summaries than are those without such expertise. Furthermore, the nature of summaries can affect the results of human evaluation. Specific summaries are straightforward for human evaluators to comprehend and judge, resulting in higher scores. In contrast, other summaries are more complex to understand and evaluate, resulting in lower scores. Despite the importance of expertise and summary generation methods, several challenges still arise during human evaluation, such as
—
Human evaluation of summaries generated by abstractive text summarization models can be time-consuming, expensive, and subject to bias in scoring summaries.
—
Human evaluators often fail to include crucial details, such as participant demographics, task design, and experimental protocols.
—
Human evaluators are often asked to evaluate text based on subjectivity, such as overall quality, cohesiveness, fluency, and relevancy, which can lead to inconsistencies in assessing the quality of the generated summaries.
—
Human evaluators cannot distinguish between human-generated and system-generated text [
52].
Even though there are several challenges associated with human evaluation, different solutions can improve their accuracy and reliability. These solutions are as follows:
—
Using a large and diverse group of expert annotators from crowdsourcing platforms to obtain evaluations can increase the reliability and reduce the cost of the evaluations.
—
Providing established evaluation protocols and clear instructions to annotators can help improve the consistency and accuracy of human evaluations.
—
Automated metrics are provided to supplement human evaluation for a more efficient and comprehensive assessment of generated summaries.
—
Using multiple evaluation metrics and ensembling their results can provide a more comprehensive and accurate evaluation of generated summaries.
—
Incorporating user feedback for human evaluators to refine and improve the quality of evaluation of generated summaries over time.
These strategies and solutions can help to address human evaluation challenges and enhance their validity and usefulness. Researchers must undertake rigorous human evaluation studies and thoroughly investigate elements that could affect their accuracy. Nevertheless, this SLR suggests the need for a new evaluation metric for abstractive text summarization.
3.3 Approaches in Abstractive Text Summarization - RQ3
This subsection outlines the different approaches employed in abstractive text summarization, beginning with rule and graph-based techniques. Before exploring transformers and hybrid approaches, neural-network-based approaches, including CNNs, RNNs, and attention mechanisms, which are key components of many SOTA models for abstractive summarization, are explored. These approaches are broadly categorized as general strategies that provide frameworks for optimal summarization. Although many studies have utilized similar approaches, they distinguish themselves using specific methods. This subsection discusses the different methods used in this field, evaluates SOTA abstractive summarization models, and investigates how pre-training affects the quality of their summaries.
Our SLR records from 2011 to 2023 show four main model-design approaches. The four approaches are as follows:
(iii)
Neural-network-based approaches
(c)
Transformer-based architecture
Research on abstractive text summarization commenced with the summary generation process using rule-based and graph-based approaches. Later, these approaches were revolutionized into neural-network-based approaches. The CNN-, RNN-, and transformer architecture-based approaches are neural-network-based approaches. Then, the combinations of each approach are curated as hybrid approaches. Table
3 depicts the approaches and counts of papers in the abstractive text summarization. The following subsections will discuss each of these approaches in detail.
3.3.1 Rule-based Approach.
In a rule-based approach, the source article initially undergoes preprocessing steps such as tokenization and sentence segmentation. Then, statistical or predefined linguistic rules to rank the sentences in the source article based on algorithms such as TF-IDF were applied. The top-ranked sentences were then chosen as key points for the summary generation. Despite its structured approach, this method may encounter challenges, such as grammatical inaccuracies and the need for more semantic comprehension. Table 7 (see electronic supplement) presents the taxonomic view of publications that used the rule-based approach.
3.3.2 Graph-based Approach.
In the graph-based approach, the input source article is represented as a graph, with nodes denoting the sentences of the source article and edges indicating the relationships between them. Employing graph algorithms such as PageRank identifies the most important nodes in the graph representing key points in the source article. Based on the identified key points, a summary was generated, providing a structured portrayals of the main ideas of the article. Although graph-based methodologies offer a systematic summarization framework, they may require assistance in managing large-scale graphs and capturing semantic nuances. It is important to explain how these graphs identify and prioritize key points to enhance the effectiveness of the summarization process. Table 8 (see electronic supplement) depicts the taxonomic view of the publications using a graph-based approach.
3.3.3 Neural-network-based Approaches.
Neural-network-based approaches, such as CNN, RNN, and transformer architectures, are preferred in the abstractive text summarization domain. The shift towards neural-network-based approaches is motivated by their inherent capacity to learn complex patterns and relationships within textual data, enabling them to capture semantic nuances and generate summaries with enhanced coherence and contextuality. Unlike rule-based methods, which rely on predefined linguistic rules and statistical metrics, neural-network-based models can automatically learn representations of textual data through training on large-scale datasets. Learning from data allows neural-network-based models to adapt to diverse text types and domains, thereby improving summarization performance across various contexts. Additionally, neural-network-based models, particularly transformer architectures, are capable of capturing long-range dependencies and global contextual information, surpassing the limitations of graph-based approaches in handling large-scale graphs and semantic subtleties.
Nevertheless, recognizing the computational demands inherent in neural-network-based approaches, notably transformer architectures, is essential, because they often require significant computational resources and prolonged training durations. Additionally, the interpretability of neural-network-based models poses a persistent challenge, as their internal mechanisms frequently need more transparency, hampering the explanation of decision-making processes. Tables 9–11 (see electronic supplement) represent the taxonomic view of the publications that used the CNN, RNN, and transformer architecture approaches.
—
CNN Architecture: CNNs process input articles by initially segmenting them into tokens and representing each token numerically using word embeddings. Next, CNNs employ convolutional filters on the input embeddings to extract features and recognize relevant phrases or sentences. CNNs can identify spatial hierarchies within an input by focusing on important sentences at various levels of granularity. The gathered features were combined to summarize the most salient information from the source article. This ability of CNN is particularly useful in abstractive summarization, where the aim is to reinterpret and condense the original material creatively.
—
RNN Architecture: In abstractive text summarization, RNNs transform input articles into a sequence of word embeddings by preserving the temporal order of words and sentences. The encoded input source article is processed sequentially by RNN by updating the hidden states at each timestep, enabling them to retain memory of past inputs and grasp contextual information. RNNs also incorporate attention mechanisms, enabling the model to concentrate on relevant input segments during summarization and to assign higher importance to significant words or phrases. Based on the acquired contextual representations and attention weights, RNNs generate summaries by decoding the encoded sequence into natural language text, ensuring an accurate reflection of the key ideas of the original article. RNNs capture the memory of preceding inputs, which are critical for abstractive text summarization models where the text length can vary.
—
Attention Mechanisms in Abstractive Text Summarization: Although SOTA pre-trained abstractive summarization models have successfully generated high-quality summaries, attention mechanisms have emerged as a powerful tool for improving the performance of various NLP tasks, including abstractive summarization. Many studies on CNN, RNN, and transformer architectures have utilized attention mechanisms or variants thereof to enhance the summarization process. The breakthrough research by Bahdanau et al. and Vaswani et al. [
16,
244] introduced attention mechanisms for abstractive text summarization. By allowing abstractive summarization models to focus on relevant parts of the input text while generating a summary, attention mechanisms can help handle long input sequences and improve model accuracy. These mechanisms promote a better understanding of the relationships between words and phrases, leading to a more accurate and effective summarization. There are various types of attention mechanisms, and the choice of each attention mechanism depends on the tasks and characteristics of the input sequence. Different attention mechanisms used in abstractive text summarization are discussed in detail in Electronic Supplement Section A.3.
—
Transformer Architecture: Transformer models have revolutionized the NLP domain by capturing global dependencies and contextual information. Recent studies in the abstractive text summarization domain have widely used the transformer architecture introduced by Vaswani et al. [
244] because of its flexibility and accuracy. The transformer architecture was introduced specifically to address the limitations of RNNs in capturing the long-term dependencies in sequences. This architecture comprises a stack of encoders and decoders with attention mechanisms. Summarization with transformers initiates with self and multi-head attention mechanisms, wherein each word in the input sequence attends to all others, allowing the model to assess the importance of each word relative to the overall context. The transformers followed pre-training and fine-tuning strategies. The transformer models were pre-trained on a large corpus of texts, such as the entire BookCorpus (800 million words) [
297] and English Wikipedia articles (2,500 million words). These models were fine-tuned on the abstractive summarization datasets. Transformers allocate greater importance to pertinent words and phrases, simplify the extraction of salient information from the input article, and generate a summary by decoding the encoded sequence into natural language text, producing succinct and coherent summaries that accurately encapsulate the primary ideas and key points of the original text. This strategy helps language models better understand the English language’s sentence structures, which helps build SOTA abstractive text summarization models. Because the transformer architecture approach uses pre-training and fine-tuning strategies, it performs better than the rule and graph-based models. Currently, the models using transformer architectures are the SOTA in the abstractive text summarization domain.
3.3.4 Hybrid Approaches.
Hybrid summarization methodologies combine multiple techniques, including rule-based, graph-based, and neural-network-based approaches, to harness their strengths and address their limitations. By integrating complementary techniques, hybrid methodologies endeavor to enhance the overall efficiency and effectiveness of the summarization process. These techniques may employ ensemble methods to amalgamate key points identified by various components or assign weights based on their significance. Although hybrid methodologies offer versatility and adaptability, they may require heightened computational complexity and meticulous parameter tuning. Only a few publications have used hybrid approaches to improve the efficiency of abstractive text summarization models. The hybrid approach combines rule-based, graph-based, fuzzy-logic, and transformer architecture-based approaches with CNN- and RNN-based approaches. Table 12 (see electronic supplement) shows a taxonomic view of the research papers using this approach.
A comprehensive examination of abstractive text summarization approaches conducted through an SLR from 2011 to 2023 has provided invaluable insights into the field’s evolution. Among the diverse range of identified model design approaches, including rule-based, graph-based, neural-network-based (such as CNNs, RNNs, and transformer architectures), and hybrid methodologies, neural-network-based approaches have emerged as the most promising and impactful avenues for advancing the field. A comparative evaluation of these approaches highlights their distinctive strengths and limitations.
Although rule-based and graph-based methods offer structured frameworks for abstractive summarization, semantic comprehension may hinder their efficacy. Neural-network-based approaches, particularly those leveraging transformer architectures, exhibit superior performance in capturing complex patterns, long-range dependencies, and semantic nuances in textual data. These approaches generate more coherent and accurate summaries because of their ability to learn and adapt to diverse linguistic contexts, thereby enhancing the effectiveness of the summarization task. Moreover, hybrid approaches that integrate techniques from rule-based, graph-based, and neural-network-based approaches present a compelling way to combine the strengths of each approach. By leveraging combined techniques, hybrid models aim to address the limitations of individual methods while improving overall summarization performance. However, hybrid approaches may increase the computational complexity and require careful parameter tuning.
In conclusion, the findings of this review highlight the key role of neural-network-based approaches, particularly those employing transformer architectures, in shaping the future of abstractive text summarization. As this field continues to evolve, these approaches will facilitate the generation of more accurate, coherent, and contextually relevant summaries across various domains and applications.
3.3.5 Implementation of Abstractive Text Summarization Approaches.
This subsection describes the implementation of various abstractive text summarization approaches used in the publications included in the SLR.
The abstractive summarization models employing a rule-based approach use methods such as pattern extraction [
197] and sentence ranking [
218]. Graphs-based models use rich semantic graphs [
167] and sentence enhancement methods[
46]. The sequence-to-sequence architecture-based approaches, such as the pointer generator networks with coverage mechanisms and their variants, are also widely used in abstractive text summarization tasks [
43,
108,
223,
291]. The pointer points to the source texts, the generator generates words from the vocabulary, and the coverage mechanism ensures no repetition of words in the generated summary. Reinforcement learning is another method used by some publications [
79,
115,
193,
195] in CNN- and RNN-based approaches. It is a reward-based mechanism wherein the model obtains rewards for predicting the correct summary generation and a penalty for incorrect summary generation. This mechanism helps the model to maximize the possibility of positive outcomes through rewards.
While the foundational and subsequent specialized attention mechanisms uniquely contribute to advancing neural network architectures for abstractive text summarization, the current SOTA abstractive text summarization models employ transformer-based attention mechanisms. Building upon the dynamic attention capabilities introduced by earlier models, transformer architectures have catalyzed a paradigm shift, dramatically enhancing the scalability and processing efficiency. Incorporating self-attention and multi-head attention within these architectures allows for unprecedented processing of multiple representation subspaces simultaneously, significantly outperforming earlier models across a range of complex tasks. These attributes emphasize transformer-based models as the best current implementation of attention mechanisms, offering an unmatched performance in handling long-range dependencies and diverse linguistic features.
The transformer architecture-based approach uses pre-training and fine-tuning-based methods. The majority of the methods in the transformer architecture approach lie in proposing a new pre-training objective, such as the
masked language model (MLM) [
56],
next sentence generation (NSG) [
271], and sentence reordering [
301]. These objectives help language models understand the language’s syntactic and structural representations. Even a few publications have used the few-shot transfer learning method [
64], wherein abstractive summarization models can be trained using less training data.
Many publications on transformer architecture-based approaches are entirely based on pre-training and fine-tuning methods and their variants. Table 13 (see electronic supplement) provides a quantitative evaluation of the SOTA pre-trained models using ROUGE scores. In abstractive text summarization, transformer-based SOTA models involve the necessary steps that enable them to produce concise and coherent summaries. Different steps must be followed to obtain quality summaries, such as preprocessing and tokenization, embedding and representation, contextual understanding and content selection, advanced pre-training strategies, model training and optimization, summary generation and decoding strategies. These steps are crucial for analyzing and understanding how different models work. Table 14 (see electronic supplement) compares the SOTA text summarization models based on different abstraction steps. The following are the different steps of abstraction in abstractive text summarization:
(1)
Preprocessing and Tokenization: This initial step in abstractive text summarization involves preprocessing the raw input text articles. The preprocessing step includes a text-cleaning method, which removes punctuation and converts all text into lowercase. The next step is tokenization, which helps to segment the input text into manageable tokens. Several methods are available for tokenizing a given input text article. The WordPiece method breaks words into meaningful subwords; models such as BERT use this method. The SentencePiece method processes text into a raw input stream with unclear word boundaries. This method enhances the robustness across languages; therefore, it is used in models such as T5 and PEGASUS. Another method called Byte-Pair Encoding (BPE) iteratively merges the most frequent pairs of characters or bytes. This helps in effectively managing rare words and reducing vocabulary size. The BPE tokenization method is used in models such as the BART. These preprocessing and tokenization steps are essential to ensure that the text input to the model is uniform and optimized for subsequent processing steps.
(2)
Embedding and Representation: Once input text is preprocessed and tokenized, each piece of input text is converted into a numerical representation called embeddings. These embeddings are either pre-trained on large datasets to capture a wide range of semantics and syntactic properties through unsupervised learning or fine-tuned for specific tasks such as summarization to adapt to the general representation of the summary-specific language. These embeddings and representations set a foundation for the deep analysis of semantics, which is a crucial step for effective summarization.
(3)
Contextual Understanding and Content Selection: The Transformer-based abstractive text summarization models leverage the power of these robust embeddings to the attention mechanism to dynamically assign importance to every phrase and word based on the context. This step allows for a deep understanding of the importance of words or phrases that must be retained in the summary. The following techniques were used for contextual understanding and content selection:
—
Scaled-dot Product Attention: It is a fundamental mechanism in the self-attention process of a transformer. In this technique, the attention score is calculated by taking the dot product between the queries representing each token being processed, with all keys representing every token in the input.
—
Multi-head Attention: Multi-head transformers deploy multiple attention heads in parallel. Each attention head executes the scalar dot product independently, which concurrently explores the different relationships and patterns between input data. This parallel mechanism allows for a more profound understanding of the context of the model, leading to a quality summary.
(4)
Advanced Pre-training Strategies: Advanced pre-training strategies are the pre-training objectives considered during pre-training, allowing the model to understand linguistic structures. These strategies include masked language modeling (MLM), contrastive learning, and task-specific objectives. The MLM pre-training strategy involves masking certain words in the input text, and during training, the model is made to predict the masked word. This strategy helps understand the context of the input text, which is crucial during language generation. However, the contrastive learning strategy involves training the model to distinguish between correctly closely related and highly altered sequences of the input text. This strategy helps the model to understand and distinguish subtle changes in language. Another type of strategy is called task-specific objectives, in which the model is trained to directly generate specific tasks, such as next-sentence prediction (NSP) or sentence reordering. This strategy helps the model better understand the narrative flow and logical coherence of the text. These strategies expose the transformer-based model to perform well in complex-language-processing tasks.
(5)
Model Training and Optimization: Some targeted strategies that enhance the model’s performance are used during model training. These effective training and optimization methods improve the accuracy and ability of the model to generate a summary. The targeted strategies include the loss function, adversarial training, and regularization techniques. The loss function utilizes cross-entropy loss to ensure high prediction accuracy. The adversarial training strategy introduces minor disturbances during training to improve the robustness of the model. This strategy teaches the model to handle minute disturbances in the input data effectively. Regularization techniques use dropout and weight decay methods to prevent generalization and overfitting. Dropout prevents remodeling among neurons by rapidly deactivating specific neurons during training, whereas weight decay penalizes larger weights to maintain the stability and simplicity of the model. All these model training and optimization strategies help to efficiently train the abstractive summarization models.
(6)
Summary Generation and Decoding Strategies: This summarization phase involves refined decoding strategies and advanced generation techniques. The strategies involved in this phase include conditional generation, a softmax layer, beam search, top-k sampling, and nucleus sampling. Conditional generation uses a decoder to generate each word based on the previous word and overall context, ensuring that the output is coherent and contextually relevant. The softmax layer is a decoder component that computes the probability distribution of the next possible word, likely from the contextual and coherent appropriate continuation of the input text. Another strategy for content selection is beam search, in which multiple hypotheses are considered at each step of the generation process, keeping the top beam width as the most probable sequence. In comparison, top-k and nucleus sampling introduce the random selection process by restricting the model’s choice to the top-k and top-p percentage of the following words.
Approaches and methods in abstractive text summarization have evolved from statistical approaches, such as rule-based and graph-based approaches to neural-network-based approaches. Researchers have combined statistical and neural-network-based approaches to derive the benefits of both approaches. Building a summarization model using a neural-network-based approach, such as RNN and CNN, is easy. This requires minimum computational requirements, and the model parameters can be controlled while training the model. However, the accuracy of these models could be, at most, that of transformer-based approaches.
The SOTA summarization models use transformer-based approaches. Building and training a summarization model using this approach demands substantial computational resources due to the training of billions of parameters associated with these models. There is a requirement in this field to build a robust abstractive text summarization model that is efficient and can train with low computational resources and optimized parameters.
3.4 Abstractive Text Summarization using Large Language Models
Recent advancements in NLP have highlighted the growing prominence of LLMs, owing to their powerful generative capabilities and versatility in handling various tasks, including abstractive summarization. LLMs are built on top of the transformer architecture, which utilizes a self-attention mechanism, allowing models to weigh the importance of different words in a sentence, irrespective of their sequential order. This architecture facilitates a deeper understanding of the context and semantic relationships across long texts, making LLMs particularly effective for generating coherent and contextually relevant summaries. As a result, these models have become fundamental in advancing SOTA in NLP, driving innovations and applications in various NLP tasks.
Although originally not designed for summarization tasks, LLMs offered by OpenAI, Google, and Meta have shown exceptional capabilities in generating abstractive summaries. These models can effectively produce general summaries; however, achieving customized content or specific stylistic outputs generally requires methods such as fine-tuning and prompt engineering. Fine-tuning tailors the model to precise summarization tasks by training on targeted datasets, promising higher-quality results and the ability to handle more examples than standard prompting. This process, while delivering improved outcomes and reduced request latency, requires substantial computational power, potentially limiting accessibility to users with constrained hardware resources. By contrast, prompt engineering requires less computational effort and involves creating detailed prompts that direct the model’s attention to key text elements to generate succinct summaries.
SOTA LLMs include OpenAI’s GPT-3.5 and GPT-4 [
189], which are notable and renowned for generating contextually rich summaries through advanced fine-tuning and prompt engineering techniques. Similarly, AI21 Labs developed GPT-Neo and Jurassic-1 Jumbo [
137], which are robust to text generation. Megatron-Turing NLG [
228], a collaborative effort between NVIDIA and Microsoft, stands out for its large-scale generative abilities. Google AI’s BARD and Gemini [
12] models efficiently manage various NLP tasks, whereas Meta AI’s open-source models, LLaMA [
241] and LLaMA-2 [
242], offer significant customization with data privacy and fine-tuning flexibility. Mistral AI’s Mistral 7B [
99] and Anthropic’s models, such as Claude 3 Opus, Sonnet, and Haiku [
13], emphasize safety and usability, focusing on generating secure and user-friendly text outputs, including summaries. These models collectively push the boundaries of what is achievable in automated text summarization, showcasing a range of approaches for tackling the complexities of language understanding and generation.
The use of LLMs for abstractive summarization also involves critical tradeoffs between resource accessibility and operational convenience. Open-source models, such as LLaMA-2, provide extensive customization opportunities, allowing users to tailor the models to specific needs. However, they require significant computational power for running and fine-tuning, necessitating robust hardware setups, which can be a barrier for smaller research teams or individual researchers. Nevertheless, cloud-based API access to models such as GPT-4 or Gemini offers a user-friendly alternative, with the infrastructure managed by service providers. However, this comes at the cost of subscription fees, which may accumulate significantly, depending on the usage volume.
Moreover, LLMs can generate summaries that are accurate and stylistically aligned with the source material, reflecting nuances, such as tone and sentiment. This capability to adjust tone or style according to specific requirements makes LLMs particularly valuable for diverse summarization needs. However, their “black-box” nature can obscure how these summaries are generated and pose challenges in understanding and addressing potential biases within the models.
In conclusion, although LLMs offer significant advantages in terms of quality and versatility for abstractive summarization, they also present challenges related to computational demands, cost, and transparency. As the field continues to evolve, advancements are expected to mitigate these challenges, making them more accessible and understandable to a broader range of users.
4 Conclusion
This SLR follows a systematic review protocol with predefined research gaps, questions, objectives, inclusion and exclusion criteria, quality assessment, and systematic data extraction and analysis methods. This systematic methodology ensures that the findings are transparent, minimally biased and easily reproducible for further research. This SLR helps researchers in this domain by providing extremely useful insights into the comparative evaluation of the datasets, evaluation metrics, approaches, and methods, highlighting emergent trends within the field. This SLR reviews 226 papers and comprehensively explores abstractive text summarization methodologies, such as rule-based, graph-based, CNN-, RNN-, and transformer-based approaches. Among these approaches, the current SOTA systems employ transformer-based approaches, particularly owing to their ability to manage long-range dependencies and their scalability.
Starting from basic approaches, such as rule-based and graph-based approaches, and transitioning towards the most advanced neural network approaches, such as transformers and their attention mechanisms, help researchers in this field to understand the progress in this field and help them understand the current SOTA systems in this domain. The recent rise in LLMs such as GPT-3.5, GPT-4, LlaMa-2, Claude, and Gemini variants has revolutionized the capabilities of generating coherent and contextual summaries. Even though these models are the current SOTA and generate summaries according to the user’s needs with a desired tone and structure, they come with a demand for computational resources, which is a limitation regarding accessibility and environmental impact. This demands model optimization and the development of efficient models that can be run easily without heavy computational requirements. Another important concern about these LLMs is that their underlying architecture is known, but the pre-training strategies are not completely disclosed to the research community.
This SLR also splits the entire abstractive text summarization task into several layers of abstraction and explains how SOTA summarization models work in each abstraction layer. Additionally, a generic conceptual framework proposed for abstractive text summarization will ensure that researchers understand the task better and will help in future technological advances. Finally, guidelines for choosing the most appropriate summarization models tailored to abstractive text summarization are proposed. This helps researchers to optimize performance and efficiency in real-world applications and choose effective methods based on their needs. The findings of this SLR aim to encourage a more collaborative research environment responsive to the evolving needs of the global NLP community, thus encouraging continuous advancement in abstractive text summarization. This SLR fills a critical gap in the literature and catalyzes further innovation.
To advance the field of abstractive text summarization, it is crucial to develop more sophisticated and robust datasets that minimize bias and factual inconsistencies, thereby helping models reduce bias and hallucinations. Additionally, new evaluation metrics should be created to assess summaries beyond simple n-gram overlap, accurately reflecting human judgments by evaluating semantics, coherence, consistency, relevancy, and fluency. Future research should also prioritize optimizing model architecture and training processes to require low computational resources without compromising performance. Despite the current SOTA models being LLMs, their pre-training strategies are often undisclosed, posing challenges for reproducibility. Therefore, a focus on reproducibility will enable researchers to harness the power of LLMs while also optimizing them. Moreover, since LLMs are generally pre-trained and not specifically built for text summarization tasks, researchers should aim to develop LLMs tailored specifically for this area, utilizing few-shot fine-tuning or prompting techniques.