octubre 10, 2017

«Mix Multiple Features to Evaluate the Content and the Linguistic Quality of Text Summaries»



Samira Ellouze, Mahar Jaoua and Lamia Hadrich Belguith
«Mix Multiple Features to Evaluate the Content and the Linguistic Quality of Text Summaries»

CIT. Journal of Computing and Information Technology, vol. 25, n.º 2 (2017)

CIT. Journal of Computing and Information Technology | University of Zagreb | Faculty of Electrical Engineering and Computing | CROACIA


Extracto de apartados 1, 2 y 6, en páginas 149-152 y 163-164 de la publicación en PDF. Véanse las referencias en la publicación original del texto.




«Introduction

»With the significant increase in automatic summarization systems, text summary evaluation has become an absolutely necessary task which guides the development of suitable summarization approaches. However, it is a complex task. In fact, the complexity of this task comes from the unclear definition of summarization properties: “What represents a ‘good’ summary”? It is in this context that several studies have been conducted to develop manual and automatic evaluation metrics of text summaries.

»These metrics can be divided into intrinsic or extrinsic metrics. Because of the importance of evaluating summarization systems, many evaluation conferences have been organized in the last two decades, such as SUMMAC, DUC (Document Understanding Conference), TAC (Text Analysis Conference), etc., to evaluate the performance of summaries generated automatically. In addition, in the TAC’2009 session, an automatic evaluation task was proposed to encourage researchers to develop automatic evaluation metrics.

»Most of the metrics previously developed in the field of automatic evaluation of content summaries had focused on the use of surface level analysis (lexical or syntactic). This level does not deal with the use of language phenomena such as synonyms, generalizations, specifications, abbreviations, homographs, etc., in text summaries.

»For this reason, we need to add other levels of analysis to an evaluation metric. Furthermore, most works have particularly focused on the evaluation of the content and have more or less neglected the linguistic quality evaluation even though [1] has mentioned the importance of this quality to read and understand a text summary easily. In fact, a text summary without reference resolution, with redundant information or with errors in sentence structure cannot be understood. It is in this framework that we have targeted as a field of study both types of evaluation while trying to address some aspects of the semantic level.

»The initial idea revolves around experiments conducted by [2] and [3], who tried to combine automatic metrics to better correlate with manual metrics. So the objective is to build models able to predict a manual content metric and others able to predict a linguistic quality metric by combining automatic metrics and features defined on the candidate summary. The choice of combining these features as a strategy has a number of advantages.

»For instance, one can benefit from the use of content features that operate on different levels of analysis. In addition, linguistic quality aggregates several linguistic aspects such as structure and coherence, grammaticality, focus, etc. Those aspects cannot be handled with one simple metric; this is why we have used a combination of features.

»The combination of features is performed using two machine learning techniques: regression and classification, which will allow us to predict respectively PYRAMID and linguistic quality scores for unseen summary and summarization system.

»In addition, in the first step, linguistic quality has been evaluated using one predictive model and without taking into account the variation on the topic of each source documents collection. However, the variation of topics leads to a variation in the writing style, the vocabularies used, the structures used, the length of sentences, etc., which may influence the performance of the predictive model.

»For this reason, in the second step, we evaluated the linguistic quality by building a predictive model for each collection of source documents.

»The rest of this paper is structured as follows: In Section 2, we present the principal works that have addressed the problem of the evaluation of the content and the linguistic quality of a summary. Then in Section 3, we explain the proposed method which is based on machine learning techniques. In Section 4, we give the details of each machine learning step. In Section 5, we present our experiments in different summarization tasks and different levels of evaluation, and then we discuss the obtained results.



»Previous Work

»In this section, we describe the principal related works that deal with the evaluation of the content and the linguistic quality of a text summary.


»2.1. Content Evaluation

»The summary evaluation task started with the manual comparison of peer summaries with reference summaries. Achieving this task was an arduous and costly process. One of the first and famous tools of manual summary evaluation is SEE (Summary Evaluation Environment) [4]. It allowed human judges to manually evaluate the content and the linguistic quality (i.e. grammaticality, cohesion, coherence, etc.) of a summary.

»To evaluate content, human judges were used to compare a candidate summary (system summary) to an ideal summary. After that, [5] proposed PYRAMID which is a manual metric based on identifying the common ideas between a candidate summary and one or several reference summaries. These ideas are represented as semantic information units called Semantic Content Units (SCUs). The PYRAMID metric was used by the TAC and DUC conference to evaluate the content of candidate summaries. Several automatic metrics have been presented to treat the cost / time problem imposed by manual metrics.

»One of the well-known metrics in automatic text evaluation is ROUGE [6]. It measures the number of overlapping units between a candidate summary and reference summaries. There are many variants of the ROUGE metric which change according to the chosen unit of comparison such as n-gram (ROUGE-N), word sequences (ROUGE-L, ROUGE-W), and word pairs (ROUGE-S) between the candidate summary and the reference summaries.

»Afterwards, [7] proposed a new metric called BE (Basic Elements) which operates at the semantic level rather than the shallow or surface level like the ROUGE metric. To decompose each sentence in a summary to minimum semantic units called Basic Elements (BE), each summary requires a deep analysis (a semantic analysis). The final score relies on the overlap of BE units between a candidate summary and reference summaries.

»Later, Giannakopoulos et al. [8] introduced the AutoSummENG metric, which is based on the statistical extraction of textual information from the summary. The information extracted from the summary represents a set of relations between the summary’s n-grams. A graph is constructed including the full set of relations and additional information concerning these relations.

»The estimation of the similarity degree is performed by comparing the graph of the candidate summary with the graph of each reference summary. Finally, the average similarity degree between the candidate summary and all reference summaries is considered as the overall score of the candidate summary. In a subsequent work, Giannakopoulos and Vangelis [9] presented the Merge Model Graph (MeMoG) which is another variation of the AutoSummENG based on n-gram graphs. This variation calculates the merged graph of all the reference summaries. Then, it computes the similarity degree between the candidate summary graph and the merged graph of the reference summaries.

»In a recent work, [10] developed the SIMetrix measurement; it assesses a candidate summary by comparing it with the source documents instead of reference summaries. The SIMetrix is a full automatic metric which does not depend on reference summaries. [10] computed ten measures of similarity based on the comparison between the source documents and the candidate summary. Among the used similarity measures we cite the cosine similarity, the divergence of Jensen-Shannon, the divergence of Kullback-Leibler, etc.

»In a more recent work, [11] developed the SERA (Summarization Evaluation by Relevance Analysis) metric, which is designed to evaluate scientific articles. This metric relies on the relevant content shared by a candidate summary and reference summaries. [11] used an information-retrieval-based method which treats summaries as search queries and then measured the overlaps of the retrieved results. A larger number of overlaps between the candidate summary and the reference summary indicates that the candidate summary has a higher content quality.

»The observation of all of the previous cited metrics shows that each metric uses only one level of comparison (the lexical level, the syntactic level, the semantic level, etc.), while the combination of many comparison levels may overcome the limits of each metric. In addition, combining scores that rely on the comparison between candidate and reference summaries and scores that are based on comparing candidate summaries with source documents can overcome the limits of each type of comparison.

»For instance, it is important to compare between texts with similar lengths, but reference summaries cannot always cover all the formulations of important ideas presented in source documents. Nevertheless, the comparison between texts that have a big difference in terms of length remains a difficult task.


»2.2. Linguistic Quality Evaluation

»The language quality is an important factor in assessing the quality of a summary. Indeed, a good linguistic quality makes a summary easy to read and understand. In fact, in the TAC conference, the linguistic quality was based on the combination of five aspects, namely structure and coherence, grammaticality, non-redundancy, referential clarity and focus. During the DUC and the TAC conferences, the linguistic quality of a summary was evaluated by human judges that took into account the five linguistic aspects without using reference summaries or source documents.

»Accordingly, the judges did not take into account the relationship between the summary and the source documents and were expected to assess the summary as a separate document.

»Because of the difficulty of manual evaluation, more work has been done in this area for automation. In this context, [12] evaluated mainly the local coherence of the summary using an entity grid model that captures the transitions of entities between two adjacent sentences. In this model, the text is represented as a matrix where each column contains one entity and each line contains a sentence. Each cell corresponds to the grammatical role of the entity in the sentence.

»The proposed method calculates the local coherence of the summary using the probability distribution of the entity’s transitions. Many other researches like [13] and [14], explored the entity grid model to evaluate local coherence. In addition, [15] dealt with the assessment of grammaticality and coherence in the summary.

»They proposed to apply machine learning techniques to train a language model by referring to a corpus of manual summaries with parts of speech and /or chunk labels. After that, the learning model estimates the probability of grammatical acceptability of a sentence. To evaluate the structure and the coherence of a summary, [15] built a lexical chain which is spread over the entire summary to represent the sequences of related words. The lexical chain which is produced can provide information on the focus of each sentence, which in its turn contributes to the focus of the summary.

»Besides, [16] attempted to predict each of the five linguistic aspects mentioned previously. They identified several linguistic features that were grouped into classes. Then, they tried to identify the best class of features for each linguistic quality aspect.

»Next, for each aspect, they built a model from each class of features. Finally, they built a meta-ranker for each aspect by combining the predicted scores from each model related to this aspect. Also, [3] evaluated summaries by constructing predictive models for overall responsiveness, PYRAMID and linguistic quality using a combination of content scores based on bi-grams and others related to linguistic quality.

»To build the predictive model, [3] tested three regression methods, namely canonical correlation, the Robust Least Squares, and the Non-Negative Least Squares. On the other hand, the CREMER metric [17] combined a content metric named TESLA-S [17] and a linguistic quality metric called “DICOMER” [17] to predict overall responsiveness.

»Some works have tried to predict overall responsiveness scores using the combination of content scores and linguistic quality features, but no one has combined them to predict linguistic quality score.



»Conclusion and Future Work

»In this paper, we presented a method of content and linguistic quality evaluation for text summaries.

»Our work has been motivated by the lack of efficient and accurate automatic tools that evaluate the content and the linguistic quality of a summary.

»The proposed method is based on the construction of models that combine selected features which come from a large set of features that cover several linguistic aspects and several types of overlap between the candidate summary and reference summaries or source documents. For both scores, the combination of features is performed by testing many single and ensemble learning classifiers.

»We have evaluated our method on two levels of granularity: the system level and the summary level and in two evaluation tasks: the initial summary task and the update summary task.

»On summary level, we have noticed that the model built using selected features that evaluate the content has the best correlation (0.7906 for the initial summary task) with the PYRAMID score.

»Furthermore, for linguistic quality evaluation and in both tasks, we also noted that the predictive model built using selected features has the best accuracy (51.0081% for the initial summary task) compared to baselines. In addition, we have built 48 models for 48 collections: a model for summaries from the same collection. We noticed that the accuracy of models has been increased for most collections: the best accuracy is 87.0968%, which is obtained with collection 34 in the initial summary task and collection 32 in the update summary task.

»This increase confirms our assumption that each collection has its specificity (writing style, sentence length, sentence complexity, etc.) since each one has a different topic.

»In system level evaluation, for a specific task and a predicted score Scoresystem (content or linguistic quality score), we calculated the average of the predicted score values of all the summaries that were built from the same summarization system. In both tasks, the average of the predicted content scores of each system Scoresystem correlates best with the PYRAMID score.

»Likewise, the Scoresystem correlates better with the manual linguistic quality score than with the baselines. In both tasks, it has been noted that there is a big gap between the correlation of the Scoresystem with the manual linguistic quality score and between the correlation of the baselines with the manual linguistic quality score. Indeed, in both tasks and for both scores, our method has provided good performance compared to baselines.

»All the obtained results prove that, first, the combination of content and linguistic quality features to predict PYRAMID and linguistic quality scores can give better performance than single ones like content (ROUGE, BE, etc.) or linguistic quality (SMOG, FOG, etc.) scores.

»Second, we can affirm that adding linguistic features for the prediction of content score or content scores for the prediction of linguistic quality also improves the performance of the two prediction manual scores, PYRAMID and linguistic quality. Therefore, this means that there is a relation between the evaluation of content and the evaluation of linguistic quality.

»Third, we can assert that the selection of relevant features can on the one hand improve the performance of the predictive model and on the other hand provide faster and more cost-effective prediction models.

»As perspective work for linguistic quality evaluation, we aim to study the causes that make the use of a classifier good in some collections and bad in the others. In addition, we want to study the reasons of the weakness of the accuracy and the kappa in some collections.»





No hay comentarios:

Publicar un comentario