skip to main content


Title: The CNN-Corpus in Spanish: a Large Corpus for Extractive Text Summarization in the Spanish Language
This paper details the development and features of the CNN-corpus in Spanish, possibly the largest test corpus for single document extractive text summarization in the Spanish language. Its current version encompasses 1,117 well-written texts in Spanish, each of them has an abstractive and an extractive summary. The development methodology adopted allows good-quality qualitative and quantitative assessments of summarization strategies for tools developed in the Spanish language.  more » « less
Award ID(s):
1842577
NSF-PAR ID:
10185298
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the ACM Symposium on Document Engineering
Volume:
19
Page Range / eLocation ID:
1-4
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper details the features and the methodology adopted in the construction of the CNN-corpus, a test corpus for single document extractive text summarization of news articles. The current version of the CNN-corpus encompasses 3,000 texts in English, and each of them has an abstractive and an extractive summary. The corpus allows quantitative and qualitative assessments of extractive summarization strategies. 
    more » « less
  2. We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale collection of Amazon reviews for multilingual text classification. The corpus contains reviews in English, Japanese, German, French, Spanish, and Chinese, which were collected between 2015 and 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID, and the coarse-grained product category (e.g., ‘books’, ‘appliances’, etc.) The corpus is balanced across the 5 possible star ratings, so each rating constitutes 20% of the reviews in each language. For each language, there are 200,000, 5,000, and 5,000 reviews in the training, development, and test sets, respectively. We report baseline results for supervised text classification and zero-shot cross-lingual transfer learning by fine-tuning a multilingual BERT model on reviews data. We propose the use of mean absolute error (MAE) instead of classification accuracy for this task, since MAE accounts for the ordinal nature of the ratings. 
    more » « less
  3. We asked whether increased exposure to iambs, two-syllable words with stress on the second syllable (e.g., guitar), by way of another language – Spanish – facilitates English learning infants' segmentation of iambs. Spanish has twice as many iambic words (40%) compared to English (20%). Using the Headturn Preference Procedure we tested bilingual Spanish and English learning 8-month-olds' ability to segment English iambs. Monolingual English learning infants succeed at this task only by 11 months. We showed that at 8 months, bilingual Spanish and English learning infants successfully segmented English iambs, and not simply the stressed syllable, unlike their monolingual English learning peers. At the same age, bilingual infants failed to segment Spanish iambs, just like their monolingual Spanish peers. These results cannot be explained by bilingual infants' reliance on transitional probability cues to segment words in both their native languages because statistical cues were comparable in the two languages. Instead, based on their accelerated development, we argue for autonomous but interdependent development of the two languages of bilingual infants. 
    more » « less
  4. The DocEng’19 Competition on Extractive Text Summarization assessed the performance of two new and fourteen previously published extractive text sumarization methods. The competitors were evaluated using the CNN-Corpus, the largest test set available today for single document extractive summarization. 
    more » « less
  5. Carlotta Demeniconi, Ian Davidson: (Ed.)
    Multi-document summarization, which summarizes a set of documents with a small number of phrases or sentences, provides a concise and critical essence of the documents. Existing multi-document summarization methods ignore the fact that there often exist many relevant documents that provide surrounding background knowledge, which can help generate a salient and discriminative summary for a given set of documents. In this paper, we propose a novel method, SUMDocS (Surrounding-aware Unsupervised Multi-Document Summarization), which incorporates rich surrounding (topically related) documents to help improve the quality of extractive summarization without human supervision. Speci fically, we propose a joint optimization algorithm to unify global novelty (i.e., category-level frequent and discriminative), local consistency (i.e., locally frequent, co-occurring), and local saliency (i.e., salient from its surroundings) such that the obtained summary captures the characteristics of the target documents. Extensive experiments on news and scientifi c domains demonstrate the superior performance of our method when the unlabeled surrounding corpus is utilized. 
    more » « less