Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Environmental conservation organizations routinely monitor news content on conservation in protected areas to maintain situational awareness of developments that can have an environmental impact. Existing automated media monitoring systems require large amounts of data labeled by domain experts, which is only feasible at scale for high-resource languages like English. However, such tools are most needed in the global south where the news of interest is mainly in local low-resource languages, and far fewer experts are available to annotate datasets on a sustainable basis. In this paper, we propose NewsSerow, a method to automatically recognize environmental conservation content in low-resource languages. NewsSerow is a pipeline of summarization, in-context few-shot classification, and self-reflection using large language models (LLMs). Using at most 10 demonstration example news articles in Nepali, NewsSerow significantly outperforms other few-shot methods and can achieve comparable performance with models fully fine-tuned using thousands of examples. With NewsSerow, Organization X has been able to deploy the media monitoring tool in Nepal, significantly reducing their operational burden, and ensuring that AI tools for conservation actually reach the communities that need them the most. NewsSerow has also been deployed for countries with other languages like Colombia.more » « less
-
The authors introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments aimed at improving language models. DCLM provides a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants can experiment with dataset curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline, the authors find that model-based filtering is critical for assembling a high-quality training set. Their resulting dataset, DCLM-Baseline, enables training a 7B parameter model from scratch to achieve 64% 5-shot accuracy on MMLU with 2.6T training tokens. This represents a 6.6 percentage point improvement over MAP-Neo (the previous state-of-the-art in open-data LMs), while using 40% less compute. The baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% and 66%), and performs similarly on an average of 53 NLU tasks, while using 6.6x less compute than Llama 3 8B. These findings emphasize the importance of dataset design for training LMs and establish a foundation for further research on data curation.more » « lessFree, publicly-accessible full text available April 21, 2026
-
Non-governmental organizations for environmental conservation have a significant interest in monitoring conservation-related media and getting timely updates about infrastructure construction projects as they may cause massive impact to key conservation areas. Such monitoring, however, is difficult and time-consuming. We introduce NewsPanda, a toolkit which automatically detects and analyzes online articles related to environmental conservation and infrastructure construction. We fine-tune a BERT-based model using active learning methods and noise correction algorithms to identify articles that are relevant to conservation and infrastructure construction. For the identified articles, we perform further analysis, extracting keywords and finding potentially related sources. NewsPanda has been successfully deployed by the World Wide Fund for Nature teams in the UK, India, and Nepal since February 2022. It currently monitors over 80,000 websites and 1,074 conservation sites across India and Nepal, saving more than 30 hours of human efforts weekly. We have now scaled it up to cover 60,000 conservation sites globally.more » « less
An official website of the United States government
