This research paper study was situated within a peer review mentoring program in which novice reviewers were paired with mentors who are former National Science Foundation (NSF) program directors with experience running discipline-based education research (DBER) panels. Whether it be a manuscript or grant proposal, the outcome of peer review can greatly influence academic careers and the impact of research on a field. Yet the criteria upon which reviewers base their recommendations and the processes they follow as they review are poorly understood. Mentees reviewed three previously submitted proposals to the NSF and drafted pre-panel reviews regarding the proposals’ intellectual merit and broader impacts, strengths, and weaknesses relative to solicitation-specific criteria. After participation in one mock review panel, mentees could then revise their pre-review evaluations based on the panel discussion. Using a lens of transformative learning theory, this study sought to answer the following research questions: 1) What are the tacit criteria used to inform recommendations for grant proposal reviews among scholars new to the review process? 2) To what extent are there changes in these tacit criteria and subsequent recommendations for grant proposal reviews after participation in a mock panel review? Using a single case study approach to explore one mock review panel, we conducted document analyses of six mentees’ reviews completed before and after their participation in the mock review panel. Findings from this study suggest that reviewers primarily focus on the positive broader impacts proposed by a study and the level of detail within a submitted proposal. Although mentees made few changes to their reviews after the mock panel discussion, changes which were present illustrate that reviewers more deeply considered the broader impacts of the proposed studies. These results can inform review panel practices as well as approaches to training to support new reviewers in DBER fields.
more »
« less
Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews.
We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.
more »
« less
- Award ID(s):
- 2022435
- PAR ID:
- 10617068
- Publisher / Repository:
- ICML conference.
- Date Published:
- Format(s):
- Medium: X
- Location:
- https://dl.acm.org/doi/10.5555/3692070.3693262
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This research paper study was situated within a peer review mentoring program in which novice reviewers were paired with mentors who are former National Science Foundation (NSF) program directors with experience running discipline-based education research (DBER) panels. Whether it be a manuscript or grant proposal, the outcome of peer review can greatly influence academic careers and the impact of research on a field. Yet the criteria upon which reviewers base their recommendations and the processes they follow as they review are poorly understood. Mentees reviewed three previously submitted proposals to the NSF and drafted pre-panel reviews regarding the proposals’ intellectual merit and broader impacts, strengths, and weaknesses relative to solicitation-specific criteria. After participation in one mock review panel, mentees could then revise their pre-review evaluations based on the panel discussion. Using a lens of transformative learning theory, this study sought to answer the following research questions: 1) What are the tacit criteria used to inform recommendations for grant proposal reviews among scholars new to the review process? 2) To what extent are there changes in these tacit criteria and subsequent recommendations for grant proposal reviews after participation in a mock panel review? Using a single case study approach to explore one mock review panel, we conducted document analyses of six mentees’ reviews completed before and after their participation in the mock review panel. Findings from this study suggest that reviewers primarily focus on the positive broader impacts proposed by a study and the level of detail within a submitted proposal. Although mentees made few changes to their reviews after the mock panel discussion, changes which were present illustrate that reviewers more deeply considered the broader impacts of the proposed studies. These results can inform review panel practices as well as approaches to training to support new reviewers in DBER fields.more » « less
-
We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale collection of Amazon reviews for multilingual text classification. The corpus contains reviews in English, Japanese, German, French, Spanish, and Chinese, which were collected between 2015 and 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID, and the coarse-grained product category (e.g., ‘books’, ‘appliances’, etc.) The corpus is balanced across the 5 possible star ratings, so each rating constitutes 20% of the reviews in each language. For each language, there are 200,000, 5,000, and 5,000 reviews in the training, development, and test sets, respectively. We report baseline results for supervised text classification and zero-shot cross-lingual transfer learning by fine-tuning a multilingual BERT model on reviews data. We propose the use of mean absolute error (MAE) instead of classification accuracy for this task, since MAE accounts for the ordinal nature of the ratings.more » « less
-
Recent advances in Large Language Models (LLMs) show the potential to significantly augment or even replace complex human writing activities. However, for complex tasks where people need to make decisions as well as write a justification, the trade offs between making work efficient and hindering decisions remain unclear. In this paper, we explore this question in the context of designing intelligent scaffolding for writing meta-reviews for an academic peer review process. We prototyped a system called MetaWriter'' trained on five years of open peer review data to support meta-reviewing. The system highlights common topics in the original peer reviews, extracts key points by each reviewer, and on request, provides a preliminary draft of a meta-review that can be further edited. To understand how novice and experienced meta-reviewers use MetaWriter, we conducted a within-subject study with 32 participants. Each participant wrote meta-reviews for two papers: one with and one without MetaWriter. We found that MetaWriter significantly expedited the authoring process and improved the coverage of meta-reviews, as rated by experts, compared to the baseline. While participants recognized the efficiency benefits, they raised concerns around trust, over-reliance, and agency. We also interviewed six paper authors to understand their opinions of using machine intelligence to support the peer review process and reported critical reflections. We discuss implications for future interactive AI writing tools to support complex synthesis work.more » « less
-
Purpose: Despite tremendous gains from deep learning and the promise of AI in medicine to improve diagnosis and save costs, there exists a large translational gap to implement and use AI products in real-world clinical situations. Adoption of standards like the TRIPOD, CONSORT, and CLAIM checklists is increasing to improve the peer review process and reporting of AI tools. However, no such standards exist for product level review. Methods: A review of the clinical trials shows a paucity of evidence for radiology AI products; thus, we developed a 10-question assessment tool for reviewing AI products with an emphasis on their validation and result dissemination. We applied the assessment tool to commercial and open-source algorithms used for diagnosis to extract evidence on the clinical utility of the tools. Results: We find that there is limited technical information on methodologies for FDA approved algorithms compared to open source products, likely due to concerns of intellectual property. Furthermore, we find that FDA approved products use much smaller datasets compared to open-source AI tools, as the terms of use of public datasets are limited to academic and non-commercial entities which preclude their use in commercial products. Conclusion: Overall, we observe a broad spectrum of maturity and clinical use of AI products, but a large gap exists in exploring the actual performance of AI tools in clinical practice.more » « less
An official website of the United States government

