NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

https://doi.org/10.18653/v1/2024.acl-long.850

Trivedi, Harsh; Khot, Tushar; Hartmann, Mareike; Manku, Ruskin; Dong, Vinty; Li, Edward; Gupta, Shashank; Sabharwal, Ashish; Balasubramanian, Niranjan (January 2024, Association for Computational Linguistics)

Full Text Available
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

https://doi.org/10.18653/v1/2023.acl-long.557

Trivedi, Harsh; Balasubramanian, Niranjan; Khot, Tushar; Sabharwal, Ashish (January 2023, 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))

Full Text Available
Teaching Broad Reasoning Skills via Decomposition-Guided Contexts

Trivedi, Harsh; Balasubramanian, Niranjan; Khot, Tushar; Sabharwal, Ashish (December 2022, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing)

Question-answering datasets require a broad set of reasoning skills. We show how to use question decompositions to teach language models these broad reasoning skills in a robust fashion. Specifically, we use widely available QDMR representations to programmatically create hard-to-cheat synthetic contexts for real questions in six multi-step reasoning datasets. These contexts are carefully designed to avoid common reasoning shortcuts prevalent in real contexts that prevent models from learning the right skills. This results in a pretraining dataset, named TeaBReaC, containing 525K multi-step questions (with associated formal programs) covering about 900 reasoning patterns. We show that pretraining standard language models (LMs) on TeaBReaC before fine-tuning them on target datasets improves their performance by up to 13 F1 points across 4 multi-step QA datasets, with up to 21 point gain on more complex questions. The resulting models also demonstrate higher robustness, with a 5-8 F1 point improvement on two contrast sets. Furthermore, TeaBReaC pretraining substantially improves model performance and robustness even when starting with numerate LMs pretrained using recent methods (e.g., PReasM, POET). Our work thus shows how to effectively use decomposition-guided contexts to robustly teach multi-step reasoning.
more » « less
Full Text Available
Summarize-then-Answer: Generating Concise Explanations for Multi-hop Reading Comprehension

https://doi.org/10.18653/v1/2021.emnlp-main.490

Inoue, Naoya; Trivedi, Harsh; Sinha, Steven; Balasubramanian, Niranjan; Inui, Kentaro (November 2021, Conference on Empirical Methods in Natural Language Processing)

Full Text Available
Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions

https://doi.org/10.18653/v1/2022.lnls-1.3

Parrish, Alicia; Trivedi, Harsh; Perez, Ethan; Chen, Angelica; Nangia, Nikita; Phang, Jason; Bowman, Samuel (January 2022, Proceedings of the First Workshop on Learning with Natural Language Supervision)

Full Text Available
IrEne: Interpretable Energy Prediction for Transformers

https://doi.org/10.18653/v1/2021.acl-long.167

Cao, Qingqing; Lal, Yash Kumar; Trivedi, Harsh; Balasubramanian, Aruna; Balasubramanian, Niranjan (August 2021, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers))
null (Ed.)
Full Text Available
What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

Nangia, Nikita; Sugawara, Saku; Trivedi, Harsh; Warstadt, Alex; Vania, Clara; Bowman, Samuel R. (January 2021, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics)
null (Ed.)
Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality. We use multiple-choice question answering as a testbed and run a randomized trial by assigning crowdworkers to write questions under one of four different data collection protocols. We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty. However, we find that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data. But using crowdsourced, instead of expert judgments, to qualify workers and send feedback does not prove to be effective. We observe that the data from the iterative protocol with expert assessments is more challenging by several measures. Notably, the human--model gap on the unanimous agreement portion of this data is, on average, twice as large as the gap for the baseline protocol data.
more » « less
Full Text Available
Repurposing Entailment for Multi-Hop Question Answering Tasks

Trivedi, Harsh; Kwon, Heeyoung; Khot, Tushar; Sabharwal, Ashish; Balasubramanian, Niranjan (June 2019, North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies)

Question Answering (QA) naturally reduces to an entailment problem, namely, verifying whether some text entails the answer to a question. However, for multi-hop QA tasks, which require reasoning with \textit{multiple} sentences, it remains unclear how best to utilize entailment models pre-trained on large scale datasets such as SNLI, which are based on sentence pairs. We introduce Multee, a general architecture that can effectively use entailment models for multi-hop QA tasks. Multee uses (i) a local module that helps locate important sentences, thereby avoiding distracting information, and (ii) a global module that aggregates information by effectively incorporating importance weights. Importantly, we show that both modules can use entailment functions pre-trained on a large scale NLI datasets. We evaluate performance on MultiRC and OpenBookQA, two multihop QA datasets. When using an entailment function pre-trained on NLI datasets, Multee outperforms QA models trained only on the target QA datasets and the OpenAI transformer models.
more » « less
Full Text Available

Search for: All records