<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA</title></titleStmt>
			<publicationStmt>
				<publisher>Association for Computational Linguistics</publisher>
				<date>01/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10611061</idno>
					<idno type="doi">10.18653/v1/2024.emnlp-main.1201</idno>
					
					<author>Maharshi Gor</author><author>Hal Daumé_Iii</author><author>Tianyi Zhou</author><author>Jordan Lee Boyd-Graber</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[CAIMIRA discovers the skills that humans and AIs use to answer questions.  By scraping websites where trivia nerds answer really difficult questions and posing those questions to AI models like GPT-4 and LLaMA-3-70B, while humans excel in knowledge-based abductive reasoning, AI outperforms on fact-based historical recall. This research suggests future challenges should focus on more complex reasoning and nuanced language tasks to better align AI development with human cognitive strengths.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The NLP community has focused on human behavior emulation, treating human performance as ceiling for models. However, the latest wave of LLMs has turned the discussion to supremacy: models are purportedly acing tests <ref type="bibr">(Liu et al., 2023;</ref><ref type="bibr">Hendrycks et al., 2020</ref>) that many humans find challenging. 1 1 As should hopefully be clear from the rest of the paper, we are highly dubious of these claims, particularly on multichoice tests with copious study material online. But this is outside the main scope of this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Relevant Latent Factors</head><p>Q: Blaise Pascal names a theorem about these shapes inscribed in conic sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Question Category: Science &gt; Mathematics</head><p>Reference from Wikipedia:</p><p>In projective geometry, Pascal's theorem (also known as the hexagrammum mysticum theorem) states that if six arbitrary points are chosen on a conic and joined by line segments in any order to form a hexagon, then the three pairs of opposite sides of the hexagon meet at three points which lie on a straight line, called the Pascal line of the hexagon.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Cult ural Records</head><p>History &amp; Events Scientific Reasoning Abduct ive Recall Complex Semantics Question Relevance 0.00 0.00 0.25 0.75 0.00 Trivia Nerds GPT 4 Turbo p( ,&#10003;) = 0.7 p( ,&#10003;) = 0.1 Trivia Nerd Skills Question Difficulty GPT-4 Skills Science &gt; Mathematics Neha Question: Blaise Pascal names a theorem about these shapes inscribed in conic sections. Answer: Hexagons Pascal's Theorem</p><p>In projective geometry, Pascal's theorem (also known as the hexagrammum mysticum theorem) states that if six arbitrary points are chosen on a conic and joined by line segments in any order to form a hexagon, then the three pairs of opposite sides of the hexagon meet at three points which lie on a straight line, called the Pascal line of the hexagon.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Hexagons</head><p>Ellipses p( ,&#10003;) = 0.7 p( ,&#10003;) = 0.1</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Pascal's Theorem</head><p>In projective geometry, Pascal's theorem (also known as the hexagrammum mysticum theorem) states that if six arbitrary points are chosen on a conic and joined by line segments in any order to form a hexagon, then the three pairs of opposite sides of the hexagon meet at three points which lie on a straight line, called the Pascal line of the hexagon.</p><p>In projective geometry, Pascal's theorem (also known as the hexagrammum mysticum theorem) states that if six arbitrary points are chosen on a conic and joined by line segments in any order to form a hexagon, then the three pairs of opposite sides of the hexagon meet at three points which lie on a straight line, called the Pascal line of the hexagon.</p><p>Pascal's Theorem Ellipses Hexagons We list the five latent factors that CAIMIRA discovers, and highlight the relevant ones (green), which contribute to estimating whether an agent will respond to the example question correctly. The agent skills over these relevant factors are highlighted in red boxes.</p><p>A notable 2010 example was IBM Watson's tour de force performance <ref type="bibr">Ferrucci et al. (2010)</ref> on Jeopardy!. While Watson defeated the two humans on stage over a few dozen questions, a thorough, quantitative examination of the relative strengths and weaknesses of human vs. computer on question answering (QA), particularly with the new panoply of recent LLMs, remains absent.</p><p>To address this gap, we turn to Item Response Theory (IRT, &#167;2.2), a statistical framework, originally developed in psychometrics <ref type="bibr">(Santor and Ramsay, 1998)</ref>, used for constructing effective standardized tests, by modeling the interaction between individuals and test items (questions). IRT is particularly suited for our analysis because it allows us to simultaneously assess the abilities of respondents (in our case, both humans and AI systems) and the characteristics of test items (our questions). This dual assessment is crucial for understanding the nuanced differences in performance between humans and AI systems across various types of questions.</p><p>Building upon IRT, we introduce CAIMIRA-Content-aware, Identifiable, and Multidimensional Item Response Analysis (pronounced Chimera )-a neural framework<ref type="foot">foot_0</ref> that overcomes key challenges of applying IRT to QA. CAIMIRA uses question text to infer characteristics, enabling generalization to new questions without needing prior responses.</p><p>For our questions, we use a QA format (Boyd-Graber et al., 2012, QuizBowl) specifically designed for effective comparison between QA agents ( &#167; 2.1). We then apply CAIMIRA ( &#167; 5) to responses collected from 155 human trivia players, and a wide range (~70) of QA systems, over thousands of these carefully crafted questions that probe knowledge recall and reasoning capabilities. CAIMIRA uncovers latent aspects (Figure <ref type="figure">5</ref>) that encapsulate different knowledge domains and reasoning skills, that best contrast agents' capabilities.</p><p>Humans and QA systems' skills are strikingly different across these latent axes (Figure <ref type="figure">6</ref>). Human responses reflect their superior interpretative abilities, instinctive thinking, and cognitive flexibility. This is particularly evident in questions demanding conceptual and knowledge-grounded abductive reasoning, characterized by indirect narrative references and ambiguous information gaps, where humans make intuitive leaps and draw connections that may not be immediately apparent. Conversely, large-scale LLMs like GPT-4-TURBO and LLAMA-3-70B demonstrate superior ability in retrieving specific information about events and locations, outdoing humans on questions loaded with entity-specific details-a feat we attribute to their extensive parametric memory. CAIMIRA also reveals questions that, while easily matched to relevant documents by retrieval systems, challenge most LLMs in extracting the final answer. These questions feature complex sentence structures and semantic relationships, that turn simple information retrieval into demanding reading comprehension.</p><p>In conclusion, this study provides insights into the strengths and weaknesses of human and AI question answering, laying the groundwork for future AI developments that better emulate or complement human cognitive abilities. In doing so, it underscores the need for sophisticated benchmarks to controllably distinguish between proficient and less capable QA systems, especially in areas demanding deeper, conceptual, and linguistic understanding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background and Preliminaries</head><p>This section describes the source of the Quizbowl QA data ( &#167; 2.1) and preliminaries of IRT and MIRT ( &#167; 2.2), the foundation of CAIMIRA ( &#167; 3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">QUIZBOWL: Where Trivia Nerds Practice</head><p>Our overarching goal is to identify similarities and differences between how systems and humans respond to questions. These questions must be diverse, less prone to false presuppositions, and designed to be challenging for humans, enabling us to draw conclusions about the strengths and weaknesses of agents without needing to "question the question" <ref type="bibr">(Min et al., 2020;</ref><ref type="bibr">Yu et al., 2022)</ref>. Following the categorization by <ref type="bibr">Rogers et al. (2023)</ref>, we focus on depth-testing "probing" questions over "information seeking" ones. This approach aligns with the Manchester paradigm outlined by <ref type="bibr">Rodriguez and Boyd-Graber (2021)</ref>, which highlights the significance of research agendas in the development of human-like, intelligent QA systems. More importantly, we need questions with many examples of diverse human answers. While humans may not answer Google queries <ref type="bibr">(Kwiatkowski et al., 2019)</ref> for fun, they do answer trivia questions as a hobby or to prepare for trivia competitions. Hence, we use the "Protobowl" <ref type="bibr">(He et al., 2016)</ref>, a dataset of trivia questions based on the Quizbowl (QB) QA setting <ref type="bibr">(Boyd-Graber et al., 2012)</ref>. Quizbowl, the source of questions for ProtoBowl, is a trivia game consisting of questions with sentence-clues decreasing in difficulty and culminating with a "giveaway" hint at the end of the question. It is the only open source QA dataset that contains records of many human players of varying levels of expertise answering questions across different categories like history, science and literature 3 (Figure <ref type="figure">2</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">A review of Item Response Theory (IRT)</head><p>We compare humans and AI systems by capturing their skills using Item Response Theory (IRT), a framework used to understand question quality and participant strengths, by analyzing responses (ruled as correct or incorrect) to a set of questions (or, "items"). It is widely adopted in psychometrics <ref type="bibr">(Morizot et al., 2009</ref><ref type="bibr">), medical education (Downing, 2003)</ref>, and other fields for developing standardized tests for human subjects.</p><p>In the context of this work, IRT assumes (1) a set of question-answer pairs, (2) subjects spanning humans and QA systems, and (3) binary correctness rulings of their responses. The IRT objective is to predict the response correctness (U i,j ) based on the subject's skill s i and the question's difficulty d j , where i and j are the indices of the subject and question, respectively. The probability of response correctness, p(U i,j = 1), is modeled as &#963;(s i -d j ), where &#963; is the sigmoid function.</p><p>(1)</p><p>The learning objective is to model skill and difficulty parameters that best fit assumed priors, given observed response data, typically using Bayesian inference. Existing IRT applications in NLP often model item characteristics in one dimension <ref type="bibr">(Lalor et al., 2019)</ref>, assuming a linear hierarchy in difficulty and skill levels. This approach is limiting when distinguishing between agents in NLP tasks.</p><p>For example, if a history question q h is found to be more difficult than a science question q s (d h &gt; d s ), the model asserts that agents correctly answering q h also correctly answer q s , and vice versa.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Multidimensional Latent IRT (MIRT).</head><p>To relax the monotonicity assumption and model multifactor characteristics, MIRT was developed <ref type="bibr">(Reckase, 2006;</ref><ref type="bibr">Chalmers, 2012)</ref>. It models two question characteristics: a scalar difficulty d j , and an m-dimensional discriminability &#945; j that interacts with the m-dimensional skill vector s i . The skill value s i,k corresponds to the agent's expertise on the k th latent aspect. The objective then becomes:</p><p>The discriminability &#945; j captures how sensitively the correctness probability changes with each dimension of the agent skill s i . To mitigate overexpressibility, MIRT assumes &#945; j to have a gamma prior, allowing only positive values. But, nonidentifiability issues <ref type="bibr">(Raue et al., 2009)</ref> persist. <ref type="foot">4</ref>A common practice of using hierarchical priors for resolving this makes optimization unstable for higher dimensions. Lastly, the model's exclusive dependence on question identifiers (q31_2) treats questions as unrelated and hinders generalization. The characteristics learned this way do not identify the difference in the questions based on their content <ref type="bibr">(Rodriguez et al., 2022)</ref> 3 Bootstrapping IRT with CAIMIRA</p><p>We propose CAIMIRA-Content-aware, Identifiable, and Multidimensional Item Response Analysis, an IRT framework that addresses the limitations of MIRT ( &#167; 2.2) by introducing three key modifications: (i) a novel concept of relevance (r j ) for each item j, (ii) zero-centered difficulty (d j ), and (iii) learnable content-aware transformations (f R and f D ) that produce r j and d j from the raw questions. These enable CAIMIRA to provide interpretable and identifiable results, and handle new questions without prior response data. The response prediction model, the probability of agent i correctly answering question j, for an m-dimensional CAIMIRA, is given by Equation <ref type="formula">3</ref>.</p><p>where, s i &#8712; R m is agent skills, and, r j , d j &#8712; R m are question relevance and difficulty resp.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Introducing question relevance r j</head><p>An interpretable item response analysis should include an item characteristic for each question that has the single responsibility of capturing how relevant each latent aspect is for estimating the likelihood of an agent correctly answering a particular question, p(U i,j ). We call this relevance.</p><p>Relevance r j measures how differences between and agent skills and question difficulty (s i -d j ), or latent scores, align across the m-dimensions (Eq 3), assigning each dimension (or, latent aspect) a proportion (r j,k ) to show its importance. To ensure clarity and prevent overlap with difficulty, r j is defined as a probability distribution across the m dimensions. For instance, for a Thermodynamics question, CAIMIRA assigns greater  <ref type="formula">3</ref>). Here, the question's raw relevance r &#8242; j and raw difficulty d; j are multidimensional and computed by learnt linear transformations over the question embedding E q j ( &#167; 3.3), and the agent skill s i is extracted from a learnable agent embedding matrix E a . r j is a probability distribution computed from the raw reference r &#8242; j and improves the interpretability of the multidimensional model ( &#167; 3.1); d j is achieved by zero centering of the raw difficulty d &#8242; j , which addresses the non-identifiability issue of s i and d j in</p><p>relevance to dimensions capturing physics knowledge and analytical reasoning, down weighing unrelated dimensions like history or language. This targeted aggregation of differences across relevant dimensions ensures that the likelihood estimate p(U i,j = 1 | s i , r j , d j ), is both precise and contextually appropriate.</p><p>Connection to Topic Models This admixture mirrors the per-document allocation in topic models; in CAIMIRA, questions are admixtures of latent aspects, or dimensions, with relevance r j indicating each dimension's contribution to the question.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Zero Centering of difficulty d j</head><p>Aggregating differences between agent skills and question difficulty (s i -d j ) across dimensions (Eq 3), leads to non-unique skill and difficulty values for same likelihood estimate p(U i,j = 1). We alleviate this non-identifiability issue by normalizing each question's raw difficulty d &#8242; j to have a zero mean for each dimension (Equation <ref type="formula">7</ref>). This normalization constrains skill and difficulty ranges and enables comparisons across dimensions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Content-Aware Transformations</head><p>CAIMIRA improves upon MIRT by incorporating question content, enabling CAIMIRA to compute characteristics for new questions without requiring prior response data, making it "cold-start friendly". At its core, CAIMIRA maps question text into relevance and difficulty values using learnable functions, f R , f D : Q &#8594; R m , transforming a question q j from the space of question texts Q into raw relevance (r &#8242; j ) and raw difficulty (d &#8242; j ) vectors (Figure <ref type="figure">3</ref>). These are modeled as linear transformations over a pre-trained embedder f E : Q &#8594; R n (e.g., BERT), which represents q j &#8712; Q in an n-dimensional space as an embedding e j :</p><p>where W R , W D &#8712; R m&#215;n and b R &#8712; R m are the parameters of the linear transformations. <ref type="foot">5</ref> The raw values are then normalized to obtain final relevance (r j ) and difficulty (d j ) values:</p><p>where n q is the number of questions in the dataset. softmax normalization for relevance ensures that the values sum to 1 across m-dimensions, reflecting the relative importance of each latent aspect.</p><p>Agent Skills. CAIMIRA learns an agent skill embedding matrix E a &#8712; R na&#215;m , where n a is the number of agents, and the skill vector for agent i is the i th row of this matrix:</p><p>This approach allows CAIMIRA to learn a compact representation of each agent's skills and question characteristics (difficulty and relevance), across m dimensions, which can be directly used in the response prediction model (Equation <ref type="formula">3</ref>).</p><p>Learning Objective. To optimize CAIMIRA's parameters (&#920;), which include the agent skill embedding matrix E a and the linear transformation parameters b R , W R and W D , we use maximum a posteriori estimate (MAP) based loss, which imposes implicit priors on the question characteristics and agent skills. This combines a cross-entropy loss L CE (Eq 9) with regularization terms (Eq 10):</p><p>where &#8467; CE (x, y) is the cross-entropy loss between the true label x and the predicted probability in Eq. (3), y. &#8741; &#8226; &#8741; 1 denotes the &#8467; 1 norm, and &#955; d and &#955; s are the regularization hyperparameters. Finally,</p><p>4 Experimental Setup</p><p>This section describes how we collect responses from humans and QA systems, assess their answers, and analyze the latent traits learned by CAIMIRA.</p><p>Protobowl Logs. We collect player logs from the "Protobowl" platform over QB questions spanning various categories. (Figure <ref type="figure">2</ref>) Player logs record question metadata, including category (e.g. History), time taken to answer the question, answer string, and the correctness ruling by the platform. The best players have deep knowledge and excellent lateral thinking skills <ref type="bibr">(Jennings, 2006)</ref>.</p><p>Constructing QA Dataset. QB questions are inherently multi-sentence (typically five) with each sentence serving as a distinct clue for the answer.</p><p>In our dataset, each item is formed by cumulatively adding clues from a QB question, with the first item containing the initial clue and subsequent items incorporating an additional clue each; i.e., the first item consists of only the first clue, the second item comprises the first two clues together, and so on. This cumulative clue addition provides insight into how progressively revealing information affects agents' response accuracy.</p><p>Mapping Player Responses to Cumulative Clues. Player responses are mapped to these cumulative clue items to analyze the effectiveness of each clue set in eliciting correct answers. Responses to q31 after only the first clue are recorded under q31_1, and responses after the second clue (which include the information from both clues) are recorded under q31_2, and so on. This mapping is further refined through a backfilling process. Because clues are meant to be progressively easier, we assume that a player who correctly answers a question at clue t, would also correctly answer the question at clue t &#8242; &gt; t. So, we mark those as correct as well. An analogous argument holds for t &#8242; &lt; t when humans answer incorrectly. Consequently, we collect a total of 3042 entries in our refined dataset.<ref type="foot">foot_3</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Human Agents</head><p>In exploring the complementary QA abilities of human and AI, a key challenge is the sparsity of individual human data: most players only engage with a set of few dozen questions. To address this, we form synthetic human agents by grouping individual human players. This approach serves two primary purposes: it helps in accumulating a dataset where agents have attempted a substantial portion of the questions, and it mitigates the issue of nonrepresentativeness of data from a few power users.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Group Formation and Decision Mechanism</head><p>Our dataset comprises only five human players who have answered over 1500 questions each. While these "power users" are invaluable, relying solely on their data could skew the understanding of human-AI interaction, as they might not be representative of the broader player base. Therefore, we introduce "grouped human agents". Each grouped agent is a synthetic construct, amalgamating responses from multiple human players with similar skill levels. We group human players such that the overall coverage of questions attempted by the group is maximized. In cases where multiple players in a group answer the same question, we use a majority rule to determine the group's response.</p><p>If no majority is reached, a response is sampled based on the votes. <ref type="foot">7</ref>We consider group sizes of 1 (individual), 5, 10, and 15, creating five groups for each size, totaling 20 human agents spanning 155 distinct players. Our human participants, all fluent in US English, are experienced Quiz Bowl players. While this sample may not encompass the full diversity of the broader population, their expertise in trivia games, particularly in Quiz Bowl, allows us to contrast the nuanced skill sets of seasoned Quiz Bowl enthusiasts with the capabilities of our AI systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">AI Agents</head><p>To capture skill differentials across AI models and humans and to learn the effects of various training and modeling techniques, we select a broad range of QA systems, 8 grouped as below:</p><p>Retrievers. These agents, indexing Wikipedia, use sparse (e.g., BM25), and dense-GRIT-LM <ref type="bibr">(Muennighoff et al., 2024)</ref> and CON-TRIEVER <ref type="bibr">(Izacard et al., 2021)</ref>-methods to fetch the k most relevant context documents to a query (where k = 1, 3, 5, 10). We call these contextretrievers. We also test a title-retriever, where only the title(s) associated with the retrieved document(s) are answer predictions. Retrievers are evaluated on recall, with a point scored if the answer appears within retrieved documents for contextretrievers, or in the title for the title-retrievers.</p><p>Large Language Models (LLMs). We assess LLMs zero-shot in-context learning <ref type="bibr">(Brown et al., 2020)</ref>, providing a task instruction followed by a single QA pair demonstration. These LLMs include base models (OPT <ref type="bibr">(Zhang et al., 2022)</ref>, <ref type="bibr">GPT-Neo (Black et al., 2021)</ref> and Pythia <ref type="bibr">(Biderman et al., 2023)</ref>), instruction-tuned models (OPT-IML <ref type="bibr">(Iyer et al., 2022)</ref>, T0, T0pp <ref type="bibr">(Sanh et al., 2021</ref><ref type="bibr">), Flan-T5 (Chung et al., 2022)</ref> and Flan-UL2 (Tay et al., 2022)), very large-scaled models like LLAMA-3-70B <ref type="bibr">(Touvron et al., 2023)</ref>, Falcon40B <ref type="bibr">(Almazrouei et al., 2023)</ref>, Cohere's CMD-R+ 9 and Mixtral 8x7b <ref type="bibr">(Jiang et al., 2024)</ref>, and closed-sourced APIs such as GPT-4O, GPT-4-TURBO <ref type="bibr">(OpenAI, 2023)</ref> and Gemini-family <ref type="bibr">(Team et al., 2024)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Retriever-augmented Generative Models (RAG).</head><p>We combine above defined retrievers with generative models for answer production, primarily using FlanT5-XL <ref type="bibr">(Chung et al., 2022)</ref> with top 3 documents and exploring Flan-UL2 <ref type="bibr">(Tay et al., 2022)</ref>, and CMD-R+ to accommodate all ten.</p><p>Answer Match Equivalence. Traditional exactmatch <ref type="bibr">(Rajpurkar et al., 2016)</ref> often misses alternative answer that have different wordings or forms but the same semantic sense as the correct answer <ref type="bibr">(Bulian et al., 2022)</ref>. To better handle this, 8 Appendix C provides further details into model specs. 9 <ref type="url">https://docs.cohere.com/docs/command-r-plus</ref>  we adopt a fuzzy match evaluation using answer aliases <ref type="bibr">(Si et al., 2021)</ref>: if the character level matching rate between the predicted answer and the gold answer exceeds a certain threshold, the prediction is considered as correct. We tuned the threshold against human judgments on a small dev set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">CAIMIRA Setup</head><p>We ablate the number of latent dimensions, m. Validation loss plateaus beyond m = 5 (Fig <ref type="figure">4</ref>). We thus train a 5-dimensional CAIMIRA model using all-mpnet-base-v2, an SBERT variant <ref type="bibr">(Reimers and Gurevych, 2019)</ref> as the question embedder f E .</p><p>To capture information gaps between questions and answers, we supplement SBERT's text input with both the answer and it's Wikipedia page summary. We minimize L CAIMIRA (Equation <ref type="formula">11</ref>) using Adam optimizer (Kingma and Ba, 2014), with learning rate 0.005, batch size 512, and &#955; d = &#955; s = 1e -5.</p><p>Interpreting Latent Aspects. To study the latent dimensions of CAIMIRA, we use Logistic Regression as a supplemental interpretative tool. We build upon <ref type="bibr">Benedetto et al. (2020)</ref>, which uses Linear Regression to post-hoc explain the latent item difficulty parameters, and follow <ref type="bibr">Gor et al. (2021)</ref> to interpret the latent relevance dimensions using logistic regression. For each latent dimension (k), Logistic Regression predicts if the relevance r jk is greater than 0.6 as a function of interpretable features extracted from the questions. These features span topical question subcategories, clue counts, temporal expression mentions, question similarity with corresponding Wikipedia pages (Wiki-MatchScore), and linguistic features from <ref type="bibr">Lee et al. (2021)</ref>. 10 Thereby, we explain CAIMIRA's latent dimensions by relating them to the logistic regression features with large (positive and negative) coefficients. Topical features are one-hot encoded; c_music is set to 1 for music related question, and 0 otherwise. The linguistics features span advanced 10 Appendix D lists all features we use.</p><p>c_plot_and_characters c_television/movies c_genre_and_style c_mathematics c_fine_arts New Automated Readability Index ratio of Content words to Function words Number of Clues # Entities Mentions / sentence Mentions of time periods Wiki Match Score -1 0 1 2 3 c_political_geography c_cultural_history Mentions of complex time expressions c_political_history # Entities Mentions / sentence Mentions of specific time expressions Mentions of time periods ratio of Content words to Function words Popular events Wiki Match Score -1 0 1 2 3 c_biology c_language c_physiography c_physics c_chemistry c_music c_earth_science ratio of Content words to Function words Number of Clues # Entities Mentions / sentence -1 0 1 2 3 c_mythology c_religion c_technology c_genre_and_style c_ancient_history Wiki Match Score Mentions of relative temporal expressions c_author_and_works Popular events ratio of Content words to Function words -1 0 1 2 3 c_plot_and_characters Wiki Match Score c_author_and_works c_music c_sports # tokens / sentence c_television/movies Popular events average Tree height per token (word) ratio of Content words to Function words Contextual diversity of tokens</p><p>-1 0 1 2 3 Model fit: 84.15% Model fit: 82.47% Model fit: 83.49% Model fit: 77.43% Model fit: 79.04% Dim 1: &#129504; Abductive Recall Dim 2: &#127963; History and Events Dim 3: &#129516; Scientific Facts Dim 4: &#127917; Cultural Records Dim 5: &#128269; Complex Semantics Figure 5: Interpretation of the five latent dimensions in CAIMIRA. We use Logistic Regression to predict the binary relevance label, r jk &gt; 0.6, for each dimension k. For question features, we use topical categories and linguistic properties. We report the classification accuracy and the statistically significant features. Coefficients are positive if the features positively affect classification, negative otherwise. This demonstrates the efficacy of predicting the relevance from a question's SBERT embedding. -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 Agent Types Human(s) API Agents Large-Scale LLMs RAG Base LLMs LLMs (Inst) Title Retrievers Context Retrievers &#129504; Abductive Recall &#127963; History and Events &#129516; Scientific Facts &#127917; Cultural Records &#128269; Complex Semantics Agent Skills semantic, discourse-based, and syntactic elements, providing a rich and multi-faceted representation of the questions. These are normalized to have zero mean and unit variance. Figure <ref type="figure">5</ref> lists the most contributing, statistically significant features for each dimension (p-value &lt; 0.05). To make the learned coefficients comparable across dimensions, we incorporate class-balancing maintaining the random guess accuracy for each dimension at 50%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Question and Agent Analysis</head><p>This section interprets the latent aspects of CAIMIRA, emphasizing their role in differentiating agent skills. It also examines the patterns of question difficulty and agent performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Latent aspects and Agent skills</head><p>CAIMIRA uncovers five latent aspects, each capturing distinct question styles and content, determined by specific linguistic and topical features (Figure <ref type="figure">5</ref>). These aspects highlight varying agent skills across the latent dimensions (Figure <ref type="figure">6</ref>). In naming and interpreting these aspects, we draw on educational assessment frameworks, particularly Bloom's Taxonomy <ref type="bibr">(Anderson and Krathwohl, 2001)</ref>, which emphasizes the stages of knowledge recall, comprehension, and application-skills central to the Quizbowl dataset.</p><p>Abductive Recall. The first aspect captures a cognitive process that combines elements of inferential reasoning with targeted knowledge retrieval. It requires bridging indirect clues and vague references to formulate the information gap, and recalling specific entities to fill the gap. This distinguishes it from purely creative and commonsensebased abductive reasoning tasks in linguistics literature <ref type="bibr">(Bhagavatula et al., 2019;</ref><ref type="bibr">Shi et al., 2024)</ref>. We term this aspect "abductive recall" to highlight the interplay between hypothesis generation and gap resolution through targeted fact retrieval. Questions often narrate events and describe characters from a fictional realm while deliberately avoiding direct references to named entities or key phrases (Example in Fig 3 ). A low WikiMatchScore-semantic overlap between questions and their associated Wikipedia pages-combined with the absence of entities and key phrases, indicate a significant information gap that necessitates not just multi-hop reasoning skills to bridge the contextual divide, but also deducing relevant involved entities from the narrative. Humans excel at these questions, surpassing GPT-4-TURBO by leveraging intuition to connect abstract clues to specific entities, while most AI models struggle.</p><p>History and Events. In contrast, the second dimension involves historically grounded questions, where the information gap is clearer, though the queries are complex. These questions challenge participants to synthesize multiple pieces of information and infer connections between events. For e.g, "This man was killed by a crossbow bolt while besieging the castle Charlus-Chabrol", requires identifying both the event and the historical figure. While these questions still feature lower Wiki-MatchScores, the gap is more structured, centering around entity relations like events, people, and places. Bigger LLMs excel in this category, often outperforming humans and retrievers, suggesting effective recall and application of historical information through their parametric memory.</p><p>Scientific Facts. This aspect focuses on domainspecific conceptual knowledge, often featuring questions from scientific domains. Retrieval-based systems fare well when allowed to retrieve sufficient documents (Figure <ref type="figure">7</ref>). Notably, these questions, along with history-related ones, best differentiate instruction-tuned LLMs from base models, with the former outperforming the latter. Humans and large-scale LLMs excel in this category, as do closed-source systems like GPT-4-TURBO.</p><p>1 3 5 10 -2 0 2 4 6 1 3 5 10 -2 0 2 4 6 1 3 5 10 -2 0 2 4 6 bm25 grit contriever # of Docs # of Docs # of Docs</p><p>Agent Skill &#129504; Abduce &#127963; Events &#129516; Science Figure 7: Variation in Context Retriever skills across latent dimensions as the number of retrieved documents (top-k) increases, showing that a system which retrieves more documents can achieve higher skills in Science, but not on Abduction and Events. Cultural Records. This aspect represents questions focusing on prominent figures such as authors, composers, artists, and leaders, asked in the style of "who did what", testing direct knowledge recall &#129504; Abduce &#127963; Events &#129516; Science &#127917; Records &#128269; Semantics 0 0.25 0.5 0.75 1.0 Relevance Figure 8: Distribution of relevance (r j,k ) scores across CAIMIRA's five latent dimensions. Cultural Records and Complex Semantics are not as representative of the dataset, as the first three.</p><p>of well-known facts and making them relatively easy and accessible (high WikiMatchScore).</p><p>Complex Semantics. The final aspect pertains to questions about popular events, featuring complex semantic relationships and detailed sentences with less common, domain-specific keywords. Despite their intricacy, they are particularly retrieverfriendly due to high WikiMatchScores, indicating a significant overlap with relevant source documents.</p><p>The most prominent fact about the answer is directly mentioned in both the question and the document, enabling retrievers to locate correct documents. However, agents without retrieval abilities, or large parametric memories, struggle.</p><p>1.9 19.6 17.5 42.2 21.2 48.0 37.8 14.1 54.9 57.7 87.7 62.6 79.7 71.1 17.4 46.7 35.0 34.6 52.7 78.3 47.9 37.4 67.6 60.5 53.5 73.6 87.3 69.2 17.4 50.9 52.3 52.8 65.0 81.0 59.4 30.2 69.2 67.1 67.1 80.5 76.7 72.8 55.9 90.2 85.8 83.1 95.0 85.2 88.6 43.4 81.0 79.3 79.7 90.5 96.0 84.4 30.2 74.3 76.7 92.0 73.2 93.1 81.0 36.8 80.1 81.1 91.5 80.7 96.8 85.2 34.5 78.0 81.3 94.4 82.7 91.0 83.6 48.8 87.0 93.2 97.3 88.6 95.2 90.6 49.5 89.1 90.9 99.0 88.6 99.5 90.9 76.2 96.6 96.7 99.3 94.1 100.0 96.2 76.2 80.2 74.9 85.0 87.1 82.5 84.2 85.2 87.1 84.2 89.0 92.4 87.7 90.6 Abduction (V.Hard) Mixed Abd. 5.2 Which Questions are most difficult? To identify groups of questions that present different challenges, we analyze each question's effective difficulty, denoted as d (e) j,k . This metric represents the contribution of the k-th latent aspect to the difficulty of question j, calculated as r j,k d j,k according to Equation 3. We cluster questions into twelve groups using KMeans on their 5-dimensional effective difficulty d (e) j , then analyze mean relevance and mean effective difficulty per cluster across dimensions (Fig 10, full set in Appendix E). The mean effective difficulty d (e) D,&#181; k on the dimension k for a question set D is calculated as a weighted mean of the effective difficulty scores over the ques-Abduction (V.Hard) 0.62 0.09 0.14 0.09 0.06 Mean Relevance (rj,k) 1.87 -0.10 -0.38 -0.05 -0.47 Mean Effective Difficulty (rj,k dj,k) 1.46 (r T j dj)</p><p>Mixed Bag (Hard) Mixed Abd. (Hard) 0.29 0.19 0.29 0.15 0.08 0.32 0.13 0.19 0.29 0.06 -0.28 0.13 -0.27 0.30 -0.03 0.35 0.25 -0.04 -0.77 -0.23 -0.22 -0.25 Abduce Events Sci Rec Sem CAIMIRA Latent factors (k) Sci. Reason (Med) GeoPol 2 (Med) 0.46 0.09 0.29 0.09 0.07 0.14 0.60 0.12 0.08 0.06 Abduce Events Sci Rec Sem CAIMIRA Latent factors (k) -1.55 0.33 0.61 0.14 0.80 0.20 -1.01 0.03 0.29 -0.31 Overall -0.72 -0.93 tions in D, normalized by the total relevance. d (e) D,&#181; k = &#8721;&#65025; j&#8712;D r j,k d j,k &#8721;&#65025; j&#8712;D r j,k</p><p>Abduction (V.Hard) and Mixed Bag emerge as the most challenging categories, demonstrating high difficulty due to complex semantics, indirect phrasing and also mostly having a single clue. AI systems, including GPT-4-TURBO, struggle with these, highlighting a marked disparity with human accuracy (Fig <ref type="figure">9</ref>). Instruction-tuned LLMs outperform base ones in moderately difficult science questions, with GPT-4O surpassing single human players. A common trend we observe is that for each latent factor, questions tend to have higher difficulty when they have fewer clues, and lower WikiMatchScore.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Related Work</head><p>Adoption of IRT in NLP. Current evaluation paradigms for machine and human QA inadequately segment datasets, treating questions as independent single transaction without assessing relative differences between the test set items. To remedy this, <ref type="bibr">Lalor et al. (2019)</ref> propose adopting the IRT ranking method from educational testing as a novel evaluation framework for NLP. <ref type="bibr">Rodriguez et al. (2021)</ref> argue for the adoption of IRT as the de facto standard for QA benchmarks, demonstrating its utility in guiding annotation effort, detecting annotator error, and revealing natural partitions in evaluation datasets. <ref type="bibr">Byrd and Srivastava (2022)</ref> further uses IRT to estimate question difficulty and model skills, and use question features to post-hoc predict question difficulty. Yet, existing studies are confined to a one-dimensional IRT models. Our research advances this domain by enhancing the learning method and capturing question traits that effectively differentiate human and AI QA abilities.</p><p>Ideal Point Models (IDP) IRT and IPM are two prominent statistical models used in different fields for distinct purposes. Both models deal with the analysis of preferences or abilities, but their applications and theoretical underpinnings show significant differences. IRT, used in educational assessments, gauges abilities from question responses, typically focusing on one-dimensional traits <ref type="bibr">(De Ayala, 2013)</ref>. Conversely, IPM, applied in political science, evaluates positions on spectra like political ideologies based on choices or votes <ref type="bibr">(Clinton et al., 2004)</ref>. Despite differences, both employ mathematically equivalent probabilistic methods to estimate the likelihood of a binary outcome-correctness in IRT, and votes in IDP, from a set of covariates, such as question difficulty or political ideology.</p><p>Human-AI Complementarity. Research in NLP has increasingly focused on augmenting human skills with language models, particularly in the areas like creative writing and question-answering. Studies have explored collaborative writing with LLMs, such as having human writers use GPT-3 for suggestions <ref type="bibr">(Lee et al., 2022)</ref> or modifying user-selected text spans for enhanced descriptiveness <ref type="bibr">(Padmakumar and He, 2021)</ref>. For trivia, experts and novices have teamed up with AI <ref type="bibr">(Feng and Boyd-Graber, 2018)</ref>, and for information retrieval, humans used AI-generated queries to find answers <ref type="bibr">(He et al., 2022)</ref> Our approach diverges by focusing modeling latent factors that best accentuate the distinct capabilities of trivia nerds and AI in QA. This strategy aims to identify the benchmarking methods for assessing and enhancing AI systems in subsequent work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusions</head><p>CAIMIRA enables discovery and interpretation of latent aspects in QA datasets that highlight the skills of various QA agents. On contrasting AI systems with humans, we find notable disparities: systems like GPT-4-TURBO and Gemini Pro excel at direct, context-rich queries that require connecting events and figures, but struggle with indirectly phrased questions lacking explicit entity references-domains where human acumen shines. Although GPT-4-TURBO matches individual human performance on complex knowledge-intensive abductive reasoning tasks, we caution against interpreting this as indicative of superhuman abilities. Given that the quiz questions that Protobowl is based off have been publicly available since 2011, and that these models' training data is not fully known, accurately assessing the reason for their near-perfect performance is challenging. Future research should aim to develop stronger and innovative evaluations that better gauge AI systems' ability to understand implicit contexts, and systematically contrast their skills with those of humans. Lastly, this work opens up new avenues for research on estimating agent skills that can be combined to assess multi-agent systems and collaborations, which becomes crucial as NLP evolves toward conversational agents and real-world problem-solving.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Limitations</head><p>Dataset and Task Limitations Our study faces constraints related to dataset and task setup: (1) Limited language diversity: Our English-only dataset restricts generalizability to other languages.</p><p>(2) Lack of diverse task types: We rely solely on trivia-based questions, lacking non-trivia datasets with human responses in competitive settings. (3) Absence of multilingual trivia benchmarks: We lack multilingual trivia datasets with human responses and performance benchmarks. Future work should address these by creating datasets that include non-trivia tasks, multiple languages, and human responses, offering a more comprehensive understanding of human and AI performance across diverse linguistic and task environments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Challenges in interpreting near-perfect scores</head><p>While models like GPT-4-TURBO match or exceed individual humans on complex tasks, caution is needed when interpreting these results as superhuman. Quiz questions in our Protobowl-based dataset have been public since 2011, and the models' full training data is unknown. This makes it difficult to determine if their near-perfect performance stems from genuine reasoning or exposure to specific questions during pre-training. genuine reasoning or exposure to specific questions during pre-training. This limitation highlights the need for more robust evaluation methods to accurately assess AI systems' understanding and reasoning abilities compared to humans.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lack of information on specific human players</head><p>Because of the nature of the Protobowl platform that we used to collect the human response data, we do not have access to information about the specific human players to incorporate that into our analysis. Future work can focus on collecting such information whilst hiding the user identity.</p><p>Non-extensibliity of a trained CAIMIRA to a new AI systems. Unlike how CAIMIRA extended MIRT to model question characteristics as a function of question texts, and not just unique question identifiers, CAIMIRA is not extensible to a new agent without retraining the model. To make this possible for AI systems, future work can maintain a feature set that describes the specifications of an AI system that can include the model architecture, the training data, parameters, training strategies, etc, and have CAIMIRA learn a transformation from the feature set to agent skills. However, since this approach would require having a feature set for human players as well, which is not available, this approach is not feasible at the moment.</p><p>Static representation from SBERT. In this work, we use a static dense representation of the question text from SBERT, instead of finetuning the model for adapting to CAIMIRA objective that learns representations from question text that best predicts the human response. This was out of the scope of this study. Future work can explore this direction using parameter efficient finetuning (PEFT) <ref type="bibr">(Xu et al., 2023)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">Ethical Considerations</head><p>In conducting this study, we adhered to strict ethical guidelines to ensure respect for privacy, obtaining informed consent from human participants and annonimization of their data. Our work complies with all relevant ethical standards, underscoring our commitment to ethical research practices in advancing NLP technologies. We utilized GitHub Copilot for low level coding and writing assistancereimplementing plotting codes, as well as editing the prose in this document to improve readability and conciseness.</p><p>Regarding ethical considerations about running computationally expensive models, we acknowledge the carbon footprint of training and running large-scale language models. In our study we only train a very small of order 25000 parameters, for 20 minutes of single A4000 GPU time. We also use a pre-trained SBERT model for encoding the question text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Quizbowl Dataset</head><p>Quizbowl <ref type="bibr">(Rodriguez et al., 2019)</ref>, the source of questions for ProtoBowl, is a trivia game consisting of questions with clues decreasing in difficulty and culminating with a "giveaway" hint at the end of the question. The sequence of clues often reveals more information or helps disambiguate possible references and interpretations at each step. Figure 11 illustrates this structure with three example questions from different categories.</p><p>Question ID q832_5 (Category: Religion) This text was written down by Sahabas (sah-HAH-bahs) after the death of the leader that received it. The clarification of the meaning and significance of this document is the practice of tafsir (TAHFSEER). Its hundred and fourteen chapters are called suras (soor-AHS). It literally means "the recitation" and is said to been revealed by Gabriel to Muhammad. For 10 points, what "divinely ordained" religious text is sacred to Muslims? Answer: Piano / Pianoforte Question ID q622_3 (Category: Music) Paul Wittgenstein commissioned concertos for this instrument that used only the left hand. This instrument is said to have been invented by Bartolomeo Cristofori ("BAR-tow-lo-MAY-oh KRIS-tow-for-ee"). It was originally named for its ability to play both loud and soft sounds, which made it an improvement over the clavichord and harpsichord. Answer: Piano / Pianoforte Question ID q2443_1 (Category: Science &gt; Mathematics) 4 times the infinite sum one, minus one third, plus one fifth, minus one seventh, et cetera, equals this number. Answer: pi / 3.14 / &#960; Quizbowl naturally discriminates players' skills as players can interrupt questions to answer, and answering earlier is better.</p><p>In contrast to "all or nothing" QA, incremental QB questions help pinpoint the clues necessary for an agent a to answer question q by creating multiple opportunities for a to answer q. We achieve this by creating creating multiple entries for a single quizbowl question into our dataset. For instance, if a Quizbowl question q622 has four clues in total, we create four entries, viz. q622_1, q622_2, q622_3, and q622_4, each corresponding to the question with first i clues, where i &#8712; {1, 2, 3, 4}.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B CAIMIRA Setup.</head><p>In this section, we provide a detailed explanation of the learning objective for CAIMIRA and the hyperparameters used in our experiments. First, let's revise the CAIMIRA objective from Section 3:</p><p>where, s i &#8712; R m is agent skills, and, r j , d j &#8712; R m are question relevance and difficulty resp.</p><p>Here, d i and r j are functions of question representation E q j defined as:</p><p>where W R , W D &#8712; R m&#215;n and b R &#8712; R m . These, along with the embedding matrix E a of agent skills (s i = E a i ), are the parameters we train for CAIMIRA over a regularized cross entropy objective.</p><p>Hyperparameters. The trainable parameters are fit using mini-batch stochastic gradient descent to minimize L CAIMIRA (Equation <ref type="formula">11</ref>), where &#955; d and &#955; s are set to 1e -5. We use Adam optimizer (Kingma and Ba, 2014) without weight decay, and with a learning rate of 0.005, and the batch size is set to 512.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C QA Agents in our study</head><p>This section describes the QA agents used in our study, including the retrievers, LLMs, RAG models, and the prompts used to query them. Retrievers as QA agents. Our retrievers, which index Wikipedia documents, respond with the top k documents (where k = 1, 3, 10) most relevant to the question. We employ two types of retrievers: dense and sparse. The dense retriever, CONTRIEVER <ref type="bibr">(Izacard et al., 2021)</ref>, is pretrained via unsupervised contrastive learning on a mix of Wikipedia and CCNet data and then fine-tuned on MS- <ref type="bibr">MARCO (Campos et al., 2016)</ref>. The sparse Title Recall@10 bm25_title-recall@10 contriever_title-recall@10 Title Recall@3 bm25_title-recall@3 contriever_title-recall@3 Top Title bm25_title-recall@1 contriever_title-recall@1 Inst Title Retriever R@10 grit_title-recall@10 Inst Title Retriever R@3 grit_title-recall@3 Inst Title Retriever R@1 grit_title-recall@1 retriever utilizes the BM25 algorithm <ref type="bibr">(Robertson Zaragoza, 2009</ref>) and Anserini's implementation with index <ref type="bibr">(Lin et al., 2021)</ref>. We also test a title-retriever, assuming the document title is the query answer. Retrievers are evaluated on recallbased accuracy, with a point scored if the answer appears within the top-k documents for contextretrievers, or in the title of the top-k documents for the title-retriever.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Contexts</head><p>Large Language Models (LLMs). We evaluate an array of LLMs, grouped below by their training / scale. All models are evaluated in a zero-shot manner (no finetuning over QB questions). Base Models: The models are exclusively trained on an unsupervised CausalLM objective: <ref type="bibr">OPT (Zhang et al., 2022)</ref>, <ref type="bibr">GPT-Neo (Black et al., 2021)</ref> and Pythia <ref type="bibr">(Biderman et al., 2023)</ref> Benchmark Instruction Tuned (IT) Models: LLMs fine-tuned on tasks with natural instructions over each benchmark; OPT-IML <ref type="bibr">(Iyer et al., 2022)</ref>, T0, T0pp <ref type="bibr">(Sanh et al., 2021</ref><ref type="bibr">), Flan-T5 (Chung et al., 2022)</ref> and Flan-UL2 <ref type="bibr">(Tay et al., 2022)</ref>. Very Large-Scaled Models: Llama-2 (70 billion parameters) <ref type="bibr">(Touvron et al., 2023)</ref> and Falcon (40 billion parameters) <ref type="bibr">(Almazrouei et al., 2023)</ref> and its instruction tuned variant. Due to limited information on their training data mixtures, direct comparisons with other models are challenging. Nevertheless, we include these large-scale models to gauge their performance relative to humans. Closed-Sourced Model-Based APIs: OpenAI's ChatGPT <ref type="bibr">(Ouyang et al., 2022)</ref> and <ref type="bibr">GPT-4 Turbo (OpenAI, 2023)</ref> None of the Transformer-based models, including those pretrained on QA datasets like TriviaQA, are specifically finetuned on QB; we adhere to the standard in-context learning practice <ref type="bibr">(Brown et al., 2020)</ref>,providing a task instruction followed by concatenated QA pair demonstrations. Figure <ref type="figure">17</ref> shows an example of the prompt used for these models.</p><p>Retriever-augmented Generative Models. Following the RAG paradigm from <ref type="bibr">(Lewis et al., 2020)</ref> for open-domain QA, we first retrieve Wikipedia documents relevant to the questions, then employ a generator model for short answer generation. Our retrievers include dense CONTRIEVER and a sparse passage retriever (BM25). For the retriever, we use both a dense retriever (CONTRIEVER) as well as a sparse passage retriever that uses BM25 to encode documents. In our study, we mainly use FlanT5-XL <ref type="bibr">(Chung et al., 2022)</ref> as the generator model, whose input context is limited to 512 tokens and composed of the top-3 documents by retriever. We also explore Flan-UL2 <ref type="bibr">(Tay et al., 2022)</ref>, an instruction-tuned UL2 with a 2048-token receptive field, to handle all the 10 documents. Figure <ref type="figure">18</ref> shows an example of the prompt used for RAG models.</p><p>Answer Match Evaluation. Traditional exactmatch metric often misses alternative answers that have different wordings or forms but the same semantic meaning as the correct answer <ref type="bibr">(Bulian et al., 2022)</ref>. To better handle this, we adopt a fuzzy match evaluation using multiple-answer aliases <ref type="bibr">(Si et al., 2021)</ref>: if the character level matching rate between the predicted answer and the gold answer exceeds a certain threshold, the prediction is considered as correct. The threshold is tuned against human judgments on a small development set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Question Features for Logistic Regression Study</head><p>This section describes the features used in the logistic regression study in &#167; 4.3.</p><p>Question Category Features. These features are binary and indicate whether a question belongs to a specific category. These categories are the one highlighted in Figure 2. The categories are: c_question_categories, c_fine_arts, c_cultural_geography, c_geography, c_physical_geography, c_political_geography, c_technical_geography, c_ancient_history, c_history, c_cultural_history, c_exploration_and_colonization, c_military_history, c_other, c_political_history, c_scientific_history, c_social_history, c_language, c_author_and_works, c_literature, c_genre_and_style, c_literary_terms, c_plot_and_characters, c_music, c_mythology, c_political_events, c_politics, c_political_figures, c_political_institutions, c_political_theory, c_religion, c_astronomy, c_science, c_biology, c_chemistry, c_earth_science, c_materials, c_mathematics, c_other, c_physics, c_scientific_history, c_sports, c_technology, c_television/movies Linguistic Features LingFeat is a Python research package designed for the extraction of various handcrafted linguistic features, positioning itself as a comprehensive NLP feature extraction tool. Currently, it is capable of extracting 255 linguistic features from English textual inputs. The features extracted by LingFeat span across five broad linguistic branches that Lee et al. (2021) details.</p><p>&#8226; Advanced Semantic (AdSem): Aims at measuring the complexity of meaning structures. Note: This feature is currently facing some operational issues, which are under investigation.</p><p>&#8226; Semantic Richness, Noise, and Clarity: Extracted from trained LDA models. The models are included and require no further training.</p><p>&#8226; Discourse (Disco): Focuses on measuring coherence and cohesion through entity counts, entity grid, and local coherence score.</p><p>&#8226; Syntactic (Synta): Evaluates the complexity of grammar and structure, including phrasal counts (e.g., Noun Phrase), part-of-speech counts, and tree structure.</p><p>&#8226; Lexico Semantic (LxSem): Measures word/phrasal-specific difficulty through metrics like type-token ratio, variation score (e.g., verb variation), age-of-acquisition, and Sub-tlexUS frequency.</p><p>&#8226; Shallow Traditional (ShTra): Encompasses traditional features/formulas for assessing text difficulty, such as basic average counts (words per sentence), Flesch-Kincaid Reading Ease, Smog, Gunning Fog, etc.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Time based features</head><p>We create two time based feature, t_range and t_range. Both are binary features. t_range is 1 if the question was asked in the context of certain time period or a range, (e.g., in the 20th century, in the 19th), and 0 otherwise. t_range is 1 if the question refers to an event related to another event, (e.g., after the fall of Rome, before the French Revolution), and 0 otherwise.</p><p>Other features o_TRASH is 1 is the question enquires about specific events in pop culture category, and 0 otherwise. This feature reflects the TRASH category from Quizbowl. Similarly, o_Records is 1 if the question enquires about specific records through mention of superlative forms of words like "most recent", "best category", etc, and 0 otherwise. This feature reflects the Records category from Quizbowl. OpenAI GPT3+ openai-gpt-3.5-turbo_1shot openai-gpt-4-turbo_1shot openai-gpt-4o_1shot Figure 15: Agents we use in the GPT-3+ category. RAG (Top 10) rag-bm25_top10-flan-ul2 rag-bm25_wiki_top10-command-r-plus rag-grit_top10-flan-ul2 rag-grit_wiki_top10-command-r-plus RAG-flan-t5-xl (Top 3) rag-bm25_top3-T0pp-11b rag-bm25_top3-flan-t5-xl rag-contriever_top3-T0pp-11b rag-contriever_top3-flan-t5-xl Figure 16: Agents we use in the RAG category. You are a Quizbowl agent expert in Question Answering. Questions are in form of single or multiple clue(s) about a certain concept / entity. The following is a list of Quizbowl clues. Deduce the answer based on what the clues are describing, and answer the question in the form of a single word or a short phrase. Question: { demonstration clues } What is being talked about here? Answer the question in a single word / short phrase. Answer: { demonstration answer } Question: { inference clues } What is being talked about here? Answer the question in a single word / short phrase. Answer: Figure 17: A condensed version of our prompt to Base models, Instruction-tuned models and Closed-source models ( &#167; 4.2). You are a Quizbowl agent expert in Question Answering. Questions are in form of single or multiple clue(s) about a certain concept / entity. Answer the Quizbowl question by finding a short answer from the reference documents listed below. Documents: { Document 1 Title}: { Document 1 Content} { Document 2 Title}: { Document 2 Content} . . . { Document k Title}: { Document k Content} Question: { inference clues } What is being talked about here? Find the answer from above documents and answer in a single word or a short phrase. Answer: Figure 18: A condensed version of our prompt to our retriever-augmented generative (RAG) models ( &#167; 4.2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Question Difficulty</head><p>This section enlists the full set of heatmaps of mean relevance r j,k and mean effective difficulty d</p><p>(e) D,&#181; k of question clusters across the five latent factors (k).</p><p>Abduction (V.Hard) 0.62 0.09 0.14 0.09 0.06 Mean Relevance (r j, k )</p><p>1.87 -0.10 -0.38 -0.05 -0.47 Mean Effective Difficulty (r j, k d j, k ) 1.46 (r T j d j ) Mixed Bag (Hard) Mixed Abd. (Hard) 0.29 0.19 0.29 0.15 0.08 0.32 0.13 0.19 0.29 0.06 -0.28 0.13 -0.27 0.30 -0.03 0.35 0.25 -0.04 -0.77 -0.23 -0.22 -0.25 Sci. Reason (Med) GeoPol 2 (Med) 0.46 0.09 0.29 0.09 0.07 0.14 0.60 0.12 0.08 0.06 -1.55 0.33 0.61 0.14 0.80 0.20 -1.01 0.03 0.29 -0.31 -0.72 -0.93 Mixed Sem. (Easy) Hist. Reason (Easy) Science 1 (Easy) 0.20 0.20 0.07 0.11 0.41 0.32 0.37 0.17 0.10 0.05 0.20 0.06 0.69 0.03 0.02 0.23 -0.09 0.16 0.39 -2.03 -0.96 -0.90 -0.15 0.19 -0.16 0.35 -0.08 -2.12 0.05 0.20 -1.65 -1.68 -1.83 Abduce Events Sci Rec Sem CAIMIRA Latent factors (k) Sci. History (V.Easy) Mixed Cult. (V.Easy) Cult History (V.Easy) 0.11 0.45 0.41 0.01 0.02 0.19 0.09 0.11 0.58 0.03 0.17 0.38 0.06 0.36 0.03 Abduce Events Sci Rec Sem CAIMIRA Latent factors (k) 0.34 -1.30 -1.08 -0.04 -0.30 0.24 0.12 0.03 -2.34 -0.14 -0.16 -1.05 0.15 -1.34 -0.48 Overall -2.13 -2.18 -2.34  .9 17.5 19.6 21.2 37.8 42.2 50.1 57.1 64.6 48.0 75.7 71.5 63.7 14.1 57.7 54.9 62.6 71.1 87.7 89.5 86.9 97.7 79.7 97.8 95.6 88.8 17.4 35.0 46.7 52.7 47.9 34.6 49.2 53.6 53.5 78.3 54.0 70.5 80.2 37.4 60.5 67.6 73.6 69.2 53.5 76.6 85.7 78.9 87.3 77.4 90.2 91.9 17.4 52.3 50.9 65.0 59.4 52.8 74.2 67.9 80.3 81.0 63.4 75.4 77.9 30.2 67.1 69.2 80.5 72.8 67.1 84.1 78.6 91.5 76.7 86.1 85.2 90.7 55.9 85.8 90.2 95.0 88.6 83.1 99.5 89.3 100.0 85.2 95.5 98.4 99.2 43.4 79.3 81.0 90.5 84.4 79.7 97.2 88.4 97.2 96.0 92.9 99.2 97.3 30.2 76.7 74.3 73.2 81.0 92.0 89.2 98.2 94.4 93.1 97.2 93.4 97.3 36.8 81.1 80.1 80.7 85.2 91.5 96.8 96.4 98.6 96.8 99.5 97.5 97.7 34.5 81.3 78.0 82.7 83.6 94.4 92.7 94.6 88.7 91.0 98.6 100.0 92.6 48.8 93.2 87.0 88.6 90.6 97.3 98.8 96.4 97.2 95.2 98.6 98.4 98.8 49.5 90.9 89.1 88.6 90.9 99.0 97.2 98.2 100.0 99.5 99.3 100.0 98.8 76.2 96.7 96.6 94.1 96.2 99.3 99.3 100.0 100.0 100.0 100.0 98.4 100.0 76.2 74.9 80.2 87.1 84.2 85.0 88.7 91.6 92.4 82.5 94.2 96.6 89.6 85.2 84.2 87.1 92.4 90.6 89.0 97.2 96.0 95.5 87.7 96.5 96.0 95.9</p><p>Abduction (V.Hard) Mixed Bag (Hard) Mixed Abd. (Hard) Sci. Reason (Med) All GeoPol 2 (Med) Science 1 (Easy) Hist. Reason (Easy) Sci. History (V.Easy) Mixed Sem. (Easy) History 1 (V.Easy) Cult History (V.Easy) Mixed Cult. (V.Easy) Base LLMs Inst-tuned LLMs BM25 Title Recall@10 GRIT Title Recall@10 BM25 Context Recall@1 GRIT Context Recall@1 GRIT Context Recall@10 BM25 Context Recall@10 RAG-flan-ul2 (Top 1) RAG CMD-R+ (Top 10) Mixtral 8x7b Instruct Meta Llama-3 70b Instruct GPT-4 Turbo GPT-4 Omni Single Human Human Team (15) 10 20 30 40 50 60 70 80 90 100 Question-subsets clustered by their effective-difficulty Loading [MathJax]/extensions/MathMenu.js  <ref type="figure">19</ref>. We use the same color scheme as in Figure <ref type="figure">9</ref>.</p><p>Science 1 (Easy)</p><p>Answer: Spanish Clues: One writer in this language wrote the collection "Twenty Love Poems and a Song of Despair."</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Earth</head><p>Clues: In Jainism, this object's central point is Mount Meru. In Chinese mythology, this object is the lower half of a cosmic egg split by Pangu, while in ancient Egypt the original form of this object was the primordial ( * ) mound.</p><p>Answer: mitochondria (" MY-toe-KON-dree-uh ") Clues: The DNA in this organelle ("or-guh-NELL") is inherited only from the mother. The inner membrane of this organelle contains folds known as cristae ("CRISS-tay") and encloses its matrix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: coral reefs</head><p>Clues: Darwin's first paper was on the formation of this biome, whose organisms are threatened by white-band disease. Acidification removes the minerals needed for this ecosystem to grow as each new generation builds on the calcium carbonate skeletons of the previous one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Ohio</head><p>Clues: n this state's capital, the Lane Avenue Bridge crosses the Olentangy River. Another of its cities contains historic Italian architecture in its Over-the-Rhine neighborhood, while another city, at the mouth of the Cuyahoga River, contains Case Western Reserve University. Much of its northern border is at Lake ( * ) Erie, and it is separated from Kentucky by its namesake river. For 10 points, name this state containing Cincinnati, Cleveland, and Columbus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Chlorine or Cl</head><p>Clues: Stomach acid consists mainly of a compound of hydrogen and this element. It is the second-lightest halogen, after fluorine, and at room temperature is a yellow-green gas. Compounds with it, carbon, hydrogen, and fluorine deplete the ozone layer and are called ( * ) CFCs. It is used in bleach as well as to disinfect swimming pools, and forms table salt along with sodium. For 10 points, name this element, number 17, symbolized Cl.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: electron</head><p>Clues: This particle was discovered by J.J. Thomson, and its exact charge was discovered in the Millikan oil drop experiment. According to the Pauli Exclusion Principle, two of these particles cannot exist in the same quantum state.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: matter</head><p>Clues: The density parameter for the non-relativistic form of this falls off with the cube of the scale factor. This substance dominated the universe from approximately 75,000 years after the Big-Bang until about 4 billion years ago.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: violin</head><p>Clues: The Rhapsody on a Theme of Paganini was written from twenty-four caprices originally written for this instrument. Vivaldi's The Four Seasons is a set of concerti ("con-CHAIR-tee") written for this instrument.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: glaciers</head><p>Clues: These objects contain the zone of plastic flow and the zone of brittle flow. They are formed by compressing firn, and parts of them break off by calving. Till is soil left behind by these objects, which also push material to form moraines. Hist. Reason (Easy) Answer: Scooby-Doo Clues: Big Bob Oakley was the first person on this show to say "I'd have gotten away with it too, if it weren't for those kids," and one show in this series introduced a character named Scrappy. In 2002, a film of the same name starred Freddie Prinze, Jr. as Freddy and Sarah Michelle Gellar as Daphne. For 10 points, name this cartoon franchise, named for a cowardly Great Dane. Answer: Steve Jobs Clues: This man, along with Edwin Catmull, was credited as an executive producer of the original Toy Story movie, produced by Pixar Animation, which he renamed after purchasing it from George Lucas in 1986. From 2000 to 2011, he served as CEO of the computer company he co-founded with Steve Wozniak. Answer: Neptune Clues: A triangular patch of clouds that circulates this planet quickly is known as The Scooter. Its atmosphere contains the fastest winds in the solar system. Its existence was predicted by Alexis Bouvard, and it was discovered by Johann Galle. It often contains the Great Dark Spot. Its largest moon, which has a retrograde orbit, is Triton. For 10 points, name this gas giant, the farthest from the Sun in the solar system. Answer: Orion Clues: This constellation contains the Trapezium Cluster and is the site of a late-October meteor shower. Answer: Niccolo Machiavelli Clues: Although he is not Sun Tzu, this man wrote a version of The Art of War. He wrote a critique of Roman history in his Discourses on Livy. Answer: prime numbers Clues: The fundamental theorem of arithmetic states that every positive integer can be uniquely represented as a product of these numbers. Answer: The New York Times Clues: This newspaper was sued by Alabama public safety officer Louis B. Sullivan. Its long-time publisher, Arthur Ochs Sulzberger, died in 2012. Answer: Uncle Tom's Cabin Clues: In this novel, shelter is provided by the Halliday and Bird families. At the beginning of this novel, the Shelby family sells their property to the St. Clare family. At the end of this novel, George and Eliza Harris escape north. The husband of Aunt Chloe is killed by Simon Legree in, for 10 points, what American novel, depicting the life of slaves, written by Harriet Beecher Stowe? Answer: Harry Mason Reid Clues: This man almost lost his Senate seat in the 1998, surviving a challenge from future colleague John Ensign, and he is expected to have a tough re-election in 2010 against Sue Lowden or Danny Tarkanian. He commented that Barack Obama was "light-skinned" and "spoke with no Negro dialect, unless he wanted one." For 10 points, name this senior Senator from Nevada, the current Senate Majority Leader.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Pangaea</head><p>Clues: One piece of evidence that supports its existence is that the Caledonian mountains of Northern Europe are a continuation of the Appalachian Mountains. This entity broke up into Laurasia and Gondwanaland ("gon-DWON-uh-land"). History 1 (V.Easy) Answer: Puerto Rico Clues: The independence of this commonwealth has been sought by Rub&#233;n Berr&#237;os, while an opposite approach has been pushed by its New Progressive Party under Pedro Pierluisi. In 2012, this commonwealth elected Alejandro Garc&#237;a Padilla as governor and voted in a referendum to end its territorial status. ( * ) For 10 points, name this Caribbean Island, a United States territory that may someday become the 51st state. Answer: Philadelphia, Pennsylvania Clues: In this city, Wissahickon Creek goes through Fairmount Park. This city can be entered by crossing the Delaware River on the Betsy Ross Bridge. One of its buildings, where the Second Continental Congress adopted the ( * ) Declaration of Independence, is Independence Hall. The Liberty Bell is found in, for 10 points, what city in Pennsylvania? Answer: Yellowstone National Park Clues: The last wild herd of bison in the United States was located in this park, where today they are hunted by grizzly bears and wolves reintroduced in the 1990s. Answer: Leo Tolstoy Clues: One work by this author, about a man who injures himself while hanging curtains, is The Death of Ivan Ilyich. One of his novels has a relationship between Levin and Kitty, while the title character has an affair with Count Vronsky and eventually commits suicide by jumping in front of a ( * ) train. For 10 points, name this author who wrote about the French invasion of Russia in War and Peace in addition to writing Anna Karenina.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Federal Republic of Germany</head><p>Clues: One leader of this country forcibly annexed the Sudetenland ("soo-DAY-ten-land"). During a movement to reunite this country, the leader of one half operated under the policy of ostpolitik ("OST-pol-it-ick"). Following World War I, the Weimar ("VIE-mar") Republic was established in this nation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Thomas Jefferson</head><p>Clues: This politician responded to Francois Barbe-Marbois in his Notes on the State of Virginia. This man founded the University of Virginia and designed the mansion of Monticello..</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Mexico</head><p>Clues: In 1822, the House of Iturbide ("EE-tur-BEE-day") assumed control of this nation for one year. This nation was ruled by an Austrian emperor installed by Napoleon III, Maximilian, although he was overthrown by Benito Juarez ("WAHR-ezz"). The Gadsden Purchase bought land from this country, whose victory at Puebla ("PWAY-bluh") is celebrated as Cinco de Mayo. For 10 points, identify this nation that once owned California and Texas.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Ronald (Wilson) Reagan</head><p>Clues: This man used powers granted by the Taft-Hartley Act during a confrontation with air traffic controllers, and his Defense Secretary resigned after violations of the Boland Amendment were revealed. Before those events during his presidency, he served as Governor of California from 1967 until 1975. Prior to entering politics, this man was a famous ( * ) Hollywood actor. For 10 points, name this Republican president from 1981 to 1989.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Isaac Asimov</head><p>Clues: This author wrote a story in which the inhabitants of Lagash experience darkness for the first time. Along with "Nightfall," this author wrote a series of novels featuring the investigative interactions of Elijah Baley and R. Daneel Olivaw. Hari Selden invents the science of psychohistory in this author's novel ( * ) Foundation. For 10 points, name this Russian-American science fiction writer who depicted the Three Laws of Robotics in his collection, I, Robot.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Julius Caesar</head><p>Clues: This man fought against Ariovistus ("air-ee-oh-VIS-tuss"), a German leader, and Vercingetorix ("ver-KING-uh-TOR-ix"), a chieftain of the Arverni ("ar-VEHR-nee") whose defeat is described in this man's book, Commentaries on the Gallic Wars.</p><p>He led his troops across the Rubicon to start a civil war with Pompey, one of his partners in the First Triumvirate. For 10 points, name this Roman leader who was assassinated by Brutus on the Ides of March. Figure 28: Examples of questions from different clusters. Mixed Cult. (V.Easy) Answer: The Nutcracker Clues: This work opens with the title item given as a gift by Drosselmeyer; it is later broken by Fritz. Spanish, Arabian, and Chinese dances in this ballet are said to represent different substances such as chocolate, coffee, and tea. The Waltz of the Snowflakes and Dance of the ( * ) Sugarplum Fairy appear in, for 10 points, what Peter Tchaikovsky ballet about Clara's Christmas gift coming to life? Answer: King Arthur Clues: A popular novel about this figure is T.H. White's The Once and Future King. In the Annales Cambriae (ah-NAH-less CAM-bree-ay), this figure was mortally wounded at the Battle of Camlann during a fight with his son Mordred.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Thebes</head><p>Clues: This city was founded by Cadmus after following a cow until it sat. This city was besieged by the Sphinx, as all travelers who entered it were forced to either solve its riddle or be eaten. To avenge the sleight done to him by Eteocles("et-TEE-oh-clees"), Polyneices ("polly-NYE-kees") led a group of seven warriors against this city.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: WikiLeaks</head><p>Clues: A PowerPoint presentation released by this organization details how Bank of America plans to attack it. One portion of this organization is run by the Sunshine Press. In November 2010, a Fox News host called it a "terrorist organization" after it published U.S. State Department diplomatic cables.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Isaac Newton</head><p>Clues: In this scientist's book Opticks, he discussed his experiments with the dispersion of light, including breaking white light into its constituent colors using a prism. One law named for him describes "universal ( * ) gravitation"; another states that the net force on an object is its mass times its acceleration, while a third states that for every action there is an equal and opposite reaction. For 10 points, name this English scientist who formulated three laws of motion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Girl Scout Cookies</head><p>Clues: A group from Muskogee, Oklahoma is believed to be the first to produce and sell these items popularly sold as a fundraiser for an organization founded by Juliette Gordon Low in 1912.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Odysseus</head><p>Clues: This man's dog Argus dies atop a refuse heap. He reveals himself to a foot-washing maid, Eurycleia ("your-ee-CLAY-uh"). The Laestrygones ("LAY-strih-GOAN-ees") destroy many ships belonging to his fleet, and he also visits the land of the lotos ("lotus") -eaters. He kills his wife's suitors with the help of his son, Telemachus ("TELL-uh-MOCK-us"), then reunites with that wife, Penelope. For 10 points, an epic by Homer describes what man's twenty-year quest to get home after the Trojan War?</p><p>Answer: Alice Clues: This character watches a lion and a unicorn fight over a crown, and although her cat Dinah will not talk to her, the Tiger Lily and the other flowers will. She shrinks after drinking a potion labeled "Drink Me," and attends a tea party with a sleepy Dormouse, a March Hare, and a Mad Hatter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Trojan War</head><p>Clues: Neoptolemus killed King Priam in the final stages of this event, after which Aeneas fled with his son. This event began after the Judgement of Paris and ( * ) Helen's abduction from King Menelaus of Sparta. After nine years, it finally ended after Greek soldiers got past enemy gates while hiding in a giant wooden horse. For 10 points, name this conflict in Greek mythology that featured warriors like Hector and Achilles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Noah</head><p>Clues: Seven laws that apply to non-Jews are named for this figure, whose nakedness was uncovered by one of his sons.</p><p>An agreement this figure made with God is symbolized by the rainbow. He was the son of Lamekh (LAH-meck) and had three sons, Japheth (JAY-feth), Ham, and Shem. To confirm that one of his jobs was complete, he sent a dove to check for dry land. For 10 points, identify this Biblical character who took two animals of each kind in his ark. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sci. History (V.Easy)</head><p>Answer: Andes Mountains Clues: This mountain range includes the Vilcabamba ("VEEL-cuh-BOM-buh") sub-range and contains a plateau called the altiplano ("ALL-tee-PLAN-oh").</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: London</head><p>Clues: Hampstead Heath and Kensington Gardens are parks in this city which is served by the "Jubilee Line," "Piccadilly Line," and "Victoria Line" of its subway system, the Underground. A Norman castle built by William the Conqueror is this city's "Tower."</p><p>Answer: Amazon River Clues: The island of Marajo (mah-RAH-hoh) is located at the mouth of this river which was named by Spanish conquistador Francisco de Orellana (day OH-ray-YAH-nah) for the warrior women of Greek mythology.</p><p>Answer: Panama Canal Clues: Lake Gatun ("GAH-tune") is part of this waterway, whose construction was made possible by the Hay-Bunau-Varilla ("HAY boo-NOW vah-REE-uh") Treaty and the secession of a province from Colombia. A 1977 agreement between Omar Torrijos ("torr-EE-hos") and Jimmy Carter resulted in the return of the special zone associated with it.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Antarctica</head><p>Clues: This geographical feature has its lowest point at Bentley Trench. A lake here lies under Vostok Station. Mt. Erebus is found on Ross Island off itscoast, between Marie Byrd and Victoria lands. The Sentinel Range of the Ellsworth Mountains contains its highest peak, Vinson Massif, located on the Ronne ( * ) Ice Shelf.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Saturn</head><p>Clues: Great White Spots are frequent storms on this planet. Its moons include Iapetus, Rhea, Enceladus, and the only known one to have an atmosphere. This planet is less dense than water. The Cassini Division is located in its extensive ring system. For 10 points, name this second largest planet in the solar system, the sixth from the Sun.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: New York City</head><p>Clues: A museum branch located in this city's Fort Tryon Park containing medieval art is known as The Cloisters. One of its straits, which includes Roosevelt Island and Rikers Island, is the East River.</p><p>Answer: Panama Canal Clues: Lake Gatun ("GAH-tune") is part of this waterway, whose construction was made possible by the Hay-Bunau-Varilla ("HAY boo-NOW vah-REE-uh") Treaty and the secession of a province from Colombia.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Vienna, Austria</head><p>Clues: This city contains the neo-gothic Votive Church, and its Karlskirche (KARLS-keer-kuh) is the largest Baroque Cathedral north of the Alps. It is the capital of a country with such states as Burgenland, Tyrol, and Styria. This city's Ring Boulevard was ordered to be restructured by Franz Joseph I, and it lies on the Danube just upriver from Bratislava, the capital of Slovakia.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Orion</head><p>Clues: This constellation contains the Trapezium Cluster and is the site of a late-October meteor shower. One of its stars, formerly known as the Amazon Star, is Bellatrix, and its brightest stars are Betelgeuse and Rigel. Its namesake nebula joins with Hatysa and other stars to form its sword, while Alnitak, Alnilam, and Mintaka form its belt. Cult History (V.Easy) Answer: Michelangelo di Lodovico Buonarroti Simoni Clues: This artist's statues of a dying slave and a horned Moses were to adorn the tomb of Julius II. His only signed work is one in which Mary holds the dead body of Jesus, entitled Piet&#225; ("pee-AY-tuh"). One of his works depicts a nude giant killer holding a sling. Answer: Charles Dickens Clues: This author wrote about the eviction of Nell Trent and her grandfather from The Old Curiosity Shop. In another work by this author, Abel Magwitch raises a fortune for the orphan Pip, who loves Estella. He also wrote about Sydney Carton sacrificing himself to save Charles Darnay in a work set in London and Paris. Answer: Oklahoma Clues: This modern state's panhandle was crossed by the Cimarron Cutoff, a branch of the Santa Fe Trail. A city in this state is called "Broken Arrow" because it was settled by Creek people, while part of this state was known as the "Indian Territory." White settlers who anticipated an 1889 decision to open its lands to homesteaders gave this state its nickname: the Sooner State. For 10 points, Tulsa is located in what state between Texas and Kansas? Answer: Blessed Virgin Mary Clues: In the Gospel of James, this Biblical figure is described as the child of Anna and Joachim. At the First Council of Ephesus, this figure was given the epithet Theotokos, or "God-Bearer." Martin Luther described this person as "the highest woman." This woman is held to be free from original sin under the doctrine of Immaculate Conception. For 10 points, name this mother of Jesus of Nazareth. Answer: Frankenstein, or the Modern Prometheus Clues: The protagonist of this work returns home from the University of Ingolstadt to find that Justine Moritz has been accused of his brother William's murder. The title character, whom Robert Walton discovers in the Arctic in a frame story, had earlier married Elizabeth Lavenza, who was killed on their wedding night. Answer: Paul Ryan Clues: This politician claimed that he went into politics because of Ayn Rand and made Atlas Shrugged required reading for his staff, but he later said he rejected Rand's atheism. He is the current chair of the House Budget Committee, and one of his budget proposals was titled ( * ) "The Path to Prosperity." For 10 points what Wisconsin Republican was Mitt Romney's Vice Presidential nominee in the 2012 election?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: cerebrum</head><p>Clues: This structure is divided into Brodmann areas, and develops from the telencephalon ("TEAL"-en-SEFF-ah-"lawn"). The corpus callosum ("CORE"-puss kuh-LOE-sum) connects the two hemispheres of this structure, which is divided into temporal, parietal, occipital, and frontal lobes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: Michelangelo di Lodovico Buonarroti Simoni</head><p>Clues: This artist's statues of a dying slave and a horned Moses were to adorn the tomb of Julius II.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer</head><p>: John Quincy Adams Clues: This person negotiated a treaty that ceded Florida to the United States with Luis de Onis (loo-EES day oh-"NIECE") while serving as James Monroe's Secretary of State. This man agreed to name Henry Clay Secretary of State in order to break a deadlock in the House of Representatives; that decision was the first "corrupt bargain." Answer: Sarah Palin Clues: This person's visit to Fort Bragg caused a stir when the press was denied entry to a book tour for Going Rogue. This person resigned from the position of Governor of the state closest to Russia shortly after a campaign loss in the most recent general election. Tina Fey did a notable impression of, for 10 points, what unsuccessful vice presidential candidate who ran alongside John McCain in 2008? </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>The implementation can be found at https:// github.com/maharshi95/neural-irt</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1"><p>Negative skill values (s i &lt; 0) and their interaction with &#945; j &gt; 1 could mimic similar likelihood estimates (p(Ui,j)) as that of positive skills (s i &gt; 0) with &#945; j &gt; 1.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2"><p>We skip the bias term for d &#8242; j since it is mean-centered.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3"><p>The dataset is available on the HuggingFace platform as mgor/protobowl-11-13.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4"><p>This method is a basic approach to represent group decision-making, acknowledging more complex dynamics for future research.</p></note>
		</body>
		</text>
</TEI>
