<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Conceptual Chronicle of Solving Raven's Progressive Matrices Computationally</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10347959</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the 8th International Workshop on Artificial Intelligence and Cognition</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Y. Yang</author><author>D. Sanyal</author><author>J. Michelson</author><author>J. Ainooson</author><author>M. Kunda</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Matrix reasoning or geometric analogy problems, like those found on the widely used Raven's Progressive Matrices test of intelligence, have been used as a challenge for machine intelligence since the early work of Evans in the 1960s. While AI research on the RPM has gone through dramatic shifts alongside other AI advances, including a dramatic rise of machine-learning-based approaches in the last five years, many of these studies have progressed in relatively siloed research lines, making it difficult to compare different approaches and judge progress in the field as a whole. This paper intends to provide a framework for understanding the different lines of work in this research field. In particular, we reviewed 50+ computational models for solving RPM or RPM-like problems and collated them into a linear conceptual framework to help researchers navigate across these diverse research paradigms. We also provide instructions on other resources such as problem/data sets and necessary background knowledge of RPM.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Raven's Progressive Matrices (RPM) is a very popular human intelligence test because of its easy administration, interpretability, and non-verbal item format. As shown in Fig. <ref type="figure">1</ref>, RPM was designed to be multiple choice problems, where the correct choice completes the matrix such that the variations are consistent across rows and columns of the matrix. The first RPM problem set was published nearly a century ago by Raven <ref type="bibr">[1]</ref> to study the genetic determinant of human intelligence. While our understanding of the original specifics about test development and target mental constructs have evolved over time, it is still used as a measure of fluid intelligence and even of general intelligence <ref type="bibr">[2]</ref>. Later studies <ref type="bibr">[3]</ref> showed that RPM indeed occupies a central position among other intellectual ability tests, in that a person's RPM scores show high correlations with other tests in various different ability domains.</p><p>RPM is a masterpiece of item-writing "art", integrating multiple complexity factors <ref type="bibr">[5,</ref><ref type="bibr">6</ref>] into a single framework. Within this framework, adjusting these factors would produce a large   <ref type="bibr">[4]</ref> showing two formats in original RPM-2&#215;2 and 3&#215;3-and also illustrating two common ways to interpret/construct RPMs-perceptual transformation and logical relation.</p><p>(even infinite) collection of items, which possesses great variability in appearance and a wide range of difficulty. Despite this complexity, RPM items are generally intelligible and solvable to human subjects (up to a reasonable distribution of difficulty). This type of task contrasts sharply with two types of tasks that are commonly considered in machine intelligence. The first type is represented by game playing, where the rules and regulated search space restrict the variability and openness, but the symbolic reasoning is complicated and even challenging for the capacity of human's working memory; and the second is represented by image classification, which is highly variable but trivial to human's intellectual ability. Therefore, the openness, variability and moderate difficulty of RPM and its close relation to general intelligence unsurprisingly make RPM a ideal propellant for research on machine intelligence. Given the above understanding of why researchers seek computational solutions to RPM, we now move on to the technical details. We keep the prerequisite knowledge at a minimal level, requiring no prior knowledge of RPM or solving RPM, and unfold our discussion in a manner that reveals the philosophy behind technical development in simplest language. As the first stop of this journey, we refer our readers to Section A in the appendix for available problem/data sets of RPM. Since the 3&#215;3 format (Fig. <ref type="figure">1b</ref>) is the most common one among these problem/data sets, we will assume that the format is 3&#215;3 in the following discussion. If the reader is already familiar with RPM problem/datasets, she can directly move on to the next section, where we introduce a linear framework-a conceptual chronicle-that explains the technical development of computational solutions to RPM. The development is divided into four stages with each stage characterized by a distinct high-level approach. In particular, the last stage features the structural evolution of machine learning models for solving RPM, which is further divided into 4 types of models, by which we wish to show the process from the first attempt to solve RPM by machine learning to the ultimate goal of building the visual abstract reasoning ability through machine learning. As an important complement to the structural evolution of learning models, we give a full description of training techniques that have been highly adapted for solving RPM in Section B in the appendix.</p><p>To clarify the terminology, we use RPM to refer to both the original RPM and automatically generated RPM-like items unless they have to be differentiated; in an RPM, various terms have been used to refer to the images in the matrix, such as panel, figure, entry, cell, frame, etc.; in this paper, we use panel, and, thus, refer to the images of given matrix entries as context panels and the images of answer choices as choice panels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">A Conceptual Chronicle of Technical Development</head><p>To cover all the relevant but heterogeneous works and present them in an understandable way, we decided to take a linear framework, in which we adapted the technical development into a conceptual chronicle of different stages of computational solutions of RPM. This pseudochronicle is roughly, though not strictly, chronological, but well serves the purpose of being comprehensive and intelligible. We will use the acronyms of computational models for simplicity and please refer to Table <ref type="table">1</ref> in the appendix for their full names.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Stage 1: Imagery-Based Approach</head><p>Visual mental imagery, in human cognition, refers to the creation, inspection, and manipulation of visual knowledge representations that do not match concurrent perceptual inputs, and that serve some functional purpose in solving tasks <ref type="bibr">[7]</ref>. Using imagery to solve RPM problems-which evidence from psychology and neuroscience suggests does occur (see review in <ref type="bibr">[4]</ref>)-a person might inspect objects in the matrix, compare them by mentally superimposing one on another, mentally transform the objects, and mentally estimate perceptual similarity. Computational imagery-based systems <ref type="bibr">[8,</ref><ref type="bibr">4,</ref><ref type="bibr">9]</ref> represent RPM panels by raster images, systematically apply predefined pixel-level operations on the images (e.g., affine transformations and set operations) and calculate pixel-level similarities between the images (e.g., Jaccard index and Hausdorff distance). These systems have proven to be effective for solving the original RPM <ref type="bibr">[10,</ref><ref type="bibr">11]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Stage 2: Logical Reasoning</head><p>The imagery-based approach provides an "in-place" solution, i.e., solving a visual reasoning problem "visually" without auxiliary devices or further abstraction of problem representation. While this approach has been found to be successful on RPM problems, this approach is restricted by the predefined pixel-operations and similarity metrics, and the computational cost of them. Logical reasoning, as a task-general and efficient tool in the early development of AI, thus can also be attempted on RPM. In these efforts <ref type="bibr">[12,</ref><ref type="bibr">13,</ref><ref type="bibr">14]</ref>, predicate and propositional representations (or other symbolic representations) are predefined to describe the objects and rules in RPM problem sets. These efforts were initially focused on modeling human's cognitive behaviors of solving RPM, by comparing different symbolic representations and control mechanisms, and, at the same time, laid foundations for later problem-solving computational models. In fact, the logical reasoning approach predates the imagery-based approach in Stage 1 <ref type="bibr">[15]</ref>, but "logically" postdates it in our conceptual chronicle as it requires a higher level of problem representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Stage 3: Probabilistic Reasoning</head><p>Despite being more general, logical reasoning has eluded the uncertainty in perception, e.g., how the symbols in logical reasoning are determined from panel images. Probabilistic reasoning is a natural upgrade of pure logical reasoning, where perceptual uncertainty is modeled through conditional probability distributions given a panel image. This upgrade leads us to the neuralsymbolic paradigm, in which a neural perception frontend extracts the distributions over a predefined set of objects from each panel and a symbolic reasoning backend performs probability calculation according to a predefined set of rules. The result of the probability calculation indicates the probability of every possible outcome of the reasoning. Different implementations of frontend and backend have been used to construct probabilistic reasoning models, such as ALANS2, PrAE and NVSA <ref type="bibr">[16,</ref><ref type="bibr">17,</ref><ref type="bibr">18]</ref>. Although both probabilistic reasoning and logical reasoning work on the basis of predefined sets of objects and rules, probabilistic reasoning has entered the realm of data-driven approaches (because the neural perception frontend requires training data), leaving imagery-based and logical reasoning in the realm of knowledge-based approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Stage 4: Learning Approach</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Rule Distribution Approximator</head><p>A seq. of panels: -an adjacent pair, -or, a row, -or, a column, -or, a diagonal, -...  To reduce the reliance on explicit prior knowledge about RPM, the learning approach has been exploited as recent developments in deep learning. This section subdivides the learning approach into four types that together show structural evolution of learning models for solving RPM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Learner Type 1</head><p>A natural solution to reduce the reliance on the predefined objects and rules is similar to the upgrade from the logical reasoning to the probabilistic reasoning. That is, we can approximate the conditional distribution over the possible rules given the matrix panels. There exist different ways to organize the matrix panels to compute this conditional distribution, which all depend on the parallelism heuristic of the matrix structure, i.e., the geometric parallelism implies abstract conceptual parallelism in rows and columns. For example, the Pairwise-ADV and Triple-ADV models <ref type="bibr">[19,</ref><ref type="bibr">20]</ref> compute the rule distribution of binary variables given two and three adjacent panels, respectively. A binary variable indicates whether a specific rule applies to the adjacent panels, for example, whether the objects in the panels are of the same color. Another example is DeepIQ <ref type="bibr">[21]</ref>, in which the rule distribution given two adjacent panels is of an ordered categorical variable (rather than binary), indicating, for example, that the objects in the two panels differ by 3 units in their sizes. According to the parallelism heuristic, the rule distributions in parallel rows or columns should be the same or similar; thus, probability metrics, such as KL-divergence and Euclidean distance, are used to measure the similarity; an answer choice is then chosen when it gives a third row/column whose rule distribution is most similar to the ones of context rows/columns. These models can be abstracted into Learner Type 1, as shown in Fig. <ref type="figure">2</ref>. Note that a panel-wise encoder is used to process each input panel individually, which is similar to the perception frontend in probabilistic reasoning. But, unlike the perception frontend, the panel-wise encoder does not necessarily output distributions over the predefined objects. The panel-wise encoder is to represent any latent space as long as the rule distribution can be approximated from this space. After the panel-wise encoder encodes each panel in an input sequence, the embeddings of these panels are further aggregated by the rule distribution approximator into a rule distribution; the distributions of different sequences are finally compared to select the answer choice. The panel-wise encoder and rule distribution approximator can be implemented as proper neural network modules, such as CNN, ResNet and MLP. In practice, these two modules are jointly trained given the ground-truth rule labels of panel sequences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Learner Type 2 panels of an RPM:</head><p>-all context panels, -and, one or more choice panels</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Encoder</head><p>Classifier Head predicted answer label Unlike the approaches in Stage 1, 2 and 3, Type-1 learners have avoided composing computational streams that explicitly rely on the predefined objects and rules. But it still relies on the ground-truth rule labels and the parallelism heuristic. Therefore, we introduce the Learner Type 2, as shown in Fig. <ref type="figure">3</ref>, which is free of the reliance. This type converts an RPM into a classification problem, where the class labels are the correctness of choice panels: if only one choice panel is included in the input, it is a binary classification problem; if multiple choice panels are included, it is a multi-class problem.</p><p>Readers might have noticed a small difference between Fig. <ref type="figure">2</ref> and Fig. <ref type="figure">3</ref>-the panel-wise encoder has been changed to an encoder (not necessarily panel-wise). As the name indicates, the encoder takes as input multiple panels and may encode relations among these panels into its output. Therefore, the encoder in Type-2 learners is not only responsible for perceptual processing but also conceptual processing.</p><p>Conceptual processing in RPM generally involves reasoning about the relations among matrix panels, and thus if one wishes to explicitly separate these two types of processing, one would have a first module (like the panel-wise encoder in Type 1) that attends to each panel individually and a second module (like the rule distribution approximator in Type 1) to aggregate the outputs of the first module. This design choice is an important one for building any computational models for visual abstract reasoning. By changing the name to "encoder", we implies that Learner Type 2 does not necessarily require an explicit separation of perceptual and conceptual processing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1-19</head><p>Hoshen and Werman <ref type="bibr">[22]</ref>   classifier, and tested it on simple figural series and RPM-like problems. This CNN+MLP model has been constantly used as a baseline in later works. Learner Type 2 can also be implemented in many different ways, such as the Wild-ResNet+MLP model <ref type="bibr">[23]</ref> and the ResNet+MLP model <ref type="bibr">[24]</ref>, respectively representing the binary and multi-class versions of Learner Type 2. Hoshen and Werman <ref type="bibr">[22]</ref> also proposed the generative counterpart of the CNN+MLP model, by replacing the MLP classifier with a deconvolutional network to generate the predicted answer panel (no choice panel in the input in this case). We upgrade Type 2 to Type 2+ to include this generative case, as shown in Fig. <ref type="figure">4</ref>. Interestingly, this type can now be considered as a prototype of Learner Type 4, which we will discuss later. But before Learner Type 4, we are going to take a detour to see how we can reach the same destination differently.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">Learner Type 3 combinatorial heuristics (group &amp; map)</head><p>panels of an RPM: -all context panels, -and, one or more choice panels</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Panel-Wise Encoder</head><p>Relation Encoder (Group-Wise &amp; Map-Regulated)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Groups Aggregator</head><p>Classifier Head predicted answer label By following the paradigm of image classification, Learner Type 2 eliminates the reliance on the ground-truth rule labels and the parallelism heuristic. But an unnecessary cost in this is that it does not impose separation of perceptual and conceptual processing, which is usually considered beneficial for visual abstract reasoning. This observation leads us to Learner Type 3, as shown in Fig. <ref type="figure">5</ref>. Note that, given the same input and output, one could certainly consider Learner Type 3 as a special case of Learner Type 2, by regarding everything before the classifier head as a single module. But models based on this more detailed specification generally perform better than the typical models of learner Type 2. Thus, we separate it from Learner Type 2.</p><p>After the panel-wise encoder encodes each panel, these panel embeddings go through a combinatorial process, in which subsets of these panel embeddings are selected and fed into next module subset by subset. In Fig. <ref type="figure">5</ref>, we use two trapezoids of opposite orientations for the panel-wise encoder and this combinatorial process to indicate that the amount of information is compressed and decompressed (i.e., the number of combinations is more than what are combined). As the name "combinatorial heuristics" indicates, Learner Type 3 explicitly relies on some heuristics to take combinations, which include but not limited to the aforementioned parallelism heuristic. Essentially, these heuristics inform the learner of which panel embeddings, together as a group, would manifest a rule. Each such group is individually processed by the same Relation Encoder to produce a relation embedding for this group. At last, all relation embeddings are aggregated for classification.</p><p>A typical example of Learner Type 3 is the WReN model <ref type="bibr">[23]</ref>. WReN takes as input all context panel plus one choice panel (thus solving binary classification). The panel-wise encoder is a small CNN (plus tagging the panel embeddings with one-hot position vectors indicating the panels' positions in the matrix). For combinatorial heuristics, WReN considers all binary rules (i.e., relations between every two panels). Note that WReN does not use the parallelism heuristic, which is commonly used in other models; but the position-tagged panel embedding make this less of a problem, because the relation encoder can trivially determine the non-adjacent panels from position-tags and output a specific rule-embedding for them. The Groups Aggregator in WReN is simply a summation. Following WReN, a series of models of Learner Type 3 have been created, using different panel-wise encoders, combinatorial heuristics, relation encoders and groups aggregators. For example, LEN <ref type="bibr">[25]</ref> considers only ternary rules for combinatorial heuristics, i.e., groups every three panels together for relation encoding, and applies gating variables to each groups in aggregator (unsurprisingly, the experiment results showed that all gating variables except the ones of rows and columns were zeroed); MXGNet <ref type="bibr">[26]</ref> also considers ternary rules, uses CNN or R-CNN panel-wise encoder, relies on parallelism heuristic for combinatorial heuristics (instead of gating variables), and employs a graph-learning-based relation encoder that models the 3 panels in a group as a graph and compute the graph embedding as the relation embedding.</p><p>Different from the previous Type-3 learners, multi-layer RN <ref type="bibr">[27,</ref><ref type="bibr">28,</ref><ref type="bibr">29]</ref> extends the relation encoding in WReN into a multilayer format. This is, the relation embeddings of each group are not aggregated into a single embedding for classification, but into multiple embeddings, which are further fed into another combinatorial module and relation encoder. Therefore, one could visualize multi-layer RN as a Type-3 learner, repeating the middle three modules as many times as needed, during which the combinatorial heuristics are defined according to task specifics. One would expect the higher-order relations (if any) to be detected in this model.</p><p>The SRAN model <ref type="bibr">[30]</ref> adopts a more complicated encoding scheme by using multiple encoders and multiple relation encoders, where panels of two rows/columns (6 in total) are encoded panel-wise, 3-panels-wise, 6-panels-wise by three different encoders, and the resulting panelembeddings, 3-panel-embeddings and 6-panel-embeddings are sequentially integrated by three relation encoders into a single rule embedding, representing the rule of these two rows/columns. The encoding scheme of SRAN, though complicated, does not deviate too much from Learner Type 3. But, in stead of using rule embeddings to solve the RPM as an classification problem, SRAN directly uses similarity metrics of rule embeddings to select the answer, as in Learner Type 1, which is also a common practice (just a different way to present the same supervising signal). Thus, it gives us a more complete Leaner Type 3+, as shown in Fig. <ref type="figure">6</ref>.</p><p>MRNet <ref type="bibr">[31]</ref> is another Type-3 learner using multiple panel encoders and multiple relation encoders, corresponding to three different resolutions defined by different layers' output in a CNN panel-wise encoder. The computational streams for each resolution separately follow the parallelism heuristic, and are aggregated at the end for classification.</p><p>Readers might have noticed the words "group" and "map" in the diagrams of Fig. <ref type="figure">5</ref> and<ref type="figure">6</ref>. By these words, we intend to call attention to a mechanism that permeates everywhere in information processing system for visual abstract reasoning, i.e., which pieces of information should be grouped together and thus to be aggregated later, and which pieces of information should be mapped<ref type="foot">foot_0</ref> and thus to be processed equally<ref type="foot">foot_1</ref> . These two types of decisions are interdependent on each other; more precisely, they are better to be viewed as two aspects of the same process. These decisions have to be made repeatedly at every level in information processing. Unfortunately, there might not be a centralized or universal solution for this grouping-mapping mechanism. As one can see in these Type-3 learners, they all resort to some specific heuristics, which might not be correct in a general sense of visual abstract reasoning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.">Learner Type 4</head><p>Reasoning Module When we look back at how learners have evolved from Type-1 to Type-3, we can see a method to circumvent the difficulty of specifying a groupingmapping mechanism and also preserve the advantages of Learner Type 3, i.e., no reliance on ground-truth rule labels and the separation of perceptual and conceptual processing. Similar to the development from Type 1 to Type 2, we use a single learnable module to replace both the combinatorial heuristics and relation encoder in Learner Type 3, without introducing any heuristic grouping-mapping. This gives us Learner Type 4, as shown in Fig. <ref type="figure">7</ref>. Since the reasoning module contains no heuristics about groupingmapping, its output cannot be assumed to represent the relation among certain panels, and thus cannot be processed as in Type 3/3+. Thus, supervising signals are directly applied on this output. Now the reader might want to look back at Learner Type 2+ and understand why we said it is a prototype of Learner Type 4. Type 4 simply separates the perceptual and conceptual processing by separating the holistic encoder into a panel-wise encoder and a reasoning module. As in Learner Type 2+, the supervising signal is applied in two ways.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reasoning Kernel 0: CNN</head><p>The CNN module has been a basic tool to extract features from raw input, and the extracted features are not only relevant for solving specific downstream tasks, but also representing certain correlations in the input. Solving visual abstract reasoning tasks is also to process correlations among visual inputs. Therefore, CNN would have been the first choice for solving RPM. However, several early works <ref type="bibr">[23,</ref><ref type="bibr">24,</ref><ref type="bibr">32]</ref> argued that CNN is incapable of solving RPM and thus proposed new models <ref type="foot">3</ref> . Since then, the research has gone into other directions, but having no generally good performance across multiple datasets (like PGM and RAVEN). Ironically, Spratley et al. <ref type="bibr">[33]</ref> proposed two Type-4 learners-Rel-Base and Rel-AIR-which are all CNNbased models and perform well on both PGM and RAVEN. After comparing these two models with the previous CNN models, we found that the difference is whether the conceptual and perceptual processing are separated. Taking Rel-Base as an example, its panel-wise encoder is a CNN module and its reasoning module is another CNN module; all the panel embeddings are first stacked together and then convolved with the convolution kernels in the reasoning module. But the baseline CNN-based models do not have this separation. Therefore, we conjecture that the outstanding performance of many non-CNN models is not because they are not using CNN for reasoning, but because they separate perceptual and conceptual processing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reasoning Kernel 1: LSTM</head><p>A typical Type-4 learner is the CNN+LSTM+MLP model <ref type="bibr">[23]</ref>. This model takes as input all context panels and one or more choice panels. Each panel embedding is sequentially processed by an LSTM reasoning module, and the final state of LSTM is fed into an MLP classifier to predict the answer panel. This model is also used as a common baseline in many later works. LSTM has also been combined with other modules: Double-LSTM <ref type="bibr">[34]</ref> uses two LSTM modules, which specialize in different rule types and are coordinated by an extra module trained to predict the rule type <ref type="foot">4</ref> ; ESBN and NTM <ref type="bibr">[35,</ref><ref type="bibr">36]</ref>, combining LSTM with external memory modules, can also be used as the reasoning kernel in Leaner Type 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reasoning Kernel 2: Self-Attention</head><p>Another commonly used reasoning kernel is the self-attention module, which is composed of a multi-head attention and a feed-forward network (with residual connections and normalization). The most typical example of this "reason kernel" is the ARNe model <ref type="bibr">[37]</ref>. It extends the Type-3 learner WReN by inserting between the panel-wise encoder and the combinatorial heuristics module a self-attention module. Note that, although ARNe inherits the combinatorial heuristic of WReN, it is no longer a Type-3 learner because the self-attended embeddings no longer represent individual panels. Instead, each self-attended embedding contains information about all the matrix panels, and should better be considered summaries of the whole matrix from different angles. Therefore, the inherited combinatorial heuristics module and the following modules of WReN can be considered similar to other general classifier heads, which are semanticallyagnostic of its input, simply aggregating the input and predicting the answer. With hindsight, a reasonable order should have been first testing the self-attention module with a simple classifier head rather than WReN.</p><p>A similar example is the HTR model <ref type="bibr">[38]</ref>, where an R-CNN panel-wise encoder is used to extract all entities in each panel and two self-attention-based sub-modules are used to move the reasoning from entity-level to panel-level and from panel-level to matrix-level. The first sub-module takes as input the entity embeddings in a single panel and sums up the self-attended entity embeddings as the panel embedding. Unlike ARNe and WReN solving RPM as binary classification, HTR solves it as multi-classification. Therefore, the output of the second sub-module contains 8 embeddings corresponding to the 8 choice panels. These 8 embeddings are fed into a contrastive classifier head <ref type="bibr">[16]</ref> (see the appendix) to predict the answer label.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reasoning Kernel 3: Multi-Head Relation Detector</head><p>The last reasoning kernel is closely related to the Relation Encoder of Type 3. Recall that, in Type 3, the combinatorial heuristics module groups the panel embeddings into multiple groups, and every group is processed by a singleton relation encoder to obtain a rule embedding for this group. Although this relation encoder has 1-in and 1-out, it is responsible for recognizing and encoding whichever rule the input group has. Recall that, by moving from Type-3 to Type-4, we intended to eliminate the reliance on combinatorial heuristics. An natural solution could be an "all-in-all-out" relation encoder, which takes as input all the panel embeddings of a matrix (no grouping) and outputs all the possible rules. This is analogous to image classification versus object detection. Particularly, the new relation encoder can have multiple output heads, where later supervising pressure can be applied to force each head to represent a specific rule. Therefore, we refer to this reasoning kernel as multi-head relation detector. This kernel is underrepresented and we found only one example using this kernel-the SCL model <ref type="bibr">[39]</ref>.</p><p>These four learner types have addressed the structural aspect of learning approach. Another equally important issue is how we train these models. Besides the standard supervised learning of the correct answers of RPM, many other techniques have been attempted to take advantage of the enriched design of RPM. Due to the page limit, We refer our readers to Section B in the appendix for more details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Concluding Remarks</head><p>In this paper, we have replayed the technical development of computational solutions to RPM. The food for thought we would love to share with our readers is that, by comparing the models in different stages, we found that the technical development, on one hand, always explores new methods to solve RPM, but, on the other hand, inevitably revisits the old ideas again and again. Therefore, the most recent models are not necessarily superior to the traditional ones in nature, for example, the imagery-based approach might trigger the next cycle of technical development in future research.</p><p>the Psychological Test for College Freshmen of the American Council on Education. Each GAP is explicitly organized in a proportional analogy A:B::C:D, i.e., A is to B as C is to D, which resembles a 2&#215;2 RPM in Fig. <ref type="figure">1a</ref>. A subtle difference is that RPM can have algebraic relations in both horizontal and vertical directions while GAP, if organized into a matrix, has algebraic relations in one direction and the relations in the other direction are determined analogically (i.e., forming a meaningful analogy). Another type of RPM-like problem-figural series <ref type="bibr">[22,</ref><ref type="bibr">34]</ref>-features the inductive nature of RPM item, which follows the format of the number series problems in many ability tests.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Stage 4.5: Training</head><p>We have addressed the structural aspect of the learning approach in Stage 4. In this section, we discuss how these learning models are trained. All the learners in Stage 4 can be trained through the supervising signal of the correct answer of RPM as in most supervised learning tasks. But different from them, the special design of RPM provides more information that can be used in training. In this section, we will focus on how the learning models are trained differently given this enrichment of useful information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1. Auxiliary Training</head><p>For the learners of Type 2, 3 and 4, an extra classifier head can be attached to exactly where the existing classifier head is attached to predict the meta-target of the input RPM, which is a multi-hot vector indicating rules and objects in this RPM. These meta-targets are available in automatically-generated datasets, such PGM and RAVEN. Therefore, the learner can be trained on the answer labels and meta-targets simultaneously. And the training on meta-targets is referred to as auxiliary training.</p><p>Intuitively, this extra supervising signal can boost the accuracy of the answer-label classifier head. Auxiliary training was first tried by the WReN model on the PGM dataset and indeed showed a approximately 10% boost (in IID generalization regime). The contribution of auxiliary training was verified by a high correlation between the two classifier heads <ref type="bibr">[23]</ref>. Similar observations on PGM were found in other studies <ref type="bibr">[47,</ref><ref type="bibr">37]</ref>. For example, the ARNe model would not even converge without auxiliary training.</p><p>However, the effect of auxiliary training is inconclusive. For example, Benny et al. <ref type="bibr">[31]</ref> showed that auxiliary training on PGM could only increase the accuracy of 1-rule problems but decrease the accuracy of multi-rule problems (and thus degrade the overall accuracy). Besides depending on rules, the effect also differs between datasets. It has been reported that the auxiliary training would generally decrease the performance on the RAVEN dataset <ref type="bibr">[24,</ref><ref type="bibr">32,</ref><ref type="bibr">25,</ref><ref type="bibr">26]</ref>, with one exception <ref type="bibr">[48]</ref>, which used a special contrastive loss and will be discussed later. However, Ma&#322;ki&#324;ski and Ma&#324;dziuk <ref type="bibr">[49]</ref> also showed contradictory results that, when the meta-target is encoded in a sparse manner (the above experiments are all dense-encoding), the auxiliary training can increase the performance on RAVEN. Therefore, we can only say that the effect of auxiliary training is jointly determined by model, loss function, dataset and meta-target encoding.</p><p>checks whether the third row/column has a meaningful variation that is similar to any context row/column in the dataset. The PRD model <ref type="bibr">[53]</ref> enhanced this type of single-row/column contrasting by including the parallelism heuristic. As in standard contrastive learning, positive and negative pairs are constructed from rows/columns, where the first two rows/columns in an RPM matrix make a positive pair. The negative pair could be constructed in multiple ways, such as rows/columns from different RPMs, randomly-shuffled rows/columns of the same RPM, or filling the third row/column with a random non-choice panel. In PRD, a Type-2 learner is used to learn the difference (or a similarity metric) between two paired rows/columns. To solve an RPM, the choice row/column that is most similar to the first two rows/columns is selected. Compared to the single-row/column contrasting, the double-row/column contrasting is more common, which could be found in many other works. For example, the aforementioned generative-MRNet <ref type="bibr">[47]</ref> contrasts the choice rows/columns formed by the generated answer panel with the choice rows/columns formed by the given choice panels.</p><p>The rationale of moving from single-row/column to double-row/column contrasting was also exemplified by the LABC training/testing regime <ref type="bibr">[54]</ref>, which makes it more accurate and complete through the meta-targets used in auxiliary training. Different from the singlerow/column and double-row/column contrasting, where the effect of contrasting is applied through extra contrastive loss functions, LABC, as a training/testing regime, requires models to learn adapted datasets, which will force the model to contrast the rows/columns. In particular, an RPM is adapted by muting some digits of its meta-target and regenerating the incorrect choice panels (based on given context panels). Since meta-targets use multi-hots to indicate the rules and geometric objects that are used to generate RPM items, the newly-generated choice panels are partially correct. This way, the model will have to compare such choice rows/columns to the context rows/columns to find the correct answer, instead of only seeking meaningful variations in the choice row/column as in the single-row/column. LABC makes this idea more systematic by introducing the concepts of semantically and perceptually plausible choice corresponding to muting different subsets of mete-target digits and using distracting objects and rules.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Intra-Item Contrasting: Matrix Contrasting</head><p>Instead of contrasting rows/columns, we can also contrasting the matrices completed by each choice panel. This is essentially contrasting the choices relative to the context panels. A Type-2 learner, CoPINet <ref type="bibr">[32]</ref>, is the first model performing such contrasting. The contrasting in CoPINet is two-fold-contrastive representation and contrastive loss. First, the embeddings of the matrices completed by each choice panel are aggregated into a "central" embedding and their differences to the "central" embedding are used in the following processing. Second, given the interweaving of these matrix embeddings, it naturally leads to a contrastive loss function that incorporates both correctly completed and incorrectly completed matrices and increases the gap between their predicted values. This contrasting could be easily embedded into models of parallel computation streams, for example, the aforementioned HTR model <ref type="bibr">[38]</ref>.</p><p>We need to point out that row/column contrasting and matrix contrasting are not exclusive. For example, the DCNet model <ref type="bibr">[55]</ref> first uses row/col contrasting to compute the matrix embeddings and then uses the matrix contrasting to predict the answer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Inter-Item Contrasting: Single-Label Contrasting</head><p>The above contrasting has been restricted within a single RPM item. Now, we describe the inter-item contrasting. The ACL and Meta-ACL <ref type="bibr">[48]</ref> are the first two inter-item contrasting models. The relation between ACL and Meta-ACL is similar to that between single-row/column and double-row/column contrasting. Given an RPM, let &#119883; be its incomplete context matrix (regarding the missing panel as an empty image), &#119883; &#119894; an incomplete matrix obtained by replacing the &#119894;-th panel with a white-noise image, and &#119883; &#8242; an incomplete matrix obtained by randomly reordering the panels of &#119883;. The ACL model contrasts the positive pair (&#119883; , &#119883; &#119894; ) with the negative pair (&#119883; , &#119883; &#8242; ). The Meta-ACL resorts to meta-targets to compose positive and negative pairs. In particular, two incomplete matrices of two distinct items of the same meta-target form a positive pair (&#119883; &#119878; , &#119883; &#119879; ), and the corresponding negative pair is (&#119883; &#119878; , &#119883; &#8242; &#119878; ). In both ACL and Meta-ACL, the contrasting effect is applied through an extra standard contrastive loss function.</p><p>The MLCL model <ref type="bibr">[49]</ref> formalizes the idea of Meta-ACL in a multi-label setting by regarding multi-hot meta-targets as multi-labels. Therefore, instead of requiring positive pairs to have exactly the same meta-targets, MLCL regards pairs of intersecting meta-targets as positive pairs (to some degree). Different from Meta-ACL, the completed matrices are used. In particular, the correctly completed matrices are used for inter-item contrasting, and the intra-item contrasting between the correctly completed matrix and its corresponding incorrectly completed matrices is performed as in CoPINet. These two types of contrasting losses are jointly optimized.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Other Dimensions of Manipulating Data</head><p>Besides contrasting, there are also other dimensions of manipulating data. For example, the FRAR model <ref type="bibr">[25]</ref> utilizes a reinforcement learning teacher model to select items from an RPM item back to train a student model, where the items in the bank are characterized by their meta-targets and the reward is the increase in accuracy of the student model. The models solving RPM have also been examined in the setting of continual learning. For example, RAVEN can be divided into 7 batches according to its 7 spatial configurations and the models are trained with different methods to mitigate forgetting when sequentially learning the 7 batches in different orders <ref type="bibr">[56]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. A Summary of Computational Models/Methods</head></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>or aligned; we use "map" to resonate with the terminology of structure-mapping theory and analogical reasoning; i.e., if two entities in the base and target domains are mapped, then they are analogous to each other.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>i.e., forcing the analogical relation between them.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>This is also why the CNN+MLP and ResNet+MLP models of Learner Type 2 have been constantly used as baselinesno outstanding performance.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>This reliance on ground-truth rule labels slightly deviates from our definition of Learner Type 4.</p></note>
		</body>
		</text>
</TEI>
