<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities</title></titleStmt>
			<publicationStmt>
				<publisher>Neural Information Processing Systems</publisher>
				<date>12/10/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10608877</idno>
					<idno type="doi"></idno>
					
					<author>Adriel Saporta</author><author>Aahlad Puli</author><author>Mark Goldstein</author><author>Rajesh Ranganath</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Contrastive learning methods, such as CLIP, leverage naturally paired data-for example, images and their corresponding text captions-to learn general representations that transfer efficiently to downstream tasks. While such approaches are generally applied to two modalities, domains such as robotics, healthcare, and video need to support many types of data at once. We show that the pairwise application of CLIP fails to capture joint information between modalities, thereby limiting the quality of the learned representations. To address this issue, we present Symile, a simple contrastive learning approach that captures higherorder information between any number of modalities. Symile provides a flexible, architecture-agnostic objective for learning modality-specific representations. To develop Symile's objective, we derive a lower bound on total correlation, and show that Symile representations for any set of modalities form a sufficient statistic for predicting the remaining modalities. Symile outperforms pairwise CLIP, even with modalities missing in the data, on cross-modal classification and retrieval across several experiments including on an original multilingual dataset of 33M image, text and audio samples and a clinical dataset of chest X-rays, electrocardiograms, and laboratory measurements. All datasets and code used in this work are publicly available at https://github.com/rajesh-lab/symile.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Contrastive learning leverages naturally paired data to learn general representations that transfer efficiently to downstream tasks <ref type="bibr">[3,</ref><ref type="bibr">35,</ref><ref type="bibr">53]</ref>. A common contrastive approach is to maximize the mutual information between the paired modalities, ensuring that the learned representations retain sensitivity to all correlations between them. While SimCLR <ref type="bibr">[12]</ref> popularized the use of the mutual information estimator InfoNCE <ref type="bibr">[38]</ref> for data augmentations, CLIP <ref type="bibr">[40]</ref> applied the approach to distinct modalities-for example, images and their corresponding text captions-where representations are learned using any encoder for each modality.</p><p>While contrastive approaches are generally applied to two modalities, there is a rapidly expanding range of domains that require the integration of many types of data at once. For example, in robotics, agents combine information from visual, proprioceptive, and tactile sensors <ref type="bibr">[18,</ref><ref type="bibr">28]</ref>; healthcare providers analyze various types of patient data including imaging, biosignals, and genomics <ref type="bibr">[10,</ref><ref type="bibr">29]</ref>; and video encompasses RGB frames, audio waveforms, and text transcripts <ref type="bibr">[55]</ref>. One strategy for handling multimodal data has been to design specialized architectures capable of processing all data types at once, which limits their general applicability and increases operational complexity <ref type="bibr">[2,</ref><ref type="bibr">47]</ref>. Another common approach is to apply two-modality contrastive objectives, such as CLIP, to pairs of available modalities <ref type="bibr">[15,</ref><ref type="bibr">44]</ref>.</p><p>In this paper, we show that, despite its popularity, the pairwise application of CLIP fails to capture higher-order conditional information between modalities, thereby limiting the quality of the representations it learns. For instance, given three modalities a, b, and c, pairwise CLIP captures dependencies between a and b, b and c, and a and c, yet cannot capture any conditional dependencies, such as between a and b given c. We show in Section 2.2 that even in a simple one-dimensional controlled setting where the target b is perfectly predictable from a and c, CLIP performs no better than random chance. Effective contrastive learning for more than two modalities requires a model-agnostic approach capable of learning modality-specific representations-like CLIP-yet also captures higher-order information between any number of modalities-unlike CLIP.</p><p>Methodological contributions. This paper presents Symile, a simple contrastive learning approach that captures higher-order information between any number of modalities. Symile provides a flexible, architecture-agnostic objective for learning modality-specific representations. To develop Symile's objective, we derive a total correlation estimator, employing a generalization of inner products to more than two vectors that allows for the simultaneous contrasting of all modalities and enables zero-shot applications such as classification and retrieval. We then show that the representations produced by Symile for any set of modalities form a sufficient statistic for predicting the remaining modalities not considered in the set. Because it targets total correlation, Symile captures strictly more information than CLIP, guaranteeing performance that matches or surpasses CLIP, except in cases where it known that only pairwise statistics are relevant. Given that such prior knowledge is rarely available, Symile should be favored over CLIP.</p><p>Empirical contributions. We demonstrate that Symile outperforms pairwise CLIP on cross-modal classification and retrieval across several experiments including on a multilingual dataset of images, text and audio of over 33M examples and a clinical dataset of chest X-rays, electrocardiograms, and laboratory measurements. We show that Symile retains its advantage over pairwise CLIP even with modalities missing in the data. We publicly release both the multilingual and the clinical datasets, which are specifically designed to test a model's ability to capture higher-order information between three distinct high-dimensional data types.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background and motivation</head><p>In this section, we first provide background on the original CLIP objective for two modalities, and describe how it has been extended to additional modalities. We then present a simple problem set up for three modalities that illustrates where pairwise contrastive objectives fall short.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Pairwise contrastive learning</head><p>Given a batch of (x, y) pairs, separately encoded by f &#952;</p><p>x and f &#952; y , respectively, contrastive objectives such as CLIP maximize the similarity between representations of correctly paired (positive) samples and minimize the similarity between representations of incorrectly paired (negative) samples.</p><p>As is now standard in contrastive learning, in order to construct a batch of data, each modality is treated as the anchor in turn and used to construct a set of positive and negative samples. Letting &#196; &#8712; R + be a temperature parameter, the CLIP objective when x is the anchor modality is the categorical cross-entropy of correctly classifying the positive pair out of N possible pairs:</p><p>The final CLIP objective is an average of the losses in each direction: L (x,y) CLIP (&#952;, &#196; ) = 1 2 &#8467; (x&#8594;y) (&#952;, &#196; ) + &#8467; (y&#8594;x) (&#952;, &#196; ) . The dot product in Equation ( <ref type="formula">1</ref>) serves as a scoring function that is trained to assign high values to positive pairs, which are sampled from the joint distribution p x,y , and low values to negative pairs, which are sampled from the product of marginals p x p y .</p><p>Contrastive methods are typically designed to maximize the mutual information between x and y, which is defined as the Kullback-Leibler divergence from the joint distribution to the product of the marginal distributions: I(x; y) = D KL p(x, y) &#8741; p(x)p(y) . It has been shown that Equation (1) maximizes a lower bound on the mutual information between x and y <ref type="bibr">[38,</ref><ref type="bibr">39]</ref>. This information maximization ensures that the learned representations preserve all correlations between the modalities, which is essential for downstream tasks.</p><p>Incorporating additional modalities. In order to learn a joint embedding space for more than two modalities, existing work has applied the CLIP objective in a pairwise fashion <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">9,</ref><ref type="bibr">11,</ref><ref type="bibr">14,</ref><ref type="bibr">21,</ref><ref type="bibr">33,</ref><ref type="bibr">34,</ref><ref type="bibr">43,</ref><ref type="bibr">44,</ref><ref type="bibr">47,</ref><ref type="bibr">52]</ref>. For example, Guzhov et al. <ref type="bibr">[19]</ref> extend CLIP to incorporate audio alongside image and text, and ImageBind <ref type="bibr">[15]</ref> uses CLIP to align image embeddings with embeddings from five other modalities. In the simplest case, for three modalities, the pairwise CLIP loss corresponds to</p><p>CLIP (&#952;, &#196; ). CLIP can either be fine-tuned for downstream tasks or operate as a zero-shot classifier by computing the similarities between the query embedding from one modality and each candidate embedding from the other modality. In the case of more than two modalities, this generalizes to a sum across the pairwise similarities. The resulting similarity scores are used to rank the candidates, and the candidate with the highest similarity to the query is chosen <ref type="bibr">[40]</ref>.</p><p>2.2 A simple one-dimensional problem for three binary modalities While contrastive objectives were originally designed for two modalities, the naive pairwise extension of CLIP to additional modalities warrants a deeper analysis. To explore this further, we propose a simple problem setup for the following data generating process:</p><p>Using the pairwise CLIP objective, we fit three affine linear models to perform the zero-shot classification task of predicting whether b is 0 or 1 given a, c. See Appendix I for additional details.</p><p>Even in this simple one-dimensional controlled setting where the target b is perfectly predictable from a and c, CLIP performs no better than random chance, with an accuracy of 0.5.</p><p>CLIP failure analysis. It can be shown that even though the variables a, b, c are jointly dependent-since c is a deterministic function of a and b-they are pairwise independent (Appendix A):</p><p>This explains CLIP's poor performance for the above XOR experiment: the objective maximizes a lower bound on the mutual information between pairwise terms, and therefore was not designed to capture higher-order dependencies such as the dependence between a and b given c. <ref type="foot">2</ref> Capturing conditional dependencies like this will require the formulation of a new contrastive learning objective.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Learning Symile representations</head><p>Instead of targeting the mutual information between pairs of modalities, we target the total correlation between any number of modalities, learning what we call Symile<ref type="foot">foot_1</ref> representations.</p><p>Total correlation <ref type="bibr">[50]</ref>-the higher-order generalization of mutual information-is defined as the Kullback-Leibler divergence from the joint distribution to the product of the marginal distributions:</p><p>In words, total correlation is a symmetric statistical measure that captures the amount of information shared in a set of random variables. A higher total correlation implies more dependency among the variables, and a total correlation of zero indicates that the variables are independent.</p><p>Total correlation can be decomposed into a summation of mutual information terms. For example, in the case of three random variables, </p><p>While, as discussed, contrastive learning was designed to capture the shared information between modalities, Equation (2) indicates that when there are more than two modalities, the scope of what to capture should extend beyond pairwise information to include conditional interactions (Figure <ref type="figure">1</ref>).</p><p>Because it targets total correlation, Symile captures strictly more information than CLIP, guaranteeing performance that matches or surpasses CLIP, except in cases where only pairwise statistics are relevant, with no higher-order interactions whatsoever. In such cases, Symile may be less sample efficient, as it tracks both pairwise and higher-order information. Unless there is prior knowledge that the downstream task relies solely on pairwise statistics, Symile should be chosen over CLIP. To illustrate when such higher-order information might be relevant, consider again the XOR experiment outlined in Section 2.2. Because all the pairwise information terms between a, b, and c are zero, the conditional mutual information terms constitute the only dependence between the variables to track.</p><p>The XOR experiment represents an extreme case where the CLIP target is zero, but most realworld applications will exhibit a combination of both pairwise and higher-order information. For example, in order to diagnose acute pancreatitis, one might consider a patient's clinical history of abdominal pain, elevated levels of digestive enzymes, and imaging results consistent with inflammation. While each of these modalities would provide useful information about the likelihood of pancreatitis (i.e., pairwise information between the modality and the diagnosis is non-zero), none of them alone would be diagnostic of the condition. Similarly, in the case of Parkinson's disease, clinical evaluation provides valuable information, along with imaging and blood tests to rule out other conditions, but clinicians rely on the integration of all modalities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Deriving a multi-sample lower bound on total correlation</head><p>In order to eventually derive a contrastive objective by maximizing total correlation, we first establish a multi-sample lower bound on total correlation. This lower bound and, in the next section, the Symile objective are illustrated using three modalities for simplicity, but both can be extended to an arbitrary number of modalities, as shown in Appendix B.</p><p>Given a batch of N (x, y, z) triples, let i &#8764; Uniform({1, . . . , N })</p><p>denote the index of the positive triple in the batch. Our goal is to estimate TC(x, y, z) given one positive triple sampled from the joint distribution, and N -1 negative triples sampled from the product of marginals:</p><p>x, y i , z i &#8764; p x,y,z (x, y i , z i ), x, y j&#824; =i , z j&#824; =i &#8764; p(x)p y (y j&#824; =i )p z (z j&#824; =i ).</p><p>Letting Y N = {y n } N n=1 and Z N = {z n } N n=1 be the sets of all samples of y and z, respectively, this sampling procedure describes the following distribution:</p><p>We derive the following lower bound in Appendix B:</p><p>Theorem 3.1 (Total Correlation Lower Bound). Given the distributions in Equations (3) and (4), for any value i of i and any scoring function g, a multi-sample contrastive lower bound on total correlation is</p><p>As described in Section 2.1, in contrastive learning each modality is sequentially treated as the anchor, with a batch of corresponding positive and negative samples generated for each. Theorem 3.1 treats x as the anchor modality, but by symmetry holds when y or z acts as the anchor modality. Notice that the term inside the expectation in Equation ( <ref type="formula">5</ref>) is the categorical log likelihood of correctly identifying the index of the positive triple in the batch, where the scoring function (or critic) g is trained to assign a high value to positive samples and a low value to negative samples. In Appendix E, we show that the optimal scoring function g * is equal to the instantaneous total correlation up to additive constants: Lemma 3.2. For some &#187; &gt; 0, the g that maximizes the lower bound</p><p>We show in Appendix B.3 that, as N gets larger, the total correlation lower bound closes for the optimal scoring function g * . This implies a computational-statistical trade-off: a larger batch size demands more computation but results in a tighter bound.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">The Symile objective</head><p>We now derive the Symile loss by maximizing the total correlation lower bound in Theorem 3.1.</p><p>Instead of using the dot product as a scoring function, as CLIP does, Symile uses its generalized form: the coordinate-wise sum of the element-wise product of a set of vectors. We call this the multilinear inner product (MIP):</p><p>As a scoring function, the MIP strikes a balance between computational simplicity and expressive power: it represents one of the simplest possible generalizations of the dot product to more than two modalities, and the vector multiplication ensures it is expressive enough to model any joint statistic. <ref type="foot">4</ref>Given a batch of N &#8242; positive triples (x i , y i , z i ), each with N -1 corresponding negative triples (x i , y &#8242; j , z &#8242; j ), and letting &#196; &#8712; R + be a temperature parameter, the Symile loss is the negative of an empirical estimate of the expected log likelihood in Equation (5):</p><p>Minimizing Equation ( <ref type="formula">6</ref>) optimizes the lower bound on total correlation by maximizing the MIP of positive tuples and minimizing the MIP of negative tuples (Figure <ref type="figure">2a</ref>). See Appendix B.4 for the Symile objective generalized to any number of modalities.</p><p>As is done with CLIP, the final Symile loss is an average of the loss terms where each modality is treated as the anchor in turn:</p><p>Algorithm 1 Pseudocode for implementation of Symile with O(N ) negative sampling # compute [n, n] logits from x &#8594; (y, z) def get_logits(x, y, z): MIP_pos = (x * y * z).sum(axis=1) #[n] y_shuffled = y[np.random.permutation(n)] z_shuffled = z[np.random.permutation(n)] MIP_neg = x @ (y_shuffled * z_shuffled).T #[n, n] return np.where(np.eye(n), MIP_pos, MIP_neg) # v, u, w: L2-normalized embeddings, each [n, dim] def symile_loss(v, u, w): logits_v_uw = np.exp(t) * get_logits(v, u, w) logits_u_vw = np.exp(t) * get_logits(u, v, w) logits_w_vu = np.exp(t) * get_logits(w, v, u) labels = np.arange(n) loss_v_uw = ce_loss(logits_v_uw, labels) loss_u_vw = ce_loss(logits_u_vw, labels) loss_w_vu = ce_loss(logits_w_vu, labels) return (loss_v_uw + loss_u_vw + loss_w_vu)/3</p><p>Efficient negative sampling. In the sampling procedure described in Section 3.1, negatives samples for the non-anchor modalities are drawn independently for each positive triple, which can be intensive in terms of both computation and memory. Instead, for efficiency, negative sampling can be approximated within a batch by forming negative tuples from non-matching combinations of the non-anchor modalities.</p><p>Approximating negatives within a batch is straightforward with two modalities, but in the case of more than two modalities, both how negatives are formed and how many are used become design choices. At one extreme, one could generate N 2 -1 negative triples for each positive by considering all possible combinations of the two remaining non-anchor modalities. This approach, which we call O(N 2 ), can be computationally and memory intensive. Instead, any subset of these negatives can be used for sampling.</p><p>For instance, a more efficient approach, which we refer to as O(N ), involves randomly permuting the non-anchor modalities within the batch, providing each data point with N -1 negatives. The cube in Figure <ref type="figure">2a</ref> illustrates the O(N 2 ) approach and Algorithm 1 presents pseudocode for the O(N ) approach, both for three modalities.</p><p>Missing data. The Symile objective is defined for data in which all modalities are observed. However, in practice, datasets often include samples where not all modalities are available. This raises the question: during training how should one incorporate data points for which only a subset of modalities is observed? Symile can be easily adapted to such missingness by adding extra dimensions to the encoder inputs that indicate whether or not a modality is missing, ensuring that missing data points are out-of-support. This approach allows Symile to model dependencies between whichever modalities are observed within a sample. We show in Section 5.2 that Symile retains its advantage over pairwise CLIP even with modalities missing in the data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Learning sufficient statistics with Symile</head><p>An important property of Symile is that it learns sufficient statistics, which is central to the representations' effectiveness for downstream tasks.</p><p>Theorem 3.3 (Symile Sufficient Statistics). Let x, y, z be three random variables whose optimal representations when trained using Symile are f * x (x), f * y (y), f * z (z), respectively. The element-wise product of any subset of the representations is a sufficient statistic for predicting the remaining random variables.</p><p>For example, f * x (x) &#187; f * z (z) is a sufficient statistic for predicting y, which can be expressed using the following conditional independence statement:</p><p>The proof can be found in Appendix G. The independence statement in Theorem 3.3 tells us that the element-wise product of the Symile representations of any subset of modalities contains all the information required to predict the remaining modalities. In other words, once Symile representations have been computed, access to the full data is no longer needed. Theorem 3.3 confirms Symile's ability to learn efficient modality-specific representations for downstream tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Zero-shot prediction using the scoring function</head><p>Just as with CLIP, the optimal scoring function g * (Lemma 3.2) can be used to predict one of the modalities y &#8712; Y using instances of the other modalities x, z. If p(y) is uniformly distributed, then the scoring function can be used to rank the candidates for y: arg max y&#8712;Y p(y = y | x, z) = arg max y&#8712;Y g * (x, y, z).</p><p>However, this zero-shot approach, whether applied to Symile or to CLIP, does not lead to the Bayes optimal prediction and, consequently, does not always yield reliable results when p(y) is not uniformly distributed (see Appendix H for a detailed discussion). To address this issue, we can instead compute the desired conditional probability directly using the scoring function:</p><p>Theorem 3.4 (Conditional Distribution using the Scoring Function). Let x, y, z be three random variables whose optimal representations when trained using Symile are</p><p>The proof is provided in Appendix H.</p><p>If the marginal distribution of y is known, we could then perform zero-shot classification in one of two ways. When the distribution p(y | x, z) itself is of interest, as is often the case in healthcare <ref type="bibr">[10]</ref>, we could compute p(y | x, z) directly, following Equation <ref type="bibr">(7)</ref>. Alternatively, if only predictions are needed, we could use</p><p>to train a simple model to predict any property of y, s(y):</p><p>). Note that although the above discussion centers on Symile, it applies equally to CLIP and its own scoring function, the dot product.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Related work</head><p>Contrastive learning beyond two modalities. As discussed, previous work has extended contrastive learning to multiple modalities by applying CLIP to pairs of available modalities. Tian et al. <ref type="bibr">[49]</ref> distinguish between two such pairwise approaches: core view and full graph. The core view strategy fixes one modality and then averages the loss terms between that primary modality and each of the other modalities <ref type="bibr">[1,</ref><ref type="bibr">11,</ref><ref type="bibr">44]</ref>. ImageBind <ref type="bibr">[15]</ref> exemplifies this approach, using CLIP to align image embeddings with embeddings from five other modalities: text, audio, depth, thermal, and motion sensor data. One advantage of this strategy is that it avoids the need for datasets with all modalities (though each dataset must still align with a primary modality). As discussed in Sections 3.2 and 5.2, Symile representations can be learned even with modalities missing in the data.</p><p>The full graph strategy-which we have referred to as pairwise CLIP in this paper-is to consider all M 2 contrastive losses <ref type="bibr">[9,</ref><ref type="bibr">14,</ref><ref type="bibr">33,</ref><ref type="bibr">34,</ref><ref type="bibr">43]</ref>. For example, Guzhov et al. <ref type="bibr">[19]</ref> extend CLIP to include audio with text-to-image, text-to-audio, and image-to-audio losses. While this pairwise strategy captures strictly more information than the one used by ImageBind, neither pairwise approach is able to capture the higher-order information that Symile does.</p><p>Pairwise CLIP has also been applied to architecture-specific fusion models that simultaneously process modalities to capture cross-modal interactions <ref type="bibr">[2,</ref><ref type="bibr">21,</ref><ref type="bibr">52]</ref>. For example, Shvetsova et al. <ref type="bibr">[47]</ref> train a Transformer to accept any number of modalities, using a weighted sum of contrastive losses across all input combinations. Such fusion approaches face a combinatorial explosion not only in the number of weighting coefficients to tune, but also in the number of forward passes required per batch. In contrast, Symile is architecture-agnostic and can learn modality-specific representations.</p><p>Targeting higher-order information with contrastive learning. The use of contrastive methods to target higher-order information has been explored primarily within the context of multiple augmentations of the same data. For instance, Bai et al. <ref type="bibr">[5]</ref> derive a total correlation estimator by recursively decomposing total correlation into a summation of mutual information terms, to which variational estimators are applied (in contrast, Symile optimizes only a single term when targeting total correlation). They then use their estimator to maximize the total correlation between four text augmentations. Shidani et al. <ref type="bibr">[46]</ref> develop a pairwise contrastive approach for image representation learning by generalizing a lower bound on mutual information to one-vs-rest mutual information across multiple augmentations. Liang et al. <ref type="bibr">[31]</ref> maximize the information in two modalities for a specific downstream task by targeting higher-order information.</p><p>The relationship between these studies and our work is analogous to that between SimCLR <ref type="bibr">[12]</ref> and CLIP. SimCLR popularized the use of the InfoNCE mutual information estimator for contrastive learning on two data augmentations. Building on this framework, CLIP applied the approach to distinct modalities, where representations are learned separately for each modality using any encoder. Similarly, while existing work leverages total correlation or mutual information estimators for multiaugmentation contrastive learning, to our knowledge only pairwise applications of CLIP have applied such estimators to more than two distinct modalities. Our work parallels the contributions of InfoNCE and CLIP for cases involving more than two modalities: like InfoNCE, we develop a simple estimator that recovers all possible information between any number of modalities, and like CLIP, we show how this estimator can be used to learn modality-specific representations using any encoder.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiments</head><p>In this section, we empirically evaluate Symile on cross-modal retrieval tasks in three settings: a synthetic dataset, a multilingual dataset encompassing text, images, and audio, and a clinical dataset with chest X-rays, electrocardiograms, and blood labs. Throughout our experiments, we use pairwise CLIP as a baseline comparison since, as outlined in Section 4, it represents the only architecture-agnostic approach that applies contrastive objectives to more than two modalities. We release all datasets and code used in these experiments at <ref type="url">https://github.com/rajesh-lab/symile</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Synthetic data</head><p>Building on the illustrative XOR experiment from Section 2, we first test Symile on a synthetic dataset drawn according to the following sampling procedure:</p><p>.</p><p>We fit three affine linear functions that map a, b, c &#8712; R 5 to representations r a , r b , r c &#8712; R 16 , respectively, and evaluate the model's ability to correctly predict r b given the pair (r a , r c ).</p><p>Results. This performance gap is a consequence of the changing information dynamics between the variables as p moves from 0 to 1, as shown in Figure <ref type="figure">3</ref>   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Symile-M3: a multilingual dataset</head><p>We now evaluate Symile on a new multilingual dataset comprising 33 million (audio, image, text) samples. The dataset, Symile-M3, is specifically designed to test a model's ability to capture higherorder information between three distinct high-dimensional data types: by incorporating multiple languages, we construct a task where text and audio are both needed to predict the image, and where, importantly, neither text nor audio alone would suffice.</p><p>Dataset design and model setup. Let w represent the number of languages in the dataset. An (audio, image, text) sample is generated by first drawing a short one-sentence audio clip from Common Voice <ref type="bibr">[4]</ref> spoken in one of w languages with equal probability. An image is drawn from ImageNet <ref type="bibr">[45]</ref> that corresponds to one of 1,000 classes with equal probability. Finally, text containing exactly w words is generated based on the drawn audio and image: one of the w words in the text is the drawn image class name in the drawn audio language. The remaining w -1 words are randomly chosen from the ImageNet class names and written in one of the w languages such that there is no overlap in language or class name across the w words in the text. The words are separated by underscores, and their order is randomized. We release three versions of the dataset: Symile-M3-2, Symile-M3-5, and Symile-M3-10, corresponding to 2, 5, and 10 languages (w). Figure <ref type="figure">4a</ref> shows an example of the data-generating process for Symile-M3-5. For each of the three datasets, 10M training, 500K validation, and 500K test samples were generated.</p><p>We use pre-trained encoders, freezing all parameters except for those in the text encoder's embedding layer and first encoder layer, which are fine-tuned. We train three linear projections to map each encoder's representation to the same 8192-dimensional space. The Symile loss is trained with O(N ) negative sampling. See Appendix I for details.</p><p>Evaluation and results. We evaluate the learned representations on the zero-shot retrieval task of finding an image of the appropriate class given the audio and text. The most probable image for a given query audio and text pair, selected from all possible candidate images in the test set, is that with the highest similarity score (Figure <ref type="figure">2b</ref>). Symile-M3 was designed to ensure that neither text nor audio alone would suffice to predict the image. Therefore, success on this zero-shot retrieval task hinges on a model's ability to capture joint information between the three modalities.</p><p>As shown in Figure <ref type="figure">4b</ref>, Symile successfully leverages this joint information, with mean accuracies of 0.939, 0.919, and 0.882 on Symile-M3-2, Symile-M3-5, and Symile-M3-10, respectively, calculated across 10 bootstrap samples of the test set, all with standard error less than 4.0 &#215; 10 -4 . In contrast, CLIP, which captures pairwise information between image and text, can only predict an image randomly from among the w class labels present in the text, resulting in mean accuracies of 0.473, 0.187, and 0.094 on Symile-M3-2, Symile-M3-5, and Symile-M3-10, respectively, all with standard error f 3.01 &#215; 10 -4 . Because CLIP cannot distinguish between the class labels in the text using the audio language, it can only pick a class label at random, bounding its accuracy by 1 /w.</p><p>Missing data. We also train Symile on a variant of Symile-M3-2 where each modality is independently missing with probability 0.5 or 0.65, corresponding, respectively, to probabilities 0.125 and 0.043 of a complete data sample in the training set (see Appendix I for details). As before, the test set consists of complete triples. As shown in Figure <ref type="figure">4c</ref>, even when only 12.5% of the training data is complete, Symile achieves a mean accuracy of 0.906 &#177; 3.4 &#215; 10 -4 (SE), far outperforming the CLIP baseline accuracy of 0.473, despite the adverse effect of missing modalities. Notably, when less than 5% of the training data is complete, Symile still exceeds the CLIP baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Chest X-ray prediction using electrocardiograms and laboratory measurements</head><p>Zero-shot retrieval is widely used in the evaluation of representation learning for healthcare <ref type="bibr">[6,</ref><ref type="bibr">22,</ref><ref type="bibr">29,</ref><ref type="bibr">51,</ref><ref type="bibr">56]</ref>. In this section, we evaluate the Symile objective on Symile-MIMIC, a clinical dataset comprised of chest X-rays, electrocardiograms, and blood labs from MIMIC-IV <ref type="bibr">[17,</ref><ref type="bibr">24,</ref><ref type="bibr">27]</ref> and MIMIC-CXR <ref type="bibr">[25,</ref><ref type="bibr">26]</ref>. Since ECGs and labs are both safer than CXRs, this experiment explores whether an ECG and labs collected at admission are predictive of a CXR taken shortly thereafter. Dataset design and model setup. Each data sample includes an ECG reading and blood labs taken within 24 hours of the patient's admission to the hospital, and a CXR taken in the 24-to 72-hour period post-admission (Figure <ref type="figure">5a</ref>). Our analysis focuses on the 50 most common blood labs, with each sample containing at least one.</p><p>We split our dataset (11, 622 admissions) into a train/validation development set (95% of patients) and a test set (5% of patients), ensuring there is no patient overlap across the splits. Following previous work, we use the ResNet-50 and ResNet-18 architectures <ref type="bibr">[20]</ref> for the CXR and ECG encoders, respectively, and a threelayer neural network to encode the blood labs. All encoders are trained from scratch, and three linear projections map each encoder's representation to the same 8192-dimensional space.</p><p>Given the limited size of the dataset, the Symile loss is trained with O(N 2 ) negative sampling to mitigate overfitting. See Appendix I for details.</p><p>Evaluation and results. We evaluate the learned representations on the zero-shot retrieval task of finding the most probable candidate CXR for a given query ECG and labs pair according to the similarity score. For each query ECG and labs pair in the test set, we sample nine negative CXR candidates from the remaining test samples, so that that each query has a total of 10 candidates: one positive (the true corresponding CXR) and nine negative.</p><p>In Figure <ref type="figure">5b</ref>, we report mean accuracy for Symile and CLIP over 10 bootstrap samples of the test set. While both models surpass random chance (0.1), Symile achieves an average accuracy of 0.435 &#177; 0.007 (SE), outperforming CLIP's 0.387 &#177; 0.003 (SE). These results correspond to a 12.5% increase in accuracy for Symile over CLIP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>This work presents Symile, a simple contrastive learning approach that captures higher-order information between any number of modalities. Symile provides a flexible, architecture-agnostic objective for learning modality-specific representations, maintaining the simplicity of CLIP while delivering superior performance, even in cases of missing modalities. Because it targets total correlation, Symile captures strictly more information than CLIP, guaranteeing performance that matches or surpasses CLIP, except in cases where it known that only pairwise statistics are relevant. Given that such prior knowledge is rarely available, Symile should be favored over CLIP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Future work. (1)</head><p>The sigmoid-based loss function SigLIP <ref type="bibr">[54]</ref> was recently introduced as a memory-efficient alternative to traditional softmax-based contrastive objectives. A potential avenue for future work would be to adapt Symile, and its use of the multilinear inner product, to this sigmoid loss.</p><p>(2) The proposed implementation of Symile relies on an approximation for negative sampling, and future work could examine how this approximation scales when applied to settings with more than three modalities. (3) Future work could integrate pre-trained Symile representations into multimodal large language models, enabling them to capture higher-order information between modalities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Pairwise independence in binary XOR experiment</head><p>In this section, we show that the three variables in the XOR experiment in Section 2.2 are pairwise independent.</p><p>Let</p><p>First, we will show that c &#8764; Bernoulli(0. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Total correlation lower bound</head><p>Our goal in this section is to derive a lower bound on TC(m 1 , . . . , m M ).</p><p>We start by describing in Appendix B.1 the sampling procedure for a batch of (m 1 , . . . , m M ) tuples. In Appendix B.2, we derive the desired lower bound in Theorem 3.1 (our proof was inspired by Poole et al. <ref type="bibr">[39]</ref>'s derivation of the InfoNCE lower bound, which does not rely on an approximation used by Oord et al. <ref type="bibr">[38]</ref>). In Appendix B.3, we show that the bound is closed at optimality. Finally, we use the lower bound to define the Symile objective in Appendix B.4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1 Sampling procedure</head><p>We start by describing the sampling procedure for the batch of N M -tuples. In contrastive learning, the objective is to differentiate between positive and negative samples constructed from a given batch of matched data. In order to construct these samples, each modality is treated as the anchor in turn, and then for each anchor modality a corresponding set of positive and negative samples is generated.</p><p>Let &#181; be arbitrary in {1, . . . , M }, let m &#181; denote the anchor modality, and let m -&#181; denote the M -1 non-anchor modalities. Let</p><p>denote the index of the positive M -tuple in the batch.</p><p>We draw m &#181; from p(m &#181; ) and m -&#181;,i from p m-&#181; | m&#181; (m -&#181;,i | m &#181; ). We call (m &#181; , m -&#181;,i ) our positive tuple.</p><p>For each non-anchor modality m &#8467;&#824; =&#181; , we draw N -1 samples of m &#8467;,j from p m &#8467; (m &#8467;,j ), so that there are N -1 total negative tuples (m &#181; , m -&#181;,j ).</p><p>Let M -&#181; = {m -&#181;,n } N n=1 be the set of all samples of non-anchor modalities m -&#181; in the batch. This sampling procedure describes the following distribution:</p><p>Letting M &#8467;&#824; =&#181; = {m &#8467;,n } N n=1 be the set of all samples of modality m &#8467; in the batch, the following properties hold by Lemma C.1:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 Lower bound on total correlation</head><p>We now derive a lower bound on TC(m 1 , . . . , m M ), which we express using the following notation for convenience:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem B.1 (Total Correlation Lower Bound). Given the distributions in Equations (8) and (9), for any value i of i and any scoring function g, a multi-sample contrastive lower bound on total correlation is</head><p>.</p><p>Proof. By Lemmas C.1 and D.1, we have</p><p>.</p><p>We call the above likelihood ratio in blue the total correlation (TC) likelihood ratio. We introduce a variational approximation q(M -&#181; | m &#181; , i = i) that has the same support as p(M -&#181; | m &#181; , i = i):</p><p>since the Kullback-Leibler divergence is always non-negative. Note that Equation ( <ref type="formula">10</ref>) is the total correlation variant of Barber &amp; Agakov <ref type="bibr">[7]</ref>'s lower bound on mutual information.</p><p>We choose to set</p><p>where</p><p>and g is an arbitrary function.</p><p>Plugging Equation <ref type="bibr">(11)</ref> into Equation <ref type="bibr">(10)</ref> gives</p><p>Since log(b) f b a + log a -1 for all b, a &gt; 0, we see that</p><p>which, continuing from Equation ( <ref type="formula">13</ref>), gives us</p><p>Substituting the formulas for f and C into Equation ( <ref type="formula">14</ref>),</p><p>. Now take the expectation of this bound over p(i):</p><p>Notice that the index i does not change the expected value in Equation <ref type="bibr">(15)</ref>. To see why, consider two values i and i &#8242; :</p><p>.</p><p>Swapping the names of integration variables does not change the integral from Equation ( <ref type="formula">16</ref>) to Equation <ref type="bibr">(17)</ref>.</p><p>Therefore, continuing from Equation ( <ref type="formula">15</ref>), the lower bound can be written for any value i of i as</p><p>.</p><p>The extra negative samples are auxiliary random variables for computation in that these random variables do not appear in the target total correlation. This is analogous to the auxiliary random variables used in approximating posteriors and probabilistic modeling <ref type="bibr">[37,</ref><ref type="bibr">42,</ref><ref type="bibr">48]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 Closing the lower bound</head><p>There are two inequalities in the derivation for the total correlation lower bound in Theorem B.1: the Barber &amp; Agakov gap in Equation ( <ref type="formula">10</ref>) and the log ratio gap in Equation <ref type="bibr">(14)</ref>. In this section, we show that each of these bounds is closed at optimality.</p><p>The Barber &amp; Agakov gap in Equation ( <ref type="formula">10</ref>) is closed when</p><p>Therefore, closing the Barber &amp; Agakov gap requires</p><p>The log ratio gap in Equation ( <ref type="formula">14</ref>) is closed when</p><p>Then by Equation ( <ref type="formula">18</ref>), the lower bound is closed if</p><p>By Equation <ref type="bibr">(12)</ref>,</p><p>Therefore, we need</p><p>Let</p><p>&#8656;&#8658;</p><p>Informally, for large enough N ,</p><p>Therefore, we have</p><p>as required by Equation <ref type="bibr">(20)</ref>.</p><p>The solution for the scoring function g in Equation ( <ref type="formula">21</ref>) equals the g * , derived in Lemma E.1, that maximizes the total correlation lower bound.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.4 The Symile objective</head><p>Given a batch of N &#8242; positive tuples (m &#181;,i , m -&#181;,i ), each with N -1 corresponding negative tuples (m &#181;,i , m &#8242; -&#181;,j ), and letting &#196; &#8712; R + be a temperature parameter, the Symile loss is the negative of an empirical estimate of the expected log likelihood in the lower bound in Theorem B.1:</p><p>We take the expectation over p(i) of both sides of Equation ( <ref type="formula">24</ref>) to get</p><p>by Eq. 24</p><p>pm &#8467; (m &#8467;,j ) dM -&#181;,j&#824; =i by Eq. 23</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Total correlation for a batch</head><p>Lemma D.1 (Total Correlation for a Batch of Tuples). Suppose a batch of N M -tuples is sampled according to the data generating process outlined in Appendix B.1 where</p><p>We claim that for any value i of i</p><p>Proof. By the definition of conditional total correlation,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G Symile learns sufficient statistics</head><p>Theorem G.1 (Symile Sufficient Statistics). Let m 1 , . . . , m M be M random variables whose optimal representations when trained using Symile are f * 1 (m 1 ), . . . , f * M (m M ), respectively. The elementwise product of any subset of the representations is a sufficient statistic for predicting the remaining random variables.</p><p>For example, letting &#181; be arbitrary in {1, . . . , M } and letting k&#824; =&#181; f * k (m k ) indicate the elementwise product of the representations for the remaining M -1 modalities, k&#824; =&#181; f * k (m k ) is a sufficient statistic for predicting m &#181; , which can be expressed using the following conditional independence statement:</p><p>Proof. Since, as discussed in Section 3.2, we use the multilinear inner product (MIP) as the scoring function g, by Lemma E.1 for some &#187; &gt; 0 at optimality, we have</p><p>Consider the case in which we are given representations for the M -1 modalities that are not m &#181; .</p><p>The goal is to show</p><p>To do so, we will show that</p><p>Since, conditioned on m -&#181; , m &#181; is independent of any function of m k&#824; =&#181; ,</p><p>Since p is a distribution,</p><p>.</p><p>Substituting this back into Equation <ref type="bibr">(27)</ref> yields</p><p>Now compute</p><p>by Eq. 28</p><p>Since m -&#181; only appears inside the expectation through k&#824; =&#181; f * k (m k ), and since we are conditioning on k&#824; =&#181; f * k (m k ) being a particular value, the term inside the expectation is conditionally constant. Therefore,</p><p>by Eq. 28</p><p>This equality establishes that</p><p>H Zero-shot prediction using the score function</p><p>In this section, we discuss the limitations-for both Symile and CLIP-of using the scoring function for zero-shot prediction and demonstrate how these limitations can be addressed by using the scoring function to directly compute the desired conditional probability.</p><p>Recall from Lemma 3.2 that the optimal scoring function g * is equal to the instantaneous total correlation up to additive constants:</p><p>g * (x, y, z) = log &#187; p x,y,z (x, y, z) p(x)p y (y)p z (z) .</p><p>Similarly, the optimal scoring function h * for CLIP can be expressed as follows <ref type="bibr">[38,</ref><ref type="bibr">39]</ref>:</p><p>Traditionally, for zero-shot prediction with CLIP, the scoring function is used to rank the candidates for one of the modalities: arg max y&#8712;Y p(y = y | x) = arg max y&#8712;Y h * (x, y). However, it turns out that this approach for zero-shot prediction does not lead to the Bayes optimal prediction, potentially sacrificing accuracy.</p><p>To illustrate the issue, consider a scenario in which we have two modalities: disease y and temperature t. The values these two variables can take are outlined in the following joint distribution table: H H H H H y t 99 100 101 102 p(y) a 0.1 0.1 0.3 0.3 0.8 b 0 0 0.1 0.1 0.2 p(t) 0.1 0.1 0.4 0.4</p><p>Now, consider a patient with a temperature of 101 degrees; our goal is to predict which disease the patient has. Predictions derived from the conditional distribution achieve optimal accuracy <ref type="bibr">[36]</ref>. Therefore, we should predict that the patient has disease a, since However, were we to apply the standard strategy of using the scoring function for zero-shot classification, we would predict that the patient has disease b, since dividing by the prior probability of disease b upweights its likelihood ratio compared to that of disease a: Why, then, does CLIP perform well in practice? Because the kinds of zero-shot classification tasks for which the dot product is used typically feature an almost deterministic likelihood, where the modality to predict has a point mass distribution at a single value, with probability zero everywhere else.</p><p>For example, in our case, this would mean that p before being passed to the feature extractor. We freeze the three encoders' parameters except for those in the text encoder's embedding layer and first encoder layer, which are fine-tuned. We train three linear projections to map each encoder's representation to the same 8192-dimensional space, followed by layer normalization.</p><p>For each combination of objective (Symile or CLIP) and Symile-M3 version (2, 5, or 10), we do a grid search over learning rate (1e-5, 5e-5, 1e-4) and weight decay (0, 1e-4, 1e-3). We also tune these hyperparameters for the experiments with missing data. All models are trained for 24 epochs using a batch size of 256. The learned temperature parameter &#196; is initialized to -6. The Symile loss is trained with O(N ) negative sampling. Checkpoints were saved every two epochs, and the best model was selected based on the lowest validation loss.</p><p>Missingness. We evaluate Symile on a variant of Symile-M3-2 where each modality is independently missing with probability 0.5 or 0.65, which correspond, respectively, to probabilities 0.125 and 0.043 of a complete data sample.</p><p>For audio and image data, we learn two embeddings, one for observed data points and one for missing data points. Each embedding matches the dimension of the last hidden layer of the respective audio or image encoder. When a data point is observed, we concatenate its encoder representation and the learned embedding for observed data points, and pass this combined vector into the linear projection head before layer normalization. When a data point is missing, we concatenate the mean encoder representation from the observed training samples and the learned embedding for missing data points, and pass this combined vector into the linear projection head before layer normalization.</p><p>For text data, if a data point is missing, we pass into the text encoder the tokenized representation of [MISSING], which is outside of the model's vocabulary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I.4 Symile-MIMIC</head><p>Symile-MIMIC is a clinical dataset comprised of chest X-rays, electrocardiograms, and blood labs from the MIMIC-IV <ref type="bibr">[16,</ref><ref type="bibr">17,</ref><ref type="bibr">24,</ref><ref type="bibr">27]</ref> and MIMIC-CXR <ref type="bibr">[25,</ref><ref type="bibr">26]</ref> datasets. We use admissions and labs from MIMIC-IV v2.2, <ref type="foot">7</ref> ECGs from MIMIC-IV-ECG v1.0, <ref type="foot">8</ref> and CXRs from MIMIC-CXR-JPG v2.0.0.<ref type="foot">foot_7</ref> </p><p>Each data sample includes an ECG reading and blood labs taken within 24 hours of the patient's admission to the hospital, and a CXR taken in the 24-72 hour period post-admission. For each admission, we choose the earliest CXR, ECG, and labs.</p><p>We use CXRs in JPG format, and consider only CXRs with a posteroanterior (PA) or anteroposterior (AP) view. Following Irvin et al. <ref type="bibr">[23]</ref>, each CXR is scaled such that the smaller edge is set to 320 pixels, followed by a square crop (random for training or center for validation and testing). Images are then normalized using the ImageNet mean and standard deviation.</p><p>We use 10-second 12-lead ECGs, and remove from consideration any ECGs with NaN values or with a signal of all zeros. The ECG signal is normalized to lie within the range [-1, 1]. We focus on the following 50 most common blood laboratory measurements in our dataset, with each data sample containing at least one: Hematocrit, Platelet Count, Creatinine, Potassium, Hemoglobin, White Blood Cells, MCHC, Red Blood Cells, MCV, MCH, RDW, Urea Nitrogen, Sodium, Chloride, Bicarbonate, Anion Gap, Glucose, Magnesium, Calcium Total, Phosphate, INR (PT), PT, PTT, Basophils, Neutrophils, Monocytes, Eosinophils, Lymphocytes, RDW-SD, H, L, I, Alanine Aminotransferase (ALT), Asparate Aminotransferase (AST), Lactate, Alkaline Phosphatase, Bilirubin Total, pH, Albumin, Base Excess, pO2, Calculated Total CO2, pCO2, Absolute Neutrophil Count, Absolute Eosinophil Count, Absolute Monocyte Count, Absolute Basophil Count, Absolute Lymphocyte Count, Creatine Kinase (CK), Immature Granulocytes.</p><p>For the labs model, we use a 100-dimensional vector as input: the first 50 coordinates are lab values standardized to percentiles based on the training set's empirical CDF, and the remaining 50 coordinates are binary indicators that denote whether each lab value is missing. When a lab value is unobserved, the mean percentile for that lab is substituted.</p><p>Following previous work <ref type="bibr">[8,</ref><ref type="bibr">22,</ref><ref type="bibr">29,</ref><ref type="bibr">30,</ref><ref type="bibr">57]</ref>, we use the ResNet-50 and ResNet-18 architectures <ref type="bibr">[20]</ref> for the CXR and ECG encoders, respectively, and a three-layer neural network to encode the blood labs. All encoders are trained from scratch, and three linear projections map each encoder's representation to the same 8192-dimensional space.</p><p>For Symile and CLIP each, we do a grid search over learning rate (5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2) and weight decay (1e-3, 1e-2, 1e-1, 2e-1, 5e-1). All models are trained for 80 epochs using a batch size of 280. The learned temperature parameter &#196; is initialized to -7. The Symile loss is trained with O(N 2 ) negative sampling to mitigate overfitting. Checkpoints were saved at the end of every epoch, and the best model was selected based on the lowest validation loss.</p><p>NeurIPS Paper Checklist</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Claims</head><p>Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?</p><p>Answer: [Yes]</p><p>Justification: All claims made in the abstract and introduction are substantiated both theoretically and empirically in the paper.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the abstract and introduction do not include the claims made in the paper.</p><p>&#8226; The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.</p><p>&#8226; The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.</p><p>&#8226; It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Limitations</head><p>Question: Does the paper discuss the limitations of the work performed by the authors?</p><p>Answer: [Yes] Justification: In Section 3, we outline limitations, clearly state all theoretical assumptions, and discuss the computational and memory trade-offs of various negative sampling approaches.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.</p><p>&#8226; The authors are encouraged to create a separate "Limitations" section in their paper.</p><p>&#8226; The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.</p><p>&#8226; The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.</p><p>&#8226; The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.</p><p>&#8226; The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.</p><p>&#8226; If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.</p><p>&#8226; While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best</p><p>&#8226; The full details can be provided either with the code, in appendix, or as supplemental material.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Experiment Statistical Significance</head><p>Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?</p><p>Answer: <ref type="bibr">[Yes]</ref> Justification: Standard error is reported for all experiments in Section 5.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not include experiments.</p><p>&#8226; The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.</p><p>&#8226; The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).</p><p>&#8226; The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)</p><p>&#8226; The assumptions made should be given (e.g., Normally distributed errors).</p><p>&#8226; It should be clear whether the error bar is the standard deviation or the standard error of the mean.</p><p>&#8226; It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.</p><p>&#8226; For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).</p><p>&#8226; If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Experiments Compute Resources</head><p>Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?</p><p>Answer: <ref type="bibr">[Yes]</ref> Justification: Details on the compute resources used are provided in Appendix I.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not include experiments.</p><p>&#8226; The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.</p><p>&#8226; The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.</p><p>&#8226; The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">Code Of Ethics</head><p>Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics <ref type="url">https://neurips.cc/public/EthicsGuidelines</ref>?</p><p>&#8226; Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.</p><p>&#8226; Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.</p><p>&#8226; We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.</p><p>12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?</p><p>Answer: [Yes]</p><p>Justification: All sources for the datasets and models used in this work are properly credited.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not use existing assets.</p><p>&#8226; The authors should cite the original paper that produced the code package or dataset.</p><p>&#8226; The authors should state which version of the asset is used and, if possible, include a URL.</p><p>&#8226; The name of the license (e.g., CC-BY 4.0) should be included for each asset.</p><p>&#8226; For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.</p><p>&#8226; If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.</p><p>&#8226; For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.</p><p>&#8226; If this information is not available online, the authors are encouraged to reach out to the asset's creators.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="13.">New Assets</head><p>Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?</p><p>Answer: [Yes]</p><p>Justification: Full details for the new datasets are available in Section 5, Appendix I and at <ref type="url">https://github.com/rajesh-lab/symile</ref>.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not release new assets.</p><p>&#8226; Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.</p><p>&#8226; The paper should discuss whether and how consent was obtained from people whose asset is used.</p><p>&#8226; At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>To be specific, we use "higher-order information" to mean information between two random variables given any number of additional random variables in the conditioning set.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>Symile stands for SYmmetric MultILinear Embeddings.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>Note that the MIP is a measure of similarity defined by the joint distribution of the modalities, rather than a measure of the geometric similarity of the modalities' representations. For example, a large MIP for Symile representations rx, ry, rz indicates that the sample (x, y, z) has high probability under the joint likelihood; it provides no information about whether rx, ry, rz are equal to one another.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3"><p>https://www.kaggle.com/c/imagenet-object-localization-challenge/ overview/description</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4"><p>https://cloud.google.com/translate</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_5"><p>https://physionet.org/content/mimiciv/2.2/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_6"><p>https://physionet.org/content/mimic-iv-ecg/1.0/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_7"><p>https://physionet.org/content/mimic-cxr-jpg/2.0.0/</p></note>
		</body>
		</text>
</TEI>
