<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Reviving the Context: Camera Trap Species Classification as Link Prediction on Multimodal Knowledge Graphs</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>10/21/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10611518</idno>
					<idno type="doi">10.1145/3627673.3679545</idno>
					
					<author>Vardaan Pahuja</author><author>Weidi Luo</author><author>Yu Gu</author><author>Cheng-Hao Tu</author><author>Hong-You Chen</author><author>Tanya Berger-Wolf</author><author>Charles Stewart</author><author>Song Gao</author><author>Wei-Lun Chao</author><author>Yu Su</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Camera traps are important tools in animal ecology for biodiversity monitoring and conservation. However, their practical application is limited by issues such as poor generalization to new and unseen locations. Images are typically associated with diverse forms of context, which may exist in different modalities. In this work, we exploit the structured context linked to camera trap images to boost out-of-distribution generalization for species classification tasks in camera traps. For instance, a picture of a wild animal could be linked to details about the time and place it was captured, as well as structured biological knowledge about the animal species. While often overlooked by existing studies, incorporating such context offers several potential benefits for better image understanding, such as addressing data scarcity and enhancing generalization. However, effectively incorporating such heterogeneous context into the visual domain is a challenging problem. To address this, we propose a novel framework that transforms species classification as link prediction in a multimodal knowledge graph (KG). This framework enables the seamless integration of diverse multimodal contexts for visual recognition. We apply this framework for out-of-distribution species classification on the iWildCam2020-WILDS and Snapshot Mountain Zebra datasets and achieve competitive performance with]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Human activities are increasingly endangering wildlife species, resulting in a significant global decline in animal populations <ref type="bibr">[2,</ref><ref type="bibr">19,</ref><ref type="bibr">37]</ref>. Therefore, accurately identifying and tracking wildlife species is vital for preserving ecological biodiversity. Camera traps, digital cameras activated by motion or infrared in natural habitats, have become ecologists' preferred data collection tool <ref type="bibr">[23,</ref><ref type="bibr">44,</ref><ref type="bibr">67]</ref>. However, manually sifting through the numerous images they capture is a time-consuming and arduous task for experts. This has led to the increased use of computer vision techniques for species recognition <ref type="bibr">[1,</ref><ref type="bibr">13,</ref><ref type="bibr">30,</ref><ref type="bibr">52,</ref><ref type="bibr">56,</ref><ref type="bibr">68</ref>]. Yet, a challenge has arisen: many of these models overfit to the backgrounds of their training images, diminishing their effectiveness on images from new locations <ref type="bibr">[9,</ref><ref type="bibr">39,</ref><ref type="bibr">55]</ref>. This underscores the need for more adaptable species classification models that perform well across diverse contexts.</p><p>Building on this, cognitive science research has demonstrated the profound influence of contextual information on human perception and visual recognition processes <ref type="bibr">[4,</ref><ref type="bibr">5,</ref><ref type="bibr">45]</ref>. Particularly in wildlife monitoring, camera trap images are replete with crucial contextual data, such as where (i.e., camera location coordinates) and when (i.e., timestamps) a photo is taken. Furthermore, the structured knowledge of biology taxonomy (e.g., Open Tree Taxonomy <ref type="bibr">[46]</ref>) can also provide valuable context for understanding the species in camera trap images. Such context provides important knowledge that can boost the recognition of visual concepts. For instance, the knowledge that a certain feline image was taken from a camera trap in Africa significantly reduces the likelihood of it representing a tiger. In addition, more robust associations might be learned with the aid of contextual information because the context provides invariable knowledge that is unbiased towards variations in the illuminations or angles of an image. This may help to compensate for domain shifts in species images resulting from such variations and potentially lead to better out-of-distribution (OOD) generalizability <ref type="bibr">[6,</ref><ref type="bibr">21]</ref>. Consequently, the incorporation of contextual information in species identification presents a significant problem worthy of investigation.</p><p>Nevertheless, contextual information has been under-exploited in the literature of image classification; standard image classification models <ref type="bibr">[24,</ref><ref type="bibr">59]</ref> often disregard the contextual information tied to images. This is partly due to the heterogeneous nature of the context, which makes it challenging to incorporate contextual information in image classification using a unified learning framework. Contextual information in different modalities (e.g., numerical values, textual descriptions, or structured taxonomies) is usually represented separately from the image in distinct feature spaces. The question of effectively combining features from these different spaces within a unified learning framework remains unanswered. Existing research typically treats all the features as additional input to the classifier via feature vector concatenation <ref type="bibr">[6,</ref><ref type="bibr">21,</ref><ref type="bibr">32]</ref> or utilizes fusion to obtain aggregate representations <ref type="bibr">[16,</ref><ref type="bibr">18]</ref>. Despite their simplicity, such approaches are incapable of capturing complex structural and semantic relationships between images and various contextual information. Additionally, these approaches assume a uniform availability of contextual information across all images, which is often unrealistic in real-world scenarios. As a result, their flexibility is limited, especially when considering situations where certain images may lack some contextual details, such as coordinates or timestamps, like in camera trap photos.</p><p>Towards this end, we propose a new learning framework, COSMO (Classification Of Species using Multimodal cOntext), where we first organize all species images and contextual information as a multimodal knowledge graph (KG) and then reformulate species classification as the standard link prediction task on the KG. Specifically, we consider species images, their corresponding labels (which are available in the training data), and their associated attributes provided in the context as entities within our KG (see Figure <ref type="figure">1</ref> for an example). We represent the relationships between these entities as edges in our KG (see a more concrete description of our KG construction in Section 3.2). Our KG is multimodal because its entities belong to different modalities. In this context, species classification can be framed as a link prediction task, where the objective is to predict the presence of an edge between an image and its corresponding species label within the KG. This learning framework enables a unified way to incorporate heterogeneous contextual information for species classification. Each form of multimodal information is treated as a type of entity, a first-class citizen of the multimodal KG with its representation computed using a modalityspecific encoder. The learning process enables the interaction of different modalities in a joint feature space for robust representation learning. In addition, COSMO demonstrates greater flexibility by not assuming uniform availability of all contextual information, unlike previous methods.</p><p>We employ the widely used DistMult <ref type="bibr">[70]</ref> model as our backbone model for link prediction to instantiate the COSMO framework. To assess the performance of COSMO, particularly in terms of out-of-distribution generalization, we conduct experiments on the iWildCam2020-WILDS benchmark <ref type="bibr">[30]</ref> and Snapshot Mountain Zebra <ref type="bibr">[48]</ref>, which are standard datasets for species classification in camera trap photos. They contain naturally occurring wildlife photos associated with metadata. Factors like variation in illumination, camera pose, and motion blur pose challenges for robustness and generalization, making these benchmarks an ideal testbed for assessing our framework's effectiveness. We show that COSMO offers a unified framework to incorporate heterogeneous context leading to improved species classification performance over existing out-of-distribution generalization approaches.</p><p>The main contribution of this work is three-fold:</p><p>&#8226; We propose a novel framework, COSMO, that reformulates species classification as link prediction in a multimodal knowledge graph, which provides a unified way to incorporate heterogeneous forms of contextual information associated with images for visual recognition. &#8226; We instantiate this framework for species classification of wildlife images, including the construction of a novel multimodal KG for this problem that integrates spatiotemporal information and structured biology knowledge. &#8226; Evaluation on the standard iWildCam2020-WILDS and Snapshot Mountain Zebra datasets demonstrate that COSMO achieves competitive performance compared with standard species classification methods, especially in improving robustness and OOD generalization. 2 Related Work Species Recognition in Camera Traps. Deep neural networks such as CNNs have been successfully deployed for large-scale recognition of camera trap images [43, 64, 68]. This has paved the way for significant savings in logistics costs for biodiversity conservation. However, training such models often requires enormous amounts of data to perform well. Sadegh Norouzzadeh et al. and Bothmann et al. propose active learning approaches to mitigate the sample inefficiency of training species classification models in such systems. Another challenge arises from the tendency of Thomson's Gazelle (Eudorcas thomsonii) Eudorcas Bovidae Red-fronted Gazelle (Eudorcas rufifrons) instance_of instance_of instance_of parent parent parent location time Species Label Edge Taxonomy Edge Eudorcas Thomson's Gazelle (Eudorcas thomsonii) parent time location instance_of visual categorical categorical numerical Subject Relation Object KGE model Genus Family Species (-0.586389, 36.817223) 2012-01-02 21:52:56 Timestamp Entity Location Entity Species/Taxon Entity (-0.586389, 36.817223) Metadata Edge</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Multimodal Knowledge Graph</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>COSMO Model Architecture</head><p>Figure <ref type="figure">1</ref>: Overview of our framework COSMO. Left: Our multimodal knowledge graph for camera traps and wildlife. Photos from camera traps are jointly represented in the KG with contextual information such as time, location, and structured biology taxonomy. The taxonomy is obtained from Open Tree Taxonomy (OTT) <ref type="bibr">[46]</ref> or iNaturalist <ref type="bibr">[25]</ref>. Right: In our formulation of species classification as link prediction, the plausibility score &#120595; (&#119904;, &#119903;, &#119900;) of each (subject, relation, object) triplet is computed using a KGE model (e.g., DistMult), where the subject, relation, and object are all first embedded into a vector space. Specifically, for our multimodal KG, we represent visual entities using a ResNet-50 pre-trained on ImageNet and represent numerical entities using an MLP. For categorical entities and relations, we directly represent them with embedding lookups.</p><p>these models to overfit to the backgrounds present in the training images <ref type="bibr">[9,</ref><ref type="bibr">39]</ref>, which limits their deployment to new camera trap locations <ref type="bibr">[55,</ref><ref type="bibr">64]</ref>. Improving robustness to new locations is a significant research challenge <ref type="bibr">[8,</ref><ref type="bibr">63]</ref> leading to the curation of datasets like iWildCam2020-WILDS <ref type="bibr">[30]</ref> to test OOD generalization for such systems. Domain adaptation approaches in the literature seek to mitigate this issue by distributionally robust optimization <ref type="bibr">[26,</ref><ref type="bibr">50]</ref> or learning domain invariant features <ref type="bibr">[61]</ref>. In contrast, this work helps improve the robustness to new camera trap locations by utilizing a multimodal KG of heterogeneous contexts.</p><p>Image Classification with Auxiliary Information. Despite the ubiquity of contextual metadata, the potential of leveraging them for image classification has been largely under-explored. Previous studies have primarily treated metadata as additional input features for classifiers <ref type="bibr">[6,</ref><ref type="bibr">21,</ref><ref type="bibr">32]</ref>, representing a shallow use that fails to capture the intricate relationships between metadata and images. Some works have attempted to model pairwise dependencies between images using heuristics based on metadata, such as shared tags on social media <ref type="bibr">[35,</ref><ref type="bibr">38]</ref> or aggregating information from neighborhood images with similar metadata <ref type="bibr">[28]</ref> while disregarding more complex relationships among images, metadata, and labels. Metaformer <ref type="bibr">[18]</ref> feeds a sequence of image patches and metadata to a Transformer model for their fusion. Additionally, these methods assume a uniform availability of metadata for all images, which is often not the case in reality due to data scarcity. For instance, the camera trap location coordinates may not be available in some cases due to privacy and security reasons. In our work, we do not assume such uniform availability and build the multimodal KG using available metadata.</p><p>Apart from the metadata, external sources of knowledge are also used in image classification. For instance, Jayathilaka et al. embed each class as a vector based on a hierarchy derived from WordNet <ref type="bibr">[40]</ref>. <ref type="bibr">Alsallakh</ref> et al. develop a class hierarchy-aware CNN for image classification on ImageNet. Similarly, Bertinetto et al. and Zhang et al. design hierarchy-aware objectives to incorporate taxonomy in image representations. BioCLIP [60] verbalizes the taxonomic hierarchy to train a CLIP-style foundational model for species classification across plants, animals, and fungi. Marino et al. represent images as local subgraphs of Visual Genome <ref type="bibr">[31]</ref>. In contrast, COSMO constructs a global KG with both metadata and external knowledge, e.g., taxonomy information from Open Tree Taxonomy, and approaches image classification as link prediction within the KG. Our novel formulation is flexible in handling data scarcity of metadata and enables reasoning over diverse relationships present in the KG. KG Link Prediction. Most real-world KGs are incomplete. The task of link prediction or knowledge graph completion (KGC) tries to infer missing links given the observed ones. Early approaches for link prediction range from translation-based models <ref type="bibr">[12,</ref><ref type="bibr">34]</ref> and semantic matching models <ref type="bibr">[42,</ref><ref type="bibr">70]</ref> to the ones that leverage neural networks like feedforward neural networks <ref type="bibr">[20]</ref>, CNNs <ref type="bibr">[17,</ref><ref type="bibr">41]</ref>, and Transformer-based models <ref type="bibr">[15,</ref><ref type="bibr">53,</ref><ref type="bibr">71]</ref>. These methods use a parameterized scoring function based on learned entity and relation embeddings to calculate the plausibility of a particular triplet. However, it could be challenging to fully encode the rich semantic information of KGs into such shallow embeddings. To mitigate this, Schlichtkrull et al., <ref type="bibr">Vashishth et</ref> al., Yu et al., Pahuja et al. use graph neural networks (GNNs) to encode the rich neighborhood context of entities for link prediction. In our framework, we employ a global multimodal KG, which consists of biological taxonomy and metadata, as the context to enhance OOD generalization.</p><p>Multimodal KG Reasoning. Multimodal KGs extend traditional KGs by including entities of different modalities such as categorical data, images, numerical data, etc. KBLRN <ref type="bibr">[22]</ref> is a pioneering work in multimodal KG reasoning that uses extra information in the form of relational and numerical features for multimodal KG reasoning. Similarly, IKRL <ref type="bibr">[57]</ref> proposes a fusion of linguistic and visual information with structured information for link prediction. MKBE <ref type="bibr">[49]</ref> constructs a multimodal KG using numerical, image, and textual information, treating them as entities instead of auxiliary features, for the link prediction task. MR-GCN <ref type="bibr">[69]</ref> further extends it by including support for more modalities, e.g., numerical, temporal, textual, visual, and spatial predicate links in the multimodal KG. To provide a more expressive way for interaction between different modalities, IMF <ref type="bibr">[33]</ref> uses bilinear pooling to fuse multiple modality features and trains it using contrastive learning on the contextual entity representations. Our work leverages link prediction in a multimodal KG to enable out-of-distribution generalization for species classification in camera traps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methodology 3.1 Preliminaries</head><p>Multimodal KG. Given a set of KG entities with categorical values E K G , multimodal entities E M M , and a set of relations R, a multimodal KG can be defined as a collection of facts</p><p>KG Link Prediction. The task of link prediction is to infer missing facts based on known facts in a KG. Given a link prediction query (&#8462;, &#119903;, ?) or (?, &#119903;, &#119905;), the model ranks the target entity among the set of candidate entities. Problem Setup. The task entails species recognition for camera trap images amidst distribution shifts. The training and test sets comprise images obtained from disjoint camera traps, enabling the evaluation of out-of-domain (OOD) generalization. During training, we use the multimodal KG to train our model, while we use just the image to make predictions for inference. The goal is to learn visual representations robust to distribution shifts by leveraging the rich structural and semantic information provided by the multimodal knowledge graph.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Building the Multimodal KG</head><p>The multimodal KG comprises entities from different modalities interconnected by heterogeneous relationships. The base KG consists of camera trap images linked with their species labels from the training set (&lt;image&gt;, instance of, &lt;species label&gt;). Next, we progressively augment the KG with links connecting the existing entities to contextual information. In this work, we utilize the following attributes to provide context for species classification:</p><p>&#8226; Taxonomy: The taxonomy forms the core of the multimodal knowledge graph, connecting distinct species to higherorder taxa. For iWildCam2020-WILDS, we obtain the phylogenetic taxonomy corresponding to the species of interest from Open Tree Taxonomy (OTT) <ref type="bibr">[46]</ref> and manually link it to the species in the dataset. For the Snapshot Mountain Zebra dataset, we utilize the iNaturalist taxonomy <ref type="bibr">[25]</ref> mapping provided by <ref type="url">www.lila.science</ref>. &#8226; Location: The camera trap images are associated with the GPS coordinates of their source cameras. For the iWildCam2020-WILDS dataset, this metadata is available for a portion of the images (67%) and is obfuscated within 1 km. for privacy reasons. Animals demonstrate a preference for particular habitats; thus, the location context attribute is useful for species recognition. &#8226; Time: The timestamp attribute indicates the precise moment when the image was captured. This timestamp information proves valuable in species recognition since specific animals exhibit activity patterns tied to particular times of the day, such as feeding, hunting, or defending their territory. In our multimodal knowledge graph, we utilize the timestamp information at an hourly granularity.</p><p>Figure <ref type="figure">1</ref> presents a schematic representation of various contexts in a multimodal KG. For location, time, and taxonomy attributes, the corresponding RDF triplets can be represented as (&lt;image&gt;, location, &lt;GPS co-ordinate&gt;), (&lt;image&gt;, time, &lt;timestamp&gt;), and (&lt;taxon_1&gt;, parent, &lt;taxon_2&gt;), respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Model Architecture</head><p>We use DistMult <ref type="bibr">[70]</ref>, a strong baseline on KGE benchmarks, as our backbone KG embedding model. <ref type="foot">2</ref> Note that COSMO is a general framework that can leverage a variety of KG embedding models proposed in the literature. DistMult minimizes a bilinear scoring function between the entity embeddings of the subject and object entities. For a given triplet (&#8462;, &#119903;, &#119905;), the scoring function of DistMult is defined as:</p><p>Here, &#119945; and &#119957; denote the vector representations of the head entity and tail entity, respectively. The relation representation is parameterized by &#119934; &#119903; &#8712; R &#119889; &#215;&#119889; , a diagonal matrix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1">Multi-modality Encoders.</head><p>We use an ImageNet pre-trained ResNet-50 <ref type="bibr">[24]</ref> as the image encoder. The base feature of each location is represented as a 2D vector [latitude, longitude]. Following prior work <ref type="bibr">[49]</ref>, we use an MLP to project the 2D location feature to a higher dimensional space. Similarly, for temporal context, we use an MLP to project the integer value of the hour timestamp to the higher dimensional embedding space. For categorical entities such as species labels and taxa, we learn dense embeddings as representations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2">Training</head><p>. We train the model using an optimization strategy based on the modality of the tail entity. For categorical attributes, we formulate it as a multi-class classification problem and use standard cross-entropy loss to train the model. For instance, in case of a given image-species label ground truth triplet (I, instance of, &#119904;), the loss is defined as: L (I,instance of, &#119904;) = log exp(&#120595; (I, instance of, &#119904;))</p><p>where &#119878; denotes the set of all species labels and &#120595; (&#8462;, &#119903;, &#119905;) denotes the plausibility score of KG edge (&#8462;, &#119903;, &#119905;).</p><p>For numerical attributes such as location and time, we formulate it as a multi-class multi-label classification problem and use a binary cross-entropy loss to optimize the parameters. This choice is motivated by the fact that images can be associated with a range of GPS coordinates and timestamps, e.g., most animals are active multiple times during the day. The label space comprises all entities of ground truth modality. For instance, in the case of a given time modality ground truth triplet (I, time, &#119905;), the loss is defined as:</p><p>where &#119897; I,&#119905;&#119894;&#119898;&#119890; &#119905; &#8242; is a binary label that indicates whether the triplet (I, &#119905;&#119894;&#119898;&#119890;, &#119905; &#8242; ) exists in the set of observed triplets and &#120590; (&#8226;) is the sigmoid activation function. We train the model by sequentially minimizing the objective on each type of context triplet. Figure <ref type="figure">1</ref> illustrates the overall model architecture.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experimental Setup 4.1 Datasets</head><p>We test our approach on the iWildCam2020-WILDS dataset <ref type="bibr">[30]</ref>, a variant of the iWildCam 2020 dataset <ref type="bibr">[7]</ref> and Snapshot Mountain Zebra <ref type="bibr">[48]</ref>. iWildCam2020-WILDS is a benchmark dataset designed to test OOD generalization for the task of species classification. The label space consists of 182 species. Each domain corresponds to a different location of the camera trap. The training and test images belong to disjoint sets of locations in the OOD setting. Snapshot Mountain Zebra comprises camera trap images taken at the Mountain Zebra National Park in South Africa as a part of the Snapshot Safari project <ref type="bibr">[48]</ref>. The label space consists of 53 species, mostly annotated at the species level. Prominent animal species include Cape Mountain zebra, kudu, and springbok. The location coordinates are not available for this dataset due to privacy and security reasons. We manually split the images to have disjoint camera traps in each split due to the absence of a standard split. These datasets pose a significant challenge for species recognition due to factors like inadequate illumination, motion blur, occlusion, temporal variations, and diverse weather conditions, effectively reflecting the complexities of real-life camera trap usage. Dataset statistics are shown in Table <ref type="table">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Baselines</head><p>We use the COSMO with no context that uses just the species label edges as our baseline. In addition, we compare with the following baseline algorithms for OOD generalization: Empirical Risk Minimization (ERM) <ref type="bibr">[30]</ref>, which trains the model to minimize average training loss, CORAL <ref type="bibr">[61]</ref>, a method for unsupervised domain adaptation that learns domain invariant features, Group DRO <ref type="bibr">[26]</ref>, an algorithm that uses distributionally robust optimization to perform well on subpopulation shifts, Fish <ref type="bibr">[58]</ref> that attempts domain adaptation using gradient matching, and ABSGD <ref type="bibr">[50]</ref>, an optimization method for addressing data imbalance. As an alternative way of incorporating contextual information, we implement MLP-concat, a baseline which utilizes the location and temporal features at both training and inference time. It uses vanilla concatenation to fuse visual and spatiotemporal representations which are then fed into an MLP. The missing features are substituted by a mean value computed over the training dataset. All models use a pre-trained ResNet-50 as image encoder. We evaluate the models using overall accuracy as the metric.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Implementation Details</head><p>We implement our models in PyTorch. The hidden dimension of the multimodal KG embedding model is set to 512, with a batch size of 16. The images are resized to 448 &#215; 448 before input to the image encoder. For the location and time attributes, we use a 3-layer MLP that projects the feature input dimension to the embedding dimension and uses PReLU as the activation function. We use Adam <ref type="bibr">[29]</ref> optimizer with a learning rate of 3e-5 and 1e-3 for the image encoder and the rest of the parameters, respectively. In our experiments, the models on iWildCam2020-WILDS and Snapshot Mountain Zebra were trained for 12 and 15 epochs, respectively. We use early stopping based on validation accuracy to prevent overfitting. The early stopping patience parameter is set to 5 epochs. All results are reported with averages across three random seeds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Results</head><p>In this section, we attempt to answer the following questions: Q1. Does the use of contextual information contribute toward better performance? (Section 5.1) Q2. How does COSMO's performance compare to the existing state-of-the-art? (Section 5.2) Q3. Does the taxonomy-aware COSMO model result in more semantically plausible predictions? (Section 5.4.1) Q4. How does COSMO's performance compare to baselines for under-represented species? (Section 5.4.3)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Performance Comparison with Addition of Multimodal Context</head><p>We add taxonomy, location, and temporal context information to the base KG and observe the impact on the species classification performance. Table <ref type="table">1</ref> shows the results for the iWildCam2020-WILDS dataset. We make the following observations from these results: Firstly, the addition of one or more contexts results in a performance gain over the no-context baseline in the vast majority of cases. For instance, in the case of COSMO with taxonomy, we obtain Table <ref type="table">1</ref>: Species classification results on iWildCam2020-WILDS (OOD) dataset. The first baseline in the second section shows the no-context baseline that uses only image-species labels as KG edges. All models use a pre-trained ResNet-50 as image encoder. Parentheses show standard deviation across 3 random seeds. We highlight the best result in bold and the second best with underline. We mark the improvements over COSMO (no-context) in green. Missing values are denoted by -.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model</head><p>Multi-modality Val. Acc. (%) Test Acc. (%) Taxonomy Location Time Empirical Risk Minimization (ERM) [30] -62.7 (&#177;2.4) 71.6 (&#177;2.5) CORAL [61] 60.3 (&#177;2.8) 73.3 (&#177;4.3) Group DRO [26] 60.0 (&#177;0.7) 72.7 (&#177;2.0) Fish [58] 58.0 (&#177;0.2) 63.2 (&#177;0.7) ABSGD [50] -72.7 (&#177;1.8) MLP-concat &#10003; &#10003; 27.3 (&#177;0.8) 39.6 (&#177;1.0) COSMO (no-context) -63.2 (&#177;0.4) 68.8 (&#177;2.1) Single context COSMO &#10003; 62.8 (&#177;2.2) (-0.4) 72.4 (&#177;2.5) (+3.6) &#10003; 64.4 (&#177;1.0) (+1.2) 74.5 (&#177;3.6) (+5.7) &#10003; 64.7 (&#177;0.4) (+1.5) 71.1 (&#177;3.1) (+2.3) Multiple contexts COSMO &#10003; &#10003; 65.4 (&#177;0.4) (+2.2) 70.4 (&#177;2.1) (+1.6) &#10003; &#10003; 64.9 (&#177;1.6) (+1.7) 73.7 (&#177;3.8) (+4.9) &#10003; &#10003; 63.0 (&#177;2.1) (-0.2) 74.2 (&#177;2.2) (+5.4) &#10003; &#10003; &#10003; 65.0 (&#177;1.6) (+1.8) 71.5 (&#177;2.8) (+2.7) We further analyze the role of location in predicting the species distribution in Section 5.4.2. Additionally, utilizing the time attribute yields a substantial improvement over the no-context baseline, resulting in a 2.3% performance gain. Secondly, we observe that the use of multiple contexts results in a performance boost in a majority of cases. For instance, the addition of location and time attributes improves over the taxonomy baseline by a margin of 2.6% and 2.1% respectively in terms of the validation set accuracy. Similarly, the taxonomy with time baseline obtains an improvement of 1.3% and 2.6% over the taxonomy and time baselines, respectively in terms of test accuracy.</p><p>Table <ref type="table">3</ref> shows the results for the Snapshot Mountain Zebra dataset. Incorporating taxonomy and time contexts results in a Table <ref type="table">3</ref>: Species classification results on Snapshot Mountain Zebra dataset. We obtain the results for OOD baselines by training them on this dataset using publicly available code. We mark the improvements over COSMO (no-context) in green.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model</head><p>Multi-modality Test Acc. (%) Taxonomy Time ERM <ref type="bibr">[30]</ref> -96.2 (&#177;0.6) CORAL <ref type="bibr">[61]</ref> 96.6 (&#177;1.2) Group DRO <ref type="bibr">[26]</ref> 93.4 (&#177;2.1) ABSGD <ref type="bibr">[50]</ref> 93.4 (&#177;2.0) MLP-concat &#10003; 94.7 (&#177;0.0) COSMO (no-context) -92.9 (&#177;2.5)</p><p>performance boost of 1% and 2.4% respectively, over the no-context baseline. Furthermore, their combined use yields a noteworthy 3.9% gain in test accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Comparison with OOD Generalization Approaches</head><p>We compare the performance of the COSMO with methods specifically designed for out-of-domain generalization. Notably, our bestperforming model, which uses location as context, achieves state-ofthe-art performance in terms of OOD test accuracy, outperforming the existing SOTA model (CORAL) by 1.2% on the iWildCam2020-WILDS dataset. Likewise, COSMO with taxonomy and time contexts outperforms existing approaches on the Snapshot Mountain Zebra dataset. This demonstrates the effectiveness of leveraging diverse multimodal contexts for achieving more robust OOD generalization, even in the absence of sophisticated objectives aimed at improving domain generalization, e.g., CORAL, Group DRO, ABSGD, and Fish.</p><p>The MLP-concat baseline overfits the training camera trap locations on the iWildCam2020-WILDS dataset, resulting in suboptimal performance. COSMO consistently outperforms the MLP-concat baseline by a significant margin across both datasets.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Compairson with Alternative KGE Backbones</head><p>In our preliminary experiments, we explored the use of ConvE <ref type="bibr">[17]</ref>, a strong neural network baseline, as an alternative to DistMult for the KGE backbone model for the iWildCam2020-WILDS dataset (Table <ref type="table">4</ref>). Sun et al. <ref type="bibr">[62]</ref> show that ConvE outperforms more recent neural network KGE models when evaluated properly. We observe that DistMult outperforms ConvE in a majority of cases, particularly when all incorporating all context types simultaneously. Furthermore, DistMult offers the advantage of being more computationally efficient than neural network based KG embedding approaches.</p><p>5.4 Fine-grained Analyses To analyze the predictions of our model with and without taxonomy information, we employ a metric that takes into account the hierarchical structure of the labels. Conventional measures like top-1 accuracy treat all errors equally, disregarding the semantic relationships among labels. Hence, we use the Least Common Ancestor (LCA) <ref type="bibr">[10]</ref> for the misclassified examples as the metric for this analysis (Table <ref type="table">5</ref>). A lower LCA value indicates that the errors made by the taxonomy-aware model are more semantically related to the true label compared to the baseline. We compare the predictions of COSMO which uses taxonomy to the no-context baseline (Figure <ref type="figure">2</ref>). Notably, the inclusion of taxonomy information assists the model in avoiding implausible predictions. For instance, consider the case of the animal ocelot (Leopardus pardalis), which belongs to the cat family (feliformia). The use of taxonomy information prevents the misprediction of this animal as a gray fox, which belongs to the dog family (caniformia). Similarly, in the second example, the baseline model incorrectly predicts the given image as Central American agouti, a mammal, instead of ocellated turkey, a bird.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.2">Correlation</head><p>Analysis for Location and Time Attributes. We examined the relationship between species distribution and numerical attributes, such as location and time, to gain insights into how these contexts contribute to the task. The location coordinates can be grouped into six clusters. A visualization of the location clusters is shown in Figure <ref type="figure">4</ref>. For each pair of cluster centroids, we compute the Bhattacharyya distance <ref type="bibr">[11]</ref>, a measure of similarity between probability distributions, between the training and validation set species distributions (Figure <ref type="figure">3a</ref>). Similarly, we plot the distance between species distributions corresponding to each hour of the day (Figure <ref type="figure">3b</ref>). We observe that the similarity (corresponds to lower distance) peaks along the diagonal for the location attribute, as well as for the day/night categorization of the time attribute. This suggests these metadata give a prior for species class distribution.  (a) Each color square shows the distance between the corresponding validation cluster centroid on x-axis and the training cluster centroid on y-axis. The correlation peaks along the diagonal (highlighted in red) 3 . (b) Each color square shows the distance between the corresponding training hour slot on x-axis and validation hour slot on y-axis. The correlation peaks for day-day and nightnight hour slots (highlighted in red). </p><note type="other">Camera Trap Image Taxonomy</note></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.3">Performance Comparison for</head><p>Under-represented Species. The iWildCam2020-WILDS dataset exhibits a long-tail species distribution <ref type="bibr">[30]</ref>, posing challenges for accurately recognizing species that are under-represented in the training set. We compare the performance of our best-performing model (COSMO with location context) to the baseline Empirical Risk Minimization (ResNet-50) model (Table <ref type="table">6</ref>). We focus on examples whose labels have a maximum of 100 instances in the training set and report the overall test set accuracy for this subset of species. This selection includes species like banded palm civet, Brazilian cottontail, and leopard, all classified as vulnerable in IUCN's list of threatened species <ref type="bibr">[66]</ref>. We observe that COSMO outperforms the ERM baseline by 2.7%, which is a 16.6% relative improvement. These findings illustrate the potential of our model to mitigate sample inefficiency in existing approaches for under-represented species by utilizing multimodal context information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Discussion and Conclusion</head><p>In this work, we presented a novel framework in which the species classification task is reformulated as link prediction in a multimodal KG of species images and their diverse contextual information. This enables a unified way to leverage various forms of multimodal context, e.g., numerical, categorical, and taxonomy information associated with images for species classification in camera traps. Through our experiments, we demonstrate that our framework achieves superior out-of-distribution generalization and competitive performance with state-of-the-art for species classification on the iWildCam2020-WILDS and Snapshot Mountain Zebra datasets. Additionally, our framework exhibits improved sample efficiency in recognizing under-represented and vulnerable wildlife species.</p><p>We assume that there is a perfect linkage between these contexts and the corresponding images in the training set. However, in scenarios where such linkage is unavailable, the training procedure may introduce noise, which could lead to inferior generalization capabilities in the model. Additionally, it is important to note that the effectiveness of diverse contexts varies based on their informativeness for the given task. Interestingly, combining two or more contexts could degrade performance compared to using a single context type in some cases (Table <ref type="table">1</ref>). We posit that specific metadata, like location, might have a stronger regularization effect on improving generalization in species recognition tasks than other metadata. To address this, future work will involve enabling the model to assign greater importance to more informative metadata.</p><p>Furthermore, we are interested in training a foundation model for camera trap species classification across a wider spectrum of species. This model should demonstrate enhanced generalization capabilities for new camera trap setups worldwide. Additionally, we aim to integrate a broader spectrum of diverse contexts such as temperature, weather conditions, habitat, and sequence information for use with real-world camera trap deployments. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Our code is available at https://github.com/OSU-NLP-Group/COSMO</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>Recent work<ref type="bibr">[51]</ref> showed that simple baselines like DistMult outperform more sophisticated neural network baselines when trained properly.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>The null value in row</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>is due to the absence of species overlap with respective validation clusters. The null value in columns 3 and 4 indicates the absence of these clusters in the validation set.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_4"><p>In this analysis, we define the time duration between</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_5"><p>A.M. and 7 P.M. local time as daytime.</p></note>
		</body>
		</text>
</TEI>
