<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Human-in-the-Loop Visual Re-ID for Population Size Estimation</title></titleStmt>
			<publicationStmt>
				<publisher>European Conference on Computer Vision</publisher>
				<date>10/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10565220</idno>
					<idno type="doi"></idno>
					
					<author>Gustavo Perez</author><author>Daniel Sheldon</author><author>Grant Van_Horn</author><author>Subhransu Maji</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Computer vision-based re-identification (Re-ID) systems are increasingly being deployed for estimating population size in large image collections. However, the estimated size can be significantly inaccurate when the task is challenging or when deployed on data from new distributions. We propose a human-in-the-loop approach for estimating population size driven by a pairwise similarity derived from an off-the-shelf Re-ID system. Our approach, based on nested importance sampling, selects pairs of images for human vetting driven by the pairwise similarity, and produces asymptotically unbiased population size estimates with associated confidence intervals. We perform experiments on various animal Re-ID datasets and demonstrate that our method outperforms strong baselines and active clustering approaches. In many cases, we are able to reduce the error rates of the estimated size from around 80% using CV alone to less than 20% by vetting a fraction (often less than 0.002%) of the total pairs. The cost of vetting reduces with the increase in accuracy and provides a practical approach for population size estimation within a desired tolerance when deploying Re-ID systems. 3]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Computer vision-based re-identification (Re-ID) is increasingly being deployed to analyze large image collections for species-level population monitoring <ref type="bibr">[47,</ref><ref type="bibr">56]</ref>. The association of individuals across camera sensors provides valuable data about animal movement, population dynamics, geographic distribution, and habitat use, which in turn can inform conservation goals and ecological models <ref type="bibr">[9,</ref><ref type="bibr">50,</ref><ref type="bibr">52]</ref>. Computer vision can reduce the manual labor associated with individual identification, and there is significant effort from the community to develop tools that facilitate animal Re-ID across a wide range of animal species <ref type="bibr">[1,</ref><ref type="bibr">5,</ref><ref type="bibr">7,</ref><ref type="bibr">15,</ref><ref type="bibr">55,</ref><ref type="bibr">59]</ref>.</p><p>We consider the task of estimating population size, that is, the number of individuals, in a collection of images using a Re-ID system. Beyond surveys, this A simple approach involves using k-means clustering on image embeddings derived from the Re-ID system and selecting the optimal k using the "elbow heuristic." (b) Active clustering (e.g., pck-means <ref type="bibr">[4]</ref>) employs pairwise constraints to enhance clustering accuracy. (c) Our method leverages nested importance sampling to produce asymptotically unbiased estimates and confidence intervals on k directly. (Right) On the MacaqueFaces dataset <ref type="bibr">[60]</ref>, our approach (Nested-IS) converges to the true k = 34 with fewer constraints than alternative methods, but also provides confidence intervals for the estimate (shown as the shaded red region) for any amount of human feedback.</p><p>quantity can inform techniques for category discovery and incremental learning of AI systems <ref type="bibr">[12,</ref><ref type="bibr">26,</ref><ref type="bibr">54]</ref>. Since the number of individuals is often unbounded, Re-ID is cast as a pairwise classification task and involves predicting whether two images are of the same individual or not. However, estimating the population size from pairwise similarities is non-trivial. One approach is to use a clustering algorithm such as k-means with the "elbow heuristic" <ref type="bibr">[51]</ref> to pick k. However, there are many algorithms and heuristics to choose from, making the process subjective. Estimated pairwise similarity can also be a poor approximation to the true similarity, especially when models are deployed out-of-distribution or on challenging tasks where the performance of Re-ID is low. For instance, using kmeans with the elbow heuristic on the MacaqueFaces dataset <ref type="bibr">[60]</ref> using the stateof-the-art "MegaDescriptor" <ref type="bibr">[55]</ref> results in 80 clusters when the true number of individuals is 34, as seen in Fig. <ref type="figure">1</ref>. Table <ref type="table">1</ref> shows that estimated population size across a variety of animal species and clustering approaches are sometimes off by factor of two or more, which can impact downstream tasks.</p><p>Our approach employs statistical estimation to compute the population size using human feedback on pairwise similarity. We develop an estimator based on nested importance sampling, each iteration driven by the approximate pairwise similarity. Feedback in the form of "same" or "not" on a set of sampled edges is used to estimate the number of clusters as well as to provide a confidence interval. We theoretically demonstrate that the estimator is unbiased, i.e., it converges to the true number of clusters in expectation, and an end user can stop screening when the confidence intervals are sufficiently small. Additionally, we contribute a strong baseline based on farthest-first traversal (FFT) to directly estimate the population size, which, although it does not offer statistical guarantees, outperforms alternative active clustering approaches. Our methods can employ any off-the-shelf image similarity and do not involve fine-tuning deep representations, thus can be practical for non-AI experts.</p><p>We conduct experiments on seven datasets where the goal is to estimate the number of individuals they contain. These datasets span animal species and exhibit different data distributions, including both short and long-tailed, with varying number of individuals. We also experiment with different image representations from the MegaDescriptor library <ref type="bibr">[55]</ref>, which provides a unified deep network for various animal Re-ID tasks. The accuracy of the Re-ID system also varies across datasets and are assumed to be unknown, providing a realistic use case for deploying the models. Our approach is more accurate for the same amount of human supervision (sampled pairs) than competing baselines. We also show that the estimator has low bias and produces accurate confidence intervals. In summary, our main contributions are:</p><p>-We study the challenges involved in using Re-ID systems for estimating population size across a range of animal Re-ID datasets. -We propose a novel approach combining nested importance sampling with human-in-the-loop feedback. that produces asymptotically unbiased estimates of population size ( &#167; 3). -We design confidence intervals that provide intuitive feedback for guiding human effort and setting stopping conditions ( &#167; 3.2). -We report extensive experiments on a benchmark of seven animal Re-ID datasets with different data distributions, demonstrating that our method achieves significantly lower error than strong baselines with similar human effort. In most cases our approach produces estimates close to the true population size using human feedback on less than 0.004% of all pairs ( &#167; 4). -We use our framework to estimate the number of categories in a dataset for generalized category discovery <ref type="bibr">[12,</ref><ref type="bibr">26,</ref><ref type="bibr">54]</ref> and measure the impact on clustering accuracy, on animal Re-ID and fine-grained classification ( &#167; 4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>The task of estimating population size in a dataset is closely related to the problem of grouping individuals using a Re-ID system. However, it is possible to estimate the population size using statistical approaches, such as ours, without fully solving the grouping problem. There is a vast literature in computer vision, especially in face recognition, on building accurate Re-ID systems, as well as using these representations to group individuals. We briefly review these approaches. Our approach focuses on animal Re-ID tasks. This task can be relatively easy for some species that have distinctive patterns but significantly more challenging for others. We review prior work on animal Re-ID and introduce the benchmarks and representations we experiment with.</p><p>Re-ID Systems There is significant prior work on person Re-ID based on face and whole-body recognition (see <ref type="bibr">[31]</ref> for a more complete survey). Early techniques, such as Eigenfaces <ref type="bibr">[53]</ref>, have been replaced by modern deep learning approaches <ref type="bibr">[33,</ref><ref type="bibr">33,</ref><ref type="bibr">42]</ref> trained on large datasets <ref type="bibr">[21,</ref><ref type="bibr">25]</ref>. While significant advances have been made, there are also known issues with poor generalization and bias of Re-ID systems in the presence of demographic shifts <ref type="bibr">[32]</ref>. In comparison, techniques for animal Re-ID are less mature. These techniques range from SIFT <ref type="bibr">[35]</ref> and Superpoint-based <ref type="bibr">[18]</ref> matching approaches (e.g., WildID <ref type="bibr">[7]</ref> and HotSpotter <ref type="bibr">[15]</ref>) that work relatively well for species with characteristic patterns on their skin or coats (e.g., Zebras and Jaguars), to complex pipelines designed to identify visual characteristics specific to particular species (e.g., whisker patterns of polar bears <ref type="bibr">[1]</ref>). Deep learning approaches have also been developed for some species (e.g., Chimpanzees and Bears) but the training datasets are relatively small. Recent work <ref type="bibr">[55]</ref> has attempted to train deep learning models across species by consolidating animal Re-ID datasets and has shown that the model generalizes across species and significantly outperforms prior work, including off-the-shelf image representations such as CLIP <ref type="bibr">[44]</ref> and DINOv2 <ref type="bibr">[10]</ref>.</p><p>We utilize a set of datasets from their collection (Table <ref type="table">1</ref>) and base our population size estimation on the various pre-trained deep networks they provide.</p><p>Their best-performing model is a Swin-transformer <ref type="bibr">[34]</ref> trained with ArcFace loss <ref type="bibr">[17]</ref> on the collection of datasets. However, the problem is far from solvedfor example, on the WhaleSharkID dataset <ref type="bibr">[24]</ref>, the performance of the Re-ID system is around 62%, which poses a challenge for population size estimation.</p><p>Clustering The population size can be estimated by determining the number of clusters in a dataset. A common approach involves using clustering algorithms, such as k-means <ref type="bibr">[22]</ref>, mean-shift <ref type="bibr">[13]</ref>, which operate directly on embeddings, or graph-based approaches <ref type="bibr">[8,</ref><ref type="bibr">19,</ref><ref type="bibr">49]</ref> that incorporate pairwise similarity, and determining the number of clusters based on a heuristic. For example, one can compute the within-cluster sum of squares (WCSS) for different values of k in k-means and select the optimal one based on the "elbow heuristic" -the point in the curve where improvements diminish (see Fig. <ref type="figure">1</ref>). However, there are many clustering algorithms and heuristics available, making the process subjective. We find that population size estimates using these methods can be significantly inaccurate, even with state-of-the-art embeddings (see Table <ref type="table">1</ref>). More complex approaches, particularly those incorporating tracking information and sophisticated graph-based clustering (e.g., <ref type="bibr">[19]</ref>), have been proposed for grouping in videos. Our primary focus is on image-based approaches using simple clustering algorithms as they are broadly applicable.</p><p>Active Clustering Human feedback can be incorporated to improve clustering in various ways. One can fine-tune the deep network using metric learning <ref type="bibr">[27,</ref><ref type="bibr">30]</ref> approaches to improve the underlying similarity using human feedback, for example thorough pairwise <ref type="bibr">[28]</ref> and triplet-based <ref type="bibr">[23]</ref> learning. However, these methods require significant compute resources and expertise to set various hyperparameters associated with training, and may not be practical for non-experts. Moreover, fine-tuning on uncurated data often encountered in a real-world deployment of the Re-ID system is rarely effective. Hence, we focus on active clustering approaches that incorporate constraints to improve clustering for a fixed embedding. We compare with a constrained k-means algorithms that use 'must-link' and 'cannot-link' constraints within the k-means <ref type="bibr">[4]</ref>. Constraints can be selected using a farthest-first scheme <ref type="bibr">[4]</ref>, or by running k-means with a large k and picking constraints to merge the small clusters into larger ones <ref type="bibr">[14]</ref>. We also develop a baseline based on the farthest-first traversal (FFT) scheme that is directly aimed at estimating the number of clusters. While these approaches improve over the baseline k-means algorithm they do not provide statistical guarantees on the estimate. Our proposed sampling-based approach results in a consistent estimator of the cluster count and produces accurate confidence intervals ( &#167; 5). The amount of human effort required depends both on the quality of the pairwise similarity as well as the level of precision needed for estimation. Related Problems Apart from population surveys, estimating the number of clusters is often the primary goal for many tasks. Life-long learning systems must discover and learn novel categories during long-term deployment. In generalized category discovery (GCD) <ref type="bibr">[12,</ref><ref type="bibr">54]</ref>, the goal is to cluster images given the labels for a subset of images in the presence of novel categories. This problem is more challenging than traditional semi-supervised learning due to the open-world setting. However, we find that the performance of existing GCD approaches is limited by the accuracy of estimating the number categories in a dataset. Often, existing approaches make the unrealistic assumption that this quantity is known. We show that our approach for estimating the number of clusters has a higher impact on improving GCD than using active clustering algorithms. Our work is also related to approaches that improve statistical estimation by combining human effort and model predictions, such as ISCount <ref type="bibr">[37]</ref>, DISCount <ref type="bibr">[43]</ref>, and Prediction-Powered Inference <ref type="bibr">[3]</ref>. For example, both ISCount and DISCount use importance sampling to estimate the population mean, and our work extends this idea to a pairwise setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Method</head><p>Assume that we have a collection of images D = {x i } n i=1 where each image x i belongs to a cluster y i &#8712; Y. Our goal is to calculate the total number of clusters K = |Y| in D. We assume labels y i &#8712; Y are unknown but we have access to an image embedding &#934;(&#8226;) which can be used to compute an approximate pairwise similarity &#349;(x, z) = f (&#934;(x), &#934;(z)) between images x and z for some function f (such as the dot product). The true similarity is s(x, z) = 1 if images x and z belong to the same cluster and is s(x, z) = 0 otherwise. We assume this can be obtained by using human feedback in the form of "same" or "not" for the pair of images. Our goal is to estimate K as accurately as possible using a small amount of human feedback given the approximate similarity &#349;(x, z). Lemma 1. Consider a graph G = (V, E) with vertices u, v &#8712; V = {1, ..., n} corresponding to images x u , with an edge e uv &#8712; E between u and v if the images x u and x v belong to the same cluster. Let d(u) = |{e uv &#8712; E}| be the degree of vertex u. Then the number of clusters K in the dataset D is equal to the number of connected components in G, and can be computed as:</p><p>Proof (Proof of Lemma 1). It is easy to see that the G is a collection of cliques, with the clique containing u of size 1+d(u). The total contribution of the vertices in this clique is (1+d(u))&#215;(1/(1+d(u))) = 1, i.e., each clique contributes a total of 1 to Eq. 1, adding up to the total number of cliques or connected components in G. See Fig. <ref type="figure">2</ref> for an example.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Estimation via Nested Sampling</head><p>The above lemma provides a way to estimate the number of connected components CC(G) by sampling.</p><p>Thus, one approach to estimate the number of connected components is to sample N vertices and estimate their degrees by querying a human if an edge exists between the vertex and all other n -1 vertices in the graph. This would require N &#215; (n -1) queries, saving over n &#215; (n -1)/2 queries for exact estimation. However, one can estimate the degree of a vertex by sampling as well. This suggests a two-step Monte Carlo (MC) approach for estimation. First, we sample N vertices u 1 , . . . , u N . For each sampled vertex u i , we then sample M nodes v i,1 , . . . , v i,M uniformly at random, query a human to obtain s(u i , v i,j ) for each potential neighbor, and estimate the degree of the vertex u i as:</p><p>From this, we can estimate the number of connected components as:</p><p>This nested Monte Carlo estimate is asymptotically unbiased for the expected number of connected components, i.e., it converges to the true mean under mild assumptions. One can also construct a sample estimate of the variance and confidence intervals around the mean (we will derive the expressions for a general sampling distribution in the next section). The number of queries required to construct the estimate scales as N &#215; M , which is more efficient than the earlier approach of N &#215;n. However, the variance of the estimator can be high for graphs where the degrees of the vertices vary significantly.</p><p>Remark 1. Our problem is related to work on estimating the number of connected components in a graph in sub-linear time. The randomized algorithm proposed in <ref type="bibr">[6,</ref><ref type="bibr">11]</ref> use a similar sampling argument but their cost model is different. They assume the edges in the graph are provided as an adjacency list and run breadth-first-search starting from each node for a number of steps till the size of the connected components exceeds a threshold-a threshold of 1/&#1013; gives an n&#1013; additive approximation to CC. In our setting the true edges are unknown-we instead have a noisy pairwise similarity-and have to pay a cost to reveal an edge, and hence the same approach is not applicable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Estimation via Nested Importance Sampling</head><p>Importance sampling <ref type="bibr">[40]</ref> uses a proposal distribution q and replaces the expectation of a quantity f (x) under p as:</p><p>The equality holds for any proposal distribution q that satisfies p(x)f (x) &gt; 0 =&#8658; q(x) &gt; 0. Moreover, one can show that the optimal proposal distribution q(x) &#8733; p(x)f (x) for which the estimator has zero variance. We are interested in computing various expectations under a uniform distribution p, and one can show that the optimal proposal distribution to sample nodes for estimating CC(G) in Eq. 1 is Q(u) &#8733; 1/(1+d(u)), while that to sample edges for estimating the degree d(u) in Eq. 3 is q u (v) &#8733; s(u, v).</p><p>However, sampling from these distributions requires knowledge of the true s(u, v) which is unknown. But we can construct proposals using the approximate pairwise similarity &#349;(u, v) &#8712; [0, 1]. This can be computed using the similarity between a pair of images estimated from their feature embeddings. Then, we can set Q(u) &#8733; 1/(1 + v&#824; =u &#349;(u, v)) as the proposal distribution to sample a set of vertices I of size N . For each sampled vertex u i , we sample a set of M Algorithm 1 Estimating cluster count k using NIS Sample ui &#8764; Q(u) 6:</p><p>Sample vi,1, ..., vi,M &#8764; qu i (v) 7:</p><p>// Human feedback &#9655; Eq. ( <ref type="formula">7</ref>) 8: end for</p><p>&#9655; Eq. ( <ref type="formula">6</ref>) nodes v i,j from the distribution q u (v) &#8733; &#349;(u, v). The CC NIS (G) obtained using importance sampling is:</p><p>where, d(u i ) is estimated as:</p><p>The proposal Q(u) biases the distribution towards isolated vertices, i.e., ones that have low degrees, as they contribute the most toward the number of clusters. The second biases the distribution towards sampling nodes that are likely to be connected, i.e., to have high &#349;(u, v). Note the only place human feedback is required is in computing d(u) in Eq. 7 consisting of M queries. Hence the overall query complexity of the approach is N &#215; M , similar to the simple Monte Carlo estimator. But if the similarity function &#349;(u, v) is a good approximation to the true similarity then we expect the estimator to have a lower variance. Algorithm 1 and Fig. <ref type="figure">3</ref> describe the overall scheme.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Variance and Confidence Intervals</head><p>Theorem 1. Assume Q(u) &gt; 0 for all u and q u (v) &gt; 0 for all u, v. Let CC N,M denote the estimator using N sampled vertices and M sampled edges per vertex. For any M &gt; 0, the estimator CC N,M is asymptotically normal, i.e., &#8730;</p><p>where</p><p>Together these facts imply that the estimator is consistent as both N and M go to infinity, i.e., lim N &#8594;&#8734;,M &#8594;&#8734; CC N,M = CC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Asymptotic normality justifies the construction of confidence intervals as follows: let &#963;2</head><p>1,M be the sample variance of 1</p><p>Proof (Proof of Theorem 1). The estimator CC N,M is the sample average of N identically distributed copies of</p><p>Therefore E[ CC N,M ] = E[ CC 1,M ] and the asymptotic normality result holds by the central limit theorem as long as &#181; M and &#963; 2 1,M are finite. The quan-</p><p>is bounded above by our assumption that Q(u) &gt; 0 and q u (v) &gt; 0. Together with the fact that the support of the joint sample space (u 1 , v 1,1 , . . . , v 1,M ) is finite, implies &#181; M and &#963; 2  1,M are finite. It remains to show that the bias is O(1/M ). This can be proved by the delta method, but also follows from a result of <ref type="bibr">[45]</ref>, which applies to our setting to assert that the mean-squared error of CC N,M is O( <ref type="formula">1</ref>N + 1 M 2 ). Because mean-squared error equals variance plus squared bias, this necessitates that the bias, which only depends on M , is O( 1 M ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Remark 2.</head><p>[45] describe the optimal allocation of samples when T = N M &#8594; &#8734; and found that it occurs when N grow proportionally to M 2 in our setting. However, in applications with finite T , the lowest error occurred for N &#8733; T &#945; with &#945; between 0.5 and 0.6, which suggests selecting N to be approximately proportional to M may be better in practice (See &#167; 5-Parameters for NIS).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>This section describes the experimental setup, datasets, models used for computing image similarity, baselines, and the evaluation metrics.</p><p>Datasets We evaluate our approach on seven animal re-identification datasets, where the goal is to estimate the number of individuals (see Tab.</p><p>1) -Chimpanzee Faces in the Wild (CTai and CZoo) [20] CTai contains 5,078 images of 72 individuals living in the Ta&#239; National Park in C&#244;te d&#226;&#258;&#377;Ivoire, while CZoo consists of 2,109 recordings of 24 chimpanzees. IPanda50 [58] contains 6,874 images of 50 giant panda individuals, OpenCows2020 [2] comprises 4,736 images of 46 Holstein-Friesian cattle individuals, MacaqueFaces [60] includes 6,280 face images of 34 rhesus macaque individuals, WhaleSharkID [24] features 7693 images with 543 individual whale shark identifications, and GiraffeZe-braID [41] with 6925 images of 2056 zebra and giraffe individuals in Kenya.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Image Encoding and Similarity</head><p>We use image embeddings from the MegaDescriptor <ref type="bibr">[55]</ref>, a Swin-transformer model optimized with ArcFace loss, that beats CLIP <ref type="bibr">[44]</ref>, DINOv2 <ref type="bibr">[10]</ref>, and ImageNet-1k <ref type="bibr">[16]</ref> pre-trained image encoders on animal Re-ID tasks. Specifically, we consider MegaDescriptor-L-384 for our experiments and present ablation tests with its smaller version MegaDescriptor-B-224 in &#167; 5. The normalized cosine similarity between a pair of embeddings &#934;(x u ) and &#934;(x v ) as used as the proposal distribution with &#964; = 0.5 (chosen with CZoo):</p><p>Table <ref type="table">1</ref>: Population Size Estimation on Animal Re-ID Datasets. The number of images |D| and individuals |Y| per dataset along with the estimated population k for a given amount of human effort (sampled pairs). Estimates using connected components (CoCo), Robust Continuous Clustering <ref type="bibr">[48]</ref> (RCC), k-means (km), and mean-shift (not shown, but see Fig. <ref type="figure">4</ref>) exhibit high error rates. k-means improves with the addition of human feedback (pckm) but the error rates remain high. Farthest-first traversal (FFT) outperforms these methods. The proposed Nested-IS (NIS) surpasses these baselines and yields estimates and confidence intervals that often contain the true value. Error rates (%) on each dataset (shown for NMC and NIS) are also lower compared to the baseline and Nested Monte Carlo (NMC). Our method demonstrates significant improvement on the challenging WhaleSharkID <ref type="bibr">[24]</ref> and GirrafeZebraID <ref type="bibr">[41]</ref>.</p><p>Sampled Estimated k @ N &#215; M sampled pairs pairs CC&#177;CI (%error) Dataset |D| |Y| ( N &#215;M |E| ) CoCo RCC km&#8594;pckm FFT NMC NIS CTai [20] 5078 72 0.014% 7 258 90 &#8594; 87 54 26&#177;4 (64%) 63&#177;20 (16%) CZoo [20] 2109 24 0.004% 3 57 50 &#8594; 49 13 08&#177;1 (68%) 23&#177;07 (17%) IPanda50 [58] 6874 50 0.002% 7 251 80 &#8594; 79 34 17&#177;3 (66%) 50&#177;16 (13%) OpenCows2020 [2] 4736 46 0.006% 39 220 70 &#8594; 69 39 18&#177;3 (62%) 40&#177;13 (17%) MacaqueFaces [60] 6280 34 0.002% 13 90 80 &#8594; 77 31 15&#177;2 (55%) 34&#177;08 (09%) WhaleSharkID [24] 7693 543 0.16% 630 1182 70 &#8594; 132 251 145&#177;9 (73%) 543&#177;m74 (06%) GiraffeZebraID [41] 6925 2056 0.16% 4714 500 100 &#8594; 155 271 403&#177;34 (80%) 1951&#177;370 (08%) Average error (%) 69.4% 219.2% 80.4% &#8594; 75.4% 38.2% 66.9% 12.3%</p><p>Clustering Baselines We employ four baselines for estimating the number of clusters. The first is mean-shift clustering <ref type="bibr">[13]</ref>. We set the bandwidth parameter as the average distance between samples and their nearest neighbor, and estimate the number of clusters as the number of modes. The second method is k-means <ref type="bibr">[22]</ref>. To calculate the optimal k, we calculate the within-cluster sum of squares</p><p>, where c i is the cluster center corresponding to u i . The elbow is identified as the value of k where the slope becomes approximately constant. Third, we use connected components (CoCo) after thresholding the similarity values &#349;(u, v) of our graph G. To find the optimal threshold, we use the same procedure as the k-means elbow; that is, we calculate J km for different thresholds and return the threshold at the elbow. Lastly, we use robust continuous clustering (RCC) to extract the number of clusters after optimizing the algorithm's objective using the default hyperparameters in <ref type="bibr">[48]</ref>.</p><p>Active Clustering We use active variants of the above approaches. First, we use pairwise constrained k-means (pck-means <ref type="bibr">[4]</ref>). For a given k, the objective is:</p><p>where M and C are the sets of must-link and cannot-link constraints, respectively, and l i denotes the cluster index for i. Scalars &#945;, &#946; &gt; 0 trade off the k-means objective with the cost of violating constraints. We select empirically &#945; = &#946; = 1. Given a set of constraints, we can pick the optimal k using the elbow heuristic on the J pckm objective. We pick samples based on a farthestfirst traversal scheme described below. The second method is constrained mean shift <ref type="bibr">[46]</ref> (Constrained-MS) that introduces a density-based integration of the constraints. Lastly, instead of incorporating constraints within k-means one can directly estimate the number of clusters using a farthest-first traversal (FFT) of points and exhaustive comparison with existing individuals. The approach is as follows: 1) Initialize the list of sampled images S as empty and list of discovered individuals as I as empty, 2) Selecting an unsampled point that is farthest from S, i.e., u = argmax u min v&#8712;S d(u, v), and add it to the sampled set S. 3) Compare u to all the previously discovered individuals in I. If it matches an individual add it to the corresponding list, else start a new list with u and add it to I. FFT rapidly explore the dataset to find at least one member for each cluster. At each iteration it pays a cost equal to the number of discovered individuals.</p><p>Evaluation Metric and Human Effort Error is measured as |CC -CC|/CC, the fractional absolute difference between the estimated CC and the true number of clusters CC. Human effort is measured as the number of pairwise queries used for estimation. To account for variable dataset sizes, we report the number of sampled pairs normalized by the total number of edges |E| in G.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Results</head><p>We compare our method to the baselines described in &#167; 4 that include connected components (CoCo), pairwise constrained k-means (pck-means), constrained mean shift (Constrained-MS), our farthest-first traversal baseline (FFT), and nested Monte Carlo sampling (Nested-MC) from Eq. ( <ref type="formula">4</ref>).</p><p>0 200 400 Individuals ID 10 0 10 1 10 2 # images WhaleSharkID class distribution 1 50 100 150 vertex degree 0 25 50 # vertices WhaleSharkID vertex degree dist. 0 0.06% 0.12% 0 25 50 75 100 CI coverage (%) All datasets (MegaDesc.-L-384) Nested-IS (ours) Nested-MC 95% coverage 0 0.06% 0.12% 40 20 0 Bias All datasets (MegaDesc.-L-384) Nested-IS (ours) Nested-MC # sampled pairs / # edges Fig. 6: Confidence Intervals' Coverage and Bias using MegaDescriptor-L-384 feature embeddings on all datasets. (Left) Our method produces a CI coverage close to 95% (when using z0.025 = 1.96 for a 95% confidence interval- &#167; 3.3) with less than 0.04% of sampled pairs. (Right) Even though our estimator has a negative bias for a low number of sampled pairs, it rapidly reaches zero compared to Nested-MC.</p><p>Baselines Perform Poorly Techniques like k-means and CoCo perform poorly, with about 80% and 69% error on average, respectively, despite using state-ofthe-art image embeddings to compute similarity (see Fig. <ref type="figure">4</ref> and Tab. 1). For instance, k-means estimates 70 clusters for WhaleSharkID instead of 543, 70 clusters instead of 34 for OpenCows2020, and 50 clusters instead of 24 for CZoo. Estimates using CoCo are slightly better on average but tend to be an underestimate. One reason is that choosing hyperparameters for clustering is particularly challenging. For instance, the WCSS curve as a function of k looks rather smooth, and there is no clear "elbow." Another reason is that the underlying similarity is imperfect. Improving this through deep metric learning approaches would require significant resources and expertise to set various parameters and may not pan out when supervision is limited <ref type="bibr">[38]</ref>.</p><p>Nested-IS outperforms Nested-MC Both Nested-MC and Nested-IS improve over the baselines as more human feedback is collected. As described in &#167; 3.1, Nested-MC samples edges uniformly at random, while Nested-IS is driven by the pairwise similarity and leads to an error reduction at a much faster rate as seen in Fig. <ref type="figure">4</ref>. Specifically, we get a relative error reduction of 82% over Nested-MC with the same human effort.</p><p>Nested-IS outperforms Active Clustering We compare our method to pairwise constrained k-means (pck-means-AL), constrained mean shift (Constrained-MS), and the proposed farthest-first traversal approach (FFT) with similar human effort. Fig. <ref type="figure">4</ref> shows that Nested-IS handily outperforms active clustering. In WhaleSharkID, despite performing best, our method still requires querying relatively many pairs to get accurate estimates, which can be explained by the difficulty of the task. In Fig. <ref type="figure">5</ref>-left-center we show the histogram of individuals with a clear long-tailed distribution and the histogram of vertex degrees where we can see that in WhaleSharkID most individuals have less than 5 images. The performance of the MegaDescriptor is also the lowest (around 62%) on the Re-ID task on this dataset. However, the performance savings over alternative approaches is significant.</p><p>Confidence Intervals are Calibrated In addition to lower error rates, a key advantage of our approach is that it also provides confidence intervals (CIs) (See Fig. <ref type="figure">4</ref>). To estimate the quality of the CI estimation we calculate the empirical coverage of the CI over 100 runs (i.e., the percentage of times the estimated CIs contains the true count). Our method produces close to 95% coverage using z 0.025 = 1.96 for a 95% CI, as described in &#167; 3.3, when around 0.02% of pairs are sampled across datasets, as shown in Fig. <ref type="figure">6</ref>-left.</p><p>Estimation Bias is Low We calculate the empirical bias, denoted as Bias =</p><p>is the mean count estimate over 100 runs. Fig. <ref type="figure">6</ref>-right shows that the bias is negative initially, but it rapidly drops to 0. The initial negative bias might explain the lower coverage of the CIs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Parameters for Nested-IS</head><p>There is a trade-off between the number of sampled vertices N and sampled neighbors per vertex M . Increasing N will reduce the variance of the estimation of the number of clusters, and increasing M will reduce the variance of each vertex degree estimate. In Fig. <ref type="figure">7</ref>-left we show that increasing the the ratio between the number of sampled neighbors per vertex M relative to the number of sampled vertices N (i.e., r = M/N ) produces more accurate estimations with less human effort. For instance, using r = 7 achieves close-to-zero error with 1/5 of the sampled pairs compared to using r = 0.5.</p><p>60 80 60 70 80 90 Accuracy CTai KMeans (true k) KMeans+Elbow PCKMeans+AL PCKMeans+NIS 30 40 50 70 80 90 100 CZoo 60 80 70 80 90 IPanda50 40 60 40 60 80 100 OpenCows2020 40 60 80 60 70 80 90 100 MacaqueFaces 200 400 20 25 30 35 40 45 WhaleSharkID 1000 2000 0 10 20 30 GiraffeZebraID # clusters (k)</p><p>Fig. <ref type="figure">8</ref>: Measuring Clustering Accuracy. Pairwise constraints (pck-means+AL) improves the clustering accuracy over k-means+elbow, but the clustering accuracy with our estimated k is even better (pck-means+NIS). We plot results with five independent runs for our proposed approach (blue stars). The red shaded line indicates the true number of clusters in the dataset.</p><p>Ablation with other MegaDescriptors We test the performance of all baselines using a smaller less-performing version of the MegaDescriptor (MegaDescriptor-B-224). In Fig. <ref type="figure">7</ref>-center-right we show that our method performs similarly with both MegaDescriptor versions.</p><p>How Much do we Know about the Clustering? Sometimes we also wish to recover the clustering in a dataset, such as in generalized category discovery <ref type="bibr">[54]</ref>. Should human effort be used to estimate the clustering directly using active clustering, or to estimate k for k-means using our approach? To answer this we conduct the following experiment. First, we can measure the accuracy of a clustering as:</p><p>where P(Y) is the set of all permutations of the class labels. We compare the accuracy of clustering using k-means with estimated k using the elbow method, using pck-means, and using k-means with k estimated using Nested-IS for tthe same amount of human effort as pck-means. Fig. <ref type="figure">8</ref> shows the results. Surprisingly, we find that accurately estimating k has a larger impact on the quality of clustering than active clustering, which suggests that human effort is better spent to estimate k initially. In the Supplementary Material we describe experiments on GCD on fine-grained classification datasets and show a similar trend.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Discussion and Conclusion</head><p>We propose a human-in-the-loop approach to estimate population size when deploying imperfect Re-ID systems. By carefully selecting a small fraction of pairs to label (often less than 0.002% of all edges), our approach produces unbiased estimates of the population size. A key advantage of our method is that it generates confidence intervals which can be used for guiding human effort. This approach can be implemented on top of any Re-ID system, as it requires only a pairwise similarity between images, making it practical for low-resource settings. Our approach adds to the growing literature on statistical estimation techniques <ref type="bibr">[3,</ref><ref type="bibr">37,</ref><ref type="bibr">43]</ref> that combine model predictions and ground-truth labels to improve the precision of count estimates. However, we tackle the novel problem of estimating cluster counts, which involves pairwise comparisons.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 Measuring Clustering Accuracy</head><p>Similarly to the Re-ID datasets (See Fig. <ref type="figure">8</ref>), we find that accurately estimating k has a bigger impact on the quality of clustering than active clustering, which suggests that human effort is better spent to estimate k initially. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0"><p>Code available at: https://github.com/cvl-umass/counting-clusters</p></note>
		</body>
		</text>
</TEI>
