<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>DISCount: Counting in large image collections with detector-based importance sampling</title></titleStmt>
			<publicationStmt>
				<publisher>AAAI Conference on Artificial Intelligence</publisher>
				<date>02/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10565218</idno>
					<idno type="doi"></idno>
					
					<author>Gustavo Perez</author><author>Subhransu Maji</author><author>Daniel Sheldon</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Many modern applications use computer vision to detect and count objects in massive image collections. However, when the detection task is very difficult or in the presence of domain shifts, the counts may be inaccurate even with significant investments in training data and model development. We propose DISCOUNTa detector-based importance sampling framework for counting in large image collections that integrates an imperfect detector with human-in-the-loop screening to produce unbiased estimates of counts. We propose techniques for solving counting problems over multiple spatial or temporal regions using a small number of screened samples and estimate confidence intervals. This enables end-users to stop screening when estimates are sufficiently accurate, which is often the goal in a scientific study. On the technical side we develop variance reduction techniques based on control variates and prove the (conditional) unbiasedness of the estimators. DISCOUNT leads to a 9-12× reduction in the labeling costs over naive screening for tasks we consider, such as counting birds in radar imagery or estimating damaged buildings in satellite imagery, and also surpasses alternative covariate-based screening approaches in efficiency.* equal advising contribution Preprint. Under review.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Many modern applications use computer vision to detect and count objects in massive image collections. For example, we are interested in applications that involve counting bird roosts in radar images and damaged buildings in satellite images. The image collections are too massive for humans to solve these tasks in the available time. Therefore, a common approach is to train a computer vision detection model and run it exhaustively on the images.</p><p>The task is interesting because the goal is not to generalize, but to achieve the scientific counting goal with sufficient accuracy for a fixed image collection. The best use of human effort is unclear: it could be used for model development, labeling training data, or even directly solving the counting task! A particular challenge occurs when the detection task is very difficult, so the accuracy of counts made on the entire collection is questionable even with huge investments in training data and model development. Some works resort to human screening of the detector outputs <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref>, which saves time compared to manual counting but is still very labor intensive.</p><p>These considerations motivate statistical approaches to counting. Instead of screening the detector outputs for all images, a human can "spot-check" some images to estimate accuracy, and, more importantly, use statistical techniques to obtain unbiased estimates of counts across unscreened images. In a related context, Meng et al. <ref type="bibr">[4]</ref> proposed IS-count, which uses importance sampling to estimate total counts across a collection when (satellite) images are expensive to obtain by using spatial covariates to sample a subset of images.</p><p>We contribute counting methods for large image collections that build on IS-count in several ways. First, we work in a different model where images are freely available and it is possible to train a detector to run on all images, but the detector is not reliable enough for the final counting task, or its reliability is unknown. We contribute human-in-the-loop methods for count estimation using the detector to construct a proposal distribution, as seen in Fig. <ref type="figure">1</ref>. Second, we consider solving multiple counting problems-for example, over disjoint or overlapping spatial or temporal regionssimultaneously, which is very common in practice. We contribute a novel sampling approach to obtain simultaneous estimates, prove their (conditional) unbiasedness, and show that the approach allocates samples to regions in a way that approximates the optimal allocation for minimizing variance. Third, we design confidence intervals, which are important practically to know how much human effort is needed. Fourth, we use variance reduction techniques based on control variates.</p><p>Our method produces unbiased estimates and confidence intervals with reduced error compared to covariate-based methods. In addition, the labeling effort is further reduced with DISCOUNT as we only have to verify detector predictions instead of producing annotations from scratch. On our tasks, DISCOUNT leads to a 9-12&#215; reduction in the labeling costs over naive screening and 6-8&#215; reduction over IS-Count. Finally, we show that solving multiple counting problems jointly can be done more efficiently than solving them separately, demonstrating a more efficient use of samples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Computer vision techniques have been deployed for counting in numerous applications where exhaustive human-labeling is expensive due to the sheer volume of imagery involved. This includes areas such as detecting animals in camera trap imagery <ref type="bibr">[1,</ref><ref type="bibr">5]</ref>, counting buildings, cars, and other structures in satellite images <ref type="bibr">[2,</ref><ref type="bibr">[6]</ref><ref type="bibr">[7]</ref><ref type="bibr">[8]</ref>, species monitoring in citizen science platforms <ref type="bibr">[5,</ref><ref type="bibr">9]</ref>, monitoring traffic in videos <ref type="bibr">[10,</ref><ref type="bibr">11]</ref>, as well as various medicine, science and engineering applications. For many applications the cost associated with training an accurate model is considerably less than that of meticulously labeling the entire dataset. Even with a less accurate model, human-in-the-loop recognition strategies have been proposed to reduce annotation costs by integrating human validation with noisy predictions <ref type="bibr">[12,</ref><ref type="bibr">13]</ref>.</p><p>Our approach is related to work in active learning <ref type="bibr">[14]</ref> and semi-supervised learning <ref type="bibr">[15]</ref>, where the goal is to reduce human labeling effort to learn models that generalize on i.i.d. held out data. While these approaches reduce the cost of labels on training data, they often rely on large labeled test sets to estimate the performance of the model, which can be impractical. Active testing <ref type="bibr">[16,</ref><ref type="bibr">17]</ref> aims to reduce the cost of model evaluation by providing a statistical estimate of the performance using a small number of labeled examples. Unlike traditional learning where the goal is performance on held out data, the goal of active testing is to estimate performance on a fixed dataset. Similarly, our goal is to estimate the counts on a fixed dataset, but different from active testing we are interested in estimates of the true counts and not the model's performance. In particular, we want unbiased estimates of counts even when the detector is unreliable. Importantly, since generalization is not the goal, overfitting to the dataset statistics may lead to more accurate estimates.</p><p>Statistical estimation has been widely used to conduct surveys (e.g., estimating population demographics, polling, etc.) <ref type="bibr">[18]</ref>. In IS-Count <ref type="bibr">[4]</ref>, the authors propose an importance sampling approach to estimate counts in large image collections using humans-in-the-loop. They showed that one can count the number of buildings at the continental scale by sampling a small number of regions based on covariates such as population density and annotating those regions, thereby reducing the cost of obtaining high-resolution satellite imagery and human labels. However, for many applications the dataset is readily available, and running the detector is cost effective, but human screening is expensive. To address this, we propose using the detector to guide the screening process and demonstrate that this significantly reduces error rates in count estimation given a fixed amount of human effort. Furthermore, for some applications, screening the outputs of a detector can be significantly faster than to annotate from scratch, leading to additional savings.</p><p>An interesting question is what is the best way to utilize human screening effort to count on a dataset. For example, labels might be used to improve the detector, measure performance on the deployed dataset, or, as is the case in our work, to derive a statistical estimate of the counts. Our work is motivated by problems where improving the detector might require significant effort, but counts from the detector are correlated with true counts and can be used as a proposal distribution for sampling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">DISCOUNT: Detector-based IS-Count</head><p>Consider a counting problem in a discrete domain &#8486; (usually spatiotemporal) with elements s &#8712; &#8486; that represent a single unit such as an image, grid cell, or day of year. For each s there is a ground truth "count" f (s) &#8805; 0, which can be any non-negative measurement, such as the number or total size of all objects in an image. A human can label the underlying images for any s to obtain f (s).</p><p>Define F (S) = s&#8712;S f (s) to be the cumulative count for a region S. We wish to estimate the total counts F (S 1 ), . . . , F (S k ) for k different subsets S 1 , . . . , S k &#8838; &#8486;, or regions, while using human effort as efficiently as possible. The regions represent different geographic divisions or time ranges and may overlap -for example, in the roost detection problem we want to estimate cumulative counts of birds for each day of the year, while disaster-relief planners want to estimate building damage across different geographical units such as towns, counties, and states. Assume without loss of generality that k i=1 S i = &#8486;, otherwise the domain can be restricted so this is true. We will next present our methods; derivations and proofs of all results are found in the appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Single-Region Estimators</head><p>Consider first the problem of estimating the total count F (S) for a single region S. Meng et al. <ref type="bibr">[4]</ref> studied this problem in the context of satellite imagery, with the goal of minimizing the cost of purchasing satellite images to obtain an accurate estimate. <ref type="bibr">[4]</ref> This is a baseline based on simple Monte Carlo sampling. Write</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Simple Monte Carlo</head><p>Then the following estimator, which draws n random samples uniformly in S to estimate the total, is unbiased:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IS-Count [4]</head><p>Meng et al. then proposed an estimator based on importance sampling <ref type="bibr">[19]</ref>. Instead of sampling uniformly, the method samples from a proposal distribution q that is cheap to compute for all s &#8712; S. For example, to count buildings in US satellite imagery, the proposal distribution could use maps of artificial light intensity, which are freely available. The importance sampling estimator is:</p><p>DISCOUNT IS-count assumes images are costly to obtain, which motivates using external covariates for the proposal distribution. However, in many scientific tasks, the images are readily available, and the key cost is that of human supervision. In this case it is possible to train a detection model and run it on all images to produce an approximate count g(s) for each s. Define G(S) = s&#8712;S g(s) to be the approximate detector-based count for region S. We propose the detector-based IS-count ("DISCOUNT") estimator, which uses the proposal distribution proportional to g on region S, i.e., with density &#7713;S (s) = g(s) I[s &#8712; S]/G(S). The importance-sampling estimator then specializes to:</p><p>To interpret DISCOUNT, let w i = f (s i )/g(s i ) be the ratio of the true count to the detector-based count for the ith sample s i or (importance) weight. DISCOUNT reweights the detector-based total count G(S) by the average weight w = 1 n n i=1 w i , which can be viewed as a correction factor based on the tendency to over-or under-count, on average, across all of S.</p><p>DISCOUNT is unbiased as long as &#7713;(s) &gt; 0 for all s &#8712; S such that f (s) &gt; 0. Henceforth, we assume detector counts are pre-processed if needed so that g(s) &gt; 0 for all relevant units, for example, by adding a small amount to each count.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">k-DISCOUNT</head><p>We now return to the multiple region counting problem. A naive approach would be to run DISCOUNT separately for each region. However, this is suboptimal. First, it allocates samples equally to each region, regardless of their size or predicted count. Intuitively, we want to allocate more effort to regions with higher predicted counts. Second, if regions overlap it is wasteful to repeatedly draw samples from each one to solve the estimation problems separately.</p><p>k-DISCOUNT We propose estimators based on n samples drawn from all of &#8486; with probability proportional to g. Then, we can estimate F (S) for any region using only the samples from S. Specifically, the k-DISCOUNT estimator is</p><p>where n(S) = |{i : s i &#8712; S}| is the number of samples in region S and w(S) = 1 n(S) i:si&#8712;S w i is the average importance weight for region S.</p><p>The unconditional bias can also be analyzed (see Appendix). Overall, bias has negligible practical impact. It occurs only when the sample size n(S) is zero, which is an event that is both observable and has probability (1 -p(S)) n that decays exponentially in n, where p(S) = G(S)/G(&#8486;).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>In terms of variance, k-DISCOUNT behaves similarly to DISCOUNT run on each region S with sample size equal to E[n(S)] = np(S). To first order, both approaches have variance</head><p>where &#963; 2 (S) is the importance-weight variance. In the case of disjoint regions, running DISCOUNT on each region is the same as stratified importance sampling across the regions, and the allocation of np(S) samples to region S is optimal in the following sense: Claim 2. Suppose S 1 , . . . , S k partition &#8486; and the importance weight variance &#963; 2 (S i ) = &#963; 2 is constant across regions. Assume DISCOUNT is run on each region S i with n i samples. Given a total budget of n samples, the sample sizes that minimize k i=1 Var( FDIS (S i )) are given by</p><p>The analysis uses reasoning similar to the Neyman allocation for stratified sampling <ref type="bibr">[18]</ref>, and shows that k-DISCOUNT approximates the optimal allocation of samples to (disjoint) regions under the stated assumptions. One key difference is that k-DISCOUNT draws samples from all of &#8486; and then assigns them to regions, which is called "post-stratification" in the sampling literature <ref type="bibr">[18]</ref>. An exact variance analysis in the Appendix reveals that, if the expected sample size np(S) for a region is very small, k-DISCOUNT may have up to 30% "excess" variance compared to stratification due to the random sample size, but the excess variance disappears quickly and both approaches have the same asymptotic variance. A second key difference to stratification is that regions can overlap; k-DISCOUNT's approach of sampling from all of &#8486; and then assigning samples to regions extends cleanly to this setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Control Variates</head><p>Control variates are functions h(s) whose integrals H(S) = s&#8712;S h(s) are known and can be combined with importance sampling using the following estimator:</p><p>where wh (S) = 1 n(S) i:si&#8712;S w h,i and</p><p>It is clear that FkDIScv (S) has the same expectation as FkDIS(S) , but FkDIScv (S) might have a lower variance under certain conditions (if f and h are sufficiently correlated <ref type="bibr">[19]</ref>). For bird counting, estimated counts from previous years could be used as control variates as migration is periodic to improve count estimates (see experiments in &#167; 4 for details).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Confidence intervals</head><p>Confidence intervals for k-DISCOUNT can be constructed in a way similar to standard importance sampling. For a region S, first estimate the importance weight variance &#963; 2 (S) as:</p><p>2 .</p><p>An approximate 1 -&#945; confidence interval is then given by FkDIS (S) &#177; z &#945;/2 &#8226; G(S) &#8226; &#963;(S)/ n(S), where z &#947; is the 1 -&#947; quantile of the standard normal distribution, e.g., z 0.025 = 1.96 for a 95% confidence interval. The theoretical justification is subtle due to scaling by the random sample size n(S). It is based on the following asymptotic result, proved in the Appendix.</p><p>Claim 3. The k-DISCOUNT estimator with scaling factor G(S)&#963;(S)/ n(S) is asymptotically normal, that is, the distribution of FkDIS (S)-F (S)</p><p>converges to N (0, 1) as n &#8594; &#8734;.</p><p>In preliminary experiments we observed that for small expected sample sizes the importance weight variance &#963; 2 (S) can be underestimated leading to intervals that are too small -as an alternative, we propose a practical heuristic for smaller sample sizes where &#963;2 (&#8486;) is used instead of &#963;2 (S); that is, all samples are used to estimate variability of importance weights for each region S.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experimental Setup</head><p>In this section we describe the counting tasks and detection models ( &#167; 4.1-4.2) and the evaluation metrics ( &#167; 4.3) we will use to evaluate different counting methods. We focus on two applications: counting roosting birds in weather radar images and counting damaged buildings in satellite images of a region struck by a natural disaster.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Counting Roosting Birds from Weather Radar</head><p>Many species of birds and bats congregate in large numbers at nighttime or daytime roosting locations. Their departures from these "roosts" are often visible in weather radar, from which it's possible to estimate their numbers <ref type="bibr">[20]</ref><ref type="bibr">[21]</ref><ref type="bibr">[22]</ref>. The US "NEXRAD" weather radar network <ref type="bibr">[23]</ref> has collected data for 30 years from 143+ stations and provides an unprecedented opportunity to study long-term and wide-scale biological phenomenon such as roosts <ref type="bibr">[24,</ref><ref type="bibr">25]</ref>. However, the sheer volume of radar scans (&gt;250M) prevents manual analysis and motivates computer vision approaches <ref type="bibr">[26]</ref><ref type="bibr">[27]</ref><ref type="bibr">[28]</ref><ref type="bibr">3]</ref>.</p><p>Unfortunately, the best computer vision models <ref type="bibr">[3,</ref><ref type="bibr">28]</ref> for detecting roosts have average precision only around 50% and are not accurate enough for fully automated scientific analysis, despite using state-of-the-art methods such as Faster R-CNNs <ref type="bibr">[29]</ref> and training on thousands of human annotations -the complexity of the task suggests substantial labeling and model development efforts would be needed to improve accuracy, and may be impractical.</p><p>0 25 50 75 0 5 10 15 20 25 1e5 KGRB 2020 Ground truth k-DISCount (n = 5) k-DISCount (n = 30) 0 25 50 75 0 10 20 30 40 50 1e4 KTYX 2015 0 25 50 75 0 5 10 15 20 1e5 KBUF 2010 0 50 100 0 5 10 15 20 25 1e5 KDTX 2018 Season days Cumulative number of birds Previous work <ref type="bibr">[30,</ref><ref type="bibr">31]</ref> used a roost detector combined with manual screening of the detections to analyze more than 600,000 radar scans spanning a dozen stations in the Great Lakes region of the US to reveal patterns of bird migration over two decades. The vetting of nearly 64,000 detections was orders of magnitude faster than manual labeling, yet still required a substantial 184 hours of manual effort. Scaling to the entire US network would require at least an order of magnitude more effort, thus motivating a statistical approach.</p><p>We use the exhaustively screened detections from the Great Lakes analysis in <ref type="bibr">[30,</ref><ref type="bibr">31]</ref> to systematically analyze the efficiency of sampling based counting. The data is organized into domains &#8486; sta,yr corresponding to 12 stations and 20 years (see Fig. <ref type="figure">7 in Appendix B</ref>). Thus the domains are disjoint and treated separately. Counts are collected for each day s by running the detector using all radar scans for that day to detect and track roost signatures and then mapping detections to bird counts using the measured radar "reflectivity" within the tracks. For the approximate count g(s) we use the automatically detected tracks, while for the true count f (s) we use the manually screened and corrected tracks. For a single domain, i.e., each station-year, we divide a complete roosting season into temporal regions in three different scenarios: (1) estimating bird counts up to each day in the roosting season (i.e., regions are nested prefixes of days in the entire season), (2) the end of each quarter of (i.e., regions are nested prefixes of quarters in the entire season), and (3) estimating each quarter's count (each region is one quarter). We measure error using the fully-screened data and average errors across all domains and regions. Fig. <ref type="figure">2</ref> shows the counts and confidence intervals estimated using k-DISCOUNT for the first scenario on four station-years.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Counting Damaged Buildings from Satellite Images</head><p>Building damage assessment from satellite images <ref type="bibr">[32,</ref><ref type="bibr">33]</ref> is often used to plan humanitarian response after a natural disaster strikes. However, the performance of computer vision models degrades when applied to new regions and disaster types. Our approach can be used to quickly vet the data produced by the detector to correctly estimate counts in these scenarios.</p><p>We use the building damage detection model by <ref type="bibr">[34]</ref>, the winner of the xView2 challenge <ref type="bibr">[35]</ref>. The model is based on U-Net <ref type="bibr">[36]</ref> to detect buildings in the pre-disaster image, followed by a "siamese network" that incorporates at pre-and post-disaster images to estimate damage. The model is trained on the xBD dataset <ref type="bibr">[37]</ref> that contains building and damage annotations spanning multiple geographical regions and disaster types (e.g., earthquake, hurricane, tsunami, etc.). While the dataset contains four levels of damage (i.e., 0: no-damage, 1: minor-damage, 2: major-damage, and 3: destroyed), in this work we combine all damage levels (i.e., classes 1-3) into a single "damage" class.</p><p>We consider the Palu Tsunami from 2018; the data consists of 113 high-resolution satellite images labeled with 31,394 buildings and their damage levels. We run the model on each tile s to estimate the number of damaged buildings g(s), while the ground-truth number of damaged buildings is used as f (s). Our goal is to estimate the cumulative damaged building count in sub-regions expanding from the area with the most damaged buildings as shown in Fig. <ref type="figure">9</ref> in the Appendix C. To define the sub-regions, we sort all m images by their distance from the epicenter (defined as the image tile with</p><p>5 15 25 35 45 55 65 Number of samples 10 20 30 40 50 60 70 80 Error rate (%) MC IS-Count DISCount 5 15 25 35 45 55 65 Labeling effort (%) 10 20 30 40 50 60 70 80 5 10 15 20 25 30 35 40 45 Number of samples 10 20 30 40 50 60 70 5 10 15 20 25 30 35 Labeling effort (%) 10 20 30 40 50 60 70 Damaged building counting (Palu Tsunami)</p><p>Roosting birds counting (Great Lakes) the most damaged buildings) and then divide into chunks or "annuli" A 1 , . . . , A 7 of size m/7. The task is to estimate the cumulative counts S j = j i=1 A i of the first j chunks for j from 1 to 7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Evaluation</head><p>We measure the fractional error between the true and the estimated counts averaged over all regions in a domain S 1 , . . . , S k &#8838; &#8486; as:</p><p>For the bird counting task, for any given definition of regions within one station-year &#8486; (i.e., cumulative days or quarters defined in &#167; 4.1) we report the error averaged across all station-years corresponding to 12 stations and &#8776; 20 years. For the damaged building counting problem there is only a single domain corresponding to the Palu Tsunami region. In addition, we calculate the average confidence interval width normalized by F (&#8486;). We run 1000 trials and plot average metrics &#177;1.96 &#215; std. error over the trials. We also evaluate confidence interval coverage, which is the fraction of confidence intervals that contain the true count over all domains, regions, and trials.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Results</head><p>In this section, we present the results comparing detector-based to covariate-based sampling. Also, we show reductions in labeling effort and demonstrate the advantages of estimating multiple counts jointly. Finally, we show confidence intervals and control variates results.</p><p>Detector-based sampling reduces error We first compare DISCOUNT (detector-based sampling) to IS-Count and simple Monte Carlo sampling for estimating F (&#8486;), that is, the total counts of birds in a complete roosting season for a given station year, or damaged buildings in the entire disaster region. Fig. <ref type="figure">3</ref> shows the error rate as a function of number of labeled samples (i.e., the number of distinct s i sampled, since each s is labeled at most once). In the buildings application, a sample refers to an image tile of size 1024 &#215; 1024 pixels, while for the birds a sample refers to a single day.</p><p>Using the detector directly without any screening results in high error rates -roughly 136% and 149% for estimating the total count for the damaged buildings and bird counting tasks respectively. Meng et al. <ref type="bibr">[4]</ref> show the advantages of using importance sampling with screening to produce count estimates with base covariates as opposed to simple Monte Carlo sampling (MC vs. IS-Count). For the bird counting task, we construct a non-detector covariate g IS by fitting a spline to f (s) with 10% of the days from an arbitrarily selected station-year pair (station KBUF in 2001). For the damaged building counting task, the covariate g IS is the true count of all buildings (independent of the damage) obtained using the labels provided with the xBD dataset.</p><p>R1 R2 R3 R4 R5 R6 R7 Regions (cumulative) 0 20 40 60 Error rate (%) Damaged building cumulative counting (Palu Tsunami) DISCount k-DISCount Q1 Q1-Q2 Q1-Q3 Q1-Q4 Season quarter (cumulative) 0 20 40 60 Roosting birds cumulative counting (Great Lakes) Q1 Q2 Q3 Q4 Season quarter (disjoint) 0 20 40 60 Roosting birds counting (Great Lakes) Covariate-based sampling (IS-Count) leads to significant savings over simple Monte Carlo sampling (MC), but DISCOUNT provides further improvements. In particular, to obtain an error rate of 20% DISCOUNT requires &#8776; 1.6&#215; fewer samples than IS-Count and &#8776; 3&#215; fewer samples than MC for both counting problems.</p><p>Screening leads to a further reduction in labeling effort DISCOUNT alleviates the need for users to annotate an image from scratch, such as identifying an object and drawing a bounding box around it. Instead, users only need to verify the detector's output, which tends to be a quicker process. In a study by Su et al. <ref type="bibr">[38]</ref> on the ImageNet dataset <ref type="bibr">[39]</ref>, the median time to draw a bounding-box was found to be 25.5 seconds, whereas verification took only 9.0 seconds (this matches the screening time of &#8776;10s per bounding-box in <ref type="bibr">[31,</ref><ref type="bibr">30]</ref>). The right side of Fig. <ref type="figure">3</ref> presents earlier plots with the x-axis scaled based on labeling effort, computed as 100 &#8226; c &#8226; n/|&#8486;|, where n denotes the number of screened samples and c &#8712; [0, 1] represents the fraction of time relative to labeling from scratch. For instance, the labeling effort is 100% when all elements must be labeled from scratch (c = 1 and n = |&#8486;|). For DISCOUNT, we estimate c DIS = 9.0/(25.5 + 9.0) = 0.26, since annotating from scratch requires both drawing and verification, while screening requires only verification. To achieve the same 20% error rate, DISCOUNT requires 6&#215; less effort than IS-Count and 9&#215; less effort than MC for the bird counting task, and 8&#215; less effort than IS-Count and 12&#215; less effort than MC for building counting.</p><p>Multiple counts can be estimated efficiently (k-DISCOUNT) To solve multiple counting problems, we compared k-DISCOUNT to using DISCOUNT separately on each region. For bird counting, the task was to estimate four quarterly counts (cumulative or individual) as described in &#167; 4.1. For k-DISCOUNT, we sampled n = 40 days from the complete season to estimate the counts simultaneously. For DISCOUNT, we solved each of the four problems separately using n/4 = 10 samples per region for the same total number of samples. For building damage counting, the task was to estimate seven cumulative counts as described in &#167; 4.2. For k-DISCOUNT, we used n = 70 images sampled from the entire domain, while for DISCOUNT we used n/7 = 10 sampled images per region.</p><p>Fig. <ref type="figure">4</ref> shows that solving multiple counting problems jointly (k-DISCOUNT) is better than solving them separately (DISCOUNT). For the cumulative tasks, k-DISCOUNT makes much more effective use of samples from overlapping regions. For single-quarter bird counts, k-DISCOUNT has slightly higher error in Q1 and Q4 and lower errors in Q2 and Q3. This can be understood in terms of sample allocation: k-DISCOUNT allocates in proportion to predicted counts, which provides more samples and better accuracy in Q2-Q3, when many more roosts appear, and approximates the optimal allocation of Claim 2. DISCOUNT allocates samples equally, so has slightly lower error for the smaller Q1 and Q4 counts. In contrast, for building counting, k-DISCOUNT has lower error even for the smallest region R1, since this has the most damaged buildings and thus gets more samples than DISCOUNT. Fig. <ref type="figure">5</ref> (left) shows k-DISCOUNT outperforms simple Monte Carlo (adapted to multiple regions similarly to k-DISCOUNT) for estimating cumulative daily bird counts as in Fig. <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Confidence intervals</head><p>We measure the width and coverage of the estimated confidence intervals (CIs) per number of samples for cumulative daily bird counting; see examples in Fig. <ref type="figure">2</ref>. We compare the CIs of k-DISCOUNT, k-DISCOUNT-cv (control variates), k-DISCOUNT-cv-&#963;(&#8486;) (using all sam- ples to estimate variance), and simple Monte Carlo sampling in Fig. <ref type="figure">5</ref>. When using control variates, the error rate and the CI width are slightly reduced while keeping the same coverage. CI coverage is lower than the nominal coverage (95%) for all methods, but increasing with sample size and substantially improved by k-DISCOUNT-cv-&#963;(&#8486;), which achieves up to &#8776; 80% coverage. Importance weight distributions can be heavily right-skewed and the variance easily underestimated <ref type="bibr">[40]</ref>.</p><p>DISCOUNT improves over a calibration baseline We implement a calibration baseline where the counts are estimated as FCAL (S) = s&#8712;S &#966;(g(s)), where we learn an isotonic regression model &#966; between the predicted and true counts trained for each station using 15 selected samples from one year from that station. Results are shown as the straight line in Fig. <ref type="figure">5</ref> (left). DISCOUNT outperforms calibration with less than 10 samples per station suggesting the difficulties in generalization across years using a simple calibration approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Control variates (k-DISCOUNT-cv)</head><p>We perform experiments adding control variates to k-DISCOUNT in the roosting birds counting problem. We use the calibrated detector counts &#966;(g(s)) defined above as the control variate for each station year. Fig. <ref type="figure">5</ref> shows that control variates reduce the confidence interval width (middle: k-DISCOUNT vs. k-DISCOUNT-cv) without hurting coverage (right). In addition, the error of the estimate is reduced slightly, as shown in Fig. <ref type="figure">5</ref> (left). Note that this is achieved with a marginal increase in the labeling effort.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Discussion and Conclusion</head><p>We contribute methods for counting in large image collections with a detection model. When the task is complex and the detector is imperfect, allocating human effort to estimate the scientific result directly might be more efficient than improving the detector. For instance, performance gains from adding more training data may be marginal for a mature model. Our proposed solution produces accurate and unbiased estimates with a significant reduction in labeling costs from naive and covariate-based screening approaches. We demonstrate this in two real-world open problems where data screening is still necessary despite large investments in model development. Our approach is limited by the availability of a good detector, and confidence interval coverage is slightly low; possible improvements are to use bootstrapping or corrections based on importance-sampling diagnostics <ref type="bibr">[40]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Derivations</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">DISCount</head><p>Take q = &#7713;S in IS-Count, then</p><p>A.3 k-DISCount</p><p>Proof of Claim 1. For any m &gt; 0 we have</p><p>In the third line, we used the fact that E h(X)</p><p>for any random variable X and event A (see Lemma 1 below). In the fourth line we used the fact that s i is conditionally independent of n(S) given s i &#8712; S, since n(S) = I[s i &#8712; S] + j&#824; =i I[s j &#8712; S] and the latter sum is independent of s i . In the fifth line we used the fact that Pr</p><p>and the terms in the sum are exchangeable. In the sixth line we computed the conditional expectation as follows using the fact that the conditional density of s i given s i &#8712; S is equal to g(s i )/G(S):</p><p>The unconditional bias of k-DISCOUNT can also be analyzed:</p><p>In particular, bias decays exponentially with n and quickly becomes negligible, with magnitude at most &#1013; for n &#8805; log(1/r)/ log(1/&#1013;) and r = 1 -p(S). Further, the bias is easily computable from the detector counts and therefore known prior to sampling, and the event that leads to a biased estimate (n(S) = 0) is observed after sampling. All these factors make bias a very minor concern. <ref type="foot">1</ref>Proof of Claim 4. Using Claim 1, we compute the unconditional expectation as</p><p>In the final line, 1 -G(S)/G(&#8486;) is probability that s i / &#8712; S for a single i, and (1 -G(S)/G(&#8486;)) n = Pr[n(S) = 0] is the probability that s i / &#8712; S for all i. Rearranging gives the result.</p><p>for any random variable X and event A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.4 Optimal allocation of samples for DISCOUNT to disjoint regions</head><p>Proof of Claim 2. The proof is similar to that of Theorem 5.6 in <ref type="bibr">[18]</ref>. We prove the claim for k = 2; the proof generalizes to larger k in an obvious way. The variance of DISCOUNT on S i is</p><p>We want to minimize i Var( FDIS (S i )), which with k = 2 is proportional to</p><p>By the Cauchy-Shwarz inequality, for any n 1 , n 2 &gt; 0,</p><p>If we substitute n i = G(S i )/Z for any Z on the left of the inequality and simplify, we see the inequality becomes tight, so the minimum is achieved. We further require i n i = n, so choose Z so</p><p>is the probability of a sample landing in S under the sampling distribution &#7713;&#8486; . Define</p><p>to be the variance of the importance weight for s i &#8764; &#7713;S . Claim 5. Let r = 1 -p(S). The variance of the k-DISCOUNT estimator is given</p><p>where (1 -r) n E 1/n(S) n(S) &gt; 0 = n j=1 (1/j) &#8226; Binomial j; n, p(S) .</p><p>The second term in the variance arise from the possibility that no samples land in S; it decays exponentially in n and is negligible compared to the first term. The first term can be compared to the variance G(S) 2 &#8226; &#963; 2 (S) &#8226; 1 m of importance sampling with exactly m samples allocated to S and the proposal distribution &#7713;S , i.e., DISCOUNT. Because the sample size n(S) is random, the correct scaling factor for k-DISCOUNT is (1-r n ) E[1/n(S) | n(S) &gt; 0], which it turns out is asymptotically equivalent to 1/(np(S)), i.e., DISCOUNT with a sample size of m = np(S) = E[n(S)] -see Claim 6 below. We find that for a small expected sample size (around 4) there can be up to 30% "excess variance" due to the randomness in the number of samples (see Figure <ref type="figure">6</ref>), but that this disappears quickly with larger expected sample size.   In the last line, we used the fact that n(S) &#8764; Binomial(n, p(S)), so Pr[n(S) &gt; 0] = 1 -r n where r = 1 -p(S). The summation for (1 -r n ) E[1/n(S) | n(S) &gt; 0] follows from the same fact. For the second term in Eq. (1), from the definition of k-DISCOUNT and conditional unbiasedness (Claim 1), we have E[ FkDIS (S) | n(S)] = 0 if n(S) = 0 F (S) if n(S) &gt; 0 = F (S) &#8226; Bernoulli(1 -r n ). The variance is therefore Var E[ FkDIS (S) | n(S)] = F (S) 2 &#8226; r n &#8226; (1 -r n ). Putting the two terms together yields the result. Proof of Claim 6. By Claim 5 we have lim n&#8594;&#8734; n Var( FkDIS,n (S)) = lim n&#8594;&#8734; n &#8226; G(S) 2 &#8226; &#963; 2 (S) &#8226; (1 -r n ) &#8226; E 1 n(S) n(S) &gt; 0 + lim n&#8594;&#8734; n &#8226; F (S) 2 &#8226; r n &#8226; (1 -r n ). (2)</p><p>The second limit on the right side is zero, because nr n &#8594; 0 as n &#8594; &#8734; (recall that r &lt; 1) and the other factors are bounded. We will show the first limit on the right side is equal to G(S) 2 &#8226;&#963; 2 (S)/p(S), which will prove the first part of the result. The asymptotic expansion of <ref type="bibr">[41]</ref> (Corollary 3) states that </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>The k-DISCOUNT estimator can be debiased by dividing by u = 1 -(1 -p(S)) n &lt; 1. However, this leads to higher overall error: if n(S) = 0, the estimator is unchanged, and conditioned on the event n(S) &gt; 0 the estimator becomes biased and has higher variance by a factor of 1/u</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>&gt; 1.</p></note>
		</body>
		</text>
</TEI>
