<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Learned Compressive Representations for Single-Photon 3D Imaging</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>10/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10479177</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the IEEE/CVF International Conference on Computer Vision</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Felipe Gutierrez-Barragan</author><author>Fangzhou Mu</author><author>Andrei Ardelean</author><author>Atul Ingle</author><author>Claudio Bruschini</author><author>Edoardo Charbon</author><author>Yin Li</author><author>Mohit Gupta</author><author>Andreas Velten</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Single-photon 3D cameras can record the time-ofarrival of billions of photons per second with picosecond accuracy. One common approach to summarize the photon data stream is to build a per-pixel timestamp histogram, resulting in a 3D histogram tensor that encodes distances along the time axis. As the spatio-temporal resolution of the histogram tensor increases, the in-pixel memory requirements and output data rates can quickly become impractical. To overcome this limitation, we propose a family of linear compressive representations of histogram tensors that can be computed efficiently, in an online fashion, as a matrix operation. We design practical lightweight compressive representations that are amenable to an in-pixel implementation and consider the spatio-temporal information of each timestamp. Furthermore, we implement our proposed framework as the first layer of a neural network, which enables the joint end-to-end optimization of the compressive representations and a downstream SPAD data processing model. We find that a well-designed compressive representation can reduce in-sensor memory and data rates up to 2 orders of magnitude without significantly reducing 3D imaging quality. Finally, we analyze the power consumption implications through an on-chip implementation.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>3D cameras based on single-photon avalanche diode technology (SPAD) are becoming increasingly popular for a wide range of applications that require high-resolution and low-power depth sensing, ranging from autonomous vehicles <ref type="bibr">[1]</ref> to consumer smartphones <ref type="bibr">[2]</ref>. Kilo-to-megapixel resolution SPAD pixel arrays <ref type="bibr">[27,</ref><ref type="bibr">28]</ref> have the capability of capturing the time-of-arrival of billions of individual photons per frame with extremely high (picosecond) time resolution <ref type="bibr">[31]</ref>. Unfortunately, this extreme sensitivity and high speed comes at a cost -the raw timestamp data causes a severe bottleneck between the image sensor and the image signal processor (ISP) that processes this data (Fig. <ref type="figure">1(a)</ref>). This data bottleneck severely limits the wider use of high- Our method applies a lightweight, on-sensor compressive coding scheme to the photon timestamp data which is later decoded at the ISP, resolving the data bandwidth limitation. resolution SPAD arrays in 3D sensing applications.</p><p>One common approach to avoid transferring individual photon timestamps is to build a histogram in each pixel. This results in a 3D histogram tensor that is transferred offsensor for processing. Although this may be practical at low spatio-temporal resolutions (e.g., 64x32 pixels with 16 time bins <ref type="bibr">[15]</ref>), it requires higher in-sensor memory. Moreover, the data rates of this histogram tensor representation also scale rapidly with the spatio-temporal resolution and maximum depth range. For example, a megapixel SPAD-based 3D camera operating at 30 fps that outputs a histogram tensor with a thousand 8-bit bins per pixel would require an unmanageable data transfer rate of 240 Gbps.</p><p>To overcome the above limitations, we seek to design compressive representations of 3D histogram tensors. In order to reduce the data rates output by the SPAD camera, the compact representation needs to be built in-pixel or inside the focal plane array (FPA). This is illustrated in Fig. <ref type="figure">1(b)</ref>. Due to the limited in-pixel memory and compute, the compressive representation needs to be built in a streaming manner, with minimal computations per photon. Photon This ICCV paper is the Open Access version, provided by the Computer Vision Foundation.</p><p>Except for this watermark, it is identical to the accepted version; the final published version of the proceedings is available on IEEE Xplore.</p><p>histogram tensors are very different from conventional RGB images/video data. Therefore, traditional compression algorithms such as MPEG are not directly applicable.</p><p>We propose a family of compressive representations for 3D histogram tensors that can be computed in an online fashion with limited memory and compute. They are based on the linear spatio-temporal projection of each photon timestamp, which can be expressed as a simple matrix operation. Instead of constructing per-pixel timestamp histograms, a compressive encoding maps its spatio-temporal information into a compressive histogram. To exploit local spatio-temporal correlations, a single compressive histogram is built for a local 3D histogram block as illustrated in Fig. <ref type="figure">2</ref>. Instead of building and storing the full 3D histogram tensor in-sensor, multiple compressive histograms are built and transferred off-sensor for processing, effectively reducing the required in-sensor memory and data rates. Recent works proposed a similar compression framework based on compressive histograms <ref type="bibr">[13]</ref> or sketches <ref type="bibr">[36]</ref>. In Sec. <ref type="bibr">4</ref>, we show that these prior works can be viewed as special cases of our proposed framework.</p><p>In this paper, we explore the design space of spatiotemporal compressive encodings and analyze the trade-offs between different design choices. Furthermore, we present a method to integrate our compression framework with data-driven SPAD data processing methods using convolutional neural networks (CNNs), which enables end-to-end optimization of the compressive encoding and a SPAD data processing CNN. We demonstrate the feasibility of compressive histograms through an on-chip implementation.</p><p>For our experimental evaluation, we integrate the compressive histograms framework with a state-of-the-art learning-based denoising model for SPAD-based 3D imaging <ref type="bibr">[32]</ref>. Our results show that the jointly optimized compressive encoding and CNN can consistently reduce data rates up to 2 orders of magnitude in a wide range of signal and noise levels. Moreover, for a given compression level, it can increase 3D imaging accuracy over previous hand-designed compressive histograms that only exploit temporal information <ref type="bibr">[13,</ref><ref type="bibr">36]</ref>, especially in low signal-tobackground ratio (SBR) scenarios and at higher compression rates. Furthermore, we show that learned compressive histograms can perform comparably and sometimes even outperform a theoretical SPAD sensor design where the full 3D histogram tensor is stored in-sensor and only per-pixel depths are transferred off-sensor. Finally, we analyze the power consumption of a compressive histogram implemented on the UltraPhase SPAD processing chip <ref type="bibr">[4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Compressive histograms, also called sketches, are an emerging framework for online in-sensor compression of SPAD timestamp data <ref type="bibr">[13,</ref><ref type="bibr">41,</ref><ref type="bibr">36,</ref><ref type="bibr">33]</ref>. A coarse histogram is one common compressive histogram approach <ref type="bibr">[13]</ref> to re-duce data rates and in-pixel memory <ref type="bibr">[15,</ref><ref type="bibr">8,</ref><ref type="bibr">18]</ref>. Despite their practical hardware implementation, coarse histograms achieve sub-optimal depth accuracy compared to compressive histograms based on Fourier <ref type="bibr">[36,</ref><ref type="bibr">37,</ref><ref type="bibr">13]</ref> and Gray <ref type="bibr">[13]</ref> codes. One limitation of these approaches is that the compressive representation only exploits the temporal information of the incident timestamp, and disregards the spatial redundancy. In this work, we generalize the compressive histogram framework to utilize the spatio-temporal information of each timestamp. Moreover, instead of relying on hand-designed coded projections, in this paper we learn them as the first layer of a CNN.</p><p>Shared In-Pixel Histograms: One common design is to have multiple neighboring SPAD pixels have a single shared memory where all timestamps are aggregated into a coarse histogram (e.g., 4 &#215; 4 <ref type="bibr">[18,</ref><ref type="bibr">15]</ref>, 3 &#215; 3 <ref type="bibr">[20]</ref>). This approach throws away the local spatial information (i.e., pixel location) of the detected photon timestamps. A compressive histogram is well-suited for these shared memory designs because it can be shared among multiple SPAD pixels and preserves spatial information through the coded projection.</p><p>Neural Sensors and Pixel Processor Arrays: Pixel processor arrays (PPAs) are an emerging sensing technology that embeds processing electronics inside each pixel <ref type="bibr">[9]</ref>. This new sensing paradigm begins processing the image at the focal plane array, which allows it to reduce the sensor data rates by transmitting only the relevant information, and consequently, it increases the sensor's throughput <ref type="bibr">[9]</ref> which can enable computer vision at 3,000 fps <ref type="bibr">[6,</ref><ref type="bibr">7]</ref>. PPAs have also become building blocks of novel computational imaging systems optimized end-to-end for HDR imaging <ref type="bibr">[25,</ref><ref type="bibr">39]</ref>, motion deblurring <ref type="bibr">[30]</ref>, video compressive sensing <ref type="bibr">[25]</ref>, and light field imaging <ref type="bibr">[43]</ref>. A compressive histogram relies on a similar in-pixel processing paradigm. Similar to previous works, we jointly optimize the sensor parameters (i.e., the compressive histogram) and the processing algorithm (i.e., the CNN), but to our knowledge, this work is the first to use this approach for SPAD data compression.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Single-Photon 3D Histogram Tensors</head><p>SPAD-based 3D cameras consist of a SPAD sensor and a pulsed laser that illuminates the scene. The photon flux signal arriving at pixel, p, can be written as:</p><p>where a p is the amplitude of the returning signal accounting for laser power, reflectivity, and light fall-off; h(t) is the system's impulse response function (IRF) which accounts for pulse waveform and sensor IRF;  (  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H b c a n b e c o m pr ess e d i n a n o nli n e f as hi o n t hr o u g h t h e li ne ar pr oj e cti o n of e a c h ti m est a m p t e ns or; e x pr ess e d as t h e i nn er pr o d u ct wit h K pr e-d esi g n e d c o di n g t e ns ors , C k , wit h di m e nsi o ns M</head><p>Here H b is the decoded compressive histogram for block b, which is the weighted linear combination of the coding tensors. The decoded histogram blocks are then concatenated and given as input to the processing 3D CNN. Fig. <ref type="figure">3</ref>(b) illustrates this decoding step. The decoding step in Eq. 6 will occur off-sensor, after the compressive histograms have been moved to the camera compute module which has access to larger memory and computational resources than sensor module. One benefit of using the unfiltered backprojection as the upsampling operator is that if all coding tensors are mutually orthogonal, in the limit when K approaches the size of H b (i.e., no compression), then</p><p>This suggests that at compresion rates close to unity, an appropriately trained compressive histogram layer should be approximately equal to an identity transformation applied to H b .</p><p>To summarize, a compressive histogram layer comprises an encoding/compression step followed by a decoding step, which uses the coding tensors as the convolutional filters. This layer can be appended to the beginning of any CNN that has been designed to process 3D histogram tensors. Finally, the coding tensors can be jointly optimized with the downstream CNN in an end-to-end manner.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Datasets and Implementation</head><p>In this section, we describe the datasets used for model training and testing, and also provide implementation details for the compressive histogram layer and the 3D CNN used for the experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Datasets</head><p>For training, we generate a synthetic SPAD measurement dataset containing different scenes at a wide range of illumination settings. We use a similar synthetic data generation pipeline used in previous learning-based SPAD-based 3D imaging works <ref type="bibr">[21,</ref><ref type="bibr">32,</ref><ref type="bibr">40]</ref>. Using Eq. 2, SPAD measurements can be simulated given an RGB-D image, the pulse waveform (h(t)), and the average number of detected signal and background photons per pixel. Please refer to the supplement for a detailed overview of the simulation pipeline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Simulated Training Dataset:</head><p>We use the RGB-D images from the NYU v2 dataset <ref type="bibr">[38]</ref>. The simulated histograms have N t = 1024 bins and a &#8710; = 80ps bin size (12.3m depth range). The pulse waveform used has a full-width half maximum (FWHM) of 400ps obtained from <ref type="bibr">[21]</ref>. For each scene, we randomly set the average number of signal and background photons detected per pixel to [2, 5, or 10] and <ref type="bibr">[2,</ref><ref type="bibr">10,</ref><ref type="bibr">50]</ref>, respectively. With appropriate normalization, the models generalize to other photon levels despite being trained on this photon-starved dataset. A total of 16,628 histogram tensors with dimensions 1024 &#215; 64 &#215; 64 are simulated and split into a training and a validation set with 13,851 and 2,777 examples, respectively. Simulated Test Dataset: For testing we use 8 RGB-D images from the Middlebury stereo dataset <ref type="bibr">[35]</ref>. The simulated histograms have N t = 1024 bins and a &#8710; = 100ps bin size (15.3m depth range). The pulse waveform used is a Gaussian pulse with an FWHM of 318ps (&#963; = 135ps). A total of 128 test histogram tensors are generated by simulating each scene with the following average number of detected signal/background photons: 2/2, 2/5, 2/50, 5/2, 5/10, 5/50, 10/2, 10/10, 10/50, 10/200, 10/500, 10/1000, 50/50, 50/200, 50/500, and 50/1000. Real-world Experimental Data: To evaluate the generalization of the proposed models, we downloaded raw histogram tensor captured in <ref type="bibr">[21]</ref> with a line-scanning SPADbased 3D camera prototype. Please refer to the supplementary document for details on this dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Training and Implementation</head><p>To simplify training, the input to all models is a 3D histogram tensor, even though compressive histograms can work directly on streams of photon timestamps (Eq. 4). Compressive Histogram Layer: The encoder is implemented as a 3D convolution with a stride equal to the filter size, whose learned filters are the coding tensors, C k . We constraint C k to be zero-mean along the time dimension. The unfiltered backprojection decoder is implemented as a 3D transposed convolution with a stride equal to its filter size.To help the CNN model generalize to different photon count levels we apply zero-normalization along the channel dimension (i.e., K) to the inputs ( Y b ) and the weights (C) of the transposed convolution, also known as layer-norm <ref type="bibr">[5]</ref>. This normalization is also commonly used in depth decoding algorithms <ref type="bibr">[14,</ref><ref type="bibr">13,</ref><ref type="bibr">26]</ref>. Depth Estimation 3D CNN: To estimate depths from the decoded histogram tensor we use the 3D deep boosting CNN model proposed by <ref type="bibr">[32]</ref> for single-photon 3D imaging, without the non-local block. Similar to <ref type="bibr">[21,</ref><ref type="bibr">32]</ref> we use the pixel-wise KL-divergence between the output histogram tensor and a normalized true histogram tensor as our objective, and estimate depths using a softargmax. Training: At each training iteration we randomly sample patches of size 1024&#215;32&#215;32. We train all models using the ADAM optimizer <ref type="bibr">[19]</ref> with &#946; 1 = 0.9, &#946; 2 = 0.999, batch size of 4, and a learning rate of 0.001 that decays by 0.9 after every epoch. We train all models for 30 epochs with periodical checkpoints, and for a given model we choose the checkpoint that achieves the lowest root mean squared error (RMSE) on the validation set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Experiments and Results</head><p>In this section, we present the performance at various compression levels for different coding tensor designs jointly optimized with the depth estimation CNN described in Sec. 5. A coding tensor design is determined by the </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Baselines and Performance Metrics</head><p>We compare against the following baselines: &#8226; Temporal Truncated Fourier C <ref type="bibr">[36,</ref><ref type="bibr">13]</ref>: A compressive histogram that uses coding tensors with dimensions 1024&#215;1&#215;1 and whose weights are set using the first K/2 frequencies of the Fourier matrix. In the supplement, we compare against additional Fourier-based C <ref type="bibr">[13]</ref>.</p><p>&#8226; Temporal Coarse Histogram C: Here C is a box downsampling operator along the temporal dimensions which produces a coarse histogram with K bins.</p><p>&#8226; No Compression Oracle: In this baseline, we assume the ideal scenario where the histogram tensor is transferred off-sensor and processed with the depth estimation 3D CNN. Similar to <ref type="bibr">[32]</ref>, we train this model with an initial learning rate of 10 -4 and total variation regularization.</p><p>&#8226; Peak Compression Oracle: This baseline implements an ideal SPAD camera with sufficient in-sensor memory to store the histogram tensor and sufficient computation power to compute per-pixel depths through an "argmax" along the time axis. To process the noisy 2D depth images with the 3D CNN, we generate a 3D grid where all elements are zero except for one element per spatial location whose index is proportional to the depth. This model is trained like the no-compression oracle. Similar to our proposed approach, all compressive histogram baselines described here, implemented their C as a compressive histogram layer, with fixed weights, whose outputs are processed by the depth estimation 3D CNN.</p><p>Evaluation Metrics: The 3D imaging performance of each model is summarized using two metrics: (1) the mean absolute depth error (MAE), ( <ref type="formula">2</ref>) and the percent of pixels with absolute depth errors that are lower than 10mm. To understand the performance under these metrics we divide the test set into different SBR ranges and report the metrics for each range individually. We also visualize the overall dataset performance as scatter plots (e.g., Fig. <ref type="figure">7</ref>) where each point shows the MAE for a given test scene and their color hue represents the mean SBR of the scene. Outliers with an MAE larger than 50mm are not visible in the plot, however, they are included in the calculation of the statistics. Finally, when comparing different compressive histogram strategies the compression ratio is fixed. The compression ratio (CR) is the ratio of the block size and the length of the compressive histogram (i.e., CR = (M t &#8226; M r &#8226; M c &#8226;)/K).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">3D Imaging Performance</head><p>Compression vs. Performance: Fig. <ref type="figure">4</ref> shows the performance of compressive histograms as a function of the compression ratio. The learned coding tensors consistently outperform the temporal Fourier-based C. At low SBR and CR &gt; 100, it becomes essential for the learned coding tensors to utilize spatio-temporal information (i.e., green and red lines). Moreover, the proposed models can outperform the peak compression oracle for CR &#8804; 64. Overall, learned spatio-temporal coding tensors provide robust performance that degrades gracefully as compression increases. Importance of Learned Coding Tensors: Fig. <ref type="figure">5</ref> compares the depth reconstructions of compressive histograms with coding tensors that were optimized (ours) against coding tensors that were fixed and not optimized throughout training. The extreme quantization in coarse histograms causes large systematic depth errors. Random unoptimized coding tensors consistently produce lower-quality depth reconstructions. A well-designed coding tensor based on Fourier codes can produce reasonable depth reconstructions at 64x compression, however, at 128x compression, scene details become blurred. The proposed learned coding tensors are able to generate high-quality reconstructions comparable to the no-compression oracle. Overall, optimizing the coding tensors can provide non-trivial performance gains. What if we could compute depths in-pixel? Fig. <ref type="figure">6</ref> compares the depth reconstruction quality of two learned coding tensors at 64x compression with the peak compression oracle described in Sec. 6.1. At high SBR, all methods recover the fine and coarse scene details. At low SBR the peak compression oracle fails to reconstruct high-level scene structures such as the rings in the red box, while the learned coding tensors better preserve these coarse and fine details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">Exploring the Coding Tensor Design Space</head><p>When does spatio-temporal coding help? Fig. <ref type="figure">7</ref> shows the effect of increasing the spatial dimension of C, at 64x compression. At high SBR, all methods have similar MAE, but, coding tensors with smaller spatial dimensions better preserve fine details (e.g., sticks). On the other hand, at low SBR, coding tensors with larger spatial dimensions preserve high-level details such as the pot handle. This difference is also observed in the scatter plot where the mean and median of the 256 &#215; 1 &#215; 1 C do not match which indicates multiple low SBR scenes with high MAE.  <ref type="bibr">[10,</ref><ref type="bibr">10]</ref> and <ref type="bibr">[10,</ref><ref type="bibr">1000]</ref>, respectively. For a fixed compression level, the spatial size of each model is increased from left to right and K is adjusted to maintain the same compression level. The coding tensors for all models in this plot are learned and separable.</p><p>How does reducing the size of C affect performance? Fig. <ref type="figure">8</ref> shows the effect of reducing the size of C at 128x compression. The coding tensors size is reduced by training models with separable coding tensors that operate on smaller histogram blocks. The performance difference between full and separable coding tensors (1024 &#215; 4 &#215; 4) is negligible. As we further reduce the number of parameters in C, the overall performance degrades. Coding tensors with fewer parameters that operate on smaller histogram blocks tend to produce blurrier reconstructions. This can be observed in the red box where the coding tensors with less than 10,000 parameters blurs the spikes. Nonetheless, as discussed in Sec. 4.1, a parameter-efficient C is desirable due to the limited in-sensor memory. Ultimately, a practical compressive SPAD-based 3D camera design requires determining the trade-off between parameter efficiency and 3D imaging quality, which may depend on the application.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.">Evaluation on Real-world Data</head><p>Fig. <ref type="figure">9</ref> shows the depth reconstructions at 256x compression of multiple compressive histograms and the no compression baseline on the real-world experimental data from <ref type="bibr">[21]</ref>. In low SBR scenes, such as the outdoor capture of the ball falling down stairs (first row), Truncated Fourier blurs the staircase edges, while the learned spatio-temporal C preserved these details. Compared to the no compression oracle, compressive histogram models produce less smooth depth images with some small artifacts. This suggests that the compressive histogram models could benefit from a spatial regularizer such as the one used when training the no compression oracle. Additional results and details on this evaluation are available in the supplement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5.">On-chip Implementation and Power Analysis</head><p>To validate the feasibility of compressive histograms, we implement them on the UltraPhase chip which has a 3 &#215; 6 processor core array <ref type="bibr">[4]</ref>. At this time, UltraPhase has not been 3D stacked on a SPAD sensor, thus we only evaluate its compute and data readout power consumption. Fig. <ref type="figure">10</ref> shows the power dissipated by UltraPhase when processing photon timestamps with different methods and transferring data off-sensor. Although compressive histograms dissipate more power on computation (blue), this is less than 0.3% of the overall power consumption, which is dominated by data readout. Due to limited memory in UltraPhase (4096 bits per core), our method learns C spatial and uses Fourier codes for C temporal due to their memory efficient implementation. We also quantize C to 8 bits and find no degradation in performance. Refer to the supplement for additional details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Discussion</head><p>SPAD-based 3D cameras encounter a data bottleneck between the SPAD array and the compute module when transferring the photon timestamps. A histogram tensor can help summarize timestamps at low resolutions, but as megapixel SPAD arrays become available, histogram tensors also lead to a data bottleneck. To overcome this limitation we proposed compressive histograms as a compact representation that can be built on-the-fly, as each photon is detected. As a consequence, compressive histograms can reduce the insensor memory and data rates because neither the photon data stream nor a histogram tensor needs to be stored or transferred. Our results show that high-quality depth information can be recovered from a learnt compressive histogram representation that is up to 2 orders of magnitude smaller than a histogram tensor representation. Practical compressive histogram operating points: The learned separable 256x4x4 and 256x2x2 achieved a good balance between parameter vs. reconstruction quality, and require on the order of 10-100kbits of memory. In an UltraPhase-like chip this would require distributing C among at least 8x8 cores (4kbits per core). If the temporal dimension is fixed to Fourier-based C temporal such as the ones from <ref type="bibr">[13,</ref><ref type="bibr">36]</ref>, the number of parameters can be reduced by 10-40x (see Suppl. Sec. 3.3) which enables storing C on a per-pixel basis and made the Ultra-Phase evaluation possible. Overall, for a 1MP SPAD sensor with 1000 bins per-pixel, compressive histograms that can provide more than 50x compression would result in practical data rates of &#8764; 0.6GB/s which USB 3.2 can support.</p><p>Trade-off space: Compressive histograms establish a trade-off between reconstruction fidelity, data bandwidth, and in-sensor compute and memory resources. For a fixed bandwidth, a coarse histogram requires arguably the lowest in-sensor resources but leads to poor reconstructions. Fourier-based histograms can improve reconstruction fidelity, at the expense of increased in-sensor computation and memory. The proposed generalization of compressive histograms allows further increasing reconstruction accuracy, at the expense of additional in-sensor memory over Fourier-based methods. Ultimately, the correct trade-off for a given scenario will be determined by power, application, and hardware constraints.</p><p>Although compressive histograms require more insensor computation than coarse histograms, our method presents a practical solution from a power consumption and a real-time application perspective. Power consumption is primarily determined by data rates. Fig. <ref type="figure">10</ref> and Suppl. Fig. <ref type="figure">4</ref> shows that despite compressive histograms dissipating 100x more processing power than coarse histograms, their overall power consumption is still lower even when data rates are only reduced by 2x. The processing time for building compressive histograms is &#8764;0.5ms (Suppl. Fig. <ref type="figure">5</ref>). From a real-time (i.e., 30 FPS) or a high-speed application standpoint, this additional processing time is nearly negligible since it is overlapped with acquisition/exposure time.</p><p>From a hardware perspective, the importance of reducing the memory overhead of C depends on resolution. In lowresolution SPAD cameras, memory-efficient C are essential because they are stored among a single or a few pixels. The proposed method allowed integrating memory-efficient Fourier codes with learned spatial codes, which were implemented on the 2x2 pixels of UltraPhase (Suppl. Sec. 2). As resolution increases and C is shared among more pixels, the memory overhead reduces, and less parameter-efficient C with increased reconstruction fidelity can be considered.</p><p>Bias in Learned C: In the supplement, we show that coding tensors with M t = N t develop a depth range bias. These coding tensors learn to zero out photons coming from distances that are less common in the dataset since they are usually background/noise photons. Interestingly, learned coding tensors with M t &lt; N t avoid this bias and generalize to depths that are less common in the training set.</p><p>Generalization: There are multiple generalization axes including signal and ambient light levels, sensor and laser parameters (e.g., resolution, pulse waveform), and scene response complexity (e.g., depth range, indirect reflections). Our evaluation has probed some of them. Specifically, our training set only included a subset of the signal and ambient light levels that the test set contained. Furthermore, the model was trained and tested on histogram tensors with different resolutions. Finally, the dataset bias investigation showed that models with N t &#8804; 256 generalized well to less prevalent depths during training (Suppl. Sec. 1).</p><p>We further discuss the model's ability to generalize in other scenarios. For instance, all learned models (including baselines) are unlikely to generalize well to wider laser pulses, because if the model is trained with a narrow pulse width, it learns to extract signal information from high frequencies which will only contain noise in the wider pulse. Similarly, these learned models may generalize to narrower pulse widths. Furthermore, we anticipate a learned C with the properties described in <ref type="bibr">[13]</ref> to be robust to diffuse indirect reflections but may require data augmentation to generalize to other light transport scenarios.</p><p>Why not compute depths in-sensor? SPAD-based 3D cameras with large in-pixel memory could store per-pixel histograms and reduce data rates by computing depths inpixel (i.e., peak compression oracle). Our results show that compressive histogram can provide similar reconstruction quality and outperform this method at low SBR without requiring the storage of the full histogram tensor in-sensor.</p><p>Additional Coding Tensor Designs: Although we find promising empirical results for the coding tensor representations described in this paper, the optimal set of coding tensors will depend on the exact hardware specifications (e.g, in-pixel memory, system bandwidth) and scenedependent parameters (e.g., SBR, geometry, albedo). Additional lightweight C designs could rely on other factorization techniques and weight quantization.</p><p>Task-specific Compressive Histograms: Histogram tensors are used in other active single-photon imaging modalities such as fluorescence lifetime microscopy <ref type="bibr">[34]</ref>, nonline-of-sight <ref type="bibr">[10,</ref><ref type="bibr">23,</ref><ref type="bibr">29]</ref>, and diffuse optical tomography <ref type="bibr">[22,</ref><ref type="bibr">45,</ref><ref type="bibr">24]</ref>. Our framework could be used to find compressive representations optimized for these applications.</p></div></body>
		</text>
</TEI>
