<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>DeltaStream: 2D-Inferred Delta Encoding for Live Volumetric Video Streaming</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>06/23/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10656674</idno>
					<idno type="doi">10.1145/3711875.3729131</idno>
					
					<author>Hojeong Lee</author><author>Yu Hong Kim</author><author>Sangwoo Ryu</author><author>James Won-Ki Hong</author><author>Sangtae Ha</author><author>Seyeon Kim</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Not Available]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Volumetric video streaming is an emerging technology that facilitates highly immersive and interactive user experiences in augmented, virtual, and mixed reality (AR/VR/MR) applications. Such experiences are made possible by representing each video frame as a 3D space or object using point clouds or meshes, providing six degrees of freedom (6-DoF) movements for users. Unlike on-demand volumetric video streaming, which delivers pre-recorded contents, live volumetric video streaming includes the entire pipeline from capture and volumetric content generation to delivery and rendering. Live streaming applications, such as remote surgery, virtual concerts, and telepresence <ref type="bibr">[18,</ref><ref type="bibr">32]</ref> require real-time performance, including maintaining high frame rates (e.g., over 30 frames per second) and minimizing end-to-end latency.</p><p>While volumetric videos provide an immersive experience, streaming the volumetric videos requires extremely high network bandwidth <ref type="bibr">[48]</ref>. For example, streaming the raw point cloud frames in 8i dataset at 30 FPS demands bandwidth over 3 Gbps <ref type="bibr">[21]</ref>. To address the challenges of large data sizes in volumetric video frames, video compression is a crucial aspect, which reduces the size to an acceptable bandwidth for streaming. Although several methods have been adopted for volumetric content compression <ref type="bibr">[5,</ref><ref type="bibr">45]</ref>, they do not address temporal correlation between frames, which is the key factor to improving compression e#ciency in 2D video compressions and decompressions (CODECs). 2D video compression algorithms, such as H.264/AVC <ref type="bibr">[44,</ref><ref type="bibr">50]</ref>, achieve high compression rates by leveraging both spatial redundancy within frames and temporal redundancy between consecutive frames. They employ techniques such as motion estimation and compensation to exploit temporal correlations, signi!cantly reducing the amount of data to be encoded.</p><p>Nevertheless, unlike in 2D, encoding methods that leverage temporal correlations between adjacent frames remain largely underexplored for volumetric videos. Focusing on point cloud as the representation of 3D video frame in this paper, we identify two key challenges in addressing temporal correlations for live volumetric video streaming.</p><p>First, the data representation of a point cloud frame is inherently irregular. In the case of 2D video, as a frame has a structured (i.e., !xed &#119871;&#119872;&#119873;&#119874;&#119875; &#8594;&#119875;&#119876;&#119872;&#119877;&#119875;&#119874;) representation, it is straightforward to process spatial and temporal correlation (e.g., by a subtracting frames to compute the di"erence). However, as a point cloud is an unstructured set of points and has a di"erent number of points across frames, de!ning subtraction operations between point clouds is dif-!cult. Therefore, it is challenging to compute and identify temporal and spatial correlations between point clouds.</p><p>Second, the computational complexity of 3D data processing, including encoding and decoding, is excessively high. The highprecision (x, y, z) position coordinates (i.e., 4 bytes for each coordinate) and (R, G, B) color attributes in point clouds, along with their irregular structure signi!cantly increases computational complexity, necessitating e#cient processing on resource-limited devices like head-mounted displays (HMDs) <ref type="bibr">[1,</ref><ref type="bibr">10]</ref>.</p><p>Tackling these fundamental and practical challenges, we propose DeltaStream, the !rst live volumetric video streaming system that leverages 2D information to e#ciently infer the delta 1 between volumetric video frames and reduce bandwidth. The key insight of DeltaStream is that 3D data is derived from 2D data, such as RGB and depth frames. Therefore, our approach aims to e#ciently leverage 2D information to reduce the overhead associated with 3D processing and bandwidth usage, thereby enabling live volumetric video streaming. DeltaStream consists of the following three key components:</p><p>(1) The !rst component of DeltaStream involves transforming 2D temporal information into 3D. As previously discussed, the unstructured representation of point clouds makes it highly challenging to compute the di"erences between two frames e#ciently.</p><p>To address this issue, we propose an inverse transformation method that maps 3D point cloud back to 2D frame to utilize 2D temporal information in the point cloud. This approach leverages pre-shared camera parameters.</p><p>(2) The second component is the set of two block-based delta encoding methods that address temporal redundancy between consecutive point cloud frames: 3D motion vectors and delta point cloud. DeltaStream classi!es blocks in 2D frames into matched blocks and mismatched blocks. Matched blocks are e#ciently compressed using 3D motion vectors, whereas mismatched blocks are handled with delta point clouds. Such behavior is analogous to using residual blocks in 2D video encoding. For 3D motion vectors, we propose a method that leverages 2D motion vectors extracted from the 2D video CODEC and combines them with depth information to generate accurate 3D motion vectors.</p><p>(3) The third component addresses client-side computation to ensure real-time performance by leveraging 3D motion vectors and delta point clouds. While these methods reduce bandwidth by utilizing temporal correlation, they can introduce additional computational costs on the client when updating the previous point cloud frame to the current frame. To mitigate this, the server dynamically adjusts delta encoding parameters based on the observed latency characteristics of 3D motion vector updates and delta point cloud decoding. This supports to meet real-time constraints for live volumetric video streaming.</p><p>We extensively evaluate DeltaStream on a wide range of environments with various network bandwidths, client compute power, scene contents, and multiple RGB-D cameras. Our experiment results show that DeltaStream, compared to the state-of-the-art system, MetaStream <ref type="bibr">[24]</ref>, reduces the bandwidth by 71% for static scenes and 49% for dynamic scenes with 1.15 to 1.63&#8594; faster decoding speed while maintaining visual quality. As a result, DeltaStream successfully enables real-time volumetric video streaming by achieving a reliable 30 FPS and is broadly applicable to any AR/VR 1 delta refers to the di"erence between the previous and current frames. application such as 3D video conferencing that streams 3D point cloud objects live-captured from multiple depth cameras.</p><p>The main contributions of this paper are three-fold:</p><p>&#8226; We identify the challenges in existing live volumetric video streaming and explore the challenges in leveraging 2D information for 3D point cloud encoding. &#8226; We propose DeltaStream, the !rst live volumetric video streaming system that leverages temporal redundancies inferred from 2D frames to e#ciently encode point cloud frames, thereby reducing bandwidth usage. &#8226; Based on extensive experiments, we verify that DeltaStream successfully achieves a reliable 30 FPS streaming performance, by reducing the bandwidth usage up to 71% with 1.63&#8594; faster decoding speed, compared to the state-of-theart methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background 2.1 Construction of a 3D Video Frame</head><p>While 2D video frames usually consist of a 3-tuple data (e.g., RGB or YUV), 3D video frames are mostly represented by two methods: point cloud and mesh. A point cloud is a set of points in a 3-dimensional space where each point is a 15 byte sized 6-tuple. In detail, a point consists of a 3D coordinate using 4 bytes per axis (i.e., x, y, z) and 1 byte each for the color channels (i.e., R, G, B). Mesh, on the other hand, connects the points with edges to form polygons, typically triangles, creating a contiguous surface representation of a 3D object. Point clouds can be generated in several ways. One method is to capture an object using multiple RGB cameras, followed by considerable post-processing of the images to combine them into a point cloud <ref type="bibr">[21,</ref><ref type="bibr">27]</ref>. Another way to generate a point cloud is to capture color and depth frames using a depth camera (RGB-D), such as Intel RealSense [6] and Orbbec Femto <ref type="bibr">[12]</ref>. As illustrated in Figure <ref type="figure">1</ref>, a point cloud can be directly created from the captured frame using the camera parameters according to Eq.</p><p>(1):</p><p>where &#119881; and &#119885; are 2D pixel coordinates and &#119873; is the depth value corresponding to the pixel. Note that depth scale &#119879; &#119871; , principal point o"sets &#119882; &#119872; and &#119882; &#119873; , focal lengths &#119883; &#119872; and &#119883; &#119873; are all camera intrinsic parameters, which are !xed values. Each point is generated as long as the pixel has a valid depth value. As the latter approach does not involve excessive computation, it is appropriate for real-time streaming. When using multiple depth cameras, point clouds are generated from the color and depth captures of each camera and are aligned into a single coordinate system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">2D and 3D Video CODECs</head><p>2D video CODECs have been extensively researched and developed through the leading e"orts of the Moving Picture Experts Group (MPEG). This has resulted in the development of e#cient video CODECs, such as advanced video coding (AVC) <ref type="bibr">[44,</ref><ref type="bibr">50]</ref> and high-e#ciency video coding (HEVC) <ref type="bibr">[47]</ref>. The principle behind the CODECs is to encode only the di"erences between frames (i.e., delta information) with spatial and temporal correlation. To reduce the temporal redundancy, the CODECs use motion vectors instead of directly calculating these di"erences <ref type="bibr">[35,</ref><ref type="bibr">58]</ref>. The performance of those CODECs can be further optimized through hardware acceleration <ref type="bibr">[11]</ref>. While 2D video CODECs have been developed over the decades, point cloud compression techniques are still in their early stages. MPEG is actively working on video-based point cloud compression (V-PCC) <ref type="bibr">[23,</ref><ref type="bibr">46]</ref>, which projects the point cloud onto multiple planes from various angles and then utilizes existing 2D video CODECs for compression, albeit at the cost of high encoding latency. Consequently, V-PCC is not well-suited for live volumetric video streaming and is primarily used for storage purposes. Other approaches are based on tree structures. Draco [5] and Point Cloud Library (PCL) <ref type="bibr">[45]</ref>, widely used 3D CODECs, utilize KD-tree <ref type="bibr">[15]</ref> and Octree <ref type="bibr">[40]</ref>, respectively. Both Octree and KD-tree divide the 3D space into smaller regions recursively, mapping each point in the space with a node in the tree structure. The main di"erence between the two types of tree structures is that every node in an Octree has 8 child nodes, whereas each node in a KD-tree can have a varying number of child nodes. However, these state-of-the-art point cloud compression techniques overlook the temporal correlation between neighboring point cloud frames, failing to encode 3D video frames e#ciently compared to 2D video CODECs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Motivation 3.1 Ine!ciency of Existing Point Cloud Compression Methods</head><p>Transmitting raw point cloud frames requires an extremely high bandwidth (e.g., over 3 Gbps for the 8i dataset <ref type="bibr">[21]</ref> to achieve 30 FPS), making point cloud compression essential before transmission. Alongside reducing bandwidth usage, encoding and decoding latencies must be addressed to enable real-time streaming at 30 FPS. To conduct a motivational study comparing the encoding and decoding speed, as well as the data size, of three widely used point cloud compression methods, we use the 8i dataset <ref type="bibr">[21]</ref> as a benchmark, which is one of the most commonly used datasets in on-demand volumetric video streaming scenarios <ref type="bibr">[34,</ref><ref type="bibr">38,</ref><ref type="bibr">55]</ref>. As shown in Figure <ref type="figure">2</ref>, V-PCC o"ers high compression e#ciency by</p><p>(a) Average encoding and decoding speed. (b) Average data size for 30 frames after compression.</p><p>Figure <ref type="figure">2</ref>: Point cloud compression performance on Intel Core i9-14900K CPU using high resolution longdress sequence from 8i dataset <ref type="bibr">[21]</ref>.</p><p>leveraging a 2D video CODEC to exploit the temporal correlation but su"ers from extremely slow encoding (&lt;1 FPS) and decoding (&lt;2 FPS) performance. V-PCC's high latency <ref type="bibr">[16,</ref><ref type="bibr">34,</ref><ref type="bibr">53]</ref> and excessive computational demands <ref type="bibr">[30]</ref> make it unsuitable for real-time volumetric video streaming. In contrast, while Draco and PCL achieve lower compression e#ciency, they o"er much faster encoding and decoding speeds compared to V-PCC. Notably, Draco provides signi!cantly faster decoding speeds and maintains stable performance even when encoding noisy point clouds <ref type="bibr">[52]</ref>. Consequently, recent volumetric video streaming systems <ref type="bibr">[20,</ref><ref type="bibr">24,</ref><ref type="bibr">25,</ref><ref type="bibr">34,</ref><ref type="bibr">55]</ref> are predominantly designed using Draco, the state-of-the-art point cloud compression technique. While Draco [5] is widely used to support real-time streaming in many systems, it has a critical ine#ciency as it employs a frame-by-frame compression approach, ignoring the temporal redundancy between consecutive point cloud frames. The same limitation applies to PCL point cloud compression. To overcome these challenges, DeltaStream introduces a novel approach that e"ectively reduces temporal redundancy, signi!cantly lowering bandwidth requirements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">High Computational Cost in Point Cloud</head><p>Inter-Frame Compression</p><p>Addressing temporal redundancy between point cloud frames entails signi!cant challenges because: (1) the number of points in the previous and current point cloud frames can be di"erent, (2) the inherently unstructured nature of point clouds results in no explicit geometric correspondence between frames, making it challenging to identify spatial relationships across consecutive frames, and (3) processing point clouds is computationally intensive, as it requires additional processing steps compared to 2D video frames and takes 6-tuple data (i.e., (x, y, z, R, G, B)) as input.</p><p>Encoding and decoding bottleneck. Due to the high computational cost, achieving real-time performance remains a signi!cant challenge. Mekuria et al. <ref type="bibr">[41]</ref> proposed a 3D cube block (i.e., &#119886; &#8594;&#119886; &#8594;&#119886;) based inter-prediction approach, claiming nearly real-time encoding. However, the encoding takes approximately 1 second (&#8596; 1 FPS) for each frame. Deep learning-based inter-frame compression methods struggle to achieve real-time decoding even with GPU support (e.g., decoding takes 0.714 seconds for each frame in <ref type="bibr">[13]</ref>) and inevitably become a bottleneck for clients without powerful GPUs <ref type="bibr">[37]</ref>. Hermes <ref type="bibr">[49]</ref> generates a reference frame with low entropy, but its extremely slow encoder performance (&lt; 1 FPS) <ref type="bibr">[43]</ref> limits its applicability only to on-demand volumetric video streaming. Octree-based XOR operation for di"erential encoding <ref type="bibr">[29]</ref> is designed for real-time processing but is limited to nearly static scenes as it considers only geometry and ignores colors <ref type="bibr">[41,</ref><ref type="bibr">49]</ref>. Therefore, a new point cloud compression method that can e#ciently compute temporal redundancy with minimal processing for both encoding and decoding is highly needed. 3D residual processing comlexity. A naive method to use 3D motion vectors is to simply follow the same approach used for 2D motion vectors. That is, applying 3D motion vectors for motion compensation on the point cloud and updating the residual information. In the case of 2D videos, this process is straightforward because the data structure has !xed dimensions (e.g., addition of two 2D matrices with RGB channels of &#119871;&#119872;&#119873;&#119874;&#119875; &#8594; &#119875;&#119876;&#119872;&#119877;&#119875;&#119874; dimensions). In contrast, a point cloud is an unstructured set of points, therefore a point matching is needed to identify which points require residual updates. This approach introduces signi!cant computational overhead on the client side. The computation complexity is &#119887; (&#119888; &#8226; &#119889;), where &#119888; and &#119889; denote the number of points in the previous and current point cloud, whereas 2D residual update has much lower complexity of &#119887; (1).</p><p>4 DeltaStream Design 4.1 DeltaStream Overview Figure <ref type="figure">3</ref> illustrates the system overview of DeltaStream. A live media server is fed RGB-D video frames from multiple RGB-D cameras. Using preset camera parameters, the block-based Delta Encoder in the server calculates the temporal di"erence between frame &#119874; &#8595; 1 and frame &#119874;, where each frame includes two types of 2D inputs, RGB and depth frames. Using block-based calculations, the encoder classi!es each block as either a matched block or a mismatched block. For matched blocks, the Delta Encoder computes 3D motion vectors, while for mismatched blocks, it generates delta point cloud. Then, the 3D motion vectors and delta point cloud are encoded and transmitted to the client along with their respective block coordinates.</p><p>On the client side, the block-based Delta Decoder utilizes preset camera parameters to decode the point cloud frame &#119874; based on the frame &#119874; &#8595; 1. First, points corresponding to matched blocks are identi!ed in point cloud frame &#119874; &#8595;1, and motion compensation is performed on those points. Next, points corresponding to mismatched blocks are identi!ed and updated using the delta point cloud. Finally, the resulting delta-decoded point cloud frame &#119874; is transformed into the world coordinate system before being rendered on the client display, such as an HMD. Since the system assumes stationary cameras, the coordinate transformation from each camera to the world coordinate can be performed straightforwardly using methods like checkerboard calibration <ref type="bibr">[26]</ref>.</p><p>The goal of DeltaStream system design is to address the following practical challenges:</p><p>&#8226; DeltaStream overcomes the main challenge of volumetric video streaming, extremely high bandwidth usage between the server and client, by signi!cantly reducing the bandwidth through the inference of temporal redundancy between point cloud frames in 2D space. &#8226; Achieving high compression e#ciency and real-time inter-frame compression processing at the same time is challenging due to the unstructured data representation of point cloud frames.</p><p>DeltaStream addresses this challenge with a computationally e#cient real-time delta encoding and decoding system, which backtracks from 3D to 2D data to establish a mapping between 2D and 3D blocks. &#8226; DeltaStream enhances compression e#ciency but maintains visual quality through a block-based approach that e"ectively !lters information corresponding to frame-to-frame di"erences. &#8226; DeltaStream adjusts the computational burden on the client based on the latency feedback from the client observations to adaptively support a variety of user devices with di"erent computational abilities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Frame Difference</head><p>Filtering Blocks </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Frame Difference</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Block-based Delta Encoder</head><p>Calculating and updating motion vectors for every pixel in a 2D video frame is highly ine#cient. Consequently, existing 2D CODECs process video frames on a 2D block-by-block 2 basis, with these matrix computations optimized through hardware acceleration <ref type="bibr">[11]</ref>.</p><p>To leverage this e#ciency, the Delta Encoder employs 2D blockbased encoding to calculate 2D motion vectors, 2D RGB frame di"erences, and depth information at the block level. Based on this, the Delta Encoder classi!es high di!erence blocks into matched and mismatched blocks, with each block type being encoded di"erently for computational e#ciency in compressing temporal redundancy. Matched and mismatched blocks. Figure <ref type="figure">4</ref> describes how Delta Encoder processes the blocks through several steps and !nally classi!es them into matched and mismatched blocks using two thresholds, &#119890; &#119871;&#119874; &#119875; &#119875; and &#119890; &#119876;&#119877; 3 . i) High di!erence blocks are those requiring updates due to the large temporal di"erences observed between two consecutive frames. To compute these blocks, Delta Encoder !rst assigns 16 &#8594; 16 block indices for all blocks in the 2D frame as set &#119891; = {&#119892; 0 , &#119892; 1 , . . . , &#119892; &#119878; }, where &#119888; is the total number of blocks determined by the width and height of the 2D frame and the block size. For each block &#119872; in frame &#119874;, Delta Encoder calculates the temporal di"erence between frame &#119874; and frame &#119874; &#8595; 1 as &#119874; = (&#119872;,&#119873;) &#8599;&#119879; &#119871; &#8600;&#119894; &#119880; (&#119880;, &#119884;) &#8595; &#119894; &#119880; &#8595;1 (&#119880;, &#119884;)&#8600; 1 , where (&#119880;, &#119884;) represents the pixel coordinates within the block and &#119894; (&#119880;, &#119884;) is the RGB vector at those coordinates. Based on the temporal di"erence &#119874; , high di"erence blocks are !ltered with a prede!ned threshold &#119890; &#119871;&#119874; &#119875; &#119875; into set &#119891; &#119881;&#119874;&#119882;&#119881; = {&#119892; &#119874; &#8599; &#119891; | &#119874; &gt; &#119890; &#119871;&#119874; &#119875; &#119875; }. Among high di!erence blocks, matched and mismatched 2 A 2D block refers to the process of dividing a &#119883;&#119874;&#119871;&#119880;&#119881; &#8594; &#119881;&#119884;&#119874;&#119882;&#119881;&#119880; 2D frame into smaller !xed-size regions of pixels (i.e., 16x16). Similarly, a 3D block usually refers to a cube with dimensions of &#119885; &#8594; &#119885; &#8594; &#119885;, used to partition a point cloud in 3D space into smaller subspaces. 3 In our system, we empirically set &#119886; &#119872;&#119871; &#119873; &#119873; = 10000 and &#119886; &#119874;&#119875; = 5000. The decision on these threshold values will be explained in Section 6.7. blocks are identi!ed using a threshold &#119890; &#119876;&#119877; after motion compensation.</p><p>ii) Matched blocks refer to blocks that can be e#ciently encoded through 3D motion vectors. To identify these blocks, Delta Encoder calculates the temporal di"erence &#8771; &#119874; between RGB frame &#119874; and 2D motion compensated RGB frame from &#119874; &#8595; 1 in the same way as &#119874; is computed. Using a pre-de!ned threshold &#119890; &#119876;&#119877; , the matched blocks are determined as</p><p>The same approach could be used to leverage the depth information as well. However, based on our observations, it is more e#cient to use motion vectors from the RGB frame alone for the following reasons. First, the depth information from RGB-D cameras is often noisy, making it di#cult to rely on the temporal di"erence in z-axis data between consecutive frames. Second, since the Delta Encoder employs 16 &#8594; 16 blocks, the z-values within each block tend to be consistent, reducing the need for additional depth-based calculation.</p><p>iii) Mismatched blocks are blocks that fail to reduce the block di"erence after 2D motion compensation among the high di"erence blocks, &#119891; &#119881;&#119874;&#119882;&#119881; , and denoted as &#119891; &#119876;&#119874;&#119889;&#119876;&#119887;&#119880;&#119888;&#119881; = &#119891; &#119881;&#119874;&#119882;&#119881; &#8595; &#119891; &#119876;&#119887;&#119880;&#119888;&#119881; . Since these blocks cannot be e#ciently encoded using motion vectors despite large temporal di"erences, Delta Encoder processes the temporal di"erence &#119874; as a delta point cloud. 3D motion vector. DeltaStream introduces a method that transforms 2D motion vectors combined with depth information into 3D motion vectors for matched blocks as described in Figure <ref type="figure">5</ref>. Suppose the 2D frame and 2D motion vectors have (&#119881;, &#119885;)-coordinates, while the 3D counterparts have (&#119880;, &#119884;, &#119878;)-coordinates. To transform 2D motion vectors, &#969;(&#119881;, &#119885;), into 3D motion vectors, &#969;(&#119880;, &#119884;, &#119878;), the Delta Encoder !rst extracts 2D motion vectors from each block in &#119891; &#119876;&#119887;&#119880;&#119888;&#119881; by using a 2D video CODEC. Based on the 2D motion vector, it identi!es how much a 16 &#8594; 16 source block in the previous frame needs to be shifted along the (&#119881;, &#119885;) axes to estimate the motion and reduce the di"erence with the destination block in the current frame. Subsequently, the points in the 2D source and destination blocks are transformed into (&#119880; src , &#119884; src , &#119878; src ) and (&#119880; dst , &#119884; dst , &#119878; dst ), respectively by following Eq. (1). Finally, the di"erences between these transformed points are used as the 3D motion vector, &#969;(&#119880;, &#119884;, &#119878;). Then, the live media server sends the coordinate of the source block corresponding to matched block and calculated 3D motion vectors (i.e., pairs of (&#119881; &#119889;&#119890;&#119888; , &#119885; &#119889;&#119890;&#119888; ) and &#969;(&#119880;, &#119884;, &#119878;) are transmitted). Delta point cloud. Using 3D motion vectors can improve compression e#ciency. But, it also imposes a signi!cant computational burden on the client during decoding. Therefore, the Delta Encoder compresses the temporal di"erence using point cloud only for blocks that cannot be e#ciently encoded with 3D motion vectors (i.e., mismatched blocks). At !rst, Delta Encoder transforms the pixels in mismatched blocks in frame &#119874; to point cloud which we call delta point cloud and encodes it using Draco <ref type="bibr">[5]</ref>. Delta Encoder uses Draco when compressing delta point clouds, as Draco exploits spatial redundancy within a point cloud frame. When encoding mismatched blocks, Delta Encoder does not encode the frame di"erence but instead encodes the point cloud in the frame &#119874; directly. This approach is adopted to enhance decoding e#ciency by leveraging vector operations, which will be addressed in detail in Section 4.3. Then, the server transmits the encoded delta point cloud with the corresponding block indices (i.e., 2D-coordinates of &#119891; &#119876;&#119874;&#119889;&#119876;&#119887;&#119880;&#119888;&#119881; and Draco-encoded delta point cloud are transmitted). As a result, Delta Encoder leverages both temporal and spatial correlations for more e#cient encoding. Keyframe. A keyframe refers to a frame that transmits the entire point cloud of the object. It is similar to an I-frame in 2D video CODECs. The keyframe is compressed using Draco [5] point cloud compression without delta encoding, same as intra-coding for 2D CODECs. In addition to serving as a reference frame, the keyframe also helps alleviate cumulative errors that may arise from continuous delta encoding. In DeltaStream, we set the keyframe rate as every 5 frames.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Block-based Delta Decoder</head><p>After receiving the encoded data from the server, the Delta Decoder performs the decoding process sequentially as shown in Figure <ref type="figure">6</ref>. Block matching. To decode block-based delta encoding on the client, a computationally e#cient method is required to link the 3D positions of the point cloud with the 2D-based information from the delta encoding. Therefore, we propose an inverse transformation technique that traces 3D point positions back to 2D blocks and refer to this process as block matching, which is a di"erent concept from the classi!cation of matched and mismatched blocks based on the threshold de!ned in Section 4.2. This approach alleviates the high computational cost in the 3D space and is performed through the following process. From the inverse process of Eq. ( <ref type="formula">1</ref>), the transformation from point cloud 3D (&#119880;, &#119884;, &#119878;) positions into (&#119881;, &#119885;) 2D frame domain (i.e, pixel location) is derived as Eq. (2). The resulting (&#119881;, &#119885;) coordinates are rounded to integers to align with pixel indices. The transformation is computationally e#cient because the operation can be performed on entire vectors with (&#119895;, &#119896;, &#119897; ) at once, without the need to repeatedly compute for each individual point. As a result, 2D image coordinates (&#119881;, &#119885;) are mapped to either the source block location for 3D motion compensation or the 2D block indices for delta point cloud update. transformation process. Speci!cally, given (&#119881; &#119889;&#119890;&#119888; , &#119885; &#119889;&#119890;&#119888; ) as the top-left coordinate of the 2D source block, it !lters the points for &#119881; &#119889;&#119890;&#119888; &#8656; &#119881; &lt; &#119881; &#119889;&#119890;&#119888; + 16 and &#119885; &#119889;&#119890;&#119888; &#8656; &#119885; &lt; &#119885; &#119889;&#119890;&#119888; + 16. Finally, the client shifts the points in the source block (&#119880; &#119889;&#119890;&#119888; , &#119884; &#119889;&#119890;&#119888; , &#119878; &#119889;&#119890;&#119888; ) by adding the corresponding 3D motion vector &#969;(&#119880;, &#119884;, &#119878;) to each point. Delta point cloud update. After 3D motion compensation process, the client decodes and update delta point cloud along with corresponding block indices of &#119891; &#119876;&#119874;&#119889;&#119876;&#119887;&#119880;&#119888;&#119881; . The only di"erence is that the delta point cloud is paired with the block index, while the 3D motion vectors are paired with the source block's location (&#119881; &#119889;&#119890;&#119888; , &#119885; &#119889;&#119890;&#119888; ). The received mismatched blocks in &#119891; &#119876;&#119874;&#119889;&#119876;&#119887;&#119880;&#119888;&#119881; are the blocks which should be removed in the previous point cloud and replaced with the delta point cloud. To remove the blocks, block indices are computed with &#119892; &#119874; = &#8658; &#119877; 16 &#8657; &#8226; &#8658; &#119883; 16 &#8657; + &#8658; &#119891; 16 &#8657;, where &#119871; denotes the 2D frame width.</p><p>Vector-wise computation. Figure <ref type="figure">7</ref> shows the log-scale di"erence between point-based computation and vector-wise computation, assuming that each of the two point clouds has 100K points. Blockwise or vector-wise computations are very e#ciently processed compared to point-based computations by two reasons: 1) Pointbased computation needs to match the point in two point clouds for the further process, therefore has &#119887; (&#119888; 2 ) complexity, and 2) the block-wise computation can be processed using vector-wise operation that can be additionally accelerated by parallel processing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Computation-Adaptive Online Control</head><p>DeltaStream addresses temporal redundancy through delta encoding using 3D motion vectors and a delta point cloud. However, this approach is not without cost. While it e"ectively reduces bandwidth, it a"ects the computational burden on the client side during decoding. Therefore, control is necessary on the server considering the level of decoding complexity to meet the real-time decoding and rendering on the client. We show that the control could be adaptive according to the client's computational power.</p><p>We !rst conduct a correlation analysis on latency and the factors a"ecting latency. We use longdress, soldier, loot, and redandblack sequences from 8i dataset <ref type="bibr">[21]</ref> and point clouds generated from RGB-D frames captured with Intel RealSense cameras <ref type="bibr">[6]</ref>. Figure <ref type="figure">8</ref> (a) shows a linear correlation between the number of points in point cloud and Draco [5] decoding latency, which is consistent with !ndings from previous studies <ref type="bibr">[20]</ref>. We also observe a linear  correlation between the number of 3D motion vectors and update latency, as shown in Figure <ref type="figure">8 (c</ref>). However, it is hard to !nd the relationship between the number of points and 3D motion vector update processing latency, as shown in Figure <ref type="figure">8 (b)</ref>. By leveraging the two clear correlations, the server adjusts the the number of 3D motion vectors used in delta encoding to support the client's real-time processing.</p><p>The primary control knob is the limitation on the number of maximum 3D motion vectors that can be used in delta encoding. Intuitively, increasing the number of 3D motion vectors reduces the number of points transmitted in the delta point cloud, but proportionally increases the computation burden on the client side at the same time. As shown in Figure <ref type="figure">8</ref> (a) and 8 (c), the slope of the linear regression function for the relationship between the number of 3D motion vectors and the update latency (&#8596; 0.5) is much steeper than that of the function relating the number of points to decoding latency (&#8596; 0.00025). Each 3D motion vector processes a 16 &#8594; 16 block in delta encoding, meaning that using &#119898; motion vectors can approximately reduce the number of points by 16 &#8594; 16 &#8594; &#119898; from the delta point cloud. Since changes in the number of 3D motion vectors have a dominant impact on latency (i.e., latency change according to one 3D motion vector change is 0.5 &#8659; 0.00025 &#8594; 16 &#8594; 16), the number of 3D motion vectors is controlled to target FPS (i.e., 30) rendering on the client side. The client periodically feedback its rendering FPS to the server. If the rendering FPS is below target FPS (e.g. 28 FPS), the server update the number of 3D motion vectors &#119899; &#119876;&#119877; &#8660; &#119900; &#8598; &#119899; &#119876;&#119877; , where &#119900; = 0.8 is a decaying parameter. The decaying parameter &#119900; was empirically set as 0.8 in DeltaStream. Note that the decoding complexity can be adjusted according to the client's computational capability during the initial phase of streaming. In particular, a lower &#119900; value results in more coarsegrained control, as motion vectors are adjusted more aggressively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Implementation</head><p>Hardware. Figure <ref type="figure">9</ref> shows our testbed setup for live volumetric video streaming. In the experiments, our system uses four depth cameras capturing an object in the center from front, back, right, and left. The camera models include two Intel RealSense D435i, Intel RealSense D455, and Intel RealSense D415 <ref type="bibr">[6]</ref>. These cameras are connected to the server through wired connections and capture RGB and depth frames at 30 FPS to generate point clouds. The processing server is equipped with an Intel Core i9-14900K CPU and 64GB of memory. We implemented the remote client on a laptop equipped with Intel Core i7-9750H @ 2.60GHz CPU, which has slightly lower computational capabilities compared to Apple Vision Pro M2 chip <ref type="bibr">[1]</ref>. Software. The server operates on a docker container using the Ubuntu base image and the client runs on Ubuntu Desktop. RGB and depth frames are captured through the Intel RealSense software development kit (SDK) <ref type="bibr">[7]</ref>, and the captured data is processed into point clouds using the Open3D library. The Open3D library is also employed for visualizing and manipulating point clouds. Delta point cloud compression is performed with Draco [5]. OpenCV and Open3D are used for handling 2D and 3D object manipulations, such as calculating di"erence between 2D frames. Eigen <ref type="bibr">[3]</ref> is used to perform matrix operations and transformations on point cloud data. This includes computing block indices and determining points to remove based on motion vectors. To extract motion vectors, FFmpeg <ref type="bibr">[4]</ref> is used to create a MPEG-4 coded video of the current and the previous 2D frame. Then, the av_frame_get_side_data function in the libavutil library [8] is used to extract motion vectors from the MPEG-4 video. For asynchronous network communication between the server and the client, the Boost Asio library is used whilst serialization is done through the cereal library [2]. The implementation employs multi-threading to concurrently process heavy tasks such as Draco encoding/decoding and motion vector extraction. Finally, Open3D is used to render the transmitted point cloud on the client. (c) End-to-end latency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RGB-D Camera</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RGB-D Camera</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RGB-D Camera</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RGB-D Camera</head><note type="other">Server WiFi AP Client</note><p>Figure <ref type="figure">10</ref>: Comparisons of end-to-end performance results with two baselines (LiveScan3D <ref type="bibr">[31]</ref> and MetaStream <ref type="bibr">[24]</ref>) and our DeltaStream for static and dynamic scene contents targeting 30 FPS.</p><p>The implementation comprises approximately 6k lines of code written in C++, including modules for camera calibration, depth data processing, 3D visualization, and communication protocols. The source code of DeltaStream is available online <ref type="foot">4</ref> .</p><p>6 Performance Evaluation 6.1 Evaluation Setup Experiment environment. For performance evaluation, we set up a network connecting the server and client through a commercial WiFi (802.11ac, GHz) and used Linux tc <ref type="bibr">[9]</ref> for bandwidthconstrained experiments on the server. Four Intel RealSense cameras were assumed to have !xed positions during the experiments and preset camera parameters were shared among the server and client prior to streaming. Evaulation metrics. The evaluation metrics include frames per second (FPS), latency (ms), structural similarity index measure (SSIM) for visual quality assessment, and network bandwidth usage. Video contents. We evaluated each metric in two live-captured scenes, static and dynamic, rather than using the 8i dataset <ref type="bibr">[21]</ref>, which is designed for on-demand volumetric video streaming. There is no standard open-source dataset captured with multiple (e.g., four) depth cameras that also provides fully disclosed intrinsic and extrinsic camera parameters. To address this, we created our own dataset using four Intel RealSense depth cameras [6], enabling synchronized live capture under realistic streaming conditions. The static scene resembles a telepresence scenario where a person is seated on a chair, making small movements with their arms or facial expressions. In the dynamic scene, a person moves freely, including actions such as sitting down, standing up, and moving body parts. The static scene consists of around 400K points and the dynamic scene consists of around 500K points. Baselines. We compare the system performance of DeltaStream with that of recent studies. 1) LiveScan3D <ref type="bibr">[31]</ref> is a state-of-the-art system prior to MetaStream <ref type="bibr">[24]</ref>, which presents live volumetric video capturing using multiple RGB-D cameras. For a fair comparison, we extended its functionality by compressing point clouds with Draco [5] before streaming. 2) MetaStream <ref type="bibr">[24]</ref>, a state-of-the-art solution, is a bandwidth-e#cient live point cloud video streaming system leveraging image segmentation processing on smart cameras. The performance of our baseline implementations closely matches the results reported in their papers. Note that we do not evaluate the bandwidth between the cameras and the server, as it is assumed to require only a few Mbps.</p><p>Other live volumetric video systems, such as Holoportation <ref type="bibr">[42]</ref>, and Project Starline <ref type="bibr">[32]</ref>, are excluded because they are limited to scenarios where the server and the client are connected via Gbps scale high bandwidth in wired connections and equipped with multiple powerful GPUs. FarfetchFusion <ref type="bibr">[33]</ref> is excluded because it primarily focuses on face reconstruction. MagicStream <ref type="bibr">[17]</ref> is not considered as a baseline since it does not utilize point cloud-based streaming and relies heavily on multiple deep learning models for components such as data reconstruction and neural rendering, but none of the model parameters or codes are open-sourced.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">End-to-end Performance of DeltaStream</head><p>This section compares the overall performance in live volumetric video streaming for each system, including frames per second (FPS), end-to-end latency (ms), and visual quality (SSIM). Frames per second (FPS). In Figure <ref type="figure">10 (a)</ref>, both DeltaStream and MetaStream achieve a stable 30 FPS, except for LiveScan3D. DeltaStream and MetaStream achieve stable 30 FPS by streaming only the region of interest point cloud object through segmentation. However, LiveScan3D lacks such optimizations, leading to lower FPS as compared to the other two systems. When the bandwidth is limited to 100 Mbps in Figure <ref type="figure">10 (b)</ref>, MetaStream and LiveScan3D drop to 20-22 and 9-10 FPS. As both systems are ine#cient at addressing temporal redundancy, the 100 Mbps bandwidth becomes a bottleneck in the streaming pipeline. However, DeltaStream maintains at 30 FPS even in the bandwidth-constraint case. End-to-end latency. In a live video streaming system, end-to-end latency is crucial to ensure the motion-to-photon latency between the server and the client and interaction delay between users <ref type="bibr">[14]</ref>. To evaluate this, we measure the end-to-end latency that consists of encoding latency on the server, transmission delay, decoding, and rendering latency on the client. In Figure <ref type="figure">10</ref> (c), DeltaStream Figure <ref type="figure">11</ref>: Average bandwidth usage of LiveScan3D <ref type="bibr">[31]</ref>, MetaStream <ref type="bibr">[24]</ref>, and DeltaStream for static and dynamic scenes. The number above the bar represents the bandwidth saving (%) compared to raw data transmission for each system. demonstrates lower latency compared to the other two baseline systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MetaStream</head><p>As DeltaStream e#ciently addresses temporal redundancy, it reduces the number of transmitted points and decreases the processing time taken by Draco [5] encoding and decoding. On the other hand, other baselines have to process more points in point clouds, resulting in higher latencies. A more detailed analysis on processing latency and overhead will be provided in Section 6.4. Visual quality. We evaluate SSIM values on rendered scenes, similar to other systems <ref type="bibr">[17,</ref><ref type="bibr">24]</ref>. SSIM is a perceptual metric that quanti!es quality degradation caused by data compression and processing. Note that DeltaStream transmits encoded data independently of viewport changes, ensuring that both user experience and visual quality remain una"ected even during extreme viewport movements. Compared to the state-of-the-art MetaStream, which has SSIM values of 0.9050 and 0.9080, DeltaStream achieves comparable SSIM values with 0.8993 and 0.9031 for static and dynamic scenes respectively, while performing delta encoding. As the rendered scene di"ers from the other two systems because of the region of interest, the SSIM metric is not directly compared with LiveScan3D.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Bandwidth Savings</head><p>Since DeltaStream utilizes temporal redundancy to encode consecutive scenes e#ciently, we compare the compression performance of each system by measuring the average bandwidth usage during streaming. For static and dynamic scene contents,</p><note type="other">Figure 11</note><p>Server Client 0 50 100 150 200 250 300 Latency (ms) Draco en/decoding LiveScan 3D MetaStream Ours (a) Static Scene. Server Client 0 50 100 150 200 250 300 Latency (ms) Draco en/decoding LiveScan 3D MetaStream Ours (b) Dynamic Scene.</p><p>Figure <ref type="figure">12</ref>: Latency breakdown of LiveScan3D <ref type="bibr">[31]</ref>, MetaStream <ref type="bibr">[24]</ref> and DeltaStream for static and dynamic scene.</p><p>Hatched region shows the latency spent by Draco encoding/decoding.</p><p>compares average bandwidth usage. As a result of the high compression e#ciency of Draco [5] point cloud encoding, all three systems achieve over 80% bandwidth savings compared to streaming raw data. The results show that DeltaStream e"ectively reduces the network bandwidth usage through delta encoding. On average, DeltaStream reduces bandwidth usage by 85% compared to LiveS-can3D and by 71% compared to MetaStream for static scenes. For dynamic scenes, it achieves bandwidth reductions of 79% and 49% compared to LiveScan3D and MetaStream, respectively. Static scene. The overall bandwidth throughput is lower than that of the dynamic scene because the static scene involves a person sitting with limited movements compared to the dynamic scene, therefore it allows greater utilization of temporal correlation from the previous frame. In extreme cases, such as when there is almost no di"erence between the previous and current frames, almost no data needs to be transmitted in DeltaStream as it e#ciently handles temporal redundancy. Since MetaStream does not process temporal redundancy, DeltaStream saves 71% of bandwidth usage. Dynamic scene. Similar to the static scene, DeltaStream e"ectively reduces bandwidth usage compared to the other two baselines. On average, DeltaStream reduces bandwidth by 79% compared to LiveScan3D and by 49% compared to MetaStream. The bandwidth variation over time in Figure <ref type="figure">11</ref> is greater compared to the static scene. In the dynamic scene, the object movement is more active, which contributes to the temporal redundancy reduction. Therefore, the bandwidth reduction rate relative to baselines is lower than in the static scene. However, DeltaStream still reduces a substantial portion of the bandwidth compared to the baselines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Processing Latency and Overhead Analysis</head><p>Figure <ref type="figure">12</ref> illustrates the latency components for each system. On the server and client sides, the latency is divided into Draco processing time and the remaining processing time, and the time spent by Draco is hatched on the graph. In all cases, DeltaStream demonstrates the lowest latency on both the server and client sides. This is because the e#cient processing of temporal redundancy reduces unnecessary point transmissions between point cloud frames, thereby decreasing the time required for Draco encoding and decoding,  which contributes the largest portion of the overall processing latency. As shown in the relationship in Figure <ref type="figure">8</ref> (a), the number of points and the time taken for Draco processing are linearly related. Therefore, reducing the number of points compressed by Draco directly results in a reduced latency compared to other systems. While reducing the temporal redundancy in the point cloud decreases the latency of Draco encoding on the server, additional processing overheads, such as extracting 2D motion vectors from the 2D video CODEC and generating 3D motion vectors, are introduced. Consequently, the latency for the additional processes in DeltaStream is higher in comparison to other baselines. However, the overall server-side latency remains lower than that of other two systems.</p><p>Similarly, on the client side, the latency for Draco decoding is reduced for the same reason, but the additional processing required for dealing with 3D motion vector updates results in higher latency compared to MetaStream. On the other hand, the higher client-side latency of LiveScan3D is caused by the lack of segmentation optimization for objects. Therefore, there are redundant points outside the object compared to DeltaStream and MetaStream. As a result, LiveScan3D incurs greater processing delays in point cloud rendering and pre-render processing steps compared to both MetaStream and DeltaStream.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5">Client Computation Adaptability</head><p>To account for the varying computational capabilities of client devices, such as device di"erences or runtime $uctuations in computational resources due to thermal throttling, DeltaStream introduces client computation-adaptive control to ensure reliable decoding speed as discussed in Section 4.4. Therefore, DeltaStream adjusts the number of 3D motion vectors &#119899; &#119876;&#119877; during delta encoding to maintain the client's 30 FPS processing.</p><p>In Figure <ref type="figure">13</ref> (a), we temporarily removed the limitation on the number of 3D motion vectors (&#119899; &#119876;&#119877; ) during the middle of streaming (i.e., at frame sequence 1000) to arti!cially cause additional computational burden on the client. This causes the FPS to drop to around 24 FPS because decoding &#119899; &#119876;&#119877; 3D motion vectors exceeds the realtime computational ability of the client. When the adaptive control was reactivated at frame sequence 1200, the server adjusted &#119899; &#119876;&#119877;  to a level manageable delta encoding. After a few frame sequences, the FPS recovered and stabilized, maintaining a rendering rate of 30 FPS. We also conducted experiments under three di"erent clock frequencies of the client CPU. Two clock frequencies were lower than the maximum clock frequency (2.6 GHz) of the client CPU, while one utilized the maximum clock frequency. Figure <ref type="figure">13 (b)</ref> illustrates decoding latency and the proportion of motion compensation time within the total decoding time for each clock frequency on the client. As the clock frequency decreases, the server progressively reduces the number of 3D motion vectors to alleviate the client's computational burden. As a result, Figure <ref type="figure">13</ref> (b) shows that the proportion of 3D motion vector processing time relative to the overall latency decreases as the clock frequency decreases, which supports the proper operation of the adaptive control.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6">Ablation Study</head><p>We compare DeltaStream with the following two encoding methods to analyze the impact of each component of the Delta Encoder on compression e#ciency and visual quality: (1) 3DMV and (2) DPC. These two methods are not independent baselines. Instead, they represent integral components of DeltaStream. Our system (3DMV+DPC) balances the tradeo" between 3DMV and DPC to minimize bandwidth and computation while preserving visual quality. The di"erence between DeltaStream and those methods is in the way of processing high di"erence blocks &#119891; &#119881;&#119874;&#119882;&#119881; . While 3DMV utilizes 3D motion vectors for all &#119891; &#119881;&#119874;&#119882;&#119881; regardless of the accuracy of motion compensation (i.e., &#119890; &#119876;&#119877; = &#8601;), DPC encodes all &#119891; &#119881;&#119874;&#119882;&#119881; as delta point cloud (i.e., &#119890; &#119876;&#119877; = 0).</p><p>Figure <ref type="figure">14</ref> shows the evaluation results of DeltaStream and other two methods. The 3DMV method is the most bandwidth-e#cient among the three methods. As the more 3D motion vectors are used in 3DMV, the fewer points need to be transmitted, therefore reducing the bandwidth usage. However, 3DMV has the lowest visual quality, as it includes inaccurate motion vectors during encoding. On the other hand, the DPC replaces all &#119891; &#119881;&#119874;&#119882;&#119881; with a delta point cloud, resulting in the highest visual quality but also the highest bandwidth usage. In summary, the optimal operational point for maximizing bandwidth savings is achieved when DeltaStream fully utilizes 3D motion vectors. By leveraging the respective advantages of both 3DMV and DPC methods, DeltaStream (3DMV+DPC) e"ectively reduces bandwidth usage while maintaining visual quality. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.7">Complementary Analysis</head><p>3D motion vectors. We use the Chamfer Distance <ref type="bibr">[51]</ref> to verify the accuracy of the calculated 3D motion vectors. The metric measures the average closest-point distance between the 3D motioncompensated source point cloud and the destination point cloud to verify the geometric alignment of point clouds. The Chamfer Distance between the ground truth (i.e., point cloud frames without encoding) and 3D motion compensated point cloud is 0.0043 on average, which is less than 5 mm distance, thereby demonstrating the high accuracy of the 3D motion vectors. Note that the slight misalignment arises from the block-based approach of DeltaStream, a common issue also observed in 2D video CODECs.</p><p>Prede"ned thresholds. The values of the thresholds primarily used in the delta encoding in DeltaStream are empirically determined through experiments. We !rst set a threshold &#119890; &#119871;&#119874; &#119875; &#119875; , which is responsible for !ltering high di"erence blocks &#119891; &#119881;&#119874;&#119882;&#119881; between the previous frame and the current frame, with a focus on visual quality. As shown in Figure <ref type="figure">15</ref>, visual quality degrades when &#119890; &#119871;&#119874; &#119875; &#119875; increases, because blocks with signi!cant di"erences are excluded from delta encoding and remain unchanged from the previous frame, even when they have considerable di"erences. &#119890; &#119871;&#119874; &#119875; &#119875; targets to maintain SSIM above 0.9, which is considered to represent a good visual quality <ref type="bibr">[19]</ref>. Therefore, we set &#119890; &#119871;&#119874; &#119875; &#119875; = 10000, the closest value makes an SSIM of 0.9. On the other hand, we tested various values of &#119890; &#119876;&#119877; after !xing &#119890; &#119871;&#119874; &#119875; &#119875; , but observed mere changes in SSIM values. Therefore, we arbitrarily set &#119890; &#119876;&#119877; = 5000, a tighter threshold value compared to &#119890; &#119871;&#119874; &#119875; &#119875; , to guarantee using only &#119891; &#119876;&#119887;&#119880;&#119888;&#119881;&#119884;&#119871; blocks with minimal di"erences after motion compensation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Discussion</head><p>Compression loss. While DeltaStream achieves signi!cant bandwidth savings through delta encoding, the lossy nature of compression introduces visual artifacts that can impact the quality of the reconstructed point cloud. DeltaStream utilizes 3D motion compensation to reduce the di"erences between consecutive frames, thereby minimizing the data that needs to be transmitted. However, inaccuracies in 3D motion vectors may result in slight misalignments. Compression loss also arises from the threshold used to classify matched and mismatched blocks. Smaller thresholds improve quality but increase bandwidth, while larger thresholds reduce bandwidth at the cost of quality. Adaptive thresholds and real-time perceptual metrics could further balance bandwidth e#ciency and visual !delity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Potential performance enhancements. While we implemented</head><p>DeltaStream with a focus on CPU processing, there is a room for performance improvement through hardware acceleration like as 2D video codecs <ref type="bibr">[11]</ref>. Since our approach is block-based, it is wellsuited for parallel processing on GPUs, which could further enhance the performance. DeltaStream e#ciently addresses temporal redundancy in the current point cloud frame based on the previous frame. However, utilizing concepts such as bi-directional prediction frames or variable block sizes <ref type="bibr">[44,</ref><ref type="bibr">50]</ref> could lead to potential performance improvements in terms of bandwidth e#ciency. As these encoding approaches increase decoding complexity, this must be carefully considered. User experience and system assumptions. Since DeltaStream deals with 3D point cloud transmission, user experience remains una"ected even with extreme viewport movements. Viewportadaptive solutions such as ViVo <ref type="bibr">[25]</ref> can be incorporated into DeltaStream to further reduce the network bandwidth. Many practical immersive applications such as remote surgery, virtual concerts, and telepresence typically assume !xed camera positions. Even when the camera position changes, these updates can be transmitted in real-time using only a few bytes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Related Work</head><p>Reducing temporal and spatial redundancy among adjacent frames in video streaming for AR/VR/MR applications is a crucial issue in terms of saving network bandwidth. For 360 &#8733; video streaming applications, recent studies <ref type="bibr">[36,</ref><ref type="bibr">57]</ref> have explored reusing information from previous frames to enhance compression and computational e#ciency. However, volumetric video streaming presents signi!cantly greater challenges due to its inherent support for 6 degrees of freedom (6-DoF), where users can move freely in space rather than only rotating their heads. In the following, we review existing systems for volumetric video streaming, as summarized in Table <ref type="table">2</ref>. Live volumetric video streaming. The most representative application of live volumetric video streaming is telepresence, which enables real-time, immersive communication between users in different locations. Prominent examples include Holoportation <ref type="bibr">[42]</ref> and Project Starline <ref type="bibr">[32]</ref>, both of which are high-quality, real-time telepresence systems. However, these systems require extremely high bandwidth on the Gbps scale and rely on multiple powerful GPUs for processing. Additionally, Project Starline <ref type="bibr">[32]</ref> is restricted to use inside a booth, limiting user mobility and $exibility. Far-fetchFusion <ref type="bibr">[33]</ref> introduces a mobile telepresence platform with a focus on 3D face reconstruction, allowing for greater user mobility. MagicStream <ref type="bibr">[17]</ref> transmits only semantic information instead of full 3D representations. While this approach signi!cantly reduces bandwidth usage, it requires extensive user pro!ling to train multiple machine learning models and is not a generalizable solution for other types of objects. LiveScan3D <ref type="bibr">[31]</ref> suggests a live volumetric capturing pipeline using RGB-D cameras, o"ering a simpler, hardware-centric approach. MetaStream <ref type="bibr">[24]</ref>, regarded as the state-of-the-art system for live volumetric video streaming, supports multi-camera capture to real-time streaming. It reduces bandwidth by performing image segmentation and o%oading computation to smart cameras, thereby improving processing e#ciency.  <ref type="bibr">[57]</ref> 360 &#8733; 3 2D &#9985; &#9985; Dragon$y <ref type="bibr">[22]</ref> 360 &#8733; 3 2D &#9985; &#9985; Vues <ref type="bibr">[38]</ref> On-demand volumetric 6 2D &#9985; &#9985; ViVo <ref type="bibr">[25]</ref> On-demand volumetric 6 Point cloud &#9985; &#9986; GROOT <ref type="bibr">[34]</ref> On-demand volumetric 6 Point cloud &#9985; &#9986; YuZu <ref type="bibr">[55]</ref> On-demand volumetric 6 Point cloud &#9985; &#9986; FarfetchFusion <ref type="bibr">[33]</ref> Live volumetric 6 2D &#9985; &#9985; MetaStream <ref type="bibr">[24]</ref> Live volumetric 6 Point cloud</p><p>Despite the diversity of approaches, none of these methods e"ectively utilize the temporal correlation among consecutive point cloud frames, which presents an opportunity for further bandwidth and computational e#ciency improvements.</p><p>On-demand volumetric video streaming. On-demand volumetric video streaming focuses on delivering pre-recorded content from the server. Unlike live volumetric video streaming, ondemand streaming does not require real-time encoding, allowing researchers to focus primarily on bandwidth-saving techniques.</p><p>Various approaches have been proposed to achieve this goal, leveraging advanced compression, prediction, and data reduction methods. Vues <ref type="bibr">[38]</ref> achieves bandwidth reduction by utilizing highly e#cient video CODECs <ref type="bibr">[28]</ref> and incorporates multiple machine learning models for viewport prediction on edge servers, thereby enhancing the quality of experience. GROOT <ref type="bibr">[34]</ref> introduces an innovative Octree-based parallel decodable tree structure, enabling real-time decoding on mobile GPUs. ViVo <ref type="bibr">[25]</ref> employs a visibilityaware sampling technique that considers viewport, distance, and occlusion factors to reduce bandwidth usage. YuZu <ref type="bibr">[55]</ref> and Vo-luSR <ref type="bibr">[54]</ref> utilize 3D super-resolution to lower the point density of transmitted point clouds, thereby signi!cantly reducing bandwidth consumption. M5 <ref type="bibr">[56]</ref> and MuV2 <ref type="bibr">[39]</ref> are speci!cally designed to support multiple users under limited bandwidth conditions, ensuring stable streaming for multiple concurrent users. While many studies on on-demand volumetric video streaming have focused on improving compression rates, the computational overhead required for these techniques often makes them unsuitable for live streaming scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">Conclusion</head><p>In this paper, we presented DeltaStream, a novel live volumetric video streaming system that leverages 2D information to ef-!ciently encode and transmit point cloud data. Additionally, the computation-adaptive online control mechanism dynamically adjusts encoding strategies based on client-side computational constraints to ensure real-time streaming at 30 FPS. Comprehensive evaluations demonstrate that DeltaStream signi!cantly reduces bandwidth usage while achieving low end-to-end latency and stable frame rates. Compared to state-of-the-art systems like MetaStream and LiveScan3D, DeltaStream reduces bandwidth usage by up to 71% for static scenes and 49% for dynamic scenes. Furthermore, it achieves 1.15-1.63&#8594; faster decoding speeds while maintaining comparable visual quality. We believe that DeltaStream provides a practical and e"ective solution for real-time volumetric video streaming in applications such as telepresence, virtual concerts, and augmented reality.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0"><p>https://github.com/delta-stream/DeltaStream</p></note>
		</body>
		</text>
</TEI>
