<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>NEAT: Distilling 3D Wireframes from Neural Attraction Fields</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE CVPR</publisher>
				<date>06/18/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10534201</idno>
					<idno type="doi"></idno>
					
					<author>Nan Xue</author><author>Bin Tan</author><author>Yuxi Xiao</author><author>Liang Dong</author><author>Gui-Song Xia</author><author>Tianfu Wu</author><author>Yujun Shen</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[•This paper studies the problem of structured 3D reconstruction using wireframes that consist of line segments and junctions, focusing on the computation of structured boundary geometries of scenes. Instead of leveraging matching-based solutions from 2D wireframes (or line segments) for 3D wireframe reconstruction as done in prior arts, we present NEAT, a rendering-distilling formulation using neural fields to represent 3D line segments with 2D observations, and bipartite matching for perceiving and dis- tilling of a sparse set of 3D global junctions. The proposed NEAT enjoys the joint optimization of the neural fields and the global junctions from scratch, using view-dependent 2D observations without precomputed cross-view feature matching. Comprehensive experiments on the DTU and BlendedMVS datasets demonstrate our NEAT’s superiority over state-of-the-art alternatives for 3D wireframe reconstruction. Moreover, the distilled 3D global junctions by NEAT, are a better initialization than SfM points, for the recently-emerged 3D Gaussian Splatting for high-fidelity novel view synthesis using about 20 times fewer initial 3D points.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>accuracy of the reconstruction, as the matching process relies on these endpoints to accurately represent the 3D geometry. These matching-based methods often result in incomplete 3D line models or suffer from fragmentation and noise, depending on the choice of 2D detectors <ref type="bibr">[25,</ref><ref type="bibr">36,</ref><ref type="bibr">[43]</ref><ref type="bibr">[44]</ref><ref type="bibr">[45]</ref><ref type="bibr">[46]</ref> and matchers <ref type="bibr">[23,</ref><ref type="bibr">24]</ref> of line segments, as in Fig. <ref type="figure">1</ref>. Dense Fields of Sparse Geometries. We challenge the explicit matching pipeline of 3D wireframe reconstruction from the perspective of dense field representation. We draw inspiration from the "implicit matching" capacity [42] of the emerging neural implicit fields <ref type="bibr">[2,</ref><ref type="bibr">22,</ref><ref type="bibr">49]</ref> for 3D dense representations (e.g., density fields and signed distance functions), and propose to render 3D line segments from multi-view 2D observations. Such a basic idea roughly works by leveraging a coordinate MLP to render 3D line segments from 2D observations, but remains problematic due to the entailed view-by-view rendering of 3D line segments in two-fold: (1) the 2D line segments of a detected wireframe often undergo localization errors, resulting in erroneous 3D line segment predictions via view-by-view rendering, and (2) simply stacking the rendered 3D line segments from all views leads to a very large amount of 3D line segments, requiring non-trivial merging/fusion to form a 3D wireframe representation of the scene. Line-to-Point Attraction in Neural Fields. We tackle the above issues by leveraging the line-to-point attraction that inherently persists in the wireframe representation, in which every endpoint of a 3D line segment should be in the set of 3D junctions of the underlying scene. Based on this, we formulate the two types of entities of 3D wireframes, the 3D line segments and junctions, in a novel rendering-distilling formulation, where the sparse set of 3D line segments are represented in a dense neural field while the junctions play the role of distilling a sparse wireframe structure from the fields. Our work is entitled as NEural Attraction (NEAT) for 3D wireframe reconstruction, mainly because of the neural design of the 3D line segments and junctions, and of leveraging the line-to-point attraction to enable joint optimization of the neural networks from multi-view images and its 2D wireframe detection results. To the best of our knowledge, we accomplish the first matching-free solution of 3D wireframe/line reconstruction by learning and optimizing from random initializations without any 3D scene information required.</p><p>In experiments, we showcase that our matching-free NEAT solution significantly outperforms all the matchingbased approaches with accurate yet complete 3D wireframe reconstruction results on both the DTU <ref type="bibr">[1]</ref> and BlendedMVS <ref type="bibr">[47]</ref> datasets, working well in both straightline dominated scenes and curve-based (or polygonal line segment dominated) scenes that challenges the traditional matching-based approaches, paving a way towards learning 3D primal sketch in a more general way. Furthermore, we show that the neurally perceived 3D junctions is applicable to the recently proposed 3D Gaussian Splatting <ref type="bibr">[13]</ref> as better initialization than the COLMAP <ref type="bibr">[29]</ref> with about 20 times fewer points, showing case the potential of structured and compact 3D reconstruction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Structured 3D Reconstruction in Geometric Primitives. Because of the inherent structural regularities for scene representation conveyed by line structures <ref type="bibr">[10,</ref><ref type="bibr">16,</ref><ref type="bibr">19,</ref><ref type="bibr">28,</ref><ref type="bibr">31]</ref> and planar structures <ref type="bibr">[33,</ref><ref type="bibr">34]</ref>, there has been a vast body of literature on line-based multiview 3D reconstruction tasks including single-view 3D reconstruction <ref type="bibr">[18,</ref><ref type="bibr">33]</ref>, line-based SfM <ref type="bibr">[3,</ref><ref type="bibr">27]</ref>, SLAM <ref type="bibr">[26,</ref><ref type="bibr">38]</ref>, and multi-view stereo <ref type="bibr">[12,</ref><ref type="bibr">17,</ref><ref type="bibr">39]</ref> based on the theory of multi-view geometry <ref type="bibr">[11]</ref>. Due to the challenge of line segment detection and matching in 2D images, most of those studies expected the 2D line segments detected from input images to be redundant and small-length to maximize the possibility of line segment matching. As for the estimation of scene geometry and camera poses, the keypoint correspondences (even including the 3D point clouds) are usually required. For example in Line3D++ <ref type="bibr">[12]</ref>, given the known camera poses by keypoint-based SfM systems <ref type="bibr">[29,</ref><ref type="bibr">30,</ref><ref type="bibr">32,</ref><ref type="bibr">40]</ref>, it is still challenging though to establish reliable correspondences for the pursuit of structural regularity for 3D line reconstruction. For our goal of 3D wireframe reconstruction, because 2D wireframe parsers aim at producing parsimonious representations with a small number of 2D junctions and long-length line segments, those correspondence-based solutions pose a challenging scenario for cross-view wireframe matching, thus leading to inferior results than the ones using redundant and smalllength 2D line segments detected by the LSD <ref type="bibr">[36]</ref>. To this end, we present a correspondence-free formulation based on coordinate MLPs, which provides a novel perspective to accomplish the goal of 3D wireframe reconstruction from the parsed 2D wireframes. Neural Rendering for Geometric Primitives. In recent years, the emergence of neural implicit representations [2, Fig. <ref type="figure">3</ref> using a synthetic example, we utilize the attracted pixels of 2D line segments in each image to define the rays for 3D rendering. For each segment, its attracted pixels are projected perpendicularly onto the 2D segment. This projection is confined within the endpoints of the segment with respect to a predefined distance threshold, &#964; ray . Each pixel is associated with its nearest line segment, ensuring a dense coverage of supporting areas for the segments. This approach facilitates the volume rendering of 3D line segments by providing a robust underlying structure.</p><p>In our approach, we model a 3D line segment at any point x t along a ray. The endpoint displacements (&#8710;x 1 t , &#8710;x 2 t ) relative to x t are computed as,</p><p>yielding the two endpoints of the segment by (x t + &#8710;x 1 t , x t + &#8710;x 2 t ). The mapping function L(&#8226;) is parameterized by a 4-layer coordinate MLP. It incorporates the view direction v, the surface normal n(&#8226;) from the SDF gradient, and a 128-dimensional feature vector z(x t ) from the SDF network, reflecting the view-dependent nature of 2D line segments. For rendering a 3D line segment, we apply the equation,</p><p>Here, x s and x t are the 3D endpoints for the attraction pixel x of a 2D line segment l = (&#567; 1 , &#567; 2 ) &#8712; V i &#215; V i of the i-th view, calculated along its ray x t .</p><p>According to the pixel-to-line relationship defined by 2D attraction field representations, the rendered 3D line segment (x s , x t ) of a ray x t should be consistent with l = (&#567; 1 , &#567; 2 ), thus resulting in a loss function between the projected 2D endpoints by viewpoint projection &#928;(&#8226;) and l in,</p><p>The proposed Neural Attraction Fields of 3D line segments is optimized together with SDF and the radiance field by minimizing the loss functions stated above, forming a querable and dense representation of 3D line segments.</p><p>Minimizing the loss functions L neat , L img , and L eik allows us to derive a geometrically meaningful but noisy 3D line cloud from multi-view images, as demonstrated in Fig. <ref type="figure">4</ref> using both a synthetic example and a real case from the DTU-24 scene <ref type="bibr">[1]</ref>. The absence of explicit line matching across multiple views leads to duplication of the same 3D line segments, each with its own view-dependent prediction errors. In the following section, we discuss how this redundancy and noise, while initially seeming detrimental, actually provide a strong inductive bias towards achieving the goal of 3D wireframe reconstruction. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Neural 3D Junction Perceiver</head><p>This section introduces our method to "clean up" the noisy and redundant 3D line cloud created by Neural Attraction Fields. Leveraging the relationship between 3D junctions and line segments in wireframes, we propose a neural and joint optimization approach, central to our NEAT method. Using the 3D line cloud, denoted by L neat , a query-based learning method is designed for perceiving 3D junctions (Eq. ( <ref type="formula">6</ref>)) via junction-line attraction, which plays the role of distillation for 3D wireframe reconstruction.</p><p>Global 3D Junction Percieving. Our 3D line segment rendering inherits the dense representation as the density field and the radiance field. To achieve parsimonious wireframes, we propose a novel query-based design to holistically perceive a predefined sparse set of N 3D junctions by</p><p>where Q N &#215;C are C-dim latent queries (randomly initialized in learning). Surprisingly, as we shall show in experiments, the underlying 3D scene geometry induced synergies between J N &#215;3 and the above 3D line segment rendering integral enable us to learn a very meaningful global 3D junction perceiver.</p><p>In the absence of well-defined ground-truth for learning 3D junctions, we use the endpoints of redundant rendered 3D line segments (Sec. 3.1) as noisy labels. By reshaping the line cloud L neat into J neat &#8712; R 2M &#215;3 , our process involves two steps: (1) clustering J 2M &#215;3 using DBScan to yield pseudo 3D junctions J cls &#8712; R m&#215;3 with m &lt; 2M clusters; (2) applying bipartite set-to-set matching between the perceived junctions J N &#215;3 (Eq. ( <ref type="formula">6</ref>)) and J cls using the Hungarian algorithm. The matching cost is based on the &#8467; 2 norm between 3D points. We define J = {(J k , J cls i k )|k = 1, . . . , K} as the set of matched junctions, where K = min(N, m), and i k is the index of the k-th matched pseudo label J cls i k . Then, our goal is to minimize the distance Figure <ref type="figure">6</ref>. Visualization of 3D Wireframe Reconstruction on the 12 scenes from the DTU dataset <ref type="bibr">[1]</ref> and the 4 scenes from the BlendedMVS dataset <ref type="bibr">[47]</ref>. For each scene, we show its line segment view (by hiding the junctions) in black, and the wireframe view by coloring the junctions in blue. For the comparison, please see our video.</p><p>Table <ref type="table">1</ref>. Evaluation Results on the DTU and BlendedMVS datasets for the reconstructed 3D wireframes. ACC-J and ACC-L are the evaluation for junctions and line segments. For Line3D++@HAWP, LiMAP and ELSR, all the endpoints of line segments are treated as junctions.</p><p>NEAT (Ours)</p><p>7718 0.8002 6.1064 624 503 1.0944 0.8547 7.7756 231 0.9019 0.8133 8.5086 249 16 0.8263 0.7879 5.4135 729 554 1.0385 0.7898 6.0420 335 0.7957 0.6992 6.9052 388 17 0.7754 0.6695 5.0498 738 546 1.1015 0.8804 5.8212 388 0.8816 0.7778 7.6257 395 18 0.6429 0.6868 5.3796 701 596 0.9950 0.8253 7.0154 287 0.7894 0.7528 7.7082 305 19 0.6989 0.6923 4.6529 809 510 0.7689 0.7110 7.9461 160 0.6815 0.7953 6.9776 330 21 0.9042 0.6923 4.6529 809 571 1.1011 0.8884 5.9821 319 0.9064 0.7953 6.9776 330 22 0.6343 0.6910 5.0871 758 596 0.8998 0.7353 6.8567 281 0.7494 0.7079 7.8014 328 23 0.5882 0.6193 5.5992 771 597 1.0561 0.8293 6.5078 377 0.8005 0.7356 8.2679 320 24 0.6386 0.5944 5.9104 860 549 1.0314 0.8293 6.5078 377 0.7940 0.6807 7.6886 366 37 1.4815 1.0856 7.5362 420 405 1.2721 1.2352 8.6413 120 1.1796 1.0287 10.2244 60 40 0.6298 1.0354 8.7825 137 469 1.2108 0.8327 9.9988 41 0.8486 0.6877 10.1206 83 65 0.7212 1.0354 8.7825 137 171 1.0469 0.5071 11.1936 7 1.1008 1.0697 11.1519 23 105 0.7204 1.0127 6.4296 621 478 1.6108 1.1929 10.7943 90 1.2957 1.0286 10.6539 61 BlendedMVS Dataset [47] Avg. 0.1949 0.1802 6.4621 602 514 0.3712 0.3169 6.9415 313 0.3743 0.3545 6.8760 724 1 0.0365 0.0404 3.7253 653 565 0.0488 0.0651 5.0457 226 0.0682 0.0650 5.3625 691 2 0.1715 0.1585 8.2943 328 343 0.3478 0.2817 8.7663 195 0.4327 0.4174 8.8864 396 3 0.2564 0.2165 7.5600 931 664 0.3796 0.3162 7.5366 467 0.3795 0.3582 7.3192 931 4 0.3153 0.3055 6.2686 509 483 0.7086 0.6045 6.4174 365 0.6171 0.5774 5.9359 876</p><p>3D wireframe reconstruction instead of 3D line segment reconstruction, for fair comparisons, we use HAWPv3 <ref type="bibr">[46]</ref> as the alternative for 2D detection in the use of Line3D++</p><p>and LiMAP. For those baselines, we use their official implementation for 3D line segments reconstruction.</p><p>DTU <ref type="bibr">[1]</ref> and BlendedMVS <ref type="bibr">[47]</ref> Datasets. These two the feasibility of optimizing coordinate MLPs using this sampling technique. As depicted in Fig. <ref type="figure">11</ref>(a), by masking over 80% of the pixels (using a distance threshold of 5 pixels), we can still effectively optimize coordinate MLPs, leading to the reasonable outcomes shown in Fig. <ref type="figure">11(b</ref>).</p><p>In addition to rendering results, we observed that increasing the distance threshold leads to a reduction in the number of line segments and junctions. As detailed in Tab. 3, setting the distance threshold to &#964; d = 20 results in fewer 3D lines and junctions. Although the ACC errors are marginally reduced, there is an increase in completeness. Conversely, when the distance threshold &#964; d is set to 1, a performance degradation is noted across all metrics due to insufficient supervision signals.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2. The Number of Global Junctions</head><p>The number of global junctions is determined heuristically to encompass all potential 3D junctions. Based on observations from both the DTU and BlendedMVS datasets, where the detected 2D line segments are in the hundreds, we set the estimated number of 3D junctions to 1024. In Tab. 4, we present experiments conducted on the DTU-24 scene with varying numbers of junctions, denoted as N , to assess performance differences. The results indicate that increasing the number of possible global 3D junctions to a larger value (e.g., N = 2048) yields only a marginal increase in the count of learned 3D line segments and junctions in the final wireframe models. Conversely, a smaller N tends to result in incomplete 3D wireframe models. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3. Additional Implementation Details</head><p>Network Architecture. The coordinate MLPs used in our NEAT approach are derived from VolSDF <ref type="bibr">[49]</ref>, which contains three coordinate MLPs for SDF, the radiance field, and the NEAT field. For the MLP of SDF, it contains 8 layers with hidden layers of width 256 and a skip connection from the input to the 4th layer. The radiance field and the NEAT field share the same architecture with 4 layers with hidden layers of width 256 without skip connections. The proposed global junction perceiving (GJP) module contains two hidden layers and one decoding layer as described in the code snippets of Sec. 1 in our main paper.</p><p>Hyperparameters. The distance threshold &#964; d about the foreground pixel (ray) generation is set to 5 by default.For the number of global junctions (i.e., the size of the latent), we set it to 1024 on the DTU and BlendedMVS datasets.</p><p>When the scene scale is larger (e.g., a scene from ScanNet mentioned in Fig. <ref type="figure">5</ref> of the main paper), the number of global junctions is set to 2048. For DBScan <ref type="bibr">[7]</ref>, we use the implementation from sklearn package, set the epsilon (for the maximum distance between two samples) to 0.01 and the number of samples (in a neighborhood for a point to be considered as a core point) to 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. The Final Distillation Step of NEAT</head><p>This section elaborates on the final distillation step required in our NEAT methodology for 3D wireframe reconstruction, with a particular focus on the extensive use of global junctions. We aim to provide a detailed insight into this crucial phase of the NEAT process.</p><p>To begin with, let us consider the challenge inherent in the junction-driven finalization of NEAT. As depicted in Fig. <ref type="figure">12</ref>, using a toy ABC scene as an example, we observe that a considerable number of 3D line segments are rendered and aggregated across different views. Concurrently, 3D junctions are dynamically distilled from the NEAT fields. While a simple approach to combine these 3D junctions with the redundant 3D line segments might seem viable, it is critical to address the potential misalignments between the junctions and line segments. To resolve this issue, we employ a least squares optimization combined with an SDF-based refinement scheme. This approach is designed to precisely adjust the position of 3D junctions, thereby ensuring an accurate and coherent reconstruction of the 3D wireframe.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.1. Least Square Optimization</head><p>To be convenient for readers, we copy Eq. ( <ref type="formula">9</ref>) in our main paper to Eq. <ref type="bibr">(10)</ref>,</p><p>which is the main objective function to adjust the junction positions according to the observation from the op- timized/learned NEAT field. Here, we mathematically define the alignment cost between the junction-driven 3D line segments l 0 u,v = (J u , J v ) and its i-th NEAT-field observation l i u,v = (x i u , x i v ) by the angular cost and the perpendicular cost as follow</p><p>where &#10216;&#8226;, &#8226;&#10217; is the inner product between two 3D vectors, and the function proj(l i u,v ; J v ) projects the point J v onto the infinite 3D line passing through the line segment l i u,v . In Tab. 5, we report the performance changes by disabling the non-linear optimization on the DTU dataset, which will result in inferior 3D wireframes with larger ACC errors for both junctions and line segments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2. SDF-based 3D Junction Refinement</head><p>Following the non-linear optimization, we employ an SDFbased refinement scheme to further enhance the localization accuracy of junctions. Specifically, for an initial 3D  junction J i &#8712; R 3 and an optimized SDF d &#8486; (&#8226;), we refine the location of J i using the following equation:</p><p>where &#8711;d &#8486; represents the normal direction of the surface at the point J i .</p><p>To assess the impact of this SDF-based refinement on junctions, we conducted an ablation study comparing 3D wireframe models with and without the SDF refinement. The results, presented in Tab. 5, clearly demonstrate the necessity of this refinement step for achieving significantly improved results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.3. Visibility Checking</head><p>As detailed in Sec. 3.3 of the main paper, we evaluate the reconstructed 3D line segments by projecting them onto 2D images from each view. This process involves computing both the angular and perpendicular distances between the projected 3D line segments and the detected 2D line segments. A 3D line segment is considered to be supported by a 2D detection if it aligns within an angular distance of 10 degrees and a perpendicular distance of 5 pixels, with a minimum overlap ratio of 50%. This methodology allows us to determine the visibility of each 3D line segment and to filter out those that are invisible as false alarms.</p><p>In our standard approach, the visibility threshold for each line segment is set to 1, aiming to achieve a more complete reconstruction. Moreover, we explore the impact of varying this visibility threshold from 1 to 4 on the DTU dataset. The findings, as summarized in Tab. 6, indicate that increasing the visibility threshold results in an improvement in the ACC metric, while the COMP metric increases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Experiments on the ABC Dataset</head><p>Because the 3D wireframe annotations are very difficult to obtain for real scene images, to better discuss the problem of 3D wireframe reconstruction and analyze our proposed NEAT approach, we conduct experiments on objects from ABC Datasets as it provides 3D wireframe annotations. Data Preparation. We use Blender <ref type="bibr">[4]</ref> to render 4 objects from the ABC dataset. The object IDs are mentioned in Tab. 7. For each object, we first resize it into a unit cube by dividing the size of the longest side and then moving it to the origin center. Then, we randomly generate 100 camera locations, each of which is distant from the origin by &#8730; 1.5 2 + 1.5 2 &#8776; 2.1213 units. The setting of the distance, &#8730; 1.5 2 + 1.5 2 , is from our early-stage development for the rendering, in which we set a camera at (0, 1.5, 1.5) location. By setting the cameras to look at the origin (0, 0, 0), we obtain 100 camera poses. Considering the fact that the ABC dataset is relatively simple, we set the focal length to 60.00mm to ensure the object is slightly occluded for rendering images. The sensor width and height of the camera in Blender are all set to 32mm. The ground truth annotations of the 3D wireframe are from the corresponding STEP files. For the simplicity of evaluation, we only keep the straight-line structures and ignore the curvature structures to obtain the ground truth annotations. The rendered images are with the size of 512 &#215; 512.</p><p>Baseline Configuration. Fig. <ref type="figure">13</ref> illustrates the rendered input images for the used four objects. Because the rendered images are textureless and with planar objects, the dependency of those baselines on the correspondencebased sparse reconstruction by SfM systems <ref type="bibr">[29]</ref> is hardly satisfied to produce reliable line segment matches for 3D line reconstruction. Accordingly, we set up an ideal baseline instead of using Line3D++ <ref type="bibr">[12]</ref> and LiMAP <ref type="bibr">[17]</ref> for comparison. Specifically, we first detect the 2D wireframes for the rendered input images and then project the junctions and line segments of the ground-truth 3D wireframe models onto the 2D image plane. For the 2D junctions, if a projected ground-truth junction can be supported by a detected one within 5 pixels in any view, we keep the ground-truth junction as the reconstructed one in the ideal case. For the 2D line segments, we compute the minimal value for the distance of the two endpoints of a detected line segment to check if it can support a ground-truth 3D line. The threshold is also set to 5 pixels. Then, we count the number of reconstructed 3D line segments and junctions in such an ideal case.</p><p>Evaluation Metrics. For our method, we compute the precision and recall for the reconstructed 3D junctions and line segments under the given thresholds. Because the objects (and the ground-truth wireframes) are normalized in a unit cube, we set the matching thresholds to {0.01, 0.02, 0.05} for evaluation. For the matching distance of line segments, we use the maximal value of the matching distance between two endpoints to identify if a line segment is successfully reconstructed under the specific distance threshold. For the ideal baseline, we report the number of ground-truth primitives (junctions or line segments), the number of reconstructed primitives, and the reconstruction rate.</p><p>Results and Discussion. Tab. 7 quantitatively summarizes the evaluation results and the statistics on the used scenes. As it is reported, our NEAT approach could accurately reconstruct the wireframes from posed multiview images. The main performance bottleneck of our method comes from the 2D detection results. As shown in the ideal baseline, by projecting the 3D junctions and line segments into the image planes to obtain the ideal 2D detection In each object, we evaluate the precision and recall rates for junctions (J) and line segments (L). For the ideal baseline, we count the number of ground-truth primitives, the number of reconstructed 3D primitives, and the reconstruction rate in the ideal baseline.</p><p>results, the 2D detection results by HAWPv3 <ref type="bibr">[46]</ref> did not perfectly hit all ground-truth annotations. Furthermore, suppose we use the hit (localization error is less than 5 pixels) ground truth for 3D wireframe reconstruction, there is a chance to miss some 3D junctions and more 3D line segments. In this sense, given a relaxed threshold of the reconstruction error for precision and recall computation, our NEAT approach is comparable with the performance of the ideal solution. For the first object (ID 4981), because of the severe self-occlusion, some line segments are not successfully reconstructed for both the ideal baseline and our approach. For object 17078, our NEAT approach reconstructed some parts of the two circles that are excluded from the ground truth, which leads to a relatively low precision rate. Fig. <ref type="figure">13</ref> also supported our results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. 3D Gaussians with NEAT Junctions</head><p>In this section, we extend the application of our NEAT framework to 3D Gaussian Splatting, as proposed by Kerbl et al. <ref type="bibr">[13]</ref>, by substituting the initial point cloud derived from Structure-from-Motion (SfM) with the junctions identified by NEAT. This experiment is designed to showcase the efficacy of NEAT junctions as a compact initialization method for 3D Gaussian Splatting. Using only a few hundred points, our NEAT junctions demonstrate an enhanced fitting ability on the DTU dataset, as evidenced by improved metrics in both Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM).</p><p>The experimental results on 12 scenes from the DTU dataset are detailed in Tab. 8. It is observed that by initializing the 3D Gaussians with NEAT junctions, there is a notable improvement in performance: PSNR increases by 0.38 dB and SSIM improves by 0.0003 points. This finding underscores the effectiveness of NEAT junctions in providing a more precise and compact starting point for 3D Gaussian Splatting. </p><p>where P and P * are the point clouds sampled from the predictions and the ground truth mesh.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F.2. Information of Used BlendedMVS Scenes</head><p>The scene IDs and their MD5 code of the BlendedMVS scenes are:</p><p>&#8226; Scene-01: 5c34300a73a8df509add216d</p><p>&#8226; Scene-02: 5b6e716d67b396324c2d77cb</p><p>&#8226; Scene-03: 5b6eff8b67b396324c5b2672</p><p>&#8226; Scene-04: 5af28cea59bc705737003253</p></div></body>
		</text>
</TEI>
