<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>DewarpNet: Single-Image Document Unwarping With Stacked 3D and 2D Regression Networks</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>10/01/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10137869</idno>
					<idno type="doi">10.1109/ICCV.2019.00022</idno>
					<title level='j'>IEEE/CVF International Conference on Computer Vision</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Sagnik Das</author><author>Ke Ma</author><author>Zhixin Shu</author><author>Dimitris Samaras</author><author>Roy Shilkrot</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Capturing document images with hand-held devices in unstructured environments is a common practice nowadays. However, “casual” photos of documents are usually unsuitablefor automatic information extraction, mainly due to physical distortion of the document paper, as well as various camera positions and illumination conditions. In this work, we propose DewarpNet, a deep-learning approach for document image unwarping from a single image. Our insight is that the 3D geometry of the document not only determines the warping of its texture but also causes the illumination effects. Therefore, our novelty resides on the explicitmodeling of 3D shape for document paper in an end-to-end pipeline. Also, we contribute the largest and most comprehensive dataset for document image unwarping to date – Doc3D. This dataset features multiple ground-truth annotations, including 3D shape, surface normals, UV map, albedo image, etc. Training with Doc3D, we demonstrate state-of-the-art performance for DewarpNet with extensive qualitative and quantitative evaluations. Our network also significantly improves OCR performance on captured document images, decreasing character error rate by 42% on average. Both the code and the dataset are released.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Paper documents carry valuable information and serve an essential role in our daily work and life. Digitized documents can be archived, retrieved, and shared in a convenient, safe, and efficient manner. With the increasing popularity of portable cameras and smartphones, document digitization becomes more accessible to users through picture taking. Once captured, the document images can be converted into electronic formats, for example, a PDF file, for further processing, exchange, information extraction, and content analysis. While capturing images, it is desirable to preserve the information on the document with the best possible accuracy -with a minimal difference from a flatbedscanned version. However, casual photos captured with mobile devices often suffer from different levels of distortions due to uncontrollable factors such as physical deformation of the paper, varying camera positions, and unconstrained illumination conditions. As a result, these raw images are often unsuitable for automatic information extraction and content analysis.</p><p>Previous literature has studied the document-unwarping problem using various approaches. Traditional approaches <ref type="bibr">[26,</ref><ref type="bibr">46]</ref> usually rely on the geometric properties of the paper to recover the unwarping. These methods first estimate the 3D shape of the paper, represented by either some parametric shape representations <ref type="bibr">[9,</ref><ref type="bibr">47]</ref> or some non-parametric shape representations <ref type="bibr">[35,</ref><ref type="bibr">45]</ref>. After that, they compute the flattened image from the warped image and the estimated shape using optimization techniques. A common drawback of these methods is that they are usually computationally expensive and slow due to the optimization process. Recent work by Ma et al. <ref type="bibr">[23]</ref> proposed a deep learning system that directly regresses the unwarping operation from the deformed document image. Their method significantly improved the speed of document unwarping system. However, their method did not follow the 3D geometric properties of the paper warping -training data was created with a set of 2D deformations -and therefore often generate unrealistic results in testing. Paper folds happen in 3D: papers with different textures but the same 3D shape can be unwarped with the same deformation field. Hence, 3D shape is arguably the most critical cue for recovering the unwarped paper. Based on this idea, we propose DewarpNet, a novel data-driven unwarping framework that utilizes an explicit 3D shape representation for learning the unwarping operation. DewarpNet works in two-stages with two sub-networks: i) The "shape network" consumes an image of a deformed document and outputs a 3D-coordinate map which has shown to be sufficient for the unwarping task <ref type="bibr">[45]</ref>. ii) The "texture mapping network" backward maps the deformed document image to a flattened document image. We train both sub-networks jointly with regression losses on the intermediate 3D shape and final unwarping result (Fig. <ref type="figure">1</ref>). After that, we provide a "refinement network" that removes the shading effect from the rectified image, further improving the perceptual quality of the result.</p><p>To enable the training of this unwarping network with explicit intermediate 3D representation, we create the Doc3D dataset -the largest and most comprehensive dataset for document image unwarping to date. We collect Doc3D in a hybrid manner, combining (1) captured 3D shapes (meshes) from naturally warped papers with (2) photorealistic rendering of an extensive collection of document content. Each data point comes with rich annotations, including 3D coordinate maps, surface normals, UV texture maps, and albedo maps. In total, Doc3D contains approximately 100,000 richly annotated photorealistic images.</p><p>We summarize our contributions as follows: First, we contribute the Doc3D dataset. To the best of our knowledge, this is the first and largest document image dataset with multiple ground-truth annotations in both 3D and 2D domain.</p><p>Second, we propose DewarpNet, a novel end-to-end deep learning architecture for document unwarping. This network enables high-quality document image unwarping in real-time.</p><p>Third, trained with the rich annotations in the Doc3D dataset, DewarpNet shows superior performance compared to recent state-of-the-art <ref type="bibr">[23]</ref>. Evaluating with perceptual similarity to real document scans, we improve the Multi-Scale Structural Similarity (MS-SSIM) by 15% and reduce the Local Distortion by 36%. Furthermore, we demonstrate the practical significance of our method by a 42% decrease in OCR character error rate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Previous Work</head><p>Based on how deformation is modeled, the two groups of prior work on document unwarping are: parametric shapebased models and non-parametric shape-based models:</p><p>Parametric shape-based methods assume that document deformation is represented by low dimensional parametric models and the parameters of these models can be inferred using visual cues. Cylindrical surfaces are the most prevalent parametric models <ref type="bibr">[8,</ref><ref type="bibr">16,</ref><ref type="bibr">19,</ref><ref type="bibr">26,</ref><ref type="bibr">41,</ref><ref type="bibr">46]</ref>. Other models include Non-Uniform Rational B-Splines (NURBS) <ref type="bibr">[10,</ref><ref type="bibr">44]</ref>, piece-wise Natural Cubic Splines (NCS) <ref type="bibr">[36]</ref>, Coon patches <ref type="bibr">[9]</ref>, etc. Visual cues used for estimating model parameters include text lines <ref type="bibr">[25]</ref>, document boundaries <ref type="bibr">[5]</ref>, or laser beams from an external device <ref type="bibr">[27]</ref>. Shafait and Breuel <ref type="bibr">[33]</ref> reported several parametric shape based methods on a small dataset with only perspective and curl distortions. However, it is difficult for such low dimensional models to model complex surface deformations.</p><p>Non-parametric shape-based methods, in contrast, do not rely on low-dimensional parametric models. Such methods usually assume a mesh representation for the de-  formed document paper, and directly estimate the position of each vertex on the mesh. Approaches used to estimate the vertex positions, include reference images <ref type="bibr">[29]</ref>, text lines <ref type="bibr">[21,</ref><ref type="bibr">35,</ref><ref type="bibr">39]</ref>, and Convolutional Neural Networks (CNNs) <ref type="bibr">[30]</ref>. Many approaches reconstruct the mesh from estimated or captured 3D paper shape information. Notable examples are point clouds estimated from stereo vision <ref type="bibr">[38]</ref>, multi-view images <ref type="bibr">[45]</ref>, structured light <ref type="bibr">[4]</ref>, laser range scanners <ref type="bibr">[47]</ref>, etc. There is also work on directly using texture information for this task <ref type="bibr">[11,</ref><ref type="bibr">24,</ref><ref type="bibr">43]</ref>. However, resorting to external devices or multi-view images makes the methods less practical. Local text line features cannot handle documents that mix text with figures. Moreover, these methods often involve complicated and time-consuming optimization. Recently, Ma et al. <ref type="bibr">[23]</ref> proposed "DocUNet", which is the first data-driven method to tackle document unwarping with deep learning. Compared to prior approaches, DocUNet is faster during inference but does not always perform well on real-world images, mainly because the synthetic training dataset only used 2D deformations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">The Doc3D Dataset</head><p>We created the Doc3D dataset in a hybrid manner, using both real document data and rendering software. We first captured the 3D shape (mesh) of naturally deformed real document paper. After that, we rendered the images with real document texture in Blender <ref type="bibr">[1]</ref> using path tracing <ref type="bibr">[40]</ref>. We used diverse camera positions and varying illumination conditions in rendering.</p><p>A significant benefit of our approach is that the dataset is created in large scale with photorealistic rendering. Meanwhile, our method generates multiple types of pixel-wise document image ground truth, including 3D coordinate maps, albedo maps, normals, depth maps, and UV maps. Such image formation variations are useful for our task, but usually harder to obtain in real-world acquisition scenarios.</p><p>Compared with the dataset in <ref type="bibr">[23]</ref> where 3D deformation was modeled in 2D only <ref type="bibr">[28]</ref>, our dataset simulates document deformation in a physically-grounded manner. Thus, it is reasonable to expect that deep-learning models trained on our dataset will generalize better when testing on realworld images, compared to models trained on the dataset of <ref type="bibr">[23]</ref>. We visually compare dataset samples in Fig. <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Capturing Deformed Document 3D Shape</head><p>3D point cloud capture. Our workstation (Fig. <ref type="figure">3 (I)</ref>) for deformed document shape capture consists of a tabletop, a gantry, a depth camera, and a relief stand. The gantry holds the depth camera level, facing towards the tabletop, at the height of 58 cm. At this height, the depth camera captures the whole document while still preserving deformation details. The relief stand has 64 individually controlled pins, raising the height of the document to isolate it from the tabletop. The height differences make it easier to extract the document from the background in the depth map. The stand simulates complex resting surfaces for the document and also supports the deformed document to maintain curls or creases.</p><p>We used a calibrated Intel RealSense D415 depth camera to capture the depth map. Assuming no occlusion, the point cloud of the document was obtained via X (3D) = K -1 [i, j, d ij ] T , where d ij is the depth value at the pixel position i, j in the depth map. The intrinsic matrix K was read from the camera. We averaged 6 frames to reduce zeromean noise, and applied Moving Least Squares (MLS) <ref type="bibr">[32]</ref> with a Gaussian kernel to smooth the point cloud.</p><p>Mesh creation. We extracted a mesh from the captured point cloud using the ball pivoting algorithm <ref type="bibr">[3]</ref>. The mesh has &#8764;130,000 vertices and 270,000 faces covering all vertices. We then subsampled each mesh to a 100 &#215; 100 uniform mesh grid to facilitate mesh augmentation, alignment, and rendering. Due to the accuracy limits of our inexpensive sensor, even a higher resolution mesh grid cannot provide finer details like subtle creases. Each vertex has a UV position, to indicate texture coordinates, used for texture mapping in the rendering step. Assigning (u, v) = {(0, 0), (0, 1), (1, 0), <ref type="bibr">(</ref>  reconstruction quality. For joint training we used &#945; = &#946; = 0.5 (Eq. 3). We use the Adam solver <ref type="bibr">[15]</ref> with a batch size of 40, and weight decay of 5 &#215; 10 -4 . The learning rate is initially set at 1 &#215; 10 -4 , and reduced by a factor of 0.5 if the loss does not reduce for 5 epochs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head><p>We evaluate our method with multiple experiments on the 130-image benchmark from <ref type="bibr">[23]</ref>, and also show qualitative results on real images from <ref type="bibr">[45]</ref>. As a baseline, we train the DocUNet <ref type="bibr">[23]</ref> unwarping method on our new Doc3D dataset. Furthermore, we evaluate OCR performance of our method from a document analysis perspective. Finally, we provide a detailed ablation study to show how the use of the Coordinate Convolutions <ref type="bibr">[22]</ref>, and the loss L D affect unwarping performance. Qualitative evaluations are shown in Fig. <ref type="figure">7</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Experimental Setup</head><p>Benchmark. For quantitative evaluation, we classify the 130-image benchmark <ref type="bibr">[23]</ref> into six classes indicating six different levels of deformation complexity (see Table <ref type="table">1</ref>). The benchmark dataset contains various kinds of documents, including images, graphics, and multi-lingual text.</p><p>Evaluation Metrics. We use two different evaluation schemes based on (a) Image similarity and (b) Optical Character Recognition (OCR) performance.</p><p>We use two image similarity metrics: Multi-Scale Structural Similarity (MS-SSIM) <ref type="bibr">[42]</ref> and Local Distortion (LD) <ref type="bibr">[45]</ref>, as quantitative evaluation criteria, following <ref type="bibr">[23]</ref>. SSIM computes the similarity of the mean pixel value and variance within each image patch and averages over all the patches in an image. MS-SSIM applies SSIM at multiple scales using a Gaussian pyramid, better suited for the evaluation of global similarity between the result and groundtruth. LD computes a dense SIFT flow <ref type="bibr">[20]</ref> from the unwarped document to the corresponding document scan, thus focusing on the rectification of local details. The parameters of LD are set to the default values of the implementation provided by <ref type="bibr">[23]</ref>. For a fair comparison, all the unwarped output and target flatbed-scanned images are resized to a 598400 pixel area, as recommended in <ref type="bibr">[23]</ref>.</p><p>OCR accuracy is calculated in terms of Character Error Rate (CER). CER is evaluated by calculating the Edit Distance (ED) <ref type="bibr">[17]</ref> between the reference and recognized text. ED is the total number of substitutions (s), insertions (i) and deletions (d) to obtain the reference text, given the recognized text. CER = (s+i+d)/N , where N is the number of characters in the reference text, which is obtained from the flatbed scanned document images.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">DocUNet on Doc3D</head><p>We present a baseline validation of the proposed Doc3D dataset by training the network architecture in Do-cUNet <ref type="bibr">[23]</ref> on our dataset -Doc3D. DocUNet is a 3Dagnostic model. The architecture consists of two stacked UNets. DocUNet takes a 2D image as input and outputs a forward mapping (each pixel represents the coordinates in the texture image). The supervisory signal is solely based on the ground truth forward mapping. Unlike the proposed DewarpNet which can directly output the unwarped image, DocUNet needs several post-processing steps to convert the forward mapping to the backward mapping (each pixel represents the coordinates in the warped input image) and then sample the input image to get the unwarped result. <ref type="table">2</ref> show significant improvement when we train DocUNet on Doc3D instead of the 2D synthetic dataset from <ref type="bibr">[23]</ref>. The significant reduction of LD (14.08 to 10.85) signals a better local detail rectification. This improvement is the result of both (1) the Dewarp-Net architecture and ( <ref type="formula">2</ref>) training with a more physically grounded Doc3D dataset, compared to the 2D synthetic dataset in <ref type="bibr">[23]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results in Table</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Test DewarpNet on the DocUNet Benchmark</head><p>We evaluate both DewarpNet and DewarpNet(ref ) (i.e., DewarpNet augmented with the post-processing refinement network) on the DocUNet Benchmark dataset. We provide comparisons on both (1) the overall benchmark dataset (Table 2) and ( <ref type="formula">2</ref>) each class in the benchmark (Fig. <ref type="figure">6</ref>). The latter provides detailed insight into the improvements of our approach over previous methods. From class (a) to (e), our model consistently improves MM-SSIM and LD over the previous state-of-the-art. In the most challenging class (f), where the images usually exhibit multiple crumples and random deformations, our method achieves comparable and slightly better results. Time Efficiency of DewarpNet. Our model takes 32ms on average to process a 4K resolution image. Compared to DocUNet <ref type="bibr">[23]</ref> this represents a 125x speed up. Dewarp-Net directly outputs the unwarped image whereas DocUNet requires an expensive separate post-processing step.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">OCR Evaluation</head><p>We use PyTesseract (v0.2.6) <ref type="bibr">[34]</ref> as the OCR engine to evaluate the utility of our work on text recognition from images. The text ground-truth (reference) is generated from 25 images from DocUNet <ref type="bibr">[23]</ref>. In all these images, more than 90% of the content is text. The supplementary material contains some samples from our OCR test-set. OCR performance comparison, presented in Table <ref type="table">3</ref>, shows our method outperforms <ref type="bibr">[23]</ref> with a large margin in all metrics. In particular, DewarpNet reduces CER by 33% compared to DocUNet, and the refinement network gives a reduction of 42%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.">Ablation Studies</head><p>Coordinate Convolution (CoordConv). We investigate the effects of CoordConv on texture mapping network performance. The experiment (Table <ref type="table">4</ref>) on Doc3D validation set demonstrates that using CoordConv leads to a 16% &#8467;2error reduction on B and a slight improvement of SSIM on D from 0.9260 to 0.9281.</p><p>Loss L D . The texture mapping network benefits greatly from using L D (unwarped visual quality loss). As shown in Table <ref type="table">4</ref> compared to using the absolute pixel coordinate loss L B only, using L B + L D significantly reduces the &#8467;2 error on B by 71% and improve the SSIM on D by 9%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6.">Qualitative Evaluation</head><p>For qualitative evaluation, we compare DewarpNet with DocUNet in Fig. <ref type="figure">7</ref> and You et al. <ref type="bibr">[45]</ref> in Fig. <ref type="figure">8</ref>. The method by <ref type="bibr">[45]</ref> utilizes multi-view images to unwarp a deformed document. Even with a single image, DewarpNet shows competitive unwarping results.</p><p>Additionally, we show that the proposed method is robust to illumination variation and camera viewpoint changes in Fig. <ref type="figure">9</ref>. To evaluate the illumination robustness, we test on multiple images with a fixed camera viewpoint but different directional lighting from front, back, left, right of the document, and environment lighting. We also test DewarpNet robustness to multiple camera viewpoints, on a sequence of multi-view images provided by <ref type="bibr">[45]</ref>. Results show that DewarpNet yields almost the same unwarped image in all cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions and Future Work</head><p>In this work, we present DewarpNet, a novel deep learning architecture for document paper unwarping. Our method is robust to document content, lighting, shading, or background. Through the explicit modeling of 3D shape, DewarpNet shows superior performance over previous state-of-the-art. Additionally, we contribute the Doc3D dataset -the largest and most comprehensive dataset for document image unwarping, which comes with multiple 2D and 3D ground truth annotations. Some limitations exist in our work: First, the inexpensive depth sensor cannot capture fine details of deformation like subtle creases on a paper crumple. Thus our data lacks samples with highly complex paper crumple. In future work, we plan to construct a dataset with better details and more complex structures. Second, DewarpNet is rela-  tively sensitive to occlusion: results degrade when parts of the imaged document are occluded. In future work, we plan to address this difficulty via data augmentation and adversarial training.</p></div></body>
		</text>
</TEI>
