<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Assessing Monocular Depth Estimation Networks for UAS Deployment in Rainforest Environments</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>10/14/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10583895</idno>
					<idno type="doi">10.1109/IROS58592.2024.10802125</idno>
					
					<author>Srisai Anirudh Tangellapalli</author><author>Harman Singh_Sangha</author><author>Joshua Peschel</author><author>Brittany A Duncan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The primary objective of this study was to utilizestate-of-the-art deep learning-based monocular depth estimationmodels to assist UAS pilots in rainforest canopy data collectionand navigation. Monocular depth estimation models providea complementary technique to other depth measurement andestimation techniques to extend the range and improve mea-surements. Several state-of-the-art models were evaluated usinga novel dataset composed of data from a simulated rainforestenvironment. In the evaluation, MiDaS outperformed the othermodels, and a segmentation pipeline was designed using thismodel to identify the highest areas of the canopies. The segmen-tation pipeline was evaluated using 1080p and 360p input videosfrom the simulated rainforest dataset. It was able to achievean IoU of 0.848 and 0.826 and an F1 score of 0.915 and 0.902at each resolution, respectively. We incorporated the proposeddepth-estimation-based segmentation pipeline into an exampleapplication and deployed it on an edge system. Experimentalresults display the capabilities of a UAS using the segmentationpipeline for rainforest data collection.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Rainforest canopies are an important part of subtropical forest regions as they help support many plant and animal species in rainforests <ref type="bibr">[1]</ref>- <ref type="bibr">[4]</ref>. Rainforest studies help in understanding the effects on the biological processes and taking adequate control measures for preserving them. In these studies, data related to the water cycle, vegetation growth, etc., are collected at canopy and ground levels <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref>. Canopy sampling has many useful applications in increasing productivity and growth <ref type="bibr">[7]</ref>. Most useful samples can only be found at the canopy level because of the increased exposure to sunlight, however, collecting samples from this region of the forest is difficult due to the height. One of the current practices includes using poles to increase reach but this method is limited due to the maneuverability of a pole longer than 10m <ref type="bibr">[8]</ref>. Other methods are more involved while providing less comprehensive coverage because they require fixed infrastructure such as canopy cranes or canopy rafts <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>.</p><p>Remote sensing has been a crucial part of forest management and monitoring, beginning with the early usage of crewed aerial imagery-based forest inventory to satellite image-based resource monitoring <ref type="bibr">[13]</ref>. With the introduction of uncrewed aerial systems (UASs) in the early 2000s as a new tool for collecting imagery-based data, the turnaround time for remote sensing data collection was reduced to hours from days and months <ref type="bibr">[14]</ref>, <ref type="bibr">[15]</ref>. The ability to collect samples from precise locations using UASs has been developed for various environments <ref type="bibr">[8]</ref>- <ref type="bibr">[10]</ref>. As these systems further develop, they can aid in obtaining soil, water, and leaf samples from predetermined and coordinated sites to provide sufficient coverage and spatially distributed data collection in locations where physical access is limited <ref type="bibr">[16]</ref>. To efficiently collect canopy samples, the UAS must reach the sampling site and position itself accurately.</p><p>UASs perceive their surroundings by employing various kinds of sensors, such as LIDAR, depth cameras, etc. <ref type="bibr">[13]</ref>. Physical depth measurement sensors such as LIDAR, while invaluable for their precision in mapping and object detection, confront significant challenges. Foremost among these is their inherent limitation in range, constraining their effectiveness in capturing data over long distances. Moreover, LIDAR sensors frequently encounter noise interference, which can distort readings and compromise the accuracy of the collected data. In addition to range limitations and noise interference, another challenge inherent to LIDAR sensors lies in the complexity of processing the data they generate. Despite their precision in capturing spatial information, LIDAR data is difficult to process effectively. Furthermore, the output typically yields a sparse depth map, presenting only discrete data points rather than a continuous representation of the environment. This leads to further hurdles in utilizing the data for other applications, requiring sophisticated algorithms and computational techniques to fully perceive the environment. To address these issues, alternative methods are currently being studied.</p><p>The recent advancements in computer vision due to deep learning led to the rise of powerful monocular depth estimation networks that can perceive depth from a single RGB image. Monocular depth estimation models offer a complementary perspective to LIDAR and other physical depth measurement sensors, potentially circumventing some of the limitations. By harnessing advanced computer vision techniques, monocular depth estimation models aim to infer depth information directly from 2D images, providing a denser and more continuous depth map compared to LIDAR's sparse output. Integrating these models alongside LIDAR sensors holds promise for enhancing overall perception systems, enabling more robust and comprehensive understanding of the surrounding environment. We investigated and evaluated state-of-the-art, deep learning-based, monocular depth estimation networks to aid UASs in rainforest canopy navigation and data collection. After evaluating several depth estimation networks, the MiDaS model performed the best in a rainforest environment <ref type="bibr">[17]</ref>. MiDaS also achieved the highest average FPS among the tested networks.</p><p>After determining the best-performing model, we design a depth-based segmentation pipeline to identify tree clusters. A four-stage pipeline is proposed that utilizes MiDaS and conventional computer vision techniques to segment tree clusters in forest canopy. The pipeline is evaluated using a modified version of the synthetic dataset used to evaluate the depth estimation networks. Additionally, to verify the capability of the pipeline, an application is implemented to utilize the pipeline on a drone to help the pilot approach leaf sampling heights above a tree. Through analyzing the experimental results of the pipeline and the flight information, we demonstrated the capability of our proposed pipeline to aid in rainforest leaf sampling and navigation. In summary, our contributions in this work are listed as follows.</p><p>&#8226; We constructed a synthetic dataset generator for rainforest environments.</p><p>&#8226; We evaluated state-of-the-art depth estimation models and highlighted the need for more diverse datasets. &#8226; We designed a depth-based segmentation pipeline to identify tree clusters within rainforest canopies. &#8226; We developed and deployed the proposed segmentation pipeline which runs on the edge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>In this section, we cover the importance of UAS in rainforest management. Through analyzing the current methods used to monitor rainforests, it is clear the usage of drones helps to cut the cost of intensive forest management and increase economic returns. We also review the usage of computer vision in autonomous navigation to learn the state-of-the-art in segmentation which is helpful for our pipeline. Finally, we review the current state of monocular depth estimation to find the best models and architectures for our pipeline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Rain forest monitoring and management using uncrewed aerial systems</head><p>Early forest research in the 1900s was based on ground metering methods and bottom-up observations <ref type="bibr">[18]</ref>. For collecting data on forest canopies, tall structures were made to reach the upper heights of the canopy. Constructing such structures took a lot of investment and observations were limited to a small area. Since the development of remote sensing technologies, new methods were found to monitor and manage forest canopies which were cheap and gave access to a larger area. Initial research used satellite imagery but using satellite remote sensing was found to be limiting when precise data collection was needed. Therefore, the introduction of UAS in the early 2000s in forest canopy research has revolutionized the data collected for forest canopy monitoring <ref type="bibr">[13]</ref>.</p><p>There are many studies where UAS were successfully deployed and crucial data was collected. They have been used in surveying forests, mapping canopy gaps, measuring forest canopy height, tracking forest wildfires, and forest management. Koh and Wich <ref type="bibr">[19]</ref> surveyed and mapped tropical rain forests in Indonesia. Getzin <ref type="bibr">[20]</ref> found a strong correlation between biodiversity and forest gap metrics obtained from drone remote sensing in independent forest regions in Germany. They suggest that UAS imagery is proficient in obtaining highresolution images of the forest canopy. Forest canopy attributes are critical parameters of forest quantification. Chianucci <ref type="bibr">[21]</ref> obtained accurate measurements of canopy structure using high-resolution true-color UAV images.</p><p>The drone remote sensing missions provided near real-time intelligence to support forest wildfire management. Ambrosia et al. <ref type="bibr">[22]</ref>; Hinkley and Zajkowski <ref type="bibr">[23]</ref> employed a largelong duration (24 h) fixed-wing drone for assisting forest wildfire management. Martinez-de Dios et al. <ref type="bibr">[24]</ref>; Merino et al. <ref type="bibr">[25]</ref> conducted a series of experiments that indicated rotary-wing drones could effectively collect real-time data on forest wildfires. Simultaneous use of multiple drones allowed larger areas to be measured and obtain complementary views of wildfires. Management of forest plantations is similar to the practice of precision agriculture and can be promoted with drone remote sensing. Felderhof and Gillieson <ref type="bibr">[26]</ref> used drone remote sensing to map tree canopy health in a macadamia plantation. A significant correlation was found between spectral radiometry and leaf nitrogen levels by field sampling. Charron et al. <ref type="bibr">[8]</ref> proposed a novel tool that is directly installed on a drone to collect leaf samples. Data collection tools such as this would greatly benefit from our proposed pipeline to approach leaf sampling sites and navigate rainforest canopies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Computer vision in autonomous navigation</head><p>Computer vision is an important computational task that plays a big role in the field of autonomous navigation. There have been many advancements made in this field, and deep learning-based methods have played a large role in these achievements, such as object detection and image segmentation networks. He et al. <ref type="bibr">[27]</ref> built upon previous RCNN models to introduce a state-of-the-art object instance segmentation model, Mask-RCNN. This model accurately identifies objects such as cars and pedestrians in street-level photos and videos and also provides masks that can be used for pixel-level segmentation. Li et al. <ref type="bibr">[28]</ref> proposes a real-time image segmentation model by leveraging smaller networks to achieve a high average FPS on difficult datasets. Milioto et al. <ref type="bibr">[29]</ref> utilize LIDAR point clouds to segment the surrounding scene to be leveraged in autonomous driving. LIDAR and image segmentation have also been used in the field of tree segmentation. Fekete et al. <ref type="bibr">[30]</ref> utilizes LIDAR footage collected by drones to segment trees in urban areas.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Monocular Depth Estimation</head><p>Deep neural networks have grown to be very popular in computer vision applications, such as image segmentation and classification, and now these networks are being applied to monocular depth estimation. Monocular depth estimation is considered to be the recovery of a pixel-level depth map from a single input RGB image <ref type="bibr">[33]</ref>, <ref type="bibr">[34]</ref>. One of the first methods, Eigen et al. <ref type="bibr">[35]</ref> proposed a two-stage network that provided the foundation for modern depth estimation networks. Since then, many different types of networks have been proposed, including CNNs, recurrent neural networks (RNNs), and generative adversarial networks (GANs) <ref type="bibr">[36]</ref>- <ref type="bibr">[38]</ref>. The proposed methods also utilized different training methods, including supervised, unsupervised, and semi-supervised <ref type="bibr">[36]</ref>.</p><p>The encoder-decoder architecture has contributed greatly to many depth estimation networks and is utilized by the methods evaluated in this paper <ref type="bibr">[17]</ref>, <ref type="bibr">[33]</ref>, <ref type="bibr">[34]</ref>, <ref type="bibr">[40]</ref>. The encoder-decoder architecture is a two-stage network in which the encoder extracts important features from the input image at each layer and then the decoder upsamples the features into the required outputs <ref type="bibr">[33]</ref>. The encoder-decoder architecture has been applied using both supervised and unsupervised models. Supervised methods are inherently at a disadvantage due to the requirement of ground truth training data <ref type="bibr">[41]</ref>. To overcome this issue, the usage of synthetic data is starting to be adopted in training deep learning networks but it is not yet widely used <ref type="bibr">[43]</ref>. Currently, there exist two datasets, NYU <ref type="bibr">[44]</ref> and KITTI <ref type="bibr">[45]</ref>, which most state-of-the-art models utilize for training and evaluation, including the model selected in this study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. METHODS &amp; MATERIALS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Depth Estimation Models</head><p>We evaluated several existing depth estimation networks to identify the most suitable network for use in a rainforest environment. The resulting model would be the primary model used in the proposed segmentation pipeline. The selected models are: MiDaS <ref type="bibr">[17]</ref>, Monodepth2 <ref type="bibr">[41]</ref>, GLPDepth <ref type="bibr">[34]</ref>, Big-To-Small <ref type="bibr">[39]</ref>, LapDepth <ref type="bibr">[40]</ref>. This subset was selected because the models contributed to the state-of-the-art of monocular depth estimation. The set of selected models includes supervised and self-supervised models. The networks were evaluated using the released public version of pre-trained weights with the configuration specified by their respective authors. The models were evaluated on a PC comprising an AMD Ryzen 9 5900X, 32 GB of RAM, and RTX 3070.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Synthetic Rainforest Dataset</head><p>The depth estimation models must be evaluated using a dataset comprising diverse forest canopies. Video data was collected at the Texas A&amp;M Soltis Center located next to the Children's Eternal Rainforest in San Juan de Pe&#241;as Blancas, San Ram&#243;n, Costa Rica. Fig. <ref type="figure">1</ref> displays one of the data collection missions conducted in Costa Rica. After analyzing the collected rainforest canopy video data, several scenarios occur frequently that need to be represented in the evaluation dataset. The scenarios were: continuous canopy, canopy with noticeable gaps, and canopy with irregular height of trees. Unfortunately, there is a lack of sufficient real rainforest datasets that represent these scenarios and also contain the necessary depth information to evaluate the selected models. Thus, we developed a novel synthetic dataset generator to create a dataset that provides complete control of the environment.</p><p>The generator was implemented using Blender <ref type="bibr">[42]</ref> and the included Python API. First, we modeled the trees and landscape and configured the environment to generate a photorealistic dataset similar to the collected video data. Using Blender's weight painting tool, it was possible to control the random placement of trees to generate the previously mentioned scenarios. Finally, we developed scripts to automate the camera placement and render an RGB image and the corresponding z-depth map necessary for evaluation. The generator created a dataset comprising approximately 2000 images and several 30-second to 1-minute-long videos. To ensure the scenarios are accurately represented in the dataset, 100 images are manually generated for each scenario. The dataset includes forward-facing (45&#176;below the horizon) and downward-facing (directly facing the ground) images because the different perspectives provide different navigational data. However, only the downward-facing images are used for the segmentation pipeline because the primary perspective for leaf sampling is approaching from directly above the sample site. Examples from the generator are displayed in the top two rows of Figure <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Segmentation Pipeline</head><p>After selecting the primary depth estimation model, we developed a segmentation pipeline incorporating the network. The pipeline expects an input of downwards-facing RGB images or video and produces a segmentation mask that identifies the topmost layer of tree clusters in the canopy. The pipeline handles the input frame-by-frame and consists of four stages: preprocessing, depth estimation, post-processing, and segmentation. The preprocessing stage converts the input image to a normalized grayscale image and resizes the resolution (384 &#8594; 384) to match the input requirements of the depth estimation network. The resulting grayscale image is piped into the depth estimation network to generate a depth map. The depth map is resized back to the original input resolution, after which a Gaussian blur is applied and it is converted to a binary image using thresholding to only highlight the topmost layer of the canopy. This results in clusters that are separated into individual masks for use in leaf sampling or canopy navigation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Evaluation 1) Depth Estimation Model Evaluation Metrics</head><p>The following evaluation metrics are utilized to compare the selected depth estimation networks. These evaluation metrics are commonly used across other similar works <ref type="bibr">[36]</ref>. However, average FPS (frames per second) and model size were added because they indicate whether the model will be suitable for use in an edge system like a UAS. The model size was calculated by measuring the disk space taken up by the weights of each model in megabytes. The model evaluation metrics are defined as follows.</p><p>is the predicted pixel value from the depth estimation model, and d &#8593; i is the ground truth pixel value. N is the total number of pixels in the depth maps which are useful. The depth values less than 1e &#8594;3 are ignored. thr is the threshold value to reach for the accuracy. The chosen values are 1.25, 1.25 2 , 1.25 3 . These values are commonly used in the evaluation of monocular depth estimation models <ref type="bibr">[34]</ref>, <ref type="bibr">[36]</ref>, <ref type="bibr">[41]</ref>. t i is the time captured after processing the current frame, and t &#8593; i is the time captured before the processing begins. F is the total number of frames in the input video.</p><p>2</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>) Segmentation Pipeline Metrics</head><p>The segmentation pipeline is evaluated by comparing the results with manually annotated images from the generated synthetic dataset using common image segmentation metrics <ref type="bibr">[31]</ref>, <ref type="bibr">[32]</ref>.</p><p>p i is the binary mask output from the proposed pipeline, and g i is the annotated mask. N is the total number of images in the dataset. To calculate the metrics, all of the individual segmentation masks related to an input image are combined into a single, whole binary mask. Then, the IoU and F 1 scores for each predicted and ground truth mask are calculated. Finally, the mean is calculated for both metrics. t i is the time captured after processing the current frame, and t &#8593; i is the time captured before the processing begins. F is the total number of frames in the input video.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Leaf Sampling System</head><p>We designed and implemented an edge system to utilize the proposed depth-based segmentation pipeline that is deployed on a drone and assists the UAS pilot approach leaf sampling sites. The implementation of this system serves as a step in qualitatively validating the efficacy of our pipeline. We aim to expand upon this application based on the results we achieved in this study, and we will present a deeper, quantitative analysis of this application in a future work.</p><p>The system is deployed on an Intel&#174;NUC 12 onboard a DJI Matrice 600 drone with Intel&#174;RealSense&#8482;Depth Camera (D455) as the main camera. This RGB-D camera is used to simultaneously collect RGB video data and verify the depth map output from the segmentation pipeline. The depth-based segmentation pipeline was developed as a ROS package that runs on the NUC and publishes the resulting depth map and segmentation masks to corresponding ROS topics. A web application is designed to display several frames as a user interface to assist the pilot as they navigate the drone to the leaf sampling site. The frames include the original RGB camera input, estimated depth map, resulting segmentation masks, camera depth map, and an overlay of the original RGB camera input and segmentation masks. The overlay frame combines the original RGB video input and the segmentation masks to highlight the tree clusters that are closest for leaf sampling. The web application is deployed on a ground station connected to the same network as the drone and subscribes to the output topics from the drone. TABLE I: Evaluation results for the entire forest dataset for every depth estimation model. Method &#969; 1 &#8594; &#969; 2 &#8594; &#969; 3 &#8594; AbsRel &#8593; SqRel &#8593; RMSE &#8593; RMSE log &#8593; BTS 0.01 0.21 0.55 77.13 65.42 0.50 1.98 GLPDepth 0.29 0.60 0.75 55.41 34.18 0.29 1.76 LapDepth 0.42 0.69 0.80 47.17 26.81 0.23 1.67 MiDaS 0.28 0.56 0.69 18.28 5.61 0.22 1.54 Monodepth2 0.24 0.49 0.67 22.66 8.46 0.26 1.68</p><p>TABLE II: Evaluation results for average FPS and model size. Method Avg FPS &#8594; Model Size (MB)-&#8593; BTS 1.1 549 GLPDepth 17.2 233 LapDepth 1.9 281 MiDaS 20.0 1280 Monodepth2 10.7 56.6</p><p>TABLE III: Evaluation results for the proposed pipeline using MiDaS. Metric 1080p 360p IoU score 0.848 0.826 F 1 score 0.915 0.902 Avg FPS 7.549 14.818</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. RESULTS &amp; DISCUSSION</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1) Model Evaluation Results</head><p>Table <ref type="table">I</ref> presents the evaluation results for the selected depth estimation models on the entire synthetic forest canopy dataset. The two best performing models were MiDaS and LapDepth. Table <ref type="table">I</ref> provides insight into the ability of the models to generalize to a completely new dataset. The accuracy is low and error is higher than expected across all which indicates the models should be tuned to handle the new environment and dataset. However, another factor that could be impacting the error is the high precision of the synthetic dataset because it provides a very precise depth map as opposed to real depth measurement sensors. Additionally, none of the models were trained on a dataset that contained a rainforest environment. Most were trained using footage captured indoors and outside footage set in an urban environment. Moreover, the original training data heavily leaned towards a forward-facing angle which led to the models predicting the top of the image is further away and the bottom is closer. This highlights the need for a dataset like ours that can provide a new benchmark for future depth estimation models. Based on the results displayed in Fig. <ref type="figure">2</ref>, the depth estimation maps from the MiDaS model (seen in the sixth row) have the most definition and clarity Fig. <ref type="figure">3</ref>: The top row contains example inputs provided to the pipeline. The middle row displays the ground truth for corresponding images. The bottom row showcases the results from the proposed segmentation pipeline.</p><p>which is crucial for our pipeline. The MiDaS model was trained using zero-shot cross-dataset transfer which led to the model being able to generalize well to unseen datasets <ref type="bibr">[17]</ref>. Additionally, regardless of the large model size as seen in Table <ref type="table">II</ref>, the MiDaS model proves to be sufficiently optimized as it had the highest average FPS.</p><p>2</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>) Pipeline results</head><p>Table <ref type="table">III</ref> presents the evaluation results for the proposed depth-based segmentation pipeline on the entire synthetic forest canopy dataset at different resolutions. Fig. <ref type="figure">3</ref> displays some of the results obtained from the pipeline and the expected outputs. Based on the high IoU and F 1 scores in both input resolutions, the proposed pipeline displays a high ability to correctly segment tree clusters. However, there is room for improvement and there were several factors that impacted the scores, specifically in the fourth stage of the pipeline.</p><p>The fourth stage of the pipeline takes a binary image as input, and it segments the image based on connected clusters. The performance of this stage is highly dependent on the input image being properly binarized, such that no important details are lost in the conversion. However, this was fairly likely the cause for several masks to be mispredicted. To convert an image to binary, a threshold must be selected to classify a pixel as 0 or 1, and several algorithms help select an optimal threshold value. The selected thresholding method, Otsu's method, was not always able to correctly identify the optimal threshold. Otsu's thresholding technique is a variable thresholding technique that calculates the histogram of the input image and utilizes the results to determine the threshold. However, the input image has a large range of values and can contain several peaks in the resulting histogram. This can make choosing an optimal threshold difficult and likely the threshold needs to be adjusted manually for every new environment to consistently attain well-defined segmentation masks.</p><p>Table III also presents the results the pipeline achieved when processing a high-resolution (1080p) video and a lowresolution (360p) video. As shown in the table, the IoU and F 1 scores are not heavily impacted by the resolution of the input video. Moreover, reducing the input resolution drastically improves the average FPS. This is an important observation because it demonstrates a lower-resolution camera can be sufficient to utilize this pipeline. This approach not only reduces hardware complexity but also opens up possibilities for scalability. By leveraging multiple camera streams, it becomes feasible to achieve comprehensive 360-degree coverage, enhancing the system's ability to detect and navigate around obstacles effectively.</p><p>Other methods to improve the average FPS of the proposed method would be to use the other lighter or older versions of the MiDaS model which have significantly fewer parameters. This will lead to a faster inference time because the current bottleneck for the average FPS is the depth estimation network. However, it may significantly impact the accuracy of the resulting depth maps.</p><p>3) Leaf Sampling System Analysis The leaf sampling system provides an example implementation and use case for the proposed pipeline. The system was deployed and tested at Havelock Research Farm, Lincoln, Nebraska, USA. We qualitatively analyze the results of the test flights to determine the efficacy of the proposed pipeline and system. Our observation of data captured from the depth camera revealed significant noise levels when compared to the predicted depth results and segmentation masks generated by the pipeline. This discrepancy highlights the potential efficacy of using depth estimation models in mitigating the inherent limitations of depth cameras and similar depth measurement sensors. Despite our successful implementation allowing the application to execute tasks in real time, we encountered notable challenges regarding the stability of data transmission between the drone and the ground station. Communication between the drone and ground station took place over a WIFI router with sufficient wireless range, however, the processing and streaming of each frame was too computationally intensive for the NUC to maintain a stable data stream to the ground station. We attempted to use a lighter version of the MiDaS model but ran into significant accuracy issues in our pipeline, even though it did improve the consistency of the data stream.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. FUTURE WORK</head><p>To improve the leaf sampling system, developing a lighter depth estimation model tailored to our novel dataset holds substantial promise for enhancing performance and accuracy. This targeted approach not only reduces the computational complexity and memory requirements of the model but also enables it to better capture and understand the intricacies of the rainforest environment. Furthermore, we plan to explore single-line depth estimation techniques, which combine the precision of LIDAR with the dense depth maps derived from monocular depth estimation. This approach presents a promising avenue for achieving high-fidelity depth perception while minimizing computational overhead. Additionally, bolstering the communication infrastructure between the drone and ground station through a stronger radio link will ensure consistent and stable data transmission, vital for real-time monitoring and control. Our current efforts are focused on exploring and implementing these changes to our leaf sampling system and presenting the quantitative evaluation results in a forthcoming paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSION</head><p>The primary objective of this study was to evaluate and utilize the state-of-the-art monocular deep learning-based depth estimation methods to assist UASs in forest canopy navigation and monitoring. By using monocular depth estimation, LIDAR sensors on UASs can be replaced as they are expensive and heavy to carry as a payload on a UAS. Monocular depth estimation simply requires an RGB input, and it is capable of constructing a dense depth map of the given scene. A synthetic dataset was created using Blender to gather ground truth depth data to evaluate depth estimation networks. After evaluating several networks, the MiDaS model outperformed the other models in nearly all aspects. We proposed a pipeline using the best-performing depth estimation model, MiDaS, for tree cluster segmentation. However, all models encountered high errors highlighting the need for more depth estimation networks targeting under-represented outdoor environments, such as rainforests. The pipeline was comprised of four stages and used RGB video as input. The pipeline's output was a mask of the tree clusters in each video frame. The pipeline was evaluated using a set of manually annotated images from the generated synthetic dataset. The pipeline achieved high IoU and F 1 scores and displays the capability to run in sufficient average FPS. However, a new depth estimation network targeting the rainforest environment would significantly improve pipeline accuracy and viability. Finally, the pipeline was incorporated into an edge system to showcase the feasibility and usefulness of the proposed depth-based segmentation pipeline. This study displays the capabilities of monocular depth estimation in UAS navigation and proposes a novel segmentation pipeline to aid in rainforest leaf sampling and navigation.</p></div></body>
		</text>
</TEI>
