<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Surveying Sidewalk Materials for and by Individuals Who Are Blind or Have Low Vision: Audio Data Collection and Classification</title></titleStmt>
			<publicationStmt>
				<publisher>International Conference on SMART MULTIMEDIA</publisher>
				<date>03/27/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10554507</idno>
					<idno type="doi"></idno>
					
					<author>J Liu</author><author>W P Lam</author><author>Z Zhu</author><author>H Tang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Navigating safely and independently presents considerable challenges for people who are blind or have low vision (BLV), as it re- quires a comprehensive understanding of their neighborhood environments. Our user study reveals that understanding sidewalk materials and objects on the sidewalks plays a crucial role in navigation tasks. This paper presents a pioneering study in the field of navigational aids for BLV individuals. We investigate the feasibility of using auditory data, specifically the sounds produced by cane tips against various sidewalk materials, to achieve material identification. Our approach utilizes ma- chine learning and deep learning techniques to classify sidewalk materials solely based on audio cues, marking a significant step towards empowering BLV individuals with greater autonomy in their navigation. This study contributes in two major ways: Firstly, a lightweight and practical method is developed for volunteers or BLV individuals to autonomously collect auditory data of sidewalk materials using a microphone-equipped white cane. This innovative approach transforms routine cane usage into an effective data-collection tool. Secondly, a deep learning-based classifier algorithm is designed that leverages a dual architecture to enhance audio feature extraction. This includes a pre-trained Convolutional Neural Network (CNN) for regional feature extraction from two-dimensional Mel-spectrograms and a booster module for global feature enrichment.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The World Health Organization (WHO) estimates that there are 285 million people with visual impairment worldwide, among whom 39 million are totally blind <ref type="bibr">[1]</ref>. People who are blind or have low vision (BLV) face many challenges in their daily lives, including the difficulty of navigating safely and independently <ref type="bibr">[2]</ref>. To navigate effectively, individuals with BLV need to acquire as much spatial information as possible from their surroundings, including information about sidewalk materials and defects <ref type="bibr">[3]</ref>. Regrettably, most existing advanced applications <ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref> do not provide sufficient functionality to help BLV people collect landmark information and understand sidewalk conditions. Mobile navigation applications with GPS and mapping services (such as Google Maps and Apple Maps), mainly focus on providing efficient, short navigation routes, which is insufficient for BLV individuals <ref type="bibr">[7,</ref><ref type="bibr">8]</ref>. Therefore, their preferences have to tilt towards paths rich in tactile landmarks and minimal sidewalk defects, prioritizing safety and reliability over shorter distances.</p><p>As for the BLV individuals, they often rely on white canes to explore their surroundings, via auditory feedback that enhances their spatial awareness and assists in self-localization <ref type="bibr">[3]</ref>. For example, they can follow the tactile shoreline in their travel by identifying surface material changes, such as grass edges or raised curbs. Many street intersections are equipped with tactile pavements of varying materials and patterns, designed to aid BLV people in identifying important locations, such as street crossings, bus stops, and the direction of streets. These surface materials serve as effective landmarks and hence their inclusion in accessible maps is crucial. Maps including sidewalk materials would be very useful to BLV people, facilitating real-time navigation and trip planning. To gain a deeper understanding of the challenges faced by BLV individuals in their lives, we have conducted an informal user study with BLV individuals <ref type="bibr">[13]</ref>. This study has revealed that materials and objects on sidewalks play a crucial role in navigation tasks. Moreover, BLV individuals highlight the critical role of audio signals in identifying sidewalk landmarks and ensuring safe travel in urban areas.</p><p>This study conducts a preliminary investigation into the use of non-visual, audio-based data, specifically the sounds produced by the cane tips of visually impaired individuals rubbing against different sidewalk materials, for the identification of materials that are challenging to differentiate by sight. Leveraging machine learning (ML) and deep learning techniques, this research centers on the classification of sidewalk materials using exclusively auditory cues. This inquiry lays the groundwork for a future in which BLV individuals can independently gather data on sidewalk materials during their routine travels, transforming the mundane act of cane usage into an opportunistic data collection method. Such a paradigm not only fosters autonomy among BLV individuals but also augments the navigational data repository with their unique, experientially-rich insights. As BLV individuals navigate diverse urban terrains, their canes evolve into dual-purpose instruments, serving both personal navigational needs and the communal objective of enhancing a dynamic, adaptive mapping infrastructure responsive to the intricacies of urban settings.</p><p>In pursuit of this innovative future, our study undertakes the development of a novel data collection methodology, enabling individuals with blindness or low vision (BLV) to autonomously gather auditory data. Additionally, this research introduces a deep learning-based classification system focused on categorizing these auditory signals. The key contributions of this paper are:</p><p>1. The design of a lightweight data collection method for BLV individuals to acquire non-visual information on sidewalk materials. We equipped the white cane with a microphone that captures auditory feedback through audio data as the cane contacts the sidewalk surface. Additionally, we have generated an auditory dataset regarding sidewalk material using the proposed method. The organization of the paper is as follows: Section 2 delves into the related work, providing context and background. Section 3 outlines the data acquisition approach in detail. Section 4 expounds on the classification methodology. Section 5 discusses the results derived from our experiments. Finally, Section 6 offers concluding observations and remarks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Material Recognition</head><p>Most existing research on material recognition relies heavily on visual cues. One notable study <ref type="bibr">[14]</ref> achieved significant results by focusing on three key elements: material image datasets, contextual influences, and unique descriptors of material appearance. In addition, numerous studies have explored the utility of light field (LF) images for material identification <ref type="bibr">[15]</ref>. An alternative view of material recognition has been proposed using a combination of acceleration and images, and a fully convolutional network has been deployed for joint surface material recognition <ref type="bibr">[16]</ref>. In contrast, our project mainly utilizes non-visual data, specifically audio data. Our deep learning classifier aims to use non-visual forms of data to discriminate sidewalk materials, thus providing a new perspective in the field of material recognition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Feature Engineering</head><p>A prevalent trend in the acoustic community involves the preprocessing of raw audio data to convert it into spectrograms, including Mel-Spectrogram and Melfrequency cepstral coefficients (MFCC). These characteristic visual representations then serve as inputs to intricate network models for training. Several studies have affirmed the effectiveness of CNN based models when applied to spectrograms <ref type="bibr">[11,</ref><ref type="bibr">12]</ref>. Remarkably, most state-of-the-art results have been achieved through transfer learning, employing pre-trained CNN models like ResNet50 <ref type="bibr">[10]</ref>. Interestingly, one notable study indicated that CNNs pre-trained with regular images, such as ImageNet, remain proficient at extracting critical features from audio spectrograms <ref type="bibr">[17]</ref>. Additionally, through a series of tests we have found that features derived from the mean, minimum, and maximum values of Mel-Spectrogram frequency bands have discernible decision boundaries. Our approach aims to synergize both forms of audio data: taking advantage of deep learning to extract rich and effective features from spectrograms of audio data, while using global features derived from statistical techniques to "boost" training on the audio data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Data Acquisition</head><p>This section presents the overview of the data collection process. A modified white cane was used to capture the unique acoustic feedback of different sidewalk materials. The following subsections detail the equipment used and the methods of data collection, including both static and continuous modes, to ensure a diverse dataset. The sidewalk material audio data acquisition was performed by 23 volunteer students who embarked on an expansive data collection mission across 4 of the 5 boroughs of New York City. This audio data inventory provided us with a basis for a training dataset for our proposed classifier, which is further detailed in Section 4. We will introduce the audio data collection equipment and inventory in the following subsections 3.1 and 3.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Audio Data Collection Equipment</head><p>A lightweight data collection system is designed to acquire acoustic feedback when the cane contacts with sidewalk surfaces using a modified white cane (Fig. <ref type="figure">1</ref>).</p><p>As a white cane interacts with different materials, distinct acoustic signals are produced by the cane tip (a metal tip is used in the system). To capture those differences, a wired microphone was positioned near the cane tip clipped to a foam ring to maximize the clarity of recorded sounds while minimizing ambient noise and cane vibration noise. Additionally, a mount for a phone was installed to aid in recording video data utilized as a reference for annotations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Audio Data Collection</head><p>To assemble the sidewalk material audio data inventory, a large-scale data collection effort was initiated involving student volunteers. The data collection was acquired in two different modes: static and continuous.</p><p>-Static data. Twenty-three <ref type="bibr">(23)</ref> sighted volunteers collected a substantial amount of sidewalk material data. Each data record contains a duration of 30+ seconds of a singular surface material category. This was done in order to approximate BLV individuals stopping at certain sidewalk landmarks and repeatedly surveying the material with a white cane in order to ensure they have arrived at a location (e.g. tactile pavement strips on sidewalk intersections). -Continuous data. Four (4) sighted volunteers collected sidewalk material data walking along longer strips of sidewalk to better emulate the standard walking conditions of BLV users. In this mode of data collection, each data record would often contain multiple sidewalk landmarks such as manhole covers and subway grates, providing us with data for multiple categories.</p><p>With the annotation tool Label Studio <ref type="bibr">[24]</ref>, each data record was manually annotated by labeling delimiting points of sidewalk materials in the accompanying video.</p><p>To ensure accuracy and relevance, the collected sidewalk material audio data was classified according to the criteria specified in the New York City Street Design Manual <ref type="bibr">[9]</ref> and the Guidebook for Accessible Sidewalk and Street Intersection Information <ref type="bibr">[18]</ref>.</p><p>The audio data inventory contains several hours of static data and continuous data. This dataset represents a diverse and comprehensive encapsulation of 11 categories including concrete, asphalt, dirt, grass, metal, manhole, granite, tactile pavement, brick, subway grate, and cellar door.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Material Classification Approach</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Data Preprocessing</head><p>Data preprocessing is a crucial step in any machine learning project for transforming raw data into a form that machine learning models can learn effectively. In this project, the goal of data preprocessing is to slice the audio data into manageable, trainable pieces, and to convert these pieces into a format suitable for deep learning classifiers to learn. Fig. <ref type="figure">2</ref> provides a schematic diagram of the data preprocessing pipeline used in this study. The pipeline consists of three main components: data preparation (Fig. <ref type="figure">2</ref>, Part I), data slicing (Fig. <ref type="figure">2</ref>, Part II), and data transformation (Fig. <ref type="figure">2</ref>, Part III). These components are briefly described in the following subsections for completion even though we mostly follow existing approaches. Data preparation. In the initial stage of data preparation, as depicted in Fig. <ref type="figure">2</ref>, Part I, our methodology involves extracting audio data from the corresponding video recordings. This crucial step is followed by a meticulous process of resampling the audio data to a frequency of 44 kHz. We employ the Sinc interpolation method for this purpose, a technique renowned for its efficacy in handling nonuniform sample rates <ref type="bibr">[19]</ref>. Additionally, the audio data undergoes a rechanneling process to convert it into a stereo format. This rechanneling serves a dual purpose: firstly, it aligns the data with the common standards of auditory processing, and secondly, it facilitates a more nuanced analysis by preserving spatial characteristics of the sound, which could be crucial in distinguishing between different types of sidewalk materials. Stereo audio, with its dimensional quality, offers a richer dataset for the subsequent stages of processing and analysis <ref type="bibr">[23]</ref>.</p><p>Data slicing. The next step is data slicing (Fig. <ref type="figure">2</ref>, Part II), which is the process where we decompose the original audio data into manageable segments that are suitable for training our deep learning classifier. A sliding window technique establishes a fixed-length window that moves across the data sequence with a determined step size. Each shift of this window generates a new data segment, enabling the extraction of localized features from the time series data.</p><p>It is worth noting that each data segment slice has a corresponding annotation label. With static collection data, data segment labels are consistent across all slices from the original audio data. However, with continuous collection data, data segment labels first reference the manual annotations for the audio data and then select the category with the greatest duration within a particular data segment (1 second is used in our experiments).</p><p>Data transformation. The final step is data transformation (Fig. <ref type="figure">2</ref>, Part III) where we performed a Mel-spectrogram transformation on the segmented audio and acceleration data. This transformation maps raw data onto a twodimensional grid, with the horizontal axis representing time and the vertical axis denoting frequency. The Mel-spectrogram efficiently captures both the spectral and temporal properties of the raw audio signal, which are crucial for our sidewalk material classification task. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Classifier Architecture</head><p>The core of our material classification algorithm lies in the sophisticated architecture of our deep learning-based classifier (Fig. <ref type="figure">3</ref>). The proposed architecture has been carefully designed to proficiently process and analyze audio data to accurately identify different sidewalk materials. At the core of its efficacy is a dual-mechanism approach that includes two key modules: the feature extraction module and the booster module. These two modules work in tandem to meticulously extract and analyze features from the Mel-spectrogram representation of audio data. Together, these modules form the backbone of our classifier's architecture, playing an instrumental role in converting the auditory nuances, captured through our innovative data collection methods, into meaningful and practical insights. Subsequent subsections will delve into the specifics of each module, illustrating their respective functions and their synergistic operation in our classification algorithm.</p><p>Feature extraction module. Given the richness of the audio data acquired, the raw data has been transformed into Mel-spectrograms, which are essentially visual representations of the spectrum of frequencies in a sound signal as they vary with time. The utilization of Mel-spectrograms for feature extraction has been consistently corroborated by numerous studies in the field, highlighting its efficacy in capturing pertinent audio information <ref type="bibr">[17]</ref>. These Mel-spectrograms are excellent candidates for the application of deep learning-based feature extraction methods, as also expressed by <ref type="bibr">[10]</ref> in their pioneering work on audio classification.</p><p>With the Mel-spectrogram representations in hand, we harness the power of pre-trained deep neural network architectures to extract robust features. This approach, while novel in our specific application, stands on the shoulders of previous studies which have emphasized the robustness of pre-trained models in extracting meaningful features from complex data <ref type="bibr">[17]</ref>. Specifically, we employ a transfer learning approach using the ResNet model <ref type="bibr">[21,</ref><ref type="bibr">22]</ref>. Leveraging the representational power of ResNet, we discern and isolate the most important localized audio features from the Mel-spectrogram, preparing the audio data for subsequent stages of the classification pipeline.</p><p>Booster module. The booster module in our classifier architecture plays a pivotal role in augmenting the feature set extracted from the Mel-spectrogram representations. This module is intricately designed to process the Mel-spectrogram across the time axis, capturing the {minimum, maximum, and mean} values of each intensity band. This operation is executed for each of the 64 Mel frequency bands. As a result of this process, the engineered data assumes a structured shape of (64, 3), where the three channels correspond to the minimum, maximum, and mean values for each of the 64 Mel frequency bands, which serve as global features. Global features, in contrast to the local or regional features extracted by the pre-trained ResNet50 model, encapsulate overarching patterns and trends present in the entire audio sample. They provide a macroscopic perspective of the data, capturing broad, holistic properties that are not bound to specific time frames or localized spectral regions. The global features, in synergy with the regional characteristics extracted by the ResNet50 model, create a more robust and nuanced feature set. This enriched feature set helps to improve the accuracy and reliability of the classifier, ensuring a more efficient and detailed interpretation of the audio data for material classification.</p><p>Feature fusion. The most straightforward method of fusion is concatenation.</p><p>After the feature extraction module and the booster module (as in Fig. <ref type="figure">3</ref>, block A), two discrete feature vectors, denoted V local of 2048 dimensions and V global of 192 (64 &#215; 3) dimensions, are obtained. To attain consistency in dimensions, a dimensionality alignment layer was designed where regional characteristic features V local will be fed into the localized feature alignment module, where it will conduct down-sampling to v local of 128 dimensions. While the global feature V global will be fed into the global feature alignment module where it will resample to v global of 128 dimensions as well (as in Fig. <ref type="figure">3</ref>, <ref type="figure">block B</ref>). They are then synthesized into a single, unified, lengthened feature vector v within the local and global feature fusion layer (as in Fig. <ref type="figure">3</ref>, <ref type="figure">block C</ref>).</p><p>5 Experimental Results </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Dataset</head><p>Training a deep learning model requires a large amount of data. In this study, we selected seven (out of eleven) categories from the audio data inventory that were most commonly found on New York City sidewalks. These categories (Fig. <ref type="figure">4</ref>) include: concrete, tactile pavement, subway grate, manholes, bricks, dirt, and cellar doors.</p><p>For training data selection, data with portions of the missing audio due to lost microphone connection or other audio signal errors were removed from the dataset. Additionally, static audio data from the most prevalent categories was removed to prevent a heavy class imbalance. Fig. <ref type="figure">5</ref> illustrates the data distribution between static and continuous training data of each category and among the seven categories. Each of these categories encompasses data with almost 60 minutes of duration, creating a robust foundation for our model training. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Implementation Details</head><p>Mel-spectrogram transformation and parameter selection. The conversion of audio data to Mel-Spectrograms is a critical step in our preprocessing pipeline, enabling the effective extraction of features relevant to our classification task. In this study, the selection of key parameters was guided by empirical testing, with the chosen settings optimizing for both accuracy and model efficiency. Key parameters in this transformation include:</p><p>-Number of Mel bands. 64, chosen to capture a wide range of frequencies while maintaining computational efficiency. -Frame length and hop length. 1024 and 256 respectively, balancing temporal resolution and frequency resolution.</p><p>Model architecture adaptations. As mentioned above, we utilized ResNet50 to conduct localized feature extraction from Mel-Spectrograms in conjunction with a booster module for global feature extraction from Mel-Spectrograms as well. Later, the localized and global features will be aligned by their alignment modules respectively. Within the network's architectural scaffolding, the Rectified Linear Unit (ReLU) was employed as the activation function. Further, we also integrated Batch Normalization layers to stabilize and accelerate training by normalizing intermediate feature maps, and applied the dropout with the probability of 0.2 where it randomly zeros some of the input tensors to improve the model's generalizability and robustness.</p><p>In this study, the proposed model is implemented in Pytorch and the overall model size is 24.7M parameters. The detailed implementation of each module is listed as follows.</p><p>-Audio feature extractor. To adapt the ResNet50 model, we initially experimented with several variants, toggling the number of frozen blocks in the architecture. As the crux of deep learning is finding the right amount of transfer versus fine-tuning, our experimentation revealed that freezing just the initial block led to an optimal balance, outperforming other configurations in terms of classification efficacy. Post this freeze, the terminal classification layer was excised, thus enabling the network to produce a feature vector of length 2048, encapsulating the richer semantics of our data without an unwarranted imposition of specificity. -Dimensionality alignment. In the dimensionality alignment module, there are two main components, namely localized feature alignment and global feature alignment. For the localized feature alignment component, it includes two feedforward layers; the first layer is used to down-sample the features from 2048 to 512 dimensions while the second layer is used to down-sample the intermediate features from 512 to 128 dimensions. Likewise, the global feature alignment component is also a feedforward layer that resamples the global features from 192 to 128 dimensions aligned with localized features. Notably, all these layers are followed by the pre-defined Relu, batch normalization and dropout to avoid the overfitting problem. -Training procedures. Our training procedures include an initial learning rate of 1 &#215; 10 -5 , reduced by a factor of 10 upon plateauing of validation loss using the Pytorch learning rate scheduler <ref type="bibr">[25]</ref>. The proposed model is trained in 50 epochs with a batch size of 128. Adam optimizer with weight decay of 1 &#215; 10 -5 was employed for its efficiency in handling sparse gradients and adaptive learning rate capabilities. In terms of the loss function, we employ a cross-entropy given its effectiveness in handling multi-class classification problems. -Inference time. In order to test inference times for the model, we ran model inference over 3000 samples 3 times with an "off-the-shelf" CPU (Intel i9-13900KF, 3.00 GHz, 32 GB RAM) and an "off-the-shelf" GPU (Nvidia GTX 4090, 24 GB RAM), the average inference time per 1-second audio segment were 13.3 &#177; 0.005ms and 8.1 &#177; 0.002ms, respectively, making it suitable for near-real-time applications.</p><p>Evaluation metrics. Given the inherent imbalance in our dataset, traditional accuracy metrics would provide a skewed representation of the model's prowess.</p><p>To counter this, macro accuracy, which computes accuracy for each class and then averages it, was utilized. Complementing this, the macro F1 score was also used, which provides a harmonized mean of precision and recall, thus giving a balanced view of the model's performance across diverse classes.</p><p>Validation strategy. To ensure our model wasn't merely memorizing the idiosyncrasies of our dataset, we implemented a K-fold cross-validation approach <ref type="bibr">[20]</ref>. The entire dataset was meticulously partitioned into eight distinct folds. However, given the computational overheads and our endeavor to remain timeefficient, we eschewed exhaustive validation across all folds. Instead, a representative subset of three randomly selected folds was earmarked for cross-validation. This approach ensured a rigorous assessment while balancing computational feasibility.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Ablation Study and Model Comparison</head><p>To discern the efficacy and contribution of each modality in our task, we first tested the feasibility of utilizing the minima, maxima, and average of Mel frequency bands by training a Multilayer Perceptron with the data. We then undertook an ablation study comparing a ResNet50 model, and a ResNet50 plus audio booster model. Additionally, a selection of standard machine learning models (K-Nearest Neighbor, Naive Bayes, RandomForest, and Support Vector Machine) trained on flattened one-dimensional arrays of the Mel-spectrogram images derived from the audio data was utilized as a basis for comparison. Our findings indicate substantially greater classification accuracy and F1score with the {ResNet50} and {ResNet50 + Audio Booster} models compared to the standard machine learning models. Whereas the Multilayer Perceptron trained on the average, minima, and maxima of Mel frequency bands places in the middle of the standard machine learning models.</p><p>The {ResNet50 + Audio Booster} model appears to have a 2 percentagepoint increase for accuracy and F1-score compared to the base ResNet50 model and greater than 20% increase for accuracy and F1-score compared to the Multilayer Perceptron (Table <ref type="table">1</ref>).</p><p>The empirical results supported our hypothesis: leveraging features engineered from the audio data with statistical techniques in order to boost the training a CNN deep learning model enhances the model's robustness and accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion and Future Work</head><p>Drawing on observations of the independent travel experiences of visually impaired individuals, our study has explored the use of auditory data from cane tips against different sidewalk materials for surface identification. We have generated a novel audio dataset and developed a model with dual-mechanism for material classification, achieving a promising 80% accuracy.</p><p>While the model was developed with a focus on accuracy and robustness rather than real-time classification capabilities, an average inference time of 13.3&#177;0.005ms (CPU mode) per 1-second audio segment, the possibility of classification in real-time in order to help BLV individuals be alerted to materials and obstacles ahead might be worth exploring in the future. An attempt to balance inference time and accuracy might be another pathway worth exploring.</p><p>In addition, we plan to explore a crowdsourcing framework, further enabling BLV users to contribute to sidewalk material data collection during their independent travels. This expansion not only aims to refine our existing model but also seeks to actively involve the BLV community in our research process, which could improve the assistive navigation technology.</p></div></body>
		</text>
</TEI>
