<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Going beyond RF: A survey on how AI-enabled multimodal beamforming will shape the NextG standard</title></titleStmt>
			<publicationStmt>
				<publisher>Elsevier</publisher>
				<date>06/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10487632</idno>
					<idno type="doi">10.1016/j.comnet.2023.109729</idno>
					<title level='j'>Computer Networks</title>
<idno>1389-1286</idno>
<biblScope unit="volume">228</biblScope>
<biblScope unit="issue">C</biblScope>					

					<author>Debashri Roy</author><author>Batool Salehi</author><author>Stella Banou</author><author>Subhramoy Mohanti</author><author>Guillem Reus-Muns</author><author>Mauro Belgiovine</author><author>Prashant Ganesh</author><author>Chris Dick</author><author>Kaushik Chowdhury</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Incorporating artificial intelligence and machine learning (AI/ML) methods within the 5G wireless standard promises autonomous network behavior and ultra-low-latency reconfiguration. However, the effort so far has purely focused on learning from radio frequency (RF) signals. Future standards and next-generation (nextG) networks beyond 5G will have two significant evolution over the state-of-the-art 5G implementations: (i) massive number of antenna elements, scaling up to hundreds-to-thousands in number, and (ii) inclusion of AI/ML in the critical path of the network reconfiguration process that can access sensor feeds from a variety of RF and non-RF sources. While the former allows unprecedented flexibility in 'beamforming', where signals combine constructively at a target receiver, the latter enables the network with enhanced situation awareness not captured by a single and isolated data modality. This survey presents a thorough analysis of the different approaches used for beamforming today, focusing on mmWave bands, and then proceeds to make a compelling case for considering non-RF sensor data from multiple modalities, such as LiDAR, Radar, and GPS for increasing beamforming directional accuracy and reducing processing time. This so called idea of multimodal beamforming will require deep learning based fusion techniques, which will serve to augment the current RF-only and classical signal processing methods that do not scale well for massive antenna arrays. The survey describes relevant deep learning architectures for multimodal beamforming, identifies computational challenges and the role of edge computing in this process, dataset generation tools, and finally, lists open research challenges that the community should tackle to realize this transformative vision of the future of beamforming.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Today's ultra-connected world is demanding high bandwidths, ultralow latency, and autonomous network reconfiguration to accommodate new applications, heterogeneous devices and shared spectrum use. The number of users is also increasing at unprecedented levels, with predictions of the number of networked devices exceeding 3&#249; the global population by 2023 <ref type="bibr">[1]</ref>. To serve bandwidth-hungry application needs, the expected maximum 5G data rate is now revised to be 13&#249; faster in 2023, a significant revision from earlier estimations made only a few years ago in 2018 <ref type="bibr">[1]</ref>. Many exciting applications will leverage such high capacity wireless networks, such as relaying high-resolution three dimensional (3D) graphical content, virtual or augmented reality (VR/AR) streams <ref type="bibr">[2]</ref>, vehicle-to-everything (V2X) links leading towards autonomous cars <ref type="bibr">[3,</ref><ref type="bibr">4]</ref>, among others.</p><p>&lt; Corresponding author.</p><p>E-mail addresses: droy@ece.neu.edu (D. <ref type="bibr">Roy)</ref>, bsalehihikouei@ece.neu.edu (B. <ref type="bibr">Salehi)</ref>, sbanou@ece.neu.edu (S. Banou), smohanti@ece.neu.edu (S. <ref type="bibr">Mohanti)</ref>, greusmuns@ece.neu.edu (G. Reus-Muns), mbelgiovine@ece.neu.edu (M. Belgiovine), prashant.ganesh@ufl.edu (P. Ganesh), cdick@nvidia.com (C. Dick), krc@ece.neu.edu (K. Chowdhury).</p><p>A key underlying technology that is essential for all of the above is transmit beamforming, where signals from multiple antenna elements combine constructively at the receiver. Consider a multi-antenna radio, with each of these antenna elements having a specific directional radiation pattern, referred as a beam. The beams from transmitter and receiver antennas are steered to initiate communication via beamforming <ref type="bibr">[5]</ref>. The communication link is then established through the periodic beam sweeping and beam selection <ref type="bibr">[6]</ref>. Beamforming increases the signal strength at the receiver, which in turn raises the capacity limit, mitigates interference by avoiding undesirable signals at neighboring receivers, and combats the effect of pronounced path loss at high frequencies. Thus, beamforming is considered as a critical component <ref type="url">https://doi.org/10.1016/j.comnet.2023.109729</ref> Received 28 September 2021; Received in revised form 6 February 2023; Accepted 20 March 2023 Fig. <ref type="figure">1</ref>. An overview of different approaches for beamforming in an example scenario involving a mmWave vehicular network. The strategies of beamforming between the roadside base station (BS) and the vehicle is categorized into three types: (a) traditional exhaustive beam search that sweeps through all possible mmWave beam combinations between the receiver and transmitter, (b) RF-based out-of-band beamforming that uses channel state information (CSI) measurements from lower frequencies to restrict mmWave beam search space, (c) multimodal beamforming that uses non-RF sensor modalities (image, LiDAR, GPS, radar) to predict the best possible beams from the situational information. of all modern WiFi <ref type="bibr">[7]</ref> standards and is steadily being integrated into 5G <ref type="bibr">[8]</ref>.</p><p>Our survey is motivated by this observation, and we strive to answer the following two questions: (i) are there fundamental limitations of traditional RF-only beamforming technology that will impact future standards evolution, and (ii) how can new data types (beyond RF) be harnessed in the future, and, given the possible information explosion by acquiring such multimodal sensor feeds, can they be analyzed through emerging machine learning methods to guide real-time beamforming decisions? To ensure a focused discussion, we emphasize use-cases that will shape the future standards beyond 5G (henceforth referred to as NextG), namely, beamforming scenarios that combine very large number of antenna elements and mobility. As an indicative example of a mmWave vehicular network that we cover in this survey, Fig. <ref type="figure">1</ref> shows moving vehicles beamforming towards a static base station by combining data from RF and non-RF modalities, and then using ML to identify a smaller set of beam-pairs for further optimization, instead of an exhaustive search. RF-based input data needed for mmWave beamforming typically include but are not limited to:</p><p>&#8226; Channel state information, which includes path loss, shadowing, fading, and multi-path effects, and &#8226; Receiver feedback on the quality of the received signal, which can be leveraged to adjust the beamforming weights and beam selection for optimal performance.</p><p>The use of non-RF data as input for mmWave beamforming is geared towards improving the beamforming performance by providing additional information about the environment and system conditions, which are otherwise impossible to decipher from RF-only sources. Non-RF-based input from cameras, LiDAR, and other sensors can be used in conjunction with RF-based sources in mmWave beamforming to decipher the:</p><p>&#8226; User location and movement information, which can be used to optimize the beamforming weights. &#8226; Comprehensive situational awareness, imparting the full knowledge of the environment, including obstacles and potential reflections that can impact the performance of mmWave communications. &#8226; Information about device orientation and position which can be used to improve the directionality of beamforming. &#8226; Information about the location of network access points, which can be used to optimize the beamforming in multi-user scenarios.</p><p>Overall, in all types of beamforming methods, the use of multimodal sensor data can lead to increased robustness by providing redundant sources of information and allowing for an enhanced ability to respond to changes in the environment in real-time. We begin our discussion by highlighting the need for beamforming with massive number of antennas and the use of AI/ML in beamforming communication systems.</p><p>&#247; Need for Beamforming in NextG Standards: The 5G New Radio (5G-NR) standard provisions for use of both sub-6 GHz as well as millimeter wave (mmWave) frequency bands from 24.25 GHz to 52.6 GHz <ref type="bibr">[8]</ref>. The sub-6 GHz band is already congested, and this problem worsens when a large data transfer needs to occur at short contact times, typically seen in mobile environments with few antennas <ref type="bibr">[6]</ref>. While mmWaveband transmission increases capacity using wider bandwidth (up to 2 GHz) <ref type="bibr">[9]</ref>, it also suffers from severe attenuation and penetration loss <ref type="bibr">[6]</ref>. Phased-array antennas <ref type="bibr">[10]</ref> address the attenuation problem by leveraging the highly directional gain of the antenna elements, thereby focusing radiated RF energy into beams. This capability is enhanced in higher frequencies given the dense packing of antenna elements, i.e., higher order phased arrays are possible with proportional increase in the number of beams. While theoretically hundreds of antenna elements can be packed in a 1cm &#249; 1cm area for mmWave band operation, the bottleneck lies in the complexity of processing methods and the computational resource available to properly configure the beams. Even though it is economically feasible to create large phased arrays, scaling beyond 8-12 antennas while supporting realtime operation in small form factor wireless devices still remains an open challenge. Thus, there is need to re-visit existing approaches to beamforming to potentially scale up to thousands of antenna elements, as is being envisaged in NextG standards <ref type="bibr">[11]</ref>.</p><p>&#247; Motivation for using AI-enabled Beamforming: Traditional beamforming techniques are restricted by linear operations and have stringent requirements for accurate channel state information and are thus susceptible to slight variations in channel conditions <ref type="bibr">[12]</ref> [13] <ref type="bibr">[14]</ref>.</p><p>The integration of AI/ML in traditional beamforming scenarios can address these limitations by incorporating non-linear operations and allowing ML algorithms to efficiently model complex channel dynamics. This approach gives ML-based beamforming models to better adapt to rapidly changing channel conditions and determine optimal beamforming strategies in real time, leading to the realization of more accurate and efficient communication systems. Artificial intelligence and machine learning (AI/ML) based algorithms have been effectively demonstrated to outperform classical approaches in wireless-centric tasks of modulation recognition <ref type="bibr">[15]</ref>, RF fingerprinting <ref type="bibr">[16]</ref>, and rogue transmitter detection <ref type="bibr">[17]</ref>, etc. The use of AI-enabled algorithms to solve the above-mentioned beamforming in nextG networks is still in a nascent stage. The general approach so far in using ML involves RF channel estimation followed by channel equalization by using different neural network-based architectures that accept a stream of in/quadrature phase (I/Q) samples collected by the receiver. We believe there is a vast untapped potential for AI-enabled techniques for extracting relevant information using different types of modalities, for e.g., images can recognize the location of the target BS and this can rapidly reduce the number of candidate beams to be explored. We refer to this emerging research trend in the domain of out-of-band beamforming as multimodal beamforming. &#247; Scope of this Survey: The statistics presented in Fig. <ref type="figure">2</ref>, comprise of the number of articles (including patents), from Google Scholar search results, that have referenced the terms beamforming in 5G and beamforming in mmWave. We believe this survey will serve the wireless research community working on beamforming in high frequency band , as in these frequencies, beamforming lies on the critical path to combat signal attenuation. We introduce and analyze the notion of multimodal beamforming for mmWave frequencies by recognizing the existing interest in the intersection of MIMO systems, wireless AI/ML and the NextG bandwidth needs. Furthermore, we emphasise the vehicular scenario shown in Fig. <ref type="figure">1</ref>, as it poses challenges caused by mobility that cannot be addressed in feasible time-scales through legacy methods for such large beamforming antenna arrays. As evidence of community interest on this general theme, we see a spike in citations (178 citations within 4 years) for the publicly available dataset called Raymobtime <ref type="bibr">[18]</ref>), which contains multimodal non-RF sensor data along with the corresponding RF ground truths for the purpose of mmWave beamforming in a V2X environment.</p><p>While we strive to produce a comprehensive survey on this subject matter, we skip the reviews on the basics of mmWave channel models, mMIMO, different beamforming system models and techniques, as there exist plethora of survey literature focusing on these fundamentals, and is out-of-scope considering our focus area. For example, the promise of mmWave communication in 5G is extensively reviewed in <ref type="bibr">[19]</ref>, the use of mmWave band for vehicular communication is surveyed in <ref type="bibr">[20]</ref>, applications of mMIMO are surveyed in <ref type="bibr">[21]</ref> and <ref type="bibr">[22]</ref>, detailed analysis of general RF-only beamforming in indoor and outdoor mmWave communications can be found in <ref type="bibr">[23]</ref>. RF-only beamforming can have digital and analog beamforming, as well as hybrid approaches that combine the two. Related models and system architectures that contrast these three approaches are described in <ref type="bibr">[24]</ref> and <ref type="bibr">[25]</ref>. A flow-graph summarizing the existing surveys related to the ''beamforming in 5G/NextG" systems is shown in Fig. <ref type="figure">3</ref>, and we explore each of these topics in their relevant sections later in this paper. We broadly categorize the trend of existing surveys on that topic in three groups: beamforming techniques for 5G, hybrid beamforming, and outof-band beamforming; where the first two categories are related to the traditional beamforming process, while the last one is aligned towards out-of-the-box solutions. In this regard, the purpose of this survey is to identify the shortcomings in the traditional beamforming methods and identify the advantages of using non-RF modalities to facilitate the beamforming process, considering nextG communications. Ultimately, we make a case for expanding the research focus towards incorporating such non-RF sensor modalities in combination with AI/ML, as a feasible pathway for NextG networks. &#247; Organization of this Survey: The remainder of this article is organized as follows. High level differences between different traditional and non-RF based beamforming techniques for nextG networks are described in Section 2 with a comprehensive review of published surveys in related areas of beamforming. The use of out-of-band RF, single, and multimodal non-RF data for beamforming are presented i Section 3, Section 4, and Section 5, respectively. We discuss different emerging research trends using multimodal beamforming in Section 6. The conclusions are drawn in the last section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background on beamforming techniques and related surveys</head><p>In this section, we first analyze the state-of-the-art traditional beamforming techniques and its shortcomings. We then explore the current research on non-RF based beamforming to motivate our intent of using these methodologies to address those shortcomings of traditional RF-only beamforming.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Traditional beamforming</head><p>Existing RF-based beamforming approaches (analog, digital, hybrid) have their unique advantages, and are applicable in specific scenarios. Indeed, the 5G-NR standard supports all three types of beamforming in the time domain <ref type="bibr">[23]</ref>.</p><p>A brief comparison study for these approaches is presented in Table <ref type="table">1</ref>. Digital beamforming improves the spectral efficiency (SE) of a MIMO system by simultaneously transmitting data to multiple users. However, it needs a distinct RF chain per antenna, making it less costeffective for higher order of antenna elements. This is one core reason why there are few off-the-shelf mmWave radios <ref type="bibr">[26]</ref> which support digital beamforming even with low order (1&#249;4) of antenna elements. Unlike its digital counterpart, analog beamforming creates the beam using one element per set of antenna. Once the best beam, among all possible combinations of beam-pairs is identified, it is activated to mitigate the impact of high pathloss in mmWave band. This is why most of the off-the-shelf mmWave devices <ref type="bibr">[27]</ref><ref type="bibr">[28]</ref><ref type="bibr">[29]</ref> support only analog beamforming. Also, analog beamforming is considered mandatory in 5G-NR <ref type="bibr">[30]</ref> for mmWave communication.</p><p>Hybrid beamforming, on the other hand, is a combination of analog and digital beamforming. The idea of hybrid beamforming revolves around trading-off the hardware cost for the overhead of time involved  in beam selection. Here, a subset of antennas is connected to a particular RF chain, as opposed to having individual RF chains for each antenna element in digital beamforming. Even though hybrid beamforming promises faster communication with higher order antenna elements, this is still an area of ongoing research <ref type="bibr">[31]</ref>. Additionally, for hybrid beamforming, the continuous beam management technique in a mobile environment involves periodic overhead <ref type="bibr">[30]</ref>. Here, beam selection is done after the measurement of reference signals (RS) received in a specific direction by manipulating the beamforming weights applied across different antenna elements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Out-of-band beamforming</head><p>As discussed before, analog beamforming technique involves timeoverhead of beam selection due to exhaustive search among all possible transmitter-receiver (TX-RX) antenna elements. The decision is made based on a combination of RF measurements, such as CSI, SNR etc., in the desired frequency band of communication. This overhead (Fig. <ref type="figure">4</ref>) is exacerbated in the case of mobile users where the position of user equipments (UEs) are changing continuously, resulting in the exhaustive search being instantiated multiple times within a few seconds. Furthermore, the wireless channel varies 10&#249; faster at 30 GHz as opposed to 3 GHz, even for the same UE mobility rate. This results in 10&#249; more frequent beam sweeping and channel estimation <ref type="bibr">[32]</ref>. Thus, we believe that out-of-band RF measurements and the use of environmental non-RF data appear to offer an attractive alternative towards minimizing the overhead of exhaustive search. We refer to such approaches as out-of-band beamforming techniques. A visual representation of the existing traditional and out-of-band beamforming techniques are given in Fig. <ref type="figure">5</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Beamforming surveys on mMIMO for 5G and beyond</head><p>The fundamentals of mMIMO and mmWave operation and the applications of mMIMO are comprehensively surveyed in earlier works <ref type="bibr">[21,</ref><ref type="bibr">22]</ref>. The promise of mmWave communication in 5G is extensively reviewed in <ref type="bibr">[19]</ref>, and the use of mmWave band for vehicular communication is surveyed in <ref type="bibr">[20]</ref>.</p><p>From Fig. <ref type="figure">2</ref> we see that the research interest on beamforming in mmWave band and 5G standards are strongly coupled, as the advancements in the former are essential to meet operational requirements for the latter. Additionally, exploration of new spectrum, assigning more bandwidth, carrier aggregation, inter-cell interference mitigation techniques, integration of mMIMO antennas, etc., are all key features that have been extensively covered in <ref type="bibr">[33]</ref>. Also, the authors state that providing accessibility, flexibility, and cloud based services through proper modulation and coding scheme (MCS), mmWave and device to device (D2D) communication is the key to realize functional nextG networks. Authors in <ref type="bibr">[34]</ref> validate the notion that beamforming has a bigger role to play in mmWave bands, as compared to low frequency bands. Hence, there is great interest in beamforming optimization in mmWave bands for nextG standards.</p><p>For sake of completeness, we mention the surveys that describe beamforming advancements tailored for sub-1 GHz, sub-6 GHz as well as sub-30 GHz 5G bands. Authors in <ref type="bibr">[35]</ref> focus on the frequency allocation, beamforming techniques and custom-designed integrated circuits for those specific bands. Kutty et al. capture the evolution of different beamforming techniques in the context of mmWave communication <ref type="bibr">[23]</ref>. They describe different radio frequency system design and implementation for mmWave beamforming for indoor and outdoor communication scenarios. The authors describe the mmWave propagation characteristics in-terms of path loss and clustered multipath structures, dominant line-of-sight (LoS) component, wideband communication, and 3D spatio-temporal modeling. They also illustrate different phased array antenna architectures to support MIMO capability in mmWave beamforming. Finally, the authors concur that using hybrid beamforming in the mmWave band for MIMO to minimizing cost and power consumption has great promise.</p><p>In a survey on hybrid beamforming for mMIMO, Molisch et al. <ref type="bibr">[24]</ref> analyze the trade-offs of using instantaneous or average (second-order) CSI in hybrid beamforming. Here, the authors evaluate current research on various types of hybrid multiple-antenna transceivers and consider how the channel sparsity in the mmWave band can be leveraged for optimizing channel estimation and beam training. However, to get broader aspects of hybrid beamforming, we review an extensive survey by Ahmed et al. in <ref type="bibr">[25]</ref>, which thoroughly track the progress in this domain till 2017. In this paper the authors present different architectures of hybrid beamforming and the techniques for optimization of phase shifters, DAC/ADC resolutions and antenna configurations. From the system model perspective, they examine eight variations of hybrid beamforming and identify many resource management aspects, particularly in beam management, MAC protocol variants, which can impact the performance of hybrid beamforming.</p><p>However, the goal of this survey paper is to describe beamforming techniques that exploit out-of-band RF and non-RF multimodal data for  nextG networks. To motivate the case for out-of band RF and multimodal data, we first identify the limitations of traditional beamforming methods using RF-only data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Limitations of traditional RF-only based approaches</head><p>The traditional RF-only beamforming approach utilizes one of these two options for mmWave beamforming: (a) estimate the mmWave channel at the receiver, and send this information back to the transmitter for generating the precoding weights <ref type="bibr">[36]</ref><ref type="bibr">[37]</ref><ref type="bibr">[38]</ref><ref type="bibr">[39]</ref>, (b) sweep through the antenna codebook elements of the transmitter and receiver <ref type="bibr">[40,</ref><ref type="bibr">41]</ref>.</p><p>However, the complex method of compressive sensing <ref type="bibr">[42]</ref><ref type="bibr">[43]</ref><ref type="bibr">[44]</ref> or feedback of channel state information for channel estimation usher in complexity and overhead <ref type="bibr">[45]</ref>, which must otherwise be kept as low as possible. In order to reduce such overhead of the complicated channel estimation and time-consuming beam-sweeping techniques, multiple out-of-band approaches have been explored in the recent literature, with the aim of achieving low overhead. These beamforming techniques can be broadly categorized into (a) RF-based and (b) non-RF based, with their different sub-categories illustrated in Fig. <ref type="figure">5</ref>. In the next sections, we explore in detail each of these categories</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Out-of-band RF based beamforming</head><p>The main idea behind leveraging out-of-band RF frequencies during beamforming is to incorporate the cross channel correlation at mmWave bands with lower frequencies (2.4 GHz, radar bands, etc.). Such cross correlation is then utilized to reduce the beam search space by establishing a mapping between the channel measurements in the mmWave bands with lower frequencies (see Fig. <ref type="figure">6</ref>). Although the propagation characteristics in mmWave are different from lower frequencies, recent research reveals that the main direction of arrivals (DoAs) are comparable. Hence, the CSI at lower frequencies can be used to restrict the beam search space and avoid time-intensive exhaustive search, as proposed in the IEEE 802.11ad standard <ref type="bibr">[46]</ref>. This is relevant as mmWave systems are very likely to be deployed in conjunction with lower frequency systems, where mmWave access points (APs) are envisioned to be paired with lower frequency APs that provide wide area control signalling and coordination. Moreover, multi-band communication is one of the proposed solutions for providing high throughput communication systems with high reliability, thus reinforcing the interest in taking advantage of such systems in the near future <ref type="bibr">[47]</ref>. Among the RF based out-of-band beamforming techniques, the use of radar signals and utilizing sub-6 GHz frequencies for mmWave beamforming have shown promising results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Radar</head><p>For a vehicle to infrastructure (V2I) hybrid MIMO scenario, Gonz&#225;lez-Prelcic et al. <ref type="bibr">[48]</ref>, derives the channel information from the infrastructure mounted radar that is used to obtain precoders/combiners at the vehicle and the infrastructure. The radar sensor operates at 76.5 GHz, which is close to the mmWave communication band at 65 GHz. Taking advantage of this close proximity of the operating frequencies, the computed covariance of the received signal at the radar is applied as an estimation of the covariance of the communication signal in the mmWave band. The authors then argue that the optimum combiner is the dominant eigenvector of the covariance matrix of the received signal. Similarly, in the proposed scheme by Ali et al. <ref type="bibr">[49]</ref>, a passive radar at the road side unit (RSU) taps the radar signals transmitted by vehicle mounted automotive radars. In comparison to the prior works, the authors propose a simplified RSU based radar receiver that does not require the transmitted waveform as a reference for covariance estimation in <ref type="bibr">[55]</ref>. To use the acquired radar information for mmWave beam initialization, a metric is defined that correlates the spatial information provided by the radar sensor and spatial characteristics of mmWave channel. This metric is then used to assess the the accuracy of the angular estimation. Reus et al. <ref type="bibr">[50]</ref> leverage the PHY layer IEEE 802.11ad frames to perform both radar operations and conventional communications using the standard compliant TX/RX chain. In this case, the radar is employed to estimate the location of vehicles, which is then used to select the optimal mmWave beam. Similarly, Demirhan et al. <ref type="bibr">[56]</ref> use Radar at 77GHz and deep learning to perform beam selection and validate it on the real-world data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Sub-6 GHz</head><p>Nitsche et al. <ref type="bibr">[51]</ref> propose a blind beam steering (BBS) system which couples mmWave with legacy 2.4/5 GHz bands. Upon a beam training request, the proposed method first performs out-of-band direction inference to calculate angular profiles by broadcasting passively overheard frames at the legacy sub-6 GHz band. The LoS paths in all profiles remain nearly static, and appear as peaks at the same angle. However, the peaks resulting from reflections vary among profiles. Thus, using profile history aggregation, the alternating reflection peaks are flattened and the remaining strongest peak is estimated to correspond to the direct path. Given the profile history for each device, a threshold for the peak-to-average ratio is defined to infer the LoS path and to reject the reflected paths. The experimental results depict that BBS successfully detects unobstructed direct path conditions with an accuracy of 96.5% and reduces the IEEE 802.11ad beam training overhead by 81%. Similarly, in <ref type="bibr">[52]</ref>, the candidate mmWave beams are restricted only to those beams that overlap with the dominant paths at sub-6 GHz band. On the other had, in <ref type="bibr">[53]</ref>, the estimated angle of arrival (AoA) on the 3 GHz channel is used to reduce the beam sweeping overhead in mmWave 30 GHz frequency. In particular, they experimentally show that in 94% of LoS conditions, the identified AoA in the 3 GHz band is within &#177;10 &#733;accuracy for the AoA of the mmWave signal. Hence, the authors propose using MUltiple SIgnal Classification (MUSIC) algorithm to estimate the AoA in the sub-6 GHz and running the exhaustive search only for angles in the corresponding direction of the mmWave band, while factoring in the error bound of &#177;10 &#733;. A dual-band MAC protocol is proposed in <ref type="bibr">[54]</ref> for coordinated wireless gigabit (WiGig) WLANs (see Fig. <ref type="figure">6</ref>). In the proposed dualband MAC protocol operation, the control frames to be shared among the APs are transmitted via the wide coverage sub-6 GHz WiFi band, while the high speed data frames are concurrently transmitted by the WiGig APs in the mmWave band. These control frames coordinate the beam training among the APs, so only one AP performs the beam training at a time, eliminating the probability of packet collisions due to simultaneous beamforming. Also, the link information consisting of the used beam identification (ID), modulation coding scheme (MCS) index and received power, is broadcasted in the sub-6 GHz WiFi frequencies, allowing other APs to effectively exclude those beam IDs that may interfere with the existing data link from their beamforming training beams. Moreover, since the location of a UE can be roughly estimated using WiFi channel information at WiFi frequencies through a process called fingerprinting, the authors propose this WiFi fingerprinting method to estimate the best and bad beam IDs of the WiGig links. We conclude the discussion on out-of-band RF based beamforming techniques by providing a comprehensive overview of these processes in Table <ref type="table">2</ref>. Next, we explore the existing challenges in this area.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Challenges</head><p>While out-of-band RF assisted beamforming present promising improvements in beam initialization speed, there are some limitations associated, which we itemize as follows:</p><p>&#8226; The out-of-band RF channel measurements need to be acquired constantly in order to estimate the channel at the mmWave band. Hence, it requires an integrated protocol for multi-band coexistence that can be challenging in the dense networks. &#8226; An optimal mapping is required between mmWave and outof-band channel measurements. The mmWave band has unique propagation characteristics that preserves sparsity. In particular, the number of reflections is limited in mmWave band, while in lower frequencies, multiple reflections are normally observed. As a result, translating the DoA for bands that are located far apart from each other can be challenging and error prone.</p><p>&#8226; RF-based out-of-band beamforming requires simultaneous multiband channel measurements that increase the complexity of mmWave transceivers. Although future mmWave devices will likely support lower frequencies as well, this feature is not widely deployed in commercial devices yet. &#8226; The existing out-of-band methods do not yet support simultaneous beamforming at both the transmitter and receiver sides, which is required for effective directional transmissions.</p><p>After motivating the utility of leveraging various non-RF sensor data for RF tasks, we next map these benefits to the use-case of beamforming in mmWave bands, when higher magnitude of antenna elements (i.e., mMIMO systems) are involved. Additionally, the challenges of using RF-based out-of-band beamforming, described in Section 3.3, suggest the research community needs to explore the space of beyond RF-only solutions (be it traditional or RF-based out-of-band). We explore this direction in the next sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Out-of-band non-RF based beamforming</head><p>In mmWave beamforming, the location of the TX-RX pair and potential obstacles are the key factors that directly affect the optimal beam configuration. Out-of-band RF aided beamforming methods estimate the approximate location of TX-RX pair given the AoA in other RF bands, which helps to narrow down the beam search space. Interestingly, the situational state of the environment can also be acquired through data obtained from other sensor devices <ref type="bibr">[57]</ref>, without occupying limited sub-6GHz RF resources. This motivates the use of non-RF sensor data to speed up the beam initialization process in mmWave band <ref type="bibr">[58]</ref>. Unlike the previously discussed out-of-band RF methods, the non-RF based beamforming does not require simultaneous multiband channel measurements and optimal mapping between mmWave and CSI collected from another band. It is also capable of generating a mutually acceptable decision for both transmitter and receiver.</p><p>Typically, non-RF based beamforming utilizes inputs from a number of different sensors such as, GPS (Global Positioning System), camera, and LiDAR (Light Detection and Ranging) etc. This is further aided by the fact that with the wide proliferation of IoT, multiple sensors are embedded in the environment, thus making it feasible to obtain situational information from non-RF sources. As an example, consider the automotive sector with vehicles that have advanced driver-assistance systems (ADAS). Fig. <ref type="figure">7</ref> depicts the increase in the market revenue of the various sensors enabling ADAS, as reported by Yole D&#233;velopment <ref type="bibr">[59]</ref>. It is expected that the global market for GPS, radar, cameras and LiDARs will reach $159.6 billion in 2025. With the easy availability of such multitude of sensors, we need to incorporate methods that leverage the heterogeneous sensor data to extract a rich understanding of the environment.</p><p>In LoS scenarios, even though the optimal beam configuration can be estimated using the location of transmitter and receiver, it is not trivial to employ such approaches when encountering irregular radiation patterns, for e.g., when devices have multiple side lobes. The problem becomes more challenging when estimating the strongest reflection from obstacles in non-line-of-sight (NLoS) conditions. Hence,  a proactive method is required to learn the channel characteristics associated with the observed non-RF sensor modalities on a case-bycase basis. Both deterministic and AI-enabled methods are proposed in literature that consider either single sensor modalities or multiple modalities through deep learning. We next go through these state-ofthe-art methods, covering different black input sensor acquisition techniques, available datasets, exploitation methods of single and multiple modalities, and identify future research trends.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Input data acquisition and processing</head><p>Choosing the right subset of sensor modalities to accurately capture the environment for detecting potential LoS paths and reflections affecting mmWave frequencies is crucial. The most popular input sensor modalities for mmWave beamforming are presented below and their features are summarized in Table <ref type="table">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.">GPS</head><p>This is a popular and widely available satellite-based localization system that generates readings in the decimal degrees (DD) format, where the separation between each line of latitude or longitude (representing 1 &#733;difference) is expressed as a float with 5 digit precision. Each measurement results in two numbers that together pinpoints the location on the earth's surface. While localization accuracy in outdoor can be up to 2 m, it drastically decreases in indoor environment because of GPS signal attenuation through walls and structures. It is to be noted that the GPS sensor data refers to the latitude and longitude values generated from the GPS receiver, not the RF signals which are transmitted from the GPS satellites.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2.">Image</head><p>Cameras can be used to capture still RBG images of the environment and are commonly used in different applications such as cell phones and surveillance monitoring. Although images allow comprehensive environmental assessment, they are impacted by low-light conditions and obstructions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.3.">LiDAR</head><p>The LiDAR (Light Detection And Ranging) sensor generates a 3-D representation of the environment by emitting pulsed laser beams. The distance of each individual object from the origin (i.e., the sensor location) is then calculated based on reflection times <ref type="bibr">[60]</ref>. The raw data in LiDAR is defined as the unprocessed data collected by the LiDAR sensor, usually represented as a point cloud representing the position and reflectivity of objects in the environment, along with the time-of-flight information for each measurement. Each point in the cloud represents the distance to the object from where the laser beam reflected back to the LiDAR sensor, along with additional information such as intensity, reflectivity, and wavelength. The raw data from LiDAR sensors can be used for a variety of applications, including mapping, obstacle detection, and autonomous navigation. This additional information can be used in conjunction with user GPS location and RF data for more efficient mmWave beam selection during beamforming <ref type="bibr">[61]</ref>. LiDAR can achieve much higher accuracy than image only, but it is expensive and sensitive to weather conditions. Even if a judicious choice is made on the sensor modality, simply using raw data might fail to provide an accurate prediction. In particular, preprocessing of the raw data can improve the system performance many-folds as we describe later in this paper. Raw observations are not useful unless the role of each device that senses the data is specified, i.e. is the data captured from a transmitter, receiver, or a potential obstacle? Each sensor type has its advantages and limitations. For example, GPS-equipped objects can be utilized to track location, but these sensors cannot capture the presence of obstacles. LiDAR can collect the 3D state of the environment by detecting reflections of emitted pulsed laser light from surrounding obstacles, but it fails to track the location of the target transceivers because LiDAR operates in the visible or near-infrared spectrum and does not detect RF signals. Thus, GPS data can be merged with raw LiDAR data in the preprocessing step to mark the coordinates of the target receiver in the collected point clouds. Hence, data-level aggregation methods are one of the commonly used approaches to refine the raw data to be more informative.</p><p>Similarly, the preprocessing steps are also beneficial for reducing the data complexity by either discarding the irrelevant information or reducing the dimensionality of the input data. As an example, using a low-pass filter on camera images can reduce the dimensionality of the image by averaging the adjacent pixels while preserving the integrity. ML-based solutions only accept the data arranged in a fixed size, while for some modalities such as LiDAR the number of point clouds is varying on a case-by-case basis, depending on the number of present objects. Hence, preprocessing can account for this issue by transforming the data to a constrained representation without degrading information content. Hence, it is important to design proper preprocessing steps before using the data for inference. It should be noted here that the preprocessing pipeline of each modality must be designed based on the unique properties of each sensor type.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Beamforming using single non-RF modalities</head><p>Next, we present detailed descriptions of different studies and algorithms that use a single non-RF sensor modality. These include either GPS coordinates, camera images or LiDAR point clouds to accelerate the beam selection, and by extension, the beamforming process. &#247;GPS Coordinates: The knowledge of the location of target receiver has been used earlier to address the challenges of cell discovery <ref type="bibr">[62]</ref>. The same idea can be used to speed up the beam initialization in mmWave band, which utilizes directional transmission. The authors in <ref type="bibr">[63,</ref><ref type="bibr">64]</ref> use the GPS based position of the receiver to estimate the optimum future beam directions. In particular, the proposed algorithms predict the future locations by tracking the mobility profile of the receiver and geometrical features of the environment. However, it should be noted that this approach only works when the LoS path is available. Alternatively, Wang et al. propose a framework for mmWave beam prediction by exploiting the situational awareness <ref type="bibr">[65]</ref>. They use the location of all the vehicles in the same scene as features to extend the solution to NLoS scenarios. The simulation scenario consists of small cars and trucks, any of which can be the target receiver. The authors argue that the vehicle dynamics have the main effect on the optimum beam configuration, since the road side buildings and infrastructures are stationary, and pedestrians are small in size. Hence, a feature vector map v = [r, t 1 , t 2 , c 1 , c 2 ] is generated where r depicts the location of RSU, t and c represent the truck and car vehicles. The subscripts 1 and 2 denote the lane index where the vehicle is located and each vector (t i , c i ), i = 1, 2 includes the location of the corresponding vehicle type in ascending order for the lane i. Since the ML algorithms accept a fixed size input, the number of trucks/cars on each lane is constrained, and the vehicles which are far away are eliminated. This feature vector is then used to predict the received power for any beam in the codebook, by leveraging ML.</p><p>Similarly, Va et al. <ref type="bibr">[66]</ref> propose an algorithm where the GPS location of all the vehicles on the road, including the target receiver, is used as input to a statistical analysis-based algorithm, to infer the best beam configuration. The proposed algorithm uses the power loss probability as a metric to estimate the misalignment probability that might occur when non-optimal beams are selected. Power loss probability defines the probability that the received signal power at the receiver will be less than a certain threshold value. Beamforming solutions are designed to maximize the received signal power at the receiver while maintaining the required signal quality and reliability, which can be done through optimization algorithms that minimize the power loss probability subject to constraints on the transmission power and received noise power. By reducing the power loss probability through proper beam selection and beam width selection, the beamforming solution ensures that the transmitted signal will be received with high quality. In this case, a subset of the beam configurations is suggested by the authors to minimize this misalignment probability.</p><p>In order to speed up the beam initialization, an online learning algorithm is proposed in <ref type="bibr">[67]</ref>, which exploits the coarse user location information in vehicular systems. In particular, the problem is modeled as a contextual multi armed bandit (MAB) problem and a lightweight context-aware online learning algorithm, namely fast machine learning (FML) is used to learn from and adapt to the environment. The proposed FML algorithm explores different beams over time while accounting for contextual information (i.e., vehicles' direction of arrival) and adapts the future beams accordingly, in order to account for the system dynamics such as the appearance of blockages and changes in traffic patterns. In comparison, in <ref type="bibr">[68]</ref> Aviles et al. first generate a database that captures the propagation characteristics at 28 GHz and the position of UE. Then, given the location of a UE, a hierarchical alignment scheme is proposed, which consults with this database and incorporates the position of the UE for faster beam alignment. &#247;Camera Images and Light Sensors: The cameras are one of the sensing modalities that capture the situational state of the environment with high resolution. With the recent progress in computer vision and deep learning, powerful algorithms are now available that can be used for processing the images in real time for beamforming. A baseline for ViWi-BT dataset is presented in <ref type="bibr">[69]</ref> based on gated recurrent units (GRUs) without the images and only the sequence of beam indices. Since GRUs are a type of Recurrent Neural Network (RNN) which are used to capture long-term dependencies in time-series data <ref type="bibr">[70]</ref> and can be used to select the beam that has the highest SNR during beamforming. Alrabeiah et al. argue that beam prediction accuracy is expected to improve significantly by leveraging both wireless and visual data <ref type="bibr">[69]</ref>. This strategy is proven to be efficient by the authors in <ref type="bibr">[71]</ref>, where they use LiDAR along with RF data to predict future mmWave beam selection decisions with high accuracy. In <ref type="bibr">[72]</ref>, Tian et al. propose a framework to predict future beam indices from previously observed beam indices and images. The proposed approach consists of three steps as follows. The first step consists of feature extraction, where ResNet, ResNext and 3D ResNext modules, each proven to have powerful feature-representation abilities, are used to capture 2D and 3D spatiotemporal visual and motion features from the input time series data in the form of images. In the second step, these features are merged through the use of a Feature Fusion Module (FFM) that comprises two long short-term memory (LSTM) <ref type="bibr">[73]</ref> networks for aggregating the features, followed by a crossgating block to make full use of related semantic information between these two features by multiplication and summation. To validate their approach, the authors use ViWi-BT dataset where the first eight pairs of images are used as time series input data to predict the next five future beams.</p><p>Similarly, in <ref type="bibr">[74]</ref>, Xu et al. propose a scheme where the images captured from different perspectives are used to construct a 3D scene that resembles the point cloud data collected by 3D sensors like LiDAR. Then, a CNN with 3D input is designed to predict the future beams to be selected. Results reveal that the proposed 3D scene based beam selection outperforms LiDAR in accuracy, without imposing the huge cost of LiDAR sensor. While the majority of current literature uses synthetic datasets, the authors in <ref type="bibr">[75]</ref> deploy a testbed using National Instruments radio at 60 GHz <ref type="bibr">[27]</ref> and camera generated images to predict the best beam configuration. Their proposed method consists of two main steps, namely detection and prediction. In the first step, the transmitter and receiver are detected in the image in the form of a bitmap. This step is important to detect the features which are relevant to the task and discard the irrelevant ones, such as static walls, etc. Finally, the bitmaps are fed to another CNN to predict the optimum beam configuration given the historical data from collected dataset. The LiSteer system proposed in <ref type="bibr">[76]</ref> steers mmWave beams to mobile devices by re-purposing indicator light emitting diodes (LEDs) on wireless APs to passively track the direction to the AP using light intensity measurements with off-the-shelf light sensors. The proposed approach considers the pseudo-optical properties of mmWave signal, i.e., dominant LoS propagation, to approximate the APs' AoA in the mmWave band. Hence, their approach requires the APs to be equipped with LEDs and to be situated close to the mmWave band antenna. The authors propose using an array of light sensors to combat the in-coherency of light-AoA estimation that also allows steering beams for both 2D and 3D beamforming codebooks. The experimental results demonstrate that LiSteer achieves direction estimates within 2.5 &#733;of ground truth on average with beam steering accuracy of more than 97% in tracking mode, without incurring any client beam training or feedback overhead. &#247;LiDAR Point Clouds: Woodford et al. <ref type="bibr">[77]</ref> use LiDAR to build a 3D map of the surrounding physical environment and captures the characteristics of the physical materials. The proposed approach uses a customized ray-tracing algorithm that can identify real RF paths in a 3D mesh generated by LiDAR sensors, and reject false reflection paths caused by reconstruction noise. The output of this phase is a pre-computed look-up table to select the best beams for all mmWave links in the environment. It should be noted that the LiDAR sensors are not required during the ordinary operation of the system and are only used in advance to generate the lookup table. The proposed approach can recompute the complete lookup table for the environment within 15 minutes. The authors validate their approach using Azure Kinect LiDAR camera <ref type="bibr">[78]</ref> and a commercial 802.11ad radio <ref type="bibr">[79]</ref>, yielding to 66% reduction in latency and 50% increase in throughput. Similarly, Jian et al. <ref type="bibr">[80]</ref> use the LiDAR data to predict future beams in a V2I scenario, without requiring any knowledge about the previous optimal beams. Thus, a sequence of captured LiDAR sensing images are used as input to a recurrent neural network (RNN) to predict the top-K promising beams. The proposed approach is validated on the publicly available DeepSense 6G dataset <ref type="bibr">[81]</ref> and the results indicate slightly lower accuracy than a baseline model that has perfect knowledge of the previous optimal beams.</p><p>A concise overview of different state-of-the-art beamforming methods while using single sensor data is presented in Table <ref type="table">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Available datasets with single sensor</head><p>All these deep learning based methods warrant for data collection and training needs for the discussed methodologies. Hence, next we discuss the features of the available public datasets specific to beamforming using non-RF sensor modalities. These datasets enable the research community to explore different aspects of non-RF beamforming without incurring an individual effort of data collection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1.">ViWi</head><p>Alrabeiah et al. proposed a scalable synthetic framework called Vision-Wireless (ViWi) <ref type="bibr">[82]</ref>. The scenario of interest is a V2I setting in 28 GHz mmWave band. The first release of this dataset contains four scenarios with different camera distributions (co-located and distributed) and views (blocked and direct). The channel characteristics and images are generated using the Remcom Wireless Insite ray-tracing <ref type="bibr">[83]</ref> and Blender <ref type="bibr">[84]</ref> software, respectively. For each scenario, a set of images and raw wireless data (signal departure/arrival angles, path gains, and channel impulse responses) are recorded. An extended version of this dataset is named ViWi vision-aided mmWave beam tracking (ViWi-BT) <ref type="bibr">[69]</ref>, which contains 13 pairs of consecutive beam indices and corresponding street view images. This dataset contains a training set with 281,100 samples, a validation set with 120,468 samples, and a test set with 10,000 samples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2.">Image-based</head><p>This dataset is obtained by Salehi et al. in <ref type="bibr">[75]</ref> from a testbed composed of two Sibeam mmWave <ref type="bibr">[27]</ref> antenna arrays mounted on sliders enabling horizontal movement. Using the mmWave transceivers from National Instruments, the mutual channel is measured for 13 beam directions at transmitter and receiver (169 beam configurations overall). Two GoPro cameras observe the movements in the environment and are synchronized with the mmWave channel measurements. In the designed scheme, an obstacle blocks the LoS path between the transmitter and receiver and the experiment is repeated for two types of obstacles, wood and cardbox, causing 30dB and 4dB attenuation while blocking the LOS path, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Multimodal beamforming</head><p>Each of the previously mentioned sensor modalities capture different aspects of the environment, using more than one sensor modality and intelligently fusing these multimodal data can result in more comprehensive understanding of the environment and can consequently enable the undertaking of robust decisions. &#247; Benefits of Fusion: The fusion of multimodal data over the single modalities has multiple advantages, as explained below:</p><p>&#8226; Enhanced Data Representation: For the situational information to be effective during beamforming, it is crucial to differentiate between the transmitter, receiver and obstacles. However, some sensor modalities cannot provide such information by only relying on raw data. In this case, the data from different modalities can be fused together to improve the data representation. As an instance, it is not trivial to locate the receiver within a LiDAR point cloud.</p><p>In this case, the GPS coordinates can be used to mark the target receiver.</p><p>&#8226; Compensate for the Missing Information: Sometimes the captured data from each sensing modality reflect an aspect of the environment, yet none can provide a complete understanding by it's own. For instance the dimensionality of objects is not reflected in GPS, and the accurate Cartesian coordinates of the target receiver cannot be acquired using LiDAR or image sensors. &#8226; Improved Accuracy: Using more than one modality enables a fine grained understanding of the environment which results in more accurate predictions. Hence, fusion reinforces the prediction accuracy by gathering the information from different sensors to make the final decision. In this case, the fusion algorithms can automatically adjust the weights of each modality towards the optimum performance. &#8226; Robustness to Errors: Collecting data using sensor devices comes with associated considerations, including the inherent error. Here, the accuracy of measurement is dependent on working with the nominal structure that the device is designed for. For instance, the accuracy of LiDAR sensor degrades with sunlight reflections, while it does not affect the GPS data <ref type="bibr">[85]</ref>. Hence, fusion increases prediction robustness in the case of inaccurate or unreliable data. &#8226; Availability: In some applications, the sensor does not have to be co-located. Hence, secondary control channels are required to enable the connectivity between the different sensors and the computing unit. However, this control channel is also subjected to saturation and loss. Using more than one modality with fusion helps the system to be robust to such scenarios and it guarantees that the prediction happens when at least one modality is available during the inference.</p><p>Below, we give some examples of state-of-the-art multimodal beamforming with different fusion approaches on multiple sensors. The following subsections are named after the multimodal sensor names which are used as input for the beam prediction in the state-of-the-art. Data from each type of sensors is processed through the same way as mentioned in Section 4.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">GPS and LiDAR</head><p>Consider a typical V2I setting, where a static BS wants to establish communication with a target vehicle-mounted receiver. The vehicle is assumed to be equipped with GPS and LiDAR sensors that enable the vehicle to acquire its location in 3D space and generate a map of the environment, including the location and shape of objects in the area, which can be used to detect nearby obstacles. The LiDAR point cloud data can be formalized by the set P = {p 1 , p 2 , .., p n } consisting of 3D points. Each point p i = (x i , y i , z i ) represents the position of the object in 3D space, where x i , y i , and z i are the coordinates of the point in the X, Y, and Z dimensions, respectively. This spatiotemporal data together with the GPS coordinates of the user can be used to improve the accuracy of mmWave beamforming by allowing the system to avoid beams that would be blocked by obstacles, and instead select beams that will provide the best signal quality. Also, the precise location information obtained by LiDAR and GPS data fusion can be used to improve the precision of beam steering, ensuring that the beams are directed precisely at the intended target, thereby improving performance and ensuring higher data rates in mmWave communication systems.</p><p>In this scenario, Klautau et al. propose a distributed architecture to reduce the mmWave beam selection overhead <ref type="bibr">[86]</ref>. Here, the BS constantly broadcasts its position via a low-band control channel. The situational state of the environment is then collected using LiDAR, situated on the vehicle and is aggregated by BS location in the preprocessing pipeline, where a histogram is generated at the beginning to quantize the space. The LiDAR point clouds then lie in the corresponding bin of the histogram, and the location of BS and receiver is also marked with unique indicators. Using the proposed preprocessing step, the measured point clouds are mapped to a ridge represented by a fixed size. Note that the number of point clouds in the raw data varies depending on the number of objects present during the measurement. This refined data representation is then fed as input to a deep CNN to estimate a set of K most likely candidate beam pairs. The selected beam pairs are then sent to the BS, and beam training is performed to generate the suggested subset to obtain the optimum beam configuration. Similarly, Dias et al. consider a V2I setting and compare the performance of the previously described distributed scheme with two centralized schemes: (i) using a single LiDAR located at the BS, and (ii) fusing LIDAR data from neighboring vehicles at the BS <ref type="bibr">[87]</ref>. The LiDAR data is then used for both LoS detection and beam selection for three competing scenarios. The experimental results in this work depict that in LoS, distributed and centralized methods perform closely, while the LiDAR at BS results in lower top-K beam prediction accuracy, because of the limited range of LiDAR. On the other hand, in NLoS scenarios, the distributed scheme outperforms the centralized method, and both are better than LiDAR at BS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Camera with sub-6 GHz</head><p>The possibility of vision-aided wireless communications is evaluated in <ref type="bibr">[88]</ref> where a camera at the BS observes the movements in the environment, and snapshots of the environment are paired with sub-6 GHz channels to help overcome the beam selection and blockage prediction overhead. The proposed method models the beam prediction from images as an image classification task. Hence, each user location in the scene is mapped to a class representing the associated beamforming codebook. However, the pure image input may be insufficient for blockage detection since the instances of 'no user' and 'blocked user' are visually the same. Hence, in order to identify blocked users, the images are fused by sub-6 GHz channels to account for the aforementioned challenge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">GPS and camera</head><p>Reus-Muns et al. <ref type="bibr">[61]</ref> propose a deep learning based fusion framework for beam selection which leverages the GPS location information along with visual data yielding to a low-overhead fast beamforming process under mobility scenarios. An object detection algorithm is employed to filter out background clutter and identify different types of vehicles. Moreover, the authors analyze the impact of the LoS/NLoS conditions in terms of beam selection accuracy and propose a method to predict mmWave link blockage. Similarly, Charan et al. study multimodal beamforming in a drone communication system where the sensory data includes images of the drone captured at the BS, GPS position, height of flight (distance to ground), and projected distance (horizontal distance to BS) <ref type="bibr">[89]</ref>. The evaluations on a realworld dataset depict up to 86.32% top-1 accuracy with 64 pre-defined beams. In <ref type="bibr">[90]</ref>, a deep neural networks predicts the optimal beam using positional and visual (camera) data with 75% top-1 beam prediction accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">GPS, camera and LiDAR</head><p>The sensor suite in a multimodal beamforming system can be extended to include a multitude of modalities. In fact, the more comprehensive the situational information are, i.e., more sensors, the more robust the prediction will be. Salehi et al. <ref type="bibr">[41]</ref> consider a V2I scenario in which the vehicle (receiver) is equipped with GPS and LiDAR sensors and a roadside camera also tracks the movements in front of the base station from Raymobtime dataset. The observations indicate that fusion of all three modalities improves the beam selection accuracy by 9.9-43.9% with respect to least and most significant single modalities, GPS and LiDAR, respectively. Moreover, the evaluations reveal 95% improvement in beam selection speed over classical RF-only beam sweeping method. Later, the same setting with three sensor modalities is studied in a federated learning setting in <ref type="bibr">[91]</ref>. In a centralized learning method the entire multimodal data is transmitted to a central unit for training which is costly. The concise overview of different state-of-the-art beamforming methods while using multimodal sensing data is presented in Table <ref type="table">5</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.">Available datasets with multimodal sensors</head><p>To validate the deep learning based state-of-the-art methods of multimodal beamforming, we next discuss various data collection process and published datasets for the discussed methodologies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.1.">Raymobtime</head><p>The Raymobtime multimodal dataset <ref type="bibr">[18]</ref> captures a virtual V2X deployment with high fidelity in the urban canyon region of Rosslyn, Virginia for different traffic patterns. A static roadside BS is placed at a height of 4 meters, alongside moving buses, cars, and trucks. The traffic is generated using the Simulator for Urban MObility (SUMO) software <ref type="bibr">[95]</ref>, which allows flexibility in changing the vehicular movement patterns. The image and LiDAR sensor data are collected by Blender, and Blender Sensor Simulation (BlenSor) <ref type="bibr">[96]</ref> software, respectively. For a so called scene, the framework designates one active receiver out of three possible vehicle types i.e. car, bus and truck. A python orchestrator invokes each software for each scene and collects synchronized samples of LiDAR point clouds, GPS coordinates and camera images mounted at the BS. The combined channel quality of different beam pairs are also generated using Wireless Insite ray-tracing <ref type="bibr">[83]</ref> software. The number of codebook elements for BS and the receiver is 32 and 8, respectively, leading to 256 beam configurations overall.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.2.">FLASH and e-FLASH</head><p>The FLASH multimodal dataset is the first real-world deployment of a mmWave V2I system. The testbed includes a 2017 Lincoln MKZ Hybrid autonomous vehicle with GPS, Camera and LiDAR sensor onboard. The GoPro Hero4 camera with a field-of-view (FOV) of 130 degrees is faced towards BS and the Velodyne VLP 16 LiDAR is mounted on top of the vehicle. The sensors are connected to a central computer with Robot Operating System (ROS) to store the recordings with their timestamps. On the other hand, the TP-Link Talon AD7200 tri-band routers, with Qualcomm QCA9500 IEEE 802.11ad Wi-Fi chips, are used as the BS and Rx at the 60 GHz frequency <ref type="bibr">[93]</ref>. The default codebook includes sector IDs from 1 to 31 and 61-63 for a total of 34 sectors; the sectors with IDs of 32 to 60 are undefined. The multimodal sensor data and RF grand truths are captured at the following rates: 1 Hz for GPS, 30 frames per second (fps) for the camera, 10 Hz for LiDAR, and 1-1.5 Hz for the RF ground-truth. Later, Gu et al., extend the sensor suite in FLASH with two additional sensors. An extra GoPro Hero9 camera is faced towards the road, unlike FLASH where the camera is faced towards the side of the road. Moreover, a Velodyne VLP 64 LiDAR is included to capture the point clouds with higher resolution of 64 channels than the 16-channel LiDAR used in FLASH. The FLASH and e-FLASH dataset include different scenarios such as LoS, NLoS with either pedestrian, static or moving car, and both are released to community and publicly available in <ref type="bibr">[94]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.3.">DeepSense 6G</head><p>Alkhateeb et al. explored a progressive set of 5 testbeds for realworld mmWave band deployments. In the most advanced setting, a multimodal dataset in a real-world multimodal V2I setting is released <ref type="bibr">[81]</ref>. While in FLASH dataset all the sensors are deployed on the vehicle, in DeepSense only GPS is available at vehicle (sampling frequency of 10 Hz). The remaining sensors including camera, RADAR, and LiDAR are deployed at the roadside base station. The camera (ZED2 from StereoLabs) covers up to 110 degrees of the field of view (FOV), recording 30 frames per second. On the other hand, the Frequency Modulated CW (FMCW) RADAR with 3 Tx and 4 Rx antennas records the complex I/Q RADAR measurements at frequency range 76-81 GHz. Finally, an Ouster LiDAR with range of 40m and resolution of up to 3cm records the point clouds <ref type="bibr">[56]</ref>. The mmWave radio in DeepSense includes a 16-element Uniform Linear Array (ULA) phased array from SIVERS semiconductors. The radios operate at 60 GHz with 64 predefined codebooks that uniformly scan 90 &#733;field-of-view. The released dataset for V2I scenario includes up to 18,667 samples.</p><p>The overview of different publicly available datasets for both single and multiple modalities for beamforming is presented in Table <ref type="table">6</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6.">Challenges</head><p>While multimodal learning is an extremely interesting research field, there are some challenges that need to be addressed. First, in order to exploit more than one modality, the synchronized information of all modalities must be present during inference. This requires a precise network controller and back-channel to enable connectivity among different modules while accounting for privacy concerns. Second, the fusion scheme needs to be designed such that the different modalities result in a reinforced prediction. The fusion model can be as simple as a linear transformation, such as summation or multiplication. However, learning the relation between different modalities might require nonlinear transformations such as deep learning on custom-made neural networks. In that regard, exploring different novel fusion techniques using different non-linear transformation is widely researched area of current state-of-the-art.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Emerging research frontiers using multimodal beamforming</head><p>In this section, we present selected research frontiers, where the AIenabled multimodal beamforming techniques can make transformative difference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">V2X networking</head><p>The Vehicle-to-Everything (V2X) market is projected to reach $13 Billion by 2028 <ref type="bibr">[97]</ref>. V2X will enable communication among vehicles as well as between vehicles and networks, infrastructure and pedestrians, aiming to improve traffic efficiency, road safety and individual vehicle energy efficiency <ref type="bibr">[98]</ref>. V2X connectivity is also essential for the advancement of autonomous driving. Traffic efficiency improves by monitoring congested areas and providing alternative routes, while maintaining road safety by monitoring speed and identifying risky drivers. At the same time, V2X networks can improve energy efficiency by making vehicles more intelligent, choosing journeys with lower carbon emissions.</p><p>Different beamforming techniques have direct impact on the performance of 5G-V2X networks <ref type="bibr">[99]</ref>. In <ref type="bibr">[100]</ref>, Lee et al. presents an object detection algorithm by fusing visual and LiDAR data to form 3D images of the vehicle surroundings. Combining these two concepts, we envision the application area of using multimodal beamforming for V2X architecture will extend from fast and reliable communication to object detection in urban scenarios. As mentioned in Section 4.1, availability of different types of sensor data forms the backbone of V2X communication. The multimodal beamforming using these sensors can be leveraged to provide low latency V2X communication. The knowledge of such selected beams at each specific location can then be leveraged to detect objects or pedestrian using AI-enabled algorithms. An example use-case of pedestrian detection via beamforming is depicted in Fig. <ref type="figure">8</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">UAV communication</head><p>Unmanned aerial vehicles (UAVs) are used extensively in military, scientific and civil applications. They can be used for capturing data, monitoring non-accessible areas and developing high-throughput wireless communication infrastructure. Networks of UAVs, known as flying ad-hoc networks (FANETs) have sparked great interest in academia, industry and government due to their flexibility, low cost and wide range of applications: disaster management, relaying networks, agricultural processes and many more <ref type="bibr">[101]</ref>. For all those applications, high-speed low-latency wireless communication is essential between UAVs as well as from UAVs to ground entities (UAV-Ground).</p><p>Images captured by flying UAVs may need to be distributed to ground nodes, while data from the ground terminals is required by the UAVs for channel allocation and routing <ref type="bibr">[102]</ref>. Distributed beamforming is an important enabler for leveraging high throughput and long range communications through flying UAVs, given their high probability of LoS links due to their altitude. Drawbacks in these scenarios such as inaccurate GPS signals, unpredictable UAV hovering, etc., create the need for accurate transmission synchronization between multiple UAVs through external sensor data input <ref type="bibr">[103]</ref>, in order to realize a practical distributed beamforming implementation for multi-UAV to ground <ref type="bibr">[104]</ref> and UAV-UAV communications. <ref type="bibr">[105]</ref> and <ref type="bibr">[106]</ref> highlight the use of mmWave links for UAV-UAV and UAV-ground communication. Ultra-fast UAV communication is essential for wireless infrastructure drones (WIDs). To improve the need for faster and more reliable communication in both the above cases, beamforming in mmWave can be combined with AI-enabled techniques. With the introduction of camera images and other non-RF multimodal data, such as GPS, beamforming in UAV communication can be enhanced with multimodal beamforming to provide higher throughput, robustness, coverage and delay metrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">Multi-agent robotics</head><p>Autonomous agents are increasingly used in a variety of applications like mining <ref type="bibr">[107]</ref>, agriculture <ref type="bibr">[108]</ref>, military <ref type="bibr">[109]</ref>, aerospace <ref type="bibr">[110]</ref> and medicine <ref type="bibr">[111]</ref> to name a few. Many system entities need to collectively coordinate with each other to make decisions online and collaboratively in this paradigm. Some examples of multi-agent robots are simultaneous localization and mapping (SLAM) <ref type="bibr">[112]</ref><ref type="bibr">[113]</ref><ref type="bibr">[114]</ref>, warehouse robotics <ref type="bibr">[115]</ref><ref type="bibr">[116]</ref><ref type="bibr">[117]</ref>, surgical robotics <ref type="bibr">[118,</ref><ref type="bibr">119]</ref>, autonomous driving <ref type="bibr">[120]</ref><ref type="bibr">[121]</ref><ref type="bibr">[122]</ref>, and agricultural robotics <ref type="bibr">[123]</ref><ref type="bibr">[124]</ref><ref type="bibr">[125]</ref> etc. In these applications, each of the agents in a multi-agent system may be equipped with sensors like LiDAR, RGB and infra-red (IR) cameras, and GPS receiver etc., which enable them to function autonomously. Many applications rely on agents being able to communicate locally with other agents. For such applications and for real-time collaboration, high-speed communication is needed for sharing sensor information, decisions and actions. To support such large data-rate requirements, industries with highly automated process flows are pursuing high bandwidth communication links, including access to mmWave bands <ref type="bibr">[126]</ref>. Multimodal beamforming using non-RF data can be an interesting approach to facilitate the faster communication between such autonomous entities using the integrated sensors within them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.">Terahertz communication</head><p>Early works that prove the feasibility of exploiting the THz frequency bands (0.3 THz to 10 THz) point towards an upcoming paradigm shift in the way wireless spectrum will be used. THz-band links bridge the gap between radio and optical frequency ranges, which may be game-changing for nextG wireless networks <ref type="bibr">[127]</ref> by enabling transfer rates of 10Gb/s <ref type="bibr">[128]</ref>. However, highly directional and fine-grained beams in the phased array antennas, which are essential to support the THz communication, come with their own challenges. Additionally, the beam search space increases with the increasing frequencies. Hence, there is urgent need to exploit out-of-the-box approaches, such as AIenabled CSI estimation techniques, to decouple the number of antenna elements from the beamforming time overhead <ref type="bibr">[39]</ref>. We believe the idea of multimodal beamforming can also be extended to THz communication to reduce the exploding search space of antenna codebook elements by leveraging the environmental multimodal data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5.">Virtual presence</head><p>Since the start of the COVID-19 pandemic, we have quickly transitioned to using virtual communications platforms to aid in wellness and safety. However, platforms like Zoom and Teams can only do so much with respect to quality of user experience. Most of these platforms are still limited by the on-screen presence. This is where the recent development of eXtended Reality (XR) can make a difference by opening up the possibility of transforming on-screen presence to a virtual presence. The concept of holographic representation can emulate physical presence for meeting, gaming, or collaborating with others. Such virtual presence will support mobility in group presentation or multi-player gaming situations. XR technologies will require multi-Gbps data-rates that may saturate a sub-mmWave band within seconds. Even the still-evolving 5G standard is not capable of supporting these data transfer rates. The standardization of ultra-fast beamforming in mmWave communication D. <ref type="bibr">Roy et al.</ref> is integral for NextG standards <ref type="bibr">[11]</ref>. The concept of using multimodal non-RF data in such applications is promising in this regard. The rich properties of XR or holographic images can be exploited for situational awareness to aid in the beamforming in high frequencies, where the codebook search space is generally too large to compute optimally in real time via exhaustive searching <ref type="bibr">[129]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6.">Hybrid beamforming</head><p>mMIMO communications in hybrid transceivers is realized by a combination of high dimensional analog phase shifters and power amplifiers with lower-dimensional digital signal processing units <ref type="bibr">[25]</ref>. For fully connected hybrid transceivers, the situational states through the non-RF modalities can be leveraged to select multiple phase shifters (multi-label prediction), which can be inferred to derive the best RF chains and aid in even-faster beamforming. Multimodal beamforming can be applied per RF chain to select the best phase shifter, and this will enable the parallel inference of all the RF chains at the same time. Hence, the use of multimodal data has huge potential for improving the emerging hybrid beamforming technique and it will allow seamless scaling to make it suitable for NextG networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions</head><p>This paper provides a comprehensive survey of AI-enabled beamforming techniques using out-of-band and multimodal data for mmWave communication in NextG networks. While the previous surveys on beamforming <ref type="bibr">[23]</ref><ref type="bibr">[24]</ref><ref type="bibr">[25]</ref> are focused more on analyzing and using mmWave channel characteristics, or channel state information for beamforming in massive MIMO leveraging the complicated hybrid beamforming process; our survey reviews recent trends in the literature that adopt an out-of-box approach for solving the same problem. We discuss the state-of-the-art in research trends, application areas, and open challenges of this exciting and emerging paradigm of multimodal sensor data-enabled beamforming. We identify those open research challenges to motivate future research and well as indicate the potential transformative impact of this area on different wireless applications. </p></div></body>
		</text>
</TEI>
