<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Reinforcement Learning for Beam Pattern Design in Millimeter Wave and Massive MIMO Systems</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/20/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10292254</idno>
					<idno type="doi">10.1109/IEEECONF51394.2020.9443430</idno>
					<title level='j'>Asilomar Conference on Signals, Systems, and Computers</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yu Zhang</author><author>Muhammad Alrabeiah</author><author>Ahmed Alkhateeb</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Deploying large scale antenna arrays is a key characteristic of current and future wireless communication systems. However, due to some non-ideal practical conditions, such as the unknown array geometry or possible hardware impairments, the accurate channel state information becomes hard to acquire. This impedes the design of beamforming/combining vectors that are crucial to fully exploit the potential of the large-scale MIMO systems or to combat the high path-loss in millimeter wave (mmWave) communications. In this paper, we propose a novel solution that leverages deep reinforcement learning (DRL) to learn the beam pattern that is optimized for a group of users without the explicit knowledge of the channels. Simulation results show that the developed solution is capable of finding the near optimal beam pattern with quantized phase shifters and with only requiring the beamforming gain feedback from the users.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Leveraging the large bandwidth available at millimeter wave (mmWave) frequency bands requires the deployment of large antenna arrays. However, to balance the overall hardware cost, cheap and low-precision radio components might be adopted. This leads to some non-ideal practical conditions, such as unknown array geometry or possible hardware impairments. In this situation, the performance of the commonly used beams (such as the ones in classical beamsteering codebooks) degrades drastically due to their unawareness of the environment and hardware. Furthermore, the accurate channel state information is generally hard/prohibitive to estimate due to the possible hardware impairments and the large number of antennas. As a result, classical or data-driven beam pattern/codebook design approaches, e.g. <ref type="bibr">[1]</ref>, may not be feasible.</p><p>Prior Work: Designing beamforming codebooks is a key step in realizing the potential of mmWave MIMO communications, and it has been an important research topic for quite some time <ref type="bibr">[2]</ref>- <ref type="bibr">[5]</ref>. With large-scale MIMO systems, the hardware limitations (especially at mmWave/THz) and the use of analog-only or hybrid transceiver architectures impose new constraints on the codebook design problems. This has motivated the development of new beamforming codebooks with single-lobe and narrow beams <ref type="bibr">[6]</ref>. Although very directive, those codebooks bring with them an increased training overhead. As such, <ref type="bibr">[7]</ref> and <ref type="bibr">[8]</ref> has explored hierarchical codebook structures, which implements different levels of beam widths.</p><p>Contribution: In this paper, we propose a deep reinforcement learning (DRL) based solution to learn the optimized Yu Zhang, Muhammad Alrabeiah and Ahmed Alkhateeb are with Arizona State University (Email: y.zhang, malrabei, alkhateeb@asu.edu). This work is supported by the National Science Foundation under Grant <ref type="bibr">No. 1923676.</ref> beam pattern for a group of users. This is done by utilizing a novel Wolpertinger architecture <ref type="bibr">[9]</ref> which is designed to efficiently explore the large discrete action space. The proposed model accounts for key hardware constraints such as the phaseonly, constant-modulus, and quantized-angle constraints <ref type="bibr">[10]</ref>. This is realized by defining the state directly as the phases of the analog phase shifters and the action as the change of phases within the quantized phase set. Simulation results show that the proposed solution is capable of finding the near optimal beam pattern and achieving a beamforming gain compared to that of equal gain combining.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. SYSTEM AND CHANNEL MODELS</head><p>In this section, we introduce in detail our adopted system and channel models. We also describe how the model considers arbitrary array geometries with possible hardware impairments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. System Model</head><p>We consider the system model where a mmWave massive MIMO base station (BS) with M antennas is communicating with a single-antenna user. Further, given the high cost and power consumption of mixed-signal components, we consider a practical system where the BS has only one radio frequency (RF) chain and employs analog-only beamforming using a network of r-bit quantized phase shifters. Therefore, the beamforming vector can be written as</p><p>where each phase shift &#952; m is selected from a finite set &#920; with 2 r possible discrete values drawn uniformly from (-&#960;, &#960;]. In the uplink transmission, if a user u transmits a symbol x &#8712; C to the base station, where the transmitted symbol satisfies the average power constraint E |x| 2 = P x , the received signal at the base station after combining can be expressed as</p><p>where h u &#8712; C M &#215;1 is the uplink channel vector between the user u and the base station antennas and n &#8764; N C 0, &#963; 2 n I is the receive noise vector at the base station.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Channel Model</head><p>We adopt a general geometric channel model for h u . Assume that the signal propagation between the user u and the base station consists of L paths. Each path has a complex gain &#945; and an angle of arrival &#966; . Then, the channel vector can be written as</p><p>where a(&#966; ) is the array response vector of the base station. The definition of a(&#966; ) depends on the array geometry and hardware impairments. Next, we discuss that in more detail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Hardware Impairments Model</head><p>Most of the prior work on mmWave signal processing has assumed uniform antenna arrays with perfect calibration and ideal hardware <ref type="bibr">[3]</ref>, <ref type="bibr">[6]</ref>, <ref type="bibr">[8]</ref>, <ref type="bibr">[10]</ref>. In this paper, we consider a more general antenna array model that accounts for arbitrary geometry and hardware impairments, and target learning beam pattern that mitigates the influence of those unknown factors. While the beam pattern learning solution that we develop in this paper is general for various kinds of array geometries and hardware impairments, we evaluate the proposed solution in Section V with respect to two main characteristics of interest, namely non-uniform spacing and phase mismatch between the antenna elements. For linear arrays, the array response vector can be modeled to capture these characteristics as follows a(&#966; ) = e j(kd1 cos(&#966; )+&#8710;&#952;1) , e j(kd2 cos(&#966; )+&#8710;&#952;2) , . . . , e j(kdM cos(&#966; )+&#8710;&#952;M ) T , </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PROBLEM DEFINITION</head><p>In this paper, we investigate the beam pattern design problem for mmWave and massive MIMO system with unknown array geometry and hardware impairment. Given the system and channel models described in Section II, the SNR after combining for user u can be written as</p><p>where w 2 = 1 is implicitly used and &#961; = Px . Besides, we define the beamforming/combining gain of adopting w as a transmit/receive beamformer for user u as</p><p>It can be seen that maximizing ( <ref type="formula">6</ref>) is equivalent to maximizing the SNR in <ref type="bibr">(5)</ref>. Therefore, the objective of this paper is to design (learn) the beamforming vector w that maximizes the beamforming/combining gain given by ( <ref type="formula">6</ref>) averaged over the set of the users with similar channels. Therefore, the beam pattern learning problem can be formulated as</p><p>where w m is the m-th element of the beamforming vector and H the channel set that is supposed to contain a single channel or multiple similar channels. It is worth mentioning that the constraint in ( <ref type="formula">8</ref>) is imposed to uphold the adopted analogonly system model, and the constraint in ( <ref type="formula">9</ref>) is to respect the quantized phase-shifters hardware constraint. Due to the unknown array geometry as well as possible hardware impairments, the accurate channel state information is generally hard to acquire. This means that all the channels h u &#8712; H in the objective function are possibly unknown. Instead, the base station may only have access to the beamforming/combining gain g u (or equivalently the Received Signal Strength Indicator (RSSI)) reported by each user. Therefore, problem ( <ref type="formula">7</ref>) is hard to solve in a general sense for the unknown parameters in the objective function as well as the non-convex constraint <ref type="bibr">(8)</ref> and the discrete constraint <ref type="bibr">(9)</ref>. Given that this problem is essentially a search problem with feedbacks in a dauntingly huge yet finite and discrete space, we consider leveraging the powerful exploration capability of deep reinforcement learning to efficiently search over the space to find the optimal or near-optimal solution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. BEAM PATTERN LEARNING</head><p>In this section, we present our proposed DRL-based algorithm for addressing the beam pattern design problem <ref type="bibr">(7)</ref>. It is worth mentioning that when viewing the problem from a reinforcement learning perspective, it features a finite yet very high dimensional action space. This makes the traditional learning frameworks (such as deep Q-learning, deep deterministic policy gradient, etc.) hard to apply. Therefore, we adopt a novel architecture called Wolpertinger to enable the efficient search in a large discrete action space, the details of which can be found at <ref type="bibr">[9]</ref>.</p><p>1) Reinforcement Learning Setup: To solve the problem with reinforcement learning, we first specify the corresponding building blocks of the learning algorithm as follows:</p><p>&#8226; State: We define the state s t as a vector that consists of the phases of all the phase shifters at the t-th iteration, that is,</p><p>T . This phase vector can be converted to the actual beamforming vector by applying <ref type="bibr">(1)</ref>. Since all the phases in s t are selected from &#920;, and all the phase values in &#920; are within (-&#960;, &#960;], (1) essentially defines a bijective mapping from the phase vector to the beamforming vector. Therefore, for simplicity, we will use the term "beamforming vector" to refer to both this phase vector and the actual beamforming vector (the conversion is given by ( <ref type="formula">1</ref>)), according to the context.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>446</head><p>Authorized licensed use limited to: ASU Library. Downloaded on September 01,2021 at 00:36:33 UTC from IEEE Xplore. Restrictions apply. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mobile users</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Serving beam</head><p>Fig. <ref type="figure">1</ref>. The proposed beam pattern design framework with deep reinforcement learning. The schematic shows the agent architecture, and the way it interacts with the environment.</p><p>&#8226; Action: We define the action a t as the element-wise changes to all the phases in s t . Since the phases can only take values in &#920;, a change of a phase means that the phase shifter selects a value from &#920;. Therefore, the action is directly specified as the next state, i.e. s t+1 = a t . &#8226; Reward: We define a ternary reward mechanism, i.e. the reward r t takes values from {+1, 0, -1}. We compare the beamforming gain achieved by the current beamforming vector, denoted by g t , with two values: (i) an adaptive threshold &#946; t , and (ii) the previous beamforming gain g t-1 . The reward is computed using the following rule g t &#8805; &#946; t , r t = +1; g t &lt; &#946; t and g t &#8805; g t-1 , r t = 0; g t &lt; &#946; t and g t &lt; g t-1 , r t = -1. It is important to note that the adopted adaptive threshold mechanism does not rely on any prior knowledge of the channel distribution. The threshold value starts from zero and whenever the BS tries a new beam and the resulting beamforming gain surpasses the current threshold, the system updates the threshold by the value of this new beamforming gain. Besides, since the update of threshold also marks a successful detection of a new beam that achieves the best beamforming gain so far, the BS also records this beamforming vector. As can be seen in the reward definition, in order to calculate the reward, the system always tracks two quantities, which are the previous beamforming gain and the best beamforming gain achieved so far (i.e. the threshold).</p><p>2) Environment Interaction: As mentioned in Sections I and III, due to the possible hardware impairments, accurate channel state information is generally unavailable. Therefore, the base station can only resort to the beamforming gain feedback reported by the users to adjust its beam pattern in order to achieve a better performance. Upon forming a new beam w, the base station transmits a pilot x by using this beam and gets feedback from every user. Then, it averages all the beamforming gain feedbacks</p><p>where H represents the targeted user channel set. Recall that <ref type="bibr">(10)</ref> is the same as evaluating the objective function of <ref type="bibr">(7)</ref> with Algorithm 1 DRL Based Beam Pattern Learning 1: Initialize actor network &#181;(s|&#952; &#181; ) and critic network Q(s, a|&#952; Q ) with random weights &#952; &#181; and &#952; Q 2: Initialize target networks &#181; and Q with the weights of actor and critic networks' &#952; &#181; &#8592; &#952; &#181; and &#952; Q &#8592; &#952; Q 3: Initialize the replay memory D, minibatch size B 4: Initialize adaptive threshold &#946; = 0 and the previous average beamforming gain g 1 = 0 5: Initialize a random process N for action exploration 6: Initialize a random beamforming vector w 1 as the initial state s 1 7: for t = 1 to T do 8:</p><p>Receive a predicted action from actor network with exploration noise a t = &#181;(s t |&#952; &#181; ) + N t</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>9:</head><p>Quantize the predicted action to a valid beamforming vector a t according to <ref type="bibr">(11)</ref> 10:</p><p>Execute action a t , observe reward r t and update state to s t+1 = a t 11:</p><p>Update the threshold &#946; and the previous beamforming gain g t</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>12:</head><p>Store the transition (s t , a t , r t , s t+1 ) in D Update the actor network using the sampled policy gradient given by</p><p>Update the target networks every C iterations 18: end for the current beamforming vector w. Depending on whether or not the new average beamforming gain surpasses the previous one as well as the current threshold, the base station gets either reward or penalty, based on which it can judge the "quality" of the current beam and decide how to move.</p><p>3) Exploration: The exploration happens after the actor network predicts the action a t+1 based on the current state (beam) s t . Upon obtaining the predicted action, an additive noise is added element-wisely to a t+1 for the purpose of exploration, which is a customary way in the context of reinforcement learning with continuous action spaces <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>. In our problem, we use temporally correlated noise samples generated by an Ornstein-Uhlenbeck process <ref type="bibr">[13]</ref>, which is also used in <ref type="bibr">[9]</ref>. It is worth mentioning that a proper configuration of the noise generation parameters has significant impact on the learning process. Normally, the extent of exploration (noise power) is set to be a decreasing function with respect to the iteration number, which is commonly known as exploration-exploitation tradeoff <ref type="bibr">[11]</ref>. Furthermore, the exact configuration of noise power should relate to specific applications. In our problem, for example, the noise is directly added to the predicted phases. Thus, at the very beginning, the noise should be strong enough to perturb the predicted phase to any other phases in &#920;. By contrast, when the learning process approaches to the termination (the learned beam already performs well), the noise power should be decreased to a smaller level that is only capable of perturbing the predicted phase to its adjacent phases in &#920;.</p><p>4) Quantization: The predicted beam (with exploration noise added) should be quantized in order to be a valid new beam that can be implemented by the discrete phase-shifters. Therefore, each quantized phase in the new vector can be calculated as</p><p>which is essentially a nearest neighbor lookup (i.e. a KNN classifier with k = 1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5) Forward Computation and Backward Update:</head><p>The current state s t and the new state s t+1 (recall that we directly set s t+1 = a t ) are then fed into the critic network to compute the Q value, based on which the targets of both actor and critic networks are calculated. This completes a forward pass. Following that, a backward update is performed to the parameters of the actor and critic networks. A pseudo code of the algorithm can be found in Algorithm 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. SIMULATION RESULTS</head><p>In this section, we evaluate the performance of the proposed solution. We first describe the adopted scenario and dataset used in our simulations and then discuss the results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Scenario and Dataset</head><p>In our simulations, we consider the outdoor scenario 'O1 60' which is offered by the DeepMIMO dataset <ref type="bibr">[14]</ref> and is generated based on the accurate 3D ray-tracing simulator Wireless InSite <ref type="bibr">[15]</ref>. This scenario comprises two streets and one intersection with three uniform x-y user grids, as shown in Fig. <ref type="figure">2</ref>. To generate the channels from the users to the base station, we adopt the following DeepMIMO parameters: (1) Scenario name: O1 60, (2) Active BSs: 3, (3) Active users: Row 1200 to 1200, (4) Number of BS antennas in (x, y, z): (1, 32, 1), ( <ref type="formula">5</ref>) System bandwidth: 1 GHz, (6) Number of OFDM sub-carriers: 1 (single-carrier), (7) Number of multipaths: 5. From the generated dataset, we further select the user at row 1200 and column 181 in the scenario. The locations of both the selected user and the base station are marked in Fig. <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Performance Evaluation</head><p>We first evaluate our proposed DRL-based beam pattern learning solution on learning a single beam that serves a single user with LOS connection to the base station. As shown in Fig. <ref type="figure">3</ref> (a), the proposed solution is capable of finding where the user is and forming a pointed beam to serve that user. By comparing the beam patterns of the equal gain combining/beamforming vector (plotted in red) and the learned beam (plotted in blue), it is evident that the proposed solution can capture the main lobe of the equal gain combining/beamforming vector very well, which explains the excellent performance it achieves. The slight mismatching is mainly due to the use of quantized phase shifters. With 3bit resolution, each phase shifter can only realize 8 different values of phase shifts drawn uniformly from (-&#960;, &#960;].</p><p>We also compare the performance of the learned single beam with a 32-beam classical beamsteering codebook, illustrated in Fig. <ref type="figure">3 (b</ref>). As it is commonly known, classical beamsteering codebook normally performs very well in LOS scenario. However, our proposed method achieves higher beamforming gain than the classical beamsteering codebook with negligible iterations. More interestingly, with less than 5 &#215; 10 4 iterations, the proposed solution can reach more than 90% of the equal gain combining (EGC) upper bound. It is worth mentioning that the EGC upper bound can only be reached when the user's channel is completely known and unquantized phase shifters are deployed. By contrast, our proposed solution can finally achieve almost 95% of the EGC upper bound with 3-bit phase shifters and without any channel information.</p><p>The proposed beam pattern learning solution is also evaluated on a system where hardware impairments exist (with the same user considered above). This is a more realistic and interesting scenario, for mmWave systems are susceptible to perturbations like antenna spacing mismatch and phase mismatch. The wavelength in mmWave bands is so small that even slight mismatching can lead to a drastic degradation of the performance. This for sure calls for an intelligent design process that is capable of adapting the beam pattern to the 448 Authorized licensed use limited to: ASU Library. Downloaded on September 01,2021 at 00:36:33 UTC from IEEE Xplore. Restrictions apply. hardware, mitigating the loss caused by hardware mismatches.</p><p>The simulation results confirm that our proposed solution is competent to learn such optimized beam pattern for a system with hardware impairments. Fig. <ref type="figure">4</ref> (a) shows the beam patterns for both equal gain combining/beamforming vector and the learned beam. At the first glance, the learned beam appears distorted and has multiple low-gain lobes. However, the performance of such beam is excellent. This can be explained by comparing the beam patterns of the learned beam and the equal gain combining/beamforming vector. As can be seen from the learned beam patterns, our proposed solution intelligently approximates the optimal beam, where all the dominant lobes are well captured. By contrast, the classical beamsteering codebook fails when the hardware is not perfect, as depicted in Fig. <ref type="figure">4</ref> (b). This is because the distorted array pattern incurred by the hardware impairment makes the pointed classical beamsteering codebook beams only able to capture a small portion of the energy, which further results in an inferior beamforming gain. The learned beam shown in Fig. <ref type="figure">4</ref> (a) is capable of achieving more than 90% of the EGC upper bound with approximately only 10 4 iterations, as shown in Fig. <ref type="figure">4</ref> (b). This is especially interesting for the fact that the proposed solution does not rely on any channel state information. As it is known, the channel estimation in this case relies first on a full calibration of the hardware, which is a hard and expensive process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSIONS AND DISCUSSIONS</head><p>In this paper, we developed a DRL-based approach to learn the optimized beam pattern for a single user or a group of users with similar channels in mmWave massive MIMO systems. More specifically, we adopt a novel Wolpertinger architecture which is designed to efficiently explore the large discrete action space. The proposed learning framework respects key hardware constraints such as the phase-only, constant-modulus, and quantized-angle constraints. Simulation results show that the proposed solution is capable of finding the near optimal beam pattern which achieves a beamforming gain compared to that of equal gain combining. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: ASU Library. Downloaded on September 01,2021 at 00:36:33 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
