<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Towards Addressing the Spatial Sparsity of MDT Reports to Enable Zero Touch Network Automation</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>12/01/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10352770</idno>
					<idno type="doi">10.1109/GLOBECOM46510.2021.9686011</idno>
					<title level='j'>2021 IEEE Global Communications Conference (GLOBECOM)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Joel Shodamola</author><author>Haneya Qureshi</author><author>Usama Masood</author><author>Ali Imran</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Minimization of Drive Test (MDT) reports are a key enabler for Machine Learning (ML)-based zero-touch automation envisioned for emerging cellular networks. However, due to numerous factors, the MDT reports are spatially sparse in nature. This sparsity undermines the performance of ML models that are built on the MDT data to estimate and optimize network KPIs. In this paper, we present and evaluate a framework to address this challenge. We leverage generative models, specifically, Generative Adversarial Networks (GAN) and Variational Autoencoders (VAE) to augment the sparse multi-dimensional MDT data. Unlike image data where the quality of synthetic images produced by the generative models can be evaluated visually, establishing the authenticity of tabular synthetic data is a more complex problem. We address this problem by leveraging a tripartite approach: 1) We use several statistical measures to quantify the resemblance of synthetic data with original data. 2) We compare the performance of an ensemble learning model trained on augmented data, with that of trained on original data only 3) We benchmark the performance of the generative models with several classical ML models. This analysis is carried out for varying levels of sparsity and reveals insights about robustness of generative models against training data sparsity as well as on suitability of various methods for evaluating the quality of the generated synthetic tabular data. Results show GAN performs considerably better compared to other approaches. The presented solution thus can be used to overcome the sparsity problem in MDT reports thereby enabling ML-based automation use cases.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>With emerging cellular network technologies, the absolute functionality of data-driven autonomous operations like selfhealing, self-configuration and self-optimization will depend on the availability of data. It is envisaged that operational complexity of the network will be an albatross for operators as this complexity is assumed to scale linearly with increase in network densification <ref type="bibr">[1]</ref>. Consequently, the current manual and offline planning operations which depends wholly on collection of measurements report from drive tests are becoming more obsolete and ineffective. To mollify this complication, 3GPP introduced the minimization of drive test (MDT) <ref type="bibr">[2]</ref> to reinforce autonomous solutions embedded in the features of self-organizing networks.</p><p>Data-driven autonomous solutions leverage on machine learning which has the capabilities of learning intuitive characteristics from these MDT reports. These reports contain user location and network quality of service which are quantified with certain key performance indicators (KPIs). The advantages that come with MDT reports range from reduction of human intervention, reduction in operational expenditure (OPEX) as well as reduction in time-inefficiency arising from offline configurations. However, the self-organizing networks functionality has not culminated to its expected use-case capacity predominantly because of lack of representative data.</p><p>The coverage estimation maps derived from the MDT reports are accompanied with a few challenges that impede the seamless operation of intelligent network operations. Among the existing challenges are geographical positioning errors, error due to quantization and data sparsity <ref type="bibr">[3]</ref>. The focus of this study is to address the data sparsity challenge. Several factors contribute to data sparsity in cellular network domain, such as:</p><p>&#8226; Sparsity due to smaller cells: It is expected that user traffic under small cells will be less dense compared to macro cells <ref type="bibr">[3]</ref>. Hence, the reports gotten from small cell users measurement will be scanty thus leading to a sparse coverage map. &#8226; MDT incompatibility of user equipment (UE): An important factor that contributes to sparse coverage mapping is that while some UE have inbuilt compatibility to upload MDT reports, some UE manufacturers have not implemented the features of MDT. &#8226; Data privacy: Full ground truth is not attained for optimization and planning due to privacy concerns. This reason contributes to the sparsity of data reported for MDT-based optimization solutions. &#8226; Data sparsity from network operators: Operators do not explore all possible combinations of network variables to avoid jeopardizing quality of service in live network. This results in non-availability of relevant or rich data for machine learning exploration.</p><p>. Several studies have proposed spatial interpolation techniques as a remedy to solving the sparsity challenge as highlighted in Section I-A. These techniques, although proven to produce close estimates to the ground truth coverage map, have limitations, like their applicability to only stationary environments, which will pose a problem for a dynamic environments. Moreover, they are limited in the feature space used for prediction, for example, classical spatial interpolation techniques like inverse distance weighted or Kriging <ref type="bibr">[4]</ref> rely on the distance feature only and do not capture additional features, such as antenna tilt or azimuth angles.</p><p>A level of intelligence is required to absorb real-time intricacy of data and enhance pro-activity. To this end, several machine learning based solutions in the cellular network domain are also proposed in literature as highlighted in Section I-A. However, these solutions are predominantly limited to the using image data as features and not tabular data <ref type="bibr">[5]</ref>- <ref type="bibr">[8]</ref>.</p><p>We investigate alternatives that involve regenerating entire coverage maps based on inherent correlation that exist in tabular MDT data and further augmenting it. For this purpose, we leverage two deep learning based generative models namely generative adversarial network (GAN) and variational autoencoder (VAE).</p><p>Generative models are well known for their abilities to learn how data are generated. Using neural network as the bedrock, data is passed into the generative model for training and after some iterations or convergence point, a similar data is reproduced. Contrary to most applications of generative models for cellular network in literature that are used for image data <ref type="bibr">[5]</ref>- <ref type="bibr">[8]</ref>, the use of generative models is presented for regression based analysis in this study. We present two applications of generative model to address the sparsity challenges as follows; coverage map generation and data augmentation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Related Work</head><p>Prior to the advent of intelligence in cellular networks, most studies proposed using traditional interpolation techniques <ref type="bibr">[4]</ref>, <ref type="bibr">[9]</ref>, <ref type="bibr">[10]</ref> to estimate and predict missing coverage values in a geo-spatial environment. Although these techniques tend to estimate and inset missing values, they do not scale well with dynamic environments because of the limiting application to stationary environments. In <ref type="bibr">[11]</ref>, authors used a spatial sampling technique to approximate traffic in a network with several base stations using an intuitive metric to select loads from a few base stations. They show that this technique eases the burden of data collection of an entire network. Authors in <ref type="bibr">[12]</ref> and <ref type="bibr">[13]</ref> acknowledge cell outage sparsity in data base station (BS) and propose a mathematical approach called Greyprediction that uses differential equations to predict RSRP data in the control BSs that comes from periodic updates between users and data. In <ref type="bibr">[14]</ref>, authors give a comparative evaluation of different interpolation techniques including kriging for localization and radio frequency estimation. They conclude that natural-neighbor interpolation has a better performance in terms of robustness to increase shadowing. Authors in <ref type="bibr">[15]</ref>, address the sparsity of data that comes from small cell users by using SMOTE to address data imbalance and an ensemble learning solution to classify fault diagnosis network. They show that this method reduces communication cost that occurs as a result of overhead. A cost function is proposed in <ref type="bibr">[16]</ref>, where authors use the function to jointly optimize a use case of capacity and coverage optimization in both uplink and downlink. In their contribution, they formulate sparsity as a function of two factors. First, availability of data within a limited parameter range (tilt) without having knowledge of users location and secondly unknown dependence between network parameters and KPIs. Simulated Annealing is used to obtain upper bound of KPIs and coordinate descent is used for tilt search in this study.</p><p>In emerging networks, network automation utilizes deep learning models that require massive amount of data to determine inherent and existing inter-dependencies that can be used to drive self-optimization and future predictive patterns. One such technique used to address the data sparsity challenge is transfer learning. For example, in <ref type="bibr">[17]</ref>, authors use transfer learning to address a different domain in wireless network. To fully achieve optimize important local edge caching, they utilize transfer learning from a source domain to a target domain under sparse knowledge of users content in small cells.</p><p>From classic interpolation techniques <ref type="bibr">[4]</ref>, <ref type="bibr">[9]</ref>, <ref type="bibr">[10]</ref>, to sampling techniques <ref type="bibr">[18]</ref>, and most recently, generative models <ref type="bibr">[19]</ref>- <ref type="bibr">[21]</ref>, several literature have come up with different propositions to address different data challenges like data imbalance, data irregularity and corruption, privacy concerns and most recent and relevant to our study, data sparsity. While little attention is focused on categorical, numerical and tabular data as is the case with most cellular networks domain, majority of the literature leveraging GANs for addressing data sparsity challenge focus on using GAN to recreate synthetic image and audio data similar to full ground truth <ref type="bibr">[5]</ref>- <ref type="bibr">[8]</ref>. Recently, GAN has also gained much acceptance in the medical space. For example, one study <ref type="bibr">[22]</ref>, addresses privacy concerns by generating and evaluating synthetic tabular data generation. With reference to cellular networks, authors in <ref type="bibr">[8]</ref> applied a variant of Generative Adversarial Network (GAN) to generate radio frequency estimation maps from irregular maps, where a reconstruction error loss was formulated in addition to typical traditional GAN loss to enhance stability in the generator. Authors in <ref type="bibr">[23]</ref>, use integrated sampling, additive noise and variation autoencoder (VAE) to generate synthetic data for localization of both indoor and outdoor environments and conclude that the proposed augmentation techniques improve the accuracy of localization. However, studies using GANs for numerical tabular data in cellular context are very few. The center of focus of authors in <ref type="bibr">[24]</ref> was to generate and predict cellular traffic data for smart city usage where vanilla GAN with LSTM (long short-term memory) networks was used for time-series data. They conclude that performance increased with augmentation rates and decreased with generative data quantities. Authors in <ref type="bibr">[25]</ref> use a combination of generative and classification model to address imbalance data for cell outage detection. The closest relevant study to our work is from authors in <ref type="bibr">[26]</ref>, where GAN was used to augment call data records (CDR). However, this work focuses on using one dimensional data, whereas our work involves the use of higher dimensions that involves tilt, azimuth and distance parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Contributions and Organization</head><p>To the best of authors' knowledge, this work is the first to study the efficacy of GANs using multi-dimensional tabular data in cellular networks context with varying sparsity levels.</p><p>The key contributions in this work can be summarized as follows:</p><p>&#8226; We use generative models to regenerate coverage maps from sparse tabular multi-dimensional cellular data consisting of features such as user to base station distance, user antenna tilt and azimuth angles. &#8226; We study the effect of varying data sparsity levels on coverage estimation. &#8226; We compare the results with traditional methods, such as sampling technique and classical machine learning (CML) predictive methods. &#8226; In order to test the authenticity of the generated synthetic data, we use a three-fold statistical and modeling analysis approach consisting of (i) evaluating the authenticity of synthetic data produced using spearman's rank correlation coefficient (SCC) and joint plot (ii) observe the effect of synthetic data on another ensemble ML model (iii) comparing its performance in terms of RMSE to other stateof-the-art synthetic data generating models. This analysis is crucial to ascertain if the synthetic data generated has the same feature characteristics that exist in the ground truth or it is just generating some random noise. The rest of this paper is thus organized: Section II involves detailed explanation of the proposed framework which includes data collection, prediction of map using different CML methods and description of augmentation techniques. In section III, we evaluate the performance of GAN comparing it with CML methods as well as a sampling technique. We finally conclude this paper with Section IV. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. SYSTEM MODEL</head><p>Coverage estimation of a particular network is usually measured using reference signal received power (RSRP). This way operators can detect if there are coverage holes, blind coverage spots or poor coverage signals. Practically, the coverage area is usually divided into bins and hence, RSRP is measured based on users in each bin. Fig. <ref type="figure">2a</ref> shows what a complete coverage map given a full ground truth would look like, however, in realistic scenarios due to the reasons highlighted in Section I, network operators do not have access to this map, hence, they re left with the task of deciphering values from a sparse map like the one presented in Fig. <ref type="figure">2b</ref>. This work addresses this challenge by studying the ability of several CML models and deep learning based generative models to predict users received powers in the white spaces of Fig. <ref type="figure">2b</ref>.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Data collection and coverage prediction</head><p>For this study, we utilize a commercial network planning tool with an avant-garde ray-tracing propagation model <ref type="bibr">[27]</ref>. Through the data obtained from this tool, we acquire RSRP reports while leveraging on the Poisson distribution of users. To capture and reflect ground-truth of realistic coverage measurements, we inculcate the environment maps from a real world environment in the city of Brussels consisting of buildings, heights, clutters and terrain profiles. We consider one cell site location with three sectors having the same coordinates and coverage area divided into bin widths of 5m. Coverage estimation from multiple cell will be studied as part of future work. Further network measurements are listed in Table <ref type="table">I</ref>.</p><p>For classical ML algorithms, we use four different ML algorithms namely: Random Forest, K-Nearest Neighbor, Support Vector and Linear Regression. Using the regression-based numerical data, we split into train and test, where the train represents sparse data with distance, tilt and azimuth values of each user. The training data is fed into each of the ML models and further used to predict the incomplete or white regions in the sparse map as shown in Figure <ref type="figure">2b</ref>. For visualization purposes, we show the predicted coverage map of all models along with their RMSE values as seen in Figure <ref type="figure">3</ref> using 20 percent of full coverage data as available sparse data. Of all the listed models, we observe that K-Nearest Neighbor performed the best with the closest map and lowest RMSE value. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Data Augmentation</head><p>In this section, we discuss the effect of data augmentation in training a ML model, since it is established that the performance of a model to learn the network behaviour is dependent on the amount of representative data. Generative models are widely known for their ability to reproduce similar data to the ones fed into them through. They recreate how data is generated by sampling from the probabilistic models that exist in the data. Using neural network as their backbone they tend to learn more with the amount of data fed into them. Although, first application of generative models were generally used for image dataset using convolutional neural networks, recent applications are seen on tabular data. In our work, we employ and modify two types of generative models from the synthesizer in <ref type="bibr">[28]</ref>; GANs and VAEs for data augmentation.</p><p>1) GAN: GAN is a type of unsupervised learning which comprises of two neural networks; generator and discriminator, where the former masters the distribution of training fed into it and maps out similar data from latent space, the latter validates the generated with real data. These two neural networks are modelled to play a mini-max game against each other. As the generator keeps learning the distribution of the original training data, it uses the parameters from the distribution to recreate similar samples from a Gaussian noise, z. Simultaneously, the discriminator acts a critic to differentiate between the true data and synthetic samples. The endpoint of this mini-max is usually user-defined or otherwise determined by a convergence point where the discriminator can no longer tell the difference between true and synthetic samples. This function is illustrated by the mathematical expressions in equations 1-3.</p><p>where D(x) is the real data and G(z) is the generated data and the cross-entropy loss for correct classification given as L (D) .</p><p>2) Variational autoencoders (VAE): Like GANs, VAE comprises of the encoder and decoder network. Where the encoder network compresses the input to a hidden latent structure of lower dimension, the decoder tends to reconstruct the distribution from the latent space back to the dimension of the input data. The general loss formulation of variation auto encoder is from an Evidence Lower Bound (ELBO) which consists of the reconstruction loss and KL divergence term as illustrated in the expression in equation 4.</p><p>Where E(x) and D(x) represent the Encoder and Decoder term respectively. The right term in the above equation tends to minimize the KL divergence of the latent distribution to a Gaussian distribution. Where the KL equation in the right-hand acts as a regularizer. Both models tend to learn and absorb the joint distribution that exists in the training data fed into them. In our work, we take the coordinates, distance to base station, tilt and azimuth of user equipment as input features as well as the corresponding RSRP values to be mapped for each parameter setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. EVALUATION AND RESULTS</head><p>Evaluating the quality of synthetic data to test if the synthetic data generated has the same feature characteristics that exist in the ground truth or it is just generating some random noise is crucial and an open research question. We evaluate GAN performance in three ways: (1) evaluating the authenticity of synthetic data produced using spearman's rank correlation coefficient (SCC) and joint plot. (2) Observe the effect of synthetic data on an ensemble ML model and (3) comparing its performance in terms of RMSE to other state-of-the-art synthetic data models.</p><p>Spearman's Correlation Coefficient is a metric used to measure the monotonicity of the relationship between two data or variables. In this work, we use SCC instead of Pearson Correlation (PC), because the latter tends to assume normal distribution of both variables as well as evaluate the linear relationship. SCC does not have this assumption and can capture nonlinear relationship that exists between the variables with less sensitivity to outliers. The SCC scores are between [-1 +1], where 0 means no correlation, +1 indicates direct proportionality and -1 indicates inverse proportionality between the variables or dataset. SCC is computed using the formula:</p><p>Where n is the sample size and r is the difference between the variable ranks of observation. To calculate the SCC value, we first compare the distance with the RSRP of the Original and further observe if there is a similar relationship in GAN. As seen from Table <ref type="table">II</ref>, negative values, mean that increased distance from the base station, will yield a reduced RSRP value. GAN is able to reflect this relationship in both the intelligence as different models respond differently to the quantity and quality of data fed into them. Lastly, convergence of the generator in GAN becomes a major underlying issue particularly when the distribution to be modeled involves large dimensions. This challenge will be investigated and addressed in future works.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. CONCLUSION</head><p>In this study, we investigated the use of generative models to predict and augment sparse MDT reports for coverage estimation using multi-dimensional cellular data under varying sparsity levels. We evaluated the authenticity of the synthetic data generated by several generative models and classical machine learning models using statistical measures and observing the effect of synthetic data generated by them on another ensemble ML model. Results show that out of the two-deep learning-based generative models used in this study, GANs are able to better learn the intrinsic characteristics and improve AIassisted data-driven network automation solutions even with little representative data as compared to several classical MLbased and traditional sampling approaches. MDT reports are key enabler for ML-based zero-touch automation, however their sparsity thwarts their practical use for ML-based reliable model training. The presented framework presents a method to overcome this challenge thereby paving the way for practical adaption of MDT reports for ML-based zero-touch automation.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: University of Oklahoma Libraries. Downloaded on August 30,2022 at 23:15:44 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
