<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Dynamic Tensor Recommender Systems</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>02/01/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10236470</idno>
					<idno type="doi"></idno>
					<title level='j'>Journal of machine learning research</title>
<idno>1532-4435</idno>
<biblScope unit="volume">22</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yanqing Zhang</author><author>Xuan Bi</author><author>Niansheng Tang</author><author>Annie Qu</author><author>Animashree Anandkumar</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Recommender systems have been extensively used by the entertainment industry, business marketing and the biomedical industry. In addition to its capacity of providing preference-based recommendations as an unsupervised learning methodology, it has been also proven useful in sales forecasting, product introduction and other production related businesses. Since some consumers and companies need a recommendation or prediction for future budget, labor and supply chain coordination, dynamic recommender systems for precise forecasting have become extremely necessary. In this article, we propose a new recommendation method, namely the dynamic tensor recommender system (DTRS), which aims particularly at forecasting future recommendation. The proposed method utilizes a tensor-valued function of time to integrate time and contextual information, and creates a time-varying coeﬃcient model for temporal tensor factorization through a polynomial spline approximation. Major advantages of the proposed method include competitive future recommendation predictions and eﬀective prediction interval estimations. In theory, we establish the convergence rate of the proposed tensor factorization and asymptotic normality of the spline coeﬃcient estimator. The proposed method is applied to simulations, IRI marketing data and Last.fm data. Numerical studies demonstrate that the proposed method outperforms existing methods in terms of future time forecasting.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Recommender systems (RS) are widely used in our daily lives, such as for selecting movies, restaurants, news articles, or online shopping. As one of the information filtering techniques, RS can help users to find interesting items through combining several information sources, e.g., users' ratings and purchasing histories, item profiles and sales volumes, time, location, and companion or promotion strategies. Particularly, incorporating time is useful in RS since users' purchase behaviors are dynamic and often highly dependent on seasonal and time factors, and business sectors also rely on dynamic recommendations to track users' changing purchase interests over time. Thus, it is essential to capture information related to time and develop time-dependent RS, and we refer this as dynamic RS (DRS).</p><p>However, developing competitive DRS brings new challenges. First, since data are streaming in over time and are time-dependent, general RS methods which are not capable of capturing time-dependency features may have reduced recommendation accuracy. Second, forecasting future recommendations accurately is also a great challenge for DRS due to the complexity of changing users' interests. For example, users might like to watch news on weekdays, but watch movies on weekends. A shoe store sells more sandals in summer and more snow boots in winter. It is important to borrow information from historical data in developing trends. Many RS methods are not designed to capture trends and predict future recommendations. In addition, as data are streaming in over time, future recommendations could involve new users or new items, whose information is not available from historical data. This is also a common problem encountered in RS, referred as the "cold start" problem.</p><p>General RS approaches include content-based filtering and collaborative filtering (CF). Traditionally, content-based filtering methods recommend similar types of items by matching a user's preferred item profile with current item's profile (e.g., <ref type="bibr">Salter and Antonopoulos, 2006;</ref><ref type="bibr">Son and Kim, 2017)</ref>. In contrast, CF methods recommend items by predicting item ratings for the active user based on ratings from other similar users (e.g., <ref type="bibr">Herlocker et al., 2004;</ref><ref type="bibr">Luo et al., 2012)</ref>. On the basis of CF methods, research work related to DRS have been developed in recent years (e.g., <ref type="bibr">Koren, 2009;</ref><ref type="bibr">Gultekin and Paisley, 2014;</ref><ref type="bibr">Yu et al., 2016;</ref><ref type="bibr">Wu et al., 2017;</ref><ref type="bibr">Guo et al., 2018;</ref><ref type="bibr">Xiong et al., 2010;</ref><ref type="bibr">Rafailidis and Nanopoulos, 2014;</ref><ref type="bibr">Bi et al., 2018;</ref><ref type="bibr">Wu et al., 2019)</ref>. However, most of these methods can only make recommendations for observed discrete time points, and are not designed for future recommendation prediction on unobserved time points. <ref type="bibr">Liao et al. (2018)</ref> constructed dynamic tensors by means of combining tensors in tensor stream. <ref type="bibr">Song et al. (2019)</ref> used temporal matrix factorization to construct temporal recommender model assuming that users' current interests are transformed from the previous time step with a Markov property. <ref type="bibr">Liu and Ye (2020)</ref> proposed a dynamic three-way granularity recommendation based on matrix factorization. However, neither of these methods can handle higher-orders tensors. Moreover, these methods cannot make recommendations for future time points.</p><p>To make future recommendations, <ref type="bibr">Yu et al. (2016)</ref> developed a CF method incorporating a time series model, and <ref type="bibr">Wu et al. (2017</ref><ref type="bibr">Wu et al. ( , 2019) )</ref> proposed CF methods incorporating long short-term memory modeling, but they cannot deal with new users, items or contextual variables. <ref type="bibr">Xiong et al. (2010)</ref> used a Bayesian estimation procedure with a time-dependent constraint to estimate DRS for new users and items but cannot deal with new time points. <ref type="bibr">Bi et al. (2018)</ref> created an additional layer of nested latent factors for new time points, users and items. However, <ref type="bibr">Xiong et al. (2010)</ref> and <ref type="bibr">Bi et al. (2018)</ref> can only estimate the components of a tensor at fixed time points instead of at any time point in a continuous time interval. In addition, for forecasting at future time points, their methods may involve an increasing number of parameters if time is treated as an additional tensor mode, which could be computationally costly.</p><p>Currently, there are several dynamic recommender systems based on neural network approaches. For example, <ref type="bibr">Ko et al. (2016)</ref> used Gated Recurrent Units (GRUs) to build collaborative sequence model. <ref type="bibr">Devooght and Bersini (2017)</ref> utilized a long short-term memory (LSTM) method to address changes in the interests of a user. <ref type="bibr">Wei et al. (2017)</ref> utilized the stacked denoising autoencoder (SDAEs) to extract features of items. <ref type="bibr">Livne et al. (2019)</ref> applied a LSTM encoder-decoder network on sequences of contextual information. However, none of these methods are able to accommodate contextual information, and solve the "cold-start" problem simultaneously. Some of these methods may have obvious hysteresis in forecasting, which could influence the accuracy of recommendation.</p><p>In this article, we propose a tensor-valued function of time for estimating the DRS and build a new time-varying coefficient model based on tensor canonical polyadic decomposition (CPD) framework; namely, the dynamic tensor recommender system (DTRS). Specifically, we introduce a tensor-valued function of time with each mode corresponding to user, item or a contextual variable, where each component of the tensor is a function of time and has intra-cluster correlation. In the CPD framework, we build a time-varying coefficient model incorporating group information of time points, users, items and contexts. We approximate each coefficient function by a polynomial spline and employ group factors to explore homogeneous group effects. We adopt the weighted least square approach to incorporate intra-cluster correlation for more efficient estimation. In addition, we construct the prediction intervals of estimators of tensor components to forecast the confidence range of predicted values. In theory, we establish the convergence rate of the proposed tensor factorization and the asymptotic property of the spline parametric estimator.</p><p>The proposed method has two significant contributions. First, it can effectively provide recommendations for an entire future interval as opposed to a series of limited time points. This is because the proposed method integrates time dependency feature to the dynamic recommender systems using the time-varying coefficient model in tensor factorization to capture dynamic trends of recommender systems. In addition, the proposed method can achieve accurate forecasts for long time period through the spline extrapolation technique. Furthermore, the proposed subgroup factors extract homogeneous information from the same group to provide recommendation forecasting for future time points, and consequently solves the "cold start" problem.</p><p>Second, we establish the asymptotic distribution of the proposed estimators in that statistical inferences such as prediction interval can be formulated. In practice, it is desirable to know the upper and lower bounds for predictions, e.g., the highest possible cost, or the future sales volumes or revenues in the worst case scenario. However, existing methods on prediction intervals are mostly univariate or multivariate time series, and the prediction intervals for user-item-context interactions under a tensor framework have not been developed. In contrast, our approach allows prediction intervals for each element of a tensor-valued function, which provides a more complete picture of the dynamic recommender system over time. Our numerical studies also demonstrate that the proposed approach provides effective prediction interval estimators.</p><p>The remainder of the paper is organized as follows. Section 2 introduces the notation and background on tensor and tensor factorization. Section 3 presents the proposed method and its implementation. Theoretical properties are derived in Section 4. Section 5 presents simulation studies to assess the performance of the proposed approach. In Section 6, we apply the proposed method to the IRI marketing data and Last.fm data. Concluding remarks and discussion are provided in Section 7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Notation and Background</head><p>In this section, we introduce some notation and the background of the tensor and classical DRSs. Throughout this article, we use blackboard capital letters for sets, e.g., T, I, small letters for scalars, e.g., x, y &#8712; R, bold small letters for vectors, e.g., x, y &#8712; R n , bold capital letters for matrices, e.g., X, Y &#8712; R n 1 &#215;n 2 , and Euler script fonts for tensors, e.g.,</p><p>A dth-order tensor is an array with d dimensions (d &gt; 2), which is an extension of a matrix to higher order. Here d represents the tensor's order. We denote the component </p><p>) is a n k -dimensional latent factor corresponding to the kth mode. That is, each component of the tensor is the product of the corresponding vector components: The canonical polyadic decomposition (CPD) is commonly adopted in tensor decomposition, which decomposes a tensor as a sum of r rank-one tensors. That is:</p><p>) is a n k -dimensional latent factor corresponding to the kth mode for k = 1, . . . , d; j = 1, . . . , r. Equivalently, each component of Y is</p><p>The CPD can be considered to be a higher-order generalization of matrix factorisation.</p><p>Figure <ref type="figure">1</ref> illustrates a matrix factorization of a matrix and a CPD of a third-order tensor.</p><p>An extensive review of tensors and other forms of tensor decomposition are discussed in <ref type="bibr">Kolda and Bader (2009)</ref>.</p><p>We can estimate &#952; via minimizing a loss function (e.g., L 2 loss). However, the non-convexity of the loss function could impose computational complexity due to numerical instability or even non-convergence <ref type="bibr">(de Silva and Lim, 2008;</ref><ref type="bibr">Frolov and Oseledets, 2017)</ref>. A common approach to alleviate the non-convexity problem is to introduce regularization. That is, an objective function with regularization as the following:</p><p>where Q is a loss function and J is a penalty function, such as L 2 , L 1 or L 0 penalties, or a fused Lasso.</p><p>Specially, the optimization problem solves &#952; * = arg min L(&#952;|Y), where &#952; * defines an optimal set of model parameters. In the case of squared loss function with an L 2 -penalty, the objective function is</p><p>where &#8226; F represents the Frobenius norm, and &#8486; = {(i 1 , i 2 , . . . , i d ) :</p><p>is a set of indices corresponding to the observed components. Notice that, in the context of RS, the set &#8486; may not contain all indices of the tensor components and could be a small fraction of the entire tensor size, since the majority of the tensor components could be missing. Major algorithms for implementing the optimization problem include the cyclic coordinate descent algorithm, the stochastic gradient descent method and the maximum block improvement algorithm <ref type="bibr">(Chen et al., 2012)</ref>. Following the tensor techniques, the classical DRSs can incorporate time as an additional mode of a tensor, that is, Y &#8712; R n 1 &#215;n 2 &#215;&#8226;&#8226;&#8226;&#215;n d &#215;T , where the last mode is a time mode at fixed time points {t 1 , t 2 , &#8226; &#8226; &#8226; , t T }. The classical DRSs use CPD to obtain component estimators, that is, where q &#8226;j = (q t 1 j , &#8226; &#8226; &#8226; , q t T j ) is a T -dimensional latent factor corresponding to the time mode. However, the classical DRSs can only estimate the values y i 1 i 2 </p><p>, the classical DRSs need to extend the time mode to future time points. However, this involves an increasing number of parameters over time which could be computationally infeasible. In addition, the classical methods only focus on the estimations of the tensor components but do not provide statistical inference, e.g., the estimation of prediction intervals. In practice, providing the upper and lower bounds of predictions are also crucial in decision making. In the following, we pursue an alternative approach to solve this problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">The Proposed Method</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">General Methodology</head><p>In this subsection, we develop the methodology for the proposed DTRS method. Specifically, we adopt the ideas of time-varying coefficient model framework to generalize the CPD to capture the trends of the DRS, and classify time points into subgroups to infer new time point trends through existing time points of the same group. We consider a dth-order tensor-valued function Y(t) &#8712; R n 1 &#215;n 1 &#215;...&#215;n d , where the value at time t is a d-dimensional array. The tensor set Y = {Y(t) : t &#8712; T} is the corresponding stochastic process defined on a compact interval T. Without loss of generality, let T be a closed interval <ref type="bibr">[0,</ref><ref type="bibr">1]</ref>. Figure <ref type="figure">2</ref> illustrates an example of a tensor-valued process with d = 3. In the DRS, the tensor-valued process could be the rating or sale volume of items or products from users or stores given contexts. We assume that time points can be categorized into different subgroups, where time points of the same group have common information. For example, in our numerical studies, time points in the same month from the twelve months of each year are categorized in the same group. In addition to time, we also categorize subjects from other modes into subgroups if they share similar characteristics, for example, stores of the same market and products of the same product category.</p><p>Given the subgroup labels, we assume that each component of Y(t) can be estimated:</p><p>where p k i k j and q k i k are the jth latent factor and the subgroup factor for the i k th subject from the kth mode, respectively. Here, h j (t) is a trend function of time for j, and g(t) = m d+1 e=1 g e (t)I(t &#8712; s e ), where I(t &#8712; s e ) is an indicator function and assigns the interval s e on the eth subgroup, g e (t) is a trend function corresponding to the eth subgroup, and m d+1 is the number of subgroups for time. We have q</p><p>if the i k th and i k th subjects are from the e k th subgroup (e k = 1, 2, . . . , m k ), where q k (e k ) is the subgroup factor associated with the e k th subgroup, and m k is the number of subgroups for the kth mode. We denote the set of observed time points for the component</p><p>and the number of components of this set by</p><p>We assume that the covariance matrix is cov(y i 1 i 2 ...i d ) = &#931; 0 i 1 i 2 ...i d , typically not an identity matrix due to the intra-cluster correlation arising from repeated observed data.</p><p>Equation (1) adopts the idea of varying-coefficient models to create a CPD for tensor data. Varying-coefficient models are a useful tool to explore dynamic patterns, and have been applied to modeling and predicting longitudinal, functional and time series data <ref type="bibr">(Huang and Shen, 2004;</ref><ref type="bibr">Fan and Zhang, 2008)</ref>. Based on the varying-coefficient models, through the equations (1), we can obtain estimators of the component of tensor-value function at any time points in a continuous time interval (e.g., t &#8712; (a, b)) instead of at fixed time points as in the DRS approaches (e.g., <ref type="bibr">Xiong et al., 2010;</ref><ref type="bibr">Bi et al., 2018)</ref>. The first part of equation ( <ref type="formula">1</ref>) is an individual-level factor model which takes into account the heterogeneity of subjects and trend of time, and the time-varying coefficients h j (t) (j = 1, . . . , r) reflect the dynamic features. The second part of equation ( <ref type="formula">1</ref>) is a subgroup-level factor model to capture common features from the same subgroups, where the subgroup factors can accommodate new subjects from any mode at future time points, and the g(t) allows time variables to follow a subgroup function of time such that we can predict future time points via borrowing information from existing time points of the same group.</p><p>To capture these trend functions, we adopt the polynomial splines to approximate h j (t) and g e (t). Let {&#957; ji } a N i=1 be interior knots within T, and &#933; j be a partition of T with a N knots, that is &#933; j = {0 = &#957; j0 &lt; &#957; j1 &lt; &#8226; &#8226; &#8226; &lt; &#957; ja N &lt; &#957; ja N +1 = 1} for j = 1, 2, . . . , d. The polynomial splines of an order &#954;+1 are functions with &#954;-degree of polynomials on intervals [&#957; ji-1 , &#957; ji ) for i = 1, 2, . . . , a N and [&#957; ja N , &#957; ja N +1 ], and have &#954; -1 continuous derivatives globally. Denote a spline bases vector of the space of such spline functions as B j (t) = (B j1 (t), . . . , B jM (t)) , where M = a N + &#954; + 1 as the number of spline bases. The function h j (t) (j = 1, 2, . . . , d) can be approximated by</p><p>where &#945; j = (&#945; j1 , &#945; j2 , . . . , &#945; jM ) is a coefficient vector. Spline functions can be B-spline or truncated polynomial functions. For example, for the truncated polynomial function,</p><p>Similarly, let {&#969; ei } a N i=1 be interior knots within T,</p><p>, and A e (t) = (A e1 (t), . . . , A eM (t)) be a vector of spline bases for e = 1, 2, . . . , m d+1 . The g e (t) can be approximated by</p><p>where &#946; e = (&#946; e1 , &#946; e2 , . . . , &#946; eM ) . Based on equation (1), the prediction can be obtain as follows</p><p>where &#285;(t) = m d+1 e=1 &#285;e (t)I(t &#8712; s e ). The equation ( <ref type="formula">2</ref>) can capture trends of the DRS sufficiently through the polynomial spline approximations of time-varying coefficient functions. In addition, since the spline approximation is computationally fast <ref type="bibr">(Xue and Yang, 2006)</ref>, the equation ( <ref type="formula">2</ref>) can achieve the spline estimates of the coefficients efficiently, and this is especially advantageous in estimating high-dimensional parameters in RS. In contrast to these approaches like <ref type="bibr">Xiong et al. (2010)</ref> and <ref type="bibr">Bi et al. (2018)</ref>, equation ( <ref type="formula">2</ref>) can achieve forecasting at any future time points without requiring an increasing number of parameters over time. Note that the proposed method does not require the same number of knots and the same degree polynomial for either trend functions. In order to reduce the computational cost, we fixed the same numbers of knots and the same degree polynomial. We can also adopt different number of knots or different degree polynomial for different trend functions g(t) and h j (t) respectively, or apply existing methods <ref type="bibr">(Van Loock et al., 2011;</ref><ref type="bibr">Yuan et al., 2013;</ref><ref type="bibr">Dung and Tjahjowidodo, 2017)</ref> to identify the number of knots.</p><p>Due to the intra-cluster correlation, it is important to incorporate intra-cluster correlation into RS. However, in practice, the covariance matrix &#931; 0 i 1 i 2 ...i d is often unknown. We adopt an invertible working covariance matrix, denoted as</p><p>(1) , . . . , q k (m k ) ) , and k = 1, . . . , d. Define &#952; = {P, q, &#947;} as parameters of interest. Considering the intra-cluster correlation and non-convexity problem, we define the following weighted penalized objective function:</p><p>(3) where &#955; is the penalized parameter, &#8486; = {(i 1 , i 2 , . . . , i d ) :</p><p>&#8226; 2 is the Euclidean norm, and</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>and can be modeled as</head><p>Some commonly used working correlation structures include independence, exchangeable, and first-order autoregressive process (AR-1), among others. Given a working correlation structure, the working correlation matrix depends on fewer nuisance parameters which can be estimated by the residual-based moment method <ref type="bibr">(Liang and Zeger, 1986)</ref>. The proposed method is robust to the misspecification of correlation structure as indicated by our numerical examples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Parameter Estimation</head><p>In this subsection, we discuss parameter estimation by minimizing (3). Let</p><p>) is observed at some t given i k } be the set of indices with the fixed kth mode index i k , where the corresponding components are observed at some time points. We assume that the number of observations for each time subgroup s e is larger or equal than 2 for e = 1, . . . , m d+1 , and the number of observations for each subgroup e k from the kth mode is larger or equal than 2 for e k = 1, . . . , m k ; k = 1, . . . , d.</p><p>The partial derivatives of the objective function (3) have explicit forms with respect to the individual factors, the subgroup factors and the spline coefficients, which makes it feasible to apply the blockwise coordinate descent approach (BCD). That is, for i k = 1, . . . , n k and k = 1, . . . , d,</p><p>In fact, the estimation procedure of p k i k in ( <ref type="formula">4</ref>) is a ridge regression, and does not require knowing</p><p>and p k n k efficiently. The minimization of L(&#952;|Y) can be done cyclically through estimating P, q, &#945; and &#946;. Notice that &#8486; = &#8746; n k i k =1 &#8486; k i k , and it is possible that &#8486; k i k is empty for certain i k 's, that is, there is no observation on the subject i k . Under this circumstance, the individual factor of the i k subject is assigned as p k i k = 0 0, and the predicted values may degenerate to the subgroup-level factor model by utilizing information from members of the same subgroup.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Implementation</head><p>In the following, we discuss several implementation issues. To solve the objective function (3), we incorporate the maximum block improvement (MBI) strategy <ref type="bibr">(Chen et al., 2012)</ref> into the BCD algorithm cyclically as in <ref type="bibr">Bi et al. (2018)</ref>. The MBI has two advantages over traditional cyclic BCD algorithms. First, it has a good algorithmic property which guarantees convergence to a stationary point, whereas traditional BCDs may end up with certain points where the criterion function ceases to decrease <ref type="bibr">(Chen et al., 2012)</ref>. Second, the MBI has the capability of choosing descending directions and hence has the possibility to discover "shortcuts", which may reduce the computational time significantly. Let &#952; l be an estimator of &#952; at the lth iteration, &#952; a be a subset of &#952;, &#952; c be the complementary set of &#952; a , and &#952; * a be the attempted update of &#952; a . The improvement of the &#952; * a is defined as</p><p>We summarize the implementation of the specifical algorithm as follows.</p><p>Algorithm Implementation Algorithm 1: (Initialization) Input all observed y i 1 i 2 &#8226;&#8226;&#8226;i d (t)'s, the number of factors r, tuning parameter &#955;, initial value &#952; 0 and a stopping criterion &#949; = 10 -4 . 2: (Individual factors update) At the lth iteration, estimate {P 1 , P 2 , &#8226; &#8226; &#8226; , P d , &#945;}.</p><p>(i) For each P k , solve (4) through parallel computing and obtain P k * . Then calculate J P k * through (8).</p><p>(ii) For &#945;, solve (6) and obtain &#945; * . Then calculate J &#945; * through (8).</p><p>(iii) Assign</p><p>3: (Subgroup factors update) At the lth iteration, estimate {q (1) , q (2) , &#8226; &#8226; &#8226; , q (d) , &#946;}.</p><p>(i) For every q (k) , solve (5) and obtain q (k) * . Then calculate J q (k) * through (8).</p><p>(ii) For &#946;, solve (7) and obtain &#946; * . Then calculate J &#946; * through (8).</p><p>Set the final estimator &#952; = &#952; l . Otherwise set l &#8592; l + 1 and go to step 2.</p><p>To select tuning parameter &#955;, we search the one from grid points minimizing the root mean square error on the validation set, defined as</p><p>, where &#915; is the set of indices and times of observed data. We choose the number of individual latent factors r such that it is sufficiently large and leads to stable estimation. In general, the r is no smaller than the theoretical rank of the tensor in order to represent subjects' latent features sufficiently well, but not so large as to over-burden the computational cost.</p><p>An appropriate selection of the knot sequence is important to efficiently implement the proposed method. In practice, knot locations are usually chosen to be equally-spaced over the range of data or placed at evenly-spaced quantiles of data. Since there are highdimensional factor parameters, for simplicity we set the number of knots to be the integer part of N 1/(2&#954;+3) , where N = |&#8486;| and &#954; is the degree of polynomials. One can also choose other methods to select the number of knots such as the AIC or BIC procedures <ref type="bibr">(Xue and Yang, 2006)</ref>. The degree of polynomials &#954; is commonly chosen as 1, 2, or 3. In our numerical study, we set &#954; = 2 and adopt truncated polynomial bases. One can also use different degrees and spline bases for different time-varying coefficients.</p><p>Another important issue is in selection of contextual variables as tensor modes. In practice, the chosen number of contexts is often pre-specified based on domain knowledge. A contextual variable can be considered an additional tensor mode of a higher-order tensor if users' and items' behaviors are distinctive under different values of the contextual variable.</p><p>On the one hand, a higher-order tensor with more contextual variables allows higher-order interactions and hence provides more accurate estimation. On the other hand, a higherorder tensor entails more complex and intensive computation, and may lead to overfitting. It is not suggested to assign too many contextual variables as additional tensor modes, which remains open to discussion regarding the number of contextual variables. In our numerical studies, promotion strategies are incorporated as a contextual variable, since users' and items' behaviors are distinctive under different promotion strategies. In general practice, however, we assume that the order of a tensor can be determined based on prior knowledge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Theoretical Properties</head><p>In this section, we provide asymptotic properties for the proposed method and the estimation of prediction intervals. Specifically, we establish the convergence rate of the proposed tensor factorization and the asymptotic normality of the spline coefficient estimator. Following asymptotic normality, we can also construct the estimation of the prediction interval of the component. Note that identifiability is critical for tensor representation. We first present the sufficient conditions to ensure identifiability of the proposed tensor modeling as follows.</p><p>Proposition 1 If d k=1 K k &#8805; 2r + d + 1 holds, minimizers of L(P, q, &#945;, &#946;|Y) in P, q, &#945; and &#946; given fixed spline bases are unique up to permutation almost surely, where K k is the Kruskal rank of (P k , q k ), and</p><p>Proposition 1 shows that the proposed tensor modeling is identifiable up to permutation almost surely. To address permutation indeterminacy, we could align the factors according to a descending order of the first row of mode-1 factor matrix P 1 , that is,</p><p>1r , following the method in <ref type="bibr">Zhang et al. (2014)</ref>. The rearrangement can be implemented during or after the proposed algorithm, since it does not affect the estimation procedure. In the rest of Section 4, we assume that the parameters are identifiable.</p><p>Let |&#215;(r+1) be the matrix consisting of f (t) for all t &#8712; T i 1 i 2 &#8226;&#8226;&#8226;i d . Considering random errors based on the equation (1), we denote</p><p>is a random error with mean zero and finite variance.</p><p>..i d . Thus, the corresponding vector form is</p><p>Let J(U) be a non-negative penalty function of U. The overall criterion given h j (&#8226;) and g(&#8226;) is redefined as</p><p>(9) for U &#8712; S, where S is the parameter space for U.</p><p>Based on the proposed method, y i 1 i 2 ...i d can be rewritten as</p><p>and</p><p>..i d |&#215;M for j = 1, 2, . . . , r, e = 1, 2, . . . , m d+1 . By the approximation theory <ref type="bibr">(de Boor, 2001)</ref>, there exists a constant C &gt; 0, the spline functions h j (t) = &#945; 0j B j (t) and g e (t) = &#946; 0e A e (t) such that sup t&#8712;T |h j (t) -h j (t)| &#8804; Ca -&#958; N and sup t&#8712;T |g e (t) -g e (t)| &#8804; Ca -&#958; N for any j = 1, . . . , r, e = 1, . . . , m d+1 . Denote &#947; 0 = (&#945; 0 , &#946; 0 ) , and let N = |&#8486;| be the number of components of the set &#8486;, &#955; min {&#8226;} and &#955; max {&#8226;} be the smallest and largest eigenvalues of any symmetric matrix, respectively. We require the following regularity conditions to establish the asymptotic properties.</p><p>(C1) The functions h j (&#8226;) and g e (&#8226;) are &#958;th-order continuously differential for some &#958; &#8805; 2, all j = 1, . . . , d, and e = 1, . . . , m d+1 . The density function of design points t is absolutely continuous and bounded away from zero and infinity on a compact support T.</p><p>(C2) The knots sequences &#933; j and &#915; e are quasi-uniform for j = 1, . . . , d and e = 1, . . . , m d+1 ; that is, there exists a constant c &gt; 0, such that max j=1,...,d</p><p>(C3) There exist positive constants &#963; 2 1 and &#963; 2 2 such that the covariance matrix</p><p>(C4) There exist some positive constants c 1 and c 2 such that c 1 &#8804; &#955; min {&#931; -1</p><p>for 0 &#8804; &#964; /2 &lt; &#965; &#8804; &#964; &lt; 1, and &#955; = o p (1).</p><p>Conditions (C1)-(C3) are standard in the polynomial spline framework. Similar conditions are also presented in <ref type="bibr">Huang (2003)</ref> and <ref type="bibr">Claeskens et al. (2009)</ref>. In particular, condition (C1) imposes a smoothness condition of trend functions and a mild condition on time density, and guarantees that the observation time points are randomly scattered. Condition (C2) indicates that the adjacent distances among the knot sequence are comparable. Condition (C3) implies that the eigenvalues of random errors are bounded. Condition (C4) implies that the difference between the working covariance and true covariance matrices is bounded. Condition (C5) implies that the number of the observed time points grows as the number of the observed components of the tensor increases, to ensure the convergence of the proposed tensor factorization. The following theorem establishes the convergence rate for the proposed tensor factorization.</p><p>Theorem 2 Under conditions (C1)-(C5), if the penalty function J(U) has bounded first and second derivatives at true parameter U 0 , as N &#8594; &#8734;, on a &#948;-ball centered at U 0 for some &#948; &gt; 0, there exists a minimizer U of (9) such that</p><p>Theorem 2 provides the convergence rate of the proposed method given trend functions. When &#964; = &#965;, that is, T max and T min have the same order, the convergence rate of the estimator U reaches the optimal rate N -1/2 . Meanwhile, if the order of T max is &#8730; N faster than that of T min , that is, &#964; -&#965; = 0.5, then U will not converge to the true U 0 . This implies that to guarantee consistency of the tensor factorization, one should collect sufficient observations even for the least popular user-item-context combinations. In the following theorem, we establish the asymptotic property of the spline coefficient estimator.</p><p>Theorem 3 Under conditions (C1)-(C5), if lim N &#8594;&#8734; a N log a N /N = 0 and lim N &#8594;&#8734; a -&#958; N N &#964; = 0, then for any vector c whose components are not all zero, the parametric estimator &#947; by ( <ref type="formula">6</ref>) and ( <ref type="formula">7</ref>) satisfies</p><p>Theorem 3 establishes the asymptotic normality of the spline coefficient estimator. The convergence rate of the spline coefficient estimator is O p (a N N -1+&#964; -2&#965; ). If T max and T min have the same order, var{c ( &#947;&#947; 0 )} = O p (a N /N 1+&#965; ), and similar results can be found in <ref type="bibr">Huang et al. (2004)</ref>. The asymptotic variance in Theorem 3 depends on the working covariance matrix and the true covariance matrix. When the working covariance matrices are equal to the true covariance matrices, the asymptotic variance of the proposed estimator reaches the minimum in the sense of Lower order and the proposed estimator is asymptotic efficient.</p><p>More importantly, the result of Theorem 3 is the key foundation for constructing prediction intervals. First, we derive the standard error for the spline parametric estimates given a fixed &#955; using the sandwich covariance formula Cov( &#947;) = ( &#936;+&#955;I) -1 &#934;( &#936;+&#955;I) -1 , where &#936; =</p><p>, &#8855; operation is the vector operation a &#8855;2 = aa , and I is an identity matrix. Since &#375;i 1 i 2 ...i d (t) = w i 1 i 2 ...i d t &#947;, and w i 1 i 2 ...i d t is the tth column of estimator W i 1 i 2 ...i d , a 100(1 -&#963;)% prediction interval <ref type="bibr">(Chatfield, 1993)</ref> </p><p>where &#966; &#963;/2 is the 100(1 -&#963;)th percentile of the standard normal distribution, and the var{e i 1 i 2 ...i d (t)} is the variance of the prediction error and can be estimated as:</p><p>The first term in equation ( <ref type="formula">11</ref>) is due to estimation error, and the second term can be estimated by the mean squared error on training data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Simulation Studies</head><p>In this section, we perform simulation studies to compare the proposed method (DTRS) with competing methods, including Bayesian probabilistic tensor factorization (BPTF, <ref type="bibr">Xiong et al., 2010)</ref> and the recommendation engine of multilayers (REM, <ref type="bibr">Bi et al., 2018)</ref>. We assess forecasting performance via examining the root mean square error (RMSE) and the mean absolute error (MAE), where the RMSE is defined as </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Simple Tensor Function</head><p>In the simulation, we consider a third-order tensor function of time with user, context and item modes. We set the numbers of users, contexts and items to n 1 = 100, n 2 = 9, and n 3 = 100, respectively. We assume that users, contexts, items and time points are from m 1 = 10, m 2 = 3, m 3 = 10 and m 4 = 4 subgroups, respectively. Users, contexts, items and time points are evenly assigned to each subgroup. The number of latent factors is set as r = 3. We generate tensor functions at time points t &#8764; U (0, 1) by generating its components as</p><p>, where the latent factors p k i k &#8764; N (0, I r ), trend functions h 1 (t) = sin(0.3&#960;t), h 2 (t) = 8t(1 -t) -1 and h 3 (t) = cos(0.2&#960;t) + 1. To distinguish different subgroups, we set the subgroup factors as a simple sequence, where q 1 (e 1 ) = -1+0.4e 1 , q 2 (e 2 ) = -1.2+0.6e 2 and q 3 (e 3 ) = -0.4 + 0.2e 3 for e k = 1, . . . , m k and k = 1, 2, 3. The function g(t) = m 4 e=1 g e (t)I(t &#8712; s e ), where g 1 (t) = 2t -1, g 2 (t) = 8(t -0.5) 3 , g 3 (t) = sin(0.1&#960;t) + cos(&#960;t), and g 4 (t) = -5 exp(t) + 10. The error &#949; i 1 i 2 i 3 = (&#949; i 1 i 2 i 3 (t 1 ), . . . , &#949; i 1 i 2 i 3 (t T )) follows a multivariate normal distribution with mean 0 and a common marginal variance 1, and the correlation structure is either independence or AR-1 with correlation &#961; = 0.85.</p><p>In each simulation, we consider the number of time points as T = T 1 + T 2 , where the tensor data in the first T 1 = 12 time points are set as the training data, and the tensor data in the last T 2 time points are used as the testing data. For evaluating the forecasting performance at future time points, we consider T 2 = 8 or 12. Considering the missing case, we generate n 1 n 2 n 3 T (1 -&#960; m ) components out of the tensor functions, where &#960; m is the missing percentage and set as 80%. Furthermore, we use &#960; cs = 30% to represent the proportion of new items in the testing data unavailable from the training set. To illustrate the effect of incorporating intra-cluster correlation on estimation efficiency, we compare the estimation efficiency of the proposed methods using different working correlation structures: independent or AR-1, denoted as DTRSin and DTRSar, respectively.</p><p>According to <ref type="bibr">Xiong et al. (2010)</ref> and <ref type="bibr">Bi et al. (2018)</ref>, BPTF and REM methods model fourth-order tensor with user, context, item and time modes. For all methods, we assume that the subgroup structure and the number of latent factors are known. For REM and the proposed methods, the tuning parameter &#955; is pre-selected from grid points ranging from 0 to 20. The validation set is the data from the last four time points of the training set. For BPTF, we keep the remaining parameters by their default choices and obtain a forecast  <ref type="formula">2019</ref>), we also consider BPTF incorporating basic exponential smoothing (Holt's method, Holt, 2004) and double exponential smoothing <ref type="bibr">(Holt-Winters method, Holt, 2004;</ref><ref type="bibr">Winters, 1960)</ref>. That is, we first use BPTF to estimate the factor matrices of user, item, context and time in the training data, and then forecast the factor matrix of time at given time points of the testing data via basic exponential smoothing or double exponential smoothing, denoted as BPTF basic and BPTF double , respectively. All methods are replicated by 100 simulation runs.</p><p>Table <ref type="table">1</ref> provides the estimation results of all methods. We observe that the proposed method has better performance when the working correlation structure is the same as the true correlation structure. When the true correlation structure is independence, the DTRSin has smaller RMSE and MAE than the DTRSar, with more than 2.17% improvement. Similarly, when the true correlation structure is AR-1, the DTRSar outperforms the DTRSin. Moreover, the PICPs of the DTRS method are close to 0.95, which implies that the proposed method provides accurate prediction intervals, whereas the competing methods are not able to provide such prediction intervals. For the performance of forecasting, we observe that the DTRSin and DTRSar outperform other methods across all settings. Specifically, both DTRS methods improve the RMSE and MAE of the REM by more than 40%, and those of the BPTF bayes , BPTF basic and BPTF double by more than 60%, 24%, and 41%, respectively. In this setting, the BPTF basic performs better than the BPTF double . This is probably because the basic exponential smoothing method is more applicable in forecasting time series with no clear trend or seasonal pattern, whereas the double exponential smoothing method performs better when a trend is present <ref type="bibr">(Holt, 2004)</ref>. Although the BPTF basic and BPTF double perform better than the BPTF bayes , the proposed method is still able to beat the best of the BPTF variations. This indicates that the proposed method provides more accurate forecasting compared to other methods.</p><p>To illustrate the specific performance for forecasting at each time point, we calculate the MAE at each time point and provide box plots for the MAE in Figures <ref type="figure">3</ref><ref type="figure">4</ref>. We observe that the performance of the proposed method is relatively robust against time in all settings. The MAEs of both DTRS methods at any time point are the lowest. The proposed method outperforms other methods, especially for long-term forecasting, indicating that it can handle both short-term and long-term forecasting accurately.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Long Forecasting Time Period</head><p>In this simulation study, we evaluate the performance of the proposed method with a vast set of users and items and time period. Moreover, we also report the average computational time (ComTime) in seconds for each method based on 50 repetitions. All experiments are implemented using Window 10 with 1.99 GHz Intel Core i7 Processor and 16 GB memory.</p><p>We consider a third-order tensor function of time with user, context and item modes. We set the numbers of users, contexts and items to n 1 = 1000, n 2 = 30, n 3 = 10000, respectively. The number of time points is T = T 1 +T 2 with the number of the training time points T 1 = 80 and the number of the testing time points T 2 = 8, 12, 24, 32. Other setting is similar to the setting in Section 5.1. We consider that the error follows a multivariate normal distribution with mean 0 and the independent structure. We perform the proposed method and the compering methods as in Section 5.1. The simulation results are summarized in Table <ref type="table">2</ref>.</p><p>Table <ref type="table">2</ref> shows that the proposed method outperforms other methods across all settings. Specifically, both DTRS methods improve the RMSE and MAE of the REM by more than 17%, and those of the BPTF bayes , BPTF basic and BPTF double by more than 39%, 31% and 39%, respectively. Moreover, the proposed method is robust against longer forecasting time periods comparing with other methods. Specifically, we observe that the REM is more efficient computationally than the proposed method but the accuracy of the REM is lower than the proposed method. This is probably due to that the REM treats time as a mode of tensor, with more parameters involved as the time increases, and leads to more errors of estimation. Due to the demand of Gibbs sampling, the BPTF bayes , BPTF basic and BPTF double require longer computational time and larger menory storage. In additional, as the time increases, the BPTF bayes needs to involve more parameters in time factors which leads to low computational efficiency and low accuracy in forecasting. Although the BPTF basic and BPTF double do not have increasing number of parameters as the time increases due to the exponential smoothing method, the Gibbs sampling is still time-consuming. Reducing the number of Gibbs samples may lead to less computational time but with less accurate predictions. For our method, the DTRSar requires to estimate the covariance matrix, and the DTRSar requires more computational time than the DTRSin. Nevertheless, the DTRSar has higher accuracy than other competing methods in terms of RMSE and MAE. Overall, the proposed method performs well in term of computational efficiency and forecasting accuracy. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Robustness under Cluster Misspecification</head><p>In this simulation study, we study the robustness of the proposed method when the clusters are misspecified.</p><p>We follow the same data-generating process as in Section 5.2, but allow the cluster assignment to be misspecified. Specifically, we misassign users, contexts and items to adjacent clusters with 0%, 10% and 30% chance. The Adjacent clusters are defined as the clusters with the closest group effects. This definition of adjacent clusters reflects the real-data situation (e.g., a facial tissue might be misassigned as paper towels, but less likely to be misassigned as yogurt). We also compare the proposed method with the REM method which also consider the subgroup information.</p><p>The simulation results based on 50 replications are summarized in Table <ref type="table">3</ref>. Table <ref type="table">3</ref> shows that in general the proposed method is robust against the misspecification of cluster. In comparison with the results when 0% of the cluster members are misclassified, the proposed method is more robust than the REM method in all settings under 10% and 30% cluster misspecification rate. For example, the proposed method under the 10% misspecification rate is 2.9% worse than the proposed method without misspecification in terms of the RMSE under T 2 = 8; and is 3.6% worse than one without misspecification under T 2 = 8. However, the REM method under the 10% and 30% misspecification rates is 31.7% worse than the REM method without misspecification in terms of the RMSE under T 2 = 8.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Empirical Examples</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">IRI Marketing Data</head><p>In this section, we focus on sales data at drug stores from the IRI Marketing Data <ref type="bibr">(Bronnenberg et al., 2008)</ref> to illustrate the performance of the proposed method. The original IRI data is an immense collection of consumer panel data and store sales at grocery stores, drug stores and mass-market stores over the years 2001-2011. The store sales data contain weekly product sales volumes, pricing, and promotion data for all items from 31 product categories sold in 50 U.S. markets. These markets are geographic units defined typically as an agglomeration of counties, usually covering a major metropolitan areas (e.g., Chicago, IL) but sometimes covering just part of a region (e.g., New England). A detailed description of an early version of the data is available in <ref type="bibr">Bronnenberg et al. (2008)</ref>.</p><p>To illustrate the proposed method, we choose sales data at drug stores collected from 2001 to 2011, where there are sales volume records, recorded times, promotion strategies, 43,631 product IDs, and 471 drug store IDs. These drug stores are from 50 markets across the United States. The products include items sold from these stores during the 11-year period, and are from 31 product categories, including hot dogs, household cleaners, margarine/butter blends, mayonnaise, milk, coffee, cigarettes, photography supplies, paper towels, frozen pizza, toilet tissue, yogurt, beer/ale/alcoholic cider, blades, cold cereal, carbonated beverages, diapers, deodorant, facial tissue, frozen dinners/entrees, laundry detergent, peanut butter, razors, mustard and ketchup, sugar substitutes, spaghetti/Italian sauce, soup, shampoo, salty snacks, toothpaste, and toothbrush. Moreover, various advertising and promotions strategies are imposed on these products to attract consumers. The promotions strategies have 30 types which are combinations of 5 advertisement features, 3 types of merchandise display, and an indicator on whether the product has a price reduction of more than 5%.</p><p>The goal of our study is to predict the future sales volumes of each product from each store given each promotion strategy based on historical sales data. Through this prediction procedure, we are able to estimate future purchases, evaluate the influence of promotion strategy for product sales, and potentially recommend the most profitable products to store managers, so the company can make wiser decisions on marketing strategies and inventory planning. For the IRI marketing data, a personalized suggestion refers to the recommendation of potentially profitable products to store managers. Statistically this can be viewed as predicting future sales volumes of each product from each store based on historical sales data. There are abundant literature on product recommender systems including, but not limited to, <ref type="bibr">Giering (2008)</ref>, <ref type="bibr">Xiong et al. (2010)</ref> and <ref type="bibr">Yu et al. (2016)</ref>. In these works, similar to the proposed IRI data analysis, recommendations of products to stores are also considered. For considering the trend of product sales, we aggregate the weekly data into monthly data according to the record time information so that the data contain more than 79.2 million sales records for 132 months from the beginning of 2001 to the end of 2011. For the proposed method, we classify stores, products, observed time points and promotion strategies into subgroups based on their markets, product categories, month of the year and whether a price reduction is applied, respectively.</p><p>Table <ref type="table">4</ref> shows the summary statistics of the data. According to the proposed method, the data can be reframed into monthly third-order tensors by store, product and promotion.   According to the given structure of monthly third-order tensors, the total number of sale records could be up to 471 &#215; 30 &#215; 43631 &#215; 132 &#8776; 8.1 billion. Although the observed records are more than 79.2 million, there are still a large number of store-promotion-product-month combinations which are associated with unknown sales volumes. The sales data have a 99.9% missing rate and are highly sparse, which renders a particular challenge for recommender systems and forecasting.</p><p>For comparison, we implement and report the performances of the proposed methods with different working correlation matrices and the competing methods as in Section 5. For all methods, we select the number of latent factors r ranging from 3 to 30. For the REM method and the proposed methods, we select a tuning parameter &#955; from 1 to 29. For the BPTF methods, we use the default values of the remaining parameters. For selecting the above parameters, we set the data from the beginning of 2001 to the end of 2009 (i.e., the first 108 months) as the training set and the data from the beginning of 2010 to the end of 2010 as the validation set, and then tune these parameters through minimizing the root mean square error on the validation set. We randomly sample 80% of the data from 2001 to 2010 as a training set and 20% of the data from the entire year of 2011 as the testing set. The random sampling is replicated 50 times.</p><p>Table <ref type="table">5</ref> shows the forecasting results produced by each method. The results of the DTRSar are similar to ones of the DTRSin and are omitted. From Table <ref type="table">5</ref>, we observe that the DTRSin method achieves coverage probabilities of the prediction interval close to 95%, implying that the proposed method can estimate the prediction interval accurately, whereas the competing methods cannot provide such prediction intervals. In addition, the DTRSin method has the lowest RMSE and MAE. For example, DTRSin improves on the RMSE and MAE of the BPTF bayes by 39.95% and 12.82%, and of the REM by 18.97% and 7.44%, respectively. The BPTF basic and BPTF double perform better than the BPTF bayes , while the BPTF double performs better than the BPTF basic . However, the proposed method still outperforms the BPTF basic and BPTF double .</p><p>To illustrate the specific performance for forecasting at each time point, we calculate the RMSE and MAE at each time point and provide box plots for the RMSE and MAE in Figure <ref type="figure">6</ref>. We observe that the performances of the DTRSin, BPTF basic and BPTF double are more robust than those of the REM and BPTF bayes . However, the RMSEs of the DTRSin method at each time point are still lower than those of the BPTF basic and BPTF double methods. The proposed method outperforms other methods for forecasting at each time point, and can deal with long-term forecasting accurately.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Last.fm Data</head><p>In this section, we analyze the Lastfm-1K dataset collected by Last.fm API <ref type="bibr">(Celma, 2010)</ref> to evaluate the performance of the proposed method. The dataset is available at <ref type="url">http: //ocelma.net/MusicRecommendationDataset/lastfm-1K.html</ref> and has been widely used in music recommendation experiments. The Lastfm data includes the listening history of 992 users and songs played daily, recorded by quadruples with user, timestamp, artist and song information, where the users' profiles contain gender, age, country and signup, and artists contain 107,528 artists with ID and 69,420 without ID. A detailed description of the Lastfm-1K dataset is available at <ref type="url">http://ocelma.net/MusicRecommendationDataset/ lastfm-1K.html</ref>.</p><p>We extract the user-artist-song playcount tensor-valued function with each monthly time point based on the quadruples records. The goal of our study is to predict the future playcount of each song given each artist for each user. Through this prediction procedure, we are able to estimate future listening habit for each user, so that to recommend interesting songs to each user. To evaluate the performance of the proposed method, we consider a sub-dataset with 100 users randomly, where there are 7,490 artists, 32,287 songs and 53 months from February of 2005 to June of 2009. We classify users, song, time stamp and artists into subgroups based on users' gender, a song's artist, month of the year, and whether the artist has ID, respectively. Although the observed records have been 356,786, there are still a large number of user-artist-song-month combinations which are associated with unknown playcount. The data have a high missing rate and are highly sparse.</p><p>Similar to Section 6.1, we implement and report the performances of the proposed method and the competing methods. We randomly sample 80% of the data from February of 2005 to May of 2008 (i.e., the first 40 months) as a training set and 20% of the data from June of 2008 to June of 2009 (i.e., the last 13 months) as the testing set. The random sampling is replicated 50 times. Table <ref type="table">6</ref> shows the forecasting results produced by each method and the computational time.</p><p>Table <ref type="table">6</ref> indicates that the DTRSin method achieves coverage probabilities of the prediction interval close to 95%, which supports that the proposed method can obtain accurate prediction interval estimator, whereas the competing methods cannot provide such prediction interval. Table <ref type="table">6</ref> also shows that the proposed method achieves the best performance. Specifically, the proposed method improves the compering methods by at least 8.02% with respect to the RMSE, and by at least 10.05% with respect to the MAE, and still achieves the smallest standard error with a reasonable computational time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Discussion</head><p>In this article, we propose a new dynamic tensor recommender system which incorporates time information through a tensor-valued function. A unique contribution of our method is that it can estimate recommendation accurately given any time point in continuous time intervals. Technically, the proposed method builds a time-value tensor decomposition model and borrows group information from existing time points of the same group for higher forecasting accuracy. Moreover, the proposed method utilizes the polynomial spline method and the weighted least squared method to incorporate time-dependency and intra-cluster correlation into the DRS. The spline extrapolation enables our method to achieve both short-term and long-term forecasts accurately, as confirmed in the numerical studies. In addition, the proposed method is able to provide pointwise prediction intervals based on the established asymptotic property, while existing recommender systems are not equipped with prediction intervals. In theory, we demonstrate that the proposed decomposition achieves asymptotic consistency on prediction and the spline coefficient estimators have asymptotic normality. In addition, we show the numerical advantages of our method compared to existing methods, including the analysis of IRI marketing data.</p><p>Then (12) implies that &#964; = 1 and &#957; j almost surely, j = 1, . . . , r. Similarly, suppose there exist some k 1 = 1, . . . , d, such that p k 1 &#8226;j = &#957; j p k 1 &#8226;j , &#945; j = &#945; j /&#957; j , q k 1 = &#964; q k 1 , &#946; e = &#946; e /&#964; for positive constants &#964;, &#957; j , j = 1, . . . , r; e = 1, . . . , m d+1 . We have</p><p>&#946; e 2 /&#964; 2 .</p><p>Then (12) implies that &#964; = 1 and &#957; j almost surely, j = 1, . . . , r. Thus, ( P, q, &#945;, &#946;) and ( P, q, &#945;, &#946;) are identical almost surely with the exception of permutation.</p><p>Proof of Theorem 2.</p><p>The estimator</p><p>represents the first derivatives with respect to the vector u i 1 &#8226;&#8226;&#8226;i d . By Taylor expansion, we have </p><p>where a b and b a mean a/b is bounded, T min = min t&#8712;T i 1 ...i d {|T i 1 ...i d |} and T max = max t&#8712;T i 1 ...i d {|T i 1 ...i d |}. Thus, under condition (C5), we have</p><p>1/2 max /T min N &#964; /2-&#965; . Under condition (C1) and (C5), we have (&#964; -&#965;) .</p><p>The proof of Theorem 2 is completed.</p><p>We can currently use a convenient basis system in our technical arguments and the results also hold true for other basis choices of the same function space. The B-spline and truncated polynomial basis functions span the same set of spline functions (de Boor, 2001), thus we use B-splines as the convenient basis system in our proofs. The B-splines have the following properties <ref type="bibr">(de Boor, 2001)</ref></p><p>k=1 &#966; 2 k , C 1 and C 2 are constant and &#966; k &#8712; R.</p><p>Lemma 5 Under Conditions (C1)-(C5), we have</p><p>1 N</p><p>(i 1 ,...,i d )&#8712;&#8486; ( W i 1 ...i d -W i 1 ...i d ) &#931; -1 i 1 ...i d &#949; i 1 ...i d = O p (N -1+&#964; -&#965; ).</p><p>Proof By Theorem 2 and the non-negative bounded properties of the B-spline basis functions (de Boor, 2001), we have for j, l = 1, . . . , r; e, k = 1, . . . , m d+1 , 1 N</p><p>(i 1 ,...,i d )&#8712;&#8486; ( X i 1 ...i d j -X i 1 ...i d j ) &#931; -1 i 1 ...i d ( X i 1 ...i d l -X i 1 ...i d l ) = 1 N (i 1 ,...,i d )&#8712;&#8486; ( u i 1 ...i d j -u i 1 ...i d j )( u i 1 ...i d l -u i 1 ...i d l )B i 1 ...i d j &#931; -1 i</p><p>1 N</p><p>(i 1 ,...,i d )&#8712;&#8486; ( X i 1 ...i d j -X i 1 ...i d j ) &#931; -1 i 1 ...i d ( Z i 1 ...i d e -Z i 1 ...i d e ) &#8804; c 2 &#963; -2 1 1 N (i 1 ,...,i d )&#8712;&#8486; ( u i 1 ...i d j -u i 1 ...i d j )( u i 1 ...i d (r+1) -u i 1 ...i d (r+1) )B i 1 ...i d j A i 1 ...i d e = O p (N -1+2(&#964; -&#965;) ), and 1 N (i 1 ,...,i d )&#8712;&#8486; ( Z i 1 ...i d k -Z i 1 ...i d k ) &#931; -1 i 1 ...i d ( Z i 1 ...i d e -Z i 1 ...i d e ) &#8804; c 2 &#963; -2 1 1 N (i 1 ,...,i d )&#8712;&#8486; ( u i 1 ...i d (r+1) -u i 1 ...i d (r+1) ) 2 A i 1 ...i d k A i 1 ...i d e = O p (N -1+2(&#964; -&#965;) ).</p></div></body>
		</text>
</TEI>
