<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Canonical sectors and evolution of firms in the US stock markets</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>09/12/2018</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10079793</idno>
					<idno type="doi">10.1080/14697688.2018.1444278</idno>
					<title level='j'>Quantitative Finance</title>
<idno>1469-7688</idno>
<biblScope unit="volume">18</biblScope>
<biblScope unit="issue">10</biblScope>					

					<author>Lorien X. Hayden</author><author>Ricky Chachra</author><author>Alexander A. Alemi</author><author>Paul H. Ginsparg</author><author>James P. Sethna</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Unsupervised machine learning can provide an objective and comprehensive broad-level sector decomposition of stocks]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Main text</head><p>Stock market performance is measured with aggregated quantities called indices that represent a weighted average price of a basket of stocks. Market-wide indices such as Russell 3000&#174; (Russell 3000&#174;Index 2015) and the S&amp;P 500&#174; (S&amp;P 500&#174;Index 2014) consist of stocks from diverse companies reflecting a broad cross-section of the market. Sector-specific indices such as the Dow Jones&#174; Financials Index (Dow Jones&#174;US Indices 2015), CBOE&#174; Oil Index (CBOE&#174;Oil Index 2013) and the Morgan Stanley&#174; High-Tech 35 Index (Morgan Stanley&#174;High-Tech 35 Index 2005), etc., are more granular and their composition requires a classification of companies into sectors. Major industrial classification schemes classify firms into sectors, albeit with many ambiguities <ref type="bibr">(Nadig and Crigger 2011)</ref>. It is not clear, for example, how to assign a sector to conglomerates or diversified companies such as General Electric&#174;. Conversely, non-conglomerates with exposure to firms outside their own sector (for example, an investment bank exclusively serving pharmaceutical firms) also blur the boundaries of sector-identification. Moreover, as companies and their economic environments evolve, neither the indus-trial sectors nor the firms' sector association remains static, necessitating updates to sector assignments and addition of new sectors.</p><p>A significant number of studies have previously aimed at identifying categories of stocks in financial markets with a variety of approaches. Recent numerical techniques have included extensive use of random matrix theory, principal component analysis or associated eigenvalue decomposition of the correlation matrix <ref type="bibr">(Plerou et al. 2002</ref><ref type="bibr">, Coronnello et al. 2005</ref><ref type="bibr">, Kim and Jeong 2005</ref><ref type="bibr">, Eom et al. 2007</ref><ref type="bibr">, Conlon et al. 2009</ref><ref type="bibr">, Fenn et al. 2011)</ref>, specialized clustering methods (Mantegna 1999, <ref type="bibr">Bonanno et al. 2000</ref><ref type="bibr">, 2003</ref><ref type="bibr">, Kullmann et al. 2000</ref><ref type="bibr">, Basalto et al. 2005</ref><ref type="bibr">, Heimo et al. 2009</ref><ref type="bibr">, Musmeci et al. 2014)</ref> or time series analysis <ref type="bibr">(Martins 2007, Podobnik and</ref><ref type="bibr">Stanley 2008)</ref>, pairwise coupling analysis <ref type="bibr">(Bury 2013)</ref>, and even topicmodeling of returns <ref type="bibr">(Doyle and Elkan 2009)</ref>. Indeed, relevant prior work analyzing historical stock price returns <ref type="bibr">(Fama and French 1993</ref><ref type="bibr">, Laloux et al. 1999</ref><ref type="bibr">, Plerou et al. 2002</ref>) elucidated that the high-dimensional space of stock price returns has a low-dimensional representation.</p><p>In parallel with this, there is a long tradition of style analysis in finance in which time series can be selected which serve as useful benchmarks for the performance of other stocks or</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Feature</head><p>Table <ref type="table">1</ref>. Canonical sectors and major business lines of primary constituent firms. The eight canonical sectors identified by the analysis described here are listed in the column on the left; these were named in accord with the business lines (middle column) of firms that show strong association with these sectors. Some examples are provided in the right column; a full list is available on companion website <ref type="bibr">(Chachra et al. 2013)</ref>. indices. The three-factor model of <ref type="bibr">Fama and French (1993)</ref> is one such example. Recently, <ref type="bibr">Vistocco and Conversano (2009)</ref> proposed that Archetypal Analysis (AA) <ref type="bibr">(Cutler and Breiman 1994)</ref> could provide these benchmark time series while also providing a way to plot this data in a meaningful way. In particular, they provide a triangular plot for Italian mutual funds and suggest parallel coordinate plots or asymmetric maps for higher dimensional representations. The positive decomposition of mutual funds into sectors using standard benchmarks (not derived using AA) was later studied by the same authors <ref type="bibr">(Conversano and Vistocco 2010)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Canonical sector Business lines</head><p>Here, we demonstrate a new, holistic way of classifying stocks into industrial sectors by utilizing the emergent structure of price returns in data space. Beyond the proposal of Vistocco and Conversano, we provide an interpretation of the archetypes of AA as sectors of the economy. This structure is purely contained in the geometry of the time series. Other methods, such as SVD, can discern that there is some such structure but are not well suited to a clean description. AA, on the other hand, determines the convex hull of the data-set making it uniquely suited to creating a quantitative analysis of the data. In particular, if we take the log price returns of individual stocks, remove the overall market return, normalize to zero mean and unit s.d., then stock returns are well-approximated by a hypertetrahedral structure. Each lobe of the hyper-tetrahedron is populated by stocks of similar or related businesses (figure <ref type="figure">1</ref>); the lobe-corners (canonical sectors) approximate the returns of companies that are prototypical of individual sectors (table <ref type="table">1</ref>). Returns of each stock can be decomposed into a weighted sum (figure <ref type="figure">2</ref>) of the canonical sector returns (figure <ref type="figure">3</ref>). Lastly, the canonical sector weights for a given company are dynamic and lead to insights into its evolution (figure <ref type="figure">5</ref>).</p><p>The matrix of daily log returns of a stock s are defined as r ts = log P tslog P (t-1)s where P ts are adjusted closing prices (i.e. corrected for stock splits and dividend issues) and t is in trading days. In the present analysis, we used normalized returns, R ts = (r tsr ts t )/&#963; s , where &#963; 2 s = r 2 ts tr ts 2 t is the variance (squared volatility) and t represents the average over time (trading days). Overall market returns from each stock were also removed, yielding what we shall call the log price returns R ts = R ts -R ts s . (The two degrees of freedom we remove from each stock -the variance and the overall return -are of practical interest elsewhere, but obscure the classification into sectors.) The hyper-tetrahedron, or simplex, which emerges (figure <ref type="figure">1</ref>) is a self-organized structure: it has prototypical firms in corners (table 1), closely related firms clumped together in each lobe, diversified companies (GE&#174;, Walt Disney&#174;, 3M&#174;, etc.) close to the centre, and the number of lobes denoting how many distinct sectors are exhibited by the data. This suggests a natural way to decompose stocks into canonical sectors: for convex sets, each interior point is representable as a unique weighted sum of corner points, implying here that every stock's return is approximated by a weighted sum of returns from the canonical sectors. Conversely, the weights for a given stock quantify its exposure to the canonical sectors.</p><p>We applied an in house python implementation of the AA algorithm described by M&#248;rup and Hansen <ref type="bibr">(M&#246;rup and Hansen 2012)</ref>. The dataset consisted of 705 US firms' stocks with a minimum $1 billion June 2013 market capitalization and with continuous 20 years  of listing on major exchanges (appendix A). Analysis of this dataset (appendices B and C) revealed eight emergent sectors which were named in accordance with the companies they comprised (prefix c-denotes 'canonical'): c-cyclical (including retail), c-energy (including oil and gas), c-industrial (including capital goods and basic materials), c-financial, c-non-cyclical (including healthcare and consumer non-cyclical goods), c-real estate, c-technology, and c-utility. Calculated participation weights for a sample of 12 firms in figure <ref type="figure">2</ref> show a decomposition of their stocks into the canonical sectors with resulting insights discussed in the Figure <ref type="figure">2</ref>. Canonical sector decomposition of stocks of selected companies. A complete set of all 705 stocks is provided on the companion website <ref type="bibr">(Chachra et al. 2013)</ref>; the color scheme is shown on the right. Conglomerates like GE&#174; decompose roughly into their core business lines. Tech firms such as Apple&#174; that sell mass-market consumer goods have an important fraction in c-cyclical, whereas IBM&#174; has a significant portion of c-non-cyclical returns presumably due to its government contracts. Telecom companies like AT&amp;T&#174; are generally classified under a separate telecom category by major classification systems, yet analysis shows their returns are described by a combination of c-non-cyclical and c-utility sectors. Health insurance providers like Aetna&#174; are commonly classified as financial services firms, but their returns consist of a major part c-non-cyclical and only a minor part of c-financial-the healthcare sector is generally less prone to economic downturns. Defense contractors like Lockheed&#174; are listed as capital goods companies, but their returns are seen to be majority c-non-cyclical and only a smaller share of c-industrial sector. Figure <ref type="figure">3</ref>. Emergent sector time series. Annualized cumulative log price returns of the eight emergent sectors are shown. The time series capture all important features affecting different sectors: building-up of the dot-com bubble (c. 2000) followed by a burst, the soaring energy valuations <ref type="bibr">(2003)</ref><ref type="bibr">(2004)</ref><ref type="bibr">(2005)</ref><ref type="bibr">(2006)</ref><ref type="bibr">(2007)</ref><ref type="bibr">(2008)</ref> followed by a crash, and the financial crisis of 2008. We note that the dot-com bubble was confined to the c-tech sector whereas the financial crisis effects were spread throughout the sectors. Precise definition of the cumulative returns plotted here is given in equation (C1); other measures of sector dynamics are in figure <ref type="figure">C1</ref>. Figure <ref type="figure">5</ref>. Evolving sector participation weights. Results from the sector decomposition made with rolling two-year Gaussian windows are shown for selected stocks. A complete set of 705 charts is provided on the companion website <ref type="bibr">(Chachra et al. 2013)</ref>. For stable and focused companies such as Pacific Gas &amp; Electric&#174; or IBM&#174;, one sees no significant shifts in sector weights; changes in time agree with errors expected from unresolved fluctuations <ref type="bibr">(Chachra et al. 2013)</ref>. Wal-Mart&#174;'s returns, on the other hand, have moved significantly from c-cyclical to c-non-cyclicals (consumer staples) in the post-financial crisis years as shown; this is also true of other low-price consumer commodities retailers such as Costco&#174;, but not true of higher price retailers such as Whole Foods&#174;, Macy's&#174;, etc. Corning&#174;, previously an industrial firm with a huge presence in optical fibre, suffered in the aftermath of the dot-com crisis and now is classified as a tech firm presumably due to its Gorilla&#174; glass used in cellphones, laptop displays, and tablets. Berry Petroleum grew within its home state of California in the early 1990s through development on properties that were purchased in the earlier part of 20th century. In 2003, the company embarked on a transformation (Berry Petroleum Company History 2013) by direct acquisition of light oil and natural gas production facilities outside California. The figure shows a clear shift in the distribution of sector weights as the company has moved toward c-energy and away from c-real estate. Similarly, as Plum Creek&#174; Timber converted to a real estate investment trust (REIT) in the late 1990s (Plum Creek&#174;History 2014), its sector weights have significantly shifted toward c-real estate sector. caption.Associated with each canonical sector f is a time series of returns. As expected, these series show hallmark historical events of individual sectors (figure <ref type="figure">3</ref>): the dot-com bubble, the energy crisis, and the financial crisis being the major events in the last two decades.</p><p>Determining the correct number of canonical sectors that appropriately describe the space of stock market returns is akin to the more general issue of selecting a signal-to-noise ratio cut-off, or a truncation threshold in the dimensionalreduction of data. The choice of this threshold is generally sensitive to sampling, yet the results presented here are reasonably robust with different choices leading to meaningful and similar decompositions. Figure <ref type="figure">4</ref> depicts the changes in the decomposition with dimension. Details of how the figure was generated as well as more information on the two and three dimensional decompositions are available in appendix G.</p><p>In addition to the full data-set of 20 years &#215; 705 firms, we also applied the algorithm to overlapping, two-year Gaussian windows to study how the sector weights for firms have evolved in time (figure <ref type="figure">5</ref>, see also appendix C). As expected, the sector decomposition of firms is dynamic. Mergers, acquisitions, spin-offs, new products, effect of competitive environments or shifting consumer preferences can change the business foci of firms and hence alter the sector association of firms. External events affecting companies in an idiosyncratic manner also show clear signature in this analysis.</p><p>The eight-factor decomposition presented here explains 11.1% of the total variation (r 2 ) in the normalized returns with the market mode removed, and 56% of the random matrix theory explainable variation defined in appendix F. For comparison, the classic three-factor decomposition of portfolio returns by <ref type="bibr">Fama and French (Fama and French 1993)</ref> into market mode, market capitalization, and growth vs. value yields an r 2 value of only 4.75%. Indeed, if only three factors are used instead of the eight for the decomposition presented here, the regression yields a comparable r 2 value (5.61%) but there appears to be no correspondence between three factors found by our unsupervised model, and those of Fama and French (figure <ref type="figure">F1</ref>). Carrying out a similar comparison with Fama and French's analysis applied to model portfolio returns, the regression on the S&amp;P 500&#174; yields an r 2 value of 99.4% for Fama and French compared to 93.5% for our eight-factor decomposition (market mode reintroduced). Our decomposition was optimized without concern for market capitalization, which appears to be the key difference: For an equal weighted index of the 338 stocks in the S&amp;P 500&#174; with current tickers and a complete data series in our time of interest, we obtain an r 2 value of 99.0% (97.0% for 3 factors) compared to 95.8% for Fama and French. We conclude that a sector decomposition like the one presented here, perhaps weighted by market capitalization, should be an improved guide to investors, compared to the widespread value/growth and large-cap/small-cap stock characterizations currently used.</p><p>Future work remains to address survivorship bias, effects of sampling at different frequencies, and incorporating market capitalization. Investors, analysts, and governments alike would benefit from the development of new investable sector indices (appendix H) that measure the health of our industrial sectors just like the macroeconomic indicators (GDP, housing starts, unemployment rate, etc.) measure the health of our broader economy. Tracing the sectors back in time could elucidate the incorporation of science and technology into our economic system. Finally, our unsupervised decomposition could provide data suitable for quantitative modelling of the internal and external dynamics of our economic system. as:</p><p>Columns of R ts C s f = E t f are the emergent sector time series (basis vectors) representing the n corners of the hyper-tetrahedron, and W f s are the participation weights (W f s &#8805; 0) in sector f so that f W f s = 1 for each stock s. The sector matrix E t f is within the convex hull (C &gt; 0, s C s f = 1) of the data R ts . It can be found by either minimizing the squared error with convex constraints in factorization as originally proposed <ref type="bibr">(Cutler and Breiman 1994)</ref>, or by making a convex hull of the dataset and choosing one or more of its vertices to be basis vectors, or by making a convex hull in low-dimensions and choosing one or more of its vertices to be basis vectors <ref type="bibr">(Thurau et al. 2009)</ref>, or by minimizing after initializing with candidate archetypes that are guaranteed to lie in the minimal convex set of the data <ref type="bibr">(M&#246;rup and Hansen 2012)</ref>. The columns of the C matrix are shown in figure <ref type="figure">B1</ref>.</p><p>Figure <ref type="figure">B1</ref>. Canonical Sector Constituents (shown as columns of the C s f ). C s f represents a weighted combination of stocks that defines the canonical sector each of which has a time series represented by E t f that is given by E t f = R ts C s f . The eight subplots show the constituent participation component of stocks in each canonical sector f . Canonical sectors are labeled on the plot; their names were chosen according to the listed sectors of firms that comprise them. Noteworthy features seen above include the co-association of listed sectors: basic, capital, transport and part of cyclicals into industrial goods. Similarly, healthcare and non-cyclicals are coupled together in what we call non-cyclicals.</p><p>Canonical retail goes primarily with listed retail and cyclicals. Stocks are colored by listed sectors as shown at the bottom. Listed sector information was obtained from <ref type="bibr">Scottrade&#174; (2015)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Feature</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix C. Calculations and convergence</head><p>Numerical computations were performed using an in-house Python language implementation of the principal convex hull analysis (PCHA) algorithm as described in <ref type="bibr">(M&#246;rup and Hansen 2012)</ref>. For the full dataset, the factorization R = E W , with E = RC as defined in equation (B1) converged in 35 iterations to a predefined tolerance value of SS E &lt; 10 -7 , where SS E is the average difference in the sum of squared error per matrix element in R -E W from one iteration to the next. The resulting columns of E t f are shown in figure <ref type="figure">C1</ref> (top row). Annualized cumulative log returns are obtained by summing rows of E t f :</p><p>The time series Q f (&#964; ) are shown in figure <ref type="figure">3</ref> and the middle row of figure C1. Weights W f s for selected stocks are shown in figure <ref type="figure">2</ref>, the remainder are available on the companion website <ref type="bibr">(Chachra et al. 2013)</ref>. In each canonical sector f , the component of weights for companies are shown in figure <ref type="figure">C2</ref>.</p><p>The analysis of evolving sector weights was performed similarly, but with a sliding Gaussian time window. We decomposed the local normalized log returns for each stock into the canonical sectors determined from the entire time series. Each column (time series) of the returns matrix R ts was multiplied with a Gaussian, G &#956; (&#964; ) = exp(-(&#964; -&#956;) 2 /(2 &#215; 250 2 )) of standard deviation 250 centered at &#956; to obtain R &#956; ts . We use C s f found using the full dataset (equation (B1)) (corresponding to keeping the sector-defining simplex corners fixed). R &#956; ts is factorized to obtain new weights W &#956; f s that describe sector decomposition of stocks in that period focused at t = &#956;:</p><p>&#956; is increased in steps of 50 starting at &#956; = 0 and ending at &#956; = 5000, and W &#956; is calculated at each &#956; with the corresponding R &#956; . These results are plotted in figure <ref type="figure">5</ref> for a select group of companies; the remainder are available on the companion website <ref type="bibr">(Chachra et al. 2013)</ref>.</p><p>To address the challenge of distinguishing signal from noise in the evolving sector weights, we emulated the effect of noise for each of the companies from figure 5. For each of these companies, we took its sector weights, &#969; f , and multiplied by E t f to obtain a time series for the company with weights that are constant in time. We then added gaussian random noise with standard deviation one and replaced these companies by this simulated data. Figure <ref type="figure">C3</ref> shows the comparison between the real flows and the simulated constant data with noise added. General features are shown to be signal while small fluctuations are consistent with noise.    <ref type="figure">5</ref> with simulated data. The simulated data is created from the dot product of the weight vector of the company with the corner time series as described in this section. This yields a version of the company with constant weights in time. To this we add gaussian noise with standard deviation one and repeat the analysis to generate the flows in time. In the left column are the actual flows for companies, on the right is their constant in time counterpart with added noise. We see that key features are in fact signal while small fluctuations correspond to noise. Colour scheme as in figures 2 and 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Feature</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix D. Dimensionality of the space of price returns</head><p>It is often the case with large data-sets that the effective dimensionality of the data space is much lower when one filters out the noise. Of the many dimensional reduction methods, the most commonly used is singular value decomposition (SVD) <ref type="bibr">(Press et al. 2007</ref>), a deterministic matrix factorization. We discuss SVD in more detail in order to draw a contrast with previous SVD results, and to apply it for quantifying the explainable variation in the returns data.</p><p>An SVD of R ts is a matrix factorization <ref type="bibr">(Press et al. 2007</ref>) R ts = U t f f f V T f s such that matrices U and V are orthogonal; is a diagonal matrix of 'singular values'. If the goal were purely rankreduction, n entries of chosen to lie above 'noise threshold' are retained and the rest truncated so that 0 &#8804; f, f &#8804; n. This effectively reduces the dimension of R to n. The choice of n can be informed by the distribution of singular values as discussed later. The rows of V T are precisely the eigenvectors of the stock-stock returns correlation matrix, &#958; ss &#8764; R T st R ts . It was previously reported that some components of the stiff eigenvectors of this stock-stock correlation matrix loosely corresponded to firms belonging to the same conventionally identified business sector <ref type="bibr">(Plerou et al. 2002)</ref> (but see figure <ref type="figure">D1</ref>).</p><p>After normalizing the log returns, the returns matrix R has entries of unit variance. If the entries were uncorrelated random variables drawn from a standard normal distribution, their singular values (which are also the positive square roots of the eigenvalues of R T R) would be  f s of the SVD of returns R ts . The orthonormal right singular vectors (rows of V T f s ) of SVD of R ts are equivalent to the eigenvectors of the stock-stock correlation matrix &#958; ss &#8764; R T R. Eight of these stiffest eigenvectors including the market mode are shown in rows of two at a time. Each has 705 components corresponding to stocks in the dataset. The market mode with all components in the same direction describes overall fluctuations in the market; it was excluded from the analysis described in the paper. Previous work <ref type="bibr">(Plerou et al. 2002)</ref> has suggested that each eigenvector of the stock-stock correlation matrix describes a listed sector, however as seen above, a more correct interpretation is that each eigenvector is a mixture of listed sectors with opposite signs in components. For example, the stiffest direction (after market mode) has positive components in real estate and utility, but negative in tech. Less stiff eigenvectors (including the last one shown here), do not contain sector-relevant information. Stocks are coloured by listed sectors as shown at the bottom. Listed sector information was obtained from <ref type="bibr">(Scottrade&#174; 2015)</ref>.</p><p>described by Wishart statistics <ref type="bibr">(Mehta 2004</ref>). The Wishart ensemble for a matrix of size &#945; &#215; &#946; predicts a distribution of singular values with a characteristic shape <ref type="bibr">(Mehta 2004)</ref>, bounded for large matrices by &#8730; &#945; &#177; &#8730; &#946;. Comparing the stock correlations with Wishart statistics has been previously used to filter noise from financial datasets <ref type="bibr">(Laloux et al. 1999)</ref>. As shown in figure <ref type="figure">D2</ref>, most singular values of the returns matrix R lie in the bulk below the bound set by the Wishart ensemble, whereas only &#8764;20 fall outside that cut-off (The singular value bounds of a random Gaussian rectangular matrix of size &#945; &#215; &#946; can be shown to be &#8730; &#945; &#177; &#8730; &#946; for large matrices.) Historically, this has served as indication that singular values within the bulk correspond to noise <ref type="bibr">(Laloux et al. 1999)</ref>. Recently, however, much progress has been made in the development of techniques to extract signal from the bulk <ref type="bibr">(Burda et al. 2004</ref><ref type="bibr">, 2006</ref><ref type="bibr">, Livan et al. 2011)</ref>. Our method does not claim to capture this information. Rather, we measure its ability to capture variation in the data above the cutoff by means of random matrix theory explainable variation as defined in section F. The largest singular value of R ts corresponds to what we will refer to as the 'market mode' as this represents overall simultaneous rise and fall of stocks. In the analysis presented in this paper, this mode has been filtered from the returns matrix by projecting the R matrix into the subspace spanned by all non-market mode eigenvectors. This is nearly equivalent to filtering the market mode using simple linear regression (as done commonly <ref type="bibr">(Plerou et al. 2002)</ref>), although more convenient.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix E. Low-dimensional projections of price returns</head><p>The emergent low-dimensional, hyper-tetrahedral (simplex) structure of stock price returns can be seen by projecting the dataset into stiff 'eigenplanes'. Eigenplanes are formed by pairs of right singular vectors from a SVD. Here, we construct an SVD of the simplex corners, E t f = X tk Y Z T k f ; simplex corners are mapped to columns of Y Z T because Y Z T k f = X T kt E t f (in other words, X T kt is a projection operator). The plots in figure <ref type="figure">E1</ref> are the projections of the dataset, X T kt R ts = v ks . The rows of v taken in pairs form the axes of the projections in figures 1 and E1. With those plots, it becomes clear that the eigenplanes represent projections of a simplex-like data into twodimensions. Secondly, we note that the simplex structure becomes less clear as one looks at planes corresponding to smaller singular value directions; the signal eventually becomes buried in the noise.</p><p>Similarly, the results of the factorization can be seen in eigenplanes from the SVD of E t f W s f = L tk M N T ks . These results (rows of M N T ks ) are shown in figure <ref type="figure">E2</ref>, where we notice that the data is now perfectly resides in simplex region as expected due to constraints. Figure <ref type="figure">E1</ref>. Low-dimensional projections of stock returns data, coloured by Scottrade&#174; sector. Each coloured circle represents a stock in our dataset and is coloured according to sectors assigned by Scottrade&#174; <ref type="bibr">(Scottrade&#174; 2015)</ref> as indicated in figure <ref type="figure">D1</ref>. The first row is equivalent to figure 1. Black circles represent the archetypes found with our analysis. The (i, j)th figure in the grid is a plane spanned by singular vectors i and j + 1 (rows of X T R) from the calculations described earlier. Projections after the factorization are shown in figure <ref type="figure">E2</ref>. Figure <ref type="figure">E2</ref>. Cross-sections along eigenplanes of the factorized returns. Each coloured circle represents a stock in our dataset and is coloured according to the primary canonical sector association with the colour scheme in figure <ref type="figure">2</ref>. Black circles represent the archetypes found with our analysis. The (i, j)th figure in the grid is a plane spanned by singular vectors i and j + 1 (rows of M N T ) from the calculations described earlier. Projections of raw data (before the factorization) are shown in figure <ref type="figure">E1</ref>. Note that the colours are very similar to those of the traditional Scottrade&#174; classification shown in figure <ref type="figure">E1</ref>; the colour schemes were designed to roughly match. Note that here all points have been projected into the hyper-tetrahedron by our factorization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix F. Coefficient of determination (r 2 )</head><p>We measured the goodness of the returns decomposition R = E W by measuring the coefficient of determination (r 2 ) as follows:</p><p>Here, SSE denotes the sum of square errors ||R -E W || 2 F , and SST is the total sum of squares ||R|| 2 F . This is also known as the proportion of variance explained (PVE). For the factorization of the full dataset, normalized with the market mode removed, the calculated r 2 value is 11.1%. The SVD of R with singular values shown in figure <ref type="figure">D2</ref> provides a convenient way to put this number in context for the returns dataset. Only 20 singular values (excluding the market mode) were above the cut-off that was predicted by random matrix theory for a matrix of purely random Gaussian entries. For any matrix M with elements m i j , the norm ||M|| 2 F = i, j m 2 i j = i s 2 i , where s i are the singular values <ref type="bibr">(Press et al. 2007</ref>). Thus, the fraction of intrinsic variation in R above the cutoff is the sum of squares of the 20 singular values (not including market mode) divided by SST, i=20 i=1 s 2 i /||R|| 2 F = 19.8%. Therefore, as a first approximation, the factorization explains 11.1/19.8 = 56% of the random matrix theory (RMT) explainable variation.</p><p>For reference we provide the RMT explainable variation for the factor decomposition of Fama and French, the classification by Scot-trade&#174;, and the top 8 singular vectors given by SVD. The percentage of the RMT explainable variation for different numbers of factors compared to the 3 factor decomposition of Fama and French is shown in table <ref type="table">F1</ref>. Fama and French have the benefit of allowing factors to have positive or negative weights. In order to compare with another non-negative decomposition, we fix the weight matrix according to the Scottrade&#174; labels and run archetypal analysis for this n = 14 factor version. The r 2 value for this decomposition is 10.7% with a corresponding RMT explainable variance of 54.2% compared to 56% for our 8 factors. For completeness, we also note that if R is rankreduced to the eight stiffest components found by SVD (not including market mode), then the factorization explains 85% of the the RMT explainable variation in R with overall results in good accord with the analysis presented here. This implies that sector decomposition information was already contained in the stiff modes from the SVD of R, however SVD is not the appropriate tool for the decomposition. Figure <ref type="figure">F1</ref> further shows that our unsupervised 3-factor decomposition appears quite distinct from Fama and French's hand-created one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(a) (b)</head><p>Figure <ref type="figure">F1</ref>. Three Factor Model vs. Fama and French. 2D projections of the weights for each company in the SP500 with current tickers and data in the date range we consider. Red denotes companies with large market caps (market cap &gt;10 billion), blue denotes medium (market cap 2-10 billion) and green denotes small (market cap &lt; 2 billion). For our decomposition (a), there is no separation distinguishable by size of company. In comparison, for the Fama and French decomposition (b), there appears a gradation from large to small companies consistent with a factor of the model being related to size. (This is natural, since one of Fama and French's factors explicitly is the difference between large and small-cap returns). Thus our unsupervised 3-factor decomposition appears quite distinct from Fama and French's hand-created one. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix G. The number n of canonical sectors</head><p>It is an open problem to determine the effective dimensionality (optimal rank) of a general dataset (matrix). One could select among models of different dimensions using statistical tests such as the r 2 discussed above, or information theory based criteria such as Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC), but the choice of the selection criterion is itself generally made on an ad hoc basis. Therefore, a direct observation of the comprehensibility of results is often the most reliable criterion. In the data-set used for analysis described here, a factorization with n &gt; 8 yielded results where both the emergent time series E t f and weights in W f s showed qualitative signs of overfitting. For example, with n = 9 the results were in good agreement with n = 8 except for an additional resulting sector involving participation from only 11 seemingly unrelated stocks (table G1 and figure <ref type="figure">4</ref>). The high-level results of factorization with different values of n may be explored in a number of ways, several of which are described below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.1. Sector changes with dimensionality</head><p>One approach to investigating how the sector decomposition changes with dimension is to produce a flow diagram. To do this, we performed the fit</p><p>Hence the sectors for n = 9 can be expressed as a linear combination of sectors for n = 8, n = 8 as a linear combination of n = 7, and so forth. The results of these fits are presented in figure <ref type="figure">5</ref>. The figure represents these relationships though connections between the decompositions for n = N + 1 and n = N weighted according to the matrix S (N ,N +1) . More precisely, we create a node corresponding to each of the 9 sectors whose size is proportional to s W f,s where W f,s is the weight matrix for the 9 sector decomposition. Hence, the relative node sizes represent the amount of the market particpating in the sector. Multiplying this vector by S (8,9) gives the approximate size for each node in n = 8. Multiplying this vector by S (7,8) gives the approximate size for each node in n = 7, and so on. In this way, we generate a Sankey diagram whose node sizes correspond roughly to the amount of the market in the sector and whose connections depict how strongly the sectors for decompositions with different n overlap. In the image, we see that the n = 9 decomposition gives the 8 sector version with an additional small sector whose companies were listed in table <ref type="table">G1</ref>. We also see that for n = 7 c-finance and c-real estate merge. At n = 6, c-industrial and c-cyclical merge. For n = 5, the new sector containing c-industrial and c-cyclical merges with c-noncyclical. For n = 4, c-utility and c-energy merge. Finally, for n = 3 and n = 2, no clear pattern emerges given this image alone.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.2. Two and three sector decompositions</head><p>We further explore the two and three sector decompositions by examining their constituent companies and looking at pie charts describing the relationship between our 8 sector decomposition and those with n = 2 and n = 3 respectively. Recall that each archetype is constrained to be a linear combination of companies, or in other words to lie in the convex hull of the data. Using this information, we list the 20 companies which contribute the most to each sector in the two and three factor decompositions (tables G2-G4). For the two sector decomposition, we find the sectors divide roughly into c-assets (e.g. financial and real estate companies) and c-goods (e.g. companies which provide goods and services). For n = 3,  the division is less clear. Another way to look at the constituents of these sectors is by examining pie chart representations of these decompositions. Again consider the fit ||E t, f -E t, f S f , f || 2 F with the constraint f S f , f = 1. Applying this, we can express the two sector archetypes as linear combinations of the 8 sector archetypes and vice versa. Additionally, we can do the same for the three factor decomposition. The pie charts these fits produce are shown in figure <ref type="figure">G1</ref>. The results are consistent with the sector breakdowns described from examining the constituent companies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.3. Robustness</head><p>In general, a factorization analysis of the returns dataset would be sensitive to number of stocks in the dataset, criteria applied for picking stocks, period over which historical prices are obtained, and frequency at which returns are computed. A robust macroeconomic analysis would therefore require a large number of stocks chosen without sampling bias, with returns calculated over the period of interest and sensitivity checked for frequency of returns calculation. On the other hand, an equity fund manager faces a less daunting task for an analysis that is limited to the universe of her portfolio of stocks: either to find its canonical sectors, or to analyse the exposure of her holdings to the core sectors of the economy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix H. Canonical sector indices</head><p>The matrix C s f in decomposition R = RC W represents how returns R of stocks s must be combined to make canonical sector returns E t f = R ts C s f . Since a canonical sector is defined as a combination of stocks, an investment in the sector f can made via buying a basket of constituent stocks s in proportions given by C s f or through an index I t f :</p><p>where, p are stocks prices suitably weighted by market cap or other divisor as common practice for common indices <ref type="bibr">(Tagiliani and Guide 2009)</ref>. An unweighted index of this kind is shown in the bottom row of figure <ref type="figure">C1</ref> for results corresponding to the analysis described in this paper. Conversely, a pre-defined basket of stocks such as the S&amp;P 500&#174; can be unbundled to find its exposure to the canonical sectors.</p><p>With an investment strategy employing longs and shorts at the same time in correct proportions, it is conceivable to invest in, for example, the c-tech component of S&amp;P 500&#174;.</p><p>The desirable features of an index include completeness, objectivity and investability <ref type="bibr">(Pastor et al. 2013)</ref>. The c-indices constructed using the ideas outlined here would not only be of value to investors through investment vehicles such as exchange-traded funds, Futures, etc., but also serve as important economic indicators.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>&#169; 2018 Informa UK Limited, trading as Taylor &amp; Francis Group</p></note>
		</body>
		</text>
</TEI>
