<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Smoothed Analysis of Discrete Tensor Decomposition and Assemblies of Neurons</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2018</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10086313</idno>
					<idno type="doi"></idno>
					<title level='j'>32nd Annual Conference on Neural Information Processing Systems</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Nima Anari</author><author>Constantinos Daskalakis</author><author>Wolfgang Maass</author><author>Christos Papadimitriou</author><author>Santosh Vempala</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We analyze linear independence of rank one tensors produced by tensor powers of randomly perturbed vectors. This enables efficient decomposition of sums of high-order tensors. Our analysis builds upon [BCMV14] but allows for a wider range of perturbation models, including discrete ones. We give an application to recovering assemblies of neurons.Assemblies are large sets of neurons representing specific memories or concepts. The size of the intersection of two assemblies has been shown in experiments to represent the extent to which these memories co-occur or these concepts are related; the phenomenon is called association of assemblies. This suggests that an animal's memory is a complex web of associations, and poses the problem of recovering this representation from cognitive data. Motivated by this problem, we study the following more general question: Can we reconstruct the Venn diagram of a family of sets, given the sizes of their ℓ-wise intersections? We show that as long as the family of sets is randomly perturbed, it is enough for the number of measurements to be polynomially larger than the number of nonempty regions of the Venn diagram to fully reconstruct the diagram.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Tensor decomposition is one of the key algorithmic tools for learning many latent variable models <ref type="bibr">[1,</ref><ref type="bibr">5,</ref><ref type="bibr">14,</ref><ref type="bibr">19]</ref>. In practice, tensor decomposition methods based on gradient descent and power method have been observed to work well <ref type="bibr">[9,</ref><ref type="bibr">16]</ref>. Theoretically, determining the minimum number of rank one components in the tensor decomposition is known to be NP-hard in the worst case <ref type="bibr">[11,</ref><ref type="bibr">12]</ref>, so usually tensor decomposition is analyzed in the average case. Several algorithms have been analyzed in the average case, where the input tensor is produced according to some probabilistic model, for example see Bhaskara et al. <ref type="bibr">[3]</ref>, De Lathauwer et al. <ref type="bibr">[7]</ref>, Goyal et al. <ref type="bibr">[10]</ref> as well as sumof-squares-based algorithms like Barak et al. <ref type="bibr">[2]</ref>, Ge and Ma <ref type="bibr">[8]</ref>, Hopkins et al. <ref type="bibr">[13]</ref>, Ma et al. <ref type="bibr">[18]</ref>.</p><p>The average case models studied in the literature generally fall into two categories. They either assume components of the tensor are fully random, i.e., generated from a known distribution (e.g., Gaussian), or they follow a smoothed analysis setting where some adversarially chosen instance is perturbed by random noise, see for example Bhaskara et al. <ref type="bibr">[3]</ref>, Goyal et al. <ref type="bibr">[10]</ref>, Ma et al. <ref type="bibr">[18]</ref>. Our work falls into the second category.</p><p>We build upon the framework used in Bhaskara et al. <ref type="bibr">[3]</ref> which reduces decomposing sums of rank one tensors to showing robust linear independence of related rank one tensors, by using Jennrich's algorithm, also known as Chang's lemma <ref type="bibr">[5,</ref><ref type="bibr">17]</ref>. The main departing point of our work is our smoothed analysis of linear independence, which we base on a new notion we call echelon trees, a generalization of Gaussian elimination and echelon form to high-order tensors, which might be of independent interest. We also get improved guarantees compared to Bhaskara et al. <ref type="bibr">[3]</ref> when the tensors are of high enough order.</p><p>The main feature of our analysis is that it can handle discrete perturbations. To illustrate, suppose that vectors X 1 , . . . , X m &#8712; R n are drawn from some unknown distribution and our goal is to recover them by (noisily) observing i X &#8855;&#8467; i for small values of &#8467;. Bhaskara et al. <ref type="bibr">[3]</ref> showed that up to constant factor blow-ups in &#8467; an efficient algorithm can do this as long as X &#8855;&#8467; i are linearly independent in a robust sense. Note that the set of vector tuples (X 1 , . . . , X m ) for which X &#8855;&#8467; 1 , . . . , X &#8855;&#8467; m are linearly dependent can be defined by polynomial equations, using determinants, and is therefore an algebraic variety. As long as m &#8810; n &#8467; , this variety will have dimension smaller than the whole space, so we expect most vector tuples to fall outside. Bhaskara et al. <ref type="bibr">[3]</ref> showed that starting from an arbitrary set of vectors X 1 , . . . , X m , by adding Gaussian noise, the new tuple will lie far away from this variety. Our analysis on the other hand, handles a much wider class of perturbations. For example, if each X i is independently chosen at random from a "large enough" discrete set such as the vertices of an arbitrary hypercube, we show that with very high probability the resulting tensors are linearly independent, again in a robust sense.</p><p>For our main application, described in the next section, it is important to assume components of the tensor come from a discrete set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Assemblies of neurons and recovering sparse Venn diagrams</head><p>Experiments by neuroscientists over the past three decades <ref type="bibr">[21]</ref> have identified neurons which are selectively activated when a real-world object<ref type="foot">foot_1</ref> is seen (or more generally sensed). It is now widely accepted <ref type="bibr">[4]</ref> that these neurons are part of large cell assemblies, stable sets of highly interconnected neurons whose firing (more or less simultaneous and in unison) is tantamount to a cognitive event such as the sensing or imagining of a person, or of a word or concept (hence the other common name "concept cells").</p><p>In a recent experiment <ref type="bibr">[15]</ref>, a neuron firing when one real-world entity is seen (say, the Eiffel tower) but not another (e.g., Barak Obama) may start firing on presentation of an image of Obama after a visual experience associating the two -for example, a picture of Obama in front of the Eiffel tower. This experiment has taught us that assemblies seem to be "mobile" and able to intersect in complex ways reflecting perceived varying degrees of associations between the corresponding entities. The stronger the association between the entities, the larger the intersection will be of the corresponding assemblies. During one's life, presumably a complex mesh of entities and associations will be created, of some degree of permanence, reflecting the sum total of one's cognitive experiences.</p><p>All said, this complex mesh of memories in somebody's brain can be modeled as a Venn diagram where each set or assembly consists of neurons firing for a particular concept, and each region of the Venn diagram, a minimal set obtained from an intersection of assemblies and their complements, represents a class of neurons behaving the same way towards all concepts.</p><p>Alternatively to the Venn diagram, one may record associations between assemblies in a hypergraph. The entities are the sets or nodes, and the edges reflect associations between the nodes. Furthermore, the hypergraph representing a person's state of knowledge can be adorned with edge weights reflecting the degree of affinity between a set of nodes (or equivalently, the size of the intersection of their corresponding sets).</p><p>This gives rise to several natural questions. The first question concerns reconstruction. How many experiments or observations are needed to identify the structure of cell assembly intersections, or in other words the Venn diagram? Here, we make two crucial assumptions. First, we assume that we can only measure the degree of association between a small number of entities or concepts. Second, the total number of classes of neurons (which behave similarly in response to stimuli) is bounded.</p><p>In the language of sets, we assume the number of non-empty regions of the Venn diagram is upper bounded by some number m and we can measure the sizes of k-wise intersections of any k of our n sets for 1 &#8804; k &#8804; &#8467; for some small &#8467;. We also allow for measurement errors.</p><p>Our main result here is as follows: As long as the cell assemblies are slightly randomly perturbed, and as long as the number of measurements, n 1 + n 2 + &#8226; &#8226; &#8226; + n &#8467; , is polynomially larger than the number of nonempty regions of the Venn diagram, m, we can fully reconstruct the Venn diagram. The perturbation of cell assemblies, a process which likely occurs naturally in the brain, is a mild assumption that we need in order to escape idiosyncratic cases. We solve the problem of reconstructing the Venn diagram by casting it as a tensor decomposition problem where the elements of the decomposition come from high order tensors of the vertices of the hypercube.</p><p>We also explore a simpler graph-theoretic model of assembly association, motivated by more recent experimental findings <ref type="bibr">[6,</ref><ref type="bibr">15]</ref>: Assume that all assemblies have the same size K, and that two assemblies are associated if their intersection is of size at least b, and are not associated if the intersection is less than another threshold a &lt; b; the results of De Falco et al. <ref type="bibr">[6]</ref>, Ison et al. <ref type="bibr">[15]</ref> suggest that a is 4% of K, while b is 8% of K. We show that an unreasonably rich and complex family of graphs can be realized by associations (roughly, any graph of degree O(K/a)).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Problem formulation</head><p>Suppose that we have a Venn diagram formed by some n sets S 1 , . . . , S n . We will assume that this Venn diagram has at most m nonempty regions. For our main application, each set S i corresponds to neurons that respond to a particular stimuli, so we are assuming that there are at most m classes of neurons. We let U denote the set of neuron classes. We also have a weight function w : U &#8594; R &#8805;0 representing the sizes of various classes. Each set S i &#8838; U is an assembly and w(S i ) = u&#8712;Si w(u) is its weight. Our main question is the following: Question 1. Given the sizes of &#8467;-wise intersections of S 1 , . . . , S n for some constant &#8467;, i.e., w(S i1 &#8745; &#8226; &#8226; &#8226;&#8745;S i &#8467; ) for all i 1 , . . . , i &#8467; &#8712; [n], can we recover the full Venn diagram of S 1 , . . . , S n , i.e., the weight of all intersections formed by these sets and their complements?</p><p>Our main result is that as long as the set memberships of elements are slightly perturbed to avoid worst case scenarios, and as long as n &#8467; is polynomially larger than m = |U|, the answer is yes and moreover there is an efficient algorithm that performs recovery. Our algorithm is also robust to inverse polynomial noise in the input.</p><p>We pose the question as a tensor decomposition problem in the following way: To each element u &#8712; U assign a vector &#967;(u) &#8712; {0, 1}</p><p>n , where &#967;(u) i indicates whether u &#8712; S i . Then the entries of the following tensor capture all &#8467;-wise intersections:</p><p>For simplicity of exposition, we assume weights are all equal to 1, but our results easily generalize, since each weight w(u) can be absorbed into &#967;(u) &#8855;&#8467; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Notations and preliminaries</head><p>We denote the set {1, . . . , n} by [n]. For a matrix A, we denote the minimum and maximum singular values of A by &#963; min (A) and &#963; max (A). We use &#8226;, &#8226; to denote the standard inner product.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>We denote the tensor product of two vectors</head><p>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>By abuse of notation we identify tensors</head><p>We also use the notation T (&#8226;, v 2 , . . . , v &#8467; ) to denote the multilinear map from R n1 to R given by:</p><p>In general we can use &#8226; in place of any of the arguments of T . So for example T (&#8226;, &#8226;, v 3 , . . . , v &#8467; ) is interpreted as living in R n1 &#8855; R n2 . With a slight abuse of notation we let some of the inputs of T be merged together by tensor operations. In other words we let T (v 1 &#8855; v 2 , v 3 , . . . , v &#8467; ) be the same as T (v 1 , . . . , v &#8467; ).</p><p>We use e 1 , . . . , e n to denote the standard basis of R n . For a tuple of coordinates I = (i 1 , . . . , i &#8467; ) we let e I denote e i1 &#8855; &#8226; &#8226; &#8226; &#8855; e i &#8467; . With this notation, the entry corresponding to coordinate (i 1 , . . . , i &#8467; ) of a tensor T can be written as T (e I ) = T (e i1 , . . . , e i &#8467; ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Tensor decomposition</head><p>Suppose that we have a finite universe U of elements with a vector &#967;(u) &#8712; R n assigned to each u &#8712; U. Our goal is to recover &#967;(u)'s by observing u &#967;(u) &#8855;&#8467; . A necessary condition is for &#967;(u) &#8855;&#8467; 's to be linearly independent, otherwise it is an easy exercise to show that there is another decomposition u (c u &#967;(u)) &#8855;&#8467; for some positive weights {c u } u&#8712;U not all equal to 1. The framework introduced by Bhaskara et al. <ref type="bibr">[3]</ref> shows that linear independence is not just necessary, but up to a constant factor blow-up in &#8467;, it is sufficient. A more detailed account is given in supplementary materials.</p><p>We also use another trick from this framework which allows us to replace symmetric tensors &#967;(u) &#8855;&#8467; with asymmetric ones. If we divide the coordinates [n] into &#8467; roughly-equal sized parts I 1 , . . . , I &#8467; and define &#967;(u) (i) to be the projection of &#967;(u) onto the i-th part, then &#967;(u) (1) </p><p>So linear independence of these tensors proves linear independence of &#967;(u) &#8855;&#8467; 's. The advantage of this trick is that when we introduce perturbations to &#967;(u) (1) , . . . , &#967;(u) (&#8467;) , we do not have to worry about consistently perturbing the same coordinates and we can potentially use independent randomness. For simplicity of notation, from here on, we use n (as opposed to n/&#8467;) to denote the dimension of each &#967;(u) (i) . So now we can work with the following tensor:</p><p>Our main result is that the components of this sum are robustly linearly independent, assuming the components &#967;(u) (i) are randomly perturbed. We remark that this implies robust linear independence of {&#967;(u) &#8855;&#8467; } u&#8712;U as well, so we can recover them from the sum u&#8712;U &#967;(u) &#8855;&#8467; .</p><p>We first define our model of perturbations: Definition 2. Assume that a vector X &#8712; R d is drawn according to some distribution D. We call D a (&#948;, p)-nondeterministic distribution if for every coordinate i &#8712; [d] and any interval of the form (t -&#948;, t + &#948;) we have</p><p>For a set of random vectors {X i }, we call their joint distribution (&#948;, p)-nondeterministic iff their concatenation is (&#948;, p)-nondeterministic. In our setting, we will assume that for each u &#8712; U, the vectors &#967;(u) (1) , . . . , &#967;(u) (&#8467;) are chosen from a (&#948;, p)-nondeterministic distribution.</p><p>Two examples of (&#948;, p)-nondeterministic perturbations can be obtained as follows: Example 3. Suppose that each &#967;(u) (i) is chosen adversarially from {0, 1}</p><p>n , but then each bit is independently flipped with some probability q. This distribution is ( 1 2 , max(q, 1 -q))-nondeterministic. Example 4. Suppose that each &#967;(u) (i) is chosen adversarially from R n , but a standard Gaussian noise of total variance &#961; 2 is added to each one. Then for any &#948; &gt; 0, this distribution is (&#948;, erf( &#8730; n&#948;/&#961;))-nondeterministic.</p><p>Gaussian perturbations are the model used in Bhaskara et al. <ref type="bibr">[3]</ref>. Our main result is the following: Theorem 5. Assume that for each u &#8712; U, the concatenation of the n-dimensional vectors {&#967;(u</p><p>is drawn from a distribution D that is (&#948;, p)-nondeterministic. Let A be the matrix whose columns are given by flattened a(u) = &#967;(u) (1) </p><p>This theorem shows how the (&#948;, p)-nondeterministic property ensures robust linear independence.</p><p>To prove it, we use a strategy similar to Bhaskara et al. <ref type="bibr">[3]</ref>, by proving a bound on the leave-one-out distance. The leave-one-out distance is closely related to &#963; min (A), and only differs from it by a factor of at most |U| &#8804; n &#8467;/2 <ref type="bibr">[3]</ref>. It is enough to prove that for any fixed u</p><p>with probability at least 1 -n &#8467; p (1-c)n . Here dist measures the distance of a vector to the closet point in a linear subspace. A union bound implies the leave-one-out distance for all u is large. As in Bhaskara et al. <ref type="bibr">[3]</ref>, we simplify the analysis by treating span{a(u &#8242; )} u &#8242; &#8712;U -{u} as a generic linear subspace V &#8838; (R n ) &#8855;&#8467; , and only using the fact that dim(V ) &lt; (cn) &#8467; . Noting that</p><p>, it is enough to prove the following Lemma 6. Assume that vectors &#967; (1) , . . . , &#967; (&#8467;) are drawn according to a (&#948;, p)-nondeterministic distribution. Further assume that V &#8838; (R n ) &#8855;&#8467; is a subspace of dimension at most (cn) &#8467; . Then</p><p>In the rest of this section we prove lemma 6.</p><p>Let W = V &#8869; &#8838; (R n ) &#8855;&#8467; be the linear subspace of all tensors that vanish on V , or in other words have zero dot product with every member of V . Then dim(W ) &#8805; (1 -c &#8467; )n &#8467; . We will show that with high probability there is an element T &#8712; W such that T &#8804; n &#8467;/2 and</p><p>This implies that</p><p>and the proof would be complete.</p><p>We find it instructive to first prove this fact for &#8467; = 1 and then for general &#8467;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">The case &#8467; = 1</head><p>Proof of lemma 6 for &#8467; = 1. We will generate a sequence T 1 , . . . , T dim(W ) &#8712; W , such that T i &#8734; &#8804; 1 for all i. This ensures that T i &#8804; &#8730; n. We will then show that</p><p>We will first pick T 1 to be any nonzero element of W . By rescaling, we can assume that T 1 &#8734; = 1 and that T 1 (e j ) = 1 for some j. Let us call j the pivot point of T 1 . By rearranging the coordinates we can assume without loss of generality that j = 1. In other words T 1 (e 1 ) = 1 and</p><p>In order to pick T 2 , consider the subspace {T &#8712; W | T (e 1 ) = 0}. This subspace has dimension at least dim(W ) -1, and we can pick T 2 to be any nonzero element of it. As before, we can without loss of generality and by scaling assume that T 2 (e 2 ) = 1 and T 2 &#8734; = 1.</p><p>When picking T i , we pick any nonzero element of {T &#8712; W | T (e j ) = 0 &#8704;j &lt; i} and by rescaling and rearranging the coordinates assume that T i (e i ) = 1 and T i &#8734; = 1. Thus we make sure that the pivot point of T i is i. A keen observer would notice that T 1 , . . . , T dim(W ) can also be obtained by a modified Gaussian elimination procedure run on some basis of the space W . Now that we have fixed T 1 , . . . , T dim(W ) it remains to prove eq. (1).</p><p>To do this, let us fix the coordinates of the random vector &#967; = &#967; (1) one-by-one, starting from &#967; n and going backwards to &#967; 1 . Once we have fixed &#967; dim(W )+1 , . . . , &#967; n we can argue about the probability of the event |T dim(W ) (&#967;)| &lt; &#948;. Since T dim(W ) (e i ) = 0 for i &lt; dim(W ), we have</p><p>Because &#967; is distributed according to a (&#948;, p)-nondeterministic distribution, this event happens with probability at most p. In other words P[|T dim(W ) (&#967;)| &lt; &#948;] &#8804; p. If this event does not occur, we are already done. Otherwise we can condition on &#967; dim(W ) , . . . , &#967; n , and look at the event |T dim(W )-1 (&#967;)| &lt; &#948;. Once we condition on &#967; dim(W ) , this event becomes independent of the previous event and we can again upperbound its probability by p. So we have</p><p>By continuing this, in the end we get</p><p>which is the complement of eq. ( <ref type="formula">1</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">The general case</head><p>Here we describe a structure that we name echelon tree. This definition is motivated by the Gaussian elimination procedure for matrices that produces an echelon form. Our definition can be seen as a generalization of this form for tensor spaces.</p><p>We first describe an index tree for R n1&#215;&#8226;&#8226;&#8226;&#215;n &#8467; : Consider an abstract rooted tree T of height &#8467; where the nodes at level k are labeled by different partial indices from</p><p>; the root has the empty label and resides at level 0, and all leaves reside at level &#8467;. We require the indices to be consistent with the tree structure, i.e., all children (and by extension descendants) of a node labeled I = (i 1 , . . . , i k ) must contain I as the prefix of their label. We further assume that T is ordered, i.e., each node of T has an ordering over its children. This enables us to talk about post-order traversal of the tree, a linear ordering of the nodes of the tree, which we denote by the binary relation &#8826;. For two nodes labeled I and J, we let I &#8826; J exactly when (i) I is a descendant of J or (ii) there are ancestors I &#8242; , J &#8242; of I, J with a common parent who places I &#8242; before J &#8242; (according to the ordering induced by the parent on its children). Definition 7. An index tree for R n1&#215;&#8226;&#8226;&#8226;&#215;n &#8467; is a height &#8467; tree T of partial indices together with a post-traversal ordering &#8826; on its nodes as described above.</p><p>We emphasize that nodes of an index tree have different labels, so we consider the partial indices the same as the nodes. For example, an index tree of height 1 is identical to an ordered list i (1) , . . . , i (s) of elements in [n 1 ], with no repetitions allowed. Next we define an echelon tree. Definition 8. An echelon tree is an index tree where each leaf I is additionally labeled by an element T I &#8712; R n1&#215;&#8226;&#8226;&#8226;&#215;n &#8467; . We require that T I (e I ) = 0 and that for every node J that appears before I in the post-order traversal, i.e., J &#8826; I, the following identity to hold:</p><p>Note that the identity in the above definition is requiring an entire sub-array of T I to be zero. For example a height 1 echelon tree is a list of unique indices i (1) , . . . , i (s) of [n 1 ] together with vectors T (1) , . . . , T (s) &#8712; R n1 such that T (j) has zeros in the i (1) , . . . , i (j-1) entries and has a nonzero i (j)th entry. Notice the similarity to the echelon form obtained by Gaussian elimination in a matrix. In particular, for a height 1 echelon tree, the vectors T (1) , . . . , T (s) must be linearly independent.</p><p>We say that T is an echelon tree for the linear subspace W &#8838; R n1&#215;&#8226;&#8226;&#8226;&#215;n &#8467; if for all leaves I, we have T I &#8712; W . Notice that we can collapse or flatten consecutive levels of an echelon tree, and the result would remain an echelon tree. In this operation, nodes of a particular level i are removed, and each orphaned node of level i + 1 is assigned to its grandparent (of level i -1). We then treat the indices as coming from</p><p>i.e., we merge the i, i + 1-st level indices. This also corresponds to partially flattening tensors T I and considering them as elements of R n1&#215;&#8226;&#8226;&#8226;&#215;nini+1&#215;...n &#8467; . It is easy to check that these operations preserve the properties in definition 8: Fact 9. Collapsing an echelon tree at level i produces an echelon tree.</p><p>The main question we would like to address here is how large of an echelon tree can be constructed for a subspace W . For example, for R n1&#215;&#8226;&#8226;&#8226;&#215;n &#8467; one can get a full tree, where nodes at level i -1 have branching factor n i , by simply placing the standard basis for R n1&#215;&#8226;&#8226;&#8226;&#215;n &#8467; at the leaves. We measure the size of a tree by its fractional branching factor. Definition 10. An echelon tree T has fractional branching (&#945; 1 , . . . , &#945; &#8467; ) &#8712; [0, 1] &#8467; if each node I at level i -1 has at least &#945; i n i children. For a single number &#945; &#8712; [0, 1], we say T has fractional branching &#945; when it has fractional branching (&#945;, &#945;, . . . , &#945;).</p><p>Note that fractional branching &#945; implies that the tree has at least &#945; &#8467; n 1 . . . n &#8467; leaves. On the other hand, repeated applications of fact 9 on the echelon tree would produce a height 1 echelon tree, and we have already observed that the vectors assigned to the leaves in such a tree must be linearly independent. So this implies that dim(W ) &#8805; &#945; &#8467; n 1 . . . n &#8467; . There is a partial inverse to this statement:</p><p>then there is an echelon tree with fractional branching 1 -c for W . However, this fact is not "robust", since the elements of W assigned to the leaves can have arbitrarily small or large entries. Instead we prove the following: Let us see first see why theorem 11 is enough to prove lemma 6.</p><p>Proof of lemma 6 for general &#8467;. Note that T I &#8734; = 1 implies that T I &#8804; n &#8467;/2 . So it suffices to show that T I (&#967; (1) , . . . , &#967; (&#8467;) ) &#8805; &#948; &#8467; for some I with high probability.</p><p>Let us say that an echelon tree is x-large when |T I (e I )| &#8805; x for all leaves I. Theorem 11 guarantees that the echelon tree produced by it is 1-large.</p><p>Our strategy is to fix &#967; (&#8467;) , &#967; (&#8467;-1) , . . . , &#967; (1) in that order, and simultaneously reduce the height of our echelon tree by 1 each time. When we fix &#967; (&#8467;) , we can get a smaller echelon tree in the following way: For each leaf I in the echelon tree, consider the reduced tensor T I (&#8226;, . . . , &#967; (&#8467;) ) &#8712; R n1&#215;&#8226;&#8226;&#8226;&#215;n &#8467;-1 as a candidate tensor for the parent of I. Now let J be a node of level &#8467; -1. Its children have produced candidate tensors for J. Pick the candidate T with the highest |T (e J )| to be T J . In this way we have removed the lowest level of the tree and have assigned appropriate tensors to the new leaves.</p><p>Our goal is to prove that if we start with an x-large echelon tree, then with high probability the next echelon tree is &#948;x-large. Inductively this would prove that with high probability over the choice of &#967; (1) , . . . , &#967; (&#8467;) , we have T I (&#967; (1) , . . . , &#967; (&#8467;) ) &#8805; &#948; &#8467; for some leaf I of the original echelon tree, completing the proof.</p><p>For a fixed node J of level &#8467; -1, we want to show that the quantity T I (e J , &#967; (&#8467;) ) is at least &#948;x in magnitude for some child I of J. But this is very similar to the &#8467; = 1 case of lemma 6, which we have already proved. The difference is that the pivots are not necessarily equal to 1, but are at least x in magnitude. This implies that</p><p>The number of nodes at level &#8467; -1 is at most n &#8467;-1 , so by a union bound, we get that with probability at least 1 -n &#8467;-1 p (1-c)n , the tree produced at the next level is &#948;x-large (the union bound is over fewer than n &#8467;-1 events, each corresponding to one J). Induction completes the proof. Now we give a proof of theorem 11. We use induction to prove a stronger version. Theorem 11 will be a corollary of the following by setting</p><p>then there is an echelon tree for W with fractional branching (&#945; 1 , . . . , &#945; n ) such that for each leaf I we have T I &#8734; = 1 and |T I (e I )| = 1.</p><p>Proof. We use induction on &#8467;. For the base case of &#8467; = 1, we have &#945; 1 &#8805; dim(W )/n 1 and we want an echelon tree with branching factor &#945; 1 n 1 &#8804; dim(W ). We have already proved this case. Now assume we have proved the statement for &#8467; -1 and want to prove it for &#8467;. Consider partially flattening the tensor space by merging the first two dimensions, i.e., considering W as a subspace of R n1n2&#215;n3&#215;&#8226;&#8226;&#8226;&#215;n &#8467; . Let us fix &#946; &#8712; [0, 1] such that the premise of the induction hypothesis holds and we can get an echelon tree of height &#8467; -1 with fractional branching (&#946;, &#945; 3 , . . . , &#945; &#8467; ). Nodes at level 1 of this tree have indices in [n 1 n 2 ], and there are &#946;n 1 n 2 of them. Considering these indices as living in [n 1 ] &#215; [n 2 ], by the pigeonhole principle at least &#946;n 1 n 2 /n 1 = &#946;n 2 of them will have the same first component; let's call this component i 1 &#8712; [n 1 ]. We can now extract the subtrees of these &#946;n 2 elements and join them into an echelon tree of height &#8467;. The common parent of these nodes will have index i 1 . So far we have constructed an echelon tree of height &#8467; with fractional branching (1/n 1 , &#946;, &#945; 3 , . . . , &#945; &#8467; ). Now consider the subspace {T &#8712; W | T (e i1 , &#8226;, . . . , &#8226;) = 0}. We think of W as living in R (n1-1)n2&#215;n3&#215;&#8226;&#8226;&#8226;&#215;n &#8467; , since index i 1 has been eliminated from the first dimension. We can again apply the induction hypothesis to this space and as long as the premise holds obtain an echelon tree of height &#8467; -1 with fractional branching (&#946;, &#945; 3 , . . . , &#945; &#8467; ). We can apply the pigeonhole principle again to find &#946;(n 1 -1)n 2 /(n 1 -1) = &#946;n 2 level-1 nodes having the same first index i 2 . We extract a height &#8467; echelon tree from them and join this with the height &#8467; echelon tree we already have. At the end we will have an echelon tree with fractional branching (2/n 1 , &#946;, &#945; 3 , . . . , &#945; &#8467; ).</p><p>Suppose we have repeated this procedure &#947;n 1 -1 many times and currently have a height &#8467; echelon tree with fractional branching (&#947;, &#946;, &#945; 3 , . . . , &#945; &#8467; ). As long as the premise of the induction hypothesis holds we can grow this echelon tree. The current subspace is</p><p>So the premise of the induction hypothesis holds as long as</p><p>This means that as long as</p><p>we can grow the echelon tree.</p><p>To finish the proof, we set &#946; = &#945; 2 , which means that while &#947; &lt; &#945; 1 , we can grow the echelon tree. So when this procedure stops we have an echelon tree with fractional branching (&#945; 1 , . . . , &#945; &#8467; ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Implications for the main question</head><p>Our result, theorem 5, together with results from <ref type="bibr">[3]</ref> (see the supplementary material), imply that under very mild assumptions we can recover S 1 , . . . , S n from their &#8467;-wise intersections as long as |U| &#8804; n &#920;(&#8467;) . These mild assumptions are necessary to prevent adversarially constructed examples that have no hope of unique recovery.</p><p>To get a sense of the mild assumptions that we need, let us discuss the parameters that appear in theorem 5. We assume that &#8467; is a constant that does not grow with n. We can take c to be some fixed constant as well. For example 1/2, or even 1/ l &#8730; 2. If we perturb our cell assemblies according to example 3, i.e., flip assembly memberships for each neuron class and assembly pair with probability q, how large of a q do we need for the conditions of theorem 5 and <ref type="bibr">[3]</ref> to be satisfied? The distribution we get for &#967;(u)s is going to be (1/2, 1 -q)-nondeterministic as long as q &#8804; 1/2. So &#948; = 1/2 is a constant. The only condition we need is now for the failure probability to be small. This roughly translates to n O(&#8467;) (1 -q) (1-c)n &#8810; 1, which will be satisfied for q = &#8486;(log n/n). In other words, we only have to flip each coordinate of &#967;(u) with probability O(log n/n). On average, each neuron's membership will be changed in about O(log(n)) of the assemblies, which is a very small fraction of the assemblies. For slightly larger values of q, e.g., q = n &#491;-1 , the probability of failure becomes exponentially small similar to <ref type="bibr">[3]</ref>.</p><p>We also assumed that w(u) = 1 for all u &#8712; U. In general this is not needed. As long as the weights w(u) are in a range whose upper bound is at most a polynomially bounded factor larger than the lower bound, we can absorb the weights into the vector &#967;(u) and the running time and accuracy will only suffer by a polynomially bounded factor.</p><p>We also remark that recovering a {0, 1} n vector within an additive error of 1/n is the same as exact recovery (by rounding the coordinates). So by setting the recovery error (see supplementary material) to 1/n we get exact recovery.</p><p>Finally, we remark that even though we are mostly interested in the case where &#8467; = O(1), our dependencies on &#8467; seem to be better than the results of <ref type="bibr">[3]</ref> even in the setting of Gaussian perturbations. In particular, our running time (as well as our tolerance for error) grows polynomially with n &#8467; , whereas the running time of <ref type="bibr">[3]</ref> grows with n 3 &#8467; . When adding Gaussian noise of total variance &#961; 2 as in example 4, we can treat our vectors as coming from a (O(&#961;/ &#8730; n), 1/2)-nondeterministic distribution.</p><p>This means our probability of failure will be at most n 2&#8467; /2 (1-c)n . To have a fair comparison, we need to allow for the number of components to be roughly half the total dimension, so we need to let c = 1/ &#8467; &#8730; 2 &#8771; 1 -&#920;(1/&#8467;). So the probability of failure will be roughly exp(O(&#8467; log n) -&#8486;(n/&#8467;)). For large enough values of &#8467; this is much better than the guarantee of exp(-&#920;(n 1/3 &#8467; )) of <ref type="bibr">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Association graphs and the soft model</head><p>When the number of observations is smaller than what is needed for reconstruction, we can still ask whether there exists some Venn diagram that is consistent with the observations. Which classes of weighted graphs (or hypergraphs) can be represented by Venn diagrams?</p><p>Interestingly, a similar model was formulated almost three decades ago, motivated by quantum mechanics and spin glass systems, and a mathematical object called correlation polytope was defined to frame that investigation <ref type="bibr">[20]</ref>. It is not hard to show that membership in the polytope is an NP-hard problem and natural optimization variants of it are hard to approximate.</p><p>In this section we formulate a promise version of the problem where either the intersection is above a certain threshold (corresponding to association) or below another (corresponding to non-association) which seems to be more tractable.</p><p>More precisely, we are given a graph that is unweighted. The nodes still stand for assemblies of neurons, all of the same size K, out of a universe of N neurons, and the edges signify association; the difference is that, in this model, if two assemblies are associated then they have an intersection of size at least a; whereas if they are not, then their intersection is at most b. The intended relationship between these numbers is that N is much larger than K (we take it to be a power of K), and K is in turn much larger than a, while a is quite a bit larger than b. To fix ideas, in the sequel we take N = K 2 and b &lt; a small constant fractions of K; in the experiment in <ref type="bibr">[6,</ref><ref type="bibr">15]</ref> a and b are found to be about 8% and 4% of K, respectively. We call a graph G = (V, E) representable with parameters (N, K, a, b) if every node of G can be associated with a set of K neurons such that for any two adjacent nodes the corresponding sets have intersection at least a, while for any two nonadjacent nodes the corresponding sets have intersection at most b. The question is, which graphs are representable? Theorem 13. Any graph of maximum degree at most 2K/a is representable, and so is any tree of maximum degree 2K 2 /a 2 .</p><p>The 2K/a bound follows from the fact that the edges of a regular Eulerian graph can be decomposed into cycles, while the 2K 2 /a 2 follows from the theory of block designs. Recalling that a is a small fraction of K, we conclude that rather rich and complex "association graphs" can be represented in principle. But can these sophisticated combinatorial constructions be carried out with surgical precision in the wet chaos of the brain?</p><p>Here is a more realistic framework which we call the soft model: Suppose that we are given an association graph G = (V, E). We wish to determine whether a model of G exists, i.e., |V | sets corresponding to nodes of G whose pairwise intersections realize G according to the rules above involving a and b. We wish to create sets of expected size K representing the nodes, starting from the universe of neurons [N ] and executing instructions of the following form (in the following C, C 1 , C 2 are previously constructed sets, and A is the set being constructed):</p><p>where by S(C, p) we denote the result of sampling each node in set C with probability p -a simple and realistic enough primitive. The question is, which graphs can be realized in such a way that the intended relations between the nodes and their intersections are not corrupted, with high enough probability, by the randomness of the process? We can show the following:</p><p>Theorem 14. Any graph with maximum degree 1 e &#8226; K a can be realized in the soft model with high probability.</p><p>by |U| &#8804; n &#8970; &#8467; 3 &#8971; /2. Finally, note that condition 3 is also automatically satisfied with very high probability for Gaussian perturbations and also our model, in which we sample vectors from the hypercube {0, 1} n . We assume that D is not only (&#948;, p)-nondeterministic but also that it satisfies condition 3 with high probability.</p><p>In section 3 we focus only on proving condition 1 in theorem 16. To make the notation simpler we replace &#8467; by (&#8467; -1)/2, and assume a(u)s are tensors of &#8467; factors. In order to bound the condition number of A in theorem 16, we need to lowerbound the minimum singular value and upperbound the maximum singular value of A. An upperbound on &#963; max (A) is readily given by condition 3 of theorem 16. The matrix A has columns with norms bounded by C and therefore &#963; max (A) &#8804; n &#8467;/2 C.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Preprint. Work in progress.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>Or person, these are commonly known as "Jennifer Aniston neurons".</p></note>
		</body>
		</text>
</TEI>
