<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Face familiarity detection with complex synapses</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10432369</idno>
					<idno type="doi">10.1016/j.isci.2022.105856</idno>
					<title level='j'>iScience</title>
<idno>2589-0042</idno>
<biblScope unit="volume">26</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Li Ji-An</author><author>Fabio Stefanini</author><author>Marcus K. Benna</author><author>Stefano Fusi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Highlights A memory system with complex synapses can recognize a large number of faces The number of recognizable faces grows almost as the square of the number of neurons Memory systems with complex synapses outperform those with simple synapses Complex synapses have distinctive signatures that are testable in experiments]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>INTRODUCTION</head><p>Synaptic memory is a complex phenomenon, which involves intricate networks of diverse biochemical processes that operate on different timescales. We recently showed that this complexity can be harnessed to greatly increase the memory capacity <ref type="bibr">1,</ref><ref type="bibr">2</ref> in situations in which the synaptic weights are stored with limited precision. More specifically, we proposed a complex synaptic model which is characterized by m dynamical variables. These variables might correspond to different biochemical processes that operate on different timescales. If the interactions between these processes are properly tuned, the memory capacity of a population of synapses, estimated by an ideal observer who has access to all the synaptic weights, can increase almost linearly with its size (i.e., the number of synapses N syn ), even when both m and the number of states of each variable grow no faster than logarithmically with N syn . This is the optimal scaling under some conditions (see <ref type="bibr">2</ref> ) and significantly better than what can be achieved by employing simple synapses characterized by a single variable. <ref type="bibr">[3]</ref><ref type="bibr">[4]</ref><ref type="bibr">[5]</ref> These previous studies on complex synapses focused on the problem of storing a large number of random and uncorrelated memories. Only recently, complex synapses started to be employed in more realistic problems (e.g., see <ref type="bibr">6</ref> ) in which memories are structured and correlated. Here we show that synaptic complexity can be important also in a real-world problem, face familiarity detection. The task is particularly difficult because we require that each face is presented only once (one-shot learning) and it has to remain recognizable for a long time. Moreover, each face is required to be recognizable even when a different pose is used as a memory cue. This is a typical situation in which a proper pre-processing of the visual stimuli combined with the complexity of the synapses can lead to a significant advantage in terms of memory capacity. The images of the faces that we used in our simulations are pre-processed by a simulated visual system which has been trained to report the identity of the person portrayed in the image. We then extracted the principal components (which can also be implemented by a neural network, see e.g. <ref type="bibr">7</ref> ) and binarized these representations. The pre-processed representations of different faces are approximately decorrelated, although a downstream readout can still retain the ability to generalize to different poses (i.e., different poses of the same face have similar representations). The decorrelation step is important to make the representations suitable for the memory system that stores the information about face familiarity. Modeling this process of ''recoding'' is of fundamental importance and it has been the subject of several studies which started with the work of David Marr in the 70s <ref type="bibr">8</ref> and continued in the 80s and in the 90s with the first memory models of the hippocampus. <ref type="bibr">[9]</ref><ref type="bibr">[10]</ref><ref type="bibr">[11]</ref><ref type="bibr">[12]</ref><ref type="bibr">[13]</ref> In these models the representations of memories are first orthogonalized to become more separable and hence facilitate the storage and reconstruction of memories. This orthogonalization process can be explicitly modeled as a process of compression <ref type="bibr">10,</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref><ref type="bibr">[16]</ref><ref type="bibr">[17]</ref><ref type="bibr">[18]</ref><ref type="bibr">[19]</ref> , which leads to the most efficient decorrelated representations for memory storage. Compression is also an important underlying principle of several computational processes. <ref type="bibr">20,</ref><ref type="bibr">21</ref> The pre-processed representations are then stored in a neural circuit that contains the complex synapses proposed in. <ref type="bibr">2</ref> These are characterized by dynamical variables that operate on multiple timescales. The fast ones can rapidly store information about a new visual stimulus such as a face, even when the stimulus is shown only once. This information is then progressively transferred to the slow variables, which can retain it for a long time. Because of these slow variables, which influence the synaptic efficacy, the older memories are protected from overwriting due to the storage of new faces. Synapses that are described by a single dynamical variable can either learn quickly if they are fast, but then they also forget quickly, or they can retain memories for a long time if they are slow, but then they cannot learn in one shot and require multiple exposures to the same face. This plasticity-rigidity dilemma concerns a very broad class of realistic synaptic models whose dynamical variables have a limited precision. <ref type="bibr">3,</ref><ref type="bibr">5,</ref><ref type="bibr">22</ref> Our memory benchmark for the complex synapses, familiarity detection (sometimes called familiarity discrimination or novelty detection), is an important component of recognition memory, which has been widely studied in humans and in animals. In particular, familiarity detection refers to the ability to rapidly memorize new items and report at a later time whether we have encountered them or not. In the case of faces, we would report that a face of a person is familiar if we experience the sense that we have already encountered that person in the past. The second component of recognition memory is recollection, which corresponds to the retrieval of the details of the individual (e.g. the name) and the episodic memories associated with that person. We can often experience a sense of familiarity without being able to recollect the details about an encountered individual. Familiarity detection, which is the focus of this article, has been studied in the famous and remarkable experiment by <ref type="bibr">Standing,</ref><ref type="bibr">23</ref> in which he showed that it is possible to recognize a surprisingly large proportion of 10,000 images that are flashed on a screen only once and for a brief time. The subjects were asked whether they had seen an image or not, which is one way of assessing the familiarity of an image. Although familiarity detection is only one component of recognition memory, in the article we will use the verb 'recognize' to indicate the ability of a subject to report whether a visual stimulus had already been seen or not. The result of the Standing experiment is even more remarkable when one considers that more recent studies proved that subjects could memorize many details about each image. <ref type="bibr">24,</ref><ref type="bibr">25</ref> The neural substrate of recognition memory is unknown, although multiple lesion studies indicate that the hippocampus and perirhinal cortex play an important role. <ref type="bibr">[26]</ref><ref type="bibr">[27]</ref><ref type="bibr">[28]</ref> The role of each area is controversial as for some investigators both the hippocampus and the perirhinal cortex contribute to recollection (memory retrieval) and familiarity <ref type="bibr">28,</ref><ref type="bibr">29</ref> and for others the hippocampus supports recollection only, and perirhinal cortex supports familiarity. <ref type="bibr">26,</ref><ref type="bibr">27</ref> One of the problems in the interpretation of these studies is that it is difficult to separate the contribution that each area gives to familiarity and recollection because when a memory can be recollected it can always be recognized. Another problem is that the role of these two areas differs depending on the nature of the memories (e.g., recognition of novel faces is intact in patients with lesioned hippocampus at a short retention interval, instead recognition memory for words, buildings, inverted faces, and famous faces is impaired <ref type="bibr">30</ref> ), on the length of the retention interval (for intervals of a few minutes or longer the hippocampus is certainly important for familiarity <ref type="bibr">28,</ref><ref type="bibr">31</ref> ) and on whether the memory is presented in a particular context or in isolation (perirhinal cortex is more important for the recognition of items in isolation whereas the hippocampus is more important when there is a contextual or associational component <ref type="bibr">26</ref> ). In the Discussion we will describe a possible interpretation of our model.</p><p>There are several biology-inspired computational models studying different aspects of recognition memory: some neural network models following the complementary learning systems approach were proposed to tease apart the hippocampal and neocortical contributions to recognition memory <ref type="bibr">32,</ref><ref type="bibr">33</ref> ; other models were concerned with the synaptic plasticity (learning) rules in the perirhinal cortex. <ref type="bibr">34</ref> Finally, there are models that stress the distinct roles of familiarity and recollection in retrieving memories. <ref type="bibr">35</ref> Analytical estimates of familiarity memory capacity showed that in the case of random uncorrelated patterns, the number of memories that can be correctly recognized as familiar can scale quadratically with the number of neurons N in a recurrent network. <ref type="bibr">36</ref> Not too surprisingly, this is a much better scaling than the linear scaling of the Hopfield model, <ref type="bibr">37</ref> in which random memories are actually reconstructed (see also the Discussion). The scaling for memory reconstruction is markedly worse and can be as low as ffiffiffiffi N p when the patterns representing the memories are correlated. <ref type="bibr">34</ref> These computational models can replicate some interesting aspects of experiments on the capacity of human recognition memory. <ref type="bibr">38</ref> We constructed a model for recognition memory that incorporates complex synapses characterized by variables that have limited dynamical range (number of distinguishable states). We show that a simple neural circuit designed to reconstruct the memorized face can take advantage of the complexity of synapses and can efficiently store a large number of faces. In particular, we show that the number of faces that can be successfully recognized as familiar scales approximately quadratically with the number of neurons, or linearly with the number of synapses. This is the same scaling achieved in <ref type="bibr">36</ref> , in which synaptic weights could be stored with unlimited precision. Moreover, this scaling is similar to the one predicted for random patterns in <ref type="bibr">2</ref> , despite the fact that our pre-processing system does not completely decorrelate the patterns that represent different faces. Importantly, the network can recognize a face even when it is presented in a different pose, and the scaling is only slightly worse than in the case in which the exact same picture of the face is presented for familiarity testing. This ability to generalize is a distinctive feature of recognition memory, it is observed in experiments and it plays an essential role in any machine learning system that relies on novelty signals to speed up learning. <ref type="bibr">39</ref> We then compared the performance of the recognition system with complex synapses to one with the same architecture but with simple synapses characterized by a single dynamical variable. The number of synapses is chosen so that the total number of synaptic variables would be the same in the two systems and of course the pre-processing system is exactly the same in the two cases. We show that the system with complex synapses outperforms the one with simple synapses, indicating that complexity provides the neural system with a clear computational advantage.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Face familiarity detection system</head><p>Our face familiarity detection system consists of three modules: an input (embedding) module, a memory module, and a readout (detection) module (see Figure <ref type="figure">1A</ref> and model details in ''face familiarity detection system'' in STAR Methods and Table <ref type="table">1</ref> summarizing the notations).</p><p>The embedding module consists of a deep convolutional neural network (pre-trained on several face tasks), taking pre-processed face images from VGGFace2 <ref type="bibr">40</ref> as inputs (see ''face data set'' in STAR Methods). The activity of the penultimate layer (adjacent to the classification layer) was extracted, further decorrelated (principal component analysis), and binarized. The top N &#240;N % 2048&#222; binarized principal components are taken as the binary face pattern x = &#189;x 1 ; .; x N T , serving as the activity of the N input neurons for the memory module. Despite the similarities between these binary face patterns and random unstructured binary patterns, we found non-trivial high-order statistics in these face patterns (see STAR Methods ''statistical differences between binary face patterns and random patterns'' and Figure <ref type="figure">S1</ref>).</p><p>The memory module is the only part of our network containing plastic synapses. The synapses are continuously updated by the ongoing presentation of the face patterns, whereas the weights of the input module are frozen during online learning. The memory module consists of N memory neurons, one for each input neuron in the embedding module (see Figure <ref type="figure">1B</ref>). The j-th input neuron connects to the i-th memory neuron (for isj) with synaptic weight (efficacy) w ij and bias term b i . There is no connection between the i-th input neuron and the i-th memory neuron for any i (i.e., w ii = 0ci). This plastic layer of synapses implements a simple feedforward memory model that can perform an approximate one-step reconstruction of a stored input pattern from a noisy cue at test time and we denote the binary memory patterns retrieved (reconstructed) in this manner as y = &#189;y 1 ; .; y N T . Because the i-th memory neuron y i is expected to reconstruct the i-th input neuron x i , we set the value of the i-th memory neuron to be x i during learning. The synaptic weights and biases are updated with the Hebbian learning rule with bounded dynamical ranges. For each synapse (i.e., for each weight w and bias term b), we implemented a complex synaptic model 2 with m discretized dynamical variables u 1 ; .; u m in discrete time. Here m denotes the total number of dynamical variables per synapse (a measure of synaptic complexity), each of which operates on a different timescale.</p><p>The readout (detection) module compares the output x = &#189;x 1 ; .; x N T of the embedding module and the output y = &#189;y 1 ; .; y N T of the memory module to assess the level of familiarity of a given pattern.</p><p>To evaluate the memory performance of our familiarity detection system we studied how the number of faces that can be recognized scales with the number of neurons N. The memory capacity of neural systems always increases with the size of the network (as the number of neurons increases also the number of synapses</p><p>The architecture of our face familiarity detection system and the task diagram (A) The neural system contains three modules: the input (embedding) module, the memory module, and the readout (detection) module. The synapses between the embedding module and the memory module (as well as the biases in the memory module) are plastic, while all other synapses are fixed (after being either set by hand or pre-trained), which requires online learning of face patterns. (B) The plastic connections between the input neurons (encoding face patterns) in the embedding module and the memory neurons (encoding memory patterns) in the memory module. (C) A series of face images are presented to the neural system. In each familiarity detection (FD) test, the system is required to determine whether a presented face is familiar or unseen. A face is considered familiar if the test image is identical to a previously presented one (i.e., the same pose, SP) or a new pose of a previously presented face (i.e., a different pose, DP), and is considered novel if it is an image of an unseen person's face. In each two-alternative forced-choice (FC) test, the neural system is presented with a pair of face images (exactly one familiar and one unseen), and is required to choose which one of the two is familiar. (D) During learning, face patterns x&#240; ,&#222; are stored in the synaptic weights via the desired weight update Dw&#240; ,&#222; generated from the Hebbian learning rule. When we test the face stored at time 0, the pattern x 0 &#240;0&#222; (either a noisy version of x&#240;0&#222; in the DP case or x&#240;0&#222; itself in the SP case) is presented to the system at time t. The ioSignal is the overlap between the synaptic weight w&#240;t&#222; at time t and the test weight update Dw 0 &#240;0&#222; (even though the synaptic weight is never actually changed by the test weight update). The rSignal is the overlap between the test face pattern x 0 &#240;0&#222; and the corresponding memory pattern y &#240;t&#222; &#240;0&#222; reconstructed using the current synaptic weight w&#240;t&#222;.</p><p>increases), but the growth can vary in a wide range, from a very inefficient logarithmic scaling <ref type="bibr">3</ref> with N to a quadratic dependence. For networks with complex synapses, the memory capacity depends also on the number of dynamical variables m per synapse (synaptic complexity), and it is important to scale m up when N increases. If m is fixed, then the memory capacity can increase rapidly with N (e.g. quadratically), but only to some value determined by m. Beyond that value, the increase is only logarithmic. Fortunately, a modest increase in m allows us to rapidly (exponentially) increase this critical value. To take advantage of a larger population of neurons, it is important to increase the longest timescale of the synapses, which is related to its complexity m. This can be achieved by choosing an m that grows logarithmically with N (such that m = log 2 N &#192; 1, as suggested in 2 ). We present the results of varying m and N separately in Figures <ref type="figure">S5</ref> and<ref type="figure">S6</ref>.</p><p>In the following simulations, all memory metrics, including the signal-to-noise ratio (SNR) and the task performance, are evaluated after the neural system reaches its steady state, i.e., when a large number of face patterns (with constant input statistics) have already been stored. In the steady state, the distribution of synaptic weights does not change any longer, 2 although synapses continue to be updated as new face images are memorized. The system is then presented with two thousand real face images from different people, interleaved with the necessary number of non-evaluated synthesized face image patterns (see ''synthesizing artificial face patterns'' in STAR Methods). We showed that the effect of interleaving with synthesized face patterns was indistinguishable from interleaving with real face patterns (see STAR Methods ''interleaving with synthesized and real face patterns'' and Figure <ref type="figure">S2</ref>). Our memory metrics were evaluated only over these real face images and further averaged over independent simulations to reduce the noise floor. We also quantified the variability of our results by first computing the metrics in each independent sequence and then considering their variations across sequences (see Figure <ref type="figure">S7</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>SNR analysis and memory performance</head><p>To measure the strength of a memory we took the perspective of an ideal observer, who has direct access to all the synaptic weights <ref type="bibr">1,</ref><ref type="bibr">2,</ref><ref type="bibr">22</ref> (see also ''evaluating the memory signal and noise'' in STAR Methods) and can compare them to the synaptic modifications Dw induced by the particular memory that we are tracking. The more similar (correlated) the current weights are to the Dw, the stronger the memory signal. The way this similarity is computed is illustrated in Figure <ref type="figure">1D</ref>: say that the memory that we intend to track is memorized at time 0. At that time a pattern x&#240;0&#222; is imposed on the input, determining through a simple Hebbian rule the Dw&#240;0&#222;, which is then used to update the synapses, leading to w&#240;1&#222;. Note that Dw&#240;0&#222; is only the desired update as the final new synaptic state will depend on the complex internal dynamics of the synapse (i.e., w&#240;1&#222; is not necessarily w&#240;0&#222; + Dw&#240;0&#222;). At a later time, say time t, we can test the memory stored at time 0. As a test face, we considered both faces in the same pose (SP) as those stored in memory and faces in a different pose (DP) (see Figures <ref type="figure">1C</ref> and<ref type="figure">2</ref>). In the first case, we would simply compare w&#240;t&#222; to Dw&#240;0&#222;.</p><p>In the DP case, Dw 0 &#240;0&#222; is computed from a face pattern x 0 &#240;0&#222; that is not in the same pose as the memorized face. The similarity (correlation) between Dw 0 &#240;0&#222; and w&#240;t&#222; is defined as the ideal observer signal (ioSignal). We computed the average of the signal S io and the noise N io (SD of the signal) across the full temporal series of the different faces. The ideal observer signal-to-noise ratio (ioSNR) S io =N io &#240;Dt&#222; is our first measure of memory strength and it depends on the age of the memory Dt.</p><p>We also estimated the ability to reconstruct a memorized face by computing the rSignal, which is the overlap between the test face pattern x 0 &#240;0&#222; and the corresponding memory pattern y &#240;t&#222; &#240;0&#222; (reconstructed from x 0 &#240;0&#222; using the current synaptic weight w&#240;t&#222;). For t % 0, ioSignal and rSignal are approximately zero because w&#240;t&#222; has not been updated by Dw&#240;0&#222; yet. The ioSignal and rSignal will reach their maximum at t = 1, and gradually decrease as time elapses.</p><p>The ioSNR critically depends on the number of memories that are stored after the tracked face pattern, i.e., the memory age. Different curves in Figures 2A and 2B correspond to synaptic models with different numbers of input neurons (and memory neurons) N and dynamical variables m. The curves are plotted on a log-log scale, for which a straight line represents a power-law dependence.</p><p>In the SP case, the ioSNR curves decay as a power-law over a time interval T corresponding to the longest timescale of the synapse before the decay becomes exponential. The ioSNR decays as slowly as the inverse square root of the memory age in the power-law regime. Changing N shifts the ioSNR curves in the log-log plot vertically, while increasing m primarily extends the power-law regime (i.e., increases T; see Figure <ref type="figure">2A</ref>). We determined the scaling of the familiarity memory lifetime with N (and m), where the lifetime t &#195; ioSNR is represented by the memory age at which the ioSNR first drops below a given threshold. A value of 1 corresponds to a situation where the signal and the noise are of the same intensity. We chose a threshold of 0.5, though its precise value does not affect the scaling behavior much. We found that the familiarity memory lifetime scales approximately as N 2 (see Figure <ref type="figure">2C</ref>, in which the linear regression slope on a log-log scale is about 1.78 for the SP case, compared to 1.79 for random patterns (RD)). This scaling is very close to the theoretical result for optimal storage of random unstructured patterns. <ref type="bibr">2</ref> Because m increases together with N (logarithmically), the familiarity memory lifetime scales exponentially with m (with the same linear regression slope on a log 2 -linear plot of t &#195; ioSNR versus m).</p><p>For the DP case (see Figure <ref type="figure">2B</ref>), the ioSNR curves are lower than those in the SP case, due to the differences between the memorized and the tested face patterns. When there are more memory neurons, the shape of its initial decay with memory age becomes flatter. The initial ioSNR increases slowly with N for N &lt; 512, and then drops a little for larger N, because compared with the earlier features, the later features are much less correlated between poses of the same person. Nevertheless, the familiarity memory capacity still scales as a power of N: the regression slope is 1.53 (the model with 2048 neurons is removed from linear regression due to saturation).</p><p>As mentioned above, we also considered another measure of the memory signal that is more directly related to the ability of the system to reconstruct the stored memory, the rSignal. We then studied the readout signalto-noise ratio (rSNR), defined similarly to the ioSNR (see Figure <ref type="figure">3</ref>). We found that the rSNR behaves similarly to the ioSNR at long time lags, but deviates from it for small memory ages, reflecting the effect of the neuronal nonlinearity (i.e., the nonlinear activation function). This nonlinear effect, which becomes more significant for larger N or smaller m (see also Figure <ref type="figure">S8</ref>), leads to larger initial rSNR values, but does not substantially affect the memory lifetime t &#195; rSNR (similarly defined as the memory age at which the rSNR first drops below a given threshold) compared to the ioSNR measure. The initial SNR enhancement quickly attenuates, leading to a similar scaling for t &#195; rSNR . These results further validate the ideal observer approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Task protocol and performance</head><p>To evaluate the task performance of our system, we considered two tests in which we presented a series of preprocessed face images to the neural system and tested its memory on randomly chosen faces (see Figure <ref type="figure">1C</ref>). These tasks are made particularly challenging by the fact that the familiar faces are presented only once.</p><p>In the first familiarity detection (FD) test, the neural system is required to determine whether the face image presented at test time is familiar (either SP or DP of a previously presented face image) or unseen (an image of an unseen person) by comparing the output of the detection module to a threshold. Here, the face images presented to the system are balanced, i.e., familiar faces previously presented within a certain agerange and unseen faces appear at test time with equal probability. The threshold on the overlap did not depend on the age of face, which is assumed to be unknown at test time, and it was optimized to best separate familiar from novel faces for all ages shorter than the memory lifetime t &#195; ioSNR (for details see STAR Methods ''choosing the optimal threshold in the FD task'' and Figure <ref type="figure">S3</ref>). Note that t &#195; ioSNR , and hence the threshold, will depend on the number of neurons and the complexity of the synapses.</p><p>We now define t &#195; FD as the age at which the FD test performance drops below some threshold. In Figure <ref type="figure">4</ref>, we plot the task performance in the FD test as a function of memory age. In the SP case, increasing N and m leads to a substantial extension of the task-relevant familiarity memory lifetime t &#195; FD (see Figure <ref type="figure">4A</ref>). The memory lifetime was estimated assuming a performance threshold of 60% (this value was chosen to keep the initial task performance of all the simulations above the threshold). The power-law scaling behavior of the familiarity memory lifetime is revealed by plotting t &#195; FD versus N on a log-log scale (linear regression slope 1.78; see Figure <ref type="figure">4E</ref>), which shows a very similar growth also in the RD case (linear regression slope 1.85). The initial task performance cannot reach 100% because each model optimized for the model-specific age-range has a non-zero constant error rate for unseen faces, even if the true positive rate (accuracy for familiar faces) saturates at 100% (also see Figure <ref type="figure">S4</ref>). As expected, in the DP case the task performance is worse than in the SP case (see Figure <ref type="figure">4C</ref>). However, we still found a reasonable power-law scaling with N (regression slope 1.49).</p><p>In the second two-alternative forced-choice (FC) test, the neural system is presented with a pair of face images containing one familiar (either SP or DP) and one unseen face, and is required to choose which one of the two is familiar by comparing the output of the detection module for the two faces. The task performance is defined as the probability of correctly choosing the familiar face (over the unseen one) for face memories of different ages (see Figures <ref type="figure">4B, 4D</ref> and<ref type="figure">4F</ref>). The regression slope of the memory lifetime t &#195; FC (defined as the age at which the FC test performance drops below some threshold) versus N on a loglog scale is 1.76 for the SP case and 1.43 for the DP case.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Complex versus simple synapses</head><p>To obtain a fair comparison between complex synapses 2 and the well-studied, simple (multi-state) synapses, <ref type="bibr">[3]</ref><ref type="bibr">[4]</ref><ref type="bibr">[5]</ref> we evaluated the familiarity memory performance of a neural system with complex synapses and three models with simple synapses in which we matched the total number of dynamic variables. As the complex synapse has 10 times more dynamical variables than the simple synapse, we randomly pruned 90% of the complex synapses. In this way each memory neuron in the complex model has on average 204 incoming synapses (randomly sampled 10% from 2047 presynaptic input neurons) and 1 bias (i.e., 2048 &#195; &#240;204 + 1&#222; &#195; 10 = 4198400 variables in total), whereas in the three simple models each neuron has 2047 incoming synapses (1 dynamical variable) and 1 bias (i.e., 2048 &#195; 2047 &#195; 1 = 4192256z4198400 variables in total). The simple synapses follow essentially the same model dynamics as the previously studied hard-bounded multi-state synapses. <ref type="bibr">4</ref> They differ in their level of plasticity: the synapses in the first model are updated every time an input pattern is stored, while the synapses in the second and third ones are changed stochastically according to a learning rate (encoding probability q) less than one, and thus are more ''rigid''. <ref type="bibr">3,</ref><ref type="bibr">41</ref> Small learning rates lead to lower initial ioSNR values, but also to longer memory lifetimes. We choose q = 0:128 for the second model so that its initial ioSNR is comparable to the complex synapse system in the SP and DP cases. For the third model, we picked q = 0:01 to obtain the longest memory lifetimes possible for a system of simple synapses of this size, with an initial SNR just above the threshold (in the DP case).</p><p>Each variable in all of these models has the same number of discrete levels, and the total numbers of variables are approximately matched in the simple and the complex system. These simulations show that the complex system has a substantially better familiarity memory performance than the simpler systems (see Figure <ref type="figure">5</ref>), despite the smaller number of synapses. For the SP and RD cases, the memory lifetime of the system with complex synapses is $ 400 &#192; 900 times longer; while for the DP case, the improvement factor is $ 20 &#192; 40. Slower simple synapses (with smaller q) can greatly extend the familiarity memory lifetime, but at the expense of the initial SNR and thus the generalization ability. Even so, they are far from matching the memory lifetime of the complex system. This clear advantage is further confirmed by another comparison, where we matched the number of dynamic variables using a different approach: We considered a larger network for the simple synapses (see Figure <ref type="figure">S8</ref>). We can conclude that the memory model with complex synapses performs at least two orders of magnitude better in terms of familiarity memory capacity, and we expect the gap between simple and complex systems to grow even wider in networks with a larger number of neurons because of the different scaling behaviors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Testable predictions for simple and complex synapses</head><p>Simple and complex synapses exhibit quantitatively different SNR decays and memory performance. We now show that it is possible to design an experiment with a specific learning schedule that would reveal whether the synapses are complex or simple. The main idea is that memories can be repeatedly refreshed in such a way that the asymptotic minimum memory strength remains constant. Ideally, this could be implemented by monitoring the memory signal of a specific stimulus, and refreshing the memory by presenting the same stimulus again as soon as the signal drops below some threshold. Using this procedure, we obtain a refresh schedule, which can be described by specifying the intervals that separate two consecutive presentations of the same stimulus. Depending on the synaptic model, the length of these intervals will be different, and, most importantly, will change over time in a different way. Comparison between models with simple synapses (N = 2048, m = 1) and different learning rates (q = 1, 0.128, and 0.01, respectively) and a complex model (N = 2048, m = 10, q = 1)that has the same total number of memory neurons and plastic variables, by keeping only 10% percent of the synapses of the fully connected simple models (i.e., 90% of the complex synapses are pruned). (A-D) Comparisons between models for the same pose (SP) case in terms of ioSNR, rSNR, familiarity detection (FD) performance, and two-alternative forcedchoice (FC) performance. (E-H) Similar comparisons between models in the different pose (DP) case. (I-L) Comparisons between models in terms of different measures of familiarity memory lifetime (t &#195; ioSNR , t &#195; rSNR , t &#195; FD , and t &#195; FC , respectively) in the SP, DP, and random-pattern (RD) cases. Also see Figure <ref type="figure">S8</ref>.</p><p>We illustrate this with three idealized models of memory traces through simulations: the exponential decay model (in which the signal decays as e &#192; t=t , where t is the time constant), the inverse-squareroot &#240;1 = ffiffi t p &#222; power-law decay, and the hyperbolic &#240;1 =t&#222; power-law decay (which is achievable by a heterogeneous population of simple synapses <ref type="bibr">42</ref> ). See Figures <ref type="figure">6A-6C</ref> and<ref type="figure">6F</ref>. We set the threshold to q = 0:5, but its precise value does not affect the scaling behavior of the intervals. For the idealized exponential decay model, the length of the interval remains constant after the second presentation (the constant interval is t ln&#240;1 + C =q&#222;, where C is the initial signal strength and q is the pre-specified threshold). Of interest, the length of the interval increases asymptotically linearly for the idealized inverse-squareroot decay model (the coefficient of this linear asymptotic growth is p 2 C 2 =2q 2 , see mathematical details in STAR Methods ''asymptotic behavior of the optimal learning schedules for idealized synaptic models with specific decay kernels''). The situation for the hyperbolic decay is intermediate between the exponential and inverse-square-root decay models, showing an approximately logarithmic increase of the length of the interval.</p><p>The idealized exponential decay and inverse-square-root decay models represent very different degrees of synaptic complexity. The memory signal of complex synapses decays as an inverse-square-root power-law over the longest timescale of the synapse before the decay becomes exponential. Example memory signal trajectories under the optimal learning schedule for simulated models with different synaptic complexity are shown in Figures <ref type="figure">6D</ref> and<ref type="figure">6E</ref>. The length of the interval is approximately constant for the model with simple synapses &#240;m = 1&#222;, similar to the idealized exponential decay, but increases linearly for the model with complex synapses &#240;m = 8&#222;, similar to the idealized inverse-square-root decay. Averaged over multiple such noisy trajectories, the interval curves are plotted on a log-log scale as a function of interval numbers (see Figure <ref type="figure">6G</ref>). Increasing the synaptic complexity m effectively extends the linear growth regime (corresponding to the inverse-square-root power-law decay regime of the ioSNR) and postpones the gradual transition into the constant interval regime (corresponding to the exponential decay regime of the ioSNR). The idealized inverse-square-root decay thus approximates the envelope of the interval curves of models of different synaptic complexity m. We further study the scaling properties of the length of the interval (Figure <ref type="figure">S9</ref>). However, it would not be feasible in experiments to monitor the memory signal in real time. Indeed, to measure the signal we need to expose the subject to the memory we intend to test, and hence we are going to modify the memory signal we want to estimate. We propose that we can simply use either a constant interval or a presentation schedule with a linearly increasing one without monitoring the memory signal (between presentations). Both protocols will be parameterized by a single variable. Under the constant schedule, a specific memory is refreshed after an interval of a fixed length g, whereas under the linear schedule, a specific memory is refreshed after a linearly increasing interval. In particular, the interval is equal to gn, where n is the interval number, and g is the length of the first interval (see Figures <ref type="figure">7A</ref> and<ref type="figure">7B</ref>). These refreshes serve the dual roles of evaluating familiarity detection performance (e.g., by querying the subject whether the presented stimulus is familiar) and boosting memory strength (corresponding to the increase of the signal strength immediately after the refresh). Under the constant learning schedule, the performance of the exponential decay model will remain constant, consistent with the conclusion drawn from the optimal schedule. The inverse-square-root decay and the hyperbolic decay models will exhibit gradually improving performance. Under the linear learning schedule, the performance of exponential decay and hyperbolic decay models will quickly drop to chance level, but the inverse-square-root decay model will maintain its performance, as predicted by the optimal schedule. Figure <ref type="figure">6</ref>. The optimal learning schedule, in which the pattern is presented whenever the monitored ioSignal drops below a pre-specified threshold (0.5 shown) (A-C) The idealized exponential decay, inverse-square-root decay, hyperbolic decay models of ioSignal under the optimal learning schedule. The ioSignal decreases over time and is enhanced by pattern presentations indicated by red arrows. The signal strength immediately before each presentation is marked by orange stars. (D) A typical noisy ioSignal trajectory from the model with simple synapses (N = 64, m = 1). The length of the interval between consecutive presentations is approximately constant, similar to the idealized exponential decay. (E) A typical noisy ioSignal trajectory from the model with complex synapses (N = 64, m = 8). The length of the interval approximately increases linearly with interval numbers, similar to the idealized inverse-square-root decay. (F) The length of the interval as a function of interval numbers for three idealized decay models. (G) The length of the interval increases as a function of interval numbers under the optimal schedule, averaged over noisy ioSignal trajectories. Different colored curves correspond to models with a different synaptic complexity m. The shaded region is bounded by interval curves of the idealized exponential decay and inverse-square-root decay models. Also see Figure <ref type="figure">S9</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A B C</head><p>We define the signal gain for interval number n as the logarithmic ratio of the ioSignal after the n-th interval (immediately before the &#240;n + 1&#222;-th presentation) relative to the ioSignal after the first interval (immediately before the second presentation). Positive signal gains correspond to better familiarity detection performance, and negative ones indicate worse performance after the following presentations. The three idealized decay models demonstrate qualitatively different signal gains under the constant and linear schedules with varying g (see Figure <ref type="figure">S10</ref>), offering testable predictions for experiments.</p><p>To better discriminate between complex synapses with a square root decay and simple heterogeneous synapses with a hyperbolic decay, we introduced a more general learning schedule (see Figure <ref type="figure">7C</ref>). Here the length of the interval takes the form of gn b , where n is the interval number, and g is the length of the first interval. b = 0 corresponds to the constant schedule, and b = 1 to the linear schedule. In the parameter space spanned by the parameters g and b, as the interval number increases, the positive gain regime (red) shrinks quickly for any positive b in the idealized exponential decay model. This regime exists in the inverse-square-root decay model for b &gt; 1 and in the hyperbolic decay model for smaller but positive b. Differentiating between these two power-law decay models requires the examination of the sign of the signal gain around b = 1: positive for the inverse-square-root decay and negative for the hyperbolic decay. This general learning schedule thus provides experimental predictions for behavioral signatures that differ between the three idealized decay models, and allow us to discriminate between memory networks of various degrees of complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DISCUSSION</head><p>We have presented a modular memory system that can solve a real-world problem such as face familiarity detection, which involves the ability to store in memory in one shot a large number of visual inputs. Thanks to the pre-processing and the interactions between fast and slow variables of the complex synaptic model, the familiarity memory capacity grows almost linearly with the number of plastic synapses or quadratically with the number of neurons of the memory module. The scaling of the system with simple synapses is only logarithmic with the number of synapses, <ref type="bibr">3,</ref><ref type="bibr">41</ref> although the memory performance can significantly increase when the learning rate q becomes small, or when the number of states per variable increases. <ref type="bibr">3,</ref><ref type="bibr">4</ref> However, even when the parameter q is properly tuned, the linear scaling cannot be achieved with a small number of states, and the system with complex synapses outperforms the one with simple synapses in all cases, even when the total number of dynamical variables is the same for the two systems.</p><p>The advantage of complex synapses comes from two important properties: the first one is that they involve multiple timescales, enabling the system to learn quickly using the fast components, and forget slowly due to the slow components. The second one is that the dynamical components operating on different timescales can interact to transfer information from one component to another. In the case of our specific model the information diffuses from the fast components to the slow ones, and back (see <ref type="bibr">2</ref> for more details). These two properties are important for any memory system that involves a process of consolidation, whether the process is synaptic or requires communication across multiple brain areas (memory consolidation at the systems level, see e.g. <ref type="bibr">42</ref> ).</p><p>Our previous work 2 systematically studied the scaling properties, the memory capacity, and the robustness of a broad class of complex synaptic models for random and uncorrelated synaptic modifications. One of Figure <ref type="figure">7</ref>. Pre-determined learning schedules (A) The idealized exponential decay, inverse-square-root decay, hyperbolic decay models of ioSignal under the constant learning schedule, in which the pattern is presented each time after an interval of a pre-determined constant length. The ioSignal decreases over time and is enhanced by pattern presentations indicated by red arrows. The signal strength immediately before each presentation is marked by orange stars, reflecting familiarity task performance. In the following presentations, the task performance of the exponential decay remains constant, while the two power-law decay models' performance gradually increases. (B) The idealized exponential decay, inverse-square-root decay, hyperbolic decay models of ioSignal under the linear learning schedule, in which the pattern is presented each time after an interval of linearly increasing length. In the following presentations, the inverse-square-root decay model maintains its performance, but the performance of exponential decay and hyperbolic decay models quickly drops to chance level (orange stars not shown due to extremely low signal strength). (C) The signal gain as a function of g (length of the first interval) and b (exponent of length increase) for the three idealized decay models under the general pre-determined learning schedule, where the length of the interval equals gn b (n denotes the interval number, 10 0 % g % 10 4 on a log scale, 0 % b % 2 on a linear scale). Red regime is shown for positive gain &#240;&gt; 0:2&#222;, blue for negative gain &#240;&lt; &#192; 0:2&#222;, and gray for marginal gain (between &#192; 0:2 and 0.2). The three idealized decay models exhibit qualitatively different behaviors. Also see Figure <ref type="figure">S10</ref>.</p><p>the situations in which the synaptic modifications are random and uncorrelated is when the patterns of activity that represent the memories are also random and uncorrelated, which is what was assumed in all the early works on memory capacity (e.g. <ref type="bibr">37</ref> ). One of the reasons behind this assumption is that it allowed theorists to perform analytic calculations. However, it is a reasonable assumption even when more complex memories are considered. Indeed, storage of new memories is likely to exploit similarities with previously stored information. Hence, the information contained in a memory is likely to be pre-processed, so that only those components that are not correlated with previously stored memories are actually stored. In other words, it is more efficient to store only the information that is not already present in our memory. As a consequence, it is not unreasonable to consider memories that are unstructured (random) and do not have any correlations with previously stored information (uncorrelated). Unfortunately, these processes that lead to uncorrelated representations are rarely modeled explicitly (but see <ref type="bibr">18</ref> ) and we currently do not have a general theory for dealing with more realistic, highly structured memories. In our model, the face stimuli, which are highly structured and correlated, are pre-processed by a simulated visual system, whose intermediate representations are then used as inputs to our memory module. Though non-trivial higher-order statistics remain in those intermediate representations (see Figure <ref type="figure">S1</ref>), this pre-processing seems to be sufficient to achieve approximately the same scaling properties predicted for random patterns.</p><p>Another important difference between our previous and present work is related to the nature of the memory problem to be solved. In our previous work, we were dealing either with a classification task with randomly chosen labels (a typical perceptron problem with only one output unit) or with a reconstruction memory problem in which a recurrent network would learn to reproduce a previously seen input at the time of memory retrieval. In this work, we considered familiarity detection, which is a recognition memory problem. To reconstruct each individual binary feature of a memorized pattern, we would employ N &#192; 1 synapses. Here we have designed a system in which N such output neurons are combined and readout to report a one-bit response, which is familiarity. We are using all N&#240;N &#192; 1&#222; plastic synapses that are available to output only one bit of information. Hence it is not surprising that in the case of reconstruction memory, the number of memories that can be retrieved (reconstructed) scales linearly with the number of neurons N, while in the case of familiarity detection, the memory lifetime scales quadratically with N.</p><p>We also studied the generalization performance of the system by considering different poses of presented faces as retrieval cues (the DP case), using probe patterns that differ from the originally stored ones. Although the task performance for this DP case is worse than in the SP case, the power-law scaling properties are similar, and the drop in performance could be compensated by introducing more memory neurons and possibly increasing the synaptic complexity. The ability to generalize to different poses is presumably helped by the complexity of the synapses. Indeed, in the case of random patterns, generalization is related to the memory SNR. <ref type="bibr">2</ref> In future studies, we will determine whether there is a similar relationship between the SNR and the ability to generalize to different poses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Biological interpretation</head><p>We hypothesize that the embedding module represents the ventral stream of visual cortex, where faces are clearly represented in dedicated patches, which are present in the inferior temporal cortex <ref type="bibr">43,</ref><ref type="bibr">44</ref> and in the perirhinal cortex. <ref type="bibr">45</ref> The memory module could be mapped onto the hippocampus, containing synapses that can be significantly more plastic than in the cortex. These highly plastic components would support one-shot online learning. This hypothesis would be compatible with the models that see the hippocampus as a memory device that compresses correlated memories before they are stored. <ref type="bibr">10,</ref><ref type="bibr">17,</ref><ref type="bibr">18</ref> This compression process is often achieved by modeling the hippocampus as a sparse auto-encoder with one input layer, containing the representation of the memory to be compressed, an intermediate layer and an output reconstruction layer. The weights are tuned to reproduce the input in the output layer. The representations in the intermediate layer are compressed because sparseness is imposed during the learning process. Comparing the input and the output layer would be equivalent to the comparison we perform in our model between the representations in the embedding module and the representations in the memory module. In our model, we did not consider an intermediate layer as the face representations are already approximately uncorrelated. However, we could easily introduce an intermediate layer to deal with other classes of visual inputs. The reconstruction layer, and hence the detection module of our model could be in the entorhinal cortex (EC), taking advantage of the architecture of the hippocampal-cortex loop <ref type="bibr">17</ref> (the hippocampus projects back to EC, which is also the main input to the hippocampus). Alternatively, it could be that the reconstruction layer is not explicitly implemented (see e.g. <ref type="bibr">18</ref> ). In this case the compressed representations would emerge in one of the parts of the hippocampus without the need to reconstruct the inputs. It could be in the dentate gyrus, as hypothesized in <ref type="bibr">18</ref> , or in specific parts of the hippocampus that are involved in social interactions (e.g., CA2 is known to be involved in familiarity detection in mice <ref type="bibr">46</ref> ). The absence of an explicit reconstruction layer would require a more complex readout, that probably needs to be trained because the detection module would have to compare two different representations. This problem could be solved by adopting a different strategy to detect novelty, as suggested in. <ref type="bibr">47</ref> Perirhinal cortex is bidirectionally connected with EC and hence with the hippocampus. It could certainly represent familiarity even if we hypothesize that the hippocampus is the main locus of the memory module. This familiarity signal could then be broadcast to the rest of the cortex and explain why familiarity can be decoded also in other areas like infero-temporal cortex. <ref type="bibr">48</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Biological complexity at the systems level</head><p>In this article, we discussed how to take advantage of the biological complexity of individual synapses to achieve an elevated memory capacity. Complex synapses are characterized by multiple dynamical variables that operate on different timescales with interactions among them. The same computational principles could be applied to memory consolidation mechanisms implemented at the systems level: for example we could assume that the synapses are simple (e.g. binary) but heterogeneous, each characterized by a different learning rate. This is a scenario proposed in <ref type="bibr">42</ref> where not only the synapses had different time constants, but they could also communicate through replay activity, effectively implementing a mechanism of information transfer that is similar to the one that occurs in the complex synapses. In these scenarios it is possible to obtain a power law decay of the memory SNR, however in the case studied in <ref type="bibr">42</ref> , the slowest decay would scale as 1=t, whereas with the complex synapses of 2 we can achieve approximately 1= ffiffi t p . It is possible to choose a distribution of timescales of simple synapses that allow the SNR to decay as in the case of complex synapses (see <ref type="bibr">2</ref> , supplementary information 9). However, such a distribution would be strongly skewed toward slow synapses, making the initial SNR very small. Although we currently do not have an efficient model of heterogeneous simple synapses that has the same performance as the model with complex synapses that we studied here, we cannot rule out the possibility that the brain is using both mechanisms of memory consolidation (synaptic and systems level). With the experiments that we proposed and that are discussed in the next subsection we cannot really separate the contributions of the two mechanisms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Predictions for familiarity detection experiments</head><p>Using different learning schedules (including the constant, the linear, and a more general learning schedule), we demonstrate that the exponential decay, the inverse-square-root decay, and the hyperbolic decay models lead to distinct and testable predictions for the familiarity task performance. This can be directly tested in human (and animal) experiments. Within a series of images used in such a familiarity experiment, face images of different identities are ordered so that the same face image is repeatedly presented and at the same time evaluated (by testing familiarity detection) after an interval of a pre-determined length following a specific schedule. When designing the experiment, a large g (the first inter-refresh interval) can lead to almost chance-level initial task performance, and a small one will cause saturated initial performance. In practice, g should be chosen based on preliminary experiments to find an initial performance which is sensitive to manipulation (e.g., 60-90%). A recent study showed that the two-alternative forced-choice task performance for images (sketches) is around 85% when there are 100 interposed items between the first presentation and the test (with a speed of 1s for each stimulus and 0.5s for the inter-item interval). <ref type="bibr">49</ref> We thus take gz100 as an educated guess, which could be even smaller for a shorter stimulus duration (i.e., less than 1s). For b, the relevant range in which it should vary would be between zero and one to estimate the complexity of synapses (e.g. by choosing b uniformly for different images in one experiment). Taking g = 100 and b = 1, the interval lengths would be gn b = 100; 200; 300; 400; 500; 600 for n = 1; .; 6. Roughly speaking, within a 1-h experiment containing 2400 interleaved images (about 1s for each stimulus plus 0.5s for response and inter-item interval, the same speed as in <ref type="bibr">49</ref> ), we will have test images refreshed at least six times. Then how the signal gain (and the corresponding probability of successful familiarity detection) develops as a function of the interval/presentation number will determine the temporal decay kernel of the memory signal and therefore it will allow us to infer the complexity of the memory consolidation process. We expect that biological synapses will behave similarly to the inverse-square-root model for a wide range of intervals until they gradually transition into an exponential decay. If we could ignore the effects of systems-level consolidation and internal replay, the interval number at which this transition occurs would provide a measure of the intrinsic complexity of synapses in the hippocampus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Limitations of the study</head><p>One limitation of our work is the assumption that the memory neurons use exactly the same representations as the input neurons. In reality, the number of memory neurons is unlikely to be precisely the same as the number of input pattern dimensions, and they would in general use a different representation of a given face from the input neurons. The detection module has to essentially compare the reconstructed memory with the representation of the current cue. This is a computation that can be performed even when the representations in the detection module are completely different from those in the input. However, it will require a smarter readout system that is trained to perform this comparison. Generalizing our system to include a more biologically plausible mapping between the embedding module and the memory module, with a corresponding readout mechanism in the detection module, is an important direction for our future work.</p><p>In our hippocampus-like memory module, there is only one feedforward layer that uses dense neural representations. However, recurrent neural computations in the hippocampus can be beneficial in some memory tasks. <ref type="bibr">50,</ref><ref type="bibr">51</ref> In addition, sparse representations of memory patterns have long been known to harbor computational benefits such as larger memory capacity and the capability to mitigate disruptive effects of correlations. <ref type="bibr">2,</ref><ref type="bibr">3,</ref><ref type="bibr">52,</ref><ref type="bibr">53</ref> To what extent recurrent connections and sparse coding are beneficial in our neural system for familiarity detection are questions currently under investigation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>STAR+METHODS</head><p>Detailed methods are provided in the online version of this paper and include the following:  <ref type="bibr">40</ref> The 2048 dimensional activity of the penultimate layer was extracted as the face feature vector for each face image input. Because the face feature vectors are sparse and non-negative, we took the following steps to transform them into a format that's suitable as the input to the memory module: (i) the dimensionality of the feature vector of each face was first reduced using principal component analysis (PCA); (ii) each dimension was then binarized with a threshold equal to the median (&#192;1 for values less than the median and +1 for values larger than the median). The first N binarized principal components were taken as the binary face pattern x = &#189;x 1 ; .; x N T , serving as the activity of the N input neurons of the memory module.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Memory module</head><p>The memory module consists of N memory neurons, one for each input neuron of the embedding module. The j-th input neuron connects to the i-th memory neuron (for isj) with synaptic weight (efficacy) w ij and bias term b i . There is no connection between the i-th input neuron and the i-th memory neuron for any i (i.e., w ii = 0ci). The activity of the i-th memory neuron is</p><p>(Equation <ref type="formula">1</ref>)</p><p>and we denote the binary memory patterns retrieved in this manner as y = &#189;y 1 ; .; y N T . This plastic layer of synapses implements a simple feedforward memory model that can perform an approximate one-step reconstruction of a stored input pattern from a noisy cue at test time. Because the i-th memory neuron y i is expected to reconstruct the i-th input neuron x i , we set the value of the i-th memory neuron to be x i during learning.</p><p>To update the synaptic weights and biases we used the learning rule:</p><p>Dw ij = x i x j ; (Equation <ref type="formula">2</ref>)</p><p>(Equation <ref type="formula">3</ref>)</p><p>These equations describe the desirable plasticity steps to store each new pattern. However, simply applying these additive updates would eventually result in unbounded values of the w ij . Therefore, we employed a mechanism to limit the weights to bounded dynamical ranges. For each synapse (i.e., for each weight w and bias term b), we implemented a complex synaptic model 2 with m dynamical variables u 1 ; .; u m in discrete time. Here m denotes the total number of variables per synapse (a measure of synaptic complexity), each of which operates on a different timescale. Specifically, at each time step t the dynamical variables u k (for 2 % k % m) are updated as follows (the indices i and j labeling the synapses are omitted for simplicity)</p><p>u k &#240;t + 1&#222; = u k &#240;t&#222; + n &#192; 2k + 2 0 a&#240;u k &#192; 1 &#240;t&#222; &#192; u k &#240;t&#222;&#222; &#192; n &#192; 2k + 1 0 a&#240;u k &#240;t&#222; &#192; u k + 1 &#240;t&#222;&#222;: (Equation <ref type="formula">4</ref>)</p><p>For k = m, the last variable u k + 1 is simply set to zero in this update equation, and for k = 1 we have u 1 &#240;t + 1&#222; = u 1 &#240;t&#222; + I&#240;t&#222; &#192; n &#192; 1 0 a&#240;u 1 &#240;t&#222; &#192; u 2 &#240;t&#222;&#222;: (Equation <ref type="formula">5</ref>)</p><p>Here I&#240;t&#222; is the desirable update (Dw or Db) imposed by the pattern x&#240;t&#222;, which takes a value + 1 or &#192; 1 and is computed from Equations 2 or 3. The first variable u 1 is used as the actual value of the synaptic weight w or bias b at test time. The parameters a and n 0 determine the overall timescale of the model dynamics and the ratio of timescales of successive synaptic variables (we set a = 0:25 and n 0 = 2 in our models; see <ref type="bibr">2</ref> for additional details).</p><p>To study the situation in which variables can only be stored with limited precision, we discretized the m synaptic variables and truncated their dynamical range to a maximum and minimum value. Hence, each variable can take one of only a finite number of integer-spaced values arranged symmetrically around zero, namely f &#192; V; &#192; V + 1;.;V &#192; 1;V g, where in our simulations we chose V = 31=2, corresponding to 32 levels (5 bits). At every time step, if the u k &#240;t + 1&#222; computed according to Equations 4 and 5 falls between two adjacent levels, its new value is set to one of those two levels, based on the result of a biased coin flip with an odds ratio equal to the inverse ratio of the distances from u k &#240;t + 1&#222; to the two levels.</p><p>In the comparison between models with simple synapses and complex synapses, we further considered different learning rates q for the plastic synapses. Each synapse in the model is updated independently. A synapse &#240;w ij &#222; is updated with probability q every time an input pattern is stored &#240;Dw ij &#222;. With probability 1 &#192; q, the weight update Dw ij is rejected and w ij remains unchanged. Synapses with smaller learning rate are considered more ''rigid''. <ref type="bibr">3,</ref><ref type="bibr">41</ref> Readout (detection) module</p><p>The readout (detection) module compares the output x = &#189;x 1 ; .; x N T of the embedding module and the output y = &#189;y 1 ; .; y N T of the memory module to assess the level of familiarity of a given pattern. This module computes the Hamming distance between x and y, and outputs ''familiar'' (or ''unseen''/''unfamiliar''/ ''novel'') if the distance is smaller (or larger) than some pre-set threshold. This approach is similar to the one proposed in. <ref type="bibr">36</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Synthesizing artificial face patterns</head><p>The VGGFace2 data set only contains faces from 9131 different people. To facilitate the evaluation of the memory performance of our system over multiple time scales, a larger number of independent non-evaluated patterns are required to be stored in between the face patterns whose memory signals are being tracked. Thus, we synthesized artificial face patterns matching the first and second moments of real face patterns. First, we extracted the mean values and the diagonal covariance matrix of the face feature vectors after PCA to get an estimate of the distribution of patterns generated by the faces of all the people in the data set. We then synthesized artificial face patterns by passing new samples from the corresponding multivariate normal distribution through the binarization step. Mathematically, this process is equivalent to generating unstructured, random, binary patterns. These artificial patterns were presented to the neural system to be memorized at time steps in between the storage of the real face patterns, but were not used to evaluate the memory performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Interleaving with synthesized and real face patterns</head><p>For all simulations in the main text, we stored synthesized face patterns (equivalent to unstructured, random, binary patterns) at time steps in between the storage of the real face patterns. Therefore, most of the patterns before and after one arbitrary stored real face pattern are synthesized face patterns.</p><p>Here we compared the case of interleaving with synthesized patterns and the case of interleaving with real face patterns (different poses of face images from other non-evaluated people) for a system with N = 128 and m = 6. We found that the two cases are almost indistinguishable (see Figure <ref type="figure">S2</ref>). This indicates that our main simulations are unaffected no matter whether we are using synthesized patterns or real face patterns for the interleaving purpose.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>QUANTIFICATION AND STATISTICAL ANALYSIS Statistical differences between binary face patterns and random patterns</head><p>After dimensionality reduction and binarization in the preprocessing for face image inputs, the face patterns become binary, similar to random unstructured binary patterns sampled from independent and identically distributed Bernoulli variables. Here we show that there are higher-order non-trivial statistics in the face patterns, by comparing the pattern correlation matrices and the feature correlation matrices between the face and the random data sets.</p><p>To generate the data sets we are comparing here, the outputs of the convolutional neural network's penultimate layer are extracted, consisting of 431050 samples (8621 people, each with 50 poses) with 2048 features. Features are centered first and then fed into principal component (PC) analysis. The covariance matrix of these 2048 PC features has decreasing diagonal variance and zero off-diagonal covariance.</p><p>Taking one pose from each person, we then binarized these PC features and obtain the face data set (M = 8621 binary patterns, each with 2048 features). In the random data set, the 8621 random patterns (with 2048 binary features) are directly sampled from a multivariate Gaussian distribution (with zero mean and diagonal covariance matrix derived from the above face PC features) and then binarized. We ran the above process with ten random seeds (sampling new poses or new random patterns).</p><p>For the top N features &#240;N % 2048&#222; in the face and the random data sets, we then computed the pattern correlation matrix (M 3 M; each element representing the correlation between two patterns of N dimensions) and the feature correlation matrix (N 3 N; each element representing the correlation between two features of M dimensions). The variance, skewness, and kurtosis of these off-diagonal elements in the pattern correlation and the feature correlation matrices are shown in Figure <ref type="figure">S1</ref>. For the off-diagonal elements in the pattern correlation matrix, the face data set and the random data set have indistinguishable variance, but different skewness and kurtosis. For the off-diagonal elements in the feature correlation matrix, the face data set also has different profiles in variance, skewness, and kurtosis from the random data set. These differences indicate that the face patterns are substantially statistically different from the random patterns also studied in the main text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Evaluating the memory signal and noise</head><p>The ideal observer signal S io and noise N io are computed as follows.</p><p>For a given face memory, the signal at time t of the input pattern x&#240;t 0 &#222; stored at an earlier time t 0 is defined as the overlap (inner product) between the synaptic modification Dw ij &#240;t 0 &#222; imposed at storage and the current ensemble of synaptic weights w ij &#240;t&#222;:</p><p>(Equation <ref type="formula">6</ref>)</p><p>We can then compute the average (denoted by CD) over all memories with an age of Dt = t &#192; t 0 to obtain the expected signal S io &#240;Dt&#222; = CS io &#240;Dt&#222;D; (Equation <ref type="formula">7</ref>)</p><p>and the corresponding noise term N 2 io &#240;Dt&#222; = C&#240;S io &#240;Dt&#222; &#192; CS io &#240;Dt&#222;D&#222; 2 D: (Equation <ref type="formula">8</ref>)</p><p>Similarly to the ioSNR, the readout signal S r is defined as the overlap between an input pattern x&#240;t 0 &#222; (stored at time t 0 ) and the retrieved memory pattern y&#240;t; t 0 &#222;, the output of the memory module when the same pattern is presented again at time t without updating the synaptic weights. We have</p><p>x i &#240;t 0 &#222;y i &#240;t; t 0 &#222;: (Equation <ref type="formula">9</ref>)</p><p>As above, we can compute the expected signal S r and noise N r by averaging over memories of a given age, and obtain the readout signal-to-noise ratio (rSNR) S r =N r &#240;Dt&#222;.</p><p>Choosing the optimal threshold in the FD task</p><p>As new patterns are presented to our proposed memory system, the familiarity of any old patterns decreases with time, and therefore the distribution of the readout signal for familiar patterns approaches the one of the unseen patterns (see Figure <ref type="figure">S3A</ref>). In Figure <ref type="figure">S3B</ref>, we plot the classification accuracy of the detection module as a function of elapsed time (relative to the presentation of the face pattern) with different signal thresholds. Smaller thresholds result in higher error rates for unseen faces, but have a better performance for familiar faces. To provide an operational definition of familiarity for the detection module, we study the overall performance (the average of classification accuracy over the whole age-range) as a function of the signal threshold and age-range (see Figure <ref type="figure">S3C</ref>). We include an equal number of familiar and unfamiliar faces in these test sets, and weight false positive and false negative errors equally. For longer age-ranges, the optimal threshold gradually decreases, since the familiar faces become indistinguishable from unseen faces, although choosing the right age-range ultimately depends on the application and on the longest timescale of the synaptic model employed in the memory system.</p><p>For the evaluations in the main text, the threshold of each model is first optimized on a balanced test set with familiar faces within the model-specific age-range (we choose t &#195; ioSNR ), and then evaluated over longer time scales. Since the distribution of the readout signal for unseen faces does not change over time, the detection module with the fixed threshold detects unseen faces with constant error rate (1 -true negative rate) (see Figure <ref type="figure">S4B</ref>), while it recognizes familiar faces (true positive rate) better for more recent than for older ones (see Figure <ref type="figure">S4A</ref>). The FD classification performance is the average of the true positive and true negative rates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Advanced learning schedules and idealized decay models</head><p>In addition to the simple schedule described above in which each face is stored only once, we further consider two types of advanced learning schedules (namely optimal and pre-determined learning schedules), where the same image pattern can be presented more than once and is evaluated during each presentation.</p><p>Under the optimal learning schedules, the ideal observer signal (ioSignal) of a specific pattern is constantly monitored after its initial presentation. Every time its ioSignal drops below the pre-specified threshold, the same pattern is presented again (refreshed) to boost its memory strength. The length of the n-th interval between the n-th and &#240;n + 1&#222;-th presentations will vary with the interval number n. Even though monitoring the memory signal in real time (without modifying it) is not feasible in experiments, we will show that this theoretical analysis of the length of the n-th interval reveals major differences between synaptic models with different complexity, which have measurable consequences.</p><p>Motivated by theoretical results on optimal learning schedules, we also propose pre-determined learning schedules, under which the length of the interval between each two consecutive presentations takes the form of gn b , where n is the interval number, g is the length of the first interval (between the first and the second presentations), and b is the exponent. When b takes value 0 or 1, this schedule degenerates into constant or linear schedules, in which a specific pattern is presented again (refreshed) after an interval of a constant length or after an interval of a linearly increasing length.</p><p>To facilitate the study of the behavior of our simulated models with simple and complex synapses under these schedules, we introduced three idealized decay models in which the memory signal decays with a pre-specified profile: exponential, inverse-square-root power-law, and hyperbolic power-law. This allows us to quickly run many numerical experiments with different learning schedules, since we do not have to simulate the internal dynamics of the complex synapses, which is inherently stochastic and would require averaging over many realizations to obtain an estimate of the expected behavior (which is instead represented directly by the specified decay function). These idealized models will allow us to demonstrate that pre-determined learning schedules can be used in experiments to discriminate different decay profiles, which are related to the complexity of a memory system, without the need to access its internal constituents.</p><p>Free parameters in the idealized decay models were chosen to match the behavior of the simulated models with simple or complex synapses:</p><p>(1) The exponential decay model, with memory signal r&#240;t&#222; = C exp e &#192; t=texp , where C exp = 1 is the initial memory strength and t exp = 7:486 is the time constant, fit to the signal strength of simulated models with simple synapses &#240;m = 1&#222;.</p><p>(2) The inverse-square-root power-law decay model, with r&#240;t&#222; = C isr = ffiffiffiffiffiffiffiffiffiffi ffi t + 1 p , where C isr = 1:316 is the initial memory strength in order to fit the power-law decay regime of signal strength of models with complex synapses &#240;m = 8&#222;.</p><p>(3) The hyperbolic power-law decay model, with r&#240;t&#222; = C hyp =&#240;t + 1&#222;, where C hyp is equal to C isr .</p><p>Asymptotic behavior of the optimal learning schedules for idealized synaptic models with specific decay kernels Here we derive the constant length of the interval between successive presentations of the same pattern for a simple synaptic model with an exponential decay and the asymptotically linear increase of the length of the interval for the inverse-square-root decay model.</p><p>Let r&#240;t; t n &#222; = r&#240;t &#192; t n &#222; denote the decay kernel of the memory signal of a synaptic model at time t for a pattern presented at time t n (the n-th presentation, t R t n ). If q is the threshold on this memory signal (such that when the signal has dropped to this level the same pattern will be presented again), we have the following system of equations: 8 &gt; &gt; &lt; &gt; &gt; : r&#240;t 2 ; t 1 &#222; = q r&#240;t 3 ; t 2 &#222; + r&#240;t 3 ; t 1 &#222; = q . r&#240;t n ; t n &#192; 1 &#222; + r&#240;t n ; t n &#192; 2 &#222; + / + r&#240;t n ; t 1 &#222; = q: (Equation <ref type="formula">10</ref>)</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Zuckerman Institute, Columbia University, New York, NY 10027, USA</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>iScience 26, 105856, January 20, 2023</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>iScience 26, 105856, January 20, 2023</p></note>
		</body>
		</text>
</TEI>
