<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Semi-supervised learning and inference in domain-wall magnetic tunnel junction (DW-MTJ) neural networks</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>09/16/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10145308</idno>
					<idno type="doi">10.1117/12.2530308</idno>
					<title level='j'>SPIE Spintronics XII</title>
<idno></idno>
<biblScope unit="volume">11090</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Christopher H. Bennett</author><author>Naimul Hassan</author><author>Xuan Hu</author><author>Jean Anne Incornvia</author><author>Joseph S. Friedman</author><author>Matthew M. Marinella</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Advances in machine intelligence have sparked interest in hardware accelerators to implement these algorithms, yet embedded electronics have stringent power, area budgets, and speed requirements that may limit nonvolatile memory (NVM) integration. In this context, the development of fast nanomagnetic neural networksusing minimal training data is attractive. Here, we extend an inference-only proposal using the intrinsic physics of domain-wall MTJ (DW-MTJ) neurons for online learning to implement fully unsupervised pattern recognition operation, using winner-take-all networks that contain either random or plastic synapses (weights). Meanwhile, a read-out layer trains in a supervised fashion. We find our proposed design can approach state-of-the-art success on the task relative to competing memristive neural network proposals, while eliminating much of the area and energy overhead that would typically be required to build the neuronal layers with CMOS devices.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Presently, conventional/von Neumann computing devices are encountering serious difficulties in locally storing, mapping, and manipulating complex data structures, an ability which may be critical to those systems that cannot rely on external (cloud) processing for energy, distance, or trust/reliability reasons. These difficulties can broadly be divided into issues with continued processor scaling, and memory energy ovrhead and bandwidth issues, the latter of which is often referred to as the memory wall. <ref type="bibr">1</ref> However, since these problems arise as a symptom of old computer architecture, the use of novel, brain-inspired (neuromorphic) architectures that exploit existing and modified logic and memory devices remains an alternate path. <ref type="bibr">2</ref> In addition to the use of standard CMOS and memory devices to realize neuromorphic prototypes, <ref type="bibr">3</ref> emerging non-volatile memory (NVM) devices have recently come into consideration for in-memory computing (IMC) systems. Neuromorphic/IMC systems built with NVM devices can reach very high density in the crossbar configuration, since the cell size can scale as low as 4F <ref type="bibr">2</ref> , where F is the base feature size, and can serve as an excellent template for performing hardware implementations of neural network learning and inference, by exploiting local physics and electrical effects. <ref type="bibr">4</ref> This combination of form (extreme density) and function (data transformation/analysis) opens a pathway to extreme energy efficiency in future computing systems. Despite this promise, many types of NVM devices remain at the prototype stage, or suffer from critical issues such as low durability, slow programming/write time, or a high required programming current. Thankfully, magnetic tunnel junction (MTJ) memory devices, including typical two-terminal spin-transfer torque MTJ (STT-MTJ) and emerging three-terminal spin-orbit torque MTJ (SOT-MTJ) devices open a doorway to low current (&lt; 10&#181;A), fast speed (&lt; 10ns), and high endurance (&gt; 10 12 operations) dense arrays of emerging memory. <ref type="bibr">5,</ref><ref type="bibr">6</ref> In order to fulfill the neurmorphic/IMC vision and perform local or in-situ learning, candidate NVM devices must be programmable according to simplified rules which are derived from either state of the art optimization techniques in machine learning, e.g. stochastic gradient descent and back-propagation, or neuroscience-derived rules, such as spike-timing dependent plasticity (STDP). While both approaches have so far been experimentally demonstrated using alternative emerging NVM devices, 7-9 demonstrations of online learning using spintronic NVM candidates such as SOT-MTJ and STT-MTJ are sparse, or rely on significant CMOS-based circuits to operate. <ref type="bibr">10,</ref><ref type="bibr">11</ref> An additional challenge is that, while emerging NVM devices are typically highly analog, MTJ variant devices are typically binary between their Antiparallel (AP) and parallel (P) states. This requires the use of stochastic behaviors as a stand-in for physically analog behavior, and can result in promising results. <ref type="bibr">12</ref> In addition, spintronic memories are an exciting alternative for future neuromorphic systems since they may avoid some of the signature issues that plague neuromorphic implementations with existing filamentary or phase-change devices, e.g. characteristic non-linearity of write-mode asymmetry relating to physical challenges in moving ions. The combination of non-linearity and asymmetry can be devastating to efficient NVM online learning systems, so finding building blocks where this issue can be obviated is a promising path forward. <ref type="bibr">13</ref> We focus on an especially promising spintronic device: a three-terminal domain-wall MTJ (DW-MTJ) devices with intrinsic analog behavior and the potential for attaJoule switching energies. <ref type="bibr">14</ref> While experimentally demonstrated only in the logic context so far, <ref type="bibr">15</ref> these devices constitute a possible implementation for not only synapse but neuron in emerging nanomagnetic hardware neural network proposals. Specifically, an array of DW-MTJ devices can be used to implement the essential features of a population or layer of spiking neurons-leak, integrate and firing-along with the inhibition behavior which is critical to unsupervised learning. <ref type="bibr">16</ref> In this work we consider how effectively a nanomagnetic neural network learning with DW-MTJ output and hidden layers can perform on a state of the art machine learning task by deriving these learned weights in-situ. Specifically, we consider a dual-phase , or semi-supervised, approach to online learning. In this approach, an unsupervised system/layer first adapts and is later co-integrated with a read-out system. Both systems can be physically realized using an adaptive crossbar system with emerging memory devices and realized at terabit or higher density scales for efficient edge computing applications. In the following sections, we detail this design and preliminary results when using it on a standard task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">DW-MTJ DEVICE FULLY ONLINE LEARNING SYSTEM</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Realistic neuronal emulation by DW-MTJ nanotracks</head><p>As described in, <ref type="bibr">14</ref> the motion of a domain-wall logic or memory devices are physically modeled and explained by the spin-polarization of injected current in a soft ferromagnetic wire which sits between an input and output MTJ device; antiferromagnetic contacts placed on the ends of the wire create an exchange bias which pins the magnetic field at either end and drives domain wall motion back towards the Input MTJ, or forward towards the Output MTJ. This complex physical system can be notably simplified by noting that the current must be above a certain critical threshold I th in a certain direction to drive the DW 'forward', or 'backward'.</p><p>Recently, the DW-MTJ device was further engineered with the addition of a heavy ferromagnetic underlayer such that, when no current is applied, the position (x) of the DW will tend to naturally relax towards the Input MTJ. <ref type="bibr">16</ref> In this same work the ability of transverse magnetic fields from a given DW-MTJs track to inhibit its neighbors was explored and confirmed via micromagnetic simulations. Taken together, the operations of leak, integrate and fire along with inhibition create a physical basis for realistic emulation of winner-take-all circuits. Besides an underlayer, an alternative physical method to implement intrinsic leaking of DW-MTJs is the use of a gradient anisotropy. <ref type="bibr">17</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Realizing fully online learning DW-MTJ neural networks 2.2.1 First (un-supervised) layer</head><p>Winner-take-all circuits are a stable of neural computing, and it has been shown that soft-winner-take all gates, i.e. more than one 'winning' or spiking neuron per layer/population, can serve as a foundation for universal function approximation. <ref type="bibr">18</ref> However, when WTA networks use plastic weight updates governed by the biological learning rule STDP, winner take all circuits have been demonstrated to possess powerful emergent computational properties such as approximating hidden Markov models. <ref type="bibr">19</ref> In addition, the use of plastic updates and lateral inhibition in the context of vector-matrix multiplies and subsequent updates in a dense memory array has been suggested to implement the powerful algorithm known as Expectation Maximization (EM). <ref type="bibr">20</ref> Due to these The two systems, which can both be conveniently realized in a shared nanofabric, both use DW-MTJ devices at the hidden and output layer, notably using the lateral inhibition function to implement the k-WTA property and the max-out property, respectively. The yellow crosspoints, here representing adaptive synapses, can in principle be any two-terminal NVM device. For the first layer, we have considered them as both analog filamentary and binary magnetic devices; for the second layer, we have considered them as only analog devices.</p><p>mutually powerful effects, we use a first layer of DW-MTJs learning in two cases: with un-adaptive synapses but active neuron effects still in place (random weights), and with plastically adaptive synapses along with the active neuron effects (STDP). In addition, we have considered two variants of STDP to determine the variety of plastic updates needed to implement this additional computational power. On one hand, we considered a binary approximation of STDP (B-STDP), which following 12 relies upon the naturally stochastic behavior of two-terminal STT-MTJ junctions, and moves NVM devices in the first layer between only extrema states (P and AP, in the case of STT-MTJ devices). Second, analog STDP (A-STDP), in which multi-bit analog NVM devices switch between multiple states, as proposed in <ref type="bibr">21</ref> and realized experimentally in. <ref type="bibr">22</ref> As in these references, we have assumed the naturally exponential (non-linear) nature of the device's state changes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2">Second (supervised) layer</head><p>As noted in, <ref type="bibr">23</ref> such an unsupervised layer can be difficult to properly read-out in the context of standard machine learning tasks, creating the need for a companion supervised learning system. Although the companion supervised learning system does not increase the computational capacity of the network, it serves as a bridge to implement an easily interpretable threshold-based learning system. Following hardware realizations that have realized efficient sign-based implementation of the classic delta or Widrow-Hoff rule, <ref type="bibr">24</ref> we realize weights in pairs of NVM devices, and adjust weights at every weight pair according to:</p><p>where &#8710;W i,k is the weight change at each programming step for the synapse connecting input i to output k, T k is the expected output, O k the actual output, X i the input value, and &#8710;G the conductance change. The teacher signal, in the case of our network are the labels for the actual image class (e.g. digit '1') during the teaching/training moment, but may in principle be any real-time signal directing the system's adaptation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">SIMULATION METHODS AND RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Description of dual learning operation simulations</head><p>In our simulations, we have co-integrated a model of the interacting DW-MTJs during the un-supervised learning as a time-step simulator using forward Euler integration, along with a simplified model of lateral inhibition, with t = 200 time steps during which the first set of parallel DW-MTJs, DW M operate based on the input currents provided at that moment. At the end of the last time step, the spike(s) are collected by the simulator and passed forward to implement the plasticity operation (if it is being implemented in that first network W us ). Following this operation, another example is presented and fed to the DW-MTJs, for a total number of L us unsupervised learning samples. After this, the weights of the first memory matrix/crossbar are frozen and remain so during the second phase. During the second phase of the dual learning approach, training inputs are again fed to the input network (W us ), spikes are collected from the DW-MTJ interaction sub-simulator, and these spikes are propagated to the second network (W s ) in a logical sign-symmetric manner as suggested in. <ref type="bibr">25</ref> As pictured in Fig. <ref type="figure">2(a)</ref>, the output from this second crossbar is used to produce a set of spike outputs from the second set of DW-MTJs DW N , and collected in a logic cell or local memory. Subsequently, these spike outputs in comparison to a teaching signal is then used to implement the on-chip learning rule. Meanwhile, as in 2(b), the programming occurs via simultaneous voltage pulses which must bypass the DW-MTJ devices, in the case of the post-synaptic pulse. This operation occurs a total of L s times using the frozen weights already obtained in the first phase. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Preliminary simulation results</head><p>We have tested this procedure and our simulator on the classic MNIST task of handwritten digits. Our preliminary simulations suggest that with DW M = 200, that is with 200 competing hidden units, and with DW M = 10 as in <ref type="bibr">16</ref> at the final output layer, we achieve 88% accuracy on the test-set with plastic binary synapses and 92% with plastic analog synapses, following L us = 500 and then L s = 50000. In order to verify that the weight adaptations in the unsupervised layer are of value, we also set the first layer weights as constant (that is, skipped the first of the dual phases mentioned earlier). Considering random binary and analog devices, we achieve 78% and 79% respectively using this approach; this result unambiguously confirms the value of lateral inhibition combined with plastic updates. However, regarding the value of analog v. binary plastic updates, we remain ambivalent about whether our early results suggest that one is definitely better than the other. Since the binary updates rely upon stochastic effects to perform properly, we hope to better benchmark whether different probability switching distributions on each device may help to close this gap. Nevertheless, at the moment we can observe notably richer analog filters in the first layer as compared to binary ones, as in Fig. <ref type="figure">3</ref>. Lastly, we have eliminated the first layer entirely and found that a perceptron with max-out DW-MTJs and following the weight updates as in Fig. <ref type="figure">2</ref> achieve 84% on the test set, an intermediate result between the random weight input layer and the plastic input layer. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">DISCUSSION</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Comparison to state-of-the-art approaches</head><p>While <ref type="bibr">10,</ref><ref type="bibr">11</ref> previously proposed nanomagnetic neural networks incorporating realistic leaky-integrate-fire and STDP learning approaches, respectively, neither incorporated the inhibition effect which is critical to k-WTA learning. By combining STDP and k-WTA and demonstrating that these two effects strictly outperform either alone, our work inspires a new avenue of research to explore the maximum computational power possible with locally learning arrays. Meanwhile, in comparison to other proposals combining WTA and stochasticity in the context of non-magnetic memristive/NVM devices, <ref type="bibr">26</ref> we have achieved unambiguously better numerical results on the task while also massively reducing CMOS reliance at the neuron device level. <ref type="bibr">26</ref> While not a state-of-the art result on the ML task, our approach yields a strong performance when compared to other competing neural network schemes using spintronics devices even using fully feature backprop. For instance, this performance is superior to various in-situ learning styles with two-terminal magnetic devices alone. <ref type="bibr">27</ref> Beyond this, we hope to achieve state-of-the-art by increasing the small number of DW-MTJ tracks operating at the hidden layer, or by demonstrating that hidden layers of DW-MTJs may be stacked in order to approach performances derived using standard convolutional neural networks , e.g. &gt; 99% on the test-set. <ref type="bibr">28</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Lateral inhibition and physical constraints</head><p>Consistent with, <ref type="bibr">16</ref> our simulations calculate the lateral inhibition assuming a small inter-DW-MTJ distance &#8710;D. A parameter anaylsis suggests that when this distance is expanded and the magnetic lateral force B F exerted upon neighbor wires is too low, too many DW-MTJs devices tend to spike. This degrades the system's performance during both inference and k-WTA operation; at the worst case almost every other neuron will spike, and performance is very poor. While we calculated a favorable value for &#8710;D and hence B F which resulted in on average on 30% or less of the neurons in hidden layer DW M firing each example, a priority is to integrate more physically realistic modeling into the neural solver to see if this is realistic. Optimizing our system to allow for a large enough B F value may be the most challenging task in later physically realizing a version of this system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">CONCLUSION</head><p>Domain-wall magnetic tunnel junction (DW-MTJ) are an extremely promising platform for neuromorphic computing as they can be used at the synapse and neuron level, have rich analog behavior, and a very low switching energy requirement. When co-integrated in an ultra-dense array, they can stand in for biorealistic populations of neurons by implementing the key functions of leak, integrate and fire (LIF). We have demonstrated that beyond inference or prediction, this property can be used in the online learning context to implement a powerful semisupervised learning strategy. This strategy, which uses both general plastic updates in the first system/layer as well as specifically directed updates in the second system/layer, has already achieved promising results compared to other semi-supervised and/or spintronic neuromorphic proposals.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Proc. of SPIE Vol. 11090 110903I-1 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 19 Apr 2020 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Proc. of SPIE Vol. 11090 110903I-2 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 19 Apr 2020 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>Proc. of SPIE Vol. 11090 110903I-3 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 19 Apr 2020 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_3"><p>Proc. of SPIE Vol. 11090 110903I-4 Downloaded From: on 19 Apr 2020 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_4"><p>Proc. of SPIE Vol. 11090 110903I-5 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 19 Apr 2020 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_5"><p>Proc. of SPIE Vol. 11090 110903I-6 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 19 Apr 2020 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_6"><p>Proc. of SPIE Vol. 11090 110903I-7 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 19 Apr 2020 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use</p></note>
		</body>
		</text>
</TEI>
