<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Baldur: A Power-Efficient and Scalable Network Using All-Optical Switches</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>02/01/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10182969</idno>
					<idno type="doi">10.1109/HPCA47549.2020.00022</idno>
					<title level='j'>2020 IEEE Symposium on High Performance Computer Architecture (HPCA)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Mohammad Reza Jokar</author><author>Junyi Qiu</author><author>Frederic T. Chong</author><author>Lynford L. Goddard</author><author>John M. Dallesasse</author><author>Milton Feng</author><author>Yanjing Li</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We present the first all-optical network, Baldur, to enable power-efficient and high-speed communications in future exascale computing systems. The essence of Baldur is its ability to perform packet routing on-the-fly in the opticaldomain using an emerging technology called the transistor laser (TL), which presents interesting opportunities and challenges at the system level. Optical packet switching readily eliminates many inefficiencies associated with the crossings between optical and electrical domains. However, TL gates consume high power at the current technology node, which makes TL-based buffering and optical clock recovery impractical. Consequently, we must adopt novel (bufferless and clock-less) architecture and design approaches that are substantially different from those used in current networks. At the architecture level, we support a bufferless design by turning to techniques that have fallen out of favor for current networks. Baldur uses a low-radix, multi-stage network with a simple routing algorithm that drops packets to handle congestion, and we further incorporate path multiplicity and randomness to minimize packet drops. This design also minimizes the number of TL gates needed in each switch. At the logic design level, a non-conventional, length-based data encoding scheme is used to eliminate the need for clock recovery. We thoroughly validate and evaluate Baldur using a circuit simulator and a network simulator. Our results show that Baldur achieves up to 3,000X lower average latency whileconsuming 3.2X-26.4X less power than various state-of-the art networks under a wide variety of traffic patterns and real workloads, for the scale of 1,024 server nodes. Baldur is also highly scalable, since its power per node stays relatively constant as we increase the network size to over 1 million server nodes, which corresponds to 14.6X-31.0X power improvements compared to state-of-the-art networks at this scale.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Power-efficient, high-speed, and scalable networks play a critical role in exascale high-performance computing (HPC) systems. In this paper, we present the first complete design of an all-optical network, Baldur, which is built using all-optical switches to perform network packet routing completely in the optical domain. Baldur consumes less power while achieving significantly higher network performance compared to stateof-the-art networks, and is highly scalable to over 1 million server nodes to enable exascale computing.</p><p>In large-scale networks, optical links are the main communication fabric because they are significantly more efficient than electrical links <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>. Current networks use optical links but electrical switches, which leads to many inefficiencies due to the need to cross between optical and electrical domains. Optical switching proposals exist in the literature, but they still rely on electrical packet header processing <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>, <ref type="bibr">[5]</ref>. In contrast, Baldur controls optical switches entirely in the optical domain. This is highly desirable as it removes significant overheads such as packet buffering, optical-toelectrical (O-E) conversions, and electrical-to-optical (E-O) conversions, etc., and enables on-the-fly packet processing and switching.</p><p>In Baldur, optical processing and switching is made possible by a new device technology called the transistor laser (TL) <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref>, which can be used to construct optical logic gates <ref type="bibr">[10]</ref>. TL gate prototypes have already been fabricated <ref type="bibr">[11]</ref>, which validates the TL as a feasible and promising technology.</p><p>However, with opportunities the TL also brings along unique constraints and challenges. Therefore, novel architecture and design approaches are essential. There are two key challenges of an all-optical design: (1) currently there is no clear path to dense optical memories <ref type="bibr">[12]</ref>, <ref type="bibr">[13]</ref>; <ref type="bibr">(2)</ref> a TL gate at the current technology node consumes &gt;100X higher power than a 32 nm CMOS gate; therefore, optical clock recovery and complex logic functions are impractical. These technology constraints lead us to several unorthodox architectural and design decisions that are largely unfavorable for electrical networks but turn out to be highly suitable for TL networks, which include: (1) multi-stage networks with randomized connections between network stages <ref type="bibr">[14]</ref> to achieve high scalability and immunity to worst-case traffic patterns; (2) low-radix (e.g., 2x2) switches to minimize design complexity; (3) packet drops and re-transmissions to handle network congestion; and (4) path multiplicity inside the network switches <ref type="bibr">[14]</ref>, <ref type="bibr">[15]</ref> to minimize packet drop rate.</p><p>Moreover, to enable practical and efficient implementation of these architectural ideas, we design our 2x2 optical switch based on a non-conventional asynchronous data encoding scheme. This design consists of 1,112 TL gates only, and consumes 96.6X less power than a 2x2 electrical switch.</p><p>We thoroughly evaluate our Baldur network and show that it has several major advantages. First, Baldur provides 3.2X-26.4X power improvements and up to 3,000X performance improvement compared to state-of-the-art electrical networks  (including dragonfly <ref type="bibr">[16]</ref>, fat-tree <ref type="bibr">[17]</ref>, and multi-butterfly <ref type="bibr">[18]</ref>) under various representative traffic patterns and real workloads. Second, Baldur's architectural property makes it immune to worst-case traffic patterns <ref type="bibr">[19]</ref>. Third, Baldur is highly scalable, unlike the widely-deployed dragonfly and fat-tree networks whose scalability is limited by switch radix. Finally, Baldur eliminates the need to manage a hierarchy of network switches -for example, as shown in Figure <ref type="figure">1</ref>, Baldur connects 1,024 server nodes using only a single 1,024-port switch that fits in one cabinet.</p><p>Our key architectural contribution is that, through indepth analysis and exploration, we obtain the fundamental understanding of various tradeoffs between network design parameters and the TL technology. Our research spans device, circuit/logic, and architecture/system layers to create the first complete design of an all-optical network that is optimized for the TL, which achieves dramatic improvements vs. existing networks even in the presence of various limitations/constraints of the current TL technology node. The major efforts of this work include:</p><p>&#8226; We characterize TL gates using a detailed device-level simulator.</p><p>&#8226; We design an all-optical TL switch, and fully validate its correctness and reliability by performing detailed circuit simulations. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. MOTIVATION</head><p>Exascale computing systems are projected to have 100,000 10 teraFLOP or 1,000,000 1 teraFLOP server nodes, and each node is required to support very high (e.g., 200 GBps) network bandwidth <ref type="bibr">[20]</ref>. Thus, networks with high power efficiency, high bandwidth, and low latency are essential in order to satisfy the high communication demand between such large numbers of nodes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Limitations of Existing Networks</head><p>In today's HPC and datacenter environments, the networks typically consist of electrical switches (to perform header processing and packet switching), electrical links for shortdistance communication, and optical links for long-distance communication. Such networks are referred to as electrical networks in this paper. Dragonfly <ref type="bibr">[16]</ref> and fat-tree <ref type="bibr">[17]</ref> networks are two representative electrical networks that are widely deployed. However, their scalability is limited by the switch radix. In both networks, as the number of server nodes increases, the radix of network switches also increases. However, it is not practical to build a single &gt;64radix switch with high bandwidth per port mostly due to power constraints <ref type="bibr">[21]</ref>, <ref type="bibr">[22]</ref>. It is also not practical to build &gt;64-radix switches using multiple low-radix switches <ref type="bibr">[22]</ref>, since that also increases network power, cost, and latency significantly (e.g., based on our analysis, a 128K-node fat-tree network built using 80-radix switches consumes 6.4X more power per node compared to a 1,024-node fat-tree network built using 16-radix switches). Given that the radix is no more than 64, fat-tree and dragonfly networks are limited to 66K and 263K nodes, respectively <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref>.</p><p>Another class of well-known network architectures is the stage-based networks. An example of such networks is multi-butterfly with randomized connections between network stages, which is highly scalable and immune to worst-case traffic patterns <ref type="bibr">[18]</ref>. However, electrical multi-stage networks are expensive/impractical in large scales due to significant O-E/E-O/SerDes overheads. For example, based on our analysis, a radix-2 multi-butterfly with multiplicity of 4 consumes 223.5 W per node at the scale of 1,024 nodes -6X higher than fat-tree -and 41.7% of the power is attributed to O-E/E-O conversions and SerDes units.</p><p>In the literature, networks that utilize optical switching elements have also been proposed <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>, <ref type="bibr">[5]</ref>. Arrayed Waveguide Grating Routers (AWGRs) are one of the most popular examples <ref type="bibr">[3]</ref>, <ref type="bibr">[5]</ref>, <ref type="bibr">[23]</ref>, and they allow multiple input ports to send packets to one output port simultaneously using different wavelengths. However, in these networks, header processing is still performed in the electrical domain, which not only increases packet latency, but also incurs many of the same inefficiencies as electrical networks (i.e., significant packet buffering and O-E/E-O/SerDes overheads). Moreover, the scalability of AWGR-based networks is also limited by the switch radix. Using typical 32-radix AWGRs, the network size is limited to 128K nodes <ref type="bibr">[24]</ref>. Table <ref type="table">I</ref> summarizes the limitations of these existing networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Opportunities and Challenges of the TL All-Optical Network</head><p>To the best of our knowledge, no practical technology existed to perform optical computing in the past. As the TL emerges as a key technology capable of high-speed optical computing, it provides a well-grounded basis to create alloptical networks, where both header processing and packet switching are performed in the optical domain. This is highly desirable in HPC and datacenter networks because:</p><p>&#8226; All-optical processing and switching eliminates the overheads associated with O-E/E-O conversions and SerDes units, which leads to significant reduction in power. &#8226; It also eliminates the high power overheads associated with packet buffering (including memories, virtual channel allocation, and control logic <ref type="bibr">[25]</ref>, <ref type="bibr">[26]</ref>, <ref type="bibr">[27]</ref>). &#8226; Eliminating the aforementioned overheads allows us to efficiently realize the benefits (high scalability and immunity to worst-case traffic patterns) of stage-based networks. &#8226; An all-optical network enables on-the-fly packet processing and switching, which reduces network latency significantly. Despite its high potentials, properties associated with the current TL technology constrain network design choices. First, optical buffering is not practical and consumes high power <ref type="bibr">[12]</ref>, <ref type="bibr">[13]</ref>. For example, using TL latches <ref type="bibr">[10]</ref> as optical buffers can lead to a &gt;1,000X power consumption increase in Baldur. Therefore, we do not consider optical buffers an option. Second, due to the high power consumption of TL gates which makes optical clock recovery impractical, as well as the need to perform ultra-fast switching, the routing algorithm must be simple and clock-less. Hence, complex adaptive/oblivious routing techniques are infeasible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Architectural Implications and Novelty of TL Networks</head><p>The opportunities and challenges associated with the TL technology impose several key architectural implications as summarized in Table <ref type="table">II</ref>, which lead to drastically different design decisions compared to current networks.</p><p>First, while multi-stage topologies (with randomized connections between network stages <ref type="bibr">[14]</ref>) are expensive/impractical in electrical networks as discussed in Sec. II-A, they turn out to be highly efficient and optimized for TLs because all buffering/E-O/O-E/SerDes overheads are eliminated by TL's high-speed optical computing capability. Second, while many electrical networks (such as dragonfly and fat-tree) favor high-radix and low-diameter designs to minimize packet latency and power overheads associated with O-E/E-O conversions and SerDes, low-radix (e.g., 2x2) switches are required for TL networks. This is because the routing algorithms must be simple enough so that they can be implemented using TL gates without incurring significant power costs, but switching power/complexity increases Path multiplicity provides more than one possible path to deliver a packet from the switch input port to the designated output destination <ref type="bibr">[14]</ref>, <ref type="bibr">[15]</ref>. This idea incurs high cost for electrical networks, but is adequate for TL networks -given that TL networks must be constructed using low-radix switches, and that there is no SerDes/E-O/O-E overheads, the benefits obtained from a reduced packet drop rate are well worth the complexity/cost associated with path multiplicity. In summary, although the architectural ideas in Table II have largely grown out of favor for current electrical networks due to various practical concerns, with the new TL technology which presents drastically different tradeoffs, we judiciously revisit these ideas out of the large architectural design space and incorporate them to create the Baldur network, which successfully achieves superior results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. THE TRANSISTOR LASER TECHNOLOGY</head><p>In this section, we provide background information on the transistor lasers (TLs), the key technology enabler for Baldur. The TL device is an InGaP/GaAs heterojunction bipolar transistor (HBT) with the addition of quantum-wells for photon generation and optical cavity for coherent light output. As shown in Figure <ref type="figure">2</ref>(a), a TL has three electrical ports and two optical ports. It can act as a transistor, a direct-modulated laser, and a photodetector, depending on various current and voltage bias conditions. When an electrical current is applied to the base, a proportional electrical current is generated at the collector port, like a transistor. If the collector-emitter voltage is higher than a threshold, a coherent optical output is also generated and the TL works as a laser. When the base-collector junction sees an optical input, a photocurrent is generated and the TL functions as a photodetector.</p><p>Individual high-speed and energy-efficient TL devices have been fabricated <ref type="bibr">[31]</ref>, <ref type="bibr">[32]</ref>. In this paper, we focus on using the TL devices to construct optical logic gates, i.e., logic   Voltage Supplies: 1.32 V (+V1), 0.6 V (+V2) 6  Load Resistor: 5 ohm 6  Base Current Modulation: 0.2 mA 6  Collector Tunneling Modulation: 17 &#956;A 6 PD Junction Capacitance: 100 fF 6  Average PD Current: 0.1 mA 6  1 The TLs can produce different wavelengths by using materials with different bandgaps to support wavelength division multiplexing (WDM). 2 These values are on-par with micro-cavity VCSELs <ref type="bibr">[30]</ref>. gates whose inputs and outputs are all optical signals. Figure <ref type="figure">2</ref>(b) shows the schematic of a TL inverter gate. Here, TL1 functions as a photodetector and a pull-down current source, and TL2 functions as a laser that provides optical output. TL1 generates photocurrent in response to its optical input and lowers the voltage at the base terminal of TL2, thereby modulating TL2's optical output. This basic TL inverter gate structure is the basis for logic gates with multiple inputs. For example, by adding photodetectors either in parallel or in series in the pull-down branch, NOR and NAND gates can be constructed. Similarly, replacing the left branch of the TL inverter with a pull-up source allows the construction of AND and OR gates. Moreover, the optical power of a signal will not deteriorate as it passes through the TL gate because the TL gate restores signal strength. In addition to logic gates, an optical latch can be constructed using two cross-coupled TL NOR gates <ref type="bibr">[10]</ref>.</p><p>The preliminary round of TL gate fabrication has been successfully completed <ref type="bibr">[11]</ref>. We are currently targeting a TL technology node which may be fabricated with reasonable cost and high yield rate in the near future, and the detailed simulation results (obtained using the Keysight Advanced Design System (ADS) software) are reported below. We are in the process of optimizing the fabrication flow of the TL gates to validate these simulation results, as well as scaling the TL technology further to continue to improve latency/power.</p><p>Table <ref type="table">III</ref> shows the device and circuit parameters used in the simulation of a TL inverter gate, and the simulation results are summarized in Table <ref type="table">IV</ref>. An interesting property in TL logic gates is that the same power and speed results apply to multi-input NAND, NOR, AND, and OR gates as well. This is because, although a multi-input TL gate requires multiple TLs at the optical input, only one TL is required at the output, and the TL at the output is the speed/powerlimiting element. The power consumption and speed will be the same as long as the average photocurrent is maintained at the same level across different gates, regardless of the number of inputs. Therefore, all TL logic gates consume 0.406 mW at 60 Gbps<ref type="foot">foot_1</ref> , or 6.77 fJ/bit (a TL latch consists of 2 cross-coupled TL NOR gates <ref type="bibr">[10]</ref>, so it consumes double the power). However, additional inputs impose additional waveguide routing and coupling complexity. Therefore, in our design we limit the number of inputs to no more than 2.</p><p>We also include the simulated eye diagram of a TL inverter gate operating at 60 Gbps in Figure <ref type="figure">2</ref>(c), which shows sufficient eye opening that indicates good signal integrity and reliable operation.</p><p>By connecting TL logic gates using waveguides, optical circuits that perform various logic functions can be constructed. In addition to waveguides, the following passive components are also used in our TL switch design: (1) splitters to split one optical signal into multiple signals <ref type="bibr">[33]</ref>, <ref type="bibr">[34]</ref>, providing circuit fanouts; (2) combiners to combine multiple optical signals into one <ref type="bibr">[34]</ref>; (3) waveguide delay elements to delay the propagation of optical signals for a short amount of time in the order of 100ps <ref type="bibr">[35]</ref>, <ref type="bibr">[36]</ref>.</p><p>To fabricate TL chips, since the TL's epitaxial structure is very similar to GaAs HBTs with the addition of a Distributed Bragg Reflector (DBR), the standard HBT process flow on GaAs substrates may be adopted with minimal modifications, which include: (1) forming an oxide aperture to provide optical and current confinement <ref type="bibr">[37]</ref>, <ref type="bibr">[38]</ref>; (2) using an Inductively-Coupled Plasma (ICP) etch to process form mesas in the semiconductor DBR region (similar to how DBRs are fabricated in VCSELs). This is how our TL gate prototypes were processed. To form even larger circuits, TL chips can be bonded onto an optical interposer where waveguides and all passive circuit elements reside through hybrid integration <ref type="bibr">[39]</ref>, similar to <ref type="bibr">[40]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. THE ALL-OPTICAL BALDUR NETWORK</head><p>In this section, we present the details of our Baldur network<ref type="foot">foot_2</ref> , which consists of 2x2 TL switches to implement a bufferless and clock-less multi-butterfly topology. We expect Baldur to achieve similar results with other multi-stage topologies (e.g., Benes <ref type="bibr">[41]</ref>, Omega <ref type="bibr">[42]</ref>, etc.) because many multi-stage networks are largely isomorphic <ref type="bibr">[43]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. TL Switch Design Challenges</head><p>The basic building blocks of Baldur are a group of 2x2 optical switches. Using the TL to design these switches presents unique challenges and thus requires new approaches that are fundamentally different from synchronous electrical logic design. First, typical data encoding schemes such as 8b/10b and 64b/66b <ref type="bibr">[44]</ref> require clock and data recovery (CDR) units to first recover the clock signal that is used to generate the packet data, and then recover the actual packet data using each cycle of the recovered clock as a delimiter for each bit. CDR units cannot be efficiently implemented using TL gates, since the power of electrical CDR units is already very high and will be further exacerbated (by &gt;100X) due to the high power of TL gates at the current technology node. The only other option, which is to use electrical CDR units to recover optical data, deteriorates the benefits of all-optical switching. Therefore, a new way to decode data without a clock is required. Second, the lack of an optical clock as well as clocked sequential elements means that the TL switch design must be asynchronous.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Our Clock-less Data Encoding/Decoding Approach</head><p>To address the challenges discussed in the previous section, we developed a variant of the Digital Pulse Interval Width Modulation (DPIWM) scheme <ref type="bibr">[45]</ref>, <ref type="bibr">[46]</ref> to perform lengthbased encoding, as it allows data decoding without a clock. Note that, our length-based encoding scheme only needs to be applied for the routing bits in the packet, as shown in  Figure <ref type="figure">3</ref>. In Baldur, there is 1 routing bit for each stage of the multi-butterfly network. All routing bits are represented by the presence of light, where binary values are encoded into the lengths of optical signals: logic "0" is encoded as two bit periods (2T) and logic "1" is encoded as one bit period (T), where T is the clock cycle time. Moreover, the routing bits are separated by "gap periods" represented by the absence of light, such that the total length of each routing bit plus the following gap period is 3T. This specific uniform length of each routing bit is derived based on our analysis to simplify design logic and also enhance circuit reliability.</p><p>Our length-based encoding scheme introduces very minimal bandwidth overhead. For example, if there are 8 routing bits and the rest of the packet is 512 bytes, this scheme would introduce 0.34% overhead compared to the typical 8b/10b encoding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Details of the TL Switch Design</head><p>Given network packets whose routing bits are encoded using the length-based scheme, our all-optical TL switch is responsible for obtaining the designated output port of the packet, and either routing the packet to the correct output port or dropping it if the network is congested. Figure <ref type="figure">4(a)</ref> shows the block diagram of the switch, which includes two main components: the switch fabric and the header processing unit.</p><p>In the switch fabric, an input packet is first split into two by an optical splitter (SP0 or SP1): one is sent to the header processing unit to obtain the packet's destination and arbitrate for the corresponding output port, and the other is connected to TL gate AND0 or AND1. Each of the AND gates modifies an incoming packet by AND'ing it with the output of a mask off latch from the header processing unit, so that the first routing bit is masked off. As a result, the next routing bit becomes the first bit at the next stage -in other words, the first routing bit always specifies the destination in the current stage. Masking the routing bits this way simplifies switching logic and enables modular design.</p><p>The output of AND0/AND1 is delayed using waveguide delays WD0/WD1 until the arbitration for the designated output port for the corresponding packet is finished. Based on our simulation experiments, the delay of WD0 and WD1 is set to 132 ps. The outputs of WD0 and WD1 are then split again using SP2 and SP3, and connected to optical AND gates AND2-AND5 so they can reach either output port depending on the routing decision from the header processing unit. Finally, the outputs of AND2 and AND3 are combined using optical combiner C0, which performs the OR function, to form a multiplexer. Similarly, AND4, AND5, and C1 together form another multiplexer. Four grant signals from the heading processing unit, one for each input/output combinations, act as the select signals for the two multiplexers to forward or drop packets according to arbitration results.</p><p>The header processing unit is composed of control latches, line activity detectors, and arbitration units.</p><p>Control Latches: There are three types of control latches in our design. A routing latch stores the decoded routing bit of an incoming packet, and a corresponding valid latch indicates if the bit stored in the routing latch is currently valid. Moreover, a mask off latch outputs a signal to the switch fabric to mask off the first routing bit in the packet, as discussed above. The values of all these latches are driven by the line activity detectors.</p><p>Line Activity Detector: The line activity detector (Figure <ref type="figure">4</ref>(b)) serves two main functions.</p><p>First, it detects the beginning and the end of each packet by continuously detecting the presence of light. In the current design, we assume that 8b/10b encoding is used for the nonrouting bits of a packet, which is not allowed to contain more than 5 consecutive 0's <ref type="foot">3</ref> . Given this constraint and to achieve a low error probability of 10 -9 in the presence of variations and timing jitters (see Sec. IV-F), in Baldur we specify that the absence of light for a period longer than 6T means that there is no in-flight packet currently. To detect the presence of light, the input signal is split into multiple paths that connect to multiple delay elements with different delay values, and the outputs of all delay elements along with the original input are connected using an optical combiner as shown in Figure <ref type="figure">4(b)</ref>. The output of the combiner is "1" if and only if at least one of its inputs is "1"; therefore, it becomes "1" as soon as the presence of light is detected at the beginning of a packet, and stays "1" until the end of the packet, which is 6T time period after the falling edge of the last "1" bit in the packet. The "0" to "1" and "1" to "0" transitions in the combiner output are used to signify the beginning and the end of a packet, respectively. To detect these transitions, we delay the combiner output by a short amount of time (0.5T in our implementation), and compare the delayed signal with the original signal: if the delayed signal is "0" and the original signal is "1", then there is a "0" to "1" transition; and vice versa for a "1" to "0" transition.</p><p>Both the valid and mask off latches are set at 2.5T time period after the beginning of a packet, and are reset at the end of the packet <ref type="foot">4</ref> . This way, their values become "1" during the gap period that corresponds to the first routing bit, and remain "1" until the end of the packet. This is the intended behavior since the routing bit is indeed valid during the entire duration when the valid bit is "1". Moreover, the mask off bit is "1" before the beginning of the second bit, so only the first bit of the packet is masked off.</p><p>Second, the line activity detector decodes the routing bit of its input packet for the current stage -i.e., the first bit of the packet, and stores it in a routing latch. Figure <ref type="figure">3</ref> illustrates how the first bit of a packet is obtained in our design: we delay the input signal by 1.3T and measure the delayed signal at the falling edge of the first bit (i.e., before its gap period). If the value is 1, it means that the length of the first bit is 2T (corresponding to a "0"); otherwise the length is T (corresponding to a "1").   Arbitration unit: After the routing bit of a packet is stored in the routing latch and the valid bit becomes "1", the packet is ready to participate in the arbitration process, and the arbitration result (i.e., the grant signals) is sent to the switch fabric to forward the packet to its designated output port if the packet wins the arbitration (the packet will be dropped otherwise). The design of our arbitration unit is a 2x2 asynchronous arbiter built using a latch and two threshold NOT gates, similar to <ref type="bibr">[47]</ref>. It guarantees that for each output, at most one packet wins the arbitration at any given time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Circuit Simulation Results</head><p>We model TL gates using the results reported in Table <ref type="table">IV</ref>, and simulate the 2x2 TL switch using Synopsys's HSPICE tool. Figure <ref type="figure">5</ref> depicts the simulation waveform for a scenario where the designated output port is available when a packet arrives at the input. As shown in the waveform, the routing bit is stored in the routing latch before the falling edge of the routing bit. Also, both the valid and mask off bits become "1" during the gap period of the first routing bit and stays "1" until the end of the packet. Therefore, the packet traverses from its input to the designated output port as expected. The waveform also shows that the first routing bit in the packet is correctly masked off before the packet arrives at the switch output.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Minimizing and Handling Packet Drops</head><p>We enhance our design by increasing path multiplicity and randomness to decrease the number of packet drops substantially. Multiplicity of m in a 2x2 switch is achieved by having 2m output ports (m output ports per each output direction), and 2m input ports (m input ports per each input direction). Thus, multiplicity of greater than 1 provides more than one path to deliver a packet to the designated output direction. To implement a 2x2 switch with multiplicity of m, we augment the TL switch with multiplicity of 1 (as shown in Figure <ref type="figure">4</ref>) such that: (1) 2m input packets can be processed independently (which is achieved by increasing the number of paths in the switch fabric and the number of control latches, line activity detectors, and arbitration units in the header processing unit); and (2) an input packet can be transmitted successfully as long as at least one of the m paths is available (which is achieved by checking the availability of each path sequentially in the arbitration units).</p><p>We evaluate the design complexity, latency, and packet drop rate for Baldur with different path multiplicity values in a 1,024-node network, and the results are shown in Table <ref type="table">V</ref> (drop rates are obtained using the CODES simulator for the transpose traffic pattern under 0.7 input load; see methodology details in Sec. V-A). We can see that, even with a high multiplicity of 4 or 5, the latency/area/power costs are still minimal, but the packet drop rate is reduced substantially. Therefore, incorporating path multiplicity results in a good trade-off for Baldur. Moreover, if we view a radix-2 multistage network as a sorting network that narrows a packet's possible destination by a factor of 2 at each stage, the effectiveness of multiplicity can be improved by connecting the switch outputs that belong to the same direction to random switches in the appropriate sorting group in the next stage. This randomization leads to a theoretical property called "expansion" <ref type="bibr">[14]</ref>. Networks with expansion, including Baldur, are immune to worst-case permutations <ref type="bibr">[19]</ref>.</p><p>For large-scale networks, detailed network simulations cannot be easily performed; therefore, we determine the multiplicity value required for Baldur to achieve a low (less than 1%) packet drop rate as follows: given a network scale (i.e., the number of server nodes), we consider the worst-case scenario where one packet per server node is injected to the network and all the packets arrive at the first stage of the network at the same time, and simulate this scenario using an in-house tool to obtain the packet drop rate corresponding to different multiplicity values for various traffic patterns. Our results show that multiplicity of 4 is required for a 1,024-node network (which is consistent with our detailed simulation results reported in Table <ref type="table">V</ref>), and multiplicity of 5 is sufficient for networks with over 1 million nodes.</p><p>In addition to incorporating multiplicity, we also implement the binary exponential backoff (BEB) mechanism <ref type="bibr">[48]</ref> to throttle the transmitter nodes when the network starts to experience an increasing number of packet drops to further reduce drop rate.</p><p>In the rare cases that packets are dropped, packet retransmissions are handled by the server nodes. When a transmitter node sends a packet, it waits for an ACK from the receiver. If it does not receive the ACK after a timeout period based on its own local timer (because either the packet itself or the ACK is dropped), the transmitter node re-transmits the packet. To support packet re-transmission, a small buffer is required for each node to hold all outgoing packets that have not been ACK'ed. Based on our simulation results (details in Sec. V), since the packet drop rate is less than 1%, a buffer of 536 KB per node is sufficient for a wide variety of traffic patterns under a heavy input load of 0.7, and we use 1 MB buffers in our design to allow abundant margins. This re-transmission mechanism may be implemented in either software or hardware, depending on the protocols supported at the server nodes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Reliability of Baldur</head><p>Reliability is a critical consideration in network switches. In TL switches, optical signal timing and amplitude are two major factors that can affect the correctness of the switch. However, since a TL gate naturally restores signal strength as discussed in Sec. III, we focus on validating that TL switches can achieve reliable operations in the presence of timing variations and jitters.</p><p>Specifically, we consider 10% variation in the delay and rise/fall time of TL gates and 1 ps variation in the waveguide delay elements, and manually verify that, in the presence of these variations, our design can tolerate up to 0.42T change (in either direction) in the bit length of any routing bit. Given this result, if we model timing jitter (in ps) at each transition of the packet signal as a random variable that follows a Gaussian distribution with &#956; = 0 <ref type="bibr">[49]</ref> and a variance of 1.53, the major error scenarios in the TL switch design, which are listed below, would only occur at a low error probability of 10 -9 : (1) the length of a routing bit was originally 2T (T), but it is incorrectly stored as T (2T); (2) the valid bit becomes high (low) when the routing bit is invalid (valid); (3) the mask off bit is latched incorrectly; (4) the line activity detector fails to detect the presence/absence of network packets correctly. This analysis shows that our TL switch design is equipped with adequate design margins to perform reliable operations.</p><p>However, in the case that an error is detected in Baldur, diagnosis support is provided so that the error can be isolated to a single 2x2 TL switch, and appropriate repair actions can be taken. In Baldur with multiplicity of 1, the path that each packet traverses to reach its destination is deterministic. Thus, a faulty switch can be identified using a sufficiently-large number of packets. In Baldur with multiplicity of greater than 1, each switch can be configured such that only one output port is enabled at a time to achieve deterministic routing (so the testing procedure is similar to that for Baldur with multiplicity of 1). This is done by incorporating additional test signals (driven by the server nodes) in Baldur to block all output ports except one in all of the 2x2 TL switches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G. Physical Link Interfaces and Packaging</head><p>The Baldur network is physically built using a 2D array of optical interposers <ref type="bibr">[40]</ref> integrated on one or more printed circuit boards (PCBs) as shown in Figure <ref type="figure">1</ref>. To simplify the connectivity requirements between different interposers, we constrain each interposer column to contain only 1 stage of the multi-butterfly network. Connections inside each interposer are formed using waveguides, and interposers in adjacent columns are connected using fiber array units (FAUs) (e.g., <ref type="bibr">[50]</ref>). To connect server nodes to the Baldur network, optical fibers with LC connectors can be used. Since there are a large number of input/output ports in Baldur, a fiber management system is included, where rack-mount fiber enclosures and cassettes (RFEC) <ref type="bibr">[51]</ref> are responsible for connecting the fibers from/to server nodes with dense FAUs, which are then coupled into the first/last column of the optical interposers. At the server nodes, either existing optical transceivers or TL-based lasers/photodetectors may be used, as long as the wavelength supported is consistent for the whole network.</p><p>The number of cabinets required to hold the entire Baldur network is 1 for the 1,024-node scale, and 752 (which is 3.2% of the total number of cabinets) for the 1M-node scale, assuming that the standard 60.96 cm x 45.72 cm PCBs, 32 mm x 10 mm interposers, and commercially-available FAUs, fiber management systems, and cabinets are used. These results are calculated under both fiber pitch (127 &#956;m <ref type="bibr">[50]</ref>) and power/thermal (no more than 85 kW per cabinet <ref type="bibr">[1]</ref>) constraints, with the fiber pitch being the limiting factor (e.g., if the 85 kW peak power per cabinet is the only constraint, then only 176 cabinets are needed for the 1M-node scale). Fiber pitch may reduce in the future (as this is an active research area <ref type="bibr">[52]</ref>), which will in turn reduce the hardware requirements for Baldur as well. Moreover, note that the TL gates occupy only a very small portion of the interposer area (e.g., &lt;10% for a 1,024-node network with multiplicity of 4), leaving abundant space for waveguides and other passive optical elements. This result suggests that the area of Baldur is insensitive to the area of TL gates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. PERFORMANCE EVALUATION</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Methodology</head><p>We evaluate the performance of a 1,024-node Baldur network with multiplicity of 4 (which achieves a packet drop rate of &lt;1% as discussed in Sec. IV-E) using the CODES toolkit to perform packet-level network simulations <ref type="bibr">[53]</ref>. In our evaluation, we quantitatively compare Baldur with the following representative networks: (1) an electrical multibutterfly network with multiplicity of 4; (2) a dragonfly network constructed using the most optimized architecture recommended in <ref type="bibr">[16]</ref>; (3) a 3-level fat-tree network with full bisection bandwidth as proposed in <ref type="bibr">[17]</ref>; (4) an ideal network with infinite bandwidth and a flat packet latency of 200 ns. Our simulation configurations and parameters are summarized in Table <ref type="table">VI</ref>.</p><p>In terms of workloads, we simulate a wide variety of synthetic traffic patterns as well as real HPC workload traces:</p><p>&#8226; Random Permutation: server nodes are paired for packet transmission using a random permutation. &#8226; Transpose: a server node with binary address a n a n-1 .. a n 2 a n 2 -1 ...a 1 a 0 sends packets to the server node with address a n 2 -1 ...a 1 a 0 a n a n-1 ..a n 2 .</p><p>&#8226; Bisection: half of the server nodes are paired with the other half for packet transmission using a random permutation. &#8226; Group Permutation: first, in dragonfly, the groups are paired using a random permutation, and each server node</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Load Dragonfly</head><p>Fat-tree Multi-butterfly Baldur   Adaptive <ref type="bibr">[16]</ref> 90 <ref type="bibr">[54]</ref> Intra-group: 10 Inter-group: 100 Fat-tree (24 KB buffer per port, 3 virtual channels) Routing Policy Switch Latency (ns) Link Delay (ns)</p><p>Adaptive <ref type="bibr">[55]</ref> 90 <ref type="bibr">[54]</ref> Level1: 10 Level2: 50 Level3: 100 Ideal (Infinite Bandwidth; Packet Latency = 200 ns) sends packets to another randomly-chosen node in the partner group. Then, the same transmitter/receiver node pairs are applied to all other networks.</p><p>&#8226; Hotspot: all server nodes send packets to one specific destination node. &#8226; Ping Pong1: server nodes are randomly paired. Each node sends a packet to its partner node, wait until it receives a packet from the partner node, and then immediately sends the next packet. &#8226; Ping Pong2: similar to ping pong1 but with a different pairing algorithm. First, in dragonfly, the nodes from one group (e.g., group A) are paired with nodes from another group (e.g., group B). Then, the same transmitter/receiver node pairs are applied to all other networks.</p><p>&#8226; Four HPC workloads from the Design Forward project <ref type="bibr">[56]</ref>. The communication traces are collected using the DUMPI framework <ref type="bibr">[57]</ref>.</p><p>In our simulation experiments, we vary the input load, which is defined as the percentage of time that the transmitter is busy transmitting packets. For each synthetic traffic pattern and input load value, each server node injects 10,000 packets to the network. For all synthetic traffic patterns except the two ping pong patterns (in which packets are sent back-andforth sequentially between paired nodes without any idle periods), the time interval between two consecutive packets is determined based on an exponential distribution, and the mean value of the time interval is defined as:</p><p>packet size is set to 512 bytes <ref type="bibr">[53]</ref> and link data rate is set to 25 Gbps (which is the maximum data rate per lane in current standards) in our experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Performance Results</head><p>In this section, we present the network performance results obtained from our simulation experiments. Note that, the results for Baldur account for all latency overheads due to packet drops and re-transmissions.</p><p>Our results are depicted in Figure <ref type="figure">6</ref> and<ref type="figure">7</ref>.  Input load is 0.7 for hotspot (latency results are similar for other load values), and is not applicable for other workloads. results for various networks running the random permutation, transpose, bisection, and group permutation synthetic traffic patterns. The advantage of the multi-butterfly topology is clear: both Baldur and electrical multi-butterfly achieve much better average and tail latency results, and they saturate at higher input loads compared to dragonfly and fat-tree. For input loads that are less than or equal to 0.7, Baldur achieves the lowest average (tail) latency among all networks -e.g., 1.9-6.3X (1.5-4.2X) vs. fat-tree, 1000-3000X (1000-2000X) vs. dragonfly, and 2.2-4.3X (1.3-2.1X) vs. electrical multi-butterfly when the load is 0.7. they have the same topology, Baldur outperforms electrical multi-butterfly because packet processing/switching in Baldur is performed in the ultra-fast optical domain. When the input load is (&gt;0.7), Baldur still achieves the best results in almost all cases. The exception is that the tail latency of electrical multi-butterfly is lower than Baldur in some of the traffic patterns, because in Baldur some packets are dropped and re-transmitted. that, although electrical multi-butterfly may achieve good performance results, it is highly expensive and thus impractical at large scales (see Sec. VI).</p><p>When comparing Baldur with the ideal network, Baldur's average packet latency is only 1.7X-3.4X higher. The difference is mostly due to packet drops. Moreover, Baldur's total switch latency (1.5 ns per stage) is much smaller than the total link latency (200 ns or 100 ns per input/output link, which is equal to the packet latency in the ideal network). Thus, we expect that Baldur's packet latency will approach the latency of the ideal network as path multiplicity increases, because packet drop rate will decrease as a result.</p><p>In Figure <ref type="figure">7</ref>, we show the average and tail latency results for the rest of the workloads: hotspot, ping pong1, ping pong2, and the HPC workloads. Baldur achieves the best performance results for all synthetic traffic patternsfor example, the Geomean of Baldur's average (tail) latency is 3.4X-4.1X (3.6X-5.9X) lower than other networks. For hotspot, which results in high network congestion, path multiplicity in both Baldur and electrical multi-butterfly effectively alleviates the congestion, thereby achieving better latency results than dragonfly and fat-tree. For ping pong1 and ping pong2, where the impact of packet latency on the overall network performance is emphasized by serialization dependency between server nodes, we see that Baldur achieves significantly better results due to its ultra-low switch latency, while the performance of all other networks is limited due to slow electrical header processing and switching. Dragonfly's packet latency values are particularly high for ping pong2, because ping pong2 is constructed in a way that forces dragonfly to experience high levels of congestion between two groups, where the network bandwidth is limited.</p><p>For the HPC workloads, Baldur also achieves the best results: the Geomean of its average (tail) latency is 2.6X-9.1X (3.1X-11.6X) better compared to other networks. Moreover, both dragonfly and fat-tree suffer from very high average latency in certain workloads. For example, in FB, the average (tail) latency of dragonfly/fat-tree is 23.5X/46.1X (22.1X/39.7X) higher than Baldur. These results suggest that, unlike Baldur, dragonfly and fat-tree are not immune to worst-case traffic patterns.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. POWER, COST, AND SCALABILITY ANALYSIS A. Power Analysis and Results</head><p>In this section, we report the total power of Baldur, and compare it with the power of dragonfly, fat-tree, and electrical multi-butterfly.</p><p>1) Methodology: We estimate the power of a network by summing up the power of the following major components:</p><p>1. Optical transceivers. The power consumption of Cisco's SFP28 modules <ref type="bibr">[58]</ref> (1.5 W) is used in our calculations for each optical transceiver.</p><p>2. SerDes units. We use the power number reported in <ref type="bibr">[59]</ref>, which is 0.693 W, for each SerDes unit.</p><p>3. Re-transmission buffers, which consume 0.741 W (for 1 MB) per node <ref type="bibr">[60]</ref> and is only required in Baldur (assuming that packet re-transmission is handled in hardware).</p><p>4. Network switches. For Baldur, the power of each TL switch is the product of the total number of TL gates per switch and the power of a TL gate (0.406 mW as reported in Sec. III). For other networks, we use ORION 3.0 <ref type="bibr">[61]</ref> and Cacti 6.5 <ref type="bibr">[62]</ref> to obtain the power associated with virtual channel allocation, switch allocation, crossbar, clocking, and buffering. Note that, the results in dragonfly and fat-tree are optimistic since the power cost of the adaptive routing logic is not included.</p><p>To see how total network power changes as network size increases, we scale the number of servers from 1,024 to over 1M (where 1M is the expected scale for exascale computing). Our methodology ensures that all networks are optimized for a given scale. For example, in dragonfly, starting from ~83K server nodes it is not efficient to use electrical links for intra-group connections any more (because of the long distance between the switches within a group). Therefore, we report power numbers assuming that optical links are also used for intra-group connections for network scales greater than or equal to 83K. Scaling dragonfly this way results in far less power (e.g., by 23.45X) than keeping the group size small but increasing the number of groups for a network with ~1.3M nodes.</p><p>Note that, the number of server nodes in each scale is not exactly the same in different networks due to the particular construction of the different topologies (for the number of nodes is always a power of two in Baldur and multi-butterfly, but this is not the case in dragonfly and fattree). Thus, when we present the power results, at each scale we show a range that includes the number of servers in all the networks.</p><p>2) Results: Figure <ref type="figure">8</ref> shows the total power consumption per server node in different network scales. Baldur achieves the best scalability, since its power consumption (per server) at the 1M-1.4M scale is only increased by 1.7X compared to the 1K-2K scale. On the other hand, between these two network scales, the power per server node is increased by 7.8X, 9.0X, and 2.0X, respectively, for dragonfly, fat-tree, and electrical multi-butterfly. The large increase in power for dragonfly and fat-tree limits the scalability of these two networks, and is a result of the significantly larger switch radix, which changes from 16 to 96 and 16 to 160, respectively. Electrical multi-butterfly shows the smallest increase in power among the three electrical networks because  Compared with other networks, Baldur consumes less power than all other networks at all scales. In particular, although the number of switches is the same for both electrical multi-butterfly, each switch in Baldur consumes significantly less power (e.g., 96.6X less at the 1K-2K scale), which is due to various optimizations including simpler control logic, and the elimination of optical transceivers, SerDes units, packet buffering, clocking, and so on. Moreover, as we scale the network, in general the power benefit of Baldur over other networks increases: Baldur reduces network power by 3.2X-26.4X at the 1K-2K scale, and 14.6X-31.0X at the 1M-1.4M scale (note that, the power benefit of Baldur decreases slightly as we scale the network from 1K-2K to 16K-17K because the multiplicity is increased from 4 to 5).</p><p>3) Sensitivity Analysis: We perform sensitivity analysis by scaling the power of network switches by 0.5X and 2X (for both electrical and optical switches) to account for possible sources of inaccuracy in technology parameters and network power modeling, and the results for a 1M-1.4M scale network are shown in Figure <ref type="figure">9</ref>. The observation is that, even considering the pessimistic case, Baldur is still the most power efficient, consuming 5.1X, 8.2X and 14.7X less power than dragonfly, fat-tree and electrical multi-butterfly, respectively.</p><p>In summary, thanks to its high scalability and low power consumption, Baldur is a promising network for exascale computing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Cost Analysis and Results in Various Network Scales</head><p>To estimate the cost of deploying Baldur in practice, we adopt a cost model which takes into account the cost of fibers, FAUs, RFECs, optical interposers, and optical transceivers, similar to previous work <ref type="bibr">[2]</ref>, <ref type="bibr">[63]</ref>. Furthermore, we pessimistically assume that the cost of optical interposers (including the TL chips and other passive optical devices) is 5x more than the cost of current CMOS chips for the same area.</p><p>As shown in Figure <ref type="figure">10</ref>, Baldur achieves low cost (in terms of USD per server node) in various network scales.</p><p>For example, at the 1K-2K scale, Baldur's cost per node is 523 USD, compared to 1,992 USD for a fat-tree network with 2,560 nodes <ref type="bibr">[17]</ref>, <ref type="bibr">[63]</ref>. The cost of Baldur increases only slightly as the number of server nodes increases, which suggests that Baldur is highly scalable (this is consistent with our power analysis). Moreover, the cost of optical interposers dominates the total cost. Thus, the cost of Baldur may further decrease since the number of optical interposers may become smaller as a result of reduced fiber pitch in the future (as discussed in Sec. IV-G).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. RELATED WORK</head><p>In this section, we discuss various optical network and optical logic proposals in the literature, and compare these proposals with Baldur.</p><p>Optical packet switching (OPS): OPS performs optical switching at the packet level <ref type="bibr">[4]</ref>, <ref type="bibr">[5]</ref>. A popular example of OPS is based on AWGRs <ref type="bibr">[3]</ref>, <ref type="bibr">[5]</ref>, and an overview of AWGR networks can be found in Sec. II. In this section, we focus on a quantitative comparison between Baldur and an AWGR network for the scale of 32 nodes (and we expect the general conclusion to hold for other network scales and OPS schemes as well). For this scale, multiplicity of 3 is sufficient for Baldur to achieve a packet drop rate of &lt;1%. The AWGR network is built using a 32-radix AWGR <ref type="bibr">[3]</ref>, which is capable of sending up to 3 packets to each output port in parallel by using 3 different wavelengths. Note that, although in theory 32 wavelengths are available in a 32-radix AWGR, a much smaller number of wavelengths should be used instead due to cost and practical considerations <ref type="bibr">[3]</ref>.</p><p>In terms of performance, both Baldur and the AWGR network provide the same bandwidth, but the AWGR network needs to pay a large latency overhead due to electrical header processing (e.g., 90ns <ref type="bibr">[54]</ref>, much higher than the switching latency in Baldur). In terms of power, excluding the power cost of optical transceivers and SerDes units at the server nodes (which is the same for both networks), Baldur consumes 0.7 W per node (which accounts for the power of TL chips), while the AWGR network consumes 4.2 W per node (which accounts for optical receivers, SerDes units, buffers to support electrical header processing, and tunable wavelength converters). Therefore, Baldur is superior to the AWGR network in terms of both latency and power. In terms of scalability, unlike Baldur which is highly scalable, the scalability of AWGR networks is limited by the switch radix as discussed in Sec. II. Moreover, as the network size increases, the number of wavelengths required in AWGR network increases, which may significantly increase the power/cost/complexity of the optical transceivers <ref type="bibr">[24]</ref>.</p><p>Optical circuit switching (OCS) and hybrid OCS/Electrical switching (ES): MEMS-based OCS relies on mirrors to perform switching <ref type="bibr">[64]</ref>, and a separate processing unit is responsible for establishing the correct paths between input and output ports by repositioning the mirrors. The main disadvantage of OCS is that the switching time is long -in the order of milliseconds. In the hybrid OCS/ES schemes <ref type="bibr">[2]</ref>, <ref type="bibr">[63]</ref>, ES is used to deliver packets with low latency, and OCS is used to provide high bandwidth transmission.</p><p>Compared to these OCS and hybrid OCS/ES schemes, Baldur has the following advantages: (1) when ES is used, Baldur has clear packet latency and power benefits as shown in Sec.</p><p>V and VI; (2) when OCS is used, Baldur is expected to achieve similar performance and bandwidth due to its ultralow packet latency; (3) in hybrid OCS/ES schemes, complex control logic is required to decide when to use OCS and ES. In contrast, the routing and control complexity in Baldur is low; (4) Baldur achieves lower cost than a OCS-based scheme (e.g., 523 USD per node for Baldur (see Sec. VI) vs. 1,719 USD per node for OCS <ref type="bibr">[63]</ref> for the scale of a few thousand nodes).</p><p>Optical logic: The TL is fundamentally different from other optical logic technologies (e.g., <ref type="bibr">[65]</ref>, <ref type="bibr">[66]</ref>). First, it is a unique technology that integrates the functionalities of a transistor, a laser, and a photodetector in a single compact device. Second, it is possible to directly interface TL devices and logic gates with circuits in the electrical domain. Last but not least, the TL technology has been demonstrated to be efficient, fast, and reliable through actual fabricated prototypes, unlike several previously proposed optical technologies (e.g., <ref type="bibr">[66]</ref>) that are impractical. Therefore, the TL serves as a solid and promising technology basis to develop all-optical networks and computing systems. .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VIII. CONCLUSIONS AND FUTURE WORK</head><p>In this paper, an all-optical network called Baldur is introduced. Baldur is able to achieve significant improvements in network latency, power, and cost compared to state-ofthe-art networks. Baldur is also highly scalable to support exascale computing. These benefits are obtained by fundamentally understanding the TL technology and its architectural implications, which leads us to non-conventional design decisions and new design techniques that are optimized for the Baldur network.</p><p>In the future, we envision that the TL will drive further architectural innovations and enable new network capabilities. Some examples include in-flight routing for &gt;100 G links, and host-based and in-network accelerations (such as network filtering for security purposes, and traffic combining to improve performance). We also plan to apply TL-based optical computing in other application domains (e.g., machine learning) that demand ultra-high-speed and efficient processing.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: University of Illinois. Downloaded on August 11,2020 at 16:53:26 UTC from IEEE Xplore. Restrictions apply.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>Note that, static power is the dominant component in the power of TL gates. Thus, unlike CMOS gates, the power of TL gates is roughly constant for different data rates and activity factors.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p>Baldur is the god of light in Norse mythology, which is a well-suited name for the first all-optical network.Rest of the Packet (e.g., 8b/10b)</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_3"><p>We use 8b/10b as an example. Our design can be easily modified to work with other encoding schemes.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_4"><p>In Baldur with multiplicity of 1, the behavior of the valid and mask off latches is the same. However, this is not the case when multiplicity (m) is greater than 1, because for each input, m valid latches are required for m paths, but the number of the mask off and routing latches stays 1 regardless of the multiplicity value.</p></note>
		</body>
		</text>
</TEI>
