<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Expanse: Computing without Boundaries</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021 Fall</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10291594</idno>
					<idno type="doi"></idno>
					<title level='j'>Practice &amp; Experience in Advanced Research Computing (PEARC)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Shawn Strande</author><author>Ilkay Altintas</author><author>Haisong Cai</author><author>Trevor Cooper</author><author>Christopher Irving</author><author>Marty Kandes</author><author>Amitava Majumdar</author><author>Dmitry Mishin</author><author>Ismael Perez</author><author>Wayne Pfeiffer</author><author>Manu Shantharam</author><author>Robert S. Sinkovits</author><author>Subhashini Sivagnanam</author><author>Mahidhar Tatineni</author><author>Mary Thomas</author><author>Nicole Wolter</author><author>Michael Norman</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We describe the design motivation, architecture, deployment, and early operations of Expanse, a 5 Petaflop, heterogenous HPC system that entered production as an NSF-funded resource in December 2020 and will be operated on behalf of the national community for five years. Expanse will serve a broad range of computational science and engineering through a combination of standard batch-oriented services, and by extending the system to the broader CI ecosystem through science gateways, public cloud integration, support for high throughput computing, and composable systems.Expanse was procured, deployed, and put into production entirely during the COVID-19 pandemic, adhering to stringent public health guidelines throughout. Nevertheless, the planned production date of October 1, 2020 slipped by only two months, thanks to thorough planning, a dedicated team of technical and administrative experts, collaborative vendor partnerships, and a commitment to getting an important national computing resource to the community at a time of great need.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>and Experience in Advanced Research Computing (PEARC '21), July 18&#347; 22, 2021, Boston, MA, USA. ACM, New York, NY, USA, 4 pages. <ref type="url">https: //doi.org/10.1145/3437359.3465588</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">SYSTEM ARCHITECTURE</head><p>High-performance computing system architectures, software, and expertise are evolving to support the growing diversification in computing that is being driven by rapidly evolving science and engineering research. Now more than ever, supercomputers must be part of a more integrated, national cyberinfrastructure that comprises distributed computing and data resources, scientific instruments, R&amp;E networks, and expertise. Accordingly, systems and application software, and user support must also evolve to support this expanding ecosystem <ref type="bibr">[1,</ref><ref type="bibr">2]</ref>.</p><p>Developed in response to NSF Solicitation 19-534, Expanse is an evolutionary system, designed in large part from lessons learned from the operation of Comet [3, 4], a supercomputer that has been operated by SDSC for the last six years. Like Comet, Expanse was designed to support the &#322;longtail&#382; of computing, which we define as the broad spectrum of computational science and engineering research that is carried out at modest scale, but with increasingly diverse system and application software. A summary of the major subsystems is given in Table <ref type="table">1</ref> Notable features of this design include: first large-scale NSF system to feature AMD EPYC processors; 13 identical Scalable Units, each with 56 compute and 4 GPU nodes; rich storage environment; full-bisection, low-latency Mellanox HDR-100 interconnect at the rack level, accessing 7,168 EPYC cores, and 16 V100 GPUs; support for Slurm and Kubernetes; integration with the Open Science Grid; scheduler-based integration with public cloud; and support for composable systems. The system includes a Lustre file system for compute and a Ceph file system, built from repurposed hardware from the Comet system that will be retired in July 2021, and will support the composable systems and cloud integration capability, and limited second copy data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">ACQUISTION AND DEPLOYMENT 2.1 Project Changes</head><p>Following a thorough assessment of the original design in light of vendor developments following the award and the potential impact on the planned deployment, a decision was made to execute a Project Change Request (PCR) for the standard compute node processor (the Expanse Risk Mitigation Plan had a provision for 1 PB this change.) Whereas the proposed system was based on an Intel processor, the deployed system is based on AMD's EPYC processor. This change resulted in doubling of the original core count from 46,592 to 93,184. Extensive benchmarking was performed in advance of the PCR to demonstrate that the change would result in a system that delivered as good or better performance than the proposed system. The results showed that the change would deliver both improved performance over Comet on real applications, and a substantial increase in overall system throughput. The other notable PCR was the addition of direct-to-chip cooling for the standard compute nodes. Both changes were accommodated within the original acquisition budget. Acquisition, deployment, and transition to production operations were carried out under COVID-19 restrictions. We estimate an approximate 2-month delay due to COVID-19 related delays, including data center access, and supply chain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Installation and Acceptance Testing</head><p>System testing comprised several distinct phases, including: early access for benchmarking and software; early delivery of a single compute node in advance of the full system for system and application stack development; system burn-in and testing at vendor facilities; onsite vendor regression testing following initial deployment; SDSC verification of bills of materials; unit level testing; full systems benchmarking using micro benchmarks and applications; reliability testing; and 30-day Early User Program. Application testing was defined by SDSC and approved as part of the Cooperative Agreement with NSF.</p><p>SDSC staff received remote access to Dell and AMD clusters with EPYC processors and used the time to develop application build recipes with various compilers (GNU, Intel, and AOCC) and find optimal run configurations. We used the NPS4 configuration -which is 4 NUMA domains per socket. This is the AMD-recommended configuration for HPC workloads which are typically NUMA aware. NPS1 is suggested for codes that are not NUMA aware or cannot handle NUMA complexity, but we did not explore this. Synthetic benchmarks included HPL, STREAM, OSU Micro Benchmarks (OMB), IOR, mdtest, and Tartan. Using HPL v2.3, an average floating-point performance of 3,711 GFLOPS (80% of theoretical peak) was measured for Expanse's standard AMD compute nodes. An average memory bandwidth for STREAM's copy function of 346,072 MB/s was measured for Expanse's standard AMD compute nodes (85% of theoretical peak). OMB results show the average point-to-point in-rack latency between the AMD compute nodes was 1.184 &#181;s and the point-to-point in-rack bandwidth was 12,335 MB/s. IOR results showed an average read bandwidth of 2,694 MiB/s on the node-local NVMe drives. The Lustre filesystem performed at an average bandwidth of 143,098 MiB/s on IOR tests and achieved a maximum of 211,369 IOPs on file reads with the mdtest benchmark.</p><p>The CPU applications benchmarks included GROMACS, NAMD, NEURON, OpenFOAM, Quantum Espresso, WRF, and RAxML. The RAxML runs were on single nodes with core counts ranging from 10 to 40. WRF, NEURON, OpenFOAM, and Quantum Espresso were run with core counts ranging from 96 to 768. NAMD, GROMACS were run with core counts ranging from 96 to 1,536 cores. Speedups relative to Comet were calculated on a per-core basis and ranged from 0.97X to 1.9X with an average speed up of 1.41X combining all cases.</p><p>The GPU application benchmarks included: AMBER -run with a single GPU and in throughput mode using all 4 GPUs; BEAST, GROMACS, NAMD, MXNet, PyTorch, and TensorFlow -all run with 1, 2, and 4 GPUs on a single node. Speedups relative to the Comet P100 GPUs ranged from 1.29X to 1.89X with an average speed up of 1.63X combining all cases.</p><p>Compute node reliability was tested by running five of the benchmark application performance tests (HPL, RAxML, NEURON, Quantum Espresso, and WRF) described above over a period of three days. A total of 47,319 jobs were run in 3 days and the overall success rate for all of the benchmarks jobs run was 99.84%. The reliability of Expanse's GPU nodes was tested by running the AMBER and GROMACS benchmark application performance tests over a period of three days. A total of 7,214 jobs were run with only one job failure detected during the 3-day period. A few of the GROMACS run were slightly slower than the norm and two GPU nodes were fixed to address the issue. Full-scale HPL was run to further identify hardware issues.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Early User Program</head><p>Following the activities described above, which addressed issues of performance and reliability, and prior to full transition to production operations, Expanse underwent a 30-day Early User Program (EUP) to ensure that the system was highly usable and stable under a real world mix of users and applications. Forty-one projects were allocated time on the system, thirty two of which made use of their allocations during the program.</p><p>During the EUP, the users ran 87k CPU and 5.7k GPU jobs, consuming a total of 6M CPU core-hours and 20k GPU hours, with an overall job completion rate of 99%. Users were surveyed as part of the program to assess their experience and determine if the system was ready for transition. Overwhelmingly, users expressed high satisfaction with the system, with many noting that it provided</p></div></body>
		</text>
</TEI>
