Today’s landscape of computational science is evolving rapidly, with a need for new, flexible, and responsive supercomputing platforms for addressing the growing areas of artificial intelligence (AI), data analytics (DA) and convergent collaborative research. To support this community, we designed and deployed the Bridges-2 platform. Building on our highly successful Bridges supercomputer, which was a high-performance computing resource supporting new communities and complex workflows, Bridges-2 supports traditional and nontraditional research communities and applications; integrates new technologies for converged, scalable high-performance computing (HPC), AI, and data analytics; prioritizes researcher productivity and ease of use; and provides an extensible architecture for interoperation with complementary data intensive projects, campuses, and clouds. In this report, we describe Bridges-2’s hardware and configuration, user environments, and systems support and present the results of the successful Early User Program.
Neocortex and Bridges-2: A High Performance AI+HPC Ecosystem for Science, Discovery, and Societal Good
Artificial intelligence (AI) is transforming research through analysis of massive datasets and accelerating simulations by factors of up to a billion. Such acceleration eclipses the speedups that were made possible though improvements in CPU process and design and other kinds of algorithmic advances. It sets the stage for a new era of discovery in which previously intractable challenges will become surmountable, with applications in fields such as discovering the causes of cancer and rare diseases, developing effective, affordable drugs, improving food sustainability, developing detailed understanding of environmental factors to support protection of biodiversity, and developing alternative energy sources as a step toward reversing climate change. To succeed, the research community requires a high-performance computational ecosystem that seamlessly and efficiently brings together scalable AI, general-purpose computing, and large-scale data management. The authors, at the Pittsburgh Supercomputing Center (PSC), launched a second-generation computational ecosystem to enable AI-enabled research, bringing together carefully designed systems and groundbreaking technologies to provide at no cost a uniquely capable platform to the research community. It consists of two major systems: Neocortex and Bridges-2. Neocortex embodies a revolutionary processor architecture to vastly shorten the time required for deep learning training, foster greater integration of artificial deep learning with more »
- Editors:
- Nesmachnow, S.; Castro, H.; Tchernykh, A.
- Award ID(s):
- 1833317
- Publication Date:
- NSF-PAR ID:
- 10274872
- Journal Name:
- Communications in computer and information science
- Volume:
- 1327
- Page Range or eLocation-ID:
- 205-219
- ISSN:
- 1865-0929
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
To advance knowledge by enabling unprecedented AI speed and scalability, the Pittsburgh Supercomputing Center (PSC), a joint research center of Carnegie Mellon University and the University of Pittsburgh, in partnership with Cerebras Systems and Hewlett Packard Enterprise (HPE), has deployed Neocortex, an innovative computing platform that accelerates scientific discovery by vastly shortening the time required for deep learning training and inference, fosters greater integration of deep AI models with scientific workflows, and provides promising hardware for the development of more efficient algorithms for artificial intelligence and graph analytics. Neocortex advances knowledge by accelerating scientific research, enabling development of more accurate models and use of larger training data, scaling model parallelism to unprecedented levels, and focusing on human productivity by simplifying tuning and hyperparameter optimization to create a transformative hardware and software platform for the exploration of new frontiers. Neocortex has been integrated with PSC’s complementary infrastructure. This papers shares experiences, decisions, and findings made in that process. The system is serving science and engineering users via an early user access program. Valuable artifacts developed during the integration phase have been made available via a public repository and have been consulted by other AI system deployments that have seen Neocortex asmore »
-
The National Ecological Observatory Network (NEON) is a continental-scale observatory with sites across the US collecting standardized ecological observations that will operate for multiple decades. To maximize the utility of NEON data, we envision edge computing systems that gather, calibrate, aggregate, and ingest measurements in an integrated fashion. Edge systems will employ machine learning methods to cross-calibrate, gap-fill and provision data in near-real time to the NEON Data Portal and to High Performance Computing (HPC) systems, running ensembles of Earth system models (ESMs) that assimilate the data. For the first time gridded EC data products and response functions promise to offset pervasive observational biases through evaluating, benchmarking, optimizing parameters, and training new ma- chine learning parameterizations within ESMs all at the same model-grid scale. Leveraging open-source software for EC data analysis, we are al- ready building software infrastructure for integration of near-real time data streams into the International Land Model Benchmarking (ILAMB) package for use by the wider research community. We will present a perspective on the design and integration of end-to-end infrastructure for data acquisition, edge computing, HPC simulation, analysis, and validation, where Artificial Intelligence (AI) approaches are used throughout the distributed workflow to improve accuracy and computational performance.
-
Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)The Neuronix high-performance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from low-end consumer grade devices such as the Nvidia GTX 970 to higher-end devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely compute-intensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floating-point calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should producemore »
-
e. With recent advances in online sensing technology and high-performance computing, structural health monitoring (SHM) has begun to emerge as an automated approach to the real-time conditional monitoring of civil infrastructure. Ideal SHM strategies detect and characterize damage by leveraging measured response data to update physics-based finite element models (FEMs). When monitoring composite structures, such as reinforced concrete (RC) bridges, the reliability of FEM based SHM is adversely affected by material, boundary, geometric, and other model uncertainties. Civil engineering researchers have adapted popular artificial intelligence (AI) techniques to overcome these limitations, as AI has an innate ability to solve complex and ill-defined problems by leveraging advanced machine learning techniques to rapidly analyze experimental data. In this vein, this study employs a novel Bayesian estimation technique to update a coupled vehicle-bridge FEM for the purposes of SHM. Unlike existing AI based techniques, the proposed approach makes intelligent use of an embedded FEM model, thus reducing the parameter space while simultaneously guiding the Bayesian model via physics-based principles. To validate the method, bridge response data is generated from the vehicle-bridge FEM given a set of “true” parameters and the bias and standard deviation of the parameter estimates are analyzed. Additionally, the meanmore »