skip to main content


Title: Expanse : Computing without Boundaries
We describe the design motivation, architecture, deployment, and early operations of Expanse, a 5 Petaflop, heterogenous HPC system that entered production as an NSF-funded resource in December 2020 and will be operated on behalf of the national community for five years. Expanse will serve a broad range of computational science and engineering through a combination of standard batch-oriented services, and by extending the system to the broader CI ecosystem through science gateways, public cloud integration, support for high throughput computing, and composable systems. Expanse was procured, deployed, and put into production entirely during the COVID-19 pandemic, adhering to stringent public health guidelines throughout. Nevertheless, the planned production date of October 1, 2020 slipped by only two months, thanks to thorough planning, a dedicated team of technical and administrative experts, collaborative vendor partnerships, and a commitment to getting an important national computing resource to the community at a time of great need.  more » « less
Award ID(s):
2017767 1925558
NSF-PAR ID:
10291594
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Practice & Experience in Advanced Research Computing (PEARC)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, Iyad ; Picone, Joseph ; Selesnick, Ivan (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing a large open source database of high-resolution digital pathology images known as the Temple University Digital Pathology Corpus (TUDP) [1]. Our long-term goal is to release one million images. We expect to release the first 100,000 image corpus by December 2020. The data is being acquired at the Department of Pathology at Temple University Hospital (TUH) using a Leica Biosystems Aperio AT2 scanner [2] and consists entirely of clinical pathology images. More information about the data and the project can be found in Shawki et al. [3]. We currently have a National Science Foundation (NSF) planning grant [4] to explore how best the community can leverage this resource. One goal of this poster presentation is to stimulate community-wide discussions about this project and determine how this valuable resource can best meet the needs of the public. The computing infrastructure required to support this database is extensive [5] and includes two HIPAA-secure computer networks, dual petabyte file servers, and Aperio’s eSlide Manager (eSM) software [6]. We currently have digitized over 50,000 slides from 2,846 patients and 2,942 clinical cases. There is an average of 12.4 slides per patient and 10.5 slides per case with one report per case. The data is organized by tissue type as shown below: Filenames: tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_0a001_00123456_lvl0001_s000.svs tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_00123456.docx Explanation: tudp: root directory of the corpus v1.0.0: version number of the release svs: the image data type gastro: the type of tissue 000001: six-digit sequence number used to control directory complexity 00123456: 8-digit patient MRN 2015_03_05: the date the specimen was captured 0s15_12345: the clinical case name 0s15_12345_0a001_00123456_lvl0001_s000.svs: the actual image filename consisting of a repeat of the case name, a site code (e.g., 0a001), the type and depth of the cut (e.g., lvl0001) and a token number (e.g., s000) 0s15_12345_00123456.docx: the filename for the corresponding case report We currently recognize fifteen tissue types in the first installment of the corpus. The raw image data is stored in Aperio’s “.svs” format, which is a multi-layered compressed JPEG format [3,7]. Pathology reports containing a summary of how a pathologist interpreted the slide are also provided in a flat text file format. A more complete summary of the demographics of this pilot corpus will be presented at the conference. Another goal of this poster presentation is to share our experiences with the larger community since many of these details have not been adequately documented in scientific publications. There are quite a few obstacles in collecting this data that have slowed down the process and need to be discussed publicly. Our backlog of slides dates back to 1997, meaning there are a lot that need to be sifted through and discarded for peeling or cracking. Additionally, during scanning a slide can get stuck, stalling a scan session for hours, resulting in a significant loss of productivity. Over the past two years, we have accumulated significant experience with how to scan a diverse inventory of slides using the Aperio AT2 high-volume scanner. We have been working closely with the vendor to resolve many problems associated with the use of this scanner for research purposes. This scanning project began in January of 2018 when the scanner was first installed. The scanning process was slow at first since there was a learning curve with how the scanner worked and how to obtain samples from the hospital. From its start date until May of 2019 ~20,000 slides we scanned. In the past 6 months from May to November we have tripled that number and how hold ~60,000 slides in our database. This dramatic increase in productivity was due to additional undergraduate staff members and an emphasis on efficient workflow. The Aperio AT2 scans 400 slides a day, requiring at least eight hours of scan time. The efficiency of these scans can vary greatly. When our team first started, approximately 5% of slides failed the scanning process due to focal point errors. We have been able to reduce that to 1% through a variety of means: (1) best practices regarding daily and monthly recalibrations, (2) tweaking the software such as the tissue finder parameter settings, and (3) experience with how to clean and prep slides so they scan properly. Nevertheless, this is not a completely automated process, making it very difficult to reach our production targets. With a staff of three undergraduate workers spending a total of 30 hours per week, we find it difficult to scan more than 2,000 slides per week using a single scanner (400 slides per night x 5 nights per week). The main limitation in achieving this level of production is the lack of a completely automated scanning process, it takes a couple of hours to sort, clean and load slides. We have streamlined all other aspects of the workflow required to database the scanned slides so that there are no additional bottlenecks. To bridge the gap between hospital operations and research, we are using Aperio’s eSM software. Our goal is to provide pathologists access to high quality digital images of their patients’ slides. eSM is a secure website that holds the images with their metadata labels, patient report, and path to where the image is located on our file server. Although eSM includes significant infrastructure to import slides into the database using barcodes, TUH does not currently support barcode use. Therefore, we manage the data using a mixture of Python scripts and manual import functions available in eSM. The database and associated tools are based on proprietary formats developed by Aperio, making this another important point of community-wide discussion on how best to disseminate such information. Our near-term goal for the TUDP Corpus is to release 100,000 slides by December 2020. We hope to continue data collection over the next decade until we reach one million slides. We are creating two pilot corpora using the first 50,000 slides we have collected. The first corpus consists of 500 slides with a marker stain and another 500 without it. This set was designed to let people debug their basic deep learning processing flow on these high-resolution images. We discuss our preliminary experiments on this corpus and the challenges in processing these high-resolution images using deep learning in [3]. We are able to achieve a mean sensitivity of 99.0% for slides with pen marks, and 98.9% for slides without marks, using a multistage deep learning algorithm. While this dataset was very useful in initial debugging, we are in the midst of creating a new, more challenging pilot corpus using actual tissue samples annotated by experts. The task will be to detect ductal carcinoma (DCIS) or invasive breast cancer tissue. There will be approximately 1,000 images per class in this corpus. Based on the number of features annotated, we can train on a two class problem of DCIS or benign, or increase the difficulty by increasing the classes to include DCIS, benign, stroma, pink tissue, non-neoplastic etc. Those interested in the corpus or in participating in community-wide discussions should join our listserv, nedc_tuh_dpath@googlegroups.com, to be kept informed of the latest developments in this project. You can learn more from our project website: https://www.isip.piconepress.com/projects/nsf_dpath. 
    more » « less
  2. Abstract This project is funded by the US National Science Foundation (NSF) through their NSF RAPID program under the title “Modeling Corona Spread Using Big Data Analytics.” The project is a joint effort between the Department of Computer & Electrical Engineering and Computer Science at FAU and a research group from LexisNexis Risk Solutions. The novel coronavirus Covid-19 originated in China in early December 2019 and has rapidly spread to many countries around the globe, with the number of confirmed cases increasing every day. Covid-19 is officially a pandemic. It is a novel infection with serious clinical manifestations, including death, and it has reached at least 124 countries and territories. Although the ultimate course and impact of Covid-19 are uncertain, it is not merely possible but likely that the disease will produce enough severe illness to overwhelm the worldwide health care infrastructure. Emerging viral pandemics can place extraordinary and sustained demands on public health and health systems and on providers of essential community services. Modeling the Covid-19 pandemic spread is challenging. But there are data that can be used to project resource demands. Estimates of the reproductive number (R) of SARS-CoV-2 show that at the beginning of the epidemic, each infected person spreads the virus to at least two others, on average (Emanuel et al. in N Engl J Med. 2020, Livingston and Bucher in JAMA 323(14):1335, 2020). A conservatively low estimate is that 5 % of the population could become infected within 3 months. Preliminary data from China and Italy regarding the distribution of case severity and fatality vary widely (Wu and McGoogan in JAMA 323(13):1239–42, 2020). A recent large-scale analysis from China suggests that 80 % of those infected either are asymptomatic or have mild symptoms; a finding that implies that demand for advanced medical services might apply to only 20 % of the total infected. Of patients infected with Covid-19, about 15 % have severe illness and 5 % have critical illness (Emanuel et al. in N Engl J Med. 2020). Overall, mortality ranges from 0.25 % to as high as 3.0 % (Emanuel et al. in N Engl J Med. 2020, Wilson et al. in Emerg Infect Dis 26(6):1339, 2020). Case fatality rates are much higher for vulnerable populations, such as persons over the age of 80 years (> 14 %) and those with coexisting conditions (10 % for those with cardiovascular disease and 7 % for those with diabetes) (Emanuel et al. in N Engl J Med. 2020). Overall, Covid-19 is substantially deadlier than seasonal influenza, which has a mortality of roughly 0.1 %. Public health efforts depend heavily on predicting how diseases such as those caused by Covid-19 spread across the globe. During the early days of a new outbreak, when reliable data are still scarce, researchers turn to mathematical models that can predict where people who could be infected are going and how likely they are to bring the disease with them. These computational methods use known statistical equations that calculate the probability of individuals transmitting the illness. Modern computational power allows these models to quickly incorporate multiple inputs, such as a given disease’s ability to pass from person to person and the movement patterns of potentially infected people traveling by air and land. This process sometimes involves making assumptions about unknown factors, such as an individual’s exact travel pattern. By plugging in different possible versions of each input, however, researchers can update the models as new information becomes available and compare their results to observed patterns for the illness. In this paper we describe the development a model of Corona spread by using innovative big data analytics techniques and tools. We leveraged our experience from research in modeling Ebola spread (Shaw et al. Modeling Ebola Spread and Using HPCC/KEL System. In: Big Data Technologies and Applications 2016 (pp. 347-385). Springer, Cham) to successfully model Corona spread, we will obtain new results, and help in reducing the number of Corona patients. We closely collaborated with LexisNexis, which is a leading US data analytics company and a member of our NSF I/UCRC for Advanced Knowledge Enablement. The lack of a comprehensive view and informative analysis of the status of the pandemic can also cause panic and instability within society. Our work proposes the HPCC Systems Covid-19 tracker, which provides a multi-level view of the pandemic with the informative virus spreading indicators in a timely manner. The system embeds a classical epidemiological model known as SIR and spreading indicators based on causal model. The data solution of the tracker is built on top of the Big Data processing platform HPCC Systems, from ingesting and tracking of various data sources to fast delivery of the data to the public. The HPCC Systems Covid-19 tracker presents the Covid-19 data on a daily, weekly, and cumulative basis up to global-level and down to the county-level. It also provides statistical analysis for each level such as new cases per 100,000 population. The primary analysis such as Contagion Risk and Infection State is based on causal model with a seven-day sliding window. Our work has been released as a publicly available website to the world and attracted a great volume of traffic. The project is open-sourced and available on GitHub. The system was developed on the LexisNexis HPCC Systems, which is briefly described in the paper. 
    more » « less
  3. While bees are critical to sustaining a large proportion of global food production, as well as pollinating both wild and cultivated plants, they are decreasing in both numbers and diversity. Our understanding of the factors driving these declines is limited, in part, because we lack sufficient data on the distribution of bee species to predict changes in their geographic range under climate change scenarios. Additionally lacking is adequate data on the behavioral and anatomical traits that may make bees either vulnerable or resilient to human-induced environmental changes, such as habitat loss and climate change. Fortunately, a wealth of associated attributes can be extracted from the specimens deposited in natural history collections for over 100 years. Extending Anthophila Research Through Image and Trait Digitization (Big-Bee) is a newly funded US National Science Foundation Advancing Digitization of Biodiversity Collections project. Over the course of three years, we will create over one million high-resolution 2D and 3D images of bee specimens (Fig. 1), representing over 5,000 worldwide bee species, including most of the major pollinating species. We will also develop tools to measure bee traits from images and generate comprehensive bee trait and image datasets to measure changes through time. The Big-Bee network of participating institutions includes 13 US institutions (Fig. 2) and partnerships with US government agencies. We will develop novel mechanisms for sharing image datasets and datasets of bee traits that will be available through an open, Symbiota-Light (Gilbert et al. 2020) data portal called the Bee Library. In addition, biotic interaction and species association data will be shared via Global Biotic Interactions (Poelen et al. 2014). The Big-Bee project will engage the public in research through community science via crowdsourcing trait measurements and data transcription from images using Notes from Nature (Hill et al. 2012). Training and professional development for natural history collection staff, researchers, and university students in data science will be provided through the creation and implementation of workshops focusing on bee traits and species identification. We are also planning a short, artistic college radio segment called "the Buzz" to get people excited about bees, biodiversity, and the wonders of our natural world. 
    more » « less
  4. null (Ed.)
    Abstract. Human-induced atmospheric composition changes cause a radiative imbalance atthe top of the atmosphere which is driving global warming. This Earth energy imbalance (EEI) is the most critical number defining the prospects for continued global warming and climate change. Understanding the heat gain ofthe Earth system – and particularly how much and where the heat isdistributed – is fundamental to understanding how this affects warmingocean, atmosphere and land; rising surface temperature; sea level; and lossof grounded and floating ice, which are fundamental concerns for society.This study is a Global Climate Observing System (GCOS) concertedinternational effort to update the Earth heat inventory and presents anupdated assessment of ocean warming estimates as well as new and updated estimatesof heat gain in the atmosphere, cryosphere and land over the period1960–2018. The study obtains a consistent long-term Earth system heat gainover the period 1971–2018, with a total heat gain of 358±37 ZJ,which is equivalent to a global heating rate of 0.47±0.1 W m−2.Over the period 1971–2018 (2010–2018), the majority of heat gain is reportedfor the global ocean with 89 % (90 %), with 52 % for both periods inthe upper 700 m depth, 28 % (30 %) for the 700–2000 m depth layer and 9 % (8 %) below 2000 m depth. Heat gain over land amounts to 6 %(5 %) over these periods, 4 % (3 %) is available for the melting ofgrounded and floating ice, and 1 % (2 %) is available for atmospheric warming. Ourresults also show that EEI is not only continuing, but also increasing: the EEIamounts to 0.87±0.12 W m−2 during 2010–2018. Stabilization ofclimate, the goal of the universally agreed United Nations Framework Convention on ClimateChange (UNFCCC) in 1992 and the ParisAgreement in 2015, requires that EEI be reduced to approximately zero toachieve Earth's system quasi-equilibrium. The amount of CO2 in theatmosphere would need to be reduced from 410 to 353 ppm to increase heatradiation to space by 0.87 W m−2, bringing Earth back towards energybalance. This simple number, EEI, is the most fundamental metric that thescientific community and public must be aware of as the measure of how wellthe world is doing in the task of bringing climate change under control, andwe call for an implementation of the EEI into the global stocktake based onbest available science. Continued quantification and reduced uncertaintiesin the Earth heat inventory can be best achieved through the maintenance ofthe current global climate observing system, its extension into areas ofgaps in the sampling, and the establishment of an international framework forconcerted multidisciplinary research of the Earth heat inventory aspresented in this study. This Earth heat inventory is published at the German Climate Computing Centre (DKRZ, https://www.dkrz.de/, last access: 7 August 2020) under the DOIhttps://doi.org/10.26050/WDCC/GCOS_EHI_EXP_v2(von Schuckmann et al., 2020). 
    more » « less
  5. Designing a Curriculum to Broaden Middle School Students’ Ideas and Interest in Engineering As the 21st century progresses, engineers will play critical roles in addressing complex societal problems such as climate change and nutrient pollution. Research has shown that more diverse teams lead to more creative and effective solutions (Smith-Doerr et al., 2017). However, while some progress has been made in increasing the number of women and people of color, 83% of employed engineers are male and 68% of engineers are white (NSF & NCSES, 2019). Traditional K–12 approaches to engineering often emphasize construction using a trial-and-error approach (ASEE, 2020). Although this approach may appeal to some students, it may alienate other students who then view engineering simply as “building things.” Designing engineering experiences that broaden students’ ideas about engineering, may help diversify the students entering the engineering pipeline. To this end, we developed Solving Community Problems with Engineering (SCoPE), an engineering curriculum that engages seventh-grade students in a three-week capstone project focusing on nutrient pollution in their local watershed. SCoPE engages students with the problem through local news articles about nutrient pollution and images of algae covered lakes, which then drives the investigation into the detrimental processes caused by excess nutrients entering bodies of water from sources such as fertilizer and wastewater. Students research the sources of nutrient pollution and potential solutions, and use simulations to investigate key variables and optimize the types of strategies for effectively decreasing and managing nutrient pollution to help develop their plans. Throughout the development process, we worked with a middle school STEM teacher to ensure the unit builds upon the science curriculum and the activities would be engaging and meaningful to students. The problem and location were chosen to illustrate that engineers can solve problems relevant to rural communities. Since people in rural locations tend to remain very connected to their communities throughout their lives, it is important to illustrate that engineering could be a relevant and viable career near home. The SCoPE curriculum was piloted with two teachers and 147 seventh grade students in a rural public school. Surveys and student drawings of engineers before and after implementation of the curriculum were used to characterize changes in students’ interest and beliefs about engineering. After completing the SCoPE curriculum, students’ ideas about engineers’ activities and the types of problems they solve were broadened. Students were 53% more likely to believe that engineers can protect the environment and 23% more likely to believe that they can identify problems in the community to solve (p < 0.001). When asked to draw an engineer, students were 1.3 times more likely to include nature/environment/agriculture (p < 0.01) and 3 times more likely to show engineers helping people in the community (p< 0.05) Additionally, while boys’ interest in science and engineering did not significantly change, girls’ interest in engineering and confidence in becoming an engineer significantly increased (Cohen’s D = 0.28, p<0.05). The SCoPE curriculum is available on PBS LearningMedia: https://www.pbslearningmedia.org/collection/solving-community-problems-with-engineering/ This project was funded by NSF through the Division of Engineering Education and Centers, Research in the Formation of Engineers program #202076. References American Society for Engineering Education. (2020). Framework for P-12 Engineering Learning. Washington, DC. DOI: 10.18260/1-100-1153 National Science Foundation, National Center for Science and Engineering Statistics. (2019). Women, Minorities, and Persons with Disabilities in Science and Engineering: 2019. Special Report NSF 17-310. Arlington, VA. https://ncses.nsf.gov/pubs/nsf21321/. Smith-Doerr, L., Alegria, S., & Sacco, T. (2017). How Diversity Matters in the US Science and Engineering Workforce: A Critical Review Considering Integration in Teams, Fields, and Organizational Contexts, Engaging Science, Technology, and Society 3, 139-153. 
    more » « less