skip to main content

Title: Rucio beyond ATLAS: experiences from Belle II, CMS, DUNE, EISCAT3D, LIGO/VIRGO, SKA, XENON
For many scientific projects, data management is an increasingly complicated challenge. The number of data-intensive instruments generating unprecedented volumes of data is growing and their accompanying workflows are becoming more complex. Their storage and computing resources are heterogeneous and are distributed at numerous geographical locations belonging to different administrative domains and organisations. These locations do not necessarily coincide with the places where data is produced nor where data is stored, analysed by researchers, or archived for safe long-term storage. To fulfil these needs, the data management system Rucio has been developed to allow the high-energy physics experiment ATLAS at LHC to manage its large volumes of data in an efficient and scalable way. But ATLAS is not alone, and several diverse scientific projects have started evaluating, adopting, and adapting the Rucio system for their own needs. As the Rucio community has grown, many improvements have been introduced, customisations have been added, and many bugs have been fixed. Additionally, new dataflows have been investigated and operational experiences have been documented. In this article we collect and compare the common successes, pitfalls, and oddities that arose in the evaluation efforts of multiple diverse experiments, and compare them with the ATLAS experience. This more » includes the high-energy physics experiments Belle II and CMS, the neutrino experiment DUNE, the scattering radar experiment EISCAT3D, the gravitational wave observatories LIGO and VIRGO, the SKA radio telescope, and the dark matter search experiment XENON. « less
Authors:
; ; ; ; ; ; ; ; ; ; ; ; ;
Editors:
Doglioni, C.; Kim, D.; Stewart, G.A.; Silvestris, L.; Jackson, P.; Kamleh, W.
Award ID(s):
1841475
Publication Date:
NSF-PAR ID:
10213855
Journal Name:
EPJ Web of Conferences
Volume:
245
Page Range or eLocation-ID:
11006
ISSN:
2100-014X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract
    Excessive phosphorus (P) applications to croplands can contribute to eutrophication of surface waters through surface runoff and subsurface (leaching) losses. We analyzed leaching losses of total dissolved P (TDP) from no-till corn, hybrid poplar (Populus nigra X P. maximowiczii), switchgrass (Panicum virgatum), miscanthus (Miscanthus giganteus), native grasses, and restored prairie, all planted in 2008 on former cropland in Michigan, USA. All crops except corn (13 kg P ha−1 year−1) were grown without P fertilization. Biomass was harvested at the end of each growing season except for poplar. Soil water at 1.2 m depth was sampled weekly to biweekly for TDP determination during March–November 2009–2016More>>
  2. Obeid, Iyad ; Picone, Joseph ; Selesnick, Ivan (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing a large open source database of high-resolution digital pathology images known as the Temple University Digital Pathology Corpus (TUDP) [1]. Our long-term goal is to release one million images. We expect to release the first 100,000 image corpus by December 2020. The data is being acquired at the Department of Pathology at Temple University Hospital (TUH) using a Leica Biosystems Aperio AT2 scanner [2] and consists entirely of clinical pathology images. More information about the data and the project can be found in Shawki et al. [3]. We currently have a Nationalmore »Science Foundation (NSF) planning grant [4] to explore how best the community can leverage this resource. One goal of this poster presentation is to stimulate community-wide discussions about this project and determine how this valuable resource can best meet the needs of the public. The computing infrastructure required to support this database is extensive [5] and includes two HIPAA-secure computer networks, dual petabyte file servers, and Aperio’s eSlide Manager (eSM) software [6]. We currently have digitized over 50,000 slides from 2,846 patients and 2,942 clinical cases. There is an average of 12.4 slides per patient and 10.5 slides per case with one report per case. The data is organized by tissue type as shown below: Filenames: tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_0a001_00123456_lvl0001_s000.svs tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_00123456.docx Explanation: tudp: root directory of the corpus v1.0.0: version number of the release svs: the image data type gastro: the type of tissue 000001: six-digit sequence number used to control directory complexity 00123456: 8-digit patient MRN 2015_03_05: the date the specimen was captured 0s15_12345: the clinical case name 0s15_12345_0a001_00123456_lvl0001_s000.svs: the actual image filename consisting of a repeat of the case name, a site code (e.g., 0a001), the type and depth of the cut (e.g., lvl0001) and a token number (e.g., s000) 0s15_12345_00123456.docx: the filename for the corresponding case report We currently recognize fifteen tissue types in the first installment of the corpus. The raw image data is stored in Aperio’s “.svs” format, which is a multi-layered compressed JPEG format [3,7]. Pathology reports containing a summary of how a pathologist interpreted the slide are also provided in a flat text file format. A more complete summary of the demographics of this pilot corpus will be presented at the conference. Another goal of this poster presentation is to share our experiences with the larger community since many of these details have not been adequately documented in scientific publications. There are quite a few obstacles in collecting this data that have slowed down the process and need to be discussed publicly. Our backlog of slides dates back to 1997, meaning there are a lot that need to be sifted through and discarded for peeling or cracking. Additionally, during scanning a slide can get stuck, stalling a scan session for hours, resulting in a significant loss of productivity. Over the past two years, we have accumulated significant experience with how to scan a diverse inventory of slides using the Aperio AT2 high-volume scanner. We have been working closely with the vendor to resolve many problems associated with the use of this scanner for research purposes. This scanning project began in January of 2018 when the scanner was first installed. The scanning process was slow at first since there was a learning curve with how the scanner worked and how to obtain samples from the hospital. From its start date until May of 2019 ~20,000 slides we scanned. In the past 6 months from May to November we have tripled that number and how hold ~60,000 slides in our database. This dramatic increase in productivity was due to additional undergraduate staff members and an emphasis on efficient workflow. The Aperio AT2 scans 400 slides a day, requiring at least eight hours of scan time. The efficiency of these scans can vary greatly. When our team first started, approximately 5% of slides failed the scanning process due to focal point errors. We have been able to reduce that to 1% through a variety of means: (1) best practices regarding daily and monthly recalibrations, (2) tweaking the software such as the tissue finder parameter settings, and (3) experience with how to clean and prep slides so they scan properly. Nevertheless, this is not a completely automated process, making it very difficult to reach our production targets. With a staff of three undergraduate workers spending a total of 30 hours per week, we find it difficult to scan more than 2,000 slides per week using a single scanner (400 slides per night x 5 nights per week). The main limitation in achieving this level of production is the lack of a completely automated scanning process, it takes a couple of hours to sort, clean and load slides. We have streamlined all other aspects of the workflow required to database the scanned slides so that there are no additional bottlenecks. To bridge the gap between hospital operations and research, we are using Aperio’s eSM software. Our goal is to provide pathologists access to high quality digital images of their patients’ slides. eSM is a secure website that holds the images with their metadata labels, patient report, and path to where the image is located on our file server. Although eSM includes significant infrastructure to import slides into the database using barcodes, TUH does not currently support barcode use. Therefore, we manage the data using a mixture of Python scripts and manual import functions available in eSM. The database and associated tools are based on proprietary formats developed by Aperio, making this another important point of community-wide discussion on how best to disseminate such information. Our near-term goal for the TUDP Corpus is to release 100,000 slides by December 2020. We hope to continue data collection over the next decade until we reach one million slides. We are creating two pilot corpora using the first 50,000 slides we have collected. The first corpus consists of 500 slides with a marker stain and another 500 without it. This set was designed to let people debug their basic deep learning processing flow on these high-resolution images. We discuss our preliminary experiments on this corpus and the challenges in processing these high-resolution images using deep learning in [3]. We are able to achieve a mean sensitivity of 99.0% for slides with pen marks, and 98.9% for slides without marks, using a multistage deep learning algorithm. While this dataset was very useful in initial debugging, we are in the midst of creating a new, more challenging pilot corpus using actual tissue samples annotated by experts. The task will be to detect ductal carcinoma (DCIS) or invasive breast cancer tissue. There will be approximately 1,000 images per class in this corpus. Based on the number of features annotated, we can train on a two class problem of DCIS or benign, or increase the difficulty by increasing the classes to include DCIS, benign, stroma, pink tissue, non-neoplastic etc. Those interested in the corpus or in participating in community-wide discussions should join our listserv, nedc_tuh_dpath@googlegroups.com, to be kept informed of the latest developments in this project. You can learn more from our project website: https://www.isip.piconepress.com/projects/nsf_dpath.« less
  3. Exploring the effects of solar eclipses on radio wave propagation has been an active area of research since the first experiments conducted in 1912. In the first few decades of ionospheric physics, researchers started to explore the natural laboratory of the upper atmosphere. Solar eclipses offered a rare opportunity to undertake an active experiment. The results stimulated much scientific discussion. Early users of radio noticed that propagation was different during night and day. A solar eclipse provided the opportunity to study this day/night effect with much sharper boundaries than at sunrise and sunset, when gradual changes occur along with temperaturemore »changes in the atmosphere and variations in the sun angle. Plots of amplitude time series were hypothesized to indicate the recombination rates and reionization rates of the ionosphere during and after the eclipse, though not all time-amplitude plots showed the same curve shapes. A few studies used multiple receivers paired with one transmitter for one eclipse, with a 5:1 ratio as the upper bound. In these cases, the signal amplitude plots generated for data received from the five receive sites for one transmitter varied greatly in shape. Examination of very earliest results shows the difficulty in using a solar eclipse to study propagation; different researchers used different frequencies from different locations at different times. Solar eclipses have been used to study propagation at a range of radio frequencies. For example, the first study in 1912 used a receiver tuned to 5,500 meters, roughly 54.545 kHz. We now have data from solar eclipses at frequencies ranging from VLF through HF, from many different sites with many different eclipse effects. This data has greatly contributed to our understanding of the ionosphere. The solar eclipse over the United States on August 21, 2017 presents an opportunity to have many locations receiving from the same transmitters. Experiments will target VLF, LF, and HF using VLF/LF transmitters, NIST?s WWVB time station at 60 kHz, and hams using their HF frequency allocations. This effort involves Citizen Science, wideband software defined radios, and the use of the Reverse Beacon Network and WSPRnet to collect eclipse-related data.« less
  4. Exploring the effects of solar eclipses on radio wave propagation has been an active area of research since the first experiments conducted in 1912. In the first few decades of ionospheric physics, researchers started to explore the natural laboratory of the upper atmosphere. Solar eclipses offered a rare opportunity to undertake an active experiment. The results stimulated much scientific discussion. Early users of radio noticed that propagation was different during night and day. A solar eclipse provided the opportunity to study this day/night effect with much sharper boundaries than at sunrise and sunset, when gradual changes occur along with temperaturemore »changes in the atmosphere and variations in the sun angle. Plots of amplitude time series were hypothesized to indicate the recombination rates and reionization rates of the ionosphere during and after the eclipse, though not all time-amplitude plots showed the same curve shapes. A few studies used multiple receivers paired with one transmitter for one eclipse, with a 5:1 ratio as the upper bound. In these cases, the signal amplitude plots generated for data received from the five receive sites for one transmitter varied greatly in shape. Examination of very earliest results shows the difficulty in using a solar eclipse to study propagation; different researchers used different frequencies from different locations at different times. Solar eclipses have been used to study propagation at a range of radio frequencies. For example, the first study in 1912 used a receiver tuned to 5,500 meters, roughly 54.545 kHz. We now have data from solar eclipses at frequencies ranging from VLF through HF, from many different sites with many different eclipse effects. This data has greatly contributed to our understanding of the ionosphere. The solar eclipse over the United States on August 21, 2017 presents an opportunity to have many locations receiving from the same transmitters. Experiments will target VLF, LF, and HF using VLF/LF transmitters, NIST’s WWVB time station at 60 kHz, and hams using their HF frequency allocations. This effort involves Citizen Science, wideband software defined radios, and the use of the Reverse Beacon Network and WSPRnet to collect eclipse-related data.« less
  5. Vast volumes of data are produced by today’s scientific simulations and advanced instruments. These data cannot be stored and transferred efficiently because of limited I/O bandwidth, network speed, and storage capacity. Error-bounded lossy compression can be an effective method for addressing these issues: not only can it significantly reduce data size, but it can also control the data distortion based on user-defined error bounds. In practice, many scientific applications have specific requirements or constraints for lossy compression, in order to guarantee that the reconstructed data are valid for post hoc analysis. For example, some datasets contain irrelevant data that shouldmore »be isolated in particular and users often have intuition regarding value ranges, geospatial regions, and other data subsets that are crucial for subsequent analysis. Existing state-of-the-art error-bounded lossy compressors, however, do not consider these constraints during compression, resulting in inferior compression ratios with respect to user’s post hoc analysis, due to the fact that the data itself provides little or no value for post hoc analysis. In this work we address this issue by proposing an optimized framework that can preserve diverse constraints during the error-bounded lossy compression, e.g., cleaning the irrelevant data, efficiently preserving different precision for multiple value intervals, and allowing users to set diverse precision over both regular and irregular regions. We perform our evaluation on a supercomputer with up to 2,100 cores. Experiments with six real-world applications show that our proposed diverse constraints based error-bounded lossy compressor can obtain a higher visual quality or data fidelity on reconstructed data with the same or even higher compression ratios compared with the traditional state-of-the-art compressor SZ. Our experiments also demonstrate very good scalability in compression performance compared with the I/O throughput of the parallel file system.« less