skip to main content


Title: Expanding the Scope of Artifact Evaluation at HPC Conferences: Experience of SC21
A scientific paper consists of a constellation of artifacts that ex- tend beyond the document itself: software, hardware, evaluation data and documentation, raw survey results, mechanized proofs, models, test suites, benchmarks, and so on. In some cases, the quality of these artifacts is as important as that of the document itself. Based on the success of the Artifact Evaluation efforts at other systems conferences, the 2021 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21) organized a comprehensive Artifact Description/Artifact Evaluation (AD/AE) review and competition as part of the SC21 Reproducibility Initiative. This paper summarizes the key findings of the AD/AE effort.  more » « less
Award ID(s):
1846418
NSF-PAR ID:
10325827
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of Practical Reproducible Evaluation in Computer Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, Iyad Selesnick (Ed.)
    The Temple University Hospital EEG Corpus (TUEG) [1] is the largest publicly available EEG corpus of its type and currently has over 5,000 subscribers (we currently average 35 new subscribers a week). Several valuable subsets of this corpus have been developed including the Temple University Hospital EEG Seizure Corpus (TUSZ) [2] and the Temple University Hospital EEG Artifact Corpus (TUAR) [3]. TUSZ contains manually annotated seizure events and has been widely used to develop seizure detection and prediction technology [4]. TUAR contains manually annotated artifacts and has been used to improve machine learning performance on seizure detection tasks [5]. In this poster, we will discuss recent improvements made to both corpora that are creating opportunities to improve machine learning performance. Two major concerns that were raised when v1.5.2 of TUSZ was released for the Neureka 2020 Epilepsy Challenge were: (1) the subjects contained in the training, development (validation) and blind evaluation sets were not mutually exclusive, and (2) high frequency seizures were not accurately annotated in all files. Regarding (1), there were 50 subjects in dev, 50 subjects in eval, and 592 subjects in train. There was one subject common to dev and eval, five subjects common to dev and train, and 13 subjects common between eval and train. Though this does not substantially influence performance for the current generation of technology, it could be a problem down the line as technology improves. Therefore, we have rebuilt the partitions of the data so that this overlap was removed. This required augmenting the evaluation and development data sets with new subjects that had not been previously annotated so that the size of these subsets remained approximately the same. Since these annotations were done by a new group of annotators, special care was taken to make sure the new annotators followed the same practices as the previous generations of annotators. Part of our quality control process was to have the new annotators review all previous annotations. This rigorous training coupled with a strict quality control process where annotators review a significant amount of each other’s work ensured that there is high interrater agreement between the two groups (kappa statistic greater than 0.8) [6]. In the process of reviewing this data, we also decided to split long files into a series of smaller segments to facilitate processing of the data. Some subscribers found it difficult to process long files using Python code, which tends to be very memory intensive. We also found it inefficient to manipulate these long files in our annotation tool. In this release, the maximum duration of any single file is limited to 60 mins. This increased the number of edf files in the dev set from 1012 to 1832. Regarding (2), as part of discussions of several issues raised by a few subscribers, we discovered some files only had low frequency epileptiform events annotated (defined as events that ranged in frequency from 2.5 Hz to 3 Hz), while others had events annotated that contained significant frequency content above 3 Hz. Though there were not many files that had this type of activity, it was enough of a concern to necessitate reviewing the entire corpus. An example of an epileptiform seizure event with frequency content higher than 3 Hz is shown in Figure 1. Annotating these additional events slightly increased the number of seizure events. In v1.5.2, there were 673 seizures, while in v1.5.3 there are 1239 events. One of the fertile areas for technology improvements is artifact reduction. Artifacts and slowing constitute the two major error modalities in seizure detection [3]. This was a major reason we developed TUAR. It can be used to evaluate artifact detection and suppression technology as well as multimodal background models that explicitly model artifacts. An issue with TUAR was the practicality of the annotation tags used when there are multiple simultaneous events. An example of such an event is shown in Figure 2. In this section of the file, there is an overlap of eye movement, electrode artifact, and muscle artifact events. We previously annotated such events using a convention that included annotating background along with any artifact that is present. The artifacts present would either be annotated with a single tag (e.g., MUSC) or a coupled artifact tag (e.g., MUSC+ELEC). When multiple channels have background, the tags become crowded and difficult to identify. This is one reason we now support a hierarchical annotation format using XML – annotations can be arbitrarily complex and support overlaps in time. Our annotators also reviewed specific eye movement artifacts (e.g., eye flutter, eyeblinks). Eye movements are often mistaken as seizures due to their similar morphology [7][8]. We have improved our understanding of ocular events and it has allowed us to annotate artifacts in the corpus more carefully. In this poster, we will present statistics on the newest releases of these corpora and discuss the impact these improvements have had on machine learning research. We will compare TUSZ v1.5.3 and TUAR v2.0.0 with previous versions of these corpora. We will release v1.5.3 of TUSZ and v2.0.0 of TUAR in Fall 2021 prior to the symposium. ACKNOWLEDGMENTS Research reported in this publication was most recently supported by the National Science Foundation’s Industrial Innovation and Partnerships (IIP) Research Experience for Undergraduates award number 1827565. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the official views of any of these organizations. REFERENCES [1] I. Obeid and J. Picone, “The Temple University Hospital EEG Data Corpus,” in Augmentation of Brain Function: Facts, Fiction and Controversy. Volume I: Brain-Machine Interfaces, 1st ed., vol. 10, M. A. Lebedev, Ed. Lausanne, Switzerland: Frontiers Media S.A., 2016, pp. 394 398. https://doi.org/10.3389/fnins.2016.00196. [2] V. Shah et al., “The Temple University Hospital Seizure Detection Corpus,” Frontiers in Neuroinformatics, vol. 12, pp. 1–6, 2018. https://doi.org/10.3389/fninf.2018.00083. [3] A. Hamid et, al., “The Temple University Artifact Corpus: An Annotated Corpus of EEG Artifacts.” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2020, pp. 1-3. https://ieeexplore.ieee.org/document/9353647. [4] Y. Roy, R. Iskander, and J. Picone, “The NeurekaTM 2020 Epilepsy Challenge,” NeuroTechX, 2020. [Online]. Available: https://neureka-challenge.com/. [Accessed: 01-Dec-2021]. [5] S. Rahman, A. Hamid, D. Ochal, I. Obeid, and J. Picone, “Improving the Quality of the TUSZ Corpus,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2020, pp. 1–5. https://ieeexplore.ieee.org/document/9353635. [6] V. Shah, E. von Weltin, T. Ahsan, I. Obeid, and J. Picone, “On the Use of Non-Experts for Generation of High-Quality Annotations of Seizure Events,” Available: https://www.isip.picone press.com/publications/unpublished/journals/2019/elsevier_cn/ira. [Accessed: 01-Dec-2021]. [7] D. Ochal, S. Rahman, S. Ferrell, T. Elseify, I. Obeid, and J. Picone, “The Temple University Hospital EEG Corpus: Annotation Guidelines,” Philadelphia, Pennsylvania, USA, 2020. https://www.isip.piconepress.com/publications/reports/2020/tuh_eeg/annotations/. [8] D. Strayhorn, “The Atlas of Adult Electroencephalography,” EEG Atlas Online, 2014. [Online]. Availabl 
    more » « less
  2. Research prior to 2005 found that no single framework existed that could capture the engineering design process fully or well and benchmark each element of the process to a commonly accepted set of referenced artifacts. Compounding the construction of a stepwise, artifact driven framework is that engineering design is typically practiced over time as a complex and iterative process. For both novice and advanced students, learning and applying the design process is often cumulative, with many informal and formal programmatic opportunities to practice essential elements. The Engineering Design Process Portfolio Scoring Rubric (EDPPSR) was designed to apply to any portfolio that is intended to document an individual or team driven process leading to an original attempt to design a product, process, or method to provide the best and most optimal solution to a genuine and meaningful problem. In essence, the portfolio should be a detailed account or “biography” of a project and the thought processes that inform that project. Besides narrative and explanatory text, entries may include (but need not be limited to) drawings, schematics, photographs, notebook and journal entries, transcripts or summaries of conversations and interviews, and audio/video recordings. Such entries are likely to be necessary in order to convey accurately and completely the complex thought processes behind the planning, implementation, and self-evaluation of the project. The rubric is comprised of four main components, each in turn comprised of three elements. Each element has its own holistic rubric. The process by which the EDPPSR was created gives evidence of the relevance and representativeness of the rubric and helps to establish validity. The EDPPSR model as originally rendered has a strong theoretical foundation as it has been developed by reference to the literature on the steps of the design process through focus groups and through expert review by teachers, faculty and researchers in performance based, portfolio rubrics and assessments. Using the unified construct validity framework, the EDDPSR’s validity was further established through expert reviewers (experts in engineering design) providing evidence supporting the content relevance and representativeness of the EDPPSR in representing the basic process of engineering design. This manuscript offers empirical evidence that supports the use of the EDPPSR model to evaluate student design-based projects in a reliable and valid manner. Intra-class correlation coefficients (ICC) were calculated to determine the inter-rater reliability (IRR) of the rubric. Given the small sample size we also examined confidence intervals (95%) to provide a range of values in which the estimate of inter-reliability is likely contained. 
    more » « less
  3. Peresani, Marco (Ed.)
    Investigations of organic lithic micro-residues have, over the last decade, shifted from entirely morphological observations using visible-light microscopy to compositional ones using scanning electron microscopy and Fourier-transform infrared microspectroscopy, providing a seemingly objective chemical basis for residue identifications. Contamination, though, remains a problem that can affect these results. Modern contaminants, accumulated during the post-excavation lives of artifacts, are pervasive, subtle, and even “invisible” (unlisted ingredients in common lab products). Ancient contamination is a second issue. The aim of residue analysis is to recognize residues related to use, but other types of residues can also accumulate on artifacts. Caves are subject to various taphonomic forces and organic inputs, and use-related residues can degrade into secondary compounds. This organic “background noise” must be taken into consideration. Here we show that residue contamination is more pervasive than is often appreciated, as revealed by our studies of Middle Palaeolithic artifacts from two sites: Lusakert Cave 1 in Armenia and Crvena Stijena in Montenegro. First, we explain how artifacts from Lusakert Cave 1, despite being handled following specialized protocols, were tainted by a modern-day contaminant from an unanticipated source: a release agent used inside the zip-top bags that are ubiquitous in the field and lab. Second, we document that, when non-artifact “controls” are studied alongside artifacts from Crvena Stijena, comparisons reveal that organic residues are adhered to both, indicating that they are prevalent throughout the sediments and not necessarily related to use. We provide suggestions for reducing contamination and increasing the reliability of residue studies. Ultimately, we propose that archaeologists working in the field of residue studies must start with the null hypothesis that miniscule organic residues reflect contamination, either ancient or modern, and systematically proceed to rule out all possible contaminants before interpreting them as evidence of an artifact’s use in the distant past. 
    more » « less
  4. Barkai, Ran (Ed.)

    Recycling behaviors are becoming increasingly recognized as important parts of the production and use of stone tools in the Paleolithic. Yet, there are still no well-defined expectations for how recycling affects the appearance of the archaeological record across landscapes. Using an agent-based model of recycling in surface contexts, this study looks how the archaeological record changes under different conditions of recycling frequency, occupational intensity, mobility, and artifact selection. The simulations also show that while an increased number of recycled artifacts across a landscape does indicate the occurrence of more scavenging and recycling behaviors generally, the location of large numbers of recycled artifacts is not necessarily where the scavenging itself happened. This is particularly true when mobility patterns mean each foraging group spend more time moving around the landscape. The results of the simulations also demonstrate that recycled artifacts are typically those that have been exposed longer in surface contexts, confirming hypothesized relationships between recycling and exposure. In addition to these findings, the recycling simulation shows how archaeological record formation due to recycling behaviors is affected by mobility strategies and selection preferences. While only a simplified model of recycling behaviors, the results of this simulations give us insight into how to better interpret recycling behaviors from the archaeological record, specifically demonstrating the importance of contextualizing the occurrence of recycled artifacts on a wider landscape-level scale.

     
    more » « less
  5. Abstract

    We have designed, built, tested, and deployed a novel device to extract porewater from deep‐sea sediments in situ, constructed to work with a standard multicorer. Despite the importance of porewater measurements for numerous applications, many sampling artifacts can bias data and interpretation during traditional porewater processing from shipboard‐processed cores. A well‐documented artifact occurs in deep‐sea porewater when carbonate precipitates during core recovery as a function of temperature and pressure changes, while porewater is in contact with sediment grains before filtration, thereby lowering porewater alkalinity and dissolved inorganic carbon (DIC). Here, we present a novel device built to obviate these sampling artifacts by filtering porewater in situ on the seafloor, with a focus near the sediment–water interface on cm‐scale resolution, to obtain accurate porewater profiles. We document 1–10% alkalinity loss in shipboard‐processed sediment cores compared to porewater filtered in situ, at depths of 1600–3200 m. We also show that alkalinity loss is a function of both weight % sedimentary CaCO3and water column depth. The average ratio of alkalinity loss to DIC loss in shipboard‐processed sediment cores relative to in situ porewater is 2.2, consistent with the signal expected from carbonate precipitation. In addition to collecting porewater for defining natural profiles, we also conducted the first in situ dissolution experiments within the sediment column using isotopically labeled calcite. We present evidence of successful deployments of this device on and adjacent to the Cocos Ridge in the Eastern Equatorial Pacific across a range of depths and calcite saturation states.

     
    more » « less