skip to main content


Search for: All records

Creators/Authors contains: "Yang, Lishan"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Data center downtime typically centers around IT equipment failure. Storage devices are the most frequently failing components in data centers. We present a comparative study of hard disk drives (HDDs) and solid state drives (SSDs) that constitute the typical storage in data centers. Using six-year field data of 100,000 HDDs of different models from the same manufacturer from the Backblaze dataset and six-year field data of 30,000 SSDs of three models from a Google data center, we characterize the workload conditions that lead to failures. We illustrate that their root failure causes differ from common expectations and that they remain difficult to discern. For the case of HDDs we observe that young and old drives do not present many differences in their failures. Instead, failures may be distinguished by discriminating drives based on the time spent for head positioning. For SSDs, we observe high levels of infant mortality and characterize the differences between infant and non-infant failures. We develop several machine learning failure prediction models that are shown to be surprisingly accurate, achieving high recall and low false positive rates. These models are used beyond simple prediction as they aid us to untangle the complex interaction of workload characteristics that lead to failures and identify failure root causes from monitored symptoms. 
    more » « less
  2. Mourlas, cotas ; Pacheco, Diego ; Pandi, Catia (Ed.)
    We present an individual-centric agent-based model and a flexible tool, GeoSpread, for studying and predicting the spread of viruses and diseases in urban settings. Using COVID-19 data collected by the Korean Center for Disease Control & Prevention (KCDC), we analyze patient and route data of infected people from January 20, 2020, to May 31, 2020, and discover how infection clusters develop as a function of time. This analysis offers a statistical characterization of population mobility and is used to parameterize GeoSpread to capture the spread of the disease. We validate simulation predictions from GeoSpread with ground truth and we evaluate different what-if counter-measure scenarios to illustrate the usefulness and flexibility of the tool for epidemic modeling. 
    more » « less
  3. null (Ed.)
  4. Abstract

    Agrobacterium-mediated plant transformation (AMT) is the basis of modern-day plant biotechnology. One major drawback of this technology is the recalcitrance of many plant species/varieties toAgrobacteriuminfection, most likely caused by elicitation of plant defense responses. Here, we develop a strategy to increase AMT by engineeringAgrobacterium tumefaciensto express a type III secretion system (T3SS) fromPseudomonas syringaeand individually deliver theP. syringaeeffectors AvrPto, AvrPtoB, or HopAO1 to suppress host defense responses. Using the engineeredAgrobacterium, we demonstrate increase in AMT of wheat, alfalfa and switchgrass by ~250%–400%. We also show that engineeredA. tumefaciensexpressing a T3SS can deliver a plant protein, histone H2A-1, to enhance AMT. This strategy is of great significance to both basic research and agricultural biotechnology for transient and stable transformation of recalcitrant plant species/varieties and to deliver proteins into plant cells in a non-transgenic manner.

     
    more » « less
  5. null (Ed.)
    As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%. 
    more » « less
  6. null (Ed.)
  7. Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU applications is the purpose of this study. To this end, it is imperative to explore the range of application output by injecting faults at all the potential fault sites. This problem is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space - in the order of billions even for some simple applications. In this paper, we present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience can be practical. The key insight behind our proposed methodology stems from the fact that GPGPU applications spawn a lot of threads, however, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by a careful analysis of faults across threads and instructions. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be first classified based on the number of the dynamic instructions they execute. We achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying and analyzing: a) the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, b) a subset of loop iterations within the representative threads, and c) a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications. 
    more » « less