skip to main content


Search for: All records

Creators/Authors contains: "An, S."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Labeling data via rules-of-thumb and minimal label supervision is central to Weak Supervision, a paradigm subsuming subareas of machine learning such as crowdsourced learning and semi-supervised ensemble learning. By using this labeled data to train modern machine learning methods, the cost of acquiring large amounts of hand labeled data can be ameliorated. Approaches to combining the rules-of-thumb falls into two camps, reflecting different ideologies of statistical estimation. The most common approach, exemplified by the Dawid-Skene model, is based on probabilistic modeling. The other, developed in the work of Balsubramani-Freund and others, is adversarial and game-theoretic. We provide a variety of statistical results for the adversarial approach under log-loss: we characterize the form of the solution, relate it to logistic regression, demonstrate consistency, and give rates of convergence. On the other hand, we find that probabilistic approaches for the same model class can fail to be consistent. Experimental results are provided to corroborate the theoretical results. 
    more » « less
    Free, publicly-accessible full text available July 19, 2025
  2. We present a study that examines the effects of guidance on learning about addressing ill-defined problems in undergraduate bi- ology education. Two groups of college students used an online labo- ratory named VERA to learn about ill-defined ecological phenomena. While one group received guidance, such as giving the learners a specific problem and instruction on problem-solving methods, the other group re- ceived minimal guidance. The results indicate that, while performance in a problem-solving task was not different between groups receiving more vs. minimal guidance, the group that received minimal guidance adopted a more exploratory strategy and generated more interesting models of the given phenomena in a problem-solving task. 
    more » « less
  3. Virtual laboratories that enable novice scientists to construct, evaluate and revise models of complex systems heavily involve parameter estimation tasks. We seek to understand novice strategies for parameter estimation in model exploration to design better cognitive supports for them. We conducted a study of 50 college students for a parameter estimation task in exploring an ecological model. We identified three types of behavioral patterns and their underlying cognitive strategies. Specifically, the students used systematic search, problem decomposition and reduction, and global search followed by local search as their cognitive strategies 
    more » « less
  4. Modeling is an important aspect of scientific problem-solving. How- ever, modeling is a difficult cognitive process for novice learners in part due to the high dimensionality of the parameter search space. This work investigates 50 college students’ parameter search behaviors in the context of ecological modeling. The study revealed important differences in behaviors of successful and unsuccessful students in navigating the parameter space. These differences suggest opportunities for future development of adaptive cognitive scaffolds to support different classes of learners 
    more » « less
  5. Citizen scientists have the potential to expand scientific research. The virtual research assistant called VERA empowers citizen scientists to engage in environmental science in two ways. First, it automatically generates simulations based on the conceptual models of ecological phenomena for repeated testing and feedback. Second, it leverages the Encyclopedia of Life biodiversity knowledgebase to support the process of model construction and revision. 
    more » « less
  6. Free, publicly-accessible full text available January 1, 2026
  7. Abstract

    Computing demands for large scientific experiments, such as the CMS experiment at the CERN LHC, will increase dramatically in the next decades. To complement the future performance increases of software running on central processing units (CPUs), explorations of coprocessor usage in data processing hold great potential and interest. Coprocessors are a class of computer processors that supplement CPUs, often improving the execution of certain functions due to architectural design choices. We explore the approach of Services for Optimized Network Inference on Coprocessors (SONIC) and study the deployment of this as-a-service approach in large-scale data processing. In the studies, we take a data processing workflow of the CMS experiment and run the main workflow on CPUs, while offloading several machine learning (ML) inference tasks onto either remote or local coprocessors, specifically graphics processing units (GPUs). With experiments performed at Google Cloud, the Purdue Tier-2 computing center, and combinations of the two, we demonstrate the acceleration of these ML algorithms individually on coprocessors and the corresponding throughput improvement for the entire workflow. This approach can be easily generalized to different types of coprocessors and deployed on local CPUs without decreasing the throughput performance. We emphasize that the SONIC approach enables high coprocessor usage and enables the portability to run workflows on different types of coprocessors.

     
    more » « less
    Free, publicly-accessible full text available December 1, 2025
  8. A<sc>bstract</sc>

    A measurement is performed of Higgs bosons produced with high transverse momentum (pT) via vector boson or gluon fusion in proton-proton collisions. The result is based on a data set with a center-of-mass energy of 13 TeV collected in 2016–2018 with the CMS detector at the LHC and corresponds to an integrated luminosity of 138 fb1. The decay of a high-pTHiggs boson to a boosted bottom quark-antiquark pair is selected using large-radius jets and employing jet substructure and heavy-flavor taggers based on machine learning techniques. Independent regions targeting the vector boson and gluon fusion mechanisms are defined based on the topology of two quark-initiated jets with large pseudorapidity separation. The signal strengths for both processes are extracted simultaneously by performing a maximum likelihood fit to data in the large-radius jet mass distribution. The observed signal strengths relative to the standard model expectation are$$ {4.9}_{-1.6}^{+1.9} $$4.91.6+1.9and$$ {1.6}_{-1.5}^{+1.7} $$1.61.5+1.7for the vector boson and gluon fusion mechanisms, respectively. A differential cross section measurement is also reported in the simplified template cross section framework.

     
    more » « less
    Free, publicly-accessible full text available December 1, 2025
  9. Abstract

    This paper describes theCombinesoftware package used for statistical analyses by the CMS Collaboration. The package, originally designed to perform searches for a Higgs boson and the combined analysis of those searches, has evolved to become the statistical analysis tool presently used in the majority of measurements and searches performed by the CMS Collaboration. It is not specific to the CMS experiment, and this paper is intended to serve as a reference for users outside of the CMS Collaboration, providing an outline of the most salient features and capabilities. Readers are provided with the possibility to runCombineand reproduce examples provided in this paper using a publicly available container image. Since the package is constantly evolving to meet the demands of ever-increasing data sets and analysis sophistication, this paper cannot cover all details ofCombine. However, the online documentation referenced within this paper provides an up-to-date and complete user guide.

     
    more » « less
    Free, publicly-accessible full text available December 1, 2025