skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, June 13 until 2:00 AM ET on Friday, June 14 due to maintenance. We apologize for the inconvenience.

Search for: All records

Creators/Authors contains: "Chen, Hsinchun"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Automated monitoring of dark web (DW) platforms on a large scale is the first step toward developing proactive Cyber Threat Intelligence (CTI). While there are efficient methods for collecting data from the surface web, large-scale dark web data collection is often hindered by anti-crawling measures. In particular, text-based CAPTCHA serves as the most prevalent and prohibiting type of these measures in the dark web. Text-based CAPTCHA identifies and blocks automated crawlers by forcing the user to enter a combination of hard-to-recognize alphanumeric characters. In the dark web, CAPTCHA images are meticulously designed with additional background noise and variable character length to prevent automated CAPTCHA breaking. Existing automated CAPTCHA breaking methods have difficulties in overcoming these dark web challenges. As such, solving dark web text-based CAPTCHA has been relying heavily on human involvement, which is labor-intensive and time-consuming. In this study, we propose a novel framework for automated breaking of dark web CAPTCHA to facilitate dark web data collection. This framework encompasses a novel generative method to recognize dark web text-based CAPTCHA with noisy background and variable character length. To eliminate the need for human involvement, the proposed framework utilizes Generative Adversarial Network (GAN) to counteract dark web background noise and leverages an enhanced character segmentation algorithm to handle CAPTCHA images with variable character length. Our proposed framework, DW-GAN, was systematically evaluated on multiple dark web CAPTCHA testbeds. DW-GAN significantly outperformed the state-of-the-art benchmark methods on all datasets, achieving over 94.4% success rate on a carefully collected real-world dark web dataset. We further conducted a case study on an emergent Dark Net Marketplace (DNM) to demonstrate that DW-GAN eliminated human involvement by automatically solving CAPTCHA challenges with no more than three attempts. Our research enables the CTI community to develop advanced, large-scale dark web monitoring. We make DW-GAN code available to the community as an open-source tool in GitHub. 
    more » « less
  2. Federal funding agencies and industry entities are seeking innovative approaches to address the ever-growing cybersecurity crisis. Increasingly, numerous cybersecurity thought leaders are indicating that Artificial Intelligence (AI)-enabled analytics can help tackle key cybersecurity tasks and deploy defenses. This half-day workshop, co-located with ACM KDD, sought to attain significant research contributions to various aspects of AI-enabled analytics for cybersecurity applications and deployable defense solutions from academics and practitioners. This workshop was a joint workshop of the 2021 AI-enabled Cybersecurity Analytics and 2021 International Workshop on Deployable Machine Learning for Security Defense. As such, we developed an interdisciplinary Program Committee with significant experience in various aspects of AI, cybersecurity, and/or deployable defense. 
    more » « less
  3. Rich, diverse cybersecurity data are critical for efforts by the intelligence and security informatics (ISI) community. Although open-access data repositories (OADRs) provide tremendous benefits for ISI researchers and practitioners, determinants of their adoption remain understudied. Drawing on affordance theory and extant ISI literature, this study proposes a factor model to explain how the essential and unique affordances of an OADR (i.e., relevance, accessibility, and integration) affect individual professionals' intentions to use and collaborate with AZSecure, a major OADR. A survey study designed to test the model and hypotheses reveals that the effects of affordances on ISI professionals' intentions to use and collaborate are mediated by perceived usefulness and ease of use, which then jointly determine their perceived value. This study advances ISI research by specifying three important affordances of OADRs; it also contributes to extant technology adoption literature by scrutinizing and affirming the interplay of essential user acceptance and value perceptions to explain ISI professionals' adoptions of OADRs. 
    more » « less
  4. Hacker forums provide malicious actors with a large database of tutorials, goods, and assets to leverage for cyber-attacks. Careful research of these forums can provide tremendous benefit to the cybersecurity community through trend identification and exploit categorization. This study aims to provide a novel static word embedding, Hack2Vec, to improve performance on hacker forum classification tasks. Our proposed Hack2Vec model distills contextual representations from the seminal pre-trained language model BERT to a continuous bag-of-words model to create a highly targeted hacker forum static word embedding. The results of our experimental design indicate that Hack2Vec improves performance over prominent embeddings in accuracy, precision, recall, and F1-score for a benchmark hacker forum classification task. 
    more » « less
  5. null (Ed.)
    Events such as Facebook-Cambridge Analytica scandal and data aggregation efforts by technology providers have illustrated how fragile modern society is to privacy violations. Internationally recognized entities such as the National Science Foundation (NSF) have indicated that Artificial Intelligence (AI)-enabled models, artifacts, and systems can efficiently and effectively sift through large quantities of data from legal documents, social media, Dark Web sites, and other sources to curb privacy violations. Yet considerable efforts are still required for understanding prevailing data sources, systematically developing AI-enabled privacy analytics to tackle emerging challenges, and deploying systems to address critical privacy needs. To this end, we provide an overview of prevailing data sources that can support AI-enabled privacy analytics; a multi-disciplinary research framework that connects data, algorithms, and systems to tackle emerging AI-enabled privacy analytics challenges such as entity resolution, privacy assistance systems, privacy risk modeling, and more; a summary of selected funding sources to support high-impact privacy analytics research; and an overview of prevailing conference and journal venues that can be leveraged to share and archive privacy analytics research. We conclude this paper with an introduction of the papers included in this special issue. 
    more » « less
  6. null (Ed.)
  7. null (Ed.)
    Cybersecurity has rapidly emerged as a grand societal challenge of the 21st century. Innovative solutions to proactively tackle emerging cybersecurity challenges are essential to ensuring a safe and secure society. Artificial Intelligence (AI) has rapidly emerged as a viable approach for sifting through terabytes of heterogeneous cybersecurity data to execute fundamental cybersecurity tasks, such as asset prioritization, control allocation, vulnerability management, and threat detection, with unprecedented efficiency and effectiveness. Despite its initial promise, AI and cybersecurity have been traditionally siloed disciplines that relied on disparate knowledge and methodologies. Consequently, the AI for Cybersecurity discipline is in its nascency. In this article, we aim to provide an important step to progress the AI for Cybersecurity discipline. We first provide an overview of prevailing cybersecurity data, summarize extant AI for Cybersecurity application areas, and identify key limitations in the prevailing landscape. Based on these key issues, we offer a multi-disciplinary AI for Cybersecurity roadmap that centers on major themes such as cybersecurity applications and data, advanced AI methodologies for cybersecurity, and AI-enabled decision making. To help scholars and practitioners make significant headway in tackling these grand AI for Cybersecurity issues, we summarize promising funding mechanisms from the National Science Foundation (NSF) that can support long-term, systematic research programs. We conclude this article with an introduction of the articles included in this special issue. 
    more » « less
  8. null (Ed.)
  9. null (Ed.)
  10. null (Ed.)
    Cybersecurity experts have appraised the total global cost of malicious hacking activities to be $450 billion annually. Cyber Threat Intelligence (CTI) has emerged as a viable approach to combat this societal issue. However, existing processes are criticized as inherently reactive to known threats. To combat these concerns, CTI experts have suggested proactively examining emerging threats in the vast, international online hacker community. In this study, we aim to develop proactive CTI capabilities by exploring online hacker forums to identify emerging threats in terms of popularity and tool functionality. To achieve these goals, we create a novel Diachronic Graph Embedding Framework (D-GEF). D-GEF operates on a Graph-of-Words (GoW) representation of hacker forum text to generate word embeddings in an unsupervised manner. Semantic displacement measures adopted from diachronic linguistics literature identify how terminology evolves. A series of benchmark experiments illustrate D-GEF's ability to generate higher quality than state-of-the-art word embedding models (e.g., word2vec) in tasks pertaining to semantic analogy, clustering, and threat classification. D-GEF's practical utility is illustrated with in-depth case studies on web application and denial of service threats targeting PHP and Windows technologies, respectively. We also discuss the implications of the proposed framework for strategic, operational, and tactical CTI scenarios. All datasets and code are publicly released to facilitate scientific reproducibility and extensions of this work. 
    more » « less