NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Retrieval and Structuring Augmented Generation with LLMs for Web Applications

https://doi.org/10.1145/3701716.3715870

Jiao, Yizhu; Ouyang, Siru; Zhong, Ming; Zhang, Yunyi; Ding, Linyi; Zhou, Sizhe; Han, Jiawei (May 2025, ACM)

Free, publicly-accessible full text available May 8, 2026
TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision

https://doi.org/10.1145/3696410.3714940

Zhang, Yunyi; Yang, Ruozhen; Xu, Xueqiang; Li, Rui; Xiao, Jinfeng; Shen, Jiaming; Han, Jiawei (April 2025, ACM)

Free, publicly-accessible full text available April 22, 2026
Ontology Enrichment for Effective Fine-grained Entity Typing

https://doi.org/10.1145/3637528.3671857

Ouyang, Siru; Huang, Jiaxin; Pillai, Pranav; Zhang, Yunyi; Zhang, Yu; Han, Jiawei (August 2024, ACM)
Baeza-Yates, Ricardo; Bonchi, Francesco (Ed.)
Fine-grained entity typing (FET) is the task of identifying specific entity types at a fine-grained level for entity mentions based on their contextual information. Conventional methods for FET require extensive human annotation, which is time-consuming and costly given the massive scale of data. Recent studies have been developing weakly supervised or zero-shot approaches.We study the setting of zero-shot FET where only an ontology is provided. However, most existing ontology structures lack rich supporting information and even contain ambiguous relations, making them ineffective in guiding FET. Recently developed language models, though promising in various few-shot and zero-shot NLP tasks, may face challenges in zero-shot FET due to their lack of interaction with task-specific ontology. In this study, we propose OnEFET, where we (1) enrich each node in the ontology structure with two categories of extra information: instance information for training sample augmentation and topic information to relate types with contexts, and (2) develop a coarse-to-fine typing algorithm that exploits the enriched information by training an entailment model with contrasting topics and instance-based augmented training samples. Our experiments show that OnEFET achieves high-quality fine-grained entity typing without human annotation, outperforming existing zero-shot methods by a large margin and rivaling supervised methods. OnEFET also enjoys strong transferability to unseen and finer-grained types. Code is available at https://github.com/ozyyshr/OnEFET.
more » « less
Full Text Available
Automated Mining of Structured Knowledge from Text in the Era of Large Language Models

https://doi.org/10.1145/3637528.3671469

Zhang, Yunyi; Zhong, Ming; Ouyang, Siru; Jiao, Yizhu; Zhou, Sizhe; Ding, Linyi; Han, Jiawei (August 2024, ACM)
Baeza-Yates, Ricardo; Bonchi, Francesco (Ed.)
Massive amount of unstructured text data are generated daily, ranging from news articles to scientific papers. How to mine structured knowledge from the text data remains a crucial research question. Recently, large language models (LLMs) have shed light on the text mining field with their superior text understanding and instructionfollowing ability. There are typically two ways of utilizing LLMs: fine-tune the LLMs with human-annotated training data, which is labor intensive and hard to scale; prompt the LLMs in a zero-shot or few-shot way, which cannot take advantage of the useful information in the massive text data. Therefore, it remains a challenge on automated mining of structured knowledge from massive text data in the era of large language models. In this tutorial, we cover the recent advancements in mining structured knowledge using language models with very weak supervision. We will introduce the following topics in this tutorial: (1) introduction to large language models, which serves as the foundation for recent text mining tasks, (2) ontology construction, which automatically enriches an ontology from a massive corpus, (3) weakly-supervised text classification in flat and hierarchical label space, (4) weakly-supervised information extraction, which extracts entity and relation structures.
more » « less
Full Text Available
Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains

https://doi.org/10.1609/AAAI.V38I17.29933

Zhang, Yu; Zhang, Yunyi; Shen, Yanzhen; Deng, Yu; Popa, Lucian; Shwartz, Larisa; Zhai, ChengXiang; Han, Jiawei (March 2024, Proceedings of the AAAI Conference on Artificial Intelligence)
Wooldridge, Michael J; Dy, Jennifer G; Natarajan, Sriraam (Ed.)
Accurately typing entity mentions from text segments is a fundamental task for various natural language processing applications. Many previous approaches rely on massive human-annotated data to perform entity typing. Nevertheless, collecting such data in highly specialized science and engineering domains (e.g., software engineering and security) can be time-consuming and costly, without mentioning the domain gaps between training and inference data if the model needs to be applied to confidential datasets. In this paper, we study the task of seed-guided fine-grained entity typing in science and engineering domains, which takes the name and a few seed entities for each entity type as the only supervision and aims to classify new entity mentions into both seen and unseen types (i.e., those without seed entities). To solve this problem, we propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus using the contextualized representations of pre-trained language models. It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types. Extensive experiments on two datasets covering four domains demonstrate the effectiveness of SEType in comparison with various baselines. Code and data are available at: https://github.com/yuzhimanhua/SEType.
more » « less
Full Text Available
Taxonomy-guided Semantic Indexing for Academic Paper Search

https://doi.org/10.18653/v1/2024.emnlp-main.407

Kang, SeongKu; Zhang, Yunyi; Jiang, Pengcheng; Lee, Dongha; Han, Jiawei; Yu, Hwanjo (January 2024, Association for Computational Linguistics)

Full Text Available
Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

https://doi.org/10.1145/3539618.3591782

Yoon, Susik; Lee, Dongha; Zhang, Yunyi; Han, Jiawei (July 2023, ACM)
Proc. 2023 ACM SIGIR Int. Conf. on Research and Development in Information Retrieval (Ed.)
Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings.
more » « less
Full Text Available
Pretrained Language Representations for Text Understanding: A Weakly-Supervised Perspective

https://doi.org/10.1145/3580305.3599569

Meng, Yu; Huang, Jiaxin; Zhang, Yu; Zhang, Yunyi; Han, Jiawei (August 2023, ACM)

Full Text Available
Weakly Supervised Multi-Label Classification of Full-Text Scientific Papers

https://doi.org/10.1145/3580305.3599544

Zhang, Yu; Jin, Bowen; Chen, Xiusi; Shen, Yanzhen; Zhang, Yunyi; Meng, Yu; Han, Jiawei (August 2023, ACM)
Proc. 2023 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (Ed.)
Instead of relying on human-annotated training samples to build a classifier, weakly supervised scientific paper classification aims to classify papers only using category descriptions (e.g., category names, category-indicative keywords). Existing studies on weakly supervised paper classification are less concerned with two challenges: (1) Papers should be classified into not only coarse-grained research topics but also fine-grained themes, and potentially into multiple themes, given a large and fine-grained label space; and (2) full text should be utilized to complement the paper title and abstract for classification. Moreover, instead of viewing the entire paper as a long linear sequence, one should exploit the structural information such as citation links across papers and the hierarchy of sections and paragraphs in each paper. To tackle these challenges, in this study, we propose FuTex, a framework that uses the cross-paper network structure and the in-paper hierarchy structure to classify full-text scientific papers under weak supervision. A network-aware contrastive fine-tuning module and a hierarchyaware aggregation module are designed to leverage the two types of structural signals, respectively. Experiments on two benchmark datasets demonstrate that FuTex significantly outperforms competitive baselines and is on par with fully supervised classifiers that use 1,000 to 60,000 ground-truth training samples.
more » « less
Full Text Available
Unsupervised Event Chain Mining from Multiple Documents

https://doi.org/10.1145/3543507.3583295

Jiao, Yizhu; Zhong, Ming; Shen, Jiaming; Zhang, Yunyi; Zhang, Chao; Han, Jiawei (April 2023, ACM)
Proc. 2023 The Web Conf. (Ed.)
Massive and fast-evolving news articles keep emerging on the web. To efectively summarize and provide concise insights into real-world events, we propose a new event knowledge extraction task Event Chain Mining in this paper. Given multiple documents abouta super event, it aims to mine a series of salient events in temporal order. For example, the event chain of super event Mexico Earthquake in 2017 is {earthquake hit Mexico, destroy houses, kill people,block roads}. This task can help readers capture the gist of textsquickly, thereby improving reading efciency and deepening text comprehension. To address this task, we regard an event as a cluster of diferent mentions of similar meanings. In this way, we can identify the diferent expressions of events, enrich their semantic knowledge and replenish relation information among them. Taking events as the basic unit, we present a novel unsupervised framework, EMiner. Specifcally, we extract event mentions from texts and merge them with similar meanings into a cluster as a single event. By jointly incorporating both content and commonsense, essential events are then selected and arranged chronologically to form an event chain. Meanwhile, we annotate a multi-document benchmark to build a comprehensive testbed for the proposed task. Extensive experiments are conducted to verify the efectiveness of EMiner in terms of both automatic and human evaluations.
more » « less
Full Text Available

« Prev Next »

Search for: All records