ArcheType: A Novel Framework for Open-Source Column Type Annotation Using Large Language Models

Feuer, Benjamin; Liu, Yurong; Hegde, Chinmay; Freire, Juliana

doi:10.14778/3665844.3665857

Citation Details

ArcheType: A Novel Framework for Open-Source Column Type Annotation Using Large Language Models

Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type; incur high run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark. more »

Award ID(s):: 2106888

PAR ID:: 10540025

Author(s) / Creator(s):: Feuer, Benjamin; Liu, Yurong; Hegde, Chinmay; Freire, Juliana

Publisher / Repository:: VLDB Endowment

Date Published:: 2024-05-01

Journal Name:: Proceedings of the VLDB Endowment

Volume:: 17

Issue:: 9

ISSN:: 2150-8097

Page Range / eLocation ID:: 2279 to 2292

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.14778/3665844.3665857

More Like this