Tab-Cleaner: Weakly Supervised Tabular Data Cleaning via Pre-training for E-commerce Catalog

Cheng, Kewei; Li, Xian; Wang, Zhengyang; Zhang, Chenwei; Huang, Binxuan; Xu, Yifan Ethan; Dong, Xin Luna; Sun, Yizhou

doi:10.18653/v1/2023.acl-industry.18

Citation Details

Tab-Cleaner: Weakly Supervised Tabular Data Cleaning via Pre-training for E-commerce Catalog

Product catalogs, conceptually in the form of text-rich tables, are self-reported by individual retailers and thus inevitably contain noisy facts. Verifying such textual attributes in product catalogs is essential to improve their reliability. However, popular methods for processing free-text content, such as pre-trained language models, are not particularly effective on structured tabular data since they are typically trained on free-form natural language texts. In this paper, we present Tab-Cleaner, a model designed to handle error detection over text-rich tabular data following a pre-training / fine-tuning paradigm. We train Tab-Cleaner on a real-world Amazon Product Catalog table w.r.t millions of products and show improvements over state-of-the-art methods by 16% on PR AUC over attribute applicability classification task and by 11% on PR AUC over attribute value validation task. more »

Award ID(s):: 2211557 1937599

PAR ID:: 10464429

Author(s) / Creator(s):: Cheng, Kewei; Li, Xian; Wang, Zhengyang; Zhang, Chenwei; Huang, Binxuan; Xu, Yifan Ethan; Dong, Xin Luna; Sun, Yizhou

Date Published:: 2023-07-01

Journal Name:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics

Volume:: 5

Page Range / eLocation ID:: 172 to 185

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.18653/v1/2023.acl-industry.18

More Like this