DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

Shankar, Shreya; Chambers, Tristan; Shah, Tarak; Parameswaran, Aditya G; Wu, Eugene

doi:10.14778/3746405.3746426

Citation Details

DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

Analyzing unstructured data has been a persistent challenge in data processing. Recent proposals offer declarative frameworks for LLM-powered processing of unstructured data, but they typically execute user-specified operations as-is in a single LLM call—focusing on cost rather than accuracy. This is problematic for complex tasks, where even well-prompted LLMs can miss relevant information. For instance, reliably extractingallinstances of a specific clause from legal documents often requires decomposing the task, the data, or both. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to deine such pipelines and uses an agent-based approach to automatically optimize them, leveraging novel agent-based rewrites (that we callrewrite directives), as well as an optimization and evaluation framework. We introduce(i)logical rewriting of pipelines, tailored for LLM-based tasks,(ii)an agent-guided plan evaluation mechanism, and(iii)an optimization algorithm that efficiently finds promising plans, considering the latencies of LLM execution. Across four real-world document processing tasks, DocETL improves accuracy by 21–80% over strong baselines. DocETL is open-source at docetl.org and, as of March 2025, has over 1.7k GitHub stars across diverse domains. more »

Award ID(s):: 2129008

PAR ID:: 10675167

Author(s) / Creator(s):: Shankar, Shreya; Chambers, Tristan; Shah, Tarak; Parameswaran, Aditya G; Wu, Eugene

Publisher / Repository:: VLDB

Date Published:: 2025-05-01

Journal Name:: Proceedings of the VLDB Endowment

Volume:: 18

Issue:: 9

ISSN:: 2150-8097

Page Range / eLocation ID:: 3035 to 3048

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Journal Article:
https://doi.org/10.14778/3746405.3746426

More Like this