ClueWeb22: 10 Billion Web Documents with Rich Information

Overwijk, Arnold; Xiong, Chenyan; Callan, Jamie

doi:10.1145/3477495.3536321

Citation Details

ClueWeb22: 10 Billion Web Documents with Rich Information

ClueWeb22, the newest iteration of the ClueWeb line of datasets, is the result of more than a year of collaboration between industry and academia. Its design is influenced by the research needs of the academic community and the real-world needs of large-scale industry systems. Compared with earlier ClueWeb datasets, the ClueWeb22 corpus is larger, more varied, and has higher-quality documents. Its core is raw HTML, but it includes clean text versions of documents to lower the barrier to entry. Several aspects of ClueWeb22 are available to the research community for the first time at this scale, for example, visual representations of rendered web pages, parsed structured information from the HTML document, and the alignment of document distributions (domains, languages, and topics) to commercial web search. This talk shares the design and construction of ClueWeb22, and discusses its new features. We believe this newer, larger, and richer ClueWeb corpus will enable and support a broad range of research in IR, NLP, and deep learning. more »

Award ID(s):: 1822975

PAR ID:: 10469642

Author(s) / Creator(s):: Overwijk, Arnold; Xiong, Chenyan; Callan, Jamie

Publisher / Repository:: ACM

Date Published:: 2022-07-06

ISBN:: 9781450387323

Page Range / eLocation ID:: 3360 to 3362

Format(s):: Medium: X

Location:: Madrid Spain

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3477495.3536321

More Like this