skip to main content


Title: ClueWeb22: 10 Billion Web Documents with Rich Information
ClueWeb22, the newest iteration of the ClueWeb line of datasets, is the result of more than a year of collaboration between industry and academia. Its design is influenced by the research needs of the academic community and the real-world needs of large-scale industry systems. Compared with earlier ClueWeb datasets, the ClueWeb22 corpus is larger, more varied, and has higher-quality documents. Its core is raw HTML, but it includes clean text versions of documents to lower the barrier to entry. Several aspects of ClueWeb22 are available to the research community for the first time at this scale, for example, visual representations of rendered web pages, parsed structured information from the HTML document, and the alignment of document distributions (domains, languages, and topics) to commercial web search. This talk shares the design and construction of ClueWeb22, and discusses its new features. We believe this newer, larger, and richer ClueWeb corpus will enable and support a broad range of research in IR, NLP, and deep learning.  more » « less
Award ID(s):
1822975
NSF-PAR ID:
10469642
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Page Range / eLocation ID:
3360 to 3362
Format(s):
Medium: X
Location:
Madrid Spain
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, Iyad ; Selesnick, Ivan ; Picone, Joseph (Ed.)
    The Temple University Hospital Seizure Detection Corpus (TUSZ) [1] has been in distribution since April 2017. It is a subset of the TUH EEG Corpus (TUEG) [2] and the most frequently requested corpus from our 3,000+ subscribers. It was recently featured as the challenge task in the Neureka 2020 Epilepsy Challenge [3]. A summary of the development of the corpus is shown below in Table 1. The TUSZ Corpus is a fully annotated corpus, which means every seizure event that occurs within its files has been annotated. The data is selected from TUEG using a screening process that identifies files most likely to contain seizures [1]. Approximately 7% of the TUEG data contains a seizure event, so it is important we triage TUEG for high yield data. One hour of EEG data requires approximately one hour of human labor to complete annotation using the pipeline described below, so it is important from a financial standpoint that we accurately triage data. A summary of the labels being used to annotate the data is shown in Table 2. Certain standards are put into place to optimize the annotation process while not sacrificing consistency. Due to the nature of EEG recordings, some records start off with a segment of calibration. This portion of the EEG is instantly recognizable and transitions from what resembles lead artifact to a flat line on all the channels. For the sake of seizure annotation, the calibration is ignored, and no time is wasted on it. During the identification of seizure events, a hard “3 second rule” is used to determine whether two events should be combined into a single larger event. This greatly reduces the time that it takes to annotate a file with multiple events occurring in succession. In addition to the required minimum 3 second gap between seizures, part of our standard dictates that no seizure less than 3 seconds be annotated. Although there is no universally accepted definition for how long a seizure must be, we find that it is difficult to discern with confidence between burst suppression or other morphologically similar impressions when the event is only a couple seconds long. This is due to several reasons, the most notable being the lack of evolution which is oftentimes crucial for the determination of a seizure. After the EEG files have been triaged, a team of annotators at NEDC is provided with the files to begin data annotation. An example of an annotation is shown in Figure 1. A summary of the workflow for our annotation process is shown in Figure 2. Several passes are performed over the data to ensure the annotations are accurate. Each file undergoes three passes to ensure that no seizures were missed or misidentified. The first pass of TUSZ involves identifying which files contain seizures and annotating them using our annotation tool. The time it takes to fully annotate a file can vary drastically depending on the specific characteristics of each file; however, on average a file containing multiple seizures takes 7 minutes to fully annotate. This includes the time that it takes to read the patient report as well as traverse through the entire file. Once an event has been identified, the start and stop time for the seizure is stored in our annotation tool. This is done on a channel by channel basis resulting in an accurate representation of the seizure spreading across different parts of the brain. Files that do not contain any seizures take approximately 3 minutes to complete. Even though there is no annotation being made, the file is still carefully examined to make sure that nothing was overlooked. In addition to solely scrolling through a file from start to finish, a file is often examined through different lenses. Depending on the situation, low pass filters are used, as well as increasing the amplitude of certain channels. These techniques are never used in isolation and are meant to further increase our confidence that nothing was missed. Once each file in a given set has been looked at once, the annotators start the review process. The reviewer checks a file and comments any changes that they recommend. This takes about 3 minutes per seizure containing file, which is significantly less time than the first pass. After each file has been commented on, the third pass commences. This step takes about 5 minutes per seizure file and requires the reviewer to accept or reject the changes that the second reviewer suggested. Since tangible changes are made to the annotation using the annotation tool, this step takes a bit longer than the previous one. Assuming 18% of the files contain seizures, a set of 1,000 files takes roughly 127 work hours to annotate. Before an annotator contributes to the data interpretation pipeline, they are trained for several weeks on previous datasets. A new annotator is able to be trained using data that resembles what they would see under normal circumstances. An additional benefit of using released data to train is that it serves as a means of constantly checking our work. If a trainee stumbles across an event that was not previously annotated, it is promptly added, and the data release is updated. It takes about three months to train an annotator to a point where their annotations can be trusted. Even though we carefully screen potential annotators during the hiring process, only about 25% of the annotators we hire survive more than one year doing this work. To ensure that the annotators are consistent in their annotations, the team conducts an interrater agreement evaluation periodically to ensure that there is a consensus within the team. The annotation standards are discussed in Ochal et al. [4]. An extended discussion of interrater agreement can be found in Shah et al. [5]. The most recent release of TUSZ, v1.5.2, represents our efforts to review the quality of the annotations for two upcoming challenges we hosted: an internal deep learning challenge at IBM [6] and the Neureka 2020 Epilepsy Challenge [3]. One of the biggest changes that was made to the annotations was the imposition of a stricter standard for determining the start and stop time of a seizure. Although evolution is still included in the annotations, the start times were altered to start when the spike-wave pattern becomes distinct as opposed to merely when the signal starts to shift from background. This cuts down on background that was mislabeled as a seizure. For seizure end times, all post ictal slowing that was included was removed. The recent release of v1.5.2 did not include any additional data files. Two EEG files had been added because, originally, they were corrupted in v1.5.1 but were able to be retrieved and added for the latest release. The progression from v1.5.0 to v1.5.1 and later to v1.5.2, included the re-annotation of all of the EEG files in order to develop a confident dataset regarding seizure identification. Starting with v1.4.0, we have also developed a blind evaluation set that is withheld for use in competitions. The annotation team is currently working on the next release for TUSZ, v1.6.0, which is expected to occur in August 2020. It will include new data from 2016 to mid-2019. This release will contain 2,296 files from 2016 as well as several thousand files representing the remaining data through mid-2019. In addition to files that were obtained with our standard triaging process, a part of this release consists of EEG files that do not have associated patient reports. Since actual seizure events are in short supply, we are mining a large chunk of data for which we have EEG recordings but no reports. Some of this data contains interesting seizure events collected during long-term EEG sessions or data collected from patients with a history of frequent seizures. It is being mined to increase the number of files in the corpus that have at least one seizure event. We expect v1.6.0 to be released before IEEE SPMB 2020. The TUAR Corpus is an open-source database that is currently available for use by any registered member of our consortium. To register and receive access, please follow the instructions provided at this web page: https://www.isip.piconepress.com/projects/tuh_eeg/html/downloads.shtml. The data is located here: https://www.isip.piconepress.com/projects/tuh_eeg/downloads/tuh_eeg_artifact/v2.0.0/. 
    more » « less
  2. Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks. 
    more » « less
  3. Obeid, Iyad ; Selesnick, Ivan ; Picone, Joseph (Ed.)
    The Neural Engineering Data Consortium has recently developed a new subset of its popular open source EEG corpus – TUH EEG (TUEG) [1]. The TUEG Corpus is the world’s largest open source corpus of EEG data and currently has over 3,300 subscribers. There are several valuable subsets of this data, including the TUH Seizure Detection Corpus (TUSZ) [2], which was featured in the Neureka 2020 Epilepsy Challenge [3]. In this poster, we present a new subset of the TUEG Corpus – the TU Artifact Corpus. This corpus contains 310 EEG files in which every artifact has been annotated. This data can be used to evaluate artifact reduction technology. Since TUEG is comprised of actual clinical data, the set of artifacts appearing in the data is rich and challenging. EEG artifacts are defined as waveforms that are not of cerebral origin and may be affected by numerous external and or physiological factors. These extraneous signals are often mistaken for seizures due to their morphological similarity in amplitude and frequency [4]. Artifacts often lead to raised false alarm rates in machine learning systems, which poses a major challenge for machine learning research. Most state-of-the-art systems use some form of artifact reduction technology to suppress these events. The corpus was annotated using a five-way classification that was developed to meet the needs of our constituents. Brief descriptions of each form of the artifact are provided in Ochal et al. [4]. The five basic tags are: • Chewing (CHEW): An artifact resulting from the tensing and relaxing of the jaw muscles. Chewing is a subset of the muscle artifact class. Chewing has the same characteristic high frequency sharp waves with 0.5 sec baseline periods between bursts. This artifact is generally diffuse throughout the different regions of the brain. However, it might have a higher level of activity in one hemisphere. Classification of a muscle artifact as chewing often depends on whether the accompanying patient report mentions any chewing, since other muscle artifacts can appear superficially similar to chewing artifact. • Electrode (ELEC): An electrode artifact encompasses various electrode related artifacts. Electrode pop is an artifact characterized by channels using the same electrode “spiking” with an electrographic phase reversal. Electrostatic is an artifact caused by movement or interference of electrodes and or the presence of dissimilar metals. A lead artifact is caused by the movement of electrodes from the patient’s head and or poor connection of electrodes. This results in disorganized and high amplitude slow waves. • Eye Movement (EYEM): A spike-like waveform created during patient eye movement. This artifact is usually found on all of the frontal polar electrodes with occasional echoing on the frontal electrodes. • Muscle (MUSC): A common artifact with high frequency, sharp waves corresponding to patient movement. These waveforms tend to have a frequency above 30 Hz with no specific pattern, often occurring because of agitation in the patient. • Shiver (SHIV): A specific and sustained sharp wave artifact that occurs when a patient shivers, usually seen on all or most channels. Shivering is a relatively rare subset of the muscle artifact class. Since these artifacts can overlap in time, a concatenated label format was implemented as a compromise between the limitations of our annotation tool and the complexity needed in an annotation data structure used to represent these overlapping events. We distribute an XML format that easily handles overlapping events. Our annotation tool [5], like most annotation tools of this type, is limited to displaying and manipulating a flat or linear annotation. Therefore, we encode overlapping events as a series of concatenated names using symbols such as: • EYEM+CHEW: eye movement and chewing • EYEM+SHIV: eye movement and shivering • CHEW+SHIV: chewing and shivering An example of an overlapping annotation is shown below in Figure 1. This release is an update of TUAR v1.0.0, which was a partially annotated database. In v1.0.0, a similar five way system was used as well as an additional “null” tag. The “null” tag covers anything that was not annotated, including instances of artifact. Only a limited number of artifacts were annotated in v1.0.0. In this updated version, every instance of an artifact is annotated; ultimately, this provides the user with confidence that any part of the record that is not annotated with one of the five classes does not contain an artifact. No new files, patients, or sessions were added in v2.0.0. However, the data was reannotated with these standards. The total number of files remains the same, but the number of artifact events increases significantly. Complete statistics will be provided on the corpus once annotation is complete and the data is released. This is expected to occur in early July – just after the IEEE SPMB submission deadline. The TUAR Corpus is an open-source database that is currently available for use by any registered member of our consortium. To register and receive access, please follow the instructions provided at this web page: https://www.isip.piconepress.com/projects/tuh_eeg/html/downloads.shtml. The data is located here: https://www.isip.piconepress.com/projects/tuh_eeg/downloads/tuh_eeg_artifact/v2.0.0/. 
    more » « less
  4. null (Ed.)
    Text categorization is an essential task in Web content analysis. Considering the ever-evolving Web data and new emerging categories, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting that aims to categorize documents effectively, with a couple of seed documents annotated per category. We recognize that texts collected from the Web are often structure-rich, i.e., accompanied by various metadata. One can easily organize the corpus into a text-rich network, joining raw text documents with document attributes, high-quality phrases, label surface names as nodes, and their associations as edges. Such a network provides a holistic view of the corpus’ heterogeneous data sources and enables a joint optimization for network-based analysis and deep textual model training. We therefore propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases – a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Each module generates pseudo training labels from the unlabeled document set, and both modules mutually enhance each other by co-training using pooled pseudo labels. We test our model on two real-world datasets. On the challenging e-commerce product categorization dataset with 683 categories, our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%, significantly outperforming all compared methods; our accuracy is only less than 2% away from the supervised BERT model trained on about 50K labeled documents. 
    more » « less
  5. In civil and construction engineering education research, a focus has been on using 3D models to support students’ design comprehension. Despite this trend, the predominant mode of design communication in the industry relies on 2D plans and specifications, which typically supersede other modes of communication. Rather than focusing on the presentation of less common 3D content as an input to support students’ understanding of a design, this paper explores more common 2D inputs, but compares different visualization formats of student output in two educational interventions. In the first intervention, students document a construction sequence for wood-framed elements in a 2D worksheet format. In the second, students work with the same wood-framed design, but document their sequence through an augmented reality (AR) format where their physical interactions move full-scale virtual elements as if they were physically constructing the wood frame. Student approaches and performance were analyzed using qualitative attribute coding of video, audio, and written documentation of the student experience. Overall, results showed that the 2D worksheet format was simple to implement and was not mentally demanding to complete, but often corresponded with a lack of critical checks and a lack of mistake recognition from the students. The AR approach challenged students more in terms of cognitive load and completion rates but showed the potential for facilitating mistake recognition and self-remediation through visualization. These results suggest that when students are tasked with conceptualizing construction sequences from 2D documentation, the cognitive challenges associated with documenting a sequence in AR may support their recognition of their own mistakes in ways that may not be effectively supported through 2D documentation as an output for documenting and planning a construction sequence. The results presented in this paper provide insights on student tendencies, behaviors, and perceptions related to defining construction sequences from 2D documentation in order for educators to make informed decisions regarding the use of similar learning activities to prepare their students for understanding the 2D design documents used in industry. 
    more » « less