skip to main content


Title: Modeling Updates of Scholarly Webpages Using Archived Data
The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the scholarly web using 19,977 seed URLs of authors’ homepages obtained from their Google Scholar profiles. We first obtain archived copies of these webpages from the Internet Archive (IA), and estimate when their actual updates occurred. Next, we apply maximum likelihood to estimate their mean update frequency ( ) values. Our evaluation shows that   values derived from a short history of archived data provide a good estimate for the true update frequency in the short-term, and that our method provides better estimations of updates at a fraction of resources compared to the baseline models. Based on this, we demonstrate the utility of archived data to optimize the crawling strategy of web crawlers, and uncover important challenges that inspire future research directions.  more » « less
Award ID(s):
1823288
NSF-PAR ID:
10271904
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
2020 IEEE International Conference on Big Data (Big Data)
Page Range / eLocation ID:
1868-1877
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, I. ; Selesnick, I. (Ed.)
    The Neural Engineering Data Consortium at Temple University has been providing key data resources to support the development of deep learning technology for electroencephalography (EEG) applications [1-4] since 2012. We currently have over 1,700 subscribers to our resources and have been providing data, software and documentation from our web site [5] since 2012. In this poster, we introduce additions to our resources that have been developed within the past year to facilitate software development and big data machine learning research. Major resources released in 2019 include: ● Data: The most current release of our open source EEG data is v1.2.0 of TUH EEG and includes the addition of 3,874 sessions and 1,960 patients from mid-2015 through 2016. ● Software: We have recently released a package, PyStream, that demonstrates how to correctly read an EDF file and access samples of the signal. This software demonstrates how to properly decode channels based on their labels and how to implement montages. Most existing open source packages to read EDF files do not directly address the problem of channel labels [6]. ● Documentation: We have released two documents that describe our file formats and data representations: (1) electrodes and channels [6]: describes how to map channel labels to physical locations of the electrodes, and includes a description of every channel label appearing in the corpus; (2) annotation standards [7]: describes our annotation file format and how to decode the data structures used to represent the annotations. Additional significant updates to our resources include: ● NEDC TUH EEG Seizure (v1.6.0): This release includes the expansion of the training dataset from 4,597 files to 4,702. Calibration sequences have been manually annotated and added to our existing documentation. Numerous corrections were made to existing annotations based on user feedback. ● IBM TUSZ Pre-Processed Data (v1.0.0): A preprocessed version of the TUH Seizure Detection Corpus using two methods [8], both of which use an FFT sliding window approach (STFT). In the first method, FFT log magnitudes are used. In the second method, the FFT values are normalized across frequency buckets and correlation coefficients are calculated. The eigenvalues are calculated from this correlation matrix. The eigenvalues and correlation matrix's upper triangle are used to generate feature. ● NEDC TUH EEG Artifact Corpus (v1.0.0): This corpus was developed to support modeling of non-seizure signals for problems such as seizure detection. We have been using the data to build better background models. Five artifact events have been labeled: (1) eye movements (EYEM), (2) chewing (CHEW), (3) shivering (SHIV), (4) electrode pop, electrostatic artifacts, and lead artifacts (ELPP), and (5) muscle artifacts (MUSC). The data is cross-referenced to TUH EEG v1.1.0 so you can match patient numbers, sessions, etc. ● NEDC Eval EEG (v1.3.0): In this release of our standardized scoring software, the False Positive Rate (FPR) definition of the Time-Aligned Event Scoring (TAES) metric has been updated [9]. The standard definition is the number of false positives divided by the number of false positives plus the number of true negatives: #FP / (#FP + #TN). We also recently introduced the ability to download our data from an anonymous rsync server. The rsync command [10] effectively synchronizes both a remote directory and a local directory and copies the selected folder from the server to the desktop. It is available as part of most, if not all, Linux and Mac distributions (unfortunately, there is not an acceptable port of this command for Windows). To use the rsync command to download the content from our website, both a username and password are needed. An automated registration process on our website grants both. An example of a typical rsync command to access our data on our website is: rsync -auxv nedc_tuh_eeg@www.isip.piconepress.com:~/data/tuh_eeg/ Rsync is a more robust option for downloading data. We have also experimented with Google Drive and Dropbox, but these types of technology are not suitable for such large amounts of data. All of the resources described in this poster are open source and freely available at https://www.isip.piconepress.com/projects/tuh_eeg/downloads/. We will demonstrate how to access and utilize these resources during the poster presentation and collect community feedback on the most needed additions to enable significant advances in machine learning performance. 
    more » « less
  2. It takes great effort to manually or semi-automatically convert free-text phenotype narratives (e.g., morphological descriptions in taxonomic works) to a computable format before they can be used in large-scale analyses. We argue that neither a manual curation approach nor an information extraction approach based on machine learning is a sustainable solution to produce computable phenotypic data that are FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016). This is because these approaches do not scale to all biodiversity, and they do not stop the publication of free-text phenotypes that would need post-publication curation. In addition, both manual and machine learning approaches face great challenges: the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other) in manual curation, and keywords to ontology concept translation in automated information extraction, make it difficult for either approach to produce data that are truly FAIR. Our empirical studies show that inter-curator variation in translating phenotype characters to Entity-Quality statements (Mabee et al. 2007) is as high as 40% even within a single project. With this level of variation, curated data integrated from multiple curation projects may still not be FAIR. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardized vocabularies (ontologies). We argue that the authors describing characters are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of the descriptions from the moment of publication. In this presentation, we will introduce the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists, which consists of three components: a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. Fig. 1 shows the system diagram of the platform. The presentation will consist of: a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. The software modules currently incorporated in Character Recorder and Conflict Resolver have undergone formal usability studies. We are actively recruiting Carex experts to participate in a 3-day usability study of the entire system of the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists. Participants will use the platform to record 100 characters about one Carex species. In addition to usability data, we will collect the terms that participants submit to the underlying ontology and the data related to conflict resolution. Such data allow us to examine the types and the quantities of logical conflicts that may result from the terms added by the users and to use Discrete Event Simulation models to understand if and how term additions and conflict resolutions converge. We look forward to a discussion on how the tools (Character Recorder is online at http://shark.sbs.arizona.edu/chrecorder/public) described in our presentation can contribute to producing and publishing FAIR data in taxonomic studies. 
    more » « less
  3. Abstract

    Population cycles can be caused by consumer–resource interactions. Confirming the role of consumer–resource interactions, however, can be challenging due to an absence of data for the resource candidate. For example, interactions between midge larvae and benthic algae likely govern the high‐amplitude population fluctuations ofTanytarsus gracilentusin Lake Mývatn, Iceland, but there are no records of benthic resources concurrent with adult midge population counts. Here, we investigate consumer population dynamics using the carbon stable isotope signatures of archivedT. gracilentusspecimens collected from 1977 to 2015, under the assumption that midge δ13C values reflect those of resources they consumed as larvae. We used the time series for population abundance and δ13C to estimate interactions between midges and resources while accounting for measurement error and possible preservation effects on isotope values. Results were consistent with consumer–resource interactions: high δ13C values preceded peaks in the midge population, and δ13C values tended to decline after midges reached high abundance. One interpretation of this dynamic coupling is that midge isotope signatures reflect temporal variation in benthic algal δ13C values, which we expected to mirror primary production. Following from this explanation, high benthic production (enriched δ13C values) would contribute to increased midge abundance, and high midge abundance would result in declining benthic production (depleted δ13C values). An additional and related explanation is that midges deplete benthic algal abundance once they reach peak densities, causing midges to increase their relative reliance on other resources including detritus and associated microorganisms. Such a shift in resource use would be consistent with the subsequent decline in midge δ13C values. Our study adds evidence that midge–resource interactions driveT. gracilentusfluctuations and demonstrates a novel application of stable isotope time‐series data to understand consumer population dynamics.

     
    more » « less
  4. Phytoplankton samples for the 4 southern Wisconsin LTER lakes (Mendota, Monona, Wingra, Fish) have been collected for analysis by LTER since 1995 (1996 Wingra, Fish) when the southern Wisconsin lakes were added to the North Temperate Lakes LTER project. Samples are collected as a composite whole-water sample and are preserved in gluteraldehyde. Composite sample depths are 0-8 meters for Lake Mendota (to conform to samples collected and analyzed since 1990 for a UW/DNR food web research study), and 0-2 meters for the other three lakes. A tube sampler is used for the 0-8 m Lake Mendota samples; samples for the other lakes are obtained by collecting water at 1-meter intervals using a Kemmerer water sampler and compositing the samples in a bucket. Samples are taken in the deep hole region of each lake at the same time and location as other limnological sampling. Phytoplankton samples are analyzed by PhycoTech, Inc., a private lab specializing in phytoplankton analyses (see data protocol for procedures). Samples for Wingra and Fish lakes are archived but not routinely counted. Permanent slide mounts (3 per sample) are prepared for all analyzed Mendota and Monona samples as well as 6 samples per year for Wingra and Fish; the slide mounts are archived at the University of Wisconsin - Madison Zoology Museum. Phytoplankton are identified to species using an inverted microscope (Utermohl technique) and are reported as natural unit (i.e., colonies, filaments, or single cells) densities per mL, cell densities per mL, and algal biovolume densities per mL. Multiple entries for the same species on the same date may be due to different variants or vegetative states - (e.g., colonial or attached vs. free cell.) Biovolumes for individual cells of each species are determined during the counting procedure by obtaining cell measurements needed to calculate volumes for geometric solids (e.g., cylinders, spheres, truncated cones) corresponding to actual cell shapes. Biovolume concentrations are then computed by mulitplying the average cell biovolume by the cell densities in the water sample. Note that one million cubicMicrometers of biovolume PerMilliliter of water are equal to a biovolume concentration of one cubicMillimeterPerMilliliter. Assuming a cell density equal to water, a cubicMillimeterPerMilliliter of biovolume converts to a biomass concentration of one milligramPerLiter. Sampling Frequency: bi-weekly during ice-free season from late March or early April through early September, then every 4 weeks through late November; sampling is conducted usually once during the winter (depending on ice conditions). Number of sites: 4 Several taxonomic updates have been made to this dataset February 2013, see methods for details. 
    more » « less
  5. Traditionally, web applications have been written as HTML pages with embedded JavaScript code that implements dynamic and interactive features by manipulating the Document Object Model (DOM) through a low-level browser API. However, this unprincipled approach leads to code that is brittle, difficult to understand, non-modular, and does not facilitate incremental update of user-interfaces in response to state changes. React is a popular framework for constructing web applications that aims to overcome these problems. React applications are written in a declarative and object-oriented style, and consist of components that are organized in a tree structure. Each component has a set of properties representing input parameters, a state consisting of values that may vary over time, and a render method that declaratively specifies the subcomponents of the component. React’s concept of reconciliation determines the impact of state changes and updates the user-interface incrementally by selective mounting and unmounting of subcomponents. At designated points, the React framework invokes lifecycle hooks that enable programmers to perform actions outside the framework such as acquiring and releasing resources needed by a component. These mechanisms exhibit considerable complexity, but, to our knowledge, no formal specification of React’s semantics exists. This paper presents a small-step operational semantics that captures the essence of React, as a first step towards a long-term goal of developing automatic tools for program understanding, automatic testing, and bug finding for React web applications. To demonstrate that key operations such as mounting, unmounting, and reconciliation terminate, we define the notion of a well-behaved component and prove that well-behavedness is preserved by these operations. 
    more » « less