skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Allan Variance-based Granulation Technique for Large Temporal Databases
decrease query response time with limited main memory and storage space, data reduction techniques that preserve data quality are needed. Existing data reduction techniques, however, are often computationally expensive and rely on heuristics for deciding how to split or reduce the original dataset. In this paper, we propose an effective granular data reduction technique for temporal databases, based on Allan Variance (AVAR). AVAR is used to systematically determine the temporal window length over which data remains relevant. The entire dataset to be reduced is then separated into granules with size equal to the AVAR-determined window length. Data reduction is achieved by generating aggregated information for each such granule. The proposed method is tested using a large database that contains temporal information for vehicular data. Then comparison experiments are conducted and the outstanding runtime performance is illustrated by comparing with three clustering-based data reduction methods. The performance results demonstrate that the proposed Allan Variance-based technique can efficiently generate reduced representation of the original data without losing data quality, while significantly reducing computation time.  more » « less
Award ID(s):
1932138
PAR ID:
10299310
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
13th International Conference on Knowledge Management and Information Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Moving averages are widely used to estimate time-varying parameters, especially when the underlying dynamic model is unknown or uncertain. However, the selection of the optimal window length over which to evaluate the moving averages remains an unresolved issue in the field. In this paper, we demonstrate the use of Allan variance to identify the characteristic timescales of a noisy random walk from historical measurements. Further, we provide a closed-form, analytical result to show that the Allan variance-informed averaging window length is indeed the optimal averaging window length in the context of moving average estimation of noisy random walks. We complement the analytical proof with numerical results that support the solution, which is also reflected in the authors’ related works. This systematic methodology for selecting the optimal averaging window length using Allan variance is expected to widely benefit practitioners in a diverse array of fields that utilize the moving average estimation technique for noisy random walk signals. 
    more » « less
  2. Persistent homology is used for computing topological features of a space at different spatial resolutions. It is one of the main tools from computational topology that is applied to the problems of data analysis. Despite several attempts to reduce its complexity, persistent homology remains expensive in both time and space. These limits are such that the largest data sets to which the method can be applied have the number of points of the order of thousands in R^3. This paper explores a technique intended to reduce the number of data points while preserving the salient topological features of the data. The proposed technique enables the computation of persistent homology on a reduced version of the original input data without affecting significant components of the output. Since the run time of persistent homology is exponential in the number of data points, the proposed data reduction method facilitates the computation in a fraction of the time required for the original data. Moreover, the data reduction method can be combined with any existing technique that simplifies the computation of persistent homology. The data reduction is performed by creating small groups of \emph{similar} data points, called nano-clusters, and then replacing the points within each nano-cluster with its cluster center. The persistence homology of the reduced data differs from that of the original data by an amount bounded by the radius of the nano-clusters. The theoretical analysis is backed by experimental results showing that persistent homology is preserved by the proposed data reduction technique. 
    more » « less
  3. Sheeri, Abhay (Ed.)
    This research presents a comparative analysis of non-stationary spatial data segmentation techniques such as fixed-length and dynamic segmentation based feature extraction efficiency. The study utilizes 5 miles of railway track geometry data, a non-stationary spatial dataset, to assess the effectiveness of both segmentation approaches. The profile (vertical alignment) of the track geometry is used for this purpose. For fixed-length segmentation, the track data is divided into segments of 264 feet (1/20th of a mile), resulting in about 102 segments. Dynamic segmentation is performed using an l2 model-based change point detection algorithm, which adapts to natural variations in the signal. Key features such as standard deviation, kurtosis, and energy are extracted from both segmentation methods. Performance is evaluated based on multiple criteria, including the discriminative power of the features for classifying track safety and ride-quality conditions using statistical tests such as the f-test and Fisher score, consistency or signal quality across segments, measured using the variance of the signal-to-noise ratio (SNR), computational efficiency in terms of run-time and memory usage. Results indicate that, features from fixed-length segments have demonstrated better discriminative power between safety and ride quality classes, with higher Fisher scores and f-values showing strong statistical significance (p < 0.05). Additionally, fixed-length segmentation has shown a better performance with lower run-time and stable signal power across segments. 
    more » « less
  4. To better fit the actual data, this paper will consider both spatio-temporal correlation and heterogeneity to build the model. In order to overcome the “curse of dimensionality” problem in the nonparametric method, we improve the estimation method of the single-index model and combine it with the correlation and heterogeneity of the spatio-temporal model to obtain a good estimation method. In this paper, assuming that the spatio-temporal process obeys the α mixing condition, a nonparametric procedure is developed for estimating the variance function based on a fully nonparametric function or dimensional reduction structure, and the resulting estimator is consistent. Then, a reweighting estimation of the parametric component can be obtained via taking the estimated variance function into account. The rate of convergence and the asymptotic normality of the new estimators are established under mild conditions. Simulation studies are conducted to evaluate the efficacy of the proposed methodologies, and a case study about the estimation of the air quality evaluation index in Nanjing is provided for illustration. 
    more » « less
  5. Test-case prioritization (TCP) aims to detect regression bugs faster via reordering the tests run. While TCP has been studied for over 20 years, it was almost always evaluated using seeded faults/mutants as opposed to using real test failures. In this work, we study the recent change-aware information retrieval (IR) technique for TCP. Prior work has shown it performing better than traditional coverage-based TCP techniques, but it was only evaluated on a small-scale dataset with a cost-unaware metric based on seeded faults/mutants. We extend the prior work by conducting a much larger and more realistic evaluation as well as proposing enhancements that substantially improve the performance. In particular, we evaluate the original technique on a large-scale, real-world software-evolution dataset with real failures using both cost-aware and cost-unaware metrics under various configurations. Also, we design and evaluate hybrid techniques combining the IR features, historical test execution time, and test failure frequencies. Our results show that the change-aware IR technique outperforms stateof-the-art coverage-based techniques in this real-world setting, and our hybrid techniques improve even further upon the original IR technique. Moreover, we show that flaky tests have a substantial impact on evaluating the change-aware TCP techniques based on real test failures. 
    more » « less