skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Change Point Detection in a Dynamic Stochastic Blockmodel
Abstract. We study a change point detection scenario for a dynamic community graph model, which is formed by adding new vertices and randomly attaching them to the existing nodes. The goal of this work is to design a test statistic to detect the merging of communities with- out solving the problem of identifying the communities. We propose a test that can ascertain when the connectivity between the balanced communities is changing. In addition to the theoretical analysis of the test statistic, we perform Monte Carlo simulations of the dynamic stochastic blockmodel to demonstrate that our test can detect changes in graph topology, and we study a dynamic social-contact graph.  more » « less
Award ID(s):
1815971
PAR ID:
10137373
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Studies in computational intelligence
Volume:
1
ISSN:
1860-949X
Page Range / eLocation ID:
211-222
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Hypothesis testing is one of the most common types of data analysis and forms the backbone of scientific research in many disciplines. Analysis of variance (ANOVA) in particular is used to detect dependence between a categorical and a numerical variable. Here we show how one can carry out this hypothesis test under the restrictions of differential privacy. We show that the F -statistic, the optimal test statistic in the public setting, is no longer optimal in the private setting, and we develop a new test statistic F 1 with much higher statistical power. We show how to rigorously compute a reference distribution for the F 1 statistic and give an algorithm that outputs accurate p -values. We implement our test and experimentally optimize several parameters. We then compare our test to the only previous work on private ANOVA testing, using the same effect size as that work. We see an order of magnitude improvement, with our test requiring only 7% as much data to detect the effect. 
    more » « less
  2. Physically unclonable hardware fingerprints can be used for device authentication. The photo-response non-uniformity (PRNU) is the most reliable hardware fingerprint of digital cameras and can be conveniently extracted from images. However, we find image post-processing software may introduce extra noise into images. Part of this noise remains in the extracted PRNU fingerprints and is hard to be eliminated by traditional approaches, such as denoising filters. We define this noise as software noise, which pollutes PRNU fingerprints and interferes with authenticating a camera armed device. In this paper, we propose novel approaches for fingerprint matching, a critical step in device authentication, in the presence of software noise. We calculate the cross correlation between PRNU fingerprints of different cameras using a test statistic such as the Peak to Correlation Energy (PCE) so as to estimate software noise correlation. During fingerprint matching, we derive the ratio of the test statistic on two PRNU fingerprints of interest over the estimated software noise correlation. We denote this ratio as the fingerprint to software noise ratio (FITS), which allows us to detect the PRNU hardware noise correlation component in the test statistic for fingerprint matching. Extensive experiments over 10,000 images taken by more than 90 smartphones are conducted to validate our approaches, which outperform the state-of-the-art approaches significantly for polluted fingerprints. We are the first to study fingerprint matching with the existence of software noise. 
    more » « less
  3. In pretraining data detection, the goal is to detect whether a given sentence is in the dataset used for training a Large Language Model LLM). Recent methods (such as Min-K % and Min-K%++) reveal that most training corpora are likely contaminated with both sensitive content and evaluation benchmarks, leading to inflated test set performance. These methods sometimes fail to detect samples from the pretraining data, primarily because they depend on statistics composed of causal token likelihoods. We introduce Infilling Score, a new test-statistic based on non-causal token likelihoods. Infilling Score can be computed for autoregressive models without re-training using Bayes rule. A naive application of Bayes rule scales linearly with the vocabulary size. However, we propose a ratio test-statistic whose computation is invariant to vocabulary size. Empirically, our method achieves a significant accuracy gain over state-of-the-art methods including Min-K%, and Min-K%++ on the WikiMIA benchmark across seven models with different parameter sizes. Further, we achieve higher AUC compared to reference-free methods on the challenging MIMIR benchmark. Finally, we create a benchmark dataset consisting of recent data sources published after the release of Llama-3; this benchmark provides a statistical baseline to indicate potential corpora used for Llama-3 training. 
    more » « less
  4. In pretraining data detection, the goal is to detect whether a given sentence is in the dataset used for training a Large Language Model LLM). Recent methods (such as Min-K % and Min-K%++) reveal that most training corpora are likely contaminated with both sensitive content and evaluation benchmarks, leading to inflated test set performance. These methods sometimes fail to detect samples from the pretraining data, primarily because they depend on statistics composed of causal token likelihoods. We introduce Infilling Score, a new test-statistic based on non-causal token likelihoods. Infilling Score can be computed for autoregressive models without re-training using Bayes rule. A naive application of Bayes rule scales linearly with the vocabulary size. However, we propose a ratio test-statistic whose computation is invariant to vocabulary size. Empirically, our method achieves a significant accuracy gain over state-of-the-art methods including Min-K%, and Min-K%++ on the WikiMIA benchmark across seven models with different parameter sizes. Further, we achieve higher AUC compared to reference-free methods on the challenging MIMIR benchmark. Finally, we create a benchmark dataset consisting of recent data sources published after the release of Llama-3; this benchmark provides a statistical baseline to indicate potential corpora used for Llama-3 training. 
    more » « less
  5. null (Ed.)
    A quickest change detection problem is considered in a sensor network with observations whose statistical dependency structure across the sensors before and after the change is described by a decomposable graphical model (DGM). Distributed computation methods for this problem are proposed that are capable of producing the optimum centralized test statistic. The DGM leads to the proper way to collect nodes into local groups equivalent to cliques in the graph, such that a clique statistic which summarizes all the clique sensor data can be computed within each clique. The clique statistics are transmitted to a decision maker to produce the optimum centralized test statistic. In order to further improve communication efficiency, an ordered transmission approach is proposed where transmissions of the clique statistics to the fusion center are ordered and then adaptively halted when sufficient information is accumulated. This procedure is always guaranteed to provide the optimal change detection performance, despite not transmitting all the statistics from all the cliques. A lower bound on the average number of transmissions saved by ordered transmissions is provided and for the case where the change seldom occurs the lower bound approaches approximately half the number of cliques provided a well behaved distance measure between the distributions of the sensor observations before and after the change is sufficiently large. We also extend the approach to the case when the graph structure is different under each hypothesis. Numerical results show significant savings using the ordered transmission approach and validate the theoretical findings. 
    more » « less