Context. Realistic synthetic observations of theoretical source models are essential for our understanding of real observational data. In using synthetic data, one can verify the extent to which source parameters can be recovered and evaluate how various data corruption effects can be calibrated. These studies are the most important when proposing observations of new sources, in the characterization of the capabilities of new or upgraded instruments, and when verifying model-based theoretical predictions in a direct comparison with observational data. Aims. We present the SYnthetic Measurement creator for long Baseline Arrays ( SYMBA ), a novel synthetic data generation pipeline for Very Long Baseline Interferometry (VLBI) observations. SYMBA takes into account several realistic atmospheric, instrumental, and calibration effects. Methods. We used SYMBA to create synthetic observations for the Event Horizon Telescope (EHT), a millimetre VLBI array, which has recently captured the first image of a black hole shadow. After testing SYMBA with simple source and corruption models, we study the importance of including all corruption and calibration effects, compared to the addition of thermal noise only. Using synthetic data based on two example general relativistic magnetohydrodynamics (GRMHD) model images of M 87, we performed case studies to assess the image quality that can be obtained with the current and future EHT array for different weather conditions. Results. Our synthetic observations show that the effects of atmospheric and instrumental corruptions on the measured visibilities are significant. Despite these effects, we demonstrate how the overall structure of our GRMHD source models can be recovered robustly with the EHT2017 array after performing calibration steps, which include fringe fitting, a priori amplitude and network calibration, and self-calibration. With the planned addition of new stations to the EHT array in the coming years, images could be reconstructed with higher angular resolution and dynamic range. In our case study, these improvements allowed for a distinction between a thermal and a non-thermal GRMHD model based on salient features in reconstructed images. 
                        more » 
                        « less   
                    This content will become publicly available on July 1, 2026
                            
                            Stress-Testing ML Pipelines with Adversarial Data Corruption
                        
                    
    
            Structured data-quality issues—such as missing values correlated with demographics, culturally biased labels, or systemic selection biases—routinely degrade the reliability of machine-learning pipelines. Regulators now increasingly demand evidence that high-stakes systems can withstand these realistic, interdependent errors, yet current robustness evaluations typically use random or overly simplistic corruptions, leaving worst-case scenarios unexplored. We introduce Savage, a causally inspired framework that (i) formally models realistic data-quality issues through dependency graphs and flexible corruption templates, and (ii) systematically discovers corruption patterns that maximally degrade a target performance metric. Savage employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity, treating the full ML pipeline, including preprocessing and potentially non-differentiable models, as a black box. Extensive experiments across multiple datasets and ML tasks (data cleaning, fairness-aware learning, uncertainty quantification) demonstrate that even a small fraction (around 5%) of structured corruptions identified by Savage severely impacts model performance, far exceeding random or manually crafted errors, and invalidating core assumptions of existing techniques. Thus, Savage provides a practical tool for rigorous pipeline stress-testing, a benchmark for evaluating robustness methods, and actionable guidance for designing more resilient data workflows. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10639846
- Publisher / Repository:
- VLDB Endowment
- Date Published:
- Journal Name:
- Proceedings of the VLDB Endowment
- Volume:
- 18
- Issue:
- 11
- ISSN:
- 2150-8097
- Page Range / eLocation ID:
- 4668 to 4681
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Robust estimation is much more challenging in high-dimensions than it is in one-dimension: Most techniques either lead to intractable optimization problems or estimators that can tolerate only a tiny fraction of errors. Recent work in theoretical computer science has shown that, in appropriate distributional models, it is possible to robustly estimate the mean and covariance with polynomial time algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. However, the sample and time complexity of these algorithms is prohibitively large for high-dimensional applications. In this work, we address both of these issues by establishing sample complexity bounds that are optimal, up to logarithmic factors, as well as giving various refinements that allow the algorithms to tolerate a much larger fraction of corruptions. Finally, we show on both synthetic and real data that our algorithms have state-of-the-art performance and suddenly make high-dimensional robust estimation a realistic possibility.more » « less
- 
            As machine learning (ML) systems become pervasive, safeguarding their security is critical. However, recently it has been demonstrated that motivated adversaries are able to mislead ML systems by perturbing test data using semantic transformations. While there exists a rich body of research providing provable robustness guarantees for ML models against ℓp norm bounded adversarial perturbations, guarantees against semantic perturbations remain largely underexplored. In this paper, we provide TSS -- a unified framework for certifying ML robustness against general adversarial semantic transformations. First, depending on the properties of each transformation, we divide common transformations into two categories, namely resolvable (e.g., Gaussian blur) and differentially resolvable (e.g., rotation) transformations. For the former, we propose transformation-specific randomized smoothing strategies and obtain strong robustness certification. The latter category covers transformations that involve interpolation errors, and we propose a novel approach based on stratified sampling to certify the robustness. Our framework TSS leverages these certification strategies and combines with consistency-enhanced training to provide rigorous certification of robustness. We conduct extensive experiments on over ten types of challenging semantic transformations and show that TSS significantly outperforms the state of the art. Moreover, to the best of our knowledge, TSS is the first approach that achieves nontrivial certified robustness on the large-scale ImageNet dataset. For instance, our framework achieves 30.4% certified robust accuracy against rotation attack (within ±30∘) on ImageNet. Moreover, to consider a broader range of transformations, we show TSS is also robust against adaptive attacks and unforeseen image corruptions such as CIFAR-10-C and ImageNet-C.more » « less
- 
            Electroencephalography (EEG) based systems utilize machine learning (ML) and deep learning (DL) models in various applications such as seizure detection, emotion recognition, cognitive workload estimation, and brain-computer interface (BCI). However, the security and robustness of such intelligent systems under analog-domain threats have received limited attention. This paper presents the first demonstration of physical signal injection attacks on ML and DL models utilizing EEG data. We investigate how an adversary can degrade the performance of different models by non-invasively injecting signals into EEG recordings. We show that the attacks can mislead or manipulate the models and diminish the reliability of EEG-based systems. Overall, this research sheds light on the need for more trustworthy physiological-signal-based intelligent systems in the healthcare field and opens up avenues for future work.more » « less
- 
            Deep Learning (DL) models achieve great successes in many domains. However, DL models increasingly face safety and robustness concerns, including noisy labeling in the training stage and feature distribution shifts in the testing stage. Previous works made significant progress in addressing these problems, but the focus has largely been on developing solutions for only one problem at a time. For example, recent work has argued for the use of tunable robust loss functions to mitigate label noise, and data augmentation (e.g., AugMix) to combat distribution shifts. As a step towards addressing both problems simultaneously, we introduce AugLoss, a simple but effective methodology that achieves robustness against both train-time noisy labeling and test-time feature distribution shifts by unifying data augmentation and robust loss functions. We conduct comprehensive experiments in varied settings of real-world dataset corruption to showcase the gains achieved by AugLoss compared to previous state-of-the-art methods. Lastly, we hope this work will open new directions for designing more robust and reliable DL models under real-world corruptions.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
