skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: UConn Voter Center - Voting Bubbles with Swatches
We introduce the UConn Bubbles with Swatches dataset. This dataset contains images of voting bubbles, scanned from Connecticut ballots, either captured as grayscale (8 bpp) or color (RGB, 24 bpp) artifacts, and extracted through segmentation using ballot geometry. These images are organized into 4 groups of datasets. The stored file contains all data together in color and we manually convert to greyscale. Each image of a bubble is 40x50 pixels. The labels are produced from an optical lens scanner.  The first dataset, Gray-B (Bubbles), uses 42,679 images (40x50, 8 bpp) with blank (35,429 images) and filled (7,250 images) bubbles filled in by humans, but no marginal marks. There are two classes, mark and nonmark. The second dataset, RGB-B, is a 24 bpp color (RGB) version of Bubbles-Gray.  The third dataset, Gray-C (Combined), augments Gray-B with a collection of marginal marks called “swatches”, which are synthetic images that vary the position of signal to create samples close to the boundary of an optical lens scanner. The 423,703 randomly generated swatches place equal amounts of random noise throughout each image such that the amount of light is the same. This yields 466,382 labeled images. The fourth dataset, RGB-C, is a 24bpp color (RGB) version of Gray-C. The empty bubbles are bubbles that were printed by a commercial vendor. They have undergone registration and segmentation using predetermined coordinates. Marks are on paper printed by the same vendor. These datasets can be used for classification training. The .h5 has many levels of datasets as shown below.  The main dataset used for training is positional.  This is only separated into blank (non-mark) and vote (mark).  Whether the example is a bubble or a swatch is indicated by batch number.  See https://github.com/VoterCenter/Busting-the-Ballot/blob/main/Utilities/LoadVoterData.py for code that creates torch arrays for RGB-B and RGB-C. See the linked Github repo (https://github.com/VoterCenter/Busting-the-Ballot/blob/main/Utilities/VoterLab_Classifier_Functions.py) for grayscale conversion functions and other utilities.   Dataset structure: COLOR - POSITIONAL - INFORMATION / / / B/V/Q B/V/Q COLOR/POSITIONAL / / / IMAGE IMAGE B/V/Q / BACKGROUND RGB VALUES Images divided into 'batches' not all of which have dataInformation contains labels for all images. Q is the swatch data, while B and V are non-mark and mark respectively.  more » « less
Award ID(s):
2232813 2141033
PAR ID:
10651257
Author(s) / Creator(s):
; ;
Corporate Creator(s):
Publisher / Repository:
Zenodo
Date Published:
Edition / Version:
1.0.0
Subject(s) / Keyword(s):
machine learning election security voting computer vision
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Currently deployed election systems that scan and process hand-marked ballots are not sophisticated enough to handle marks insufficiently filled in (e.g., partially filled-in), improper marks (e.g., using check marks or crosses instead of filling in bubbles), or marks outside of bubbles, other than setting a threshold to detect whether the pixels inside bubbles are dark and dense enough to be counted as a vote. The current works along this line are still largely limited by their degree of automation and require substantial manpower for annotation and adjudication. In this study, we propose a highly automated deep learning (DL) mark segmentation model-based ballot tabulation assistant able to accurately identify legitimate ballot marks. For comparison purposes, a highly customized traditional computer vision (T-CV) mark segmentation-based method has also been developed to compare with the DL-based tabulator, with a detailed discussion included. Our experiments conducted on two real election datasets achieved the highest accuracy of 99.984% on ballot tabulation. In order to further enhance our DL model’s capability of detecting the marks that are underrepresented in training datasets, e.g., insufficiently or improperly filled marks, we propose a Siamese network architecture that enables our DL model to exploit the contrasting features between a handmarked ballot image and its corresponding blank template image to detect marks. Without the need for extra data collection, by incorporating this novel network architecture, our DL modelbased tabulation method not only achieved a higher accuracy score but also substantially reduced the overall false negative rate. 
    more » « less
  2. null (Ed.)
    As paper ballots and post-election audits gain increased adoption in the United States, election technology vendors are offering products that allow jurisdictions to review ballot images—digital scans produced by optical-scan voting machines—in their post-election audit procedures. Jurisdictions including the state of Maryland rely on such image audits as an alternative to inspecting the physical paper ballots. We show that image audits can be reliably defeated by an attacker who can run malicious code on the voting machines or election management system. Using computer vision techniques, we develop an algorithm that automatically and seamlessly manipulates ballot images, moving voters’ marks so that they appear to be votes for the attacker’s preferred candidate. Our implementation is compatible with many widely used ballot styles, and we show that it is effective using a large corpus of ballot images from a real election. We also show that the attack can be delivered in the form of a malicious Windows scanner driver, which we test with a scanner that has been certified for use in vote tabulation by the U.S. Election Assistance Commission. These results demonstrate that post-election audits must inspect physical ballots, not merely ballot images, if they are to strongly defend against computer-based attacks on widely used voting systems. 
    more » « less
  3. Major semantic segmentation approaches are designed for RGB color images, which is interpolated from raw Bayer images. The use of RGB images on the one hand provides abundant scene color information. On the other hand, RGB images are easily observable for human users to understand the scene. The RGB color continuity also facilitates researchers to design segmentation algorithms, which becomes unnecessary in end-to-end learning. More importantly, the use of 3 channels adds extra storage and computation burden for neural networks. In contrast, the raw Bayer images can reserve the primitive color information in the largest extent with just a single channel. The compact design of Bayer pattern not only potentially increases a higher segmentation accuracy because of avoiding interpolation, but also significantly decreases the storage requirement and computation time in comparison with standard R, G, B images. In this paper, we propose BayerSeg-Net to segment single channel raw Bayer image directly. Different from RGB color images that already contain neighboring context information during ISP color interpolation, each pixel in raw Bayer images does not contain any context clues. Based on Bayer pattern properties, BayerSeg-Net assigns dynamic attention on Bayer images' spectral frequency and spatial locations to mitigate classification confusion, and proposes a re-sampling strategy to capture both global and local contextual information. We demonstrate the usability of raw Bayer images in segmentation tasks and the efficiency of BayerSeg-Net on multiple datasets. 
    more » « less
  4. null (Ed.)
    We consider the problem of estimating the structure of an undirected weighted graph underlying a set of smooth multi -attribute signals. Most existing methods for graph estimation are based on single-attribute models where one associates a scalar data variable with each node of the graph, and the problem is to infer the graph topology that captures the relationships between these variables. An example is image graphs for grayscale texture images for modeling dependence of a pixel on neighboring pixels. In multi-attribute graphical models, each node represents a vector, as for example, in color image graphs where one has three variables (RGB color components) per pixel node. In this paper, we extend the single attribute approach of Kalofolias (2016) to multi -attribute data. An alternating direction method of multipliers (ADMM) algorithm is presented to optimize the objective function to infer the graph topology. Numerical results based on synthetic as well as real data are presented. 
    more » « less
  5. Purpose Prior studies show convolutional neural networks predicting self-reported race using x-rays of chest, hand and spine, chest computed tomography, and mammogram. We seek an understanding of the mechanism that reveals race within x-ray images, investigating the possibility that race is not predicted using the physical structure in x-ray images but is embedded in the grayscale pixel intensities. Approach Retrospective full year 2021, 298,827 AP/PA chest x-ray images from 3 academic health centers across the United States and MIMIC-CXR, labeled by self-reported race, were used in this study. The image structure is removed by summing the number of each grayscale value and scaling to percent per image (PPI). The resulting data are tested using multivariate analysis of variance (MANOVA) with Bonferroni multiple-comparison adjustment and class-balanced MANOVA. Machine learning (ML) feed-forward networks (FFN) and decision trees were built to predict race (binary Black or White and binary Black or other) using only grayscale value counts. Stratified analysis by body mass index, age, sex, gender, patient type, make/model of scanner, exposure, and kilovoltage peak setting was run to study the impact of these factors on race prediction following the same methodology. Results MANOVA rejects the null hypothesis that classes are the same with 95% confidence (F 7.38, P < 0.0001) and balanced MANOVA (F 2.02, P < 0.0001). The best FFN performance is limited [area under the receiver operating characteristic (AUROC) of 69.18%]. Gradient boosted trees predict self-reported race using grayscale PPI (AUROC 77.24%). Conclusions Within chest x-rays, pixel intensity value counts alone are statistically significant indicators and enough for ML classification tasks of patient self-reported race. 
    more » « less