Abstract In this work, we develop a method namedTwinningfor partitioning a dataset into statistically similar twin sets.Twinningis based onSPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets.Twinningis orders of magnitude faster than theSPlitalgorithm, which makes it applicable to Big Data problems such as data compression.Twinningcan also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures andk‐fold cross validation. 
                        more » 
                        « less   
                    
                            
                            A multi-species benchmark for training and validating mass spectrometry proteomics machine learning models
                        
                    
    
            Abstract Training machine learning models for tasks such asde novosequencing or spectral clustering requires large collections of confidently identified spectra. Here we describe a dataset of 2.8 million high-confidence peptide-spectrum matches derived from nine different species. The dataset is based on a previously described benchmark but has been re-processed to ensure consistent data quality and enforce separation of training and test peptides. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2245300
- PAR ID:
- 10554332
- Publisher / Repository:
- Nature Publishing Group
- Date Published:
- Journal Name:
- Scientific Data
- Volume:
- 11
- Issue:
- 1
- ISSN:
- 2052-4463
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract The Consistent Artificial Intelligence (AI)-based Soil Moisture (CASM) dataset is a global, consistent, and long-term, remote sensing soil moisture (SM) dataset created using machine learning. It is based on the NASA Soil Moisture Active Passive (SMAP) satellite mission SM data and is aimed at extrapolating SMAP-like quality SM back in time using previous satellite microwave platforms. CASM represents SM in the top soil layer, and it is defined on a global 25 km EASE-2 grid and for 2002–2020 with a 3-day temporal resolution. The seasonal cycle is removed for the neural network training to ensure its skill is targeted at predicting SM extremes. CASM comparison to 367 globalin-situSM monitoring sites shows a SMAP-like median correlation of 0.66. Additionally, the SM product uncertainty was assessed, and both aleatoric and epistemic uncertainties were estimated and included in the dataset. CASM dataset can be used to study a wide range of hydrological, carbon cycle, and energy processes since only a consistent long-term dataset allows assessing changes in water availability and water stress.more » « less
- 
            Abstract “How strong is this Lewis acid?” is a question researchers often approach by calculating its fluoride ion affinity (FIA) with quantum chemistry. Here, we present FIA49k, an extensive FIA dataset with 48,986 data points calculated at the RI‐DSD‐BLYP‐D3(BJ)/def2‐QZVPP//PBEh‐3c level of theory, including 13 differentp‐block atoms as the fluoride accepting site. The FIA49k dataset was used to train FIA‐GNN, two message‐passing graph neural networks, which predict gas and solution phase FIA values of molecules excluded from training with a mean absolute error of 14 kJ mol−1(r2=0.93) from the SMILES string of the Lewis acid as the only input. The level of accuracy is notable, given the wide energetic range of 750 kJ mol−1spanned by FIA49k. The model's value was demonstrated with four case studies, including predictions for molecules extracted from the Cambridge Structural Database and by reproducing results from catalysis research available in the literature. Weaknesses of the model are evaluated and interpreted chemically. FIA‐GNN and the FIA49k dataset can be reached via a free web app (www.grebgroup.de/fia‐gnn).more » « less
- 
            BackgroundThe treatment of depression in children and adolescents is a substantial public health challenge. This study examined artificial intelligence tools for the prediction of early outcomes in depressed children and adolescents treated with fluoxetine, duloxetine, or placebo. MethodsThe study samples included training datasets (N = 271) from patients with major depressive disorder (MDD) treated with fluoxetine and testing datasets from patients with MDD treated with duloxetine (N = 255) or placebo (N = 265). Treatment trajectories were generated using probabilistic graphical models (PGMs). Unsupervised machine learning identified specific depressive symptom profiles and related thresholds of improvement during acute treatment. ResultsVariation in six depressive symptoms (difficulty having fun, social withdrawal, excessive fatigue, irritability, low self‐esteem, and depressed feelings) assessed with the Children’s Depression Rating Scale‐Revised at 4–6 weeks predicted treatment outcomes with fluoxetine at 10–12 weeks with an average accuracy of 73% in the training dataset. The same six symptoms predicted 10–12 week outcomes at 4–6 weeks in (a) duloxetine testing datasets with an average accuracy of 76% and (b) placebo‐treated patients with accuracies of 67%. In placebo‐treated patients, the accuracies of predicting response and remission were similar to antidepressants. Accuracies for predicting nonresponse to placebo treatment were significantly lower than antidepressants. ConclusionsPGMs provided clinically meaningful predictions in samples of depressed children and adolescents treated with fluoxetine or duloxetine. Future work should augment PGMs with biological data for refined predictions to guide the selection of pharmacological and psychotherapeutic treatment in children and adolescents with depression.more » « less
- 
            Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search foraugmentations---join or union-compatible datasets---that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We presentSaibot, a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates thatSaibotcan return augmentations that achieve model accuracy within 50--90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
