Abstract Data reduction methods are frequently employed in large genomics and phenomics studies to extract core patterns, reduce dimensionality, and alleviate multiple testing effects. Principal component analysis (PCA), in particular, identifies the components that capture the most variance within omics datasets. While data reduction can simplify complex datasets, it remains unclear how the use of PCA impacts downstream analyses such as quantitative trait loci (QTL) or genome-wide association (GWA) approaches and their biological interpretation. In QTL studies, an alternative to data reduction is the use of post-hoc data summarization approaches, such as hotspot analysis, which involves mapping individual traits and consolidating results based on shared genomic locations. To evaluate how different analytical approaches may alter the biological insights derived from multi-dimensional QTL datasets, we compared individual trait hotspots with PCA-based QTL mapping using transcriptomic and metabolomic data from a structured recombinant inbred line population. Interestingly, these two approaches identified different genomic regions and genetic architectures. These findings suggest that mapping PCA-reduced data does not merely streamline analyses but may generate a fundamentally different view of the underlying genetic architecture compared to individual trait mapping and hotspot analysis. Thus, the use of PCA and other data reduction techniques prior to QTL or GWAS mapping should be carefully considered to ensure alignment with the specific biological question being addressed.
more »
« less
Loki: Streamlining Integration and Enrichment
Data scientists frequently transform data from one form to another while cleaning, integrating, and enriching datasets.Writing such transformations, or “mapping functions" is time-consuming and often involves significant code re-use. Unfortunately, when every dataset is slightly different from the last, finding the right mapping functions to re-use can be equally difficult. In this paper, we propose “Link Once and Keep It" (Loki), a system which consists of a repository of datasets and mapping functions and relates new datasets to datasets it already knows about, helping a data scientist to quickly locate and re-use mapping functions she developed for other datasets in the past. Loki represents a first step towards building and re-using repositories of domain-specific data integration pipelines.
more »
« less
- PAR ID:
- 10208623
- Editor(s):
- Abouzied, Azza; Amer-Yahia, Sihem; Ives, Zachary
- Date Published:
- Journal Name:
- Human in the Loop Data Analytics
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Registering functions (curves) using time warpings (re-parameterizations) is central to many computer vision and shape analysis solutions. While traditional registration methods minimize penalized-L2 norm, the elastic Riemannian metric and square-root velocity functions (SRVFs) have resulted in significant improvements in terms of theory and practical performance. This solution uses the dynamic programming algorithm to minimize the L2 norm between SRVFs of given functions. However, the computational cost of this elastic dynamic programming framework – O(nT 2 k) – where T is the number of time samples along curves, n is the number of curves, and k < T is a parameter – limits its use in applications involving big data. This paper introduces a deep-learning approach, named SRVF Registration Net or SrvfRegNet to overcome these limitations. SrvfRegNet architecture trains by optimizing the elastic metric-based objective function on the training data and then applies this trained network to the test data to perform fast registration. In case the training and the test data are from different classes, it generalizes to the test data using transfer learning, i.e., retraining of only the last few layers of the network. It achieves the state-of-the-art alignment performance albeit at much reduced computational cost. We demonstrate the efficiency and efficacy of this framework using several standard curve datasets.more » « less
-
Abstract. One of the key components of this research has been the mapping of Antarctic bed topography and ice thickness parameters that are crucial for modelling ice flow and hence for predicting future ice loss andthe ensuing sea level rise. Supported by the Scientific Committee on Antarctic Research (SCAR), the Bedmap3 Action Group aims not only to produce newgridded maps of ice thickness and bed topography for the internationalscientific community, but also to standardize and make available all thegeophysical survey data points used in producing the Bedmap griddedproducts. Here, we document the survey data used in the latest iteration,Bedmap3, incorporating and adding to all of the datasets previously used forBedmap1 and Bedmap2, including ice bed, surface and thickness point data from all Antarctic geophysical campaigns since the 1950s. More specifically,we describe the processes used to standardize and make these and futuresurveys and gridded datasets accessible under the Findable, Accessible, Interoperable, and Reusable (FAIR) data principles. With the goals of making the gridding process reproducible and allowing scientists to re-use the data freely for their own analysis, we introduce the new SCAR Bedmap Data Portal(https://bedmap.scar.org, last access: 1 March 2023) created to provideunprecedented open access to these important datasets through a web-map interface. We believe that this data release will be a valuable asset to Antarctic research and will greatly extend the life cycle of the data heldwithin it. Data are available from the UK Polar Data Centre: https://data.bas.ac.uk (last access: 5 May 2023). See the Data availability section for the complete list of datasets.more » « less
-
null (Ed.)Alaska has witnessed a significant increase in wildfire events in recent decades that have been linked to drier and warmer summers. Forest fuel maps play a vital role in wildfire management and risk assessment. Freely available multispectral datasets are widely used for land use and land cover mapping, but they have limited utility for fuel mapping due to their coarse spectral resolution. Hyperspectral datasets have a high spectral resolution, ideal for detailed fuel mapping, but they are limited and expensive to acquire. This study simulates hyperspectral data from Sentinel-2 multispectral data using the spectral response function of the Airborne Visible/Infrared Imaging Spectrometer-Next Generation (AVIRIS-NG) sensor, and normalized ground spectra of gravel, birch, and spruce. We used the Uniform Pattern Decomposition Method (UPDM) for spectral unmixing, which is a sensor-independent method, where each pixel is expressed as the linear sum of standard reference spectra. The simulated hyperspectral data have spectral characteristics of AVIRIS-NG and the reflectance properties of Sentinel-2 data. We validated the simulated spectra by visually and statistically comparing it with real AVIRIS-NG data. We observed a high correlation between the spectra of tree classes collected from AVIRIS-NG and simulated hyperspectral data. Upon performing species level classification, we achieved a classification accuracy of 89% for the simulated hyperspectral data, which is better than the accuracy of Sentinel-2 data (77.8%). We generated a fuel map from the simulated hyperspectral image using the Random Forest classifier. Our study demonstrated that low-cost and high-quality hyperspectral data can be generated from Sentinel-2 data using UPDM for improved land cover and vegetation mapping in the boreal forest.more » « less
-
Ishikawa, H.; Liu, CL.; Pajdla, T.; Shi, J. (Ed.)We propose a novel technique to register sparse 3D scans in the absence of texture. While existing methods such as KinectFusion or Iterative Closest Points (ICP) heavily rely on dense point clouds, this task is particularly challenging under sparse conditions without RGB data. Sparse texture-less data does not come with high-quality boundary signal, and this prohibits the use of correspondences from corners, junctions, or boundary lines. Moreover, in the case of sparse data, it is incorrect to assume that the same point will be captured in two consecutive scans. We take a different approach and first re-parameterize the point-cloud using a large number of line segments. In this re-parameterized data, there exists a large number of line intersection (and not correspondence) constraints that allow us to solve the registration task. We propose the use of a two-step alternating projection algorithm by formulating the registration as the simultaneous satisfaction of intersection and rigidity constraints. The proposed approach outperforms other top-scoring algorithms on both Kinect and LiDAR datasets. In Kinect, we can use 100X downsampled sparse data and still outperform competing methods operating on full-resolution data.more » « less
An official website of the United States government

