Title: Efficient Projection onto the Perfect Phylogeny Model
Several algorithms build on the perfect phylogeny model to infer evolutionary trees. This problem is particularly hard when evolutionary trees are inferred from the fraction of genomes that have mutations in different positions, across different samples. Existing algorithms might do extensive searches over the space of possible trees. At the center of these algorithms is a projection problem that assigns a fitness cost to phylogenetic trees. In order to perform a wide search over the space of the trees, it is critical to solve this projection problem fast. In this paper, we use Moreau's decomposition for proximal operators, and a tree reduction scheme, to develop a new algorithm to compute this projection. Our algorithm terminates with an exact solution in a finite number of steps, and is extremely fast. In particular, it can search over all evolutionary trees with fewer than 11 nodes, a size relevant for several biological problems (more than 2 billion trees) in about 2 hours. more »« less
Greenberg, Craig; Macaluso, Sebastian; Monath, Nicholas; Dubey, Avinava; Flaherty, Patrick; Zaheer, Manzil; Ahmed, Amr; Cranmer, Kyle; McCallum, Andrew
(, The Conference on Uncertainty in Artificial Intelligence (UAI))
null
(Ed.)
Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel \emph{trellis} data structure. This combination results in an exact algorithm that scales beyond previous state of the art, from a search space with 10^12 trees to 10^15 trees, and an approximate algorithm that improves over baselines, even in enormous search spaces that contain more than 10^1000 trees. We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering.
Carraher, Lee A.; Dey, Sayantan; Wilsey, Philip A.
(, MLDM 2019: 15th International Conference on Machine Learning and Data Mining)
Clustering continues to be an important tool for data engineering and analysis. While advances in deep learning tend to be at the forefront of machine learning, it is only useful for the supervised classification of data sets. Clustering is an essential tool for problems where labeling data sets is either too labor intensive or where there is no agreed upon ground truth. The well studied k-means problem partitions groups of similar vectors into k clusters by iteratively updating the cluster assignment such that it minimizes the within cluster sum of squares metric. Unfortunately k-means can become prohibitive for very large high dimensional data sets as iterative methods often rely on random access to, or multiple passes over, the data set — a requirement that is not often possible for large and potentially unbounded data sets. In this work we explore an randomized, approximate method for clustering called Tree-Walk Random Projection Clustering (TWRP) that is a fast, memory efficient method for finding cluster embedding in high dimensional spaces. TWRP combines random projection with a tree based partitioner to achieve a clustering method that forgoes storing the exhaustive representation of all vectors in the data space and instead performs a bounded search over the implied cluster bifurcation tree represented as approximate vector and count values. The TWRP algorithm is described and experimentally evaluated for scalability and accuracy in the presence of noise against several other well-known algorithms.
Purpose Marine transportation has been faced with an increasing demand for containerized cargo during the past decade. Marine container terminals (MCTs), as the facilities for connecting seaborne and inland transportation, are expected to handle the increasing amount of containers, delivered by vessels. Berth scheduling plays an important role for the total throughput of MCTs as well as the overall effectiveness of the MCT operations. This study aims to propose a novel island-based metaheuristic algorithm to solve the berth scheduling problem and minimize the total cost of serving the arriving vessels at the MCT. Design/methodology/approach A universal island-based metaheuristic algorithm (UIMA) was proposed in this study, aiming to solve the spatially constrained berth scheduling problem. The UIMA population was divided into four sub-populations (i.e. islands). Unlike the canonical island-based algorithms that execute the same metaheuristic on each island, four different population-based metaheuristics are adopted within the developed algorithm to search the islands, including the following: evolutionary algorithm (EA), particle swarm optimization (PSO), estimation of distribution algorithm (EDA) and differential evolution (DE). The adopted population-based metaheuristic algorithms rely on different operators, which facilitate the search process for superior solutions on the UIMA islands. Findings The conducted numerical experiments demonstrated that the developed UIMA algorithm returned near-optimal solutions for the small-size problem instances. As for the large-size problem instances, UIMA was found to be superior to the EA, PSO, EDA and DE algorithms, which were executed in isolation, in terms of the obtained objective function values at termination. Furthermore, the developed UIMA algorithm outperformed various single-solution-based metaheuristic algorithms (including variable neighborhood search, tabu search and simulated annealing) in terms of the solution quality. The maximum UIMA computational time did not exceed 306 s. Research limitations/implications Some of the previous berth scheduling studies modeled uncertain vessel arrival times and/or handling times, while this study assumed the vessel arrival and handling times to be deterministic. Practical implications The developed UIMA algorithm can be used by the MCT operators as an efficient decision support tool and assist with a cost-effective design of berth schedules within an acceptable computational time. Originality/value A novel island-based metaheuristic algorithm is designed to solve the spatially constrained berth scheduling problem. The proposed island-based algorithm adopts several types of metaheuristic algorithms to cover different areas of the search space. The considered metaheuristic algorithms rely on different operators. Such feature is expected to facilitate the search process for superior solutions.
Shekhar, Subhanshu; Javidi, Tara
(, Proceedings of Machine Learning Research)
In this paper, the problem of estimating the level set of a black-box function from noisy and expensive evaluation queries is considered. A new algorithm for this problem in the Bayesian framework with a Gaussian Process (GP) prior is proposed. The proposed algorithm employs a hierarchical sequence of partitions to explore different regions of the search space at varying levels of detail depending upon their proximity to the level set boundary. It is shown that this approach results in the algorithm having a low complexity implementation whose computational cost is significantly smaller than the existing algorithms for higher dimensional search space \X. Furthermore, high probability bounds on a measure of discrepancy between the estimated level set and the true level set for the the proposed algorithm are obtained, which are shown to be strictly better than the existing guarantees for a large class of GPs.In the process, a tighter characterization of the information gain of the proposed algorithm is obtained which takes into account the structured nature of the evaluation points. This approach improves upon the existing technique of bounding the information gain with maximum information gain.
Kvinge, Henry; Farnell, Elin; Kirby, Michael; Peterson, Chris
(, 2018 IEEE International Conference on Big Data (Big Data))
Dimensionality-reduction methods are a fundamental tool in the analysis of large datasets. These algorithms work on the assumption that the "intrinsic dimension" of the data is generally much smaller than the ambient dimension in which it is collected. Alongside their usual purpose of mapping data into a smaller-dimensional space with minimal information loss, dimensionality-reduction techniques implicitly or explicitly provide information about the dimension of the dataset.In this paper, we propose a new statistic that we call the kappa-profile for analysis of large datasets. The kappa-profile arises from a dimensionality-reduction optimization problem: namely that of finding a projection that optimally preserves the secants between points in the dataset. From this optimal projection we extract kappa, the norm of the shortest projected secant from among the set of all normalized secants. This kappa can be computed for any dimension k; thus the tuple of kappa values (indexed by dimension) becomes a kappa-profile. Algorithms such as the Secant-Avoidance Projection algorithm and the Hierarchical Secant-Avoidance Projection algorithm provide a computationally feasible means of estimating the kappa-profile for large datasets, and thus a method of understanding and monitoring their behavior. As we demonstrate in this paper, the kappa-profile serves as a useful statistic in several representative settings: weather data, soundscape data, and dynamical systems data.
Jia, Bei, Ray, Surjyendu, Safavi, Sam, and Bento, Jose. Efficient Projection onto the Perfect Phylogeny Model. Retrieved from https://par.nsf.gov/biblio/10107399. Advances in Neural Information Processing Systems 31 (NIPS 2018) .
Jia, Bei, Ray, Surjyendu, Safavi, Sam, and Bento, Jose.
"Efficient Projection onto the Perfect Phylogeny Model". Advances in Neural Information Processing Systems 31 (NIPS 2018) (). Country unknown/Code not available. https://par.nsf.gov/biblio/10107399.
@article{osti_10107399,
place = {Country unknown/Code not available},
title = {Efficient Projection onto the Perfect Phylogeny Model},
url = {https://par.nsf.gov/biblio/10107399},
abstractNote = {Several algorithms build on the perfect phylogeny model to infer evolutionary trees. This problem is particularly hard when evolutionary trees are inferred from the fraction of genomes that have mutations in different positions, across different samples. Existing algorithms might do extensive searches over the space of possible trees. At the center of these algorithms is a projection problem that assigns a fitness cost to phylogenetic trees. In order to perform a wide search over the space of the trees, it is critical to solve this projection problem fast. In this paper, we use Moreau's decomposition for proximal operators, and a tree reduction scheme, to develop a new algorithm to compute this projection. Our algorithm terminates with an exact solution in a finite number of steps, and is extremely fast. In particular, it can search over all evolutionary trees with fewer than 11 nodes, a size relevant for several biological problems (more than 2 billion trees) in about 2 hours.},
journal = {Advances in Neural Information Processing Systems 31 (NIPS 2018)},
author = {Jia, Bei and Ray, Surjyendu and Safavi, Sam and Bento, Jose},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.