skip to main content


Title: Accelerating Substructure Similarity Search for Formula Retrieval
Formula retrieval systems using substructure matching are effective, but suffer from slow retrieval times caused by the complexity of structure matching. We present a specialized inverted index and rank-safe dynamic pruning algorithm for faster substructure retrieval. Formulas are indexed from their Operator Tree (OPT) representations. Our model is evaluated using the NTCIR-12 Wikipedia Formula Browsing Task and a new formula corpus produced from Math StackExchange posts. Our approach preserves the effectiveness of structure matching while allowing queries to be executed in real-time.  more » « less
Award ID(s):
1717997
NSF-PAR ID:
10198750
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proc. European Conference on Information Retrieval
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. When searching for mathematical content, accurate measures of formula similarity can help with tasks such as document ranking, query recommendation, and result set clustering. While there have been many attempts at embedding words and graphs, formula embedding is in its early stages. We introduce a new formula em- bedding model that we use with two hierarchical representations, (1) Symbol Layout Trees (SLTs) for appearance, and (2) Operator Trees (OPTs) for mathematical content. Following the approach of graph embeddings such as DeepWalk, we generate tuples represent- ing paths between pairs of symbols depth-first, embed tuples using the fastText n-gram embedding model, and then represent an SLT or OPT by its average tuple embedding vector. We then combine SLT and OPT embeddings, leading to state-of-the-art results for the NTCIR-12 formula retrieval task. Our fine-grained holistic vector representations allow us to retrieve many more partially similar for- mulas than methods using structural matching in trees. Combining our embedding model with structural matching in the Approach0 formula search engine produces state-of-the-art results for both fully and partially relevant results on the NTCIR-12 benchmark. Source code for our system is publicly available. 
    more » « less
  2. Abstract

    The local escape velocity provides valuable inputs to the mass profile of the galaxy, and requires understanding the tail of the stellar speed distribution. Following Leonard & Tremaine, various works have since modeled the tail of the stellar speed distribution as(vescv)k, wherevescis the escape velocity, andkis the slope of the distribution. In such studies, however, these two parameters were found to be largely degenerate and often a narrow prior is imposed onkin order to constrainvesc. Furthermore, the validity of the power-law form can breakdown in the presence of multiple kinematic substructures or other mis-modeled features in the data. In this paper, we introduce a strategy that for the first time takes into account the presence of kinematic substructure. We model the tail of the velocity distribution as a sum of multiple power laws as a way of introducing a more flexible fitting framework. Using mock data and data from FIRE simulations of Milky Way-like galaxies, we show the robustness of this method in the presence of kinematic structure that is similar to the recently discovered Gaia Sausage. In a companion paper, we present the new measurement of the escape velocity and subsequently the mass of the Milky Way using Gaia eDR3 data.

     
    more » « less
  3. Fix a weakly minimal (i.e. superstable [Formula: see text]-rank [Formula: see text]) structure [Formula: see text]. Let [Formula: see text] be an expansion by constants for an elementary substructure, and let [Formula: see text] be an arbitrary subset of the universe [Formula: see text]. We show that all formulas in the expansion [Formula: see text] are equivalent to bounded formulas, and so [Formula: see text] is stable (or NIP) if and only if the [Formula: see text]-induced structure [Formula: see text] on [Formula: see text] is stable (or NIP). We then restrict to the case that [Formula: see text] is a pure abelian group with a weakly minimal theory, and [Formula: see text] is mutually algebraic (equivalently, weakly minimal with trivial forking). This setting encompasses most of the recent research on stable expansions of [Formula: see text]. Using various characterizations of mutual algebraicity, we give new examples of stable structures of the form [Formula: see text]. Most notably, we show that if [Formula: see text] is a weakly minimal additive subgroup of the algebraic numbers, [Formula: see text] is enumerated by a homogeneous linear recurrence relation with algebraic coefficients, and no repeated root of the characteristic polynomial of [Formula: see text] is a root of unity, then [Formula: see text] is superstable for any [Formula: see text]. 
    more » « less
  4. Abstract

    Brown dwarf spectra offer vital testbeds for our understanding of the chemical and physical processes that sculpt substellar atmospheres. Recently, atmospheric retrieval approaches have been successfully applied to low-resolution (R∼ 100) spectra of L, T, and Y dwarfs, yielding constraints on the chemical abundances and temperature structures of these atmospheres. Medium-resolution (R∼ 103) spectra of brown dwarfs offer additional insight, as molecular features are more easily disentangled and the thermal structure of the upper atmosphere is better probed. We present results from a GPU-based retrieval analysis of a high signal-to-noise, medium-resolution (R∼ 6000) FIRE spectrum from 0.85 to 2.5μm of the T9 dwarf UGPS J072227.51–054031.2. At 60× higher spectral resolution than previous brown dwarf retrievals, a number of novel challenges arise. We examine the effect of different opacity sources, in particular for CH4. Furthermore, we find that flaws in the data like errors from order stitching can bias our constraints. We compare these retrieval results to those for anR∼ 100 spectrum of the same object, revealing how constraints on atmospheric abundances and temperatures improve by an order of magnitude or more with increased spectral resolution. In particular, we can constrain the abundance of H2S, which is undetectable at lower spectral resolution. While these medium-resolution retrievals offer the potential of precise, stellar-like constraints on atmospheric abundances (∼0.02 dex), our retrieved radius is unphysically small (R=0.500.01+0.01RJup), indicating shortcomings with our modeling framework. This work is an initial investigation into brown dwarf retrievals at medium spectral resolution, offering guidance for future ground-based studies and JWST observations.

     
    more » « less
  5. null (Ed.)
    We address the problem of ad hoc table retrieval via a new neural architecture that incorporates both semantic and relevance matching. Understanding the connection between the structured form of a table and query tokens is an important yet neglected problem in information retrieval. We use a learning- to-rank approach to train a system to capture semantic and relevance signals within interactions between the structured form of candidate tables and query tokens. Convolutional filters that extract contextual features from query/table interactions are combined with a feature vector based on the distributions of term similarity between queries and tables. We propose using row and column summaries to incorporate table content into our new neural model. We evaluate our approach using two datasets, and we demonstrate substantial improvements in terms of retrieval metrics over state-of-the-art methods in table retrieval and document retrieval, and neural architectures from sentence, document, and table type classification adapted to the table retrieval task. Our ablation study supports the importance of both semantic and relevance matching in the table retrieval. 
    more » « less