When searching for mathematical content, accurate measures of formula similarity can help with tasks such as document ranking, query recommendation, and result set clustering. While there have been many attempts at embedding words and graphs, formula embedding is in its early stages. We introduce a new formula em- bedding model that we use with two hierarchical representations, (1) Symbol Layout Trees (SLTs) for appearance, and (2) Operator Trees (OPTs) for mathematical content. Following the approach of graph embeddings such as DeepWalk, we generate tuples represent- ing paths between pairs of symbols depth-first, embed tuples using the fastText n-gram embedding model, and then represent an SLT or OPT by its average tuple embedding vector. We then combine SLT and OPT embeddings, leading to state-of-the-art results for the NTCIR-12 formula retrieval task. Our fine-grained holistic vector representations allow us to retrieve many more partially similar for- mulas than methods using structural matching in trees. Combining our embedding model with structural matching in the Approach0 formula search engine produces state-of-the-art results for both fully and partially relevant results on the NTCIR-12 benchmark. Source code for our system is publicly available. 
                        more » 
                        « less   
                    
                            
                            Accelerating Substructure Similarity Search for Formula Retrieval
                        
                    
    
            Formula retrieval systems using substructure matching are effective, but suffer from slow retrieval times caused by the complexity of structure matching. We present a specialized inverted index and rank-safe dynamic pruning algorithm for faster substructure retrieval. Formulas are indexed from their Operator Tree (OPT) representations. Our model is evaluated using the NTCIR-12 Wikipedia Formula Browsing Task and a new formula corpus produced from Math StackExchange posts. Our approach preserves the effectiveness of structure matching while allowing queries to be executed in real-time. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1717997
- PAR ID:
- 10198750
- Date Published:
- Journal Name:
- Proc. European Conference on Information Retrieval
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            null (Ed.)We address the problem of ad hoc table retrieval via a new neural architecture that incorporates both semantic and relevance matching. Understanding the connection between the structured form of a table and query tokens is an important yet neglected problem in information retrieval. We use a learning- to-rank approach to train a system to capture semantic and relevance signals within interactions between the structured form of candidate tables and query tokens. Convolutional filters that extract contextual features from query/table interactions are combined with a feature vector based on the distributions of term similarity between queries and tables. We propose using row and column summaries to incorporate table content into our new neural model. We evaluate our approach using two datasets, and we demonstrate substantial improvements in terms of retrieval metrics over state-of-the-art methods in table retrieval and document retrieval, and neural architectures from sentence, document, and table type classification adapted to the table retrieval task. Our ablation study supports the importance of both semantic and relevance matching in the table retrieval.more » « less
- 
            Fix a weakly minimal (i.e. superstable [Formula: see text]-rank [Formula: see text]) structure [Formula: see text]. Let [Formula: see text] be an expansion by constants for an elementary substructure, and let [Formula: see text] be an arbitrary subset of the universe [Formula: see text]. We show that all formulas in the expansion [Formula: see text] are equivalent to bounded formulas, and so [Formula: see text] is stable (or NIP) if and only if the [Formula: see text]-induced structure [Formula: see text] on [Formula: see text] is stable (or NIP). We then restrict to the case that [Formula: see text] is a pure abelian group with a weakly minimal theory, and [Formula: see text] is mutually algebraic (equivalently, weakly minimal with trivial forking). This setting encompasses most of the recent research on stable expansions of [Formula: see text]. Using various characterizations of mutual algebraicity, we give new examples of stable structures of the form [Formula: see text]. Most notably, we show that if [Formula: see text] is a weakly minimal additive subgroup of the algebraic numbers, [Formula: see text] is enumerated by a homogeneous linear recurrence relation with algebraic coefficients, and no repeated root of the characteristic polynomial of [Formula: see text] is a root of unity, then [Formula: see text] is superstable for any [Formula: see text].more » « less
- 
            We explore computational strategies for matching human vocal imitations of birdsong to actual birdsong recordings. We recorded human vocal imitations of birdsong and subsequently analysed these data using three categories of audio features for matching imitations to original birdsong: spectral, temporal, and spectrotemporal. These exploratory analyses suggest that spectral features can help distinguish imitation strategies (e.g. whistling vs. singing) but are insufficient for distinguishing species. Similarly, whereas temporal features are correlated between human imitations and natural birdsong, they are also insufficient. Spectrotemporal features showed the greatest promise, in particular when used to extract a representation of the pitch contour of birdsong and human imitations. This finding suggests a link between the task of matching human imitations to birdsong to retrieval tasks in the music domain such as query-by-humming and cover song retrieval; we borrow from such existing methodologies to outline directions for future research.more » « less
- 
            null (Ed.)This paper presents CLEAR, a retrieval model that seeks to complement classical lexical exact-match models such as BM25 with semantic matching signals from a neural embedding matching model. CLEAR explicitly trains the neural embedding to encode language structures and semantics that lexical retrieval fails to capture with a novel residual-based embedding learning method. Empirical evaluations demonstrate the advantages of CLEAR over state-of-the-art retrieval models, and that it can substantially improve the end-to-end accuracy and efficiency of reranking pipelines.more » « less
 An official website of the United States government
An official website of the United States government 
				
			
 
                                    