skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: DeQA: On-Device Question Answering
Today there is no effective support for device-wide question answer- ing on mobile devices. State-of-the-art QA models are deep learning behemoths designed for the cloud which run extremely slow and require more memory than available on phones. We present DeQA, a suite of latency- and memory- optimizations that adapts existing QA systems to run completely locally on mobile phones. Specifi- cally, we design two latency optimizations that (1) stops processing documents if further processing cannot improve answer quality, and (2) identifies computation that does not depend on the ques- tion and moves it offline. These optimizations do not depend on the QA model internals and can be applied to several existing QA models. DeQA also implements a set of memory optimizations by (i) loading partial indexes in memory, (ii) working with smaller units of data, and (iii) replacing in-memory lookups with a key-value database. We use DeQA to port three state-of-the-art QA systems to the mobile device and evaluate over three datasets. The first is a large scale SQuAD dataset defined over Wikipedia collection. We also create two on-device QA datasets, one over a publicly available email data collection and the other using a cross-app data collection we obtain from two users. Our evaluations show that DeQA can run QA models with only a few hundred MBs of memory and provides at least 13x speedup on average on the mobile phone across all three datasets.  more » « less
Award ID(s):
1815358
PAR ID:
10098293
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
ACM International Conference on Mobile Systems, Applications, and Services
Volume:
1
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The problem of answering questions using knowledge from pre-trained language models (LMs) and knowledge graphs (KGs) presents two challenges: given a QA context (question and answer choice), methods need to (i) identify relevant knowledge from large KGs, and (ii) perform joint reasoning over the QA context and KG. Here we propose a new model, QA-GNN, which addresses the above challenges through two key innovations: (i) relevance scoring, where we use LMs to estimate the importance of KG nodes relative to the given QA context, and (ii) joint reasoning, where we connect the QA context and KG to form a joint graph, and mutually update their representations through graph-based message passing. We evaluate QA-GNN on the CommonsenseQA and OpenBookQA datasets, and show its improvement over existing LM and LM+KG models, as well as its capability to perform interpretable and structured reasoning, e.g., correctly handling negation in questions. 
    more » « less
  2. Abstract We present a method for mining the web for text entered on mobile devices. Using searching, crawling, and parsing techniques, we locate text that can be reliably identified as originating from 300 mobile devices. This includes 341,000 sentences written on iPhones alone. Our data enables a richer understanding of how users type “in the wild” on their mobile devices. We compare text and error characteristics of different device types, such as touchscreen phones, phones with physical keyboards, and tablet computers. Using our mined data, we train language models and evaluate these models on mobile test data. A mixture model trained on our mined data, Twitter, blog, and forum data predicts mobile text better than baseline models. Using phone and smartwatch typing data from 135 users, we demonstrate our models improve the recognition accuracy and word predictions of a state-of-the-art touchscreen virtual keyboard decoder. Finally, we make our language models and mined dataset available to other researchers. 
    more » « less
  3. The memory footprint of modern applications like large language models (LLMs) far exceeds the memory capacity of accelerators they run on and often spills over to host memory. As model sizes continue to grow, DRAM-based memory is no longer sufficient to contain these models, resulting in further spill-over to storage and necessitating the use of technologies like Intel Optane and CXL-enabled memory expansion. While such technologies provide more capacity, their higher latency and lower bandwidth has given rise to heterogeneous memory configurations that attempt to strike a balance between capacity and performance. This paper evaluates the impact of such memory configurations on a GPU running out-of-core LLMs. Starting with basic host/device bandwidth measurements using an Optane and Nvidia A100 equipped NUMA system, we present a comprehensive performance analysis of serving OPT-30B and OPT-175B models using FlexGen, a state-of-the-art serving framework. Our characterization shows that FlexGen's weight placement algorithm is a key bottleneck limiting performance. Based on this observation, we evaluate two alternate weight placement strategies, one each optimizing for inference latency and throughput. When combined with model quantization, our strategies improve latency and throughput by 27% and 5x, respectively. These figures are within 9% and 6% of an all-DRAM system, demonstrating how careful data placement can effectively enable the substitution of DRAM with high-capacity but slower memory, improving overall system energy efficiency. 
    more » « less
  4. To build robust question answering systems, we need the ability to verify whether answers to questions are truly correct, not just “good enough” in the context of imperfect QA datasets. We explore the use of natural language inference (NLI) as a way to achieve this goal, as NLI inherently requires the premise (document context) to contain all necessary information to support the hypothesis (proposed answer to the question). We leverage large pre-trained models and recent prior datasets to construct powerful question conversion and decontextualization modules, which can reformulate QA instances as premise-hypothesis pairs with very high reliability. Then, by combining standard NLI datasets with NLI examples automatically derived from QA training data, we can train NLI models to evaluate QA models’ proposed answers. We show that our approach improves the confidence estimation of a QA model across different domains, evaluated in a selective QA setting. Careful manual analysis over the predictions of our NLI model shows that it can further identify cases where the QA model produces the right answer for the wrong reason, i.e., when the answer sentence cannot address all aspects of the question. 
    more » « less
  5. Graph Convolutional Networks (GCNs) have successfully incorporated deep learning to graph structures for social network analysis, bio-informatics, etc. The execution pattern of GCNs is a hybrid of graph processing and neural networks which poses unique and significant challenges for hardware implementation. Graph processing involves a large amount of irregular memory access with little computation whereas processing of neural networks involves a large number of operations with regular memory access. Existing graph processing and neural network accelerators are therefore inefficient for computing GCNs. This paper presents Parag, processing in memory (PIM) architecture for GCN computation. It consists of customized logic with minuscule computing units called Neural Processing Elements (NPEs) interfaced to each bank of the DRAM to support parallel graph processing and neural network computation. It utilizes the massive internal parallelism of DRAM to accelerate the GCN execution with high energy efficiency. Simulation results for inference of GCN over standard datasets show a latency and energy reduction by three orders of magnitude over a CPU implementation. When compared to a state-of-the-art PIM architecture, PARAG achieves on an average 4x reduction in latency and 4.23x reduction in the energy-delay-product (EDP). 
    more » « less