Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips SmartLights datasets. Each split has two test sets: one with held-out utterances assessing natural language understanding abilities, and one with heldout speakers to test speech processing skills. Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets. These performance gaps allow more realistic and actionable comparisons between different architectures, driving future model development. We release our splits and tools for the community
more »
« less
A Unified Framework of Surrogate Loss by Refactoring and Interpolation
We introduce UniLoss, a unified framework to generate surrogate losses for training deep networks with gradient descent, reducing the amount of manual design of task-specific surrogate losses. Our key observation is that in many cases, evaluating a model with a performance metric on a batch of examples can be refactored into four steps: from input to real-valued scores, from scores to comparisons of pairs of scores, from comparisons to binary variables, and from binary variables to the final performance metric. Using this refactoring we generate differentiable approximations for each non-differentiable step through interpolation. Using UniLoss, we can optimize for different tasks and metrics using one unified framework, achieving comparable performance compared with task-specific losses. We validate the effectiveness of UniLoss on three tasks and four datasets. Code is available at this https URL.
more »
« less
- Award ID(s):
- 1734266
- PAR ID:
- 10202372
- Date Published:
- Journal Name:
- ECCV 2020
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips SmartLights datasets. Each split has two test sets: one with held-out utterances assessing natural language understanding abilities, and one with heldout speakers to test speech processing skills. Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets. These performance gaps allow more realistic and actionable comparisons between different architectures, driving future model development. We release our splits and tools for the community.1more » « less
-
Abstract Background Individuals with hemiparesis post-stroke often have difficulty with tasks requiring upper extremity (UE) intra- and interlimb use, yet methods to quantify both are limited. Objective To develop a quantitative yet sensitive method to identify distinct features of UE intra- and interlimb use during task performance. Methods Twenty adults post-stroke and 20 controls wore five inertial sensors (wrists, upper arms, sternum) during 12 seated UE tasks. Three sensor modalities (acceleration, angular rate of change, orientation) were examined for three metrics (peak to peak amplitude, time, and frequency). To allow for comparison between sensor data, the resultant values were combined into one motion parameter, per sensor pair, using a novel algorithm. This motion parameter was compared in a group-by-task analysis of variance as a similarity score (0–1) between key sensor pairs: sternum to wrist, wrist to wrist, and wrist to upper arm. A use ratio (paretic/non-paretic arm) was calculated in persons post-stroke from wrist sensor data for each modality and compared to scores from the Adult Assisting Hand Assessment (Ad-AHA Stroke) and UE Fugl-Meyer (UEFM). Results A significant group × task interaction in the similarity score was found for all key sensor pairs. Post-hoc tests between task type revealed significant differences in similarity for sensor pairs in 8/9 comparisons for controls and 3/9 comparisons for persons post stroke. The use ratio was significantly predictive of the Ad-AHA Stroke and UEFM scores for each modality. Conclusions Our algorithm and sensor data analyses distinguished task type within and between groups and were predictive of clinical scores. Future work will assess reliability and validity of this novel metric to allow development of an easy-to-use app for clinicians.more » « less
-
Semantic typing aims at classifying tokens or spans of interest in a textual context into semantic categories such as relations, entity types, and event types. The inferred labels of semantic categories meaningfully interpret how machines understand components of text. In this paper, we present UniST, a unified framework for semantic typing that captures label semantics by projecting both inputs and labels into a joint semantic embedding space. To formulate different lexical and relational semantic typing tasks as a unified task, we incorporate task descriptions to be jointly encoded with the input, allowing UniST to be adapted to different tasks without introducing task-specific model components. UniST optimizes a margin ranking loss such that the semantic relatedness of the input and labels is reflected from their embedding similarity. Our experiments demonstrate that UniST achieves strong performance across three semantic typing tasks: entity typing, relation classification and event typing. Meanwhile, UniST effectively transfers semantic knowledge of labels and substantially improves generalizability on inferring rarely seen and unseen types. In addition, multiple semantic typing tasks can be jointly trained within the unified framework, leading to a single compact multi-tasking model that performs comparably to dedicated single-task models, while offering even better transferability.more » « less
-
We consider the crowdsourcing setting where, in response to the assigned tasks, agents strategically decide both how much effort to exert (from a continuum) and whether to manipulate their reports. The goal is to design payment mechanisms that (1) satisfy limited liability (all payments are non-negative), (2) reduce the principal’s cost of budget, (3) incentivize effort and (4) incentivize truthful responses. In our framework, the payment mechanism composes a performance measurement, which noisily evaluates agents’ effort based on their reports, and a payment function, which converts the scores output by the performance measurement to payments. Previous literature suggests applying a peer prediction mechanism combined with a linear payment function. This method can achieve either (1), (3) and (4), or (2), (3) and (4) in the binary effort setting. In this paper, we suggest using a rank-order payment function (tournament). Assuming Gaussian noise, we analytically optimize the rank-order payment function, and identify a sufficient statistic, sensitivity, which serves as a metric for optimizing the performance measurements. This helps us obtain (1), (2) and (3) simultaneously. Additionally, we show that adding noise to agents’ scores can preserve the truthfulness of the performance measurements under the non-linear tournament, which gives us all four objectives. Our real-data estimated agent-based model experiments show that our method can greatly reduce the payment of effort elicitation while preserving the truthfulness of the performance measurement. In addition, we empirically evaluate several commonly used performance measurements in terms of their sensitivities and strategic robustness.more » « less