skip to main content

Search for: All records

Creators/Authors contains: "Nwogu, Ifeoma"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. The role of a sign interpreting agent is to bridge the communication gap between the hearing-only and Deaf or Hard of Hearing communities by translating both from sign language to text and from text to sign language. Until now, much of the AI work in automated sign language processing has focused primarily on sign language to text translation, which puts the advantage mainly on the side of hearing individuals. In this work, we describe advances in sign language processing based on transformer networks. Specifically, we introduce SignNet II, a sign language processing architecture, a promising step towards facilitating two-way sign language communication. It is comprised of sign-to-text and text-to-sign networks jointly trained using a dual learning mechanism. Furthermore, by exploiting the notion of sign similarity, a metric embedding learning process is introduced to enhance the text-to-sign translation performance. Using a bank of multi-feature transformers, we analyzed several input feature representations and discovered that keypoint-based pose features consistently performed well, irrespective of the quality of the input videos. We demonstrated that the two jointly trained networks outperformed their singly-trained counterparts, showing noteworthy enhancements in BLEU-1 - BLEU-4 scores when tested on the largest available German Sign Language (GSL) benchmark dataset.
    Free, publicly-accessible full text available December 1, 2023
  2. While several methods for predicting uncertainty on deep networks have been recently proposed, they do not always readily translate to large and complex datasets without significant overhead. In this paper we utilize a special instance of the Mixture Density Networks (MDNs) to produce an elegant and compact approach to quantity uncertainty in regression problems. When applied to standard regression benchmark datasets, we show an improvement in predictive log-likelihood and root-mean-square-error when compared to existing state-of-the-art methods. We demonstrate the efficacy and practical usefulness of the method for (i) predicting future stock prices from stochastic, highly volatile time-series data; (ii) anomaly detection in real-life highly complex video segments; and (iii) the task of age estimation and data cleansing on the challenging IMDb-Wiki dataset of half a million face images.
    Free, publicly-accessible full text available October 9, 2023
  3. We investigate the behaviors that compressed convolutional models exhibit for two key areas within AI trust: (i) the ability for a model to be explained and (ii) its ability to be robust to adversarial attacks. While compression is known to shrink model size and decrease inference time, other properties of compression are not as well studied. We employ several compression methods on benchmark datasets, including ImageNet, to study how compression affects the convolutional aspects of an image model. We investigate explainability by studying how well compressed convolutional models can extract visual features with t-SNE, as well as visualizing localization ability of our models with class activation maps. We show that even with significantly compressed models, vital explainability is preserved and even enhanced. We find with applying the Carlini & Wagner attack algorithm on our compressed models, robustness is maintained and some forms of compression make attack more difficult or time-consuming.
    Free, publicly-accessible full text available August 21, 2023
  4. This work is motivated by the need to automate the analysis of parent-infant interactions to better understand the existence of any potential behavioral patterns useful for the early diagnosis of autism spectrum disorder (ASD). It presents an approach for synthesizing the facial expression exchanges that occur during parent-infant interactions. This is accomplished by developing a novel approach that uses landmarks when synthesizing changing facial expressions. The proposed model consists of two components: (i) The first is a landmark converter that receives a set of facial landmarks and the target emotion as input and outputs a set of new landmarks transformed to match the emotion. (ii) The second component involves an image converter that takes in an input image, a target landmark and a target emotion and outputs a face transformed to match the input emotion. The inclusion of landmarks in the generation process proves useful in the generation of baby facial expressions; babies have somewhat different facial musculature and facial dynamics than adults. This paper presents a realistic-looking matrix of changing facial expressions sampled from a 2-D emotion continuum (valence and arousal) and displays successfully transferred facial expressions from real-life mother-infant dyads to novel ones.
  5. While a significant amount of work has been done on the commonly used, tightly -constrained weather-based, German sign language (GSL) dataset, little has been done for continuous sign language translation (SLT) in more realistic settings, including American sign language (ASL) translation. Also, while CNN - based features have been consistently shown to work well on the GSL dataset, it is not clear whether such features will work as well in more realistic settings when there are more heterogeneous signers in non-uniform backgrounds. To this end, in this work, we introduce a new, realistic phrase-level ASL dataset (ASLing), and explore the role of different types of visual features (CNN embeddings, human body keypoints, and optical flow vectors) in translating it to spoken American English. We propose a novel Transformer-based, visual feature learning method for ASL translation. We demonstrate the explainability efficacy of our proposed learning methods by visualizing activation weights under various input conditions and discover that the body keypoints are consistently the most reliable set of input features. Using our model, we successfully transfer-learn from the larger GSL dataset to ASLing, resulting in significant BLEU score improvements. In summary, this work goes a long way in bringing together the AImore »resources required for automated ASL translation in unconstrained environments.« less
  6. This research work explores different machine learning techniques for recognizing the existence of rapport between two people engaged in a conversation, based on their facial expressions. First using artificially generated pairs of correlated data signals, a coupled gated recurrent unit (cGRU) neural network is developed to measure the extent of similarity between the temporal evolution of pairs of time-series signals. By pre-selecting their covariance values (between 0.1 and 1.0), pairs of coupled sequences are generated. Using the developed cGRU architecture, this covariance between the signals is successfully recovered. Using this and various other coupled architectures, tests for rapport (measured by the extent of mirroring and mimicking of behaviors) are conducted on real-life datasets. On fifty-nine (N = 59) pairs of interactants in an interview setting, a transformer based coupled architecture performs the best in determining the existence of rapport. To test for generalization, the models were applied on never-been-seen data collected 14 years prior, also to predict the existence of rapport. The coupled transformer model again performed the best for this transfer learning task, determining which pairs of interactants had rapport and which did not. The experiments and results demonstrate the advantages of coupled architectures for predicting an interactional processmore »such as rapport, even in the presence of limited data.« less
  7. Sign language translation without transcription has only recently started to gain attention. In our work, we focus on improving the state-of-the-art translation by introducing a multi-feature fusion architecture with enhanced input features. As sign language is challenging to segment, we obtain the input features by extracting overlapping scaled segments across the video and obtaining their 3D CNN representations. We exploit the attention mechanism in the fusion architecture by initially learning dependencies between different frames of the same video and later fusing them to learn the relations between different features from the same video. In addition to 3D CNN features, we also analyze pose-based features. Our robust methodology outperforms the state-of-the-art sign language translation model by achieving higher BLEU 3 – BLEU 4 scores and also outperforms the state-of-the-art sequence attention models by achieving a 43.54% increase in BLEU 4 score. We conclude that the combined effects of feature scaling and feature fusion make our model more robust in predicting longer n-grams which are crucial in continuous sign language translation.
  8. Many sign languages are bona fide natural languages with grammatical rules and lexicons hence can benefit from machine translation methods. Similarly, since sign language is a visual-spatial language, it can also benefit from computer vision methods for encoding it. With the advent of deep learning methods in recent years, significant advances have been made in natural language processing (specifically neural machine translation) and in computer vision methods (specifically image and video captioning). Researchers have therefore begun expanding these learning methods to sign language understanding. Sign language interpretation is especially challenging, because it involves a continuous visual-spatial modality where meaning is often derived based on context. The focus of this article, therefore, is to examine various deep learning–based methods for encoding sign language as inputs, and to analyze the efficacy of several machine translation methods, over three different sign language datasets. The goal is to determine which combinations are sufficiently robust for sign language translation without any gloss-based information. To understand the role of the different input features, we perform ablation studies over the model architectures (input features + neural translation models) for improved continuous sign language translation. These input features include body and finger joints, facial points, as well asmore »vector representations/embeddings from convolutional neural networks. The machine translation models explored include several baseline sequence-to-sequence approaches, more complex and challenging networks using attention, reinforcement learning, and the transformer model. We implement the translation methods over multiple sign languages—German (GSL), American (ASL), and Chinese sign languages (CSL). From our analysis, the transformer model combined with input embeddings from ResNet50 or pose-based landmark features outperformed all the other sequence-to-sequence models by achieving higher BLEU2-BLEU4 scores when applied to the controlled and constrained GSL benchmark dataset. These combinations also showed significant promise on the other less controlled ASL and CSL datasets.« less
  9. Expression neutralization is the process of synthetically altering an image of a face so as to remove any facial expression from it without changing the face's identity. Facial expression neutralization could have a variety of applications, particularly in the realms of facial recognition, in action unit analysis, or even improving the quality of identification pictures for various types of documents. Our proposed model, StoicNet, combines the robust encoding capacity of variational autoencoders, the generative power of generative adversarial networks, and the enhancing capabilities of super resolution networks with a learned encoding transformation to achieve compelling expression neutralization, while preserving the identity of the input face. Objective experiments demonstrate that StoicNet successfully generates realistic, identity-preserved faces with neutral expressions, regardless of the emotion or expression intensity of the input face.