People increasingly use the Internet to make food-related choices, prompting research on food recommendation systems. Recently, works that incorporate nutritional constraints into the recommendation process have been proposed to promote healthier recipes. Ingredient substitution is also used, particularly by people motivated to reduce the intake of a specific nutrient or in order to avoid a particular category of ingredients due for instance to allergies. This study takes a complementary approach towards empowering people to make healthier food choices by simplifying the process of identifying plausible recipe substitutions. To achieve this goal, this work constructs a large-scale network of similar recipes, and analyzes this network to reveal interesting properties that have important implications to the development of food recommendation systems.
more »
« less
Ki-Cook: clustering multimodal cooking representations through knowledge-infused learning
Cross-modal recipe retrieval has gained prominence due to its ability to retrieve a text representation given an image representation and vice versa. Clustering these recipe representations based on similarity is essential to retrieve relevant information about unknown food images. Existing studies cluster similar recipe representations in the latent space based on class names. Due to inter-class similarity and intraclass variation, associating a recipe with a class name does not provide sufficient knowledge about recipes to determine similarity. However, recipe title, ingredients, and cooking actions provide detailed knowledge about recipes and are a better determinant of similar recipes. In this study, we utilized this additional knowledge of recipes, such as ingredients and recipe title, to identify similar recipes, emphasizing attention especially on rare ingredients. To incorporate this knowledge, we propose a knowledge-infused multimodal cooking representation learning network, Ki-Cook, built on the procedural attribute of the cooking process. To the best of our knowledge, this is the first study to adopt a comprehensive recipe similarity determinant to identify and cluster similar recipe representations. The proposed network also incorporates ingredient images to learn multimodal cooking representation. Since the motivation for clustering similar recipes is to retrieve relevant information for an unknown food image, we evaluated the ingredient retrieval task. We performed an empirical analysis to establish that our proposed model improves the Coverage of Ground Truth by 12% and the Intersection Over Union by 10% compared to the baseline models. On average, the representations learned by our model contain an additional 15.33% of rare ingredients compared to the baseline models. Owing to this difference, our qualitative evaluation shows a 39% improvement in clustering similar recipes in the latent space compared to the baseline models, with an inter-annotator agreement of the Fleiss kappa score of 0.35.
more »
« less
- Award ID(s):
- 2335967
- PAR ID:
- 10530769
- Publisher / Repository:
- Frontiers in Big Data
- Date Published:
- Journal Name:
- Frontiers in Big Data
- Volume:
- 6
- ISSN:
- 2624-909X
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question AnsweringVisual question answering (VQA) requires systems to perform concept-level reasoning by unifying unstructured (e.g., the context in question and answer; “QA context”) and structured (e.g., knowledge graph for the QA context and scene; “concept graph”) multimodal knowledge. Existing works typically combine a scene graph and a concept graph of the scene by connecting corresponding visual nodes and concept nodes, then incorporate the QA context representation to perform question answering. However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of knowledge. To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context, and introduce a new multimodal GNN technique to perform inter-modal message passing for reasoning that mitigates representational gaps between modalities. On two challenging VQA tasks (VCR and GQA), our method outperforms strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA, suggesting its strength in performing concept-level reasoning. Ablation studies further demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge.more » « less
-
null (Ed.)In this paper, we explore text classification with extremely weak supervision, i.e., only relying on the surface text of class names. This is a more challenging setting than the seed-driven weak supervision, which allows a few seed words per class. We opt to attack this problem from a representation learning perspective—ideal document representations should lead to nearly the same results between clustering and the desired classification. In particular, one can classify the same corpus differently (e.g., based on topics and locations), so document representations should be adaptive to the given class names. We propose a novel framework X-Class to realize the adaptive representations. Specifically, we first estimate class representations by incrementally adding the most similar word to each class until inconsistency arises. Following a tailored mixture of class attention mechanisms, we obtain the document representation via a weighted average of contextualized word representations. With the prior of each document assigned to its nearest class, we then cluster and align the documents to classes. Finally, we pick the most confident documents from each cluster to train a text classifier. Extensive experiments demonstrate that X-Class can rival and even outperform seed-driven weakly supervised methods on 7 benchmark datasets.more » « less
-
College students may have limited access to produce and may lack confidence in preparing it, but cooking videos can show how to make healthy dishes. The Cognitive Theory of Multimedia Learning suggests that learning is enhanced when visual and auditory information is presented considering cognitive load (e.g., highlighting important concepts, eliminating extraneous information, and keeping the video brief and conversational). The purpose of this project was to pilot test a food label for produce grown at an urban university and assess whether student confidence in preparing produce improved after using the label and QR code to view a recipe video developed using principles from the Cognitive Theory of Multimedia Learning. The video showed a student preparing a salad with ingredients available on campus. Students indicated the label was helpful and reported greater perceived confidence in preparing lettuce after viewing the label and video (mean confidence of 5.60 ± 1.40 before vs. 6.14 ± 0.89 after, p = 0.016, n = 28). Keeping the video short and providing ingredients and amounts onscreen as text were cited as helpful. Thus, a brief cooking video and interactive label may improve confidence in preparing produce available on campus. Future work should determine whether the label impacts produce consumption and if it varies depending on the type of produce used.more » « less
-
In multimodal machine learning, effectively addressing the missing modality scenario is crucial for improving performance in downstream tasks such as in medical contexts where data may be incomplete. Although some attempts have been made to retrieve embeddings for missing modalities, two main bottlenecks remain: (1) the need to consider both intra- and inter-modal context, and (2) the cost of embedding selection, where embeddings often lack modality-specific knowledge. To address this, the authors propose MoE-Retriever, a novel framework inspired by Sparse Mixture of Experts (SMoE). MoE-Retriever defines a supporting group for intra-modal inputs—samples that commonly lack the target modality—by selecting samples with complementary modality combinations for the target modality. This group is integrated with inter-modal inputs from different modalities of the same sample, establishing both intra- and inter-modal contexts. These inputs are processed by Multi-Head Attention to generate context-aware embeddings, which serve as inputs to the SMoE Router that automatically selects the most relevant experts (embedding candidates). Comprehensive experiments on both medical and general multimodal datasets demonstrate the robustness and generalizability of MoE-Retriever, marking a significant step forward in embedding retrieval methods for incomplete multimodal data.more » « less
An official website of the United States government

