We propose a novel knowledge distillation (KD) method to selectively instill teacher knowledge into a student model motivated by situations where the student’s capacity is significantly smaller than that of the teachers. In vanilla KD, the teacher primarily sets a predictive target for the student to follow, and we posit that this target is overly optimistic due to the student’s lack of capacity. We develop a novel scaffolding scheme where the teacher, in addition to setting a predictive target, also scaffolds the student’s prediction by censoring hard-to-learn examples. The student model utilizes the same information as the teacher’s soft-max predictions as inputs, and in this sense, our proposal can be viewed as a natural variant of vanilla KD. We show on synthetic examples that censoring hard-examples leads to smoothening the student’s loss landscape so that the student encounters fewer local minima. As a result, it has good generalization properties. Against vanilla KD, we achieve improved performance and are comparable to more intrusive techniques that leverage feature matching on benchmark datasets. 
                        more » 
                        « less   
                    
                            
                            Teaching and learning in uncertainty
                        
                    
    
            We investigate a simple model for social learning with two agents: a teacher and a student. The teacher’s goal is to teach the student the state of the world Theta, however, the teacher herself is not certain about Theta and needs to simultaneously learn it and teach it to the student. We model the teacher’s and the student’s uncertainty via binary symmetric channels, and employ a simple heuristic decoder at the student’s end. We focus on two teaching strategies: a "low effort" strategy of simply forwarding information, and a "high effort" strategy of communicating the teacher’s current best estimate of Theta at each time instant. Using tools from large deviation theory, we calculate the exact learning rates for these strategies and demonstrate regimes where the low effort strategy outperforms the high effort strategy. Our primary technical contribution is a detailed analysis of the large deviation properties of the sign of a transient Markov random walk on the integers. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1841190
- PAR ID:
- 10105686
- Date Published:
- Journal Name:
- Proceedings of the IEEE International Symposium on Information Theory
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Knowledge distillation aims at reducing model size without compromising much performance. Recent work has applied it to large vision-language (VL) Transformers, and has shown that attention maps in the multi-head attention modules of vision-language Transformers contain extensive intra-modal and cross-modal co-reference relations to be distilled. The standard approach is to apply a one-to-one attention map distillation loss, i.e. the Teacher’s first attention head instructs the Student’s first head, the second teaches the second, and so forth, but this only works when the numbers of attention heads in the Teacher and Student are the same. To remove this constraint, we propose a new Attention Map Alignment Distillation (AMAD) method for Transformers with multi-head attention, which works for a Teacher and a Student with different numbers of attention heads. Specifically, we soft-align different heads in Teacher and Student attention maps using a cosine similarity weighting. The Teacher head contributes more to the Student heads for which it has a higher similarity weight. Each Teacher head contributes to all the Student heads by minimizing the divergence between the attention activation distributions for the soft-aligned heads. No head is left behind. This distillation approach operates like cross-attention. We experiment on distilling VL-T5 and BLIP, and apply AMAD loss on their T5, BERT, and ViT sub-modules. We show, under vision-language setting, that AMAD outperforms conventional distillation methods on VQA-2.0, COCO captioning, and Multi30K translation datasets. We further show that even without VL pre-training, the distilled VL-T5 models outperform corresponding VL pre-trained VL-T5 models that are further fine-tuned by ground-truth signals, and that fine-tuning distillation can also compensate to some degree for the absence of VL pre-training for BLIP models.more » « less
- 
            null (Ed.)Embedding models that encode semantic information into low-dimensional vector representations are useful in various machine learning tasks with limited training data. However, these models are typically too large to support inference in small edge devices, which motivates training of smaller yet comparably predictive student embedding models through knowledge distillation (KD). While knowledge distillation traditionally uses the teacher’s original training dataset to train the student, we hypothesize that using a dataset similar to the student’s target domain allows for better compression and training efficiency for the said domain, at the cost of reduced generality across other (non-pertinent) domains. Hence, we introduce Specialized Embedding Approximation (SEA) to train a student featurizer to approximate the teacher’s embedding manifold for a given target domain. We demonstrate the feasibility of SEA in the context of acoustic event classification for urban noise monitoring and show that leveraging a dataset related to this target domain not only improves the baseline performance of the original embedding model but also yields competitive students with >1 order of magnitude lesser storage and activation memory. We further investigate the impact of using random and informed sampling techniques for dimensionality reduction in SEA.more » « less
- 
            We analyze teachers’ written feedback to students in an online learning environment, specifically a setting in which high school students in Uruguay are learning English as a foreign language. How complex should teachers’ feedback be? Should it be adapted to each student’s English profi- ciency level? How does teacher feedback affect the probability of engaging the student in a conversation? To explore these questions, we conducted both parametric (multilevel modeling) and non-parametric (bootstrapping) analyses of 27,627 messages exchanged between 35 teachers and 1074 students in 2017 and 2018. Our results suggest: (1) Teach- ers should adapt their feedback complexity to their students’ English proficiency level. Students who receive feedback that is too complex or too basic for their level post 13- 15% fewer comments than those who receive adapted feed- back. (2) Feedback that includes a question is associated with higher odds-ratio (17.5-19) of engaging the student in conversation. (3) For students with low English proficiency, slow turnaround (feedback after 1 week) reduces this odds ratio by 0.7. These results have potential implications for online platforms offering foreign language learning services, in which it is crucial to give the best possible learning expe- rience while judiciously allocating teachers’ time.more » « less
- 
            Wang, N.; Rebolledo-Mendez, G.; Matsuda, N.; Santos, O.C.; Dimitrova, V. (Ed.)Research indicates that teachers play an active and important role in classrooms with AI tutors. Yet, our scientific understanding of the way teacher practices around AI tutors mediate student learning is far from complete. In this paper, we investigate spatiotemporal factors of student-teacher interactions by analyzing student engagement and learning with an AI tutor ahead of teacher visits (defined as episodes of a teacher being in close physical proximity to a student) and immediately following teacher visits. To conduct such integrated, temporal analysis around the moments when teachers visit students, we collect fine-grained, time-synchronized data on teacher positions in the physical classroom and student interactions with the AI tutor. Our case study in a K12 math classroom with a veteran math teacher provides some indications on factors that might affect a teacher’s decision to allocate their limited classroom time to their students and what effects these interactions have on students. For instance, teacher visits were associated more with students’ in-the-moment behavioral indicators (e.g., idleness) than a broader, static measure of student needs such as low prior knowledge. While teacher visits were often associated with positive changes in student behavior afterward (e.g., decreased idleness), there could be a potential mismatch between students visited by the teacher and who may have needed it more at that time (e.g., students who were disengaged for much longer). Overall, our findings indicate that teacher visits may yield immediate benefits for students but also that it is challenging for teachers to meet all needs - suggesting the need for better tool support.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    