<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Simulated Language Learning from Communicative Goals and Linguistic Input</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10356961</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the Annual Meeting of the Cognitive Science Society</title>
<idno></idno>
<biblScope unit="volume">44</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Hao Zhu</author><author>Yonatan Bisk</author><author>Graham Neubig</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Children do not learn language from passively analyzing correlations between language and observations, but from interaction with caregivers or peers. The non-nativist approach claims that the main driver of language learning should be to achieve communicative goals. Imitation, on the other hand, is another natural desire that many argue influences language learning. However, there are still gaps in the research on what roles communicative goals and imitating linguistic input play in language acquisition, due to the difficulty of performing comprehensive experiments with human learners. In this paper, we propose a computational framework using simulated experiments that allows us to compare the roles of the two drivers. Specifically, we simulate a two-way communication game between a speaker, corresponding to a language learner, and a listener, corresponding to a caregiver or teacher. The speaker's communicative goals are modeled as rewards for successful completion of a referential game, and imitation is performed by mimicking feedback from the listener. The listener adaptively chooses to give feedback and makes choices based on the speaker's utterances. With empirical results on naturalistic visual and language data, we find that communicative goals play an important role in driving language learning, whereas imitation accelerates the learning process. We also find that (1) models trained with communicative goals tend to use minimal vocabulary and utterances and overextend them to concepts outside the original word meanings; (2) the strategy with which the listener provides feedback also influences the learning results and speed. Code and data for replicating the experiments are available (https://bit.ly/interactgym) to spur future research on models for computational studies of language learning.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Children learn a striking amount of language in their first few years of life -thousands of sounds, words, grammatical categories, and how to combine them into meaningful utterances. Unlike most recent machine learning models, which learn language from static existing text or images (?, ?), very young children do not learn language purely from observing visuallinguistic co-occurrences, e.g. watching television (?, ?, ?, ?, ?), but rather from interacting with their parents in conversations regarding family members, body parts, animals, foods and clothing, directed by the interest of the child (?, ?, ?). The challenge is then to understand how this learning process works and what internal and external factors influence it. Figure <ref type="figure">1</ref>: A child and adult playing shared-goal bidirectional communication games. The child learns from both communicative goals and the parent's feedback as linguistic input. On the left, the child uses incorrect word order to describe the shape in the middle, but the adult understands and gives corrective input. In contrast, on the right, the child uses "orange square" to describe the shape on the right, but the adult misinterprets and provides feedback for the shape on the left.</p><p>The most common and straightforward view is that children primarily use language as a tool to communicate (?, ?, ?). Just like learning to use other tools, one becomes more proficient via trial-and-error. Getting what they ask for, like asking for "applesauce" and receiving it instead of another object, reinforces the connection between entities and names. Conversely, failing to achieve a goal leads to a weaker connection or even negative reinforcement. Parents and adult members of the community also share intentions with children and respond to children's requests (?, ?), and thus children learn language from the use of language (?, ?). From this view, language is learned to convey meaning and reinforced by communicative goals (CG), providing pressure to learn at least semantics, and perhaps also syntax to allow for disambiguation of more difficult concepts (?, ?)</p><p>Another way children learn language from their parents is through imitation, which has been studied for centuries since ? (1787). Although parents do not always explicitly point out grammatical mistakes in children's language, they offer corrective linguistic input (LI) to children based on their understanding of the meaning of the children's utterances (?, ?, ?). As a part of social learning, children imitate the feedback from their parents and learn the correlation between feedback and the meaning they want to express. In this way, the fluency of a children's speech improves, but since the parents may not interpret the request correctly, the meaning of the feedback may align with the children's intent.</p><p>In this paper, we simulate this learning process in the context of language games, which have been the proving ground for various linguistics theories since their conception by ? (?).</p><p>Drawing an analogy to the child-parent conversation scenario, we model the child as a speaker agent, which generates utterances based on the target objects provided by the environment, and the parent as a listener agent, which responses to the child's utterances by choosing the objects and/or give corrective feedbacks. Based on this setup, we study the following research question Can we form a computational speaker model that learns to speak from a skilled language listener under the communication game setting? What role do communicative goals and linguistic input play in this process?</p><p>Our hypothesis is that communicative goals are the main driver of language learning in this formulation (Communication Games section), while linguistic input accelerates the learning process through syntax level supervision. To evaluate this, we use neural networks combined with heuristic rules to model the speakers and listeners (Model section). The learning process is implemented with a balance of reinforcement learning for CG and maximum likelihood estimation for LI (Learning Process section). We perform empirical experiments on communication games with MSCOCO (?, ?) images and captions (Experiment section). We find that CG contributes most to the game accuracy and also helps learning syntax as reflected by a fluency metric. We also find that different listener strategies also contribute to the success of language learning. Interesting, we also find that overextension in the resulting language of CG-driven models is very common, which is also common in early children speech (?, ?), while the same phenomenon does not often appear in models only trained with LI. Our results may provide evidence for usage-based language acquisition theory, the belief that language is acquired in the service of communicative functions (?, ?).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Communication Games</head><p>Following previous work on communicative agents learning to form communication pacts in referential games, we use asymmetric speaker-listener games (?, ?) with additional feedback channels.</p><p>A general goal-oriented language game provides an environment where the participants use language to communicate with each other to achieve the given goal. We consider the most basic setting of a collaborative referential game.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Procedure</head><p>As illustrated in from the pool of images I, only visible to the speaker. N distractor images are randomly sampled from a distribution D N x . The target image and distractors are randomly shuffled before being shown to the listener to prevent any bias in the order. We denote the shuffled sequence of images as C &#732;, and order of the goal as i g , i.e. C &#732;ig = x and {C &#732;i} i&#824; =i g is a permutation of C.</p><p>The speaker (modeling the child) takes the first turn in each game by describing the image in English. The listener (modeling the parent) then takes one of two actions based on the utterance u: (1) choose an image i &#710;or (2) perform no action i &#710;= noop (e.g. when they do not understand the utterance with enough confidence). Additionally, at the end of each game, the listener can choose to provide linguistic supervision to the speaker. At the end of each game, the speaker receives a reward based on the listener's action.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reward</head><p>To model the communicative goals, we give positive rewards when the game is successful and negative rewards if the listener chooses the wrong image. In addition, we encourage the speaker to give unambiguous utterances by penalizing the noop action with a small negative reward -1 &lt; w noop &lt; 0.<ref type="foot">foot_0</ref> :</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Speaker</head><p>As mentioned before, the participants consist of a speaker and a listener sending and receiving natural language messages. The speaker is a message-producing model defined by the vocabulary &#931;; the space of observations I ; and a model f : I &#8594; &#931; * . The listener is an instruction-follower defined by the same vocabulary &#931;, observation space I N+1 , and space of actions [N + 1] as the speaker; and a model g :</p><p>Note that the listeners cannot directly observe the goal, so the speakers need to use instructions to inform the listeners about the goal of each game.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Models Speaker</head><p>The speaker is an image captioning model (?, ?, ?), which first encodes the goal image x with a pretrained ResNet (?, ?), and generates the utterance u = u i M i=1 with an LSTM neural network (?, ?) in an auto-regressive fashion:</p><p>where w u i &#8712; R d w is the word embedding of word u i .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Listener</head><p>A listener consists of two parts, a neural network-based ranker and a rule-based controller deciding whether to act and give language feedback to the speaker. Given the utterance u, the listener ranks the images C &#732;by the dot product between the LSTM embeddings of u and image embeddings encoded by the same pretrained ResNet as the speaker, i.e. for each image C &#732;i, the score for ranking is</p><p>where w u is a shorthand for the word embedding of all words in the sentence. Note that we use the same visual network for speakers and listeners, ignoring the differences between the visual perception of individuals, but the parameters language networks are not the same. We will discuss the method to acquire these parameters later in this section.</p><p>In human conversations, parents use a variety of techniques when giving feedback, including asking clarification questions and providing exemplar utterances. However, incorporating this open-ended feedback presents a huge challenge to the computational modeling of speakers. In this paper, we limit the feedback to full correct utterances for the goal image, which may be redundant or ineffective in many real world cases, but is general enough that most other kinds of feedback can be converted to it. We consider a listener controlled by both neural network rankers (as described above) and heuristic rules (Alg. ??) which makes a choice when its confidence is high enough and gives feedback to the speaker if it thinks the utterance is not articulated well. Following (?, ?), we use the probability of prediction as the indicator of confidence. Alg. ?? has two parameters &#952; 1 and &#952; 2 which control making choices and giving feedback respectively. We will show the dramatic effects brought by these two parameters on language learning in the experiments. The golden utterances for images U * are drawn from the captions provided in MSCOCO dataset (?, ?).</p><p>Listener Pretraining To model learning for a proficient language user, we need a good enough listener. Apart from &#952; 1 and &#952; 2 as well as the parameters in the ResNet, the parameters</p><p>If the confidence is too low, the listener will not make a choice or give feedback.</p><p>in the language network need pretraining. We use mini-batch stochastic gradient descent to optimize the following</p><p>Learning Process</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Objectives</head><p>The communicative goals and mimicking linguistic input can be modeled as two distinct learning objectives for the speaker network. Similar to children getting rewards from the environment if their request is fulfilled, and penalties otherwise, we use the expected game rewards as the objective for CG:</p><p>Note that the action space -the space of utterances -is discrete and non-differentiable, so we employ reinforcement learning to optimize the speaker policy &#960;. Later in this section we give a brief introduction to PPO (?, ?), the RL method we used in the experiment. Children's language models are reinforced if they recover the parent's corrective linguistic input. We thus model this objective as an maximum likelihood objective which measures parents' language in children's models.</p><p>This part is optimized with stochastic gradient descent.</p><p>To study the joint effect of both objectives, we adopt a multitask learning objective:</p><p>where &#955; is the coefficient balancing the two objectives and correlates with the importance of the CG objective.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Optimizing CG objective</head><p>Reinforcement learning methods are often employed to optimize non-differentiable objectives. In this subsection, we use the short hands state s = {u i } t-1 i=1 , action a = u t at time step driven by communicative goals converge to a higher average reward than models with only linguistic input.</p><p>t for generating the utterance. The simplest RL method is policy gradient</p><p>As an on-policy method, PG optimizes its policy on the rollout data only once, which is inefficient. We can use importance sampling to reuse the data:</p><p>where r &#960; old s,a (&#960;) = &#960;(a|s)/&#960; old (a|s) is the likelihood ratio between the new policy &#960; and the old policy &#960; old used to sample data, A</p><p>is the advantage value function of the old policy &#960; old . However, using this method does not guarantee a policy improvement. Therefore, TRPO (?, ?) and PPO (?, ?) are introduced with the basic of restricting the policy in a close distance from the old policy. PPO restricts the policy by a clipping function (?, ?)<ref type="foot">foot_1</ref> </p><p>where F CLIP is defined as</p><p>(1&#949;, 1 + &#949;) is called the clipping range, and 0 &lt; &#949; &lt; 1 is the parameter. Note that in theory most RL methods can be employed to optimize the CG objective. However, we use PPO here based on the trade-off of simplicity and relative good performance. We discuss other RL methods in the related works section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiment Game Setup</head><p>We use conventional split of MS COCO (?, ?). All of our neural networks are trained or pretrained on the training set, and all the results below are calculated on the test set.</p><p>In each game, after sampling the goal image x, the distractors are sampled from either uniform distribution C &#8764; U(I ) N (easy and default setting) or from a distribution skewed to the goal C &#8764; D N x , where D x (y) &#8733; e &#8741;x x x-y y y&#8741; 2 and x x x and y y y are embeddings from pretrained ResNet (?, ?) (hard setting).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Metrics</head><p>We use two metrics in the following experiments: (1) accuracy: the frequency of the listener choosing the goal among images;</p><p>(2) fluency score, which reflects grammar quality of the sentence without considering semantics relatedness, following (?, ?)</p><p>We use GPT-2 large (?, ?) as p M and a unigram model as p U , both are fine-tuned/trained on MSCOCO.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>What Drives Accuracy?</head><p>The first question we want to investigate is which signal is more important in learning semantically correct descriptions for the target image. In this paper, we use the listener's accuracy as a proxy to examine the semantic quality of generated descriptions. As shown in Fig. ??, the accuracy of the LIonly model tops out at 60%, while models with the CG objective have significantly higher accuracy. However, the CGonly model needs about 400k steps to warm-up before dramatically improving on the similar performance of the combined model. With the help of LI, the CG+LI model (where &#955; = 0.01 is the best hyperparameter, used in all CG+LI models) not only has a faster improvement at the start of training, but also achieves higher accuracy then CG-only model. From this result, we can see that CG is the main driver for conveying accurate information. The communication goal signal steers the model to output pragmatical descriptions that help the listener choose the correct target. In the hard setting, the CG+LI model and CG-only model both achieve 74% accuracy while the LI-only model only reaches 59%, which is a similar trend as the easy setting, thus confirming our conclusion still holds even if more detailed descriptions are needed for the game.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>What Factors Help Fluency Learning?</head><p>The second question to investigate is which signal helps the speaker to learn to produce fluent language. Fig. ?? shows that LI is the main driver for learning to speak more fluently. The likely reason for the decreasing fluency of the CG-only model is the vocabulary shrinks and concentrates on a few words instead of all frequent ones in MSCOCO. In contrast, learning from linguistic inputs helps the model to fit the natural distribution of words. Later in this section, we will talk about the overextension of CG driven models. The improvement brought by LI may be the reason why CG+LI model does not need a warmup in Stage I in Fig. ??.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Does the Listener's Strategy Affect to Learning?</head><p>In all previous experiments, we present results with &#952; 1 = 0.4 and &#952; 2 = 0.9 as the the thresholds for the listener strategy. In Fig. ?? we show the influence of these two parameters on the model's final accuracy. We find that the performance is very sensitive to the listener strategy. A small 0.05 change results in the difference between the best result and failure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Overextension Phenomenon</head><p>Besides the experiments on CG and LI's influence on language learning, a signifcant difference between CG-driven models and LI-only model is overextension.  To explore this, we randomly choose several nouns in the empirical vocabulary (words that exist in utterances) of CG+LI model. Most words exhibit intuitive cases of overextension. Some are based on color similarity, e.g., court; some are based on shape similarity, e.g. horse, giraffe, kite; while others are based on texture similarity, e.g. couch, pizza. We hypothesize there are a few possible reason for overextension in the model: (1) the shared visual perception -similar images to the speaker also look similar to the listener; (2) lack of linguistic inputs -with limited vocabulary, the RL model tests the acceptability of similar concepts; (3) the generality of the listener -the listener can understand the utterances just as we can interpret these errors. Although the models are making these errors, it may not necessarily be a bad thing. This phenomenon accounts for 40% of words used by 1;6 to 2;6 children (?, ?). This shows our formulation may be a good model of child language acquisition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Related Work Emergent Communication</head><p>Without natural language annotations, this pressure for the speaker and listener enables language emergence. (?, ?) first uses the same recurrent neural networks as the speaker and the listener to conduct emergent communication in referential game. Following their lead, (?, ?) study how emergent languages are grounded to the input images, and (?, ?) studies multi-turn communication via negotiation. (?, ?, ?) study the compositionally and systematicity of emergent languages. (?, ?) also explore the setting of training speaker with both reinforcement learning and MLE in referential games. To build a model that can communicate with humans, they start with a pretrained language model and use ground truth data of games in the experiment. Whereas we start from randomly initialized speakers and do not allow listeners' access to the goal to study language learning from scratch in the communication games.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reinforcement Learning for Language Generation</head><p>Different reinforcement learning methods have been applied to language generation. On-policy methods include REIN-FORCE (?, ?, ?), actor-critic(?, ?, ?), policy gradient (?, ?, ?, ?), and off-policy methods include importance weighted policy gradient (?, ?, ?, ?), Q-learning (?, ?), and soft Q-learning (?, ?). In this paper, we use the most commonly used onpolicy method PPO to optimize the CG objective. Experimenting with other methods is an interesting future direction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion and Future Directions</head><p>In this paper, we propose a computational framework for language learning through communication games with both communicative goals and linguistic input objectives. We investigate the roles of CG and LI in general language learning in terms of conveying meaning and syntax learning. We also find that listener's strategy is important for language learning. This sheds light on child language learning -language usage may be the main driver, but without linguistic inputs language may be slow to acquire. Additionally, the adults' strategy in responding to the children's request is also important. Future work could further confirm this intuition by teaching human subjects (new) language with the best listener setting.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>We use a tighter lower bound of w noop , so that a random choice is worse than no action: w &gt; 1 N -1.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>There are two variants of PPO: we refer to the one with clipping function as PPO, and refer to the one with adaptive KL penalty coefficient as PPO-penalty (?, ?).</p></note>
		</body>
		</text>
</TEI>
