<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Progressive Sentiment Analysis for Code-Switched Text Data</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10403514</idno>
					<idno type="doi"></idno>
					<title level='j'>Findings of the Association for Computational Linguistics: EMNLP 2022</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Sudhanshu Ranjan</author><author>Dheeraj Mekala</author><author>Jingbo Shang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Multilingual transformer language models have recently attracted much attention from researchers and are used in cross-lingual transfer learning for many NLP tasks such as text classification and named entity recognition.However, similar methods for transfer learning from monolingual text to code-switched text have not been extensively explored mainly due to the following challenges:(1) Code-switched corpus, unlike monolingual corpus, consists of more than one language and existing methods can’t be applied efficiently,(2) Code-switched corpus is usually made of resource-rich and low-resource languages and upon using multilingual pre-trained language models, the final model might bias towards resource-rich language. In this paper, we focus on code-switched sentiment analysis where we have a labelled resource-rich language dataset and unlabelled code-switched data. We propose a framework that takes the distinction between resource-rich and low-resource language into account.Instead of training on the entire code-switched corpus at once, we create buckets based on the fraction of words in the resource-rich language and progressively train from resource-rich language dominated samples to low-resource language dominated samples. Extensive experiments across multiple language pairs demonstrate that progressive training helps low-resource language dominated samples.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Code-switching is the phenomena where the speaker alternates between two or more languages in a conversation. The lack of annotated data and diverse combinations of languages with which this phenomenon can be observed, makes it difficult to progress in NLP tasks on code-switched data. And also, the prevalance of different languages is different, making annotations expensive and difficult.</p><p>&#8676; Jingbo Shang is the corresponding author. Intuitively, multilingual language models like mBERT <ref type="bibr">(Devlin et al., 2019)</ref> can be used for code-switched text since a single model learns multilingual representations. Although the idea seems straightforward, there are multiple issues. Firstly, mBERT performs differently on different languages depending on their script, prevalence and predominance. mBERT performs well in mediumresource to high-resource languages, but is outperformed by non-contextual subword embeddings in a low-resource setting <ref type="bibr">(Heinzerling and Strube, 2019)</ref>. Moreover, the performance is highly dependent on the script <ref type="bibr">Pires et al. (2019)</ref>. Secondly, pre-trained language models have only seen monolingual sentences during the unsupervised pretraining, however code-switched text contains phrases from both the languages in a single sentence as shown in Figure <ref type="figure">1</ref>, thus making it an entirely new scenario for the language models. Thirdly, there is difference in the languages based on the amount of unsupervised corpus that is used during pretraining. For e.g., mBERT is trained on the wikipedia corpus. English has &#8672; 6.3 million articles, whereas Hindi and Tamil have only &#8672; 140K articles each. This may lead to under-representation of low-resource langauges in the final model. Further, English has been extensively studied by NLP community over the years, making the supervised data and tools more easily accessible. Thus, the model would be able to easily learn patterns present in the resource- The source labelled dataset S in resource rich language should be easily available. Using S, a classifier is trained, say m pt . Unlabelled code-switched dataset T is divided into buckets using the fraction of English words as the metric. The leftmost bucket B1 has samples dominated by resource-rich language and as we move towards right, the samples in the buckets are dominated by low-resource language. m pt is used to generate pseudo-labels for unlabelled texts in bucket B1. We use texts from B1 along with their pseudo-labels and the dataset S to train a second text classifier m 1 . Then, m 1 is used to get the pseudo-labels for texts in bucket B2. We keep repeating this until we obtain the final model which is used for predictions.</p><p>rich language segments and motivating us to attempt transfer learning from English supervised datasets to code-switched datasets.</p><p>The main idea behind our paper can be summarised as follows: When doing zero shot transfer learning from a resource-rich language (LangA) to code switched language (say LangA-LangB, where LangB is a low-resource language compared to LangA), the model is more likely to be wrong when the instances are dominated by LangB. Thus, instead of self-training on the entire corpus at once, we propose to progressively move from LangA-dominated instances to LangBdominated instances while transfer learning. As illustrated in Figure <ref type="figure">2</ref>, model trained on the annotated resource-rich language dataset is used to generate pseudo-labels for code-switched data. Progressive training uses the resource-rich language dataset and (unlabelled) resource-rich language dominated code-switched samples together to generate better quality pseudo-labels for (unlabelled) low-resource language dominated code-switched samples. Lastly, annotated resource-rich language dataset and pseudo-labelled code-switched data are then used together for training which increases the performance of the final model.</p><p>Our key contributions are summarized as: &#8226; We propose a simple, novel training strategy that demonstrates superior performance. Since our hypothesis is based on the pretraining phase of the multilingual language models, it can be combined with any transfer learning method. &#8226; We conduct experiments across multiple language-pair datasets, showing efficiency of our proposed method. &#8226; We create probing experiments that verify our hypothesis. Reproducibility. Our code is publicly available on github<ref type="foot">foot_0</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related work</head><p>Multiple tasks like Language Identification, Named Entity Recognition, Part-of-Speech, Sentiment Analysis, Question Answering and NLI have been studied in the code-switched setting. For sentiment analysis, <ref type="bibr">Vilares et al. (2015)</ref> showed that multilingual approaches can outperform pipelines of monolingual models on code-switched data. <ref type="bibr">Lal et al. (2019)</ref> use CNN based network for the same. <ref type="bibr">Winata et al. (2019)</ref> use hierarchical meta embed-dings to combine multilingual word, character and sub-word embeddings for the NER task. <ref type="bibr">Aguilar and Solorio (2020)</ref> augment morphological clues to language models and uses them for transfer learning from English to code-switched data with labels. <ref type="bibr">Samanta et al. (2019)</ref> uses translation API to create synthetic code-switched text from English datasets and use this for transfer learning from English to code-switched text without labels in the codeswitched case. <ref type="bibr">Qin et al. (2020)</ref> use synthetically generated code-switched data to enhance zero-shot cross-lingual transfer learning. Recently, <ref type="bibr">Khanuja et al. (2020)</ref> released the GLUECoS benchmark to study the performance of multiple models for codeswitched tasks across two language pairs En-Es and En-Hi. The benchmark contains 6 tasks, 11 datasets and has 8 models for every task. Multilingual transformers fine tuned with masked-language-model objective on code-switched data can outperform generic multilingual transformers. Results from <ref type="bibr">Khanuja et al. (2020)</ref> show that sentiment analysis, question answering and NLI are significantly harder than tasks like NER, POS and LID. In this work, we focus on the sentiment analysis task in the absence of labeled code-switched data using multilingual transformer models, while taking into account the distinction between resource-rich and low-resource languages. Although our work seems related to curriculum learning, it is distinct from the existing work. Most of the work in curriculum learning is in supervised setting <ref type="bibr">(Zhang et al., 2019;</ref><ref type="bibr">Xu et al., 2020)</ref> and our work focuses on zero-shot setting, where no code-switched sample is annotated. Note that, this is also different from semi-supervised setting because of distribution shifts between labeled resource-rich language data and target unlabeled code-switched data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Preliminaries</head><p>Our problem is a sentiment analysis problem where we have a labelled resource-rich language dataset and unlabelled code-switched data. From here onwards, we refer the labelled resource-rich language dataset as the source dataset and the unlabelled code-switched dataset as target dataset. Since code-switching often occurs in language pairs that include English, we refer to English as the resource-rich language. The source dataset, say S, is in English and has the text-label pairs {(x s 1 , y s 1 ), (x s 2 , y s 2 ), ...(x sm , y sm )} and the target dataset, say T , is in code-switched form and has texts {x cs 1 , x cs 2 , ...x csn }, where m is significantly greater than n. The objective is to learn a sentiment classifier to detect sentiment of codeswitched data by leveraging labelled source dataset and unlabelled target dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Methodology</head><p>Our methodology can be broken down into three main steps: (1) Source dataset pretraining, which uses the resource-rich language labelled source dataset S for training a text classifier. This classifier is used to generate pseudo-labels for the target dataset T . (2) Bucket creation, which divides the unlabelled data T into buckets based on the fraction of words from resource-rich language. Some buckets would contain samples that are more resourcerich language dominated while others contain samples dominated by low-resource language. (3) Progressive training, where we initially train using S and the samples dominated by resource-rich language and gradually include the low-resource language dominated instances while training. For rest of the paper, pretraining refers to step 1 and training refers to the training in step 3. And, we also use class ratio based instance selection to prevent the model getting biased towards majority label.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Source Dataset Pretraining</head><p>Resource-rich languages have abundant resources which includes labeled data. Intuitively, sentences in T that are similar to positive sentiment sentences in S would also be having positive sentiment (and same for the negative sentiment). Therefore, we can treat the predictions made on T by multilingual model trained on S as their respective pseudolabels. This would assign noisy pseudo-labels to unlabeled dataset T . The source dataset pretraining step is a text classification task. Let the model obtained after pretraining on dataset S be called m pt . This model is used to generate the initial pseudolabels and to select the instances to be used for progressive training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Bucket Creation</head><p>Since progressive training aims to gradually progress from training on resource-rich language dominated samples to low-resource language dominated samples, we divide the dataset T into buckets based on fraction of words in resource-rich language. This creates buckets that have more resource-rich language dominated instances and </p><p>also buckets that have more low-resource language dominated instances as well. In Figure <ref type="figure">2</ref>, we can observe that the instances in the leftmost bucket are dominated by the English, whereas the instances in the rightmost bucket are dominated by Hindi. More specifically, we define:</p><p>where n eng (x i ) and n_words(x i ) denotes the number of English words and total number of words in the text x i . Then, we sort the texts in dataset T in decreasing order of f eng (x i ) and create k buckets (B 1 , ..., B k ) with equal number of texts in each bucket. Thus, bucket B 1 contains the instances mostly dominated by English language and as we move towards buckets with higher index, instances would be dominated by the low-resource language.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Progressive Training</head><p>As the model m pt is obtained by fine-tuning on a resource-rich language dataset S, it is more likely to perform better on resource-rich language dominated instances. Therefore, we choose to start progressive training from resource-rich language dominated samples. However, note that the pseudolabels generated for dataset T are noisy, thus we sample high confident resource-rich language dominated samples to obtain better quality pseudo-labels for the rest of the instances.</p><p>Firstly, we use m pt to obtain all the high confidence samples from dataset T to be used for pro-gressive training and their respective pseudo-labels. Among the samples to be used for progressive training, we select the samples from B 1 and use them along with S to train a second classifier which is further used to generate pseudo-labels for the rest of the samples to be used for progressive training. Then we select samples from B 2 and use them along with samples from previous iterations (i.e. samples selected from B 1 and S) to get a third classifier. We continue this process until we reach the last bucket and use the model obtained at the last iteration to make the final predictions.</p><p>More formally, we use m pt to select the most confident fraction of samples from the dataset T , considering probability as the proxy for the confidence. Let X st denote the fraction of samples with the highest probability of the majority class to be used for progressive training. Let X st i = X st \ B i , where X st i is the subset of samples from bucket B i that would be used for the progressive training. To train across k buckets, we use k iterations. Let m j denote the model obtained after training for iteration j and m 0 refers to model m pt . Iteration j is trained using texts (([ j i=1 X st i ) [ S). The true labels for texts in S are available and for texts X st i , predictions obtained using model m i 1 are considered as their respective labels. The model obtained at the last iteration i.e. m k is used for final predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Class ratio based instance selection</head><p>Datasets frequently have a significant amount of class imbalance. Therefore, when selecting the samples for progressive training, we often end up selecting a very small amount or no samples from the minority class which leads to very poor performance. Hence, instead of selecting fraction of samples from the entire dataset T , we select fraction of samples per class. Specifically, let X + and X denote the set of samples for which the pseudo-labels are positive and negative sentiment respectively. For progressive training, we choose fraction of most confident samples from X + and fraction of most confident samples from X .</p><p>The pseudo-code for algorithm is in Algorithm 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiments</head><p>We describe the details relevant to the experiments in this section and also elaborate on the probing tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Datasets</head><p>For source dataset pretraining, we use the English Twitter dataset from SemEval 2017 Task 4 <ref type="bibr">(Rosenthal et al., 2017)</ref>. We use three code-switched datasets for our experiments : Hindi-English <ref type="bibr">(Patra et al., 2018)</ref>, Spanish-English <ref type="bibr">(Vilares et al., 2016)</ref>, and Tamil-English <ref type="bibr">(Chakravarthi et al., 2016)</ref>. Hindi-English, Spanish-English are collected from Twitter and the Tamil-English is collected from YouTube comments. The statistics of the dataset can be found in Table <ref type="table">1</ref>. Two out of the three datasets have a class imbalance, the maximum being in the case of Tamil-English where the positive class is &#8672;5x of the negative class. We upsample the minority class to create a balanced dataset.</p><p>Most of the sentences in the datasets are written in the Roman script. The words in Hindi and Tamil are converted into the Devanagari script. We use the processed dataset provided by <ref type="bibr">Khanuja et al. (2020)</ref> for Hindi-English and Spanish-English datasets. The processed version of Hindi-English dataset has the Hindi words in Devanagari script. Since we deal with low-resource languages for which tools might not be well developed, we use heuristics to detect the words in English. For Spanish-English and Tamil-English datasets, we use the Brown corpus from NLTK<ref type="foot">foot_1</ref> to detect the English words in the sentence. Words that are not present in the corpus are considered of another language. For the Tamil-English dataset, we use the AI4Bharat Transliteration python library<ref type="foot">foot_2</ref> to get the transliterations. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Model training</head><p>In all the experiments we use multilingual-bertbase-cased (mBERT) classifier. The supervised English dataset has a 80-20 train-validation split.</p><p>Following <ref type="bibr">Wang et al. (2021)</ref>, is set to 0.5. We observe that in most datasets, the number of spikes in the distribution plot of f eng (x i ) is either 1 or 2. For example, we observe there are only two spikes for the Hindi-English dataset in Figure <ref type="figure">7</ref> in Appendix A.2. Therefore, we set k=2. We train the classifier in both pre-training and training phase for 4 epochs. During pretraining with supervised English dataset, we choose the best weights using the validation set. While training, we use the weights obtained after fourth epoch. For final evaluation, we use the model obtained from the last iteration.</p><p>Additional details about the hyper-parameters can be found in Appendix A.1.</p><p>In the rest of the paper, we refer to the model pretrained on the resource-rich language source dataset as model m pt , the model trained on source dataset along with bucket B 1 as m 1 , and the model trained on source dataset along with the buckets B 1 and B 2 as m 2 . m 2 is used for final predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Evaluation</head><p>As the datasets are significantly skewed between the two classes, we choose to report micro, macro and weighted f1 scores as done in <ref type="bibr">Mekala and Shang (2020)</ref>. For code-switched datasets, we use all the samples without labels during the selftraining. The final score is obtained using the predictions made by model m 2 on all the samples and their true labels. For each dataset, we run the experiment with 5 seeds and report the mean and standard deviation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Baselines</head><p>We consider four baselines as described below:</p><p>&#8226; Deep Embedding for Clustering (DEC) <ref type="bibr">(Xie et al., 2016)</ref> has been used in WeSTClass <ref type="bibr">(Meng et al., 2018)</ref> for self-training using unlabeled documents after pretraining on generated pseudo documents. We adapt DEC similarly to our setting, by pretraining on S and self-training using DEC objective only on T .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8226; No Progressive Training (No-PT) initially trains</head><p>the model on the source dataset S. As done in <ref type="bibr">(Wang et al., 2021)</ref>, it selects fraction of the code-switched data with pseudo-labels and trains a classifier on selected samples and the source dataset S without any progressive training. &#8226; Unsupervised Self-Training (Unsup-ST) <ref type="bibr">(Gupta et al., 2021)</ref> starts with a pretrained sentiment analysis model and then self-trains using codeswitched dataset. We use the default version  <ref type="bibr">(Pires et al., 2019)</ref> denotes the zero shot performance when the model is pretrained on monolingual resource-rich language dataset S. We also compare with two ablation versions of our method, denoted by -Source and -Ratio. -Source uses only the code-switched dataset with its corresponding pseudo-labels without the source dataset S for training. -Ratio chooses the most confident samples for training without taking the class ratio into account.</p><p>We also report the performance in the supervised setting, denoted by Supervised. For each dataset, we train the model only on dataset T but use true labels to do the same. This is the possible upper bound.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Performance comparison</head><p>The results for all the three datasets are reported in Table <ref type="table">2</ref>. In almost all the cases, we observe a performance improvement using our method as compared to the baselines, maximum improvement being upto &#8672; 1.2% in the case of Spanish-English. The comparison between ZS and our method shows the necessity of target code-switched dataset and the comparison between No-PT and our method shows that progressive training has a positive impact. In most cases, the final performance is within &#8672; 10% of the supervised setting. We believe our improvements are significant since the baselines are close to the supervised model in terms of the performance and yet our progressive training strategy makes a significant improvement. We report the statistical significance test results between our method and other baselines in Table <ref type="table">8</ref> in Appendix A.4. In all the cases, we observe the p-value to be less than 0.001. The progressively trained model for Spanish-English does better than its corresponding supervised setting, outperforming it by &#8672; 6%. We hypothesize, this is because of having a large number of instances in the source dataset S, the progressively trained model has access to more information and successfully leveraged it to improve the performance on target code-switched dataset. The comparison between our method and its ablated version -Source demonstrates the importance of source dataset while training the classifier. We can note that our proposed method is efficiently transferring the relevant information from the source dataset to the code-switched dataset, thereby improving the performance. On comparing our method with -Ratio, we observe that using class ratio based instance selection improves the performance in two out of three cases. For the Tamil-English dataset, we observe that the weighted &amp; micro F1 score are higher for -Ratio method but the macro F1 score is poor. This is because the F1 score of the positive class increases by &#8672; 2% but F1 score of negative class drops by &#8672; 9% when using -Ratio method instead of ours. Since the datatset is skewed in the favor of the positive class, this lead to a higher weighted and micro F1 score.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6">Performance comparison across buckets</head><p>In Figure <ref type="figure">3</ref>, we plot the performance obtained by No-PT and our method on both buckets. Since our method aims at improving the performance of low-resource language dominated instances, we expect our model m 2 to perform better on bucket B 2 and we observe the same. As shown in Figure <ref type="figure">3</ref>, in most of the cases, our method performs better than the baseline on bucket B 2 . For bucket B 1 , we observe a minor improvement in the case of Spanish-English, whereas it stays similar for other datasets. Detailed qualitative analysis is present in Appendix A.5. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.7">Probing task : Out-Of-Distribution (OOD) detection</head><p>As previously mentioned, our proposed framework is based on two main hypotheses: (1.) A transformer model trained on resource-rich language dataset is more likely to be correct/robust on resource-rich language dominated samples compared to the low-resource language dominated samples, (2.) The models obtained using the progressive training framework is more likely to be correct/robust on the low-resource dominated samples compared to the models self-trained on the entire code-switched corpus at once. To confirm our hypotheses, we perform a probing task where we compute the fraction of the samples that are OOD. More specifically, we ask two questions: a) Is the fraction of OOD samples same for both the buckets for model m pt ? b) Is there a change in OOD fraction for bucket B 2 if we use model m 1 instead of model m pt ? The first question helps in verifying the first part of the hypothesis and the second question helps in verifying the second part of the hypothesis.</p><p>Since the source dataset S is in English and the target dataset T is code-switched, the entire dataset T might be considered as out-of-distribution. However, transformer models are considered robust and can generalise well to OOD data <ref type="bibr">(Hendrycks et al., 2020)</ref>. Determining if a sample is OOD is difficult until we know more about the difference in the datasets. However, model probability can be used as a proxy. We use the method based on model's softmax probability output similar to <ref type="bibr">Hendrycks and Gimpel (2017)</ref> to do OOD detection. Higher the probability of the predicted class, more is the confidence of the model, thus less likely the sample is out of distribution.</p><p>For a given model trained on a dataset, a threshold p &#8629; is determined using the development set (or the unseen set of samples) to detect OOD samples. p &#8629; is the probability value such that only &#8629; fraction of samples from the development set (or the unseen set of samples) have probability of the predicted class less than p &#8629; . For example, if &#8629; = 10%, 90% of samples in the development set have probability of predicted class greater than p &#8629; . If a new sample from another dataset (or bucket) has probability of predicted class less than p &#8629; , we would consider it to be OOD. Using p &#8629; , we can determine the fraction of samples from the new set that are OOD. Since, there is no method to know the exact value of &#8629; to be used, we report OOD using three values of &#8629; : 0.01, 0.05 and 0.10. For model m pt , we use the development split from the dataset S to determine the value of p &#8629; , and for model m 1 , we use the set of samples from bucket B 1 that are not used in selftraining (i.e. B 1 X st 1 ) to determine p &#8629; . Based on the value of &#8629;, we conduct two experiments and answer our two questions. Is the fraction of OOD samples same for both the buckets for model m pt ? In the first experiment, we consider the model trained on the source dataset and try to find the fraction of OOD samples in both the buckets. Since the first bucket contains more resource-rich language dominated samples, we expect a lesser fraction of samples to be outof-distribution compared to the second bucket. In Figure <ref type="figure">4</ref>, we plot the bucketwise OOD for different datasets. We observe that lesser fraction of samples from first bucket are OOD in all the datasets except Spanish-English. This shows that instances domi-  Is there a change in OOD fraction for bucket B 2 if we use model m 1 instead of model m pt ?</p><p>In the second experiment, we compare the fraction of OOD data in bucket B 2 for the models m pt and m 1 .</p><p>In Figure <ref type="figure">5</ref>, we observe a lesser fractions of samples in bucket B2 are OOD for model m 1 compared to model m pt . This is expected since the model m 1 has seen samples with low-resource language words while training, thus providing empirical evi-dence in the support of our proposed training strategy. Although, the samples from B 2 would still have noisy labels, we expect them to be more accurate when predicted by m 1 than m pt .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.8">Comparison with other multilingual models</head><p>Recently, multiple multilingual transformer models have been proposed. We experiment with MuRIL <ref type="bibr">(Khanuja et al., 2021)</ref> and IndicBERT <ref type="bibr">(Kakwani et al., 2020)</ref>. Firstly, we obtain the performance of three language models: mBERT, MuRIL, and IndicBERT without progressive training on all datasets and we use progressive training on top of the best performing model corresponding to each dataset and verify whether it further improves the performance. The F1 scores are reported in Table <ref type="table">3</ref>. We observe that performance either increases or stays very competitive in all the cases, thus showing our method is capable of improving performance even when used with the best multilingual model for the task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.9">Hyper-parameter sensitivity analysis</head><p>There are two hyper-parameters in our experiments: the number of buckets (k) and the ratio of samples selected for self-training ( ). We vary k from 2 to 4 to study the effect of the number of buckets on the performance and the F1-scores are reported in Table <ref type="table">4</ref>. Our method is fairly robust to the values of k. For almost all values of k, our method does better than the baselines. As mentioned earlier, the number of spikes in the distribution plot of f eng is 1 or 2 for all the datasets. In presence of more number of spikes, higher value of k is recommended. For studying the effect of hyper- parameter , we plot macro, micro, and weighted F1 scores across multiple values of in Figure <ref type="figure">6</ref>. With low , there wouldn't be enough sentences for self-training to help whereas with high , the samples would be too noisy. Thus, a value in the middle i.e. 0.4-0.6 should be reasonable choice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion and Future work</head><p>In this paper, we propose progressive training framework that takes distinction between lowresource and resource-rich language into account while doing zero-shot transfer learning for codeswitched sentiment analysis. We show that our framework improves performance across multiple datasets. Further, we also create probing tasks to provide empirical evidence in support of our hypothesis. In future, we want to extend the framework to other tasks like question-answering and natural language inference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Limitations</head><p>A key potential limitation of the current framework is that if the number of samples in buckets are very disproportionate, the progressive learning might not result in significant improvement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Ethical consideration</head><p>This paper proposes a progressive training framework to transfer knowledge from resource-rich language data to low-resource code-switched data. We work on sentiment classification task which is a standard NLP problem. Based on our experiments, we don't see any major ethical concerns with our work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Appendix</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1 Additional hyper-parameter details</head><p>The batch size is 64, sequence length is 128 and learning rate is 5e-5. These hyperparameters are same for all the models used during pre-training and training. Every iteration takes approximately &#8672;1-2 seconds and &#8672;12 GB of memory on a GPU.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 Statistics related to the buckets</head><p>We report the average value and standard deviation of the f eng across the buckets in Table <ref type="table">5</ref>. We report the number of instances selected for self-training across the buckets in Table <ref type="table">6</ref>. We plot the distribution of f eng in Figure <ref type="figure">7</ref>.  Figure <ref type="figure">7</ref>: Distribution of f eng (x) vs number of samples for the Spanish-English, Hindi-English and Tamil-English datasets (left-to-right). For Hindi-English, we can observe two spikes in the graph showing some samples are heavily dominated by English and some samples are heavily dominated by Hindi. For the other two datasets, we observe the progression to be more gradual.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3 Results of the zero-shot model on buckets</head><p>We report the results of the zero-shot model on both the buckets in Table <ref type="table">7</ref>. As expected, in all the cases the model performs better on B 1 compared to B 2 . Hindi-English 2.76e 6 4.37e 43 3.49e 4 6.85e 12 Tamil-English 1.87e 41 3.72e 16 3.71e 3 3.40e 7</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.4 Statistical Significance Results</head><p>A.5 Qualitative analysis</p><p>As discussed previously, on the low-resource language dominated bucket, our model is correct more often than the No-PT baseline. We focus on samples from bucket B 2 for qualitative analysis. For the sample, "fixing me saja hone ka gift", the Hindi word "saja" refers to punishment which is negative in sentiment whereas the word "gift" is positive in sentiment. Thus, the contextual information in the Hindi combined with that of the English is necessary to make correct prediction. For the sample "Mera bharat mahan, padhega India tabhi badhega India", the model has to identify Hindi words "mahan" &amp; "badhega" to make the correct predictions. We also do the qualitative analysis by looking at predictions of samples between successive iterations. In Table <ref type="table">9</ref>, we randomly choose samples which are predicted incorrectly by model m pt but are predicted correctly by model m 1 . Out of 8 samples, 6 samples had sentiment specifically present in the Hindi words. In Table <ref type="table">10</ref>, we randomly choose samples which are predicted incorrectly by model m 1 but are predicted correctly by model m 2 . Out of 8 samples, 4 samples had sentiment specifically present in the Hindi words and 2 samples required understanding both the Hindi and English words simultaneously. The blue highlighted words are relevant to determining the sentiment of the sentence. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://github.com/s1998/ progressiveTrainCodeSwitch</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>https://www.nltk.org/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>https://pypi.org/project/ai4bharat-transliteration/</p></note>
		</body>
		</text>
</TEI>
