<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>YOSM: A New Yoruba Sentiment Corpus for Movie Reviews</title></titleStmt>
			<publicationStmt>
				<publisher>. In Proceedings of the 3rd Workshop on African Natural Language Processing, co-located with International Conference on Learning Representations (ICLR) 2022</publisher>
				<date>04/20/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10470682</idno>
					<idno type="doi"></idno>
					
					<author>I. Shode</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Sentiment Analysis is a popular text classification task in natural language processing. It involvesdeveloping algorithms or machine learning models to determine the sentiment or opinion expressed ina piece of text. The results of this task can be used by business owners and product developers tounderstand their consumers’ perceptions of their products. Asides from customer feedback andproduct/service analysis, this task can be useful for social media monitoring (Martin et al., 2021).One of the popular applications of sentiment analysis is for classifying and detecting the positiveand negative sentiments on movie reviews. Movie reviews enable movie producers to monitor theperformances of their movies (Abhishek et al., 2020) and enhance the decision of movie viewers toknow whether a movie is good enough and worth investing time to watch (Lakshmi Devi et al.,2020). However, the task has been under-explored for African languages compared to their westerncounterparts, ”high resource languages”, that are privileged to have received enormous attentiondue to the large amount of available textual data. African languages fall under the category of the lowresource languages which are on the disadvantaged end because of the limited availability of datathat gives them a poor representation (Nasim & Ghani, 2020). Recently, sentiment analysis hasreceived attention on African languages in the Twitter domain for Nigerian (Muhammad et al., 2022)and Amharic (Yimam et al., 2020) languages. However, there is no available corpus in the moviedomain. We decided to tackle the problem of unavailability of Yoru`ba´ data for movie sentimentanalysis by creating the first Yoru`ba´ sentiment corpus for Nollywood movie reviews. Also, wedevelop sentiment classification models using state-of-the-art pre-trained language models likemBERT (Devlin et al., 2019) and AfriBERTa (Ogueji et al., 2021).]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Yoru `ba Language is the third most spoken indigenous African language <ref type="bibr">(Eberhard et al., 2020)</ref> with over 50 million speakers. Speakers of the Yoru `ba &#180; language can be found in the South-Western region of Nigeria and across the globe. Yoru `ba &#180; is a tonal language that comprises 25 letters. Despite its large number of speakers, Yoru `ba &#180; falls under the category of the low resource languages and few NLP datasets that have been developed for the language <ref type="bibr">(Adelani et al., 2021b)</ref>. Furthermore, there is no record of sentiment analysis research done on Nigerian movies (i.e. Nollywood) or even Yoru `ba &#7743;ovie reviews.</p><p>Nollywood is the home for Nigerian movies that depict the Nigerian people and reflect the diversities across Nigerian cultures. A Masterclass staff, Foster in 2022<ref type="foot">foot_0</ref> , claims that four to five movies are released daily by Nigerian movie producers for an estimated audience of fifteen million Nigerians and five million in other African countries. As a result, Nollywood is the second-largest movie and film industry in the world. Despite its capacity, Nollywood movie reviews are scarce. Data: Unlike Hollywood movies that are heavily reviewed with hundreds of thousands of reviews all over the internet, there are fewer reviews about Nigerian movies. Furthermore, there is no online platform dedicated to movie reviews originally written in Yoru `ba &#180;. Most of the reviews are written in English. We collected 1,500 reviews with a balanced set of positive and negative reviews. These reviews were accompanied with ratings and were sourced from three popular online movie review platforms<ref type="foot">foot_1</ref> -IMDB, Rotten Tomatoes and, Letterboxd. We also collected reviews and ratings from two Nigerian indigenous movie reviews websites<ref type="foot">foot_2</ref> -Cinemapointer and Nollyrated. Our annotation focused on the classification of the reviews based on the ratings that the movie reviewer gave the</p><p>AfricaNLP workshop at ICLR2022  movie. We used a rating scale to classify the positive or negative reviews and defined ratings between 0-4 under the negative (NEG) category while 7-10 were positive (POS). After collecting the data, native speakers of Yoru `ba &#180; that work as professional translators were recruited to manually translate the movie reviews from English to Yoru `ba &#180;. Thus, we have a parallel review dataset in English and Yoruba, and their corresponding ratings.</p><p>As an alternative in the absence of human translation for training, we automatically translate the English reviews to Yoru `ba &#180; using Google Translate machine translation tool, this can be useful for scenarios where there is an absence of training data in Yoru `ba &#180; language. To evaluate the quality of the automatic translation, we compute BLEU score <ref type="bibr">Papineni et al. (2002)</ref> between human translated sentences and output of Google Translate. We obtained 3.36 BLEU, which shows the performance of the English-Yoru `ba &#180; MT model is very poor, similar to the observation of <ref type="bibr">Adelani et al. (2021a)</ref> on Google Translate across several domains. However, we want to evaluate to which extent automatic translations can help when there is an absence of human translations. Table <ref type="table">1</ref> shows the information about the data sources of the curated Yoru `ba &#180; movie reviews, which we named YOSM. We split YOSM into 800 reviews as training set, 200 reviews as development set and 500 reviews as test set.</p><p>Baseline Models We fine-tune two pre-trained language models (PLMs) that have been pre-trained on Yoru `ba &#180; language: mBERT <ref type="bibr">(Devlin et al., 2019)</ref> and AfriBERTa <ref type="bibr">(Ogueji et al., 2021)</ref>. AfriBERTa has been exclusively pre-trained on 11 African languages while mBERT was pre-trained on 104 languages. As an additional baseline model, we make use of a PLM that has been adapted to Yoru `ba &#314;anguage using language adaptive fine-tuning (LAFT) -an approach to fine-tune PLM on monolingual texts on a new language using the same masked language model objective as BERT. It has been shown to improve performance on named entity recognition task on Yoru `ba &#180; <ref type="bibr">(Alabi et al., 2020;</ref><ref type="bibr">Adelani et al., 2021b)</ref> and better zero-shot cross-lingual transfer <ref type="bibr">(Pfeiffer et al., 2020)</ref>.</p><p>Transfer Learning Setting We examine four transfer learning experiments, (1) imdb (en): crosslingual transfer from a large Hollywood movie review dataset (i.e IMDB) with 25,000 samples and zero-shot evaluation on YOSM test set. (2) en: cross-lingual transfer from the English Nollywood movie reviewthe size is limited to the 800 samples in the untranslated reviews in our dataset. (3) yo:MT: trained on machine translation of 800 English Nollywood reviews to Yoru `ba &#180; language. (4) en+yo:MT combined data from the English Nollywood reviews and machine translated reviews.</p><p>Results Table <ref type="table">2</ref> shows the baseline results on PLMs, we obtained very impressive results ( &gt; 83 F1) by training on our small training set (i.e 800 reviews). AfriBERTa and mBERT+LAFT gave better results (more than 86 F1) compared to mBERT (83.2) since they have been trained exclusively on African languages or adapted using LAFT. For the transfer learning results, we obtained a very good cross-lingual transfer of over (61 F1) on all settings. We find the transfer of en to per-form better than imdb(en), an improvement on of 2.4 -5.5 F1 using mBERT+LAFT or AfriBERTa since en captures better the Nollywood domain than imdb(en) that is based on Hollywood reviews.</p><p>The best transfer approach in the absence of humanly written Yoru `ba &#180; reviews is to train on machine translated reviews (yo:MT) and/or combine with English Nollywood reviews (en+yo:MT), with performance reaching 77.9 F1. Although, there is a small benefit of combining English and automatically translated Yoru `ba &#180; Nollywood reviews (0.8 -2.5 F1) to further improve performance over (yo:MT).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>In this paper, we presented the first Yoru `ba &#180; sentiment corpus for Nollywood movie reviews -YOSM that was manually translated from English Nollywood reviews. We perform experiments on this dataset by using the state-of-the-art pre-trained language models and transfer learning approaches which gave us impressive results. The YOSM dataset is publicly available on Github<ref type="foot">foot_3</ref> .</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://www.masterclass.com/articles/nollywood-new-nigerian-cinema-explained</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>www.imdb.com , www.rottentomatoes.com, and https://letterboxd.com/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>www.cinemapointer.com, and https://nollyrated.com/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>https://github.com/IyanuSh/YOSM</p></note>
		</body>
		</text>
</TEI>
