<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Bridging Nations: Quantifying the Role of Multilinguals in Communication on Social Media</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/05/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10450316</idno>
					<idno type="doi">10.1609/icwsm.v17i1.22174</idno>
					<title level='j'>Proceedings of the International AAAI Conference on Web and Social Media</title>
<idno>2162-3449</idno>
<biblScope unit="volume">17</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Julia Mendelsohn</author><author>Sayan Ghosh</author><author>David Jurgens</author><author>Ceren Budak</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Social media enables the rapid spread of many kinds of information, from pop culture memes to social movements. However, little is known about how information crosses linguistic boundaries. We apply causal inference techniques on the European Twitter network to quantify the structural role and communication influence of multilingual users in cross-lingual information exchange. Overall, multilinguals play an essential role; posting in multiple languages increases betweenness centrality by 13%, and having a multilingual network neighbor increases monolinguals’ odds of sharing domains and hashtags from another language 16-fold and 4-fold, respectively. We further show that multilinguals have a greater impact on diffusing information is less accessible to their monolingual compatriots, such as information from far-away countries and content about regional politics, nascent social movements, and job opportunities. By highlighting information exchange across borders, this work sheds light on a crucial component of how information and ideas spread around the world.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Social media facilitates worldwide diffusion of many forms of content, such as pop culture memes, protest movements, and misinformation <ref type="bibr">(Bruns, Highfield, and Burgess 2013;</ref><ref type="bibr">Nissenbaum and Shifman 2018;</ref><ref type="bibr">Bridgman et al. 2021)</ref>. However, most connections on social media are formed between people who share a common nationality or language <ref type="bibr">(Ugander et al. 2011;</ref><ref type="bibr">Takhteyev, Gruzd, and Wellman 2012)</ref>. How can information spread around the world when relatively few ties cross geographic and linguistic boundaries? Multilingual users are believed to be an important piece of this puzzle, but understanding how they act as brokers in information flow across language communities remains underexplored <ref type="bibr">(Hong, Convertino, and Chi 2011)</ref>. We carry out a set of causal inference studies to quantify how multilingual users influence cross-lingual information exchange across Europe on Twitter and show that the role of multilingual users varies depending on the relationship between countries and the topic of content shared.</p><p>We identify multilinguals based on the languages of their authored posts and avoid making claims about users' offline Figure <ref type="figure">1</ref>: Consider networks of users from pairs of countries (C x , C y ) (e.g. Germany and Turkey) with dominant languages L x and L y (e.g. German and Turkish). Here, multilingual user A posts in both languages. We use causal inference to quantify (i.) the structural role and (ii.) the communication influence of A in cross-lingual information exchange. language competencies, which is an ideologically and theoretically fraught notion <ref type="bibr">(Cheng et al. 2021)</ref>. Using this operationalization of multilingualism, prior work has found that multilingual users tend to have ties (e.g., following links) to multiple distinct language communities, suggesting that they act as bridges in online social networks <ref type="bibr">(Eleta and Golbeck 2012;</ref><ref type="bibr">Hale 2014a</ref>). We expand upon earlier populationlevel descriptive work by using causal inference techniques, namely propensity score stratification, to more robustly isolate the effects of individual users' multilingual behaviors on several outcomes. This approach is motivated by recent work showing propensity score stratification aligns with experimental results when measuring peer effects in link sharing behavior on social media <ref type="bibr">(Eckles and Bakshy 2020)</ref>. We carry out two analyses using this framework. First, we measure the structural role of multilinguals. Here, we quantify the extent to which these users act as bridges using betweenness centrality <ref type="bibr">(Freeman 1978)</ref>. We then measure multilingual users' communication influence. To do so, we ask how having a multilingual contact impacts the content, specifically URL domains and hashtags, that monolingual users share from other languages.</p><p>By analyzing information spread across country pairs with different dominant languages, we show how multilinguals' influence varies based on relationships between countries. <ref type="bibr">Takhteyev, Gruzd, and Wellman (2012)</ref> suggest that offline country relationships (e.g., economic agreements or migration patterns) impact transnational tie frequency online. Extending this conjecture, we hypothesize that the role of multilinguals is not uniform across all country pairs. Specifically, we expect multilinguals to have a bigger impact on country pairs that are more geographically distant or have weaker bonds, in which case these users would serve as gatekeepers of otherwise inaccessible information.</p><p>We additionally consider actual content, which affects the rate and shape of information diffusion <ref type="bibr">(Romero, Meeder, and Kleinberg 2011;</ref><ref type="bibr">Tsur and Rappoport 2012)</ref>. However, this cannot be captured by analyses of network structure alone. Our measures of multilinguals' communication influence overcome this limitation by enabling us to compare effects across content topics. We again hypothesize multilingual users play a larger role in spreading content otherwise inaccessible to an international audience. For example, we expect multilinguals to have a bigger influence on spreading hashtags about local or national politics compared to widespread entertainment sensations.</p><p>Our contributions are as follows: in large-scale causal inference studies of European Twitter, we show that multilingually-posting users play a vital structural role and communication influence on information diffusion across languages. We compare how multilinguals' influence varies based on relationships between countries and find they have a greater effect among country pairs that are more geographically distant, especially for Western Europeans who post in Eastern European languages. We further measure how the effect of multilinguals is driven by content topic and find that they have the largest influence in spreading content about politics, developing health-related social movements, and job opportunities.</p><p>Understanding how multilinguals affect information diffusion has immense consequences for online platforms. For example, platforms may focus efforts on empowering multilingual users to spread information that can support knowledge-sharing, collaboration, crisis response, social progress, or other beneficial outcomes <ref type="bibr">(Eleta and Golbeck 2014)</ref>. On the other hand, they may want to discourage multilinguals from sharing dangerous content such as misinformation or abuse.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Related Work</head><p>We draw upon prior work on multilinguals' online behavior and information diffusion across social networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Social Networks and Information Diffusion</head><p>Online social networks tend to have relatively few ties connecting distinct national and linguistic communities, leading to structural holes <ref type="bibr">(Burt 2004;</ref><ref type="bibr">Ugander et al. 2011;</ref><ref type="bibr">Hale 2012)</ref>. Spreading novel information across these communities thus relies on bridges that span structural holes <ref type="bibr">(Easley and Kleinberg 2010)</ref>. Information spreads the most quickly across long-range bridges, where the intermediary node greatly reduces the shortest path between two other nodes <ref type="bibr">(Kossinets, Kleinberg, and Watts 2008)</ref>. Bridges in a social network play a brokering role that adds to their social capital <ref type="bibr">(Burt 2004</ref>). We posit that multilinguals play an important role in cross-lingual information exchange because they serve as bridges between language communities.</p><p>Prior work measures how people influence others to propagate information <ref type="bibr">(Guille et al. 2013)</ref>. For example, people are more likely to share information their friends share, and overall, weak ties (e.g., acquaintances) are more responsible for spreading novel information than strong ties (e.g., close friends) <ref type="bibr">(Bakshy et al. 2012)</ref>. Although multilinguals are not necessarily weak ties, they similarly can act as bridges between communities. Methodologically, we draw from Eckles and <ref type="bibr">Bakshy (2020)</ref>, who use causal inference techniques, namely propensity score stratification, to measure peer influence on link-sharing behavior and show that their observational estimates are consistent with experimental results.</p><p>Understanding information diffusion and influential brokers impacts research across disciplines, including the development of activism and protest movements <ref type="bibr">(Gonz&#225;lez-Bail&#243;n et al. 2011;</ref><ref type="bibr">Park, Lim, and Park 2015;</ref><ref type="bibr">Lee and Murdie 2020)</ref>, spread of misinformation <ref type="bibr">(Bridgman et al. 2021)</ref>, product adoption <ref type="bibr">(Talukdar, Sudhir, and Ainslie 2002)</ref>, and online abuse <ref type="bibr">(Sp&#246;rlein and Schlueter 2021)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The Bridging Role of Multilinguals</head><p>Prior work has analyzed connections within and across language communities online <ref type="bibr">(Hale 2012</ref><ref type="bibr">(Hale , 2014b;;</ref><ref type="bibr">Samoilenko et al. 2016)</ref>. <ref type="bibr">Hale (2014b)</ref> argues that multilingual editors on Wikipedia are important for sharing knowledge across language communities and facilitate access to more diverse knowledge; they are more active than their monolingual counterparts overall and often write the same article across multiple language versions. However, <ref type="bibr">Kim et al. (2016)</ref> suggests that language remains a barrier because multilinguals are less engaged and less likely to edit complex content in a non-primary language. While online bloggers primarily link to content within the same language, cross-lingual links do exist, and some bloggers explicitly seek to connect distinct language communities <ref type="bibr">(Zuckerman 2008;</ref><ref type="bibr">Hale 2012)</ref>. Authors of blogs that bridge language communities often tend to be multilingual migrants or language learners <ref type="bibr">(Herring et al. 2007</ref>). On Twitter, multilinguals disseminate information across different public spheres during events such as the Arab Spring <ref type="bibr">(Bruns, Highfield, and Burgess 2013)</ref>.</p><p>Earlier work has also examined the structural role of multilinguals. <ref type="bibr">Kim et al. (2014)</ref> count edges between monolingual and multilingual "lingua" groups within multilingual regions on Twitter. They find that monolingual groups tend to be connected via multilingual groups, suggesting that multilingual users are bridges between different language communities. <ref type="bibr">Hale (2014a)</ref> similarly argues that multilinguals collectively play an important bridging role in the Twitter network. When removing all multilingual nodes, the largest connected component becomes smaller and the number of small components increases, and these changes are significantly larger than if the same number of randomlyselected monolinguals were removed <ref type="bibr">(Hale 2014a</ref>). <ref type="bibr">Eleta and</ref><ref type="bibr">Golbeck (2012, 2014)</ref> characterize multilingual users' ego-networks as gatekeepers (language communities connected by only a few users) and language bridges (tightlyconnected language groups). The authors suggest that gatekeepers are essential individuals for spreading information across linguistic, national, and cultural boundaries.</p><p>Other work has measured multilinguals' participation in information cascades (resharing chains) on Twitter. <ref type="bibr">Jin (2017)</ref> find that multilingual behaviors of the original poster and their followers are predictive of information cascades crossing languages. <ref type="bibr">Chen et al. (2021)</ref> compare monolingual and multilingual users in cascade trees of COVID-19 information, develop measures that capture users' bridging roles based on how much they propagate information, and find that multilinguals have a bigger bridging role than monolinguals in nearly two-thirds of information cascades.</p><p>We build upon prior work in several important ways. While structural analyses suggest that multilinguals are important for cross-lingual information exchange, our causal inference studies quantify the effect of multilingual behaviors on both structural and content-sharing outcomes while accounting for possible confounds such as posting frequency and overall popularity. We further extend the evidence that online multilingual behaviors and connections vary across countries by identifying systematic variation in the effect of multilinguality across country pairs <ref type="bibr">(Kim et al. 2014)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data</head><p>We construct an undirected network with Decahose data from 2012-2020, where edges are mutual "mentions" between users. We consider all pairs of European countries (C x , C y ) with dominant languages L x and L y , and select countries based on geography and Eurovision participation, a marker of cultural affiliation. Using location inference <ref type="bibr">(Compton, Jurgens, and Allen 2014)</ref>, we extract network subsets for all (C x , C y ) pairs where nodes are users from C x or C y . We use tweets from 2018-2020 for multilingualism and content measures due to limited text availability.</p><p>Definitions Each pair of countries (C x , C y ) is a multilingual country pair (MCP) with dominant languages L x and L y . Users who post only in L x are L x monolinguals and users who post in both L x and L y are (L x , L y ) multilinguals. Within each (C x , C y ) network we separately measure the role of (L x , L y ) multilinguals located in C x and C y , which we refer to as loci. In Figure <ref type="figure">1,</ref><ref type="figure">(Germany,</ref><ref type="figure">Turkey</ref>) is an MCP, and we measure the role of (German, Turkish) multilinguals within each locus Germany and Turkey.</p><p>Language identification Twitter posts present a challenge to automated language identification (LangID) due to their short length, informal style, and lack of ground truth labels <ref type="bibr">(Graham, Hale, and Gaffney 2014;</ref><ref type="bibr">Williams and Dagli 2017)</ref>. We use Twitter's built-in language detector because it is computationally efficient for a massive dataset, requires few additional resources, and is trained on in-domain data. <ref type="foot">1</ref>We validate our decision with a comparison to 5 popular LangID packages: langdetect<ref type="foot">foot_2</ref> , langid.py <ref type="bibr">(Lui and Baldwin 2012)</ref>, and CLD2<ref type="foot">foot_3</ref> use probabilistic models, while fastText <ref type="bibr">(Joulin et al. 2016</ref>) and CLD3<ref type="foot">foot_4</ref> use neural networks. As in prior LangID evaluations, we randomly sample 1K tweets from 32 countries written in that country's dominant language, as labeled by Twitter <ref type="bibr">(Graham, Hale, and Gaffney 2014;</ref><ref type="bibr">Lamprinidis et al. 2021)</ref>. Like <ref type="bibr">Graham, Hale, and Gaffney (2014)</ref>, we calculate intercoder agreement between all pairs of models. Table <ref type="table">S3</ref> (Supplemental Material) shows that Twitter's LangID has a high agreement with other models, even at higher rates than they agree with each other.</p><p>Like most LangID models, Twitter's LangID has relatively low coverage of the world's languages and may struggle to distinguish similar languages <ref type="bibr">(Lui and Baldwin 2014;</ref><ref type="bibr">Williams and Dagli 2017)</ref>. We mitigate these limitations by selecting multilinguals and MCPs so that our analyses are not overly sensitive to individual prediction errors.</p><p>Identifying multilingual users Following <ref type="bibr">Eleta and Golbeck (2014)</ref>, an individual uses language L if at least 10% of their posts containing original content are written in L (i.e., excluding retweets but including quote tweets and replies). A user is monolingual if one language passes the 10% threshold, and multilingual in the (C x , C y ) network if both L x and L y pass this threshold. Following <ref type="bibr">Kim et al. (2014)</ref>, we collect language information for all users with at least five posts in the Decahose. We additionally exclude users if over 20% of their tweets are in unidentified languages because our calculations for language use may be less accurate. We set these thresholds so that estimates of users' multilingualism are more reliable and robust to language prediction errors of individual tweets. We emphasize that our operationalization of multilingualism is based solely on users' expression on Twitter; we do not make claims about language knowledge or offline behavior.</p><p>Selecting network subsets for analysis From an initial set of 50 European countries, we exclude 18 after three filtering steps. First, we remove six micro-states with area under 500km 2 because Compton, Jurgens, and Allen (2014)'s reported errors suggest that location inference is less reliable within such small areas. Second, we restrict our analysis to countries with a single official or dominant language (i.e., used by at least 70% of the population), removing eight more countries.<ref type="foot">foot_5</ref> This step is necessary for our problem formulation, which focuses on information spread across borders. The study of users within highly-multilingual countries is beyond the scope of this paper. Third, we calculate the distribution of tweets authored by users from each country and exclude four countries where the majority of tweets' languages are unidentified or no tweets are identified as written in the country's dominant language. See Supplemental Material Table <ref type="table">S1</ref> for the list of included and filtered countries.  <ref type="bibr">(Ho et al. 2007;</ref><ref type="bibr">Stuart 2010)</ref>. We first estimate the propensity of each unit to receive treatment and use these scores to divide the sample into 25 strata. We then compare treated and untreated units within each stratum through a weighted regression to estimate the causal effect of the treatment.</p><p>We include an MCP (C x , C y ) if: (1) C x and C y have different dominant languages, and (2) the (C x , C y ) network is sufficiently large to rigorously estimate causal effects. Specifically, an MCP network is included if it has at least 100 L x and L y monolinguals, 20 (L x , L y ) multilinguals, and 100 users with at least 1 (L x , L y ) multilingual neighbor.<ref type="foot">foot_6</ref> See Supplemental Material Table <ref type="table">S2</ref> for included MCPs.</p><p>Descriptive statistics Our dataset contains information for 12.6M users from 32 countries. 8.0M users pass the thresholds for posting activity and language identifiability and are considered for analysis. Of these users, 1.1 million (13.6%) tweet in both their country's dominant language and another European language. Norway has the largest percentage of bilingual users (46.9%), while the United Kingdom has the least (3.3%). 232 MCP network subsets were selected based on our inclusion criteria, covering 30 European countries and 7.7 million unique users (Georgia and Moldova were considered, but no MCP network containing these countries was included). MCP networks range in size from 4.7K nodes (Latvia, Lithuania) to 3.6M nodes (United Kingdom, Turkey), with a median size of 319K nodes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Problem Formulation</head><p>Adopting the Neyman-Rubin framework of potential outcomes <ref type="bibr">(Holland 1986)</ref>, we isolate the effect of multilingualism by addressing the counterfactual: how would a multilingual user u's influence be different if u were monolingual? Each study's details vary (Table <ref type="table">1</ref>), but all have the same idea: we define a set of units (users), some of whom receive a treatment indicating multilingual behavior, and we estimate an outcome variable related to influence on information exchange across languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Measures of Influence</head><p>Are multilinguals well-positioned in networks to spread information? In Study 1, we quantify their structural role by estimating the effect of multilingual posting on betweenness centrality. A measure of the proportion of shortest paths that must go through an intermediary node, betweenness centrality quantifies the extent to which the intermediary is an in- formation broker <ref type="bibr">(Freeman 1978)</ref>. Studies 2 and 3 focus on multilinguals' influence on content flows across languages.</p><p>If an L x monolingual has a multilingual neighbor in the (C x , C y ) network, how much more likely are they to share L y content? We consider two forms of content: URL domains and hashtags <ref type="bibr">(Hong, Convertino, and Chi 2011)</ref>.</p><p>Betweenness centrality (Study 1) For each (C x , C y ) MCP, we consider a sample of users from locus C x who post in L x . The treatment is if a user is multilingual in (L x , L y ), and the outcome is the user's betweenness centrality in the (C x , C y ) network, scaled by 10 6 and log-transformed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Domain sharing (Study 2)</head><p>We first identify URL domains associated with each language. We filter out the 200 overall most-frequent domains. Like stop word removal (Jurafsky and Martin 2000), excluded domains are largely uninformative for measuring content sharing across languages (e.g. instagram.com, unfollowspy.com). To account for sampling variability, we exclude domains that appear fewer than ten times <ref type="bibr">(Monroe, Colaresi, and Quinn 2008)</ref>. We then consider "domains associated with L x " to be the 100 domains most overrepresented in L x tweets (excluding retweets) relative to all European tweets based on the weighted log odds ratio with an informative Dirichlet prior <ref type="bibr">(Monroe, Colaresi, and Quinn 2008)</ref>. For each MCP (C x , C y ) and locus C x , the sample is the set of L x monolinguals from C x . A user u is treated if they have at least one (L x , L y ) multilingual neighbor, and the outcome is whether any of u's Decahose tweets contain at least one domain associated with L y , the language u does not use. Note that we exclude retweets for language-based measures, such as identifying multilinguals and language-specific domains/hashtags, but include retweets for information-sharing measures because they are an important component of information diffusion on Twitter.</p><p>Hashtag sharing (Study 3) We similarly identify hashtags associated with each language. Unlike domains, language-associated hashtags can change rapidly because they may refer to short-term events, such as elections, protests, or upcoming TV shows. We thus separate the Decahose data into 60 14-day intervals, and use the log-odds ratio with informative Dirichlet priors to identify H i x , the set of 100 hashtags most associated with L x in interval i, after again filtering out 200 most-frequent hashtags (e.g. #blog, #radio), those occurring fewer than ten times in i, and excluding retweets. A user u "shares" a hashtag from L x if any of u's Decahose tweets contain at least one hashtag h &#8712; H i x during i or the subsequent period i + 1 (including retweets). Resulting hashtags reflect entertainment, sports, politics, and everyday life; see Supplemental Material Table <ref type="table">S4</ref> for examples of language-specific domains and hashtags.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Causal Inference Setup</head><p>We use propensity score stratification to estimate the causal effects of multilingual treatment variables on information diffusion outcomes <ref type="bibr">(Rosenbaum and Rubin 1983)</ref>. Our procedure, shown in Figure <ref type="figure">2</ref>, closely follows the guidelines provided by <ref type="bibr">Stuart (2010)</ref> and MatchIt <ref type="bibr">(Ho et al. 2007)</ref>.</p><p>We first fit a logistic regression model to calculate propensity scores, which represent the probability of receiving treatment as a function of the specified covariates (Rosenbaum and Rubin 1983). Covariates are shared for all studies and capture aspects of users' Twitter behavior that may affect our outcomes <ref type="bibr">(Stuart 2010)</ref>. These include verified status, network degree, follower and following counts, account age, number of tweets in the Decahose sample, "favorites" count, and post rate, all log-scaled. Covariates for each user are based on their most recent tweet in our sample.</p><p>We then conduct propensity score stratification where units are separated into 25 strata based on propensity scores and verify sufficient balance for all covariates with absolute standardized mean difference less than 0.1 <ref type="bibr">(Ho et al. 2007)</ref>. Although five strata have commonly been used in practice, more strata can yield less biased estimates for larger samples sizes <ref type="bibr">(Eckles and Bakshy 2020)</ref>. Propensity score estimation, stratification, and balance checks are performed using the MatchIt R package <ref type="bibr">(Ho et al. 2007)</ref>.</p><p>We compare treated and untreated units within each stratum by fitting a regression of each outcome on treatment status, weighted by the matching weights. In particular, we estimate the average treatment effect on the treated (ATT); this estimates how the outcome among treated units is different than in the counterfactual scenario where they are not treated <ref type="bibr">(Stuart 2010)</ref>. For all studies, we include covariate  adjustment to control for direct effects that pre-treatment covariates may have on the outcome.</p><p>Because each study defines units, treatment, and outcomes differently, the specific details of ATT estimation vary. In Study 1, we fit a linear regression after propensity score stratification to estimate the difference in (scaled and log-transformed) betweenness centrality. Here, multilingualism increases betweenness centrality if ATT &gt;0. In Studies 2 and 3, the outcomes are binary so we use logistic regression to estimate ATT as an odds ratio, a measure used in prior work to compare domain and hashtag sharing behaviors across languages <ref type="bibr">(Hong, Convertino, and Chi 2011)</ref>. For some treated user u who is monolingual in L x , the ATT estimates the odds that u shares a domain (hashtag) associated with L y divided by the odds that u would share a domain (hashtag) associated with L y in the counterfactual scenario where u is not treated. ATT &gt; 1 indicates that having a multilingual (L x , L y ) contact increases an L x monolingual's likelihood of sharing domains (hashtags) from L y .</p><p>In each study, we estimate the ATT on aggregate data from both loci in all MCP networks in order to get a single causal estimate of the role of multilinguals in crosslingual information exchange. We additionally estimate separate ATT values for units from each locus in each MCP network, and analyze the variation across locus-specific causal effects later. To ensure sufficient data across strata, we only estimate locus-specific causal effects for locus C x in MCP (C x , C y ) if there are at least 100 treated units. Furthermore, we only include locus-specific causal effects for Studies 2 and 3 if at least 100 units in the combined treated and untreated sample (all L x monolinguals) have an outcome of 1 (share at least one domain or hashtag from L y ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Overall Effects of Multilingual Behavior</head><p>All 3 studies support our hypothesis that multilinguals play an important role in cross-lingual information exchange.</p><p>Betweenness centrality (Study 1) Multilingual (L x , L y ) users have higher betweenness centrality in the (C x , C y ) network than their monolingual peers, suggesting that these users serve as local bridges and thus are well-positioned to spread novel information across the network <ref type="bibr">(Freeman 1978;</ref><ref type="bibr">Granovetter 1973)</ref>. The overall ATT is 0.034 (p &lt; 0.0001, with robust standard error estimation). In other words, multilingual posting increases the outcome of logtransformed betweenness by 0.034, which corresponds to a 13.5% increase in betweenness centrality.</p><p>URL domain sharing (Study 2) Having a multilingual (L x , L y ) friend increases the odds of a monolingual L x user sharing a domain from L y by a factor of 15.57 (p &lt; 0.0001).</p><p>For interpretability, we also estimate the average marginal effect with a marginal effects model on the logistic regression used for estimating the ATT as an odds ratio. We find that having a (L x , L y ) friend increases an L x monolingual's probability of sharing an L y domain by 20.0%.</p><p>Hashtag sharing (Study 3) Having a multilingual (L x , L y ) friend significantly increases the odds of an L x monolingual sharing a hashtag from L y by a factor of 3.98 (p &lt; 0.0001). Through estimating the average marginal effect, we find that this corresponds to an increase in the probability of sharing an L y hashtag by 32.7%. While the odds ratio for hashtag sharing is lower than for domains, the probability increase is greater because extremely low domainsharing probabilities result in inflated odds ratios.</p><p>Table <ref type="table">2</ref> summarizes locus-specific causal effect estimates. Due to minimum inclusion criteria, we only estimate ATTs for a subset of all 464 loci corresponding to 232 MCPs. All three treatments increase information-sharing outcomes in about half of the loci, and decreases outcomes in under 4% percent of loci. Locus-specific estimates reinforce that multilinguals facilitate information exchange across language boundaries. However, the distribution of positive, negative, and insignificant effects across loci as shown in Table <ref type="table">2</ref> suggests wide variation across MCPs and loci.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Variation across Country Pairs</head><p>All three studies show that multilingual behaviors increase cross-lingual information exchange. However, locusspecific causal estimates indicate substantial heterogeneity in the magnitude of their effects across MCPs and loci. We conduct regression analyses to characterize how geographic, demographic, economic, political, and linguistic relationships between country pairs correlate with the effects of multilingual behaviors on European Twitter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Regression Setup</head><p>We fit linear regression models to characterize how the causal effects of multilinguals vary across countries. For a given MCP (C x , C y ) and locus C x , we define 3 dependent variables: the estimated causal effects (ATT) of multilingual behavior in each of the 3 described studies. Independent variables capture relationships between C x and C y .</p><p>Demographic variables include the population ratio of C x to C y from 2019 based on CEPII's Gravity Dataset <ref type="bibr">(Head and Mayer 2014;</ref><ref type="bibr">Conte, Cotterlaz, and Mayer 2021)</ref>. Using 2017 World Bank migration data, we consider the fraction of C x 's population born in C y , C y 's population born in C x , and each country's population who are foreign-born. <ref type="foot">7</ref>Geographic variables are the distance (in km) between C x and C y 's population centers and time difference: the number of hours C y is ahead (further east) of C x .</p><p>Economic variables include i.) the ratio of C x to C y 's GDP per capita, ii.) if C x and C y are in an RTA (Regional Trade Agreement, which includes the EU), and iii.) trade flow between C x and C y averaged over both country's reports and both directions, normalized by the total population of both countries. Geographic and economic variables use 2019 data from CEPII's Gravity Dataset.</p><p>Political variables, specifically material conflict between C x and C y , are determined by querying the GDELT event database <ref type="bibr">(Leetaru and Schrodt 2013)</ref>. These include the percent of C x 's external conflict actions inflicted on C y , and C y 's external conflict actions inflicted on C x .</p><p>The last fixed effect is linguistic distance between L x and L y using Glottolog <ref type="bibr">(Hammarstr&#246;m et al. 2021)</ref>. Inspired by <ref type="bibr">Samoilenko et al. (2016)</ref>'s measurement of shared language families, we define a 4-level measurement of linguistic distance between L x and L y : i.) no relationship (e.g. Spanish and Hungarian), ii.) in the same primary family (e.g. German and Polish are Indo-European), iii.) in the same branch (e.g. English and Swedish are Germanic), and iv.) in the same sub-branch (e.g. Spanish and Italian are Romance). <ref type="foot">8</ref>We avoid multicollinearity issues by ensuring that all variables' variance inflation factor is under 4. We thus exclude highly-correlated variables, such as EU membership and each country's population. We weight each regression model by the number of treated units from each locus. Finally, we scale all variables by z-score to facilitate direct comparisons.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Regression Results</head><p>Geography Multilinguals have a larger effect on crosslingual information exchange in MCPs where C x and C y are further away from each other (Table <ref type="table">3</ref>). We visualize this pattern in Figure <ref type="figure">3</ref>, which shows the relationship between geography and effects of multilinguals on betweenness centrality and domain sharing (see Figure <ref type="figure">S4</ref> in Supplemental Material for the hashtag sharing map).</p><p>In Figure <ref type="figure">3</ref>, the effect of multilinguals in MCP (C x , C y ) and locus C x is shown as a directed edge from C y to C x . Only edges corresponding to significant estimates are drawn. Negative effects are red, positive effects are blue, and greater magnitude is represented with darker and thicker edges. For example, the dark blue edge from Russia to Spain in both maps indicates that Russian-Spanish multilinguals are especially important for bringing Russian information to Spain. In contrast, the faint edge from Portugal to Spain   We believe stronger treatment effects across longer distances are due to information accessibility. Information from faraway places is not readily accessible for monolinguals, so they may need to rely more on their multilingual friends to serve as information brokers. On the other hand, information between nearby countries such as Norway and Sweden may be more easily accessible with more channels for diffusion, possibly via more multilinguals, so users rely less on individual multilinguals to spread information.</p><p>In addition to distance, the time diference between C x and C y significantly predicts multilinguals' effects (Table <ref type="table">3</ref>). Multilinguals have the largest impact when C y is 2-3 hours ahead (i.e., further east) of C x (see Figure <ref type="figure">S1</ref> in Supplemental Material). In other words, multilinguals have a large impact on spreading information from Eastern European languages to Western Europe. This asymmetric East-West pattern is visible in both maps of Figure <ref type="figure">3</ref> (e.g. the many dark blue outgoing links from Russia indicate a large influence of multilingual users from other countries who post in Russian).</p><p>Why are Western European users of Eastern European languages so influential in cross-lingual information exchange? Fully answering this question is left for future work, but we speculate that it is a consequence of historical inaccessibility to information across strict Cold War-era borders. Additionally, offline connections may explain these users' online role as bridges; Eastern migrants in Western Europe have been characterized by more transnational and circular offline networks since the early 2000s <ref type="bibr">(Favell 2008)</ref>.</p><p>Demographics Multilingualism increases betweenness centrality more for users in smaller countries who post in more populous countries' languages, but this relationship is not significant for content sharing outcomes. While multilinguals in smaller countries are better positioned within MCP networks to spread novel information, they do not necessarily "import" information from the larger to smaller country.</p><p>When C y has a higher proportion of foreign-born residents, (L x , L y ) multilingualism has a greater impact on users' structural role in C x , but not communication influence. In fact, having a multilingual friend has a negative impact on sharing a domain from L y when C y has more foreign-born residents; perhaps this is because shared links from L y feature more content with multicultural or international appeal, so people rely less on multilinguals as information brokers. We observe no significant effects of the foreign-born population of C x on either the structural role or communication influence of multilinguals.</p><p>Economics and politics Multilinguals' effects are associated with GDP per capita: multilinguals in C x have more influence when C x is wealthier than C y (although this is not significant for domain sharing). However, government-level relationships including national trade agreements, tradeflows, and political conflict are not significant predictors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Linguistic similarity</head><p>We observe mixed results for linguistic distance. Our information accessibility hypothesis suggests (L x , L y ) multilinguals would play a larger role when L x and L y are very different because their monolingual peers would rely more heavily on them. This pattern appears to hold for communication influence outcomes but is not significant. Contrary to our hypothesis, multilinguals' structural role is amplified when L x and L y are closely related. Though we do not have a clear explanation for this pattern, it may be driven by deeper differences in the structures of MCP networks connecting countries with similar dominant languages <ref type="bibr">(Herring et al. 2007</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Variation across Topics</head><p>Different aspects of content, such as topic, framing, sentiment, and subjectivity, could amplify or hinder multilinguals' influence. Therefore, here we extend Study 3 to investigate how the role of multilinguals depends on the topic of shared hashtags using unsupervised topic modeling <ref type="bibr">(Kim et al. 2014;</ref><ref type="bibr">Jin 2017)</ref>. Specifically, for each MCP (C x , C y ) we measure how having a multilingual (L x , L y ) friend affects the odds of an L x monolingual from C x sharing a topic-specific hashtag from C y .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Measuring Multilinguals' Role across Topics</head><p>Identifying topic-specific hashtags We train a multilingual contextualized topic model (CTM) to identify topics <ref type="bibr">(Bianchi et al. 2021)</ref>. This approach uses multilingual sentence-BERT <ref type="bibr">(Reimers and Gurevych 2019)</ref> as input to a variational autoencoder topic model to support zero-shot topic prediction for texts in unseen languages during training <ref type="bibr">(Srivastava and Sutton 2017;</ref><ref type="bibr">Bianchi et al. 2021)</ref>. Crucially, the CTM uses the same topics in all languages, making direct comparisons across languages possible. The CTM is trained for 20 epochs on a random sample of 1M Englishlanguage tweets from the European decahose data. We set  the total number of topics to and retain default values for all other hyperparameters. We use the trained CTM to predict topic distributions for all tweets containing any hashtag highly associated with any language in any time interval. Tweets are assigned to the single topic with the highest probability, and hashtags are then assigned to single topics based on the most frequent topic of tweets in which it appears.</p><p>We then manually inspect the ten most-frequent hashtags per topic to identify topics of interest, where hashtags are coherent, meaningful, and reflect different types of content. These 15 topics account for 52.1% of European tweets from our dataset, and 66.8% of unique hashtags associated with any language at any which rises to 82.5% when accounting for hashtag frequency (see topic distributions in Figure <ref type="figure">S3</ref> in Supplemental Material). Brief topic descriptions are in Table <ref type="table">4</ref>. To understand how content shapes multilinguals' influence, we separate the 15 topics into four macro-categories: entertainment, politics, sports, and promotion (e.g., giveaways). The distribution of hashtag macrocategories across languages are shown in Figure <ref type="figure">S2</ref> in the Supplemental Material.</p><p>Evaluation We evaluated our multilingual hashtag topic assignment method with intrusion tests, following common practice <ref type="bibr">(Hoyle et al. 2021)</ref>. Four hashtags were sampled from each topic t with frequency weighting, and one hashtag was sampled from one of the 49 other topics based on frequency. Annotators tried to identify which hashtag does not belong to t. We evaluated the 15 topics of interest in English, German, Spanish, Italian, and Turkish, selected based on annotators' language proficiencies. We conducted ten intruder tests for each topic in English, and 5 for each topic in the other languages. Three annotators completed the intruder test for English hashtags with an average accuracy of 0.73 and interannotator agreement of 0.67 (Krippendorff &#945;). Among tasks where at least two annotators selected the same intruder, accuracy rose to 0.78, and further rose to 0.89 for tasks where all three annotators agreed. MCP (C x , C y ) and locus C x , units are L x monolinguals from C x and the treatment is having at least one multilingual (L x , L y ) contact. For each topic t, the outcome is whether a user shares a hashtag that belongs to topic t and is associated with L y . Our analysis focuses on overall causal effect estimates for all 15 topics. We also estimate locus-specific effects for each macro-category, with a summary of results and maps in the Supplemental Material (Figure <ref type="figure">S4</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Estimating causal effects</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head><p>Figure <ref type="figure">4</ref> shows the effect of multilinguals by hashtag topic and macro-category. Three key findings support our hypothesis that multilinguals play a greater role in spreading information that is otherwise less accessible to their peers.</p><p>First, multilinguals have a greater communication influence on the cross-lingual diffusion of political content than entertainment. Political discourse likely occurs more in regional or country-centric public spheres <ref type="bibr">(Sch&#252;nemann 2020)</ref>, so there is a greater reliance on multilingual individuals to broker political information across borders. Correspondingly, of all political topics, multilinguals play a smaller role for topics with widespread transnational popularity and awareness, such as: Equality, which includes hashtags for global gender equality movements like #metoo and #8m, and International Politics, which includes country names and non-European political organizations. Contrary to most political hashtags, entertainment hashtags of-ten reflect globally-popular phenomena (e.g., K-pop fandoms) despite being associated with specific languages, and more cross-lingual information cascades involve entertainment content <ref type="bibr">(Jin 2017)</ref>. We believe that individual multilingual contacts play a smaller role in entertainment because there are more ways for that content to spread.</p><p>Second, we argue that multilingual individuals are crucial for nascent social movements to gain global traction, but have a relatively small influence within well-established transnational movements. Figure <ref type="figure">4</ref> indicates that multilinguals have a large impact on spreading health-related information, which includes hashtags advocating for mental health awareness, COVID-related activism, and disability rights. In contrast to racial and gender equality movements reflected in the Equality topic, organizing for disability rights gained traction as a social movement later than for race and gender both offline and on Twitter, where the public sphere about disability is still growing <ref type="bibr">(Scotch 1988;</ref><ref type="bibr">Sarkar et al. 2021)</ref>. The initial adoption of burgeoning social movements across countries relies heavily on direct contacts, such as multilingual friends <ref type="bibr">(McAdam and Rucht 1993)</ref>. Information diffusion about more well-established social movements do not depend on multilingual bridges because it occurs via many channels of communication, including news and television media <ref type="bibr">(McAdam and Rucht 1993)</ref>.</p><p>Third, multilinguals play an especially important role in sharing information about job searches and career opportunities. This parallels <ref type="bibr">Granovetter (1973)</ref>'s argument about the strength of weak ties for job-seeking purposes. Even though monolingual users' ties with multilinguals are not necessarily weak, multilinguals similarly serve as bridges between different parts of a social network and thus facilitate access to novel information, such as job opportunities that users may not have otherwise been aware of.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Discussion</head><p>Gaining a complete picture of global information diffusion requires understanding how information crosses languages. We design three studies to investigate how multilingual users participate in this process, which we use to quantify their structural role and communication influence in information exchange across European languages on Twitter. For each pair of countries with different dominant languages, we construct networks where users are connected if they have mutually "mentioned" each other. We use these networks in Study 1 to quantify the extent to which multilinguals' positions in these networks facilitate spreading novel information, as measured by betweenness centrality. In Studies 2 and 3, we quantify how having a multilingual contact influences the odds of monolingual users posting content from the other country's dominant language.</p><p>Results from all three studies show that multilinguals play an outsize role in cross-lingual information exchange compared to their monolingual peers. Effects vary widely but systematically across country pairs and topics. We conduct regression analyses to measure how the role of multilinguals is associated with demographic, geographic, economic, political, and linguistic aspects of the relationship between country pairs. To compare multilinguals' influence across topics, we augment our study design for hashtag sharing with multilingual contextualized topic modeling.</p><p>In general, multilingual individuals have a greater influence on the spread of information that is otherwise less accessible to their monolingual peers, as they play more of a gatekeeping role. Multilinguals have a greater effect on information diffusion between dominant languages of countries that are geographically far apart, with Western European multilinguals who post in Eastern European languages having an especially big influence. We identify a similar pattern for topics, where multilinguals have greater influence on cross-lingual information exchange for topics discussed in more restricted public spheres: national or regionallyoriented politics over entertainment which can have international appeal, nascent health-related social movements over established racial and gender equality movements, and job opportunities previously known only to small communities.</p><p>We acknowledge that this work has important limitations. First, our studies do not account for multilinguals who use minority languages (e.g., Basque) or reside in highly multilingual countries (e.g., Switzerland). Imperfect performance of location inference and language detection also limited the set of countries and languages studied. Furthermore, we make the simplifying assumption that tweets are written in one language, which does not adequately account for codemixing within posts and the users who engage in such practices <ref type="bibr">(Androutsopoulos 2015)</ref>. While code-mixed tweets are a relatively small percent of Twitter communication <ref type="bibr">(Rijhwani et al. 2017)</ref>, accurately recognizing these tweets at scale has proven challenging due to the absence of labeled data for training models <ref type="bibr">(Jurgens, Tsvetkov, and Jurafsky 2017)</ref>. As the performance and efficiency of language detection of code-mixed tweets improve, we anticipate that incorporating such information would be fruitful, and analyzing the relationship between code-mixing strategies and information diffusion could yield interesting theoretical insights.</p><p>To avoid making assumptions about people's offline language usage or competence, we intentionally define multilinguals based on their performance on Twitter. However, this presents a significant limitation: users who only tweet in one language but understand multiple may also play an important, and perhaps different, role in information diffusion. Future research could employ different methodologies to highlight these users, such as linking social media activity with survey data about users' language backgrounds.</p><p>Further research can also improve upon our study designs. We adopt a traditional causal inference setup, which considers treatment status binary to emulate randomized experiments. Thus, all of our studies involve collapsing underlying continuous variables into binary indicators. A possible next step would involve adapting our studies to account for continuous treatments; this would facilitate investigation of how cross-lingual information exchange is impacted by a user's degree of multilingualism (Study 1) or the number and/or strength of a user's ties to multilinguals (Studies 2 and 3).</p><p>Beyond addressing these limitations, there are numerous avenues for future work. For example, we adopt a microscopic perspective on information diffusion by examining how multilingualism impacts individual users' roles in in-formation diffusion; we focus on local influence because it is more precise and less random than observations of information cascades <ref type="bibr">(Bakshy et al. 2011</ref>). Nevertheless, an interesting extension that takes a macroscopic perspective, perhaps involving simulations of cross-lingual information cascades <ref type="bibr">(Chen et al. 2021)</ref>, could help contextualize how these individual-level effects, in aggregate, shape the global flow of information. Another future direction would involve considering other forms of information that may spread via different mechanisms than URLs and hashtags, such as meme templates, images, videos, and text outside of hashtags. There is likely variation in the role of multilinguals across semantic dimensions beyond topic, such as emotional valence or misinformation. Finally, future research can assess the generalizability of our findings beyond the scope of European Twitter by applying our methodology to study other regions, languages, and platforms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Broader Impact</head><p>Understanding the role of multilinguals in information diffusion has immense consequences. Platforms like Twitter can empower multilinguals to spread information that supports positive outcomes such as knowledge-sharing, collaboration, crisis response, or social progress, thus enabling different language communities to benefit from a truly global social network <ref type="bibr">(Eleta and Golbeck 2014;</ref><ref type="bibr">Hale 2014a</ref>). Our study not only highlights this potential but also identifies how it varies across topics and with respect to the geographical, linguistic, and political relationship between the countries. For instance, our research suggests that multilinguals can be better utilized to spread political news as opposed to entertainment. We also see that their importance is more pronounced for supporting information spread across countries further away from each other. Such findings not only highlight the contexts where multilinguals already play an important role but also help us identify the barriers for crosslingual diffusion; in such situations, platforms may benefit more from technologies such as machine translation.</p><p>Our research can also help platforms address dangerous consequences of global networks by focusing efforts on nudging multilinguals to mitigate the spread of harmful information such as misinformation, conspiracy theories, or online abuse. Past network science research shows the value of betweenness centrality in identifying nodes that can limit the spread of such information <ref type="bibr">(Golovchenko, Hartmann, and Adler-Nissen 2018)</ref>. Here, we show that multilingual nodes tend to have high betweenness centrality. Furthermore, our study shows that multilinguals play a particularly important role in the spread of political topics, common targets for malicious actors aiming to spread propaganda and disinformation.</p><p>Despite the potential for positive impact, we acknowledge the ethical risks of this work. Rather than stem the flow of harmful content, our work may inspire malicious agents to target and manipulate multilinguals into propagating such information. In addition, our focus on users of politically and socially-dominant language varieties and use of automated language detection excludes people whose posts contain endangered or minority languages, non-prestige dialects of dominant languages, or code-mixing. Although our work does not present direct harm to individuals, these decisions systematically exclude marginalized groups whose online behavior deserves equal consideration.</p><p>To promote transparency and future research, we publicly share data, code, and models but take steps to preserve user privacy. The datasheets shared for causal effect estimation include only variables necessary to replicate our results. We do not share user IDs, raw text, or other personallyidentifiable information. While the location inference tool used presents a privacy risk by inferring users' specific geocoordinates, we only store information at the country level.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>By developing a set of causal inference studies that measure users' structural role and communication influence, we show that multilingually-posting users on European Twitter are particularly important for information diffusion across languages. These users have an especially large influence in situations where they serve more as gatekeepers in information flow, particularly in spreading information from places and topics that are otherwise inaccessible to their monolingual peers. This work is crucial for understanding how information is shared around the world, and has implications for platforms to support beneficial consequences of global social networks while mitigating potential harms. Publiclyavailable code, models, and aggregated data can be found at: <ref type="url">https://github.com/juliamendelsohn/bridging-nations</ref>.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Proceedings of the Seventeenth International AAAI Conference on Web and Social Media(ICWSM 2023)   </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>https://blog.twitter.com/engineering/en us/a/2015/ evaluating-language-identification-performance</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p>https://github.com/shuyo/language-detection</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_3"><p>https://github.com/CLD2Owners/cld2</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_4"><p>https://github.com/google/cld3</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_5"><p>Data for the geographic area and population-level language usage is from the CIA World Factbook.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_6"><p>While we deem such thresholds necessary for precise causal effect estimation, we acknowledge the arbitrariness of these numbers as a limitation of our study.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_7"><p>worldbank.org/en/topic/migrationremittancesdiasporaissues/ brief/migration-remittances-data</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_8"><p>We do not use graph-based measurements of distance because there is wide variation across language branches' structures due to an uneven interest by linguists across languages.</p></note>
		</body>
		</text>
</TEI>
