<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>SAT-Geo: A social sensing based content-only approach to geolocating abnormal traffic events using syntax-based probabilistic learning</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>03/01/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10340492</idno>
					<idno type="doi">10.1016/j.ipm.2021.102807</idno>
					<title level='j'>Information Processing &amp; Management</title>
<idno>0306-4573</idno>
<biblScope unit="volume">59</biblScope>
<biblScope unit="issue">2</biblScope>					

					<author>Lanyu Shang</author><author>Yang Zhang</author><author>Christina Youn</author><author>Dong Wang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Social sensing has become an emerging and pervasive sensing paradigm to collect timely observations of the physical world from human sensors. In this paper, we study the problem of geolocating abnormal traffic events using social sensing. Our goal is to infer the location (i.e., geographical coordinates) of the abnormal traffic events by exploring the location entities from the content of social media posts. Two critical challenges exist in solving our problem: i) how to accurately identify the location entities related to the abnormal traffic event from the content of social media posts? ii) How to accurately estimate the geographic coordinates of the abnormal traffic event from the set of identified location entities? To address the above challenges, we develop a Social sensing based Abnormal Traffic Geolocalization (SAT-Geo) framework to accurately estimate the geographic coordinates of abnormal traffic events by exploring the syntax-based patterns in the content of social media posts and the geographic information associated with the location entities from the social media posts. We]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>evaluate the SAT-Geo framework on three real-world Twitter datasets collected from New York City, Los Angeles, and London. Evaluation results demonstrate</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>With the proliferation of mobile devices and the ubiquitous network connections, social sensing has become an emerging and pervasive sensing paradigm to collect timely observations of the physical world from human sensors <ref type="bibr">[1]</ref>. Examples of social sensing applications include post-disaster damage assessment with social media user posts <ref type="bibr">[2]</ref>, urban environment monitoring using input from citizen scientists <ref type="bibr">[3]</ref>, and smart health condition tracing using wearable devices <ref type="bibr">[4]</ref>.</p><p>Real-time traffic monitoring is an important application of social sensing in intelligent transportation systems (ITS), where timely social media posts are collected to acquire real-time traffic situation awareness (e.g., road congestion, traffic accident) of an urban area. Comparing to traditional infrastructure-based solutions (e.g., surveillance cameras, radar sensors), social sensing provides an infrastructure-free solution that is more pervasive and scalable <ref type="bibr">[5]</ref>. In this paper, we focus on the problem of identifying the geographic coordinates (i.e., latitude and longitude coordinates) of abnormal traffic events reported on social media. We refer to this problem as social sensing based abnormal traffic event geolocalization. The identified geographic coordinates information of abnormal traffic events can be utilized to provide effective precautions (e.g., traffic accident alerts) and timely responses (e.g., emergency medical rescue for severe traffic accidents) for improving traffic safety and efficiency <ref type="bibr">[6]</ref>.</p><p>Many efforts have been made to study the problem of event localization using social media data <ref type="bibr">[7,</ref><ref type="bibr">8,</ref><ref type="bibr">9,</ref><ref type="bibr">10,</ref><ref type="bibr">11,</ref><ref type="bibr">12,</ref><ref type="bibr">13,</ref><ref type="bibr">14]</ref>. These solutions can be mainly categorized into two categories: geotagging-based solutions <ref type="bibr">[7,</ref><ref type="bibr">8,</ref><ref type="bibr">15]</ref> and contentbased solutions <ref type="bibr">[9,</ref><ref type="bibr">10,</ref><ref type="bibr">11,</ref><ref type="bibr">12,</ref><ref type="bibr">13,</ref><ref type="bibr">14]</ref>. However, these solutions are insufficient to fully address the problem of fine-grained abnormal traffic event geolocalization.</p><p>First, the geotagging-based solutions that leverage the geotagging information associated with social media posts (e.g., "coordinates" field of a tweet 1 ) often suffer from two critical limitations. On one hand, the geotagging information of social media posts is sparse due to the privacy concerns of users (e.g., fewer than 0.5% tweets have geotags <ref type="bibr">[16]</ref>). On the other hand, the geotagging information of a social media post may not always represent the real geolocation of the reported event (e.g., a user may travel a few blocks away from the accident site after he/she finishes editing the post) <ref type="bibr">[17]</ref>. Second, existing content-based solutions are also impractical to solve our problem. This is because these solutions often require auxiliary information that is not always available (e.g., private user activities, user's previous posts), and the inferred event locations are often inaccurate (e.g., the estimated event location in current solutions can only reduce the average error distance to about 10 km <ref type="bibr">[15]</ref>). Therefore, the problem of geolocalizing abnormal traffic events using social sensing data feeds remains to be a challenging problem to be addressed.</p><p>In this paper, we develop a social sensing based solution to directly infer the geographic coordinates of abnormal traffic events from the content of social media data (e.g., tweets). Our design is to first identify the location entities in the social media post (i.e., the named entities in a social media post that indicate the location of the abnormal traffic event) by exploring the syntax of the post content. We then leverage the identified location entities to accurately infer the geographic coordinates of the abnormal traffic event by investigating the geographic information of these location entities and their relations. An example of our abnormal traffic event geolocalization problem is shown in Figure <ref type="figure">1</ref>. Our goal is to infer the event geolocation (e.g., the geographic coordinates marked with the red pin in Figure <ref type="figure">1</ref>(b)) using the location entities identified in the text of the tweet (e.g., the location entities highlighted in red boxes in Figure <ref type="figure">1(a)</ref>). However, it is not a trivial task to accurately geolocate the abnormal traffic 1 <ref type="url">https://developer.twitter.com/en/docs/tutorials/filtering-tweets-by-location.html</ref> event from the content of social media posts due to two important challenges that are elaborated on below. from the content of social media posts. A possible approach to address the aforementioned issue of sparse geotagging information is to infer the event location by analyzing the content of social media posts <ref type="bibr">[18]</ref>. However, the limited and unstructured content in a social media post (e.g., 280 characters in a tweet) makes the location entity inference problem challenging <ref type="bibr">[19]</ref>. For example, the essential location entity "FDR DR NB" in Figure <ref type="figure">1</ref>(a) is misidentified as "NB"</p><p>(in blue box) by the state-of-the-art entity extraction method Google Named Entity Detection service<ref type="foot">foot_0</ref> and leads to the inaccurate geolocation (i.e., the blue pin in Figure <ref type="figure">1</ref>(b)). In addition, existing solutions for event localization often require external information (e.g., using the content of abnormal traffic event posts to retrieve the geotagging information in geotagged tweets with similar content <ref type="bibr">[8]</ref>). However, such external information may not always be available.</p><p>For example, our case study in New York City shows that most traffic incidents are only reported by a single tweet. Therefore, these solutions are insufficient to fundamentally address the content-only location entity inference challenge in abnormal traffic event geolocalization.</p><p>Fine-grained Geolocation Estimation. The second challenge lies in how to accurately estimate the geographic coordinates of the abnormal traffic event by leveraging the location entities identified in a social media post. Existing solutions for geolocation estimation often utilize a grid-based method that divides the map area of interest into a set of grids of equal size and estimates the event geolocation in terms of the grid (e.g., the center of the estimated grid) <ref type="bibr">[20,</ref><ref type="bibr">15]</ref>.</p><p>However, the estimated geolocation of interest is often coarse-grained and is not precise enough for estimating the geographic coordinates of abnormal traffic events. For example, a single grid in an urban area (e.g., New York City in Figure <ref type="figure">1</ref> A preliminary version of this work has been published in ASONAM 2019 <ref type="bibr">[21]</ref> to study the problem of identifying location entities for abnormal traffic event localization which is an initial step in geolocating abnormal traffic events. This paper is a significant extension of our conference paper (i.e., SyntaxLoc) in the following aspects. First, we focus on an abnormal traffic event geolocalization problem where the goal is to infer the geographic coordinates from the fuzzy description in social media posts instead of only extracting the location entities as we studied in the conference paper (Section 1 and 3). Second, we develop a new SAT-Geo framework to address the fine-grained geolocation estimation challenge by developing a distance-aware geolocation estimation model to accurately estimate the geographic coordinates of the abnormal traffic events (Section 4).</p><p>Third, we conduct a set of new experiments to comprehensively evaluate the geolocation estimation performance of the proposed SAT-Geo framework comparing to the state-of-the-art baseline methods (Section 5). Fourth, we extend the related work by reviewing recent works on intelligent transportation systems (Section 2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Social Sensing</head><p>Social sensing has become as an emerging sensing paradigm to observe the physical world by exploring the "wisdom of the crowd" on social media <ref type="bibr">[5,</ref><ref type="bibr">22]</ref>.</p><p>Social sensing has been adopted in a wide range of application domains <ref type="bibr">[23,</ref><ref type="bibr">24,</ref><ref type="bibr">25,</ref><ref type="bibr">26,</ref><ref type="bibr">27,</ref><ref type="bibr">28]</ref>, including damage assessment in the aftermath of a disaster using social media data <ref type="bibr">[25]</ref>, cross-modal data fusion using crowdsourcing intelligence <ref type="bibr">[26]</ref>, and environment and urban infrastructure monitoring with inputs from citizen scientists <ref type="bibr">[28]</ref>. The problem of abnormal traffic event geolocalization remains to be an important challenge that has not been well-addressed in social sensing applications. Specifically, the goal of abnormal traffic event geolocalization is to accurately identify the geographic coordinates of abnormal traffic events reported on social media. The identified geographic coordinates can be leveraged to provide effective precautions and timely responses for enhancing the safety and performance of the traffic systems. In this paper, we develop SAT-Geo, a social sensing approach to effectively estimate the location entities associated with the abnormal traffic events and accurately estimate the corresponding geographic coordinates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Location Inference in Social Sensing</head><p>A good amount of efforts have been made towards addressing the location inference problems in social sensing <ref type="bibr">[16,</ref><ref type="bibr">9,</ref><ref type="bibr">10,</ref><ref type="bibr">11,</ref><ref type="bibr">29,</ref><ref type="bibr">12,</ref><ref type="bibr">13,</ref><ref type="bibr">14]</ref>. For example, ory to estimate event locations on social media using a combination of user profiles, post content, and geotagging information <ref type="bibr">[15]</ref>. However, the above solutions cannot be adapted to address our problem of geolocating abnormal traffic events since they either require prior knowledge or external information (e.g., users' private online activities, complete gazetteer database), or are insufficient to perform fine-grained geolocation estimation (e.g., average error distance is about 10-100 miles). In contrast to existing solutions, we design a novel SAT-Geo scheme that focuses on exploring the syntax patterns in the textual content of social media posts and estimates the geographic coordinates of the reported abnormal traffic event via a probabilistic-based learning approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Probabilistic Learning Technique</head><p>Our SAT-Geo framework is related to the probabilistic learning method in machine learning. Probabilistic learning has been applied to a wide range of studies, including natural language processing, computer vision, and information retrieval <ref type="bibr">[30,</ref><ref type="bibr">31,</ref><ref type="bibr">32]</ref>. For example, Li et al. developed a probabilistic image annotation framework to estimate the image-to-word correlation using multicorrelation probabilistic matrix factorization <ref type="bibr">[30]</ref>. Zettlemoyer et al. proposed a structured classification model that leverages probabilistic categorical grammars to learn the mapping from sentences to logical forms <ref type="bibr">[31]</ref>. Danelljan et al. designed a probabilistic regression approach to track the state of the target object in visual frames of video <ref type="bibr">[33]</ref>. However, none of these approaches is designed to study the syntax patterns in the short and informal text of social media posts for geolocating abnormal traffic events. In contrast, the proposed SAT-Geo framework develops a syntax-based probabilistic learning approach to explicitly explore syntax patterns of social media content and effectively identify the relevant location entities for the accurate estimation of the abnormal traffic event's geographic coordinates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Intelligent Transportation Systems</head><p>Our proposed work for abnormal traffic event geolocalization is closely related to intelligent transportation systems (ITS) <ref type="bibr">[34]</ref> and can benefit many applications in ITS (e.g., improving traffic management efficiency <ref type="bibr">[35]</ref>, enhancing public transportation safety <ref type="bibr">[36]</ref>). Examples of intelligent transportation systems include traffic monitoring, traffic congestion/accident detection, and public transportation management in urban planning <ref type="bibr">[37]</ref>. For example, Barmpounakis et al. designed an urban traffic monitoring system using the sensing data collected from drones to monitor traffic congestion in the urban area <ref type="bibr">[38]</ref>. ning <ref type="bibr">[41]</ref>. To the best of our knowledge, the SAT-Geo framework is the first infrastructure-free solution to address the problem of geolocating abnormal traffic events at a fine-grained level using social media data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Problem</head><p>We present the problem of geolocating abnormal traffic event in social sensing. First of all, we define a few key terms in our problem formulation.</p><p>Definition 1. Social Media Posts (P ): We define the social media posts P as a set of S social media posts (e.g., tweets) that are posted by social media users to report abnormal traffic events. Specifically, we define P as P = {P 1 , P 2 , ..., P S } where P s , &#8704; 1 &#8804; s &#8804; S, denotes a social media post reporting abnormal traffic event.</p><p>Definition 2. Location Entities (L): The location entities (L) is defined as a set of named entities that are associated with the geolocation of the abnormal traffic event reported in a social media post. For example, "FDR Dr NB", "49th</p><p>St", and "34th St" are the location entities in the social media post shown in Figure <ref type="figure">1</ref>(a). Specifically, we define L s = {L s 1 , L s 2 , ..., L s C } to be the set of C location entities from the post P s .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 3. Event Geographic Coordinates (G): We define the Event</head><p>Geographic Coordinates (G) to be the longitude and latitude coordinates of the abnormal traffic event depicted in a social media post. In particular, we define G s = (g s lat , g s long ) to be the geographic coordinates of the abnormal traffic event in post P s .</p><p>The goal of our problem is to precisely estimate the abnormal traffic event geolocation by accurately identifying all location entities from a social media post. We formally formulate our problem as below:</p><p>where G s and G s are the estimated and ground-truth geographic coordinates of the traffic event reported in social media post P s , respectively. D(&#8226;) is the application-specific distance measurement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Solution</head><p>In this section, we present the SAT-Geo framework to address the problem of geolocating abnormal traffic events using social sensing. An overview of the SAT-Geo framework is shown in  The SPL module is designed to effectively learn the syntax-based patterns in social media posts for identifying the location entities. First, we define several key terms that will be used in this module.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 4. Entity (e):</head><p>We define an entity e to be a sequence of words that belongs to the same part-of-speech (e.g., "Conduit Ave"). In particular, "Accident (NOUN )", "on (ADP )", "Conduit Ave (NOUN )", "approaching (VERB )", "Sutter Ave (NOUN )"</p><p>2 : ADP+NOUN+VERB ["on (ADP )", "Conduit Ave (NOUN )", "approaching (VERB )"] T to be a contiguous syntax sequence of n entities in a given social media post.</p><p>For example, the 3-entity sequence "Accident+on+Conduit Ave" has a 3-syntax pattern of "NOUN+ADP+NOUN ".</p><p>as the set of all possible n-Syntax patterns T (n) .</p><p>Table <ref type="table">1</ref> shows a simplified example that includes a social media post and the related entities, n-Syntax patterns, and n-Syntax models as defined above. In addition, we also define two types of probabilities that will be used to extract location entities from the social media post.</p><p>Definition 7. Pattern Probability: Pattern probability represents the probability of an n-Syntax pattern T (n) in an n-Syntax model M (n) that is defined as:</p><p>where |T (n) | is the number of occurrences of the n-Syntax pattern T (n) in a given set of social media posts. |M (n) | is the total number of all n-Syntax patterns.</p><p>Definition 8. Index Probability: Index probability represents the probability of a location entity index i (n) in an n-Syntax pattern T (n) which is defined as:</p><p>where |i (n) | is the number of the location entities in the i th entity given the n-Syntax pattern T (n) .</p><p>With the key concepts defined above, our next goal is to leverage the learned pattern probability Pr(T (n) |M (n) ) and index probability Pr(i (n) |T (n) ) to effectively extract location entities in the unlabeled social media posts in next subsection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Probabilistic-based Entity Extraction (PEE)</head><p>The PEE module aims to effectively extract the location entities from the content of the social media posts using the pattern probability and index probability learned in the SPL module. In particular, we first measure the likelihood of entity e to be a location entity from the pattern probability and index probability defined in Equation 2 and Equation 3, respectively. Formally, the likelihood of entity e being a location entity is as follows.</p><p>Pr</p><p>where Pr(i (n) |T (n) ) and Pr(T (n) |M (n) ) denote the index probability and pattern probability, respectively. Pr(M (n) ) is the weight of n-Syntax model that represents the importance of each n-Syntax model in extracting the location entities.</p><p>Pr(M (n) ) is often set to be a small value if we do not have prior knowledge.</p><p>In addition, we note that an entity e often appears in multiple n-Syntax patterns</p><p>For example, "Conduit Ave (NOUN )" occurs in different n-Syntax patterns (i.e., 2, 3 and 4 syntax patterns), as the example shown in Table <ref type="table">1</ref>. Thus, we aggregate the likelihood of each entity over different n-Syntax patterns as below:</p><p>Finally, an entity e is classified to be a location entity if the likelihood Pr(e &#8712; L) is greater than a predefined threshold &#8710; 4 . Specifically,</p><p>where "1" (i.e., true) indicates entity e is classified as a location entity and "0" (i.e., false) otherwise. The classified location entities are to be used as the input to effectively estimate the corresponding geographic coordinates in the DGE module that will be elaborated in next subsection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Distance-aware Geolocation Estimation (DGE)</head><p>The DGE module is developed to accurately estimate the geographic coordinates of the abnormal traffic event using the location entities identified in the PEE module. First, we design a point-based map representation method to accurately extract the geographic coordinates of the location entities associated with the abnormal traffic event. Current solutions for geolocation estimation often adopt a grid-based approach that divides the geological areas of interest into grids and identifies the grid covering the abnormal traffic event <ref type="bibr">[15,</ref><ref type="bibr">8]</ref>.</p><p>However, the precision of such an approach is limited by the size/area of the grid, which often ranges from 1 square mile (e.g., event location identification using social media data <ref type="bibr">[15]</ref>) to 1000 square miles (e.g., global localization for 4 &#8710; is an application specific parameter. We present a robustness study of the variation of &#8710; in the evaluation section.</p><p>the origin of social media posts <ref type="bibr">[20]</ref>). In addition, the grid-based approach often uses the center of the grid to represent the estimated geographic coordinates for the identified location and is sub-optimal to effectively geolocate abnormal traffic events in urban areas with dense traffic where each grid contains multiple roads and intersections. For example, there are more than 100 intersections per square mile in Manhattan, New York <ref type="bibr">[42]</ref>.</p><p>In light of such a limitation, we design a point-based map representation approach to effectively model the geographic coordinates associated with the location entities identified in PEE. We first define the point-based map database that will be used in the DGE module to estimate the geographic coordinates of the abnormal traffic event.</p><p>Definition 9. Map Database (Q): We define the map database as a set of</p><p>where each road entity is associated with a location entity identified in PEE.</p><p>In particular, we formally define the road entity in the map database Q as follows.</p><p>Definition 10. Road Entity (R h ): We define a road entity R h &#8712; Q to be a sequence of K h geographic points sampled from the road that is associated with each location entity in L. In particular, for a social media post</p><p>to be the set of C road entities associated with the location entities in L s . Formally, each road entity is defined as</p><p>With the map database Q defined above, our goal is to find the geographic point that is closest to the abnormal traffic event location from a set of candidate geographic points sampled from the road entities corresponding to the location entities of the reported abnormal traffic event. However, it is not a trivial task to accurately infer a geographic point that is closest to the abnormal traffic event location from multiple road entities (i.e., multiple sequences of geographic points) identified in the social media post. We observe that the abnormal traffic events reported on social media often contain more than two location entities (e.g., the location entities "FDR DR NB", "49TH ST", and "34TH ST" shown in Figure <ref type="figure">3</ref>). Therefore, we assume there are at least two road entities in R s .</p><p>Otherwise, we output the geographic point at the midpoint of the road entity as the estimated event geographic coordinates G s . However, the coordinates of the estimated geographic point cannot be automatically identified by simply finding the geographic coordinates of the intersections of road entities. This is especially true when two or more intersections exist. For example, as Figure <ref type="figure">3</ref> shows, there are two intersections that exist among the road entities associated with the abnormal traffic event reported on social media (i.e., the intersection of "FDR Dr NB" and "49th St", and the intersection of "FDR Dr NB" and "34th St"). However, only the intersection of "FDR Dr NB" and "49th St" is the accurate estimation of the geographic coordinates of the reported abnormal traffic event. To address such a challenge, we design a distance-aware geolocation estimation method to accurately infer the geographic coordinates of the abnormal traffic event. We jointly consider the distance between each geographic point and its neighborhood road entities (i.e., the road entities that co-appear in the same social media post), and the syntax-based relations of road entities related to the abnormal traffic event. In particular, we define the distance-driven weight of each geographic point in the identified road entities of the abnormal traffic event.</p><p>Definition 11. Distance-driven Weight: For each road entity R s i &#8712; R s , the distance-driven weight w ki of each v ki i &#8712; r i is defined as</p><p>where &#948;(&#8226;) is distance function that measures the shortest Euclidean distance between a geolocation node v ki i and road entity R s j , and &#1013; is a small constant to avoid the zero value in the denominator.</p><p>We observe that the location of the abnormal traffic event location often appears to be also close to the road entities describing the location of the abnormal traffic event in the social media post. For example, the traffic event location in Figure <ref type="figure">3</ref> (marked with a pink star) has the shortest distance to the road entities "FDR Dr" and "49th St" and is reasonably close to the road entity "34th St". Therefore, we compute the distance-driven weight of each geographic point to measure the distance between the geographic point and the geographic coordinates of the abnormal traffic event.</p><p>In addition, we observe that relations between the location entities are also critical in inferring the traffic event geolocation. For example, in the social media post (i.e., "Accident on Conduit Ave approaching Sutter Ave") shown in Table <ref type="table">1</ref>, we can effectively infer that the traffic event geographic coordinates belong to the road entity "Conduit Ave" according to the adposition "on" in the 2-syntax pattern "on+Conduit Ave". Therefore, we further identify the important location entities from the location entities in a tweet based on the adpositions in the syntax patterns. In particular, we add a relation indicator &#964; i to the distance-driven weight w ki and update w ki as follow:</p><p>where &#964; i = 1 if the location entity associated with r i co-appears with an adposition (ADP) in the 2-syntax and 3-syntax pattern, otherwise &#964; i = 0. In particular, we focus on the adpositions on, at, approaching, after based on the empirical observation. Finally, the geographic point with the highest distancedriven weight is output as the estimated traffic event geographic coordinates &#284;s .</p><p>A summary of the distance-aware geolocation estimation (DGE) module is summarized in Algorithm 1. The input to the DGE module is the set of location entities L s of a social media post P s from the PEE module. The output of the DGE module is the estimated abnormal traffic event geographic coordinates G s .</p><p>Algorithm 1 Distance-aware Geolocation Estimation (DGE)</p><p>for each e in P s do 2:</p><p>retrieve the road entity r from Q for each v k i i in ri do assign the geolocation node v k i i corresponds to max(w k i ) to G s 14: end if 15: output G s for P s</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Summary of SAT-Geo Framework</head><p>The pseudocode of the SAT-Geo framework is summarized in Algorithm 2.</p><p>The input of our SAT-Geo framework is a set of social media posts P that depict abnormal traffic events on social media. The output of our SAT-Geo framework is the estimated geographic coordinates G s of the abnormal traffic event reported in each social media post P s .</p><p>Algorithm 2 Summary of the SAT-Geo Framework 1: input: a set of N social media posts P , a map database Q 2: output: the estimated geolocation G s for each P s &#8712; P 3: compute pattern probability Pr(T (n) |M (n) ) and index probability Pr(i (n) |T (n) ) using SPL 4: for each P s in P do 5:</p><p>for each e in P s do 6:</p><p>classify e using PEE output G s for P s 13: end for v</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Evaluation</head><p>In this section, we evaluate the performance of the proposed SAT-Geo framework on three real-world Twitter datasets collected from three cities. In particular, we first compare the location entity identification accuracy of SAT-Geo in comparison to state-of-the-art baseline methods. In addition, we also evaluate the geolocation estimation performance of SAT-Geo. Evaluation results show that SAT-Geo achieves significant performance gains compared to stateof-the-art baselines in terms of accurately identifying location entities associated with abnormal traffic events and estimating the geographic coordinates of the abnormal traffic event.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Dataset</head><p>First, we describe the real-world Twitter datasets we collected from three major cities in the world, namely New York City (NYC), Los Angeles (LA), and London. In particular, the social media posts (i.e., tweets) on abnormal traffic events are collected from Twitter using the crawler Get Old Tweets<ref type="foot">foot_2</ref> with a set of keywords and hashtags (e.g., "slow traffic", "accident", city names).</p><p>We manually select 200 tweets from each dataset for our study, and verify that each tweet contains a unique abnormal traffic event (i.e., 1 tweet per event) <ref type="foot">6</ref> .</p><p>The reported abnormal traffic events in our datasets can be mainly categorized into the following types: traffic accidents (e.g., collision, broken down vehicles), infrastructure incidents (e.g., out-of-order traffic signal, falling trees), and unusual road conditions (e.g., road closure, road construction). Each dataset is randomly split into 80% training set and 20% testing set. We manually annotate the location entities and the traffic event's geographic coordinates in each post to obtain the ground-truth annotations. A summary of these three datasets are reported in Table <ref type="table">2</ref>. In particular, there are 2,412 entities in the NYC datasets and 20.4% of them are location entities. The LA dataset contains 2,851 location entities and 16.7% of them are location entities. The London dataset contains 2,483 location entities and 19.6% of these entities are location entities. We observe similar syntax patterns in social media posts among different English-speaking countries (e.g., United</p><p>States and United Kingdom). For example, "Accident on Grand Ave SB at CR-12" and "Collision on Greenford Road Northbound at Daryngton Drive" are reported on social media in New York and London, respectively. We also show the distribution of the abnormal traffic events across each studied city by presenting the heatmap of the abnormal traffic event geolocations in Figure <ref type="figure">4</ref>. Additionally, since the geotagging information associated with each tweet in our dataset is not necessarily available (due to privacy and legal concerns), we also invited independent human annotators to annotate the geographic coordinates associated with each tweet for evaluating the performance of geographic coordinates estimation. In particular, we annotate the ground-truth geographic coordinates by manually assessing the abnormal traffic event described in each tweet and finding the geographic coordinates of the traffic event location using an online map service<ref type="foot">foot_4</ref> . </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Baselines</head><p>We compare SAT-Geo with a set of state-of-the-art baseline methods in location entity identification and geolocation estimation.</p><p>&#8226; Google Named Entity Detection<ref type="foot">foot_5</ref> (GoogleNE): Google Named Entity Detection is the advanced commercial entity recognition service that extracts entities with the corresponding entity types (e.g., location entity) using a set of pre-trained natural language models.</p><p>&#8226; Stanford CoreNLP (StanfordNLP) <ref type="bibr">[12]</ref>: Stanford CoreNLP is an integrated natural language processing toolkits that can be applied to &#8226; Spacy <ref type="bibr">[43]</ref>: Spacy is an industrial natural language processing framework that detects named entities in text document with a set of well-trained entity recognition models.</p><p>For all the baseline methods, we use the Google Maps Geocoding API<ref type="foot">foot_6</ref> </p><p>(Geocoding API) to convert the extracted location entities to the geographic coordinates of the corresponding traffic event location. In particular, we first concatenate the extracted location entities and them to the Geocoding API. The geographic coordinates returned by the Geocoding API are used as the estimated geographic coordinates for each corresponding baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Evaluation Metrics</head><p>In evaluating the performance of location entity identification (i.e., location entity v.s. non-location entity), we adopt the following metrics that are commonly used for binary classification: Accuracy, Precision, Recall, and F1-score.</p><p>In evaluating the performance of geolocation estimation, we adopt the Mean Error Distance and Median Error Distance that are commonly used to evaluate the error distance in geolocation estimation <ref type="bibr">[20]</ref>. In particular, the error distance d (in miles) between the estimated and ground-truth geographic coordinates is computed using the Haversine formula <ref type="bibr">[44]</ref> as:</p><p>where r is the radius of the earth. (&#285; b lat , &#285;b long ) and (g b lat , g b long ) are the estimated and ground-truth geographic coordinates of the abnormal traffic event reported in social media post S b , respectively. If the ground-truth geographic coordinates of the traffic event is a single point, we measure the error distance in terms of the distance between the estimated and ground-truth event geographic coordinates. If the ground-truth geolocation is a road segment, we measure the error distance in terms of the perpendicular distance (in miles) between the estimated geographic coordinates and the road segment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Evaluation Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.1.">Location Entity Identification Performance</head><p>In the first set of experiments, we evaluate the performance of location entity identification. In particular, we vary the threshold &#8710; (defined in Equation <ref type="formula">6</ref>) from 0.4 to 0.6 for the SAT-Geo scheme (e.g., SAT-Geo 0.6 represents the SAT-Geo scheme with &#8710; = 0.6). The evaluation results on the NYC, LA, and London datasets are reported in Table <ref type="table">3</ref>, Table <ref type="table">4</ref>, and Table <ref type="table">5</ref>, respectively. We observe that the SAT-Geo scheme consistently outperforms all baselines under all evaluation metrics on all datasets. In particular, SAT-Geo achieves performance gains of 9.2%, 22.2%, 36.9%, and 25.7% comparing to the bestperforming baseline in NYC (i.e., GoogleNE) in terms of accuracy, precision, recall, and F1-score, respectively. We observe similar performance gains on the LA and London datasets. The significant performance gains achieved by SAT-Geo demonstrate the effectiveness of judicious syntax patterns learning and the accurate location entity extraction in the principled probabilistic learning framework. We also note that SAT-Geo also outperforms all baseline methods as the &#8710; value varies in all datasets. Such consistent performance improvements again show the robustness of SAT-Geo with respect to the &#8710; parameter in the PEE module.</p><p>We also evaluate the performance of the SAT-Geo framework by varying the training set ratio from 60% to 80% for the NYC, LA, and London datasets. The results of SAT-Geo are shown in Figure <ref type="figure">5</ref>. We observe a stable performance of SAT-Geo over different sizes of the training set across all cities in our study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.2.">Geolocation Estimation Performance</head><p>We also study the geolocation estimation accuracy of SAT-Geo and the compared baselines. The results of the geolocation estimation performance on the NYC, LA, and London datasets are shown in Table <ref type="table">6</ref>, Table <ref type="table">7</ref>, and Table <ref type="table">8</ref>, respectively. We observe that SAT-Geo consistently outperforms all the baseline methods on all datasets. In particular, the SAT-Geo framework achieves a mean error distance of 2.26 miles (i.e., SAT-Geo 0.6 ) on the NYC dataset which is 56.8% less than the mean error distance of the best performing baseline method (i.e., GoogleNE). Similarly, the mean error distance of the SAT-Geo framework is 37.2% and 58.1% less than the best-performing baseline method (i.e., Goog-leNE) on the LA and London datasets, respectively. In addition to the effective entity extraction in SAT-Geo, we also attribute the performance gains to the accurate distance-aware geolocation estimation that jointly models the distance between each geographic point and the road entities reported in the same post, and the different syntax-based relation between the road entities related to the  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.3.">Ablation Study for Geolocation Estimation</head><p>Finally, we carry out an ablation study to investigate the geolocation estimation effectiveness of the DGE module in the SAT-Geo framework. In particular, we consider the following variations of SAT-Geo and the baseline methods: i)</p><p>with DGE: using DGE as the geolocation estimation module to estimate the traffic event geographic coordinates using location entities identified by SAT-Geo and the baseline methods; ii)without DGE: using Google Map Geocoding as the geolocation estimation module to estimate the traffic event geographic coordinates using location entities identified by SAT-Geo and the baseline methods. The evaluation results are summarized in Figure <ref type="figure">6</ref> for the NYC, LA, and London datasets. We observe that the SAT-Geo with DGE achieves the best performance on all three datasets in terms of the mean error distance. In addi-    </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Discussion</head><p>In this study, we focus on estimating the abnormal traffic event geolocation associated with social media posts. In our experiments, we only use a single tweet to geolocate each abnormal traffic event due to the high manual label cost <ref type="bibr">[45]</ref>. However, the performance of the proposed SAT-Geo framework can be further enhanced by leveraging multiple data sources (e.g., multiple social media users reporting the same abnormal traffic event) to improve SAT-Geo's robustness against misinformation on social media <ref type="bibr">[46,</ref><ref type="bibr">47]</ref>. One critical challenge to leverage multiple data sources to geolocate the abnormal traffic event is that the reliability of different data sources are often unknown a priori, where the tweets posted by the unreliable social media users could lead to inconsistent and inaccurate geolocating results <ref type="bibr">[1]</ref>. To address this challenge, we plan to leverage the estimation theoretical truth discovery solutions <ref type="bibr">[23,</ref><ref type="bibr">48]</ref> that are designed to jointly estimate the reliability of each studied social media user as well as the credibility of their posts to help us cross-validate the geolocating results and improve the abnormal traffic location estimation accuracy.</p><p>In addition, we collect tweets from both traffic authority accounts (i.e., the Twitter accounts managed by traffic authorities to publish traffic-related information) and general Twitter user accounts. Our current framework does not explicitly explore the authoritativeness of the Twitter accounts as we manually verify the abnormal traffic events reported in the collected tweets and choose the credible ones as the input to SAT-Geo. This is mainly due to the high labor cost of manually verifying the authoritativeness of all Twitter accounts involved in the study <ref type="bibr">[45]</ref>. It will also be interesting to further investigate the authoritativeness of Twitter accounts by modeling the reliability of these accounts as well as the credibility of their posts about abnormal traffic events. However, it is not a trivial task to rigorously model the reliability of different Twitter accounts in the SAT-Geo framework. The reason is that the reliability is often not known for all Twitter accounts a priori, and the Twitter accounts with unknown/uncertain reliability may report inconsistent or conflicting infor-mation about the same abnormal traffic event <ref type="bibr">[1]</ref>. To address such a challenge, we plan to utilize the estimation theoretical methods in truth discovery <ref type="bibr">[23]</ref> [48] to jointly estimate the reliability of the studied Twitter accounts and the credibility of posts associated with these accounts to improve the geolocation estimation performance of the SAT-Geo framework. As this line of effort is beyond the scope of this paper, we plan to implement it in our future work.</p><p>We also acknowledge that there is a limitation of using an identified set of tweets relevant to abnormal traffic events in our experiments that is laborintensive and not scalable. In our future work, we plan to integrate the SAT-Geo framework with abnormal traffic event detection methods <ref type="bibr">[49,</ref><ref type="bibr">50]</ref> to automate the process of retrieving traffic-related tweets from real-time data streams. In particular, the social media posts retrieved by keywords/hashtags can be fed into a pre-trained abnormal traffic event detection model to classify whether a tweet contains the description related to an abnormal traffic event. The identified tweets will then be used as the input to our SAT-Geo framework for estimating the geolocation of abnormal traffic events. However, such a pre-trained abnormal traffic event detection model often requires a non-trivial amount of annotated ground-truth labels of the social media posts that report a diverse set of abnormal traffic events across different cities <ref type="bibr">[50]</ref>. We plan to implement the abnormal traffic event detection model in our future work by leveraging the crowdsourcing platforms (e.g., Amazon MTurk) to collect sufficient groundtruth labels to train the detection model.</p><p>We note that the scalability of the SAT-Geo framework in the inference phase is expected to be linear to the size of the dataset. In particular, the time In this work, we focus on the social sensing based abnormal traffic event geolocation problem in large cities with high traffic volume (e.g., New York City, Los Angeles, London). In general, our model is more feasible for cities of a large size and population. This is because a large city is more likely to have a higher occurrence of different types of abnormal traffic events and there are more active social media users in a large city to post different abnormal traffic events in time <ref type="bibr">[53]</ref>. As a result, our SAT-Geo model can be trained to detect different abnormal traffic events by leveraging the rich set of reported abnormal traffic events in the studied cities. For small or middle size cities, we expect the detection accuracy of our scheme would decrease because both the occurrence of abnormal traffic events and the chances of them being reported on social media decrease as the size of the city shrinks, leading to insufficient training data for our SAT-Geo model. One possible solution to address the above problem is to apply the transfer learning techniques <ref type="bibr">[54]</ref> [55] to train our SAT-Geo model in a large city (e.g., NYC) and transfer the trained model to locate abnormal traffic events in a smaller city (e.g., El Paso, TX). However, it is challenging to effectively adapt the trained model across cities of different sizes. This is especially the case when the training data at the smaller city is sparse or unavailable <ref type="bibr">[54]</ref>. To address this challenging problem, we plan to leverage the deep transfer learning techniques (e.g., adversarial transfer learning) to capture the latent feature of the syntax patterns from the social media posts reported in the large city (e.g., NYC). The extracted latent features can then be applied to identify location entities in the posts reporting abnormal traffic events at the smaller city (e.g., El Paso, TX).</p><p>The training phase of our SAT-Geo framework in a new city (i.e., different from the city that SAT-Geo is trained with) will depend on the availability of the training data in the new city. In the case that a sufficient amount of training data is available in the new city, SAT-Geo can be re-trained to achieve the desired performance. However, if the amount of training data of the new city is sparse or insufficient, we can use the limited training data to fine-tune the SAT-Geo framework that has been pre-trained with the training data from the original city. Lastly, if the training dataset of the new city is not available at all, we can integrate SAT-Geo with the aforementioned transfer learning techniques <ref type="bibr">[54,</ref><ref type="bibr">55]</ref> to transfer the syntax pattern features learned from the original city with sufficient training data to geolocate the abnormal traffic events in the new city.</p><p>Another limitation of our work lies in the adaptability of our scheme to geolocate abnormal traffic events reported in regions where the primary language is not English (e.g., Arabic countries, Germany, Portugal, China). Our model does not directly apply to languages other than English. This is mainly due to the fundamental difference of grammar and syntax patterns between English and other languages <ref type="bibr">[56]</ref>. For example, descriptive adjectives are often placed after nouns in Spanish which is opposite to the syntax pattern in English. In our future work, we consider two possible solutions to address the above problem. The first solution is to leverage the state-of-the-art machine translations models to translate the non-English posts to English and apply the SAT-Geo framework to geolocate the abnormal traffic events reported in the translated posts. Alternatively, our second solution aims to modify the n-Syntax Patterns and n-Syntax Models in the SPL module of SAT-Geo to accommodate the language-specific patterns in non-English languages <ref type="bibr">[57]</ref>. We will investigate both solutions and compare their performance on non-English case studies. In particular, we plan to further evaluate the performance of SAT-Geo in geolo-cating abnormal traffic events in Arabic-speaking countries (e.g., Saudi Arabia)</p><p>and Portuguese-speaking countries (e.g., Portugal) in our future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>In this paper, we develop SAT-Geo, a syntax-based probabilistic learning approach to geolocate abnormal traffic events using social sensing. The SAT-Geo framework is designed to estimate the geographic coordinates of the abnormal traffic events from the content of social media posts. In particular, we first identify the location entities associated with the abnormal traffic event location in social media posts by developing a syntax-based probabilistic learning approach. In addition, we design a distance-aware geolocation estimation method to accurately estimate the geographic coordinates associated with the reported abnormal traffic event. We evaluate the SAT-Geo framework on two real-world Twitter datasets. Results show that our SAT-Geo framework achieves significant performance gains comparing to state-of-the-art baseline methods in terms of accurately estimating the geographic coordinates of abnormal traffic events using social media data. The SAT-Geo framework can be further generalized and applied to a broader range of applications in fine-grained geolocalization using social media input (e.g., geolocating natural disasters or public safety events). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CRediT authorship contribution statement</head><note type="other">Lanyu</note></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>https://cloud.google.com/natural-language/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>https://cloud.google.com/natural-language/docs/analyzing-syntax</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2"><p>https://github.com/Mottl/GetOldTweets3</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3"><p>The number of tweets is mainly limited by the human labor of annotation.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4"><p>https://www.google.com/maps</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_5"><p>https://cloud.google.com/natural-language/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_6"><p>https://developers.google.com/maps/documentation/geocoding/overview</p></note>
		</body>
		</text>
</TEI>
