<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Real-Time Twitter Data Mining Approach to Infer User Perception Toward Active Mobility</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10291507</idno>
					<idno type="doi">10.1177/03611981211004966</idno>
					<title level='j'>Transportation Research Record: Journal of the Transportation Research Board</title>
<idno>0361-1981</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Rezaur Rahman</author><author>Kazi Redwan Shabab</author><author>Kamol Chandra Roy</author><author>Mohamed H. Zaki</author><author>Samiul Hasan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This study evaluates the level of service of shared transportation facilities through mining geotagged data from social media and analyzing the perceptions of road users. An algorithm is developed adopting a text classification approach with contextual understanding to filter out relevant information related to users’ perceptions toward active mobility. Using a heuristic-based keyword matching approach produces about 75% tweets that are out of context, so that approach is deemed unsuitable for information extraction from Twitter. This study implements six different text classification models and compares the performance of these models for tweet classification. The model is applied to real-world data to filter out relevant information, and content analysis is performed to check the distribution of keywords within the filtered data. The text classification model “term frequency-inverse document frequency” vectorizer-based logistic regression model performed best at classifying the tweets. To select the best model, the performances of the models are compared based on precision, recall, F1 score (geometric mean of precision and recall), and accuracy metrics. The findings from the analysis show that the proposed method can help produce more relevant information on walking and biking facilities as well as safety concerns. By analyzing the sentiments of the filtered data, the existing condition of biking and walking facilities in the DC area can be inferred. This method can be a critical part of the decision support system to understand the qualitative level of service of existing transportation facilities.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Widespread use of social media platforms such as Facebook, Twitter, Strava offers a unique opportunity to collect real-time information on existing transportation facilities in a cost-effective way. It is more relevant, especially when traditional data collection approaches such as travel demand surveys, user perception surveys are costly and time-consuming. Social networking sites facilitate users to share their daily activities, travel patterns, perceptions, and sentiments as small messages or posts. Such information is vital when managing infrastructure, traffic operation, and demand <ref type="bibr">(1)</ref>. It also encourages researchers and practitioners to explore the capacity of social media data in real-time traffic information sharing <ref type="bibr">(2)</ref>, travel behavior modeling <ref type="bibr">(3,</ref><ref type="bibr">4)</ref>, and qualitative service quality analysis of existing transportation facilities <ref type="bibr">(5)</ref>. In this study, we leverage social media data to create a framework to understand road user perception towards shared active transportation facilities.</p><p>Although active mobility is one of the significant components for developing sustainable transportation infrastructure, traditional transportation planning focuses on improving conditions of highways for car owners at the cost of safe sidewalks and bike facilities. Consequently, limited resources have been deployed in this sector to develop smart tools to understand pedestrian and bikers' safety and mobility concerns. In 2018 total federal funding on walking and biking was $916 million, which is only 2% of the total federal funding in transportation infrastructure <ref type="bibr">(6,</ref><ref type="bibr">7)</ref>, merely enough to encourage more people to use active transportation mode <ref type="bibr">(6)</ref>. Hence, this study's main objective is to propose a low-cost solution to engage more with the population and extract information regarding their perception towards active mobility, including conditions of existing facilities and safety concerns.</p><p>Moreover, city planners and transportation managers need to know the condition of existing facilities to develop a strategic plan for facility improvements. Transportation agencies mostly rely on qualitative perception surveys and quantitative analysis such as pedestrian volume count, number of bikers, and number of crashes to understand overall conditions for active mobility. Such data collection approaches require substantial resources, discouraging agencies from continuous real-time monitoring of shared space transportation facilities. Hence, we need a cost-effective alternative for real-time monitoring of the existing shared facilities.</p><p>In the recent past, social media platforms have gained popularity by allowing users to share their thoughts and concerns. Twitter is a notable example, with more than 330 million subscribed users. It is a microblogging service used to share views, activities, and thoughts through a 280character message known as a 'tweet.' In the United States, Twitter is one of the most widely used social media platforms, with more than 67 million active users <ref type="bibr">(8)</ref>. Many transportation agencies (e.g., state DOTs) use Twitter to share real-time information to travelers, such as information related to traffic congestion, crashes, incidents, and planned road work. Users also share their views and concerns on existing transportation facilities via tweets. This information can be utilized to understand the overall condition of a transportation facility in a qualitative way. Hence, Twitter data has the potential to support decision support tools assisting transportation managers in understanding user's perception of existing facilities with respect to safety and service quality.</p><p>However, the flexibility of information sharing on social media platforms has created a significant challenge to extract relevant information related to specific content; most of the time, these social networking sites get flooded with random information, making it challenging to extract task-specific information. In recent years advances in natural language processing technologies create an opportunity to overcome these challenges to extract relevant information from social media data. Thus, this study's main objective is to develop a framework to extract information related to service quality and safety issues of biking and walking facilities from geotagged Twitter data. The study implements a systematic framework for Twitter data mining and text analysis to understand user perception towards active mobility. The research is motivated by three key prospects: 1) widespread use of social media platforms, 2) real-time data collection techniques, and 3) advances in natural language processing (NLP) technologies.</p><p>The proposed framework includes a three-step tweet filtering process; at first, we apply geolocation-based boundaries to filter out tweets for a specific region, later we apply a heuristicbased screening technique to filter out geotagged tweets based on specific keywords related to walking and biking. However, the heuristic-based screening technique fails to filter out the most relevant tweets related to a specific context, such as user perception towards walking and biking facilities; a higher proportion of the tweets provides random information. To overcome this challenge, we implement a text classification method based on the tweet context to filter out the most relevant information. As a final step of the tweet filtering approach, we apply a text classification model to filter out the most relevant tweets on active transportation facilities. We also validate our approach by analyzing the contents of each tweet; we apply the Latent Dirichlet Allocation (LDA) model on the filtered tweets to infer a high-level summary of users' thoughts on walking and biking conditions. Finally, we conduct sentiment analysis to understand the polarity of users' sentiments for existing active transportation facilities. The framework we proposed is based on real-time monitoring of the Twitter feed. Once the framework is deployed for real-world applications, all these operations can be completed in real-time. If a user posts any tweet related to a walking or biking facility, the proposed algorithm will collect this information and show the tweet's polarity and content in real-time.</p><p>Overall, based on our understanding of existing literature, we anticipate that this study will have three major contributions to existing literature and practices. First, it develops a new approach to collect information on user perception towards active mobility cost-effectively; second, it demonstrates qualitative tweet analysis technique to understand the level of service of an existing facility based on users' sentiments; third, it provides experimental evidence of the validity of the proposed method using geolocated Twitter data. Since the implemented approach identifies users' safety concerns at different locations, the proposed framework can be an alternative approach for near real-time monitoring of the qualitative level of service of existing active transportation facilities in terms of service quality and safety.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">LITERATURE REVIEW</head><p>Social media offers an open-access platform for people to share their opinions about different issues (5) instantly. Information from these passive data sources establishes an alternative way to understand user perceptions towards existing transportation facilities such as availability and quality of sidewalks, bike lanes, safety concerns, etc. However, the raw data collected from Twitter are extremely noisy: flooded with random topics, similar keywords used in different contexts <ref type="bibr">(9)</ref>. Even though social media data provides a massive volume of information on user opinions, this information is meaningless unless we can extract the relevant information related to a specific topic. Traditional keyword-based filtering algorithms commonly handle text as straightforward successions of character strings; they only search if a given set of keywords is present in a sentence regardless of the context <ref type="bibr">(10)</ref>. These methods cannot extract context-wise information from Twitter; hence we need a robust context-wise text classification approach to overcome this challenge.</p><p>Text classification, one of the fundamental tasks in Natural Language Processing (NLP), is categorizing text according to its content. Text classification has widely been used for topic labeling, spam detection, and intent detection. Existing text classification methods can be divided into a traditional machine learning approach and a deep learning approach. Na&#239;ve Bayes <ref type="bibr">(11)</ref>, logistic regression <ref type="bibr">(12,</ref><ref type="bibr">13)</ref>, and support vector machine <ref type="bibr">(14)</ref> are the most commonly used machine learning approaches for text classification. Na&#239;ve Bayes is commonly used as a standard for text classification since it is quick and simple to execute <ref type="bibr">(15)</ref>. It assumes all attributes of the class as an individual element, and this pattern simplifies the classification of the text <ref type="bibr">(15)</ref>. However, when the training data is noisy and small, Bayesian learning is not practical for text classification <ref type="bibr">(16)</ref>.</p><p>The logistic regression-based multiclass text classification has shown superior performance compared to other traditional approaches <ref type="bibr">(13)</ref>. This algorithm assigns weights to each input sequence to segregate potential classes from each other <ref type="bibr">(17)</ref>. However, Logistic Regression assumes that all the input features in the dataset are independent, which lowers the precision of text classification for a dependent set of variables <ref type="bibr">(18)</ref>. Support Vector Classification (SVC) works well for high dimensional features in texts <ref type="bibr">(14)</ref>; however, it takes a substantial amount of time to tune the parameters for SVC algorithms to improve the precision <ref type="bibr">(19)</ref>. Several studies <ref type="bibr">(20)</ref> have also applied tree-based classifiers for text classification. Although these algorithms work well with categorical features, they are susceptible to a small perturbation in the data set and suffer from overfitting issues.</p><p>In the recent past, deep learning approaches have gained more attention due to its ability to deal with high dimensional data. Convolutional neural network (CNN) and recurrent neural network (RNN) are the two most commonly used deep learning methods for text classification. Although the CNN architecture was built for image processing, it has been successfully applied in text classification. However, CNN performs poorly for long sequences of text due to a limited capacity to learn consecutive connections <ref type="bibr">(21)</ref>. For long sequences of text RNN based classification models such as Long Short-Term Memory Neural Network (LSTM), Gated Recurrent Unit (GRU), Bidirectional LSTM (BiLSTM) have shown better performance <ref type="bibr">(18,</ref><ref type="bibr">(22)</ref><ref type="bibr">(23)</ref><ref type="bibr">(24)</ref>.</p><p>One of the limitations of RNN based classification is that it becomes biased when later words are more influential than earlier ones for a sequence of texts. To overcome this issue, a CNN layer is introduced with the RNN architecture <ref type="bibr">(25)</ref>; Convolutional LSTM (convLSTM) model utilized the CNN model to extract a sequence of higher-level phrase representations and are fed into an LSTM model. The ConvLSTM model captures both local features of sentences as well as global and temporal sentence semantics.</p><p>Overall a vast number of studies have used NLP to improve the efficiency of existing text classification and content analysis methods. These approaches can help us overcome the challenges of utilizing social media data for transportation planning and traffic management. In transportation research, text classification approaches are mostly applied for traffic incident detection from social media data <ref type="bibr">(26)</ref><ref type="bibr">(27)</ref><ref type="bibr">(28)</ref>; these studies apply a binary classification approach to separate the incident related text. Apart from incident detection, NLP has also been applied to infer weather-related events from social media data <ref type="bibr">(29)</ref>.</p><p>Although social media platforms generate massive information on user perceptions and sentiments related to different transportation facilities such as public transportation <ref type="bibr">(30)</ref>, shared mobility active transportation facilities <ref type="bibr">(31)</ref>, few studies have explored the capacity of social media data for real-time monitoring of qualitative service quality of these transportation facilities based on users perceptions and sentiments. Moreover, a few studies <ref type="bibr">(31,</ref><ref type="bibr">32)</ref> explored the impact of user opinions and sentiments on social media to encourage sustainable mobility options such as biking, public transit. However, these studies are limited to social media data exportation; they do not address the challenges in collecting context-wise real-time information from social media posts.</p><p>In this study, we implement a framework based on Twitter data mining to analyze the qualitative level of service for active mobility to overcome this research gap. We implement advanced text classification approaches to extract the most relevant information from Twitter posts based on the context of the texts. We also perform sentiment analysis to represent users' polarity towards active transportation facilities. The proposed framework offers a new approach for evaluating the qualitative level of service of existing facilities for active mobility in terms of service quality and safety.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">RESEARCH METHODOLOGY</head><p>This study proposes a framework to extract information on user perceptions towards active mobility options using real-time Twitter streaming data. The methodology is shown in Figure <ref type="figure">1</ref> and consists of three steps. First, we apply a geolocation boundary to collect Twitter data for a specific zone, followed by a keyword matching based searching approach to identify the relevant tweets related to active mobility. A context-based text classification approach is then applied to prune out tweets that include some information that contains keywords closely related to walking and biking but out of the context of mobility and transportation.</p><p>In the tweet filtering process, the keyword matching algorithm uses relevant keywords related to walking and biking (e.g., walk, bike, sidewalk etc.) to collect active mobility related tweets. The set of keywords assigned to the algorithm will control the number of collected tweet samples closely related to active transportation mode. In some cases, we might not get enough information because of missing some relevant keywords. Therefore, it is important to ensure feedback control in the Twitter filtering process depending on the outcome (Figure . 1.). The feedback control process will help check the collected information if the data collection approach is missing some relevant information related to active transportation facilities.</p><p>In the next step, we adopt a topic modeling approach to perform content analysis over-filtered tweets and generate clusters of topics related to active transportation facilities-the distributed keywords inside each cluster highlight user's activity, perceptions, and concerns at a higher level. By analyzing the topics and keywords inside each topic, we can identify the user concerns regarding active transportation facilities. Finally, we perform a sentiment analysis over the filtered tweets to understand users' polarity (a score representing whether a text is positive, neutral, or negative) towards existing walking and biking facilities. The sentiment analysis approach provides polarity metrics for users' perceptions.</p><p>The following section explains the different components of the proposed method, starting with model selection for text classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Tweet Classification</head><p>One of the critical components of the proposed framework is to implement a model that can filter out the most relevant information from the social media platform based on context of the tweets, which means the accuracy of the model will decide relevance of the collected information to a specific topic. In this study, we will implement multiple text classification models and will compare their performance in classification accuracy. However, to proceed with model implementation, at first, we need to extract the feature vectors from each tweet.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Extracting Feature Vectors</head><p>We use a vectorizer <ref type="bibr">(33,</ref><ref type="bibr">34)</ref> to convert the tweet texts into a sparse matrix that consists of numbers or tokens. The size of the matrix depends on the vocabulary size if the vocabulary size is not given, the vectorizer estimates the vocabulary size by analyzing the data. So, the main function of the vectorizer is to convert the tweet texts as a vectorized input for the models. We also use a Term Frequency-Inverse Document Frequency (TF-IDF) <ref type="bibr">(35)</ref> score to estimate the importance of different words inside a tweet. The TF-IDF of a word increases with an increase in frequencies but decreases if the word is present in many documents (e.g., stop words, punctuations); a high TF-IDF score of a word implies high importance within a collection of documents (tweets). The equation for calculating TF-IDF for a word (i) is as follows:</p><p>where</p><p>We use both unigram and bigram of words to create feature vectors. The details about unigram and bigram are available in <ref type="bibr">(36,</ref><ref type="bibr">37)</ref>. Moreover, to remove the effect of total word counts in a document, we apply &#119897;2 normalization (sum of the squared value of TF-IDF = 1 for a document).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model Selection for Text Classification</head><p>We adopt a multiclass classification approach to identify the tweets related to different categories of active mobility options (e.g., walking, biking). The objective is to find the best model to map the tweets into different categories based on the context. Let, &#8497; denotes the function that maps input tweets (&#119883; &#119898; ) into different categories (&#119884; &#119898; ),</p><p>Here, m indicates the number of data samples; &#119883; &#119898; is the input TF-IDF vector created in the previous step; Y m is a vector that contains the labels (&#119910; = &#119894;) for each category of tweets, &#119894; denotes one class out of our three classes. We consider three categories of tweets: walking related, biking related, and other random out of context tweets. Thus, the target vector (&#119884;) has three labels where, &#119910; &#8712; { " 0 " : "&#119908;&#119886;&#119897;&#119896;&#119894;&#119899;&#119892;", "1 " : "&#119887;&#119894;&#119896;&#119894;&#119899;&#119892;", "2": "&#119900;&#119905;&#8462;&#119890;&#119903;"}.</p><p>To select the best model, we check the predictive performance of each model in terms of precision, recall, f1 score, and accuracy. We generate a confusion matrix to estimate these performance measures. The confusion matrix also reveals the performance imbalance of a classifier: high accuracy for a class but low for another. Table <ref type="table">1</ref> shows the components of a confusion matrix. The rows represent the actual labels, and the columns represent the predicted labels where positive means the existence of a particular label and negative means the absence of a particular label. For a given tweet, if the actual label is negative, a negative prediction by the model is assigned as true negative, and a positive prediction is assigned as false positive. Similarly, if the actual label is positive, a positive prediction is assigned as a true positive, and a negative prediction is assigned as a false negative. The selection of the best classification model is challenging when the model performance changes for different performance metrics. In some cases, a classifier can show better accuracy, but a higher false-positive rate will cause a lower precision score. In the case of tweet classification, how accurately the model identifies the true positive cases is important. So, precision and recall are two essential criteria to evaluate model performance. However, we analyze all the metrics to develop an optimal model that shows consistent performance for each performance metric.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Topic Model for Content Analysis</head><p>To recognize the content of a posted tweet, we apply the Latent Dirichlet Allocation (LDA) or topic modeling approach (3). The topic model specifies a probabilistic procedure of generating documents; the process starts with choosing a distribution over topics, and for each word in the document, a topic is randomly selected from the chosen distribution. Finally, a word is randomly drawn from that topic <ref type="bibr">(39)</ref>. The process can infer the set of topics responsible for generating a collection of documents (i.e., tweets) by applying standard statistical techniques inverting the process. The topic model has been widely used in machine learning; it has been recently used in transportation studies <ref type="bibr">(40)</ref><ref type="bibr">(41)</ref><ref type="bibr">(42)</ref><ref type="bibr">(43)</ref><ref type="bibr">(44)</ref>.</p><p>We implement the model using the genism library <ref type="bibr">(45,</ref><ref type="bibr">46)</ref> in Python. We train the model over a sample of tweets to generate the topics and distribution of keywords inside each topic. However, generating meaningful topics from random tweet samples is more challenging; we need to decide on the optimal number of topics to prevent the repetition of similar topics and keyword distributions. To overcome this challenge, we use the coherence score metric <ref type="bibr">(47)</ref>, which measures the degree of semantic similarity (e.g., conceptually correlated words) between high scoring words within a topic. Thus, it helps differentiate between the semantically interpretable topics and the topics that are artifacts of statistical inference. A higher value of the coherence score indicates that the words in a topic are semantically relatable, so the topic has suitable interpretability. In our study, we adopt "topic coherence pipeline" <ref type="bibr">(48,</ref><ref type="bibr">49)</ref> to estimate the aggregated coherence score for the topic model. To choose the optimal number of topics, we run the topic model with a different number of topics and estimate the aggregated coherence score. Finally, we select the optimal number of topics based on the maximum coherence score.</p><p>Once we find the optimal model, we can use it for topic generation and content analysis. The distributed keywords inside each topic will provide insights into user perceptions and concerns. So, rather than going over each topic, we can understand users' opinions by interpreting the topic keywords. Moreover, we use the trained model to find tweets closely related to a particular topic (e.g., safety concern, bike facility, etc.).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Sentiment Analysis</head><p>Sentiment analysis has widely been used to understand user perception of different products or facilities <ref type="bibr">(50)</ref>. In this study, we adopt a sentiment analysis approach to understand the user's experience of different active transportation facilities. To analyze the sentiment for each tweet, we use python's "vaderSentiment" library <ref type="bibr">(51)</ref>. VADAR (Valence Aware Dictionary for Sentiment Reasoning) is a pre-trained classification model that uses a rule-based approach to classify a text as positive, negative, and neutral. The model is being trained over social media texts and emoji, thus suitable for tweet analysis. We apply the pre-trained model to get the compound score (polarity of a text) for each tweet, which varies between -1 to 1. We categorize a tweet as positive if the compound score &gt;= 0.05, neutral if the compound score is between -0.05 and 0.05, negative if the compound score &lt;= -0.05.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">DATA COLLECTION AND PRE-PROCESSING</head><p>In this study, we use Twitter data collected from the Washington, DC area using Twitter's streaming API (application programming interface), giving a geolocation boundary (Figure <ref type="figure">2(a)</ref>). To collect the information on walking and biking, we filtered the tweets with relevant keywords (Table <ref type="table">2</ref>). In total, we have collected 3,533 tweets from October 7, 2019, to November 7, 2019. We check the duplicate entries for all the tweets based on tweet ids and remove all the duplicate tweets; the final dataset has in total of 3,273 tweets. We also check the number of unique users and find that the dataset has 2307 unique users, among which about 77.6 % tweeted only once between October 07, 2019 and November 07, 2019 (Figure <ref type="figure">2 (b)</ref>). Only three users posted more than 30 tweets within this period, which means that in the data sample, majority of the users are occasional users. Figure <ref type="figure">2(b)</ref> shows the distribution of users based on the number of tweets posted in the study period.</p><p>To create an annotated dataset, we manually labeled all of 3,273 tweets. To ensure that we retrieve the right labels of the tweets, we independently (three annotators) labeled each tweet and then matched the labels from different annotators to fix the final label. Each tweet can have at least one label out of three possible categories: walking related, biking related, and others. Figure <ref type="figure">2(c)</ref> shows the distribution of different types of tweets. Although we apply a heuristic approach to remove irrelevant tweets from Twitter, we find that only 25% of the tweets are related to walking and biking; the rest contain random posts that involve words such as walk, bike, etc.</p><p>To understand the content of the collected data, we run a generalized topic model over the collected tweet samples. We estimate the coherence score for different numbers of topics and determine the optimal number of topics based on the maximum value of coherence. Figure <ref type="figure">3 (a)</ref> shows that the maximum value of coherence (0.53) is obtained for ten topics. So, we run the final model with 10 topics (Figure <ref type="figure">3 (b)</ref>). From the topic analysis, we find that few topics, such as topic#7 include keywords like "bike," "good," "ride," "well," indicating the user's perception of bike rides. Topic#6 also includes keywords related to bike and bike lanes. However, most of the topics seem irrelevant to the context and provide random information. Hence, we need further cleaning of the data to reduce the flooding of random information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">RESULTS AND DISCUSSIONS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Tweet Classification</head><p>To classify the tweets, we implement six different classification models: Na&#239;ve Bayes, Logistic Regression, Support Vector Classification (SVC), Long-Short Term Memory Neural Network (LSTM), Bidirectional LSTM, and convolutional LSTM. To train the model, we divide the data into train, test and validations set. We use 70% of the data to train the model and 15% of the data to test the model while tuning the parameters. Rest 15% of the data is used for validating the proposed approach for text classification and content analysis.</p><p>Before training the models, we apply both unigram and bigram based "CountVectorizer" <ref type="bibr">(33)</ref> from scikit learn library <ref type="bibr">(52)</ref> to create feature vectors from the tweets, however for this particular problem unigram based vectorizer performed best. The vectorizer generates a vectorized sparse matrix that includes 7070 features per tweet text, which means the vocabulary size of the tweet samples is 7070. We use this vectorized sparse matrix to estimate the TF-IDF score for each of the words using "TfidfTransformer" <ref type="bibr">(53,</ref><ref type="bibr">54)</ref>. In our final data set, we represent the tweet text as a vectorized matrix consisting of the TFIDF score of each word, which is then directly fed into the model as input.</p><p>In this experiment, we choose multinomial Na&#239;ve Bayes as our base model. We explored different values for the smoothing parameter alpha for the Na&#239;ve Bayes model; however, there is no significant increase in model accuracy. In the case of the logistic regression model, we do not <ref type="bibr">(a)</ref> have any open parameters to tune. The number of parameters for the logistic regression model is the same as the input size (7070). The SVM classifier is more laborious than the logistic regression model. For the SVM classifier, we explore three different kernels: linear, polynomial, and radial basis function. We also vary the penalty parameter ranging from 100 to 10000. We obtain the best result for both linear and radial basis functions with the value of penalty parameters as 1000.</p><p>In the case of deep learning models, we randomly searched different set of hyperparameters to obtain the optimal combination of hyperparameters that works best. However, none of them perform very well compared to the base model. Moreover, since we generate the feature matrix using vectorizer with TF-IDF, the dimension of features (7070) is higher. As such, it increases the model complexity by increasing the number of neurons at the input layers. Hence, most of the time, these complex models overfit and performed well on training data but not on the test data.</p><p>We apply another approach by using a simple "Tokenizer"(34) from Keras library (55) and a word embedding layer for the deep learning model, which significantly reduces the training time while increasing the model accuracy. For the embedding layer, we use a vocabulary size of 7106, which is obtained from the tokenizer. The length of the input features is 280, which is the same as the maximum allowable number of words in a single tweet. In Table <ref type="table">3</ref>, we report the hyperparameter sets for the deep learning model that produced the top result. The detail about each component of deep learning models for text classification can be found in <ref type="bibr">(25)</ref>.</p><p>Overall based on the performance measures, we find that TF-IDF based logistic regression (test accuracy=0.821) has performed better than all the other models, including the deep learning models. As shown in Table <ref type="table">4</ref>, both logistic regression and deep learning models performed well if we consider precision and recall measures. However, deep learning models would take more data, time, and computational power to complete the training process.</p><p>We also analyze the Receiver Operating Characteristics (ROC) curves to understand the performance of the models in terms of true positive and false positive rates. The ROC curve is based on the macro-average of the classification for each label. In the macro-average case, we compute the ROC value independently for each class and then take the average (hence treating all classes equally). ROC is a probability curve consists of a true positive rate and false positive rate, whereas the area under the curve (AUC) represents a degree or measure of separability. It explains the capability of the model to distinguish one class from another class. The higher the AUC is, the better the model performs in predicting 0 as 0 and 1 as 1. Based on the AUC value, we find that the Logistic regression model (0.890) performs better than all the other models. Figure <ref type="figure">4</ref> shows the ROC for the Logistic regression model over the test dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Model validation and Content Analysis</head><p>We validate our model on 573 tweets, among which 121 tweets are walking related, and 20 tweets were biking related. To demonstrate the logistics regression model's performance, we estimate the confusion matrix over the validation data (Figure <ref type="figure">5</ref>). From the confusion matrix, we find that the model can separate random tweets with 88% accuracy; the percentage of true positive rate (90%) is also very high for tweets related to biking. However, we see a higher percentage of false positive (about 38%) rate for classifying tweets related to walking. From our analysis, we find that most of the random tweets include the keywords walking or walk. However, many are out of context and unrelated to any perception of walking facilities. Such out of context tweets are difficult to separate from contextual tweets related to walking facilities; consequently, it generates a higher false positive rate. The model's overall accuracy on the validation data set is satisfactory, with 82.7% of correct classification.</p><p>We train the topic model on the filtered data (141 samples); we run the model with a different number of topics and check the corresponding coherence score. From the coherence measure, we find that the coherence score is the highest (0.57) for three topics (Figure <ref type="figure">6a</ref>). Figure <ref type="figure">6 (b)</ref> demonstrates the keyword distribution for each topic. From the keyword distribution, we find that users mostly address if they are enjoying walking and biking. Moreover, some keywords indicate negative aspects. For example, in topic#0, the keyword "kill" is associated with other active transportation-related keywords: walk and sidewalk. It is possible that this topic is closely related to the safety concerns of existing sidewalks.</p><p>Moreover, we apply the trained model to get the dominant topic category (i.e. topic#0, topic#1, and topic#2) for each tweet. Based on that, we group the tweet samples into topic#0, topic#1, and topic#2. Figure <ref type="figure">7</ref>(a) demonstrates six tweets with their dominant topic category. We find that all the tweets closely related to topic#0 indicate safety issues such as pedestrian-vehicle collision while using the sidewalk or crossing the street. Moreover, we also observe a safety concern regarding e-scooters. There are also a few tweets on the benefits of existing bike facilities, such as protected bike facilities; requirements for facility improvement.</p><p>We conduct further analysis to understand user perception about walking and biking facilities. We perform sentiment analysis on the filtered tweets to obtain a tweet's polarity: whether positive or negative. From the sentiment analysis result, we obtain the compound score. If the compound score is positive, it means the user's perception is positive, while if negative, it indicates a negative perception regarding the activity or facility. Figure <ref type="figure">7</ref> (b) shows the distribution of polarity scores for the classified tweets related to walking and biking. The overall distribution indicates that about 18.69 % of the walking related tweets are negative, and 61.11 % of them are neutral. This indicates that users show more positivity (38.2%) towards existing walking facilities (e.g., sidewalk, crossing, etc.) in the Washington DC area. Similarly, we observe that only 5% of the bike-related tweets are negative, which may indicate a better level of service quality for biking facilities. However, we do not have enough data sample to confirm this conclusion.</p><p>Another relevant benefit of this approach is that we can find the geolocation of the tweets; we can identify the area from where the tweets were posted. The geolocations come in two ways: exact point location and bounding box. The tweets that include the exact geo-coordinates are precise and indicate the incident or any facility's actual location. On the other hand, for tweets including location as a bounding box, we calculate the center of that bounding box to determine the location. In this case, we can understand the precision of the location by computing the diagonal distance of the bounding box. If the diagonal distance is small, the geolocation will be more precise. As shown in Figure <ref type="figure">7</ref>(a), we map all the tweets based on their geolocation. From this information, we can identify the zones which are less safe or provide an insufficient safety measure for safe walking and biking.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">CONCLUSION</head><p>Active mobility -mostly indicates walking and biking, one of the major components of sustainable transportation modes. The majority of the world's population relies on cycling, walking, and other forms of human-powered transport to commute to work, schools, and public transport stations <ref type="bibr">(56)</ref>. Moreover, active transportation promotes a healthy lifestyle, one of the most affordable and practical ways to reduce &#119862;&#119874; 2 emissions (57). Therefore, promoting walking and biking is critical for establishing people-oriented sustainable cities that are safe and equitable. One of the appropriate ways of maintaining a safer environment for active mobility is the continuous monitoring of existing infrastructure. However, traditional approaches to collect information on user satisfaction and safety concerns on active transportation modes are costly, require constant maintenance, and can be time-consuming. Addressing some of those concerns, our study leverages social media data and offers low-cost support to monitor existing active transportation facilities. The study brings a novel and systematic framework to overcome challenges in analyzing social media data in transportation using advanced language processing tools. The proposed method can be a cost-effective alternative to understand the qualitative level of service for different existing facilities for active movement. This paper presented different components of the framework, such as the tweet filtering mechanism, topics pattern identification, and user perception towards facilities (e.g., protected bike lanes, sidewalks). We validated the results on real-world data from Washington DC. Some limitations would require our attention in future research. One main shortcoming of the study is the small sample (i.e., number of tweets) to carry out the analysis. A best practice is to collect the data for three months. In the future, we will continue collecting the data to analyze a larger dataset. Another urgent concern for social media data is misinformation or the spread of fake news; however, this is out of our research scope. Few existing studies <ref type="bibr">(58)</ref><ref type="bibr">(59)</ref><ref type="bibr">(60)</ref> have proposed several methods to identify social media post misinformation. In our future research, we will consider these methods to filter out misinformation or fake posts. Moreover, Twitter does not reveal the user's sociodemographic characteristics (e.g., age, gender). Hence, we cannot ensure a representative sample in terms of gender and age groups. However, according to a recently published statistic <ref type="bibr">(61)</ref>, the distribution of Twitter users by age group is 38% for age group 18-29, 43% for age group 30-64, and 7% for ages 65+. Thus, Twitter has a mixture of all population groups. We believe that inferring sociodemographic information of Twitter users and checking the Twitter dataset's representativeness compared to the actual population is an avenue for future research.</p><p>Another research question that arises from this study is comparing our results with real condition transportation facilities' activities. Due to limited data availability, we could not complete this analysis. However, we explore all the available active data collection methods to extract more information on active transportation facilities. Moreover, the implemented classification models need further tuning, as we find that LSTM, Bidirectional LSTM, Convolutional LSTM have reasonably high AUC values; however, they have lower accuracy values. Although we used different feature extraction methods as well as different combination of hyperparameters, TF-IDF based logistic regression method outperformed other methods. As a future research direction, we will study if users' perception towards bike facilities influences bikesharing demands. The availability of the capital bike-sharing data would make it possible to expand on the framework presented in this paper.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>List of Figures</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>FIGURE. 1. A framework to collect information on active mobility for users' perception analysis</head></div></body>
		</text>
</TEI>
