<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>SmishViz: Towards A Graph-based Visualization System for Monitoring and Characterizing Ongoing Smishing Threats</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>06/19/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10620661</idno>
					<idno type="doi">10.1145/3714393.3726499</idno>
					
					<author>Seyed Mohammad Sanjari</author><author>Ashfak Md Shibli</author><author>Maraz Mia</author><author>Maanak Gupta</author><author>Mir_Mehedi Ahsan Pritom</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[SMS phishing (aka 'smishing') threats have grown to be a serious concern for mobile users around the globe. In cases of successful smishing, attackers take advantage of users' trust through deceptive text messages to trick them into downloading malicious content, disclosing private information, or becoming victims of fraud. Current studies on smishing mostly focus on the classification of smishing (or spam) messages from benign ones as a means of defense. However, there is no systematic study to characterizing smishing threats and their landscapes by which we can monitor the ongoing campaigns from a bird's-eye perspective to apply effective defense. In this paper, we propose SmishViz, a graph-based visualization system that can aid defenders (i.e., analysts) to characterize ongoing smishing threats in the wild and allow them to monitor the connected campaigns and campaign-operations through effective graph visualization approach integrated with state-of-the-art opensource visualization tool. This paper also provides case study with real-world smishing dataset to showcase the efficacy of SmishViz system in practical use-case scenarios. Our case study results reveal that the proposed system can certainly help defenders to track and monitor ongoing smishing campaigns, understand attackers' tactics to formulate strategic defense and uproot the attack operations.
CCS Concepts• Security and privacy → Mobile and wireless security; • Human-centered computing → Visualization toolkits; Graph drawings.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In today's digital age, smartphones have become an essential part of our daily lives. We rely on them for communication, banking, shopping, and many other online browsing activities. Unfortunately, this reliance has also made mobile device users a prime target for cyber-criminals. SMS Phishing attacks also known as 'Smishing' have emerged as a significant threat, taking advantage of users' trust through text messages to trick them into revealing sensitive personal information or installing malicious software <ref type="bibr">[1,</ref><ref type="bibr">26]</ref>.</p><p>Recent studies have also revealed that users do fall victim to these smishing attacks and interact with smishing messages more than often <ref type="bibr">[20,</ref><ref type="bibr">25]</ref>.</p><p>Smishing attacks can have serious consequences for both individuals and organizations. A successful smishing attack can lead to identity theft, financial losses, data breaches, and reputation damage through a mobile device <ref type="bibr">[4]</ref>. We see reports showing sharp 1245% increase of smishing attacks in Q1 2023 from the previous quarter <ref type="bibr">[30]</ref>. According to the fraud reports by the Federal Trade Commission (FTC), only during Q1 of 2024, a total loss of $645.7 million has been reported in United States alone through frauds phone calls and text messages <ref type="bibr">[6]</ref>. Attackers are constantly coming up with new stealthy strategies to penetrate users' trust making the job even more challenging for defenders <ref type="bibr">[2]</ref>.</p><p>Traditional methods for smishing detection, such as blacklists based on contents and URLs within texts are becoming less effective as attackers continuously adapt their techniques and tactics <ref type="bibr">[13]</ref>.</p><p>To keep up with the ever-evolving threat landscape, it is essential to develop more advanced and robust defense mechanisms, which requires a deep understanding of the characteristics and infrastructures of smishing attacks. We believe we can understand smishing ecosystem better through a comprehensive analysis and continuous monitoring of real-world smishing attacks. Usually, within the real-world smishing dataset, we observe that various themes of smishing messages can be identified based on textual (string-based) similarity metrics <ref type="bibr">[13,</ref><ref type="bibr">31]</ref>. However, there might be connection between completely dissimilar text message groups with different topic themes based on the web-entity infrastructures (e.g., website URLs and domain names associated with the SMS). These characteristics might unveil larger attack campaign operation owners who might be running multiple different thematic topic based smishing campaigns targeting various brands. We hypothesize that building these web infrastructural-level connections between various message groups and visualizing them through a connected graph may aid defenders uncover campaign operations and dismantle them effectively. Hence, in this paper, we propose SmishViz, a graph-based visualization system, to observe and capture smishing campaigns and campaign-operations that are operated by same bad actor (or actors) using common underlying web-entity infrastructures and/or using similar message templates. We also propose to continuously monitor smishing messages in real-time and extract campaignoperations to proactively take defensive actions against various ongoing smishing campaigns. In this paper, we make the following four major contributions.</p><p>&#8226; First, we propose to create similar topic-themed message clusters using sentence embeddings to characterize smishing messages into various meaningful groups and investigate them further for finding within-group campaigns (i.e., sharing high textual similarity).</p><p>&#8226; Second, we propose to group similar semantic topic-themed messages together as clusters using pre-trained BERTopic model, find highly similar text pattern (i.e., template-based) groups within each clusters using Ratcliff patter recognition algorithm to identify them as sub-clusters representing campaigns. These individual sub-clusters from various topicthemed clusters are then connected based on shared webentity infrastructures to form bigger campaign-operations that are possibly coordinated by the same bad actor(s).</p><p>&#8226; Third, we propose to visually analyze the campaign-operations as connected tripartite graphs and get data-driven insights from each campaign-operations through the web-based visualization system integrated with D3.js state-of-the-art visualization tool. &#8226; Fourth, we present case study with a recently published real-world smishing dataset to characterize smishing threats and showcase the practicality of the proposed visualization system in understanding smishing campaigns and operations that can aid defenders create further preventive measures.</p><p>Paper Outline: Section 2 discuss the current literature of SMS phishing attacks and defenses. Section 3 describes the research questions (RQs) and detail functionalities of SmishViz system modules. Section 4 presents the case study with real-world smishing dataset to answer the RQs and show the use-case scenarios. Section 5 discusses the limitations. Finally, section 6 concludes the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Works</head><p>Smishing attacks are growing rapidly and that is why the field has received an enhanced attention from the security community in recent years. In recent notable studies, Nahapetayan et al. <ref type="bibr">[18]</ref> has analyzed a large amount of public SMS gateways' data to understand SMS phishing tactics and characterize the underneath infrastructures. They have analyzed SMS phishing campaigns and operations, identifying patterns such as multiple attacks targeting the same receiver's cell phone numbers and the use of shortened URLs, which has immensely inspired our study. In the literature, we find some studies have proposed using machine learning (ML) to analyze words and links in text messages for smishing detection, but have scoped their approach around a limited set of hand-picked clues or manually selected features <ref type="bibr">[17]</ref>. In another follow-up study, Jain et al. <ref type="bibr">[13]</ref> closely analyzed the words and URLs in smishing messages to find smish indicators. Similarly, Goel and Jain <ref type="bibr">[9]</ref> proposed an ML-based classifier framework, which identifies smishing messages based on the SMS text contents. Next, Mishra and Soni <ref type="bibr">[17]</ref> have introduced a promising NLP and ML-based tool named DSmishSMS for identifying smishing messages, but the evaluation is not comprehensive due to dataset limitations. In another study, Yeboah-Boateng et al. <ref type="bibr">[34]</ref> have assessed the threats of phishing, smishing, and vishing attacks against mobile devices, while Wu et al. <ref type="bibr">[33]</ref> proposed defense schemes against phishing attacks on mobile platforms. Additionally, Hossain et al. <ref type="bibr">[12]</ref> have proposed using deep learning (DL) models, such as CNN and LSTM, to detect spam and phishing SMS, which achieved good results but they only considered the text's word frequency feature such as Term Frequency-Inverse Document Frequency (TF-IDF) which did not take into account some core aspects of a smishing attack like the associated URL, brand impersonation, and sender information. Furthermore, researchers have introduced computer vision techniques to analyze the similarity of looks among the phishing <ref type="bibr">[14]</ref>, messages but visual similarities are not always reflect on the similarities in word or metadata levels as stated in phishing literature <ref type="bibr">[19]</ref>. These studies highlight the sophistication of mobile-based attacks and our research builds upon this foundation by providing a graph-based data-driven visualization system to further investigate underneath smishing campaigns and bigger operations.</p><p>In addition, existing literature also discuss how generative AI can be abused by attackers for malicious purposes and particularly create newer and previously unseen smishing and phishing campaigns <ref type="bibr">[11,</ref><ref type="bibr">24,</ref><ref type="bibr">28]</ref>, which indicates that previous knowledge based content-driven defense solutions may not be enough for detecting new deceiving AI generated campaigns. Moreover, researchers also proposed to leverage LLM-empowered defense mechanism to detect smishing messages with natural language based reasoning <ref type="bibr">[27]</ref>. Lastly, existing literature have also explored attack vectors, users susceptibility and users awareness when exposed to smishing attacks by conducting user studies to conclude that users do fall victim to these attacks at an alarming rate <ref type="bibr">[8,</ref><ref type="bibr">20]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methodology and System Overview</head><p>In this section, we detail the driving research questions, problem formulation, and define the integrated modules of SmishViz and their functionalities. Research Questions (RQs): We address five major research questions (RQs) in this paper.</p><p>&#8226; RQ1:Can we identify clusters of messages with high similarity in meanings and themes (e.g., topics) and automatically label them with corresponding themed topics using datadriven text analysis? &#8226; RQ2: Can we find groups of messages within a cluster that share templates or patterns to be identified as a same-origin (i.e., created by same bad actor) campaign? &#8226; RQ3: Are there evidence that these same-origin campaigns may connect multiple sub-groups of messages from multiple clusters to form campaign-operations based on common infrastructures or underneath connections? &#8226; RQ4: How can we effectively visualize the campaigns and campaign-operations that can aid any cyber defenders (i.e., analysts)?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Problem Formalization and Notations</head><p>The ideation of SmishViz system is to integrate with a live SMS dataset that collects data continuously from the wild and then investigate the data to monitor ongoing smishing campaigns. In order to formally define the system, let's first define some notations and terminologies that are used in the rest of the paper-</p><p>, where, &#287; denotes the text content of the -th SMS after removing URLs or links; &#287; denotes the sender information (email, phone numbers); and &#287; presents the web-entity present in the -th SMS that can be collected from to a corresponding domain name &#287; or URL string &#287; .</p><p>ter, generated using BERTopic, a topic modeling technique that leverages BERT embeddings and density-based clustering. Each cluster contains messages with semantically similar content, grouped together based on their contextual relationships and assign a corresponding theme &#8462; &#262; for the whole cluster. &#8226; &#263; &#8712; &#262; represents the -th sub-cluster of messages within cluster &#262; that can be grouped together because of high pattern similarity (i.e., using a common template). The cluster &#262; can also be presented as a set of sub-clusters,</p><p>where the sub-clusters may vary in text patterns but they have same or similar topics. Moreover, each of these sub-clusters also represent individual campaigns that can be safely assumed to be originated by the same bad actor.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8226;</head><p>&#264; represents the -th campaign-operation that connects multiple sub-clusters from same or different clusters based on common web entity infrastructures between them. This campaign operation will help to identify sub-clusters that are of different themes and topics but somehow connected based on the underneath infrastructure which indicates a common source of origin. Formally, &#264; = { &#276; &#8712; &#261; &#8746; &#277; &#8712; &#262; } when there is common web entity between messages in sub-cluster &#276; &#8712; &#261; and sub-cluster &#277; &#8712; &#262; . Figure <ref type="figure">1</ref> presents the system overview of SmishViz and its modules. First, it takes the raw bulk SMS data as input from any public smishing repository &#297;&#291;&#297; . Next, it curates each SMS &#287; &#8712; &#297;&#291;&#297; by extracting ( &#287; , &#287; , &#287; ) tuples. Then, it applies BERTopic <ref type="bibr">[10]</ref>embedding to find semantic similarity among these messages and group them together into set of clusters { 1 , 2 , &#8226; &#8226; &#8226; , &#262; }, where each cluster will ideally contain messages of unique topics or themes. Next, we apply pattern matching algorithms between messages within cluster boundary to find sub-clusters ( &#263; &#8712; &#262; ) within a particular cluster &#262; that exhibits very high similarity in patterns, meaning they are most likely using a common template. Additionally, we also connect these sub-clusters from different clusters based on the common web entity infrastructures to form any campaignoperations &#264; . Each campaign-operation, by our definition, can be presented as strongly connected graphs and visualized for further analysis using state-of-the-art visualization tools. Furthermore, any campaign-operation can be identified as connected smishing threats originated from the same source or bad actors. Here, each campaign-operation can be represented as a tripartite graph network where sub-clusters nodes are connected to some message nodes on one side and web-entity nodes on the other side as shown in Figure <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">SMS Data Collection and Pre-Processing Module</head><p>In this module, we collect raw SMS data &#297;&#291;&#297; from any public facing smishing dataset repository. However, raw data usually have noises and inconsistencies that makes our analysis and clustering task more challenging. To deal with these issues, we use a data pipeline for each individual SMS &#287; that helps us preparing the data. At first, we extract the tuple ( &#287; , &#287; , &#287; ) from SMS &#287; . Next, we need to further clean up the extracted message text part (i.e., &#287; ) to remove stop words, digits, and punctuations to get a clean version of text <ref type="bibr">[35]</ref> without losing the meaningful context. Additionally, some more technical details on cleaning the URL data in raw SMS data those encounter additional space characters for the usage of optical character reading techniques are discussed in Appendix A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Semantic Similarity-Based Clustering Module With BERTopic</head><p>Linguistic analysis of smishing texts can reveal indicating patterns and red flags associated with malicious characteristics. The main motivation behind the semantic similarity based clustering approach is to group the messages together that are covering same thematic topics (e.g., delivery theme, account theme, lottery theme) and potentially similar type of target brands (e.g., FedEx, USPS). Moreover, we hypothesize that smishing messages originating from the same bad actors would share common textual patterns (i.e., linguistic templates) and common web entity infrastructures. Thus, grouping the text messages based on semantic similarity would further help us to uncover smishing campaigns and later campaignoperations that are run and operated by the same bad actors. Since BERTopic <ref type="bibr">[10]</ref> is performing well for creating topics for a set of documents, we propose to use it as a clustering approach where the same topic themed messages are grouped together in one cluster and eventually get various clusters each labeled with a different theme.</p><p>Additionally, BERTopic provides a comprehensive approach to the clustering requirement, through sentence embeddings instead of word embeddings. This method uses UMAP <ref type="bibr">[16]</ref> for dimension reduction that creates a noise cluster of messages separated from the established similar message topic clusters. We provide further Figure 2: Tripartite graph representation illustrating the connections between sub-cluster nodes, message nodes, and webentity nodes additional details on how BERTopic approach works and why we have preferred it in the Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1">Semantic Refinement of Clusters Through Hierarchical Clustering.</head><p>To see the semantic relationships between clusters, we propose to leverage the BERTopic's visualize_hierarchy() function <ref type="bibr">[10]</ref>. This hierarchical clustering approach can help us merge or split initial clusters for better interpretability to ensure that each final cluster represents a unique topic theme and there are no multiple clusters with same topic theme.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2">Sub-Cluster Formation Within</head><p>Clusters. We also propose to group messages within a cluster into multiple sub-groups based on high pattern similarity (i.e., infers as these messages are using a common template). For this, we propose Ratcliff pattern recognition algorithm <ref type="bibr">[21]</ref> to investigate messages from the same cluster to identify similar message patterns and create sub-clusters of messages. Appendix C details the functionality of Ratcliff algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Campaign-Operation Graph Generation and Storage Module</head><p>This module connects multiple cohesive sub-clusters together based on common or shared web entity infrastructures among these subclusters to form campaign-operations that are safely assumed to be owned and operated by the same bad actors. We envision that the campaign-operations would give defenders insights on coordinated big smish campaigns, their topic themes, target brands, templates, and underneath web infrastructures, which can then be then used to strategize defense against these threats. Here, we propose this as a graph problem, where each of the campaign-operation can be represented as a connected tripartite graph network, comprising the following nodes and edges-</p><p>&#8226; Node Types: Sub-cluster node [ ], Message node [ ], and Web-entity node [ ].</p><p>&#8226; Edge Types: There are two edge types. (1) edge between node &#289; and node &#287; -representing messages within a particular sub-cluster; (2) edges between &#289; and &#288; -represents web-entities used as infrastructure withing a sub-cluster. As a whole, these connected graph component would represent a campaign-operation run and operated by same bad actor. An illustrative example graph is already presented in figure 2. To achieve this, we collect all the web entities linked to messages residing within a sub-cluster and store this information. Each message can be assigned a corresponding sub-cluster ID with its parent cluster-ID and linked to its associated web entities. This ensures that campaign graphs are built with both cluster-level and sub-cluster-level precision. If a new message is entirely novel and uses a completely new domain infrastructure, it will not fit into any existing cluster or campaign graph. While such cases are more challenging to detect, they impose significant cost and effort on attackers to evade detection. Next, we generate a JSON file to encode and store the graph data structure where each web-entity acts as a root key, linking the subclusters in a hierarchical format. An example JSON file structure is highlighted in Appendix D. These JSON files are then imported to the visualization tool for visualizing the campaign-operation graphs swiftly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Campaign-Operation Graph Visualization and Analysis Module</head><p>One of the primary goals of SmishViz system is to uncover bigger smishing campaigns and campaign-operations those are otherwise not exposed. To visually analyze the relationships between different message groups with their connected web infrastructures, we propose graph-based visualization with state-of-the art visualization tool, such as D3.js <ref type="bibr">[5]</ref> to monitor and analyze campaign-operations. We find D3.js visualization tool is highly compatible for importing JSON format graph data and can be easily integrated with any web application and web services. Additionally, for gaining more insights on campaign-operations, defenders (i.e., analyst) can explore the following elements from the graph visualizations-</p><p>&#8226; Central web-entities will be determined as highly connected web-entities that are shared across multiple sub-clusters and/or clusters. &#8226; The size of campaign-operation will be visible, such that it includes total number of sub-clusters and number of messages in each of the sub-clusters. The visualizations tool we use should also provide zoom-in/zoomout capabilities on specific sub-clusters or web entities to examine many connected nodes effectively if too many nodes are present in one campaign-operation graph. We can also filter the data by campaign topic themes to focus on specific thematic clusters as all our clusters are labeled with specific topic themes. Moreover, during the graph analysis with SmishViz, analysts can use metrics such as degree centrality, graph density, and connected components to aid their analysis. To illustrate, by identifying degree centrality, analyst can find web entities or sub-clusters that attackers rely on more, which can be critical for defensive strategies. Again, measuring graph density also helps to realize the overall cohesion of a campaign-operation, which show how strongly the components are connected and provides insight into the campaign's structure.</p><p>Finally, connected components metric can detect separate campaignoperations which allow defenders to isolate and analyze distinct smishing campaigns. By analyzing these metrics, defenders can prioritize their investigation and mitigation efforts.</p><p>4 Experimental Case Study: SmishViz with SmishTank 4.1 SmishTank Data Collection and Pre-processing</p><p>For the case study, we have collected a snapshot of Smishtank SMS dataset <ref type="bibr">[32]</ref> denoted as &#297;&#291;&#297; where | &#297;&#291;&#297; | = 1, 062. This dataset provides us with a good resource for smishing messages which is ideal for analyzing text patterns, including impersonation of brands, topic themes, and usage of web entities. The smishtank dataset has revealed the following key characteristics-</p><p>&#8226; URLs in messages: 88.6% of these messages contained URLs, many exhibiting suspicious patterns, such as random characters, or unusual domain names. &#8226; Brand impersonation: 65.91% of messages included identifiable brand names. A total of 182 unique brands were impersonated, containing industries such as financial services (e.g., Bank of America), e-commerce platforms (e.g., Amazon), delivery services (e.g., USPS), and streaming platforms (e.g., Netflix). &#8226; Web-entity patterns: The dataset included frequent use of URL shortener services like bit.ly and tinyurl.com, which attackers are possibly using to obscure malicious domains. Some high level basic statistics on the smishtank SMS dataset is presented in table 1. To prepare the data as input for our SmishViz system, we conduct data cleaning following the process described in section 3.2. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Semantic Clustering of Smishtank Messages</head><p>We analyze smishtank dataset to create clusters based on semantic similarity in terms of their meanings and topic themes. These themes may include specific topics such as streaming service theme, account verification theme, delivery theme, promotional offer theme, and so on. At first, we apply BERTopic <ref type="bibr">[10]</ref> to group messages based on their contextual semantic sentence embeddings, which resulted in 31 message groups with nuanced meanings and distinct thematic patterns. For further refinement of the clusters, we apply Hierarchical clustering using BERTopic's visualize_hierarchy() function to eventually end up with several distinct clusters and an outlier group containing noisy or irrelevant messages. Figure <ref type="figure">3</ref> highlights the hierarchical clustering with BERTopic and semantic distances between the clusters that are used for cluster refinement. The cluster  refinement process helps us to merge multiple groups of messages that has very similar topic themes to eventually find unique topic clusters. For example, after the generic BERTopic cluster we have two different message groups that revolving around USPS delivery scams), which then merged to form one cluster for USPS delivery theme. This approach revealed 11 distinct clusters with different sizes (ranging from 223 to 30 messages) as shown in table <ref type="table">2</ref>. Figure <ref type="figure">5</ref> shows the word distributions by each cluster (topic). This clustering approach achieved strong thematic cohesion, with an overall topic coherence score of 0.702, as calculated using the Gensim Python library <ref type="bibr">[22]</ref>. Figure <ref type="figure">4</ref> presents the topic-wise coherence score for each cluster. To label the clusters systematically with topic themes, we employed a multi-step process: By employing semantic clustering methods and a systematic labeling approach, we effectively answered RQ1 by identifying and grouping messages with high similarity together which share a common topic theme. The topic theme label for each of the clusters is presented in Table <ref type="table">3</ref>. Moreover, as per our sub-cluster formation process, we apply Ratcliff pattern recognition algorithm into each of the 11 clusters (including &#293;&#299;&#298;&#290;&#287;&#283;&#296; ). The distribution of sub-clusters within these 11 message clusters are also highlighted in table 3, which effectively answers RQ2. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Campaign-Operation Graph Generation for Smishtank</head><p>We analyze Smishtank dataset to uncover various campaign-operations by generating campaign-operation graphs. We use sub-clusters as nodes in our tripartite graph to reveal campaign operations by examining shared web entities among the sub-clusters. To generate these graphs, we first extract all URLs from the messages within each sub-cluster and identify the corresponding web entities (e.g., domain names, full URL, or hostname). These web entities serve as the root nodes of our tripartite graph structure. 4.3.1 Evidence of Same-Origin Campaign-Operations. Our findings show that several sub-clusters from either same or different parent clusters have contained groups of messages sharing common or similar web entities, indicating they are operated by the same bad actor. For example, if we look deeper in cluster 3 , we find-&#8226; 'direct-capitals.com' website belongs to both 11 &#8712; 3 and 12 &#8712; 3. &#8226; 'direct-paying.com' website belongs to both 2 &#8712; 3 and 3 &#8712; 3 .  &#8226; 'secure-fundhub.com' website belongs to sub-clusters 2 , 11 , 13 , and 16 within 3 , which is also consistently used for targeting financial scams. This infrastructure reuse attack tactics suggests coordination by the same bad actor(s) to run and operate smishing campaigns. We also observe examples of campaign graphs within the smishtank dataset which connects multiple sub-clusters such as subclusters 11 &#8712; 3 and 13 &#8712; 3 with common web-entity 'easytransaction.com' as reported in Appendix E. Attackers also frequently used URL shortener services such as bit.ly, tinyurl.com, and rebrand.ly across clusters to obscure the actual domains and evade detection mechanisms <ref type="bibr">[7]</ref> as the distribution of top frequent domains across clusters are presented in table <ref type="table">4</ref>. The above analysis have answered RQ3 as we have already found campaign-operation connecting multiple sub-clusters.  <ref type="bibr">[29]</ref>. These domains often include slight variations, such as character substitutions (e.g., replacing 'o' with '0' or 'l' with '1'), additional characters, or minor alterations to domain structure to make them indistinguishable to a casual observer. These domains which are frequently registered through the same registrar organization during a short period, suggest a coordinated effort to maintain consistency in their campaign operations while diversifying their infrastructure. This tactic enables attackers to keep hiding from domain-based detection systems while continuing to target users through domains that appear legitimate. By using VIDNs, attackers expand their access in sub-clusters and create a robust infrastructure to support their malicious smish campaigns.</p><p>To illustrate, URLs 'us.ps.track-pack-add.com' and 'usps.trackpkg.com' are both observed in a USPS delivery themed SMS phishing campaign. These web-entities exploit the similarity in their appearance to legitimate USPS-related URLs, using slight variations such as subdomain string us.ps. and hyphenation track-pack-add to deceive users while maintaining thematic consistency. Upon analyzing their registration details, we have found that both domains been registered through the same registrar organization within a two-day period as listed in table 5, strongly indicating a deliberate and coordinated effort for a same-origin campaign. Additionally, the domains were linked to similar smishing messages instructing recipients to track their packages by clicking on the provided URLs, redirecting them to malicious pages. These examples further highlight a coordinated effort by bad actors to prevent detection by registering multiple similar domains, that support our hypothesis of bigger campaign-operations.  Figure <ref type="figure">7</ref> present the top 30 domain names, and report their frequency counts of appearing into different messages as unique URLs. Moreover, table <ref type="table">2</ref> shows the unique domain counts for each of the 11 clusters where we observe that clusters have often use a variety of unique domains to carry out their attacks, so that they can not be detected only by domain name based blocking. Another interesting insights is the usage of WhatsApp owned domain names like whatsapp.com and wa.me, which are used for creating direct chat link to initiate a chat conversation. We have also observed the usage of app.link, which is a unique mobile URL that takes users to a specific in-app page if they already have installed that app on their phone, which reveals attackers' tactics of leading users to desired mobile apps. This analysis indicates that visualizing these campaigns as connected graphs and analyzing the web entities may aid the defenders in designing and blocking larger smishing campaigns effectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.2">Visualizing Connections</head><p>Between Thematic Clusters and the Outlier Cluster. Despite thematic clusters showing strong internal cluster consistency in their web-entity usage, meaning no URL is shared between sub-clusters of different thematic parent clusters. However, upon examining the outlier cluster, we extract cases where web-entities from some sub-clusters within a specific thematic cluster also re-used within one of the sub-clusters of our outlier cluster. These overlaps suggest the presence of a potential larger campaign-operation that spans across both thematic and outlier clusters. In Figure <ref type="figure">9</ref>, we present one of the formed connected tri-partite graphs based on connections between sub-clusters from thematic cluster 5 and sub-clusters from outlier cluster &#267;&#299;&#298;&#290;&#287;&#283;&#296; . This also suggests that in a live SmishViz system, this &#267;&#299;&#298;&#290;&#287;&#283;&#296; cluster can host some other future thematic clusters, which do not have enough datapoints to form an independent cluster for now. This observation indicates that attackers might be conducting operations that are not yet fully represented in topic thematic clusters but are connected through the infrastructure present in the outlier cluster.</p><p>Example of campaign-operation from regular thematic cluster with outlier cluster: we provide couple of examples from our observation which demonstrate the possibility of larger, evolving campaigns that connect thematic clusters to messages within the outlier cluster through shared web infrastructure.</p><p>&#8226; Sub-cluster 2 &#8712; 5 (USPS Delivery scams) shares the domain uspsusa-us.com with sub-cluster 102 &#8712; &#267;&#299;&#298;&#290;&#287;&#283;&#296; . &#8226; Sub-cluster 4 (Survey-based Reward scams) shares IP address '107.175.219.12' with another sub-cluster within the outlier cluster.</p><p>With the above graph-based visualizations, we effectively answered RQ4 by demonstrating how SmishViz aids defenders in uncovering and understanding interconnected smishing campaigns. 5 Limitations</p><p>Clustering of messages and connecting them under campaignoperations plays a vital role in understanding the tactics and the current landscape of smishing threats. By identifying message topic themes, similarities in text patterns, and shared infrastructures, analysts can prioritize their defense efforts against bigger and coordinated operations and proactively detect new smishing attacks. Although the SmishViz system proposed in this paper offers insightful information, the paper has the following limitations-First, due to the dynamic threat landscape for smishing, the analysis on Smish-Tank may not capture all diverse attacks due to limited data. Second, smishtank community is still not big enough or active enough to vote on validating the smishing texts, which infer that smishtank can not be counted as ground-truth labeled data for smishing. Third, similarity-based message categorization may need further evaluation in different dataset, where we can capture more similar themed messages under the same message clusters. Fourth, the system's reliance on outlier clusters to handle new or unstructured messages highlights a gap in detecting new themed smishing campaigns. Introducing incremental clustering or anomaly detection methods in the future could enhance adaptability to evolving threats. Fifth, the scalability of the tripartite graph-based visualization may pose challenges as datasets grow larger, potentially impacting efficiency and clarity, which can be addressed in the future by exploring graph simplification techniques, hierarchical clustering, or dynamic filtering mechanisms to maintain the system's usability and interpretability at scale. Sixth, our proposed SmishViz system is not evaluated for usability metrics by real analysts, which we can explore in future studies where user-study can be conducted to evaluate and improve the functionalities of SmishViz platform. Seventh, we have not fully highlighted the graph analysis capabilities in this paper, which can be explored further in future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In summary, we took a close look at smishing campaigns using a real-world dataset from smishtank.com. We find patterns, themed topics, and common web infrastructures that can help us identify and characterize smishing campaign-operations. Characterizing the campaign-operations help us understand the attacker's strategies and tactics. While our case study has been scoped for a specific dataset, the methods and techniques can be easily applied to any other datasets from various sources. We also envision to build a live system for tracking smishing campaigns and campaign operations in the wild. The proposed SmishViz system should be a public facing live system that anyone can request to integrate with newer dataset which will dynamically generate ongoing campaignoperation graphs for visualization and further analysis. We believe a live system can help defenders building defensive strategies against newer smishing attacks by continuously monitoring ongoing smishing campaigns. The code and relevant resources are shared in the following Github repository to enable the reproducibility of our results-<ref type="url">https://github.com/varnicm/SmishViz-Project/</ref>. threshold, HDBSCAN automatically identifies flat clusters by simplifying complex hierarchies. This adaptability is particularly useful for datasets with variable message densities, enabling effective handling of both dense and sparse regions without classifying sparse regions as noise. Finally, BERTopic assigns a topic label to each cluster, summarizing its central theme. By generating topic-level representations, BERTopic links all messages in a cluster, offering an advantage over traditional methods like TF-IDF <ref type="bibr">[3]</ref> which rely on individual message features. These components make BERTopic an effective and justified approach for our clustering needs. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. How Do Ratcliff Pattern Recognition Works?</head><p>The Ratcliff string matching algorithm calculates the similarity metric between two strings as twice the number of matching (overlapping) characters between the two strings divided by the total number of characters in the two strings. So, similarity of two messages &#287; and &#288; can be calculated as Sim( &#287; , &#288; ) = 2&#215;Matching (&#287;,&#288; ) |&#291; &#287; |+|&#291; &#288; | , where Matching (&#287;,&#288; ) presents the number of overlapping characters between messages &#287; and &#288; .  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. An Example JSON File Structure To Store Campaign-Graph</head></div></body>
		</text>
</TEI>
