<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Convolutional autoencoder-based ground motion clustering and selection</title></titleStmt>
			<publicationStmt>
				<publisher>Soil Dynamics and Earthquake Engineering</publisher>
				<date>04/01/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10573397</idno>
					<idno type="doi">10.1016/j.soildyn.2025.109240</idno>
					<title level='j'>Soil Dynamics and Earthquake Engineering</title>
<idno>0267-7261</idno>
<biblScope unit="volume">191</biblScope>
<biblScope unit="issue">C</biblScope>					

					<author>Yiming Jia</author><author>Mehrdad Sasani</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Ground motion selection has become increasingly central to the assessment of earthquake resilience. The selection of ground motion records for use in nonlinear dynamic analysis significantly affects structural response. This, in turn, will impact the outcomes of earthquake resilience analysis. This paper presents a new ground motion clustering algorithm, which can be embedded in current ground motion selection methods to properly select representative ground motion records that a structure of interest will probabilistically experience. The proposed clustering-based ground motion selection method includes four main steps: 1) leveraging domain-specific knowledge to pre-select candidate ground motions; 2) using a convolutional autoencoder to learn low-dimensional underlying characteristics of candidate ground motions’ response spectra – i.e., latent features; 3) performing k-means clustering to classify the learned latent features, equivalent to cluster the response spectra of candidate ground motions; and 4) embedding the clusters in the conditional spectra-based ground motion selection. The selected ground motions can represent a given hazard level well (by matching conditional spectra) and fully describe the complete set of candidate ground motions. Three case studies for modified, pulse-type, and non-pulse-type ground motions are designed to evaluate the performance of the proposed ground motion clustering algorithm (convolutional autoencoder + k-means). Considering the limited number of pre-selected candidate ground motions in the last two case studies, the response spectra simulation and transfer learning are used to improve the stability and reproducibility of the proposed ground motion clustering algorithm. The results of the three case studies demonstrate that the convolutional autoencoder + k-means can 1) achieve 100% accuracy in classifying ground motion response spectra, 2) correctly determine the optimal number of clusters, and 3) outperform established clustering algorithms (i.e., autoencoder + k-means, time series k-means, spectral clustering, and k-means on ground motion influence factors). Using the proposed clustering-based ground motion selection method, an application is performed to select ground motions for a structure in San Francisco, California. The developed user-friendly codes are published for practical use.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>convolutional autoencoder to learn low-dimensional underlying characteristics of candidate ground motions' response spectra -i.e., latent features; 3) performing k-means clustering to classify the learned latent features, equivalent to cluster the response spectra of candidate ground motions; and 4) embedding the clusters in the conditional spectra-based ground motion selection. The selected ground motions can represent a given hazard level well (by matching conditional spectra) and fully describe the complete set of candidate ground motions. Three case studies for modified, pulse-type, and non-pulse-type ground motions are designed to evaluate the performance of the proposed ground motion clustering algorithm (convolutional autoencoder + k-means). Considering the limited number of pre-selected candidate ground motions in the last two case studies, the response spectra simulation and transfer learning are used to improve the stability and reproducibility of the proposed ground motion clustering algorithm. The results of the three case studies demonstrate that the convolutional autoencoder + k-means can 1) achieve 100% accuracy in classifying ground motion response spectra, 2) correctly determine the optimal number of clusters, and 3) outperform established clustering algorithms (i.e., autoencoder + k-means, time series kmeans, spectral clustering, and k-means on ground motion influence factors). Using the proposed clustering-based ground motion selection method, an application is performed to select ground motions for a structure in San Francisco, California. The developed user-friendly codes are published for practical use.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Ground motion (GM) selection has received increasingly more attention in earthquake resilience studies in the past decades, because it provides the necessary link between seismic hazard and structural response <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>, <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>, <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref>. Selected GMs can significantly affect the nonlinear dynamic response of a structure <ref type="bibr">[2]</ref>, <ref type="bibr">[3]</ref>, which provides information used to define the engineering demand parameters upon which earthquake resilience evaluation relies (e.g., the maximum transient and permanent interstory drift indices). In general, GM selection can affect conclusions regarding performance-based assessment, such as structural safety and community resilience. Meanwhile, owing to the evolution of engineering software (e.g., OpenSees <ref type="bibr">[10]</ref>), complex structures can be modeled and analyzed nonlinearly. The computational cost of carrying out reliable nonlinear time history analyses of complex structures is high, however, and therefore necessitates limiting the number of selected GMs. Thus, methods for GM selection have two objectives: 1) selecting GMs that can represent a given seismic hazard level and 2) selecting as few GMs as possible to minimize the number of required nonlinear dynamic analyses. Spectral matching-based GM selection methods have been widely used to achieve the above-mentioned objectives. For community or regional resilience analysis, one approach is that the response spectra of GMs are calculated, scaled at a period of interest, and collectively matched with a target mean and variance of response spectral values of conditional spectra (CS) over a range of periods <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>, <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>, <ref type="bibr">[13]</ref>, <ref type="bibr">[14]</ref>, <ref type="bibr">[15]</ref>. In these selection methods, the candidate GMs are pre-selected from a database for a given seismological condition (e.g., magnitude, distance, rupture mechanism, site class, etc.). A set of GMs is then selected from the pool of candidate GMs to match the CS. Improved by <ref type="bibr">[2]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref>, a range of scale factors is employed to remove unrealistic candidate GMs, and a portion of the selected GMs are chosen to be pulse-type since pulse-type GMs have a larger damage potential than non-pulse-type GMs.</p><p>With the existing methods, even if the response spectra of selected GMs match the CS, the selected GMs may not properly represent the candidate GMs.</p><p>One way to further improve the current CS-based GM selection methods is to classify GM response spectra (either directly or based on their underlying characteristics) and proportionally select the GMs based on the number of candidate GMs in each cluster to match the CS. In other words, the GM clustering algorithm is embedded in the GM selection process as an additional step between GM pre-selection and CS matching.</p><p>Proportionally selecting GMs from each cluster to match the CS can maintain diversity across the GMs with different underlying characteristics and consider the weights of underlying characteristics. In doing so, the resulting selected GMs can describe the complete set of candidate GMs (which include all candidate GMs from the pre-selection). However, commonly used clustering algorithms, such as k-means <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref>, Gaussian mixture model <ref type="bibr">[18]</ref>, and hierarchical clustering <ref type="bibr">[19]</ref>, <ref type="bibr">[20]</ref>, work properly for low-dimensional data <ref type="bibr">[21]</ref>, but may not be suitable for clustering GM response spectra (as high-dimensional time series data) <ref type="bibr">[6]</ref>, <ref type="bibr">[22]</ref>, <ref type="bibr">[23]</ref>. Spectral clustering <ref type="bibr">[24]</ref>, <ref type="bibr">[25]</ref>, <ref type="bibr">[26]</ref> can classify GM response spectra via dimensionality reduction, but its efficacy decreases when the GM response spectra are similar to each other (illustrated by the case studies in Section 3). Advances in machine learning have facilitated the development of several clustering algorithms that are explicitly designed for time series classification. Huang et al. <ref type="bibr">[27]</ref> proposed a k-means-type smooth subspace clustering algorithm to organize time series data into homogeneous groups where the similarities among time series in the same group are maximized. Improved upon by <ref type="bibr">[6]</ref>, an unsupervised machine learning algorithm was developed for sequential clustering and used to classify GM response spectra (hereinafter referred to as time series k-means). As with spectral clustering, the efficacy of time series k-means decreases when the GM response spectra are similar to each other, and the optimal number of clusters cannot be definitively identified. Bond et al. <ref type="bibr">[28]</ref> developed an autoencoder (AE) to uncover the machine-learned low-dimensional underlying characteristics of GM response spectra -latent features, which can be classified by commonly used clustering algorithms (e.g., k-means clustering). AE is a type of neural network intended to learn compressed representations (latent features) of unlabeled data <ref type="bibr">[29]</ref>, <ref type="bibr">[30]</ref>. It is worth noting that the AE developed by <ref type="bibr">[28]</ref> only uses fully connected layers, which may not be suitable for time series data, particularly compared to convolutional layers. It has been shown that convolutional layers can extract the strong one-dimensional temporal locality of time series by convolving a sequence of input across the entire temporal space <ref type="bibr">[31]</ref>, <ref type="bibr">[32]</ref>, <ref type="bibr">[33]</ref>, <ref type="bibr">[34]</ref>, <ref type="bibr">[35]</ref>, <ref type="bibr">[36]</ref>. The incorporation of convolutional layers in an AE, which results in a convolutional autoencoder (CAE), has been successfully employed for time series classification (e.g., <ref type="bibr">[30]</ref>, <ref type="bibr">[37]</ref>, <ref type="bibr">[38]</ref>, <ref type="bibr">[39]</ref>). This paper introduces a clustering-based GM selection method comprised of four main steps: 1) leveraging domain-specific knowledge to pre-select candidate GMs, 2) using a CAE to learn the latent features of the response spectra of candidate GMs, which effectively reduces the dimensionality of data to be classified, 3) performing k-means clustering to classify the learned latent features, which is identical to clustering the GM response spectra, and 4) finally, selecting GMs form the clusters (proportional to the number of GMs in each cluster) to match the CS. It is worth noting that, as the outcome of CAE (a black box model), latent features are the machine-learned low-dimensional underlying characteristics of GM response spectra <ref type="bibr">[28]</ref>.</p><p>They can be interpreted as the extracted features, which preserve the essential patterns in the original data -GM response spectra. Although latent features do not have any physical interpretation from a seismological point of view, clustering based on latent features can yield similar results to clustering on the GM response spectra (illustrated by the case studies presented in Section 3). The GM response spectra classified in the same cluster are cohesive, compact, and close to each other. In other words, the clusters can indeed serve as indicators to show patterns in the GM response spectra. In general, the GM selection method proposed in this paper embeds the GM clustering algorithm (CAE + k-means) in the current CSbased selection protocol. The main advantage of the proposed GM selection method is that the selected GMs not only represent a given hazard level well (by matching CS) but also fully describe the complete set of candidate GMs. However, it is worth noting that the potential misrepresentation of GM due to the scaling for CS matching is not considered in this paper.</p><p>As the key component of the proposed clustering-based GM selection, the CAE + k-means is validated through case studies. The results indicate that the CAE + k-means can accurately classify GM response spectra, correctly determine the number of clusters in GM response spectra. For comparison purposes, established clustering algorithms for GM responses spectra (or in general for time series data), including AE + k-means <ref type="bibr">[28]</ref>, time series k-means <ref type="bibr">[6]</ref>, spectral clustering <ref type="bibr">[24]</ref>, <ref type="bibr">[25]</ref>, <ref type="bibr">[26]</ref>, and k-means on GM influence factors <ref type="bibr">[21]</ref> (magnitude and Joyne-Boore distance, Rjb, <ref type="bibr">[40]</ref>), are also evaluated. The comparison results demonstrate that the CAE + k-means outperforms the above-mentioned established clustering algorithms (see Section 3). Using the proposed clustering-based GM selection method, an application is performed for a structure in San Francisco, California. The codes, which are user-friendly (no training in machine learning is required) for practical use by practitioners in ground motion clustering and selection, will be publicly available on GitHub at <ref type="url">https://github.com/yimjia/Ground-Motion-Clustering-and-Selection</ref> after the paper is published.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Methodology</head><p>The main steps of the proposed clustering-based GM selection method are illustrated in Figure <ref type="figure">1</ref>. These steps are discussed in detail in the following subsections. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Ground motion pre-selection</head><p>In order to better represent the seismic hazard for a structure of interest, GMs are pre-selected by leveraging domain-specific knowledge. Restrictions to GM pre-selection are often imposed based on the type of expected fault, ASCE 7 site class <ref type="bibr">[41]</ref>, and range of magnitudes. For a given structure of interest, the fault type and site class can be determined by geology and seismology. Of these considerations, the range of magnitudes has been found to contribute most significantly to the seismic hazard based on analysis of deaggregated USGS hazard data <ref type="bibr">[42]</ref>. In the context of community resilience, the response spectra of GMs are calculated and often scaled to the fundamental period of the primary structure of interest <ref type="bibr">[12]</ref>. In this paper, scale factors are limited to between 0.25 and 4 based on the recommendations in <ref type="bibr">[41]</ref> and <ref type="bibr">[43]</ref>.</p><p>Furthermore, considering that pulse-type GMs have a larger damage potential than non-pulse-type GMs <ref type="bibr">[2]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref>, the pre-selection procedure is conducted for both pulse-type and non-pulse-type GMs.</p><p>Due to the above-mentioned restrictions for GM pre-selection, the number of pre-selected GMs (also known as candidate GMs) can be reduced from thousands (e.g., <ref type="bibr">[28]</ref>) to hundreds. The candidate GMs are specific to the structure of interest. From a machine learning perspective, GM pre-selection is analogous to a data cleaning process that removes irrelevant data and improves the machine learning model's performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Convolutional autoencoder</head><p>An AE is a type of artificial neural network used to learn latent features (i.e., low-dimensional underlying characteristics) of unlabeled data <ref type="bibr">[29]</ref>, <ref type="bibr">[30]</ref>, which, in this paper, are the candidate GMs' response spectra.</p><p>Considering that GM response spectra are time series, convolutional layers, which can successfully extract the strong one-dimensional temporal locality of time series <ref type="bibr">[31]</ref>, <ref type="bibr">[32]</ref>, <ref type="bibr">[33]</ref>, <ref type="bibr">[34]</ref>, <ref type="bibr">[35]</ref>, <ref type="bibr">[36]</ref>, are included in the AE, resulting in a CAE. The schematic of CAE in this application is shown in Figure <ref type="figure">2</ref>. In general, an AE consists of two parts: an encoder and a decoder. The encoder gradually reduces the input dimensions and compresses the input data into an encoded representation, also known as downsampling. As shown in Figure <ref type="figure">2</ref>(a), in the CAE, the encoder maps GM response spectra to latent features via convolutional, max pooling, flatten, and fully connected layers. Note that the flatten layer reshapes the outputs from the previous convolutional layers (2D) into a 1D vector using concatenation. Since this is purely a structural transformation, the flatten layer does not modify its input values; it only increases the size of the resulting 1D vector. To illustrate how the flatten layer operates, the second dimension of the last convolutional layer in Figure <ref type="figure">2</ref>(a) is shown using different colors (i.e., blue, grey, and cyan). The flatten layer concatenates the nodes of these different colors into a 1D vector. The decoder reconstructs GM response spectra from the latent features (learned by the encoder) using fully connected, reshape, and transposed convolutional layers (see Figure <ref type="figure">2 (b)</ref>). The reshape layer can convert a 1D vector into a 2D vector, serving as the input for the transposed convolutional layer. The different colors in Figure <ref type="figure">2</ref>   In the CAE, the latent features, &#119963;&#119963;, can be expressed as</p><p>where &#119930;&#119930; &#119938;&#119938; are the spectral accelerations of GMs for a user-defined range of periods and &#120611;&#120611; &#119916;&#119916; are the parameters of the encoder. The reconstructed GM spectral accelerations, &#119930;&#119930; &#65533; &#119938;&#119938; , can be expressed as</p><p>where &#120611;&#120611; &#119915;&#119915; are the parameters of the decoder. Training the CAE is equivalent to solving the optimization problem that minimizes the difference between &#119930;&#119930; &#119938;&#119938; and &#119930;&#119930; &#65533; &#119938;&#119938; . The loss function can be written as</p><p>where &#119873;&#119873; &#119866;&#119866;&#119866;&#119866; is the number of GMs and &#119873;&#119873; &#119879;&#119879; is the number of periods for &#119930;&#119930; &#119938;&#119938; . In this paper, the CAE is trained using Adam optimizer <ref type="bibr">[44]</ref> to minimize the loss function. The learning rate decay strategy, which has been shown to be an effective technique to improve optimization (e.g., speeding up convergence) and generation (e.g., alleviating overfitting) <ref type="bibr">[45]</ref>, <ref type="bibr">[46]</ref>, <ref type="bibr">[47]</ref>, is used in the training process. This strategy gradually reduces the learning rate by a given factor when a plateau in model performance is detected, such as when the loss stops decreasing for a given number of training epochs <ref type="bibr">[48]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(b)</head><p>In addition to the CAE architecture shown in Figure <ref type="figure">2</ref>, one can include GM intensity measures as inputs to the CAE. GM intensity measures can help the CAE reconstruct the GM response spectra because GM intensity measures may provide information related to GM response spectra and their inclusion in the CAE can increase the number of trainable parameters (which can improve the flexibility of CAE). The included GM intensity measures can be determined based on users' domain-specific knowledge and/or experience.</p><p>Recommended GM intensity measures include peak ground responses (i.e., acceleration, velocity, and displacement), Arias intensity <ref type="bibr">[49]</ref>, incremental velocity <ref type="bibr">[50]</ref>, effective design acceleration <ref type="bibr">[51]</ref>, and significant peak ground acceleration (for pulse-type GMs only, <ref type="bibr">[52]</ref>).</p><p>The nonlinear activation function, which can improve the performance of neural networks to learn complex patterns of data, is used for each layer. Commonly used nonlinear activation functions, including hyperbolic tangent, sigmoid, rectified linear unit (ReLU <ref type="bibr">[53]</ref>), leaky ReLU <ref type="bibr">[54]</ref>, and parametric ReLU <ref type="bibr">[55]</ref>, are examined. The leaky ReLU activation function <ref type="bibr">[54]</ref>, shown in Eq. ( <ref type="formula">4</ref>), results in a higher training accuracy than the other activation functions.</p><p>where &#119910;&#119910; is the output from the convolutional process and &#120572;&#120572; is a positive constant (usually less than 1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Transfer learning</head><p>The CAE can accurately reconstruct GM response spectra. However, the parameters of the CAE are randomly initialized when training, and the learned latent features (which are from the middle hidden layer) could vary significantly depending on the initialization. This would lead to different GM clustering results and, in turn, affect the GM selection. In other words, repeatedly running this GM clustering and selection method may result in various sets of GMs, an outcome known as instability and low reproducibility. This is of particular concern for scenarios in which the sample size (e.g., 100 GMs and each with response spectra at 200 periods, resulting in 20,000 data points) is much smaller than the number of CAE parameters (e.g., &gt; 300,000). One way to alleviate this issue is by leveraging the concept of transfer learning. Transfer learning aims to improve the performance of target learners on target domains by transferring the knowledge contained in different but related source domains <ref type="bibr">[56]</ref>. It has been used successfully for classification problems (e.g., <ref type="bibr">[33]</ref>, <ref type="bibr">[56]</ref>, <ref type="bibr">[57]</ref>, <ref type="bibr">[58]</ref>, <ref type="bibr">[59]</ref>, <ref type="bibr">[60]</ref>). In this paper, the concept of transfer learning is implemented using three steps. First, for the response spectra of each candidate GM, hundreds of response spectra are simulated by including lognormally distributed noise at each period. This is a data augmentation process, which can significantly increase the sample size. Second, the CAE is trained using the simulated response spectra of candidate GMs and denoted as a pre-trained CAE. In this step, the large sample size can help the CAE minimize the impact of the randomness of parameter initialization, leading to uniquely determined parameters that, in turn, improve the stability and reproducibility of the GM clustering. Third, the pre-trained CAE is trained using the candidate GMs' response spectra, known as a fine-tuning process.</p><p>The flowchart of transfer learning is illustrated in Figure <ref type="figure">3</ref>. The hyperparameters of transfer learning (e.g., number of simulated response spectra for each candidate GM, the initial learning rate and number of epochs for the fine-tunned CAE) are determined based on parameter studies to optimize the training accuracy and efficiency. It is worth noting that here -unlike with the general use of transfer learning, where the parameters for some hidden layers are frozen for the fine-tuning process to save time and computational resources -all the parameters are free to migrate to their appropriate values, which means no layers are frozen. This is mainly because the data size for fine-tunning process (only candidate GMs) is much smaller than the one for pre-trained CAE (simulated GMs). There is no significant difference between the training times of the fine-tuned CAE with and without frozen layers (&lt;5%). As noted in <ref type="bibr">[61]</ref> and <ref type="bibr">[62]</ref>, fine-tuning all the layers does not drastically increase the computational cost. Due to the similarity between the response spectra of candidate GMs and their simulations, the parameters of the pre-trained CAE change only slightly in this step. As a result, the fine-tuned CAE not only captures the candidate GMs' response spectra well but also ensures a consistent pattern of learned latent features that leads to a unique GM selection. Additionally, transfer learning can help reduce overfitting, particularly in cases with a limited number of</p><p>GMs. The CAE model is pre-trained using a dataset of simulated response spectra, allowing it to learn generalizable features of response spectra that are less prone to overfitting. As mentioned above, due to the similarity between the response spectra of candidate GMs and their simulations, the fine-tuned CAE requires only minimal adjustments to its parameters to achieve good performance. For example, in the case study presented in Section 3.2, the parameter changes are limited to a median relative difference of 0.64% between the parameters obtained from the pre-trained and fine-tuned CAEs. This parameter efficiency enhances the model's generalization capabilities, ensuring reliable performance across diverse datasets, and thereby lowering the risk of overfitting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">K-means clustering</head><p>Using the learned latent features from the CAE (&#119963;&#119963;</p><p>1 , &#119963;&#119963; 2 , &#8230; , &#119963;&#119963; &#119873;&#119873; &#119866;&#119866;&#119866;&#119866; ), k-means clustering is employed to classify these latent features into &#119896;&#119896; (&#8804; &#119873;&#119873; &#119866;&#119866;&#119866;&#119866; ) sets, &#119930;&#119930; = {&#119878;&#119878; 1 , &#119878;&#119878; 2 , &#8230; , &#119878;&#119878; &#119896;&#119896; }. This can be achieved by minimizing the sum of the distance between each GM's latent features and the corresponding cluster centroid, given by arg min &#119930;&#119930; &#65533; &#65533; dist(&#119963;&#119963; -&#119958;&#119958; &#119888;&#119888; ) &#119963;&#119963;&#8712;&#119878;&#119878; &#119888;&#119888; &#119896;&#119896; &#119888;&#119888;=1</p><p>where dist is the squared Euclidean distance and &#119958;&#119958; &#119888;&#119888; is the centroid of the &#119888;&#119888;th cluster given by</p><p>where &#119873;&#119873; &#119878;&#119878; &#119888;&#119888; is the number of latent features in the &#119888;&#119888;th set. The k-means++ algorithm, which has been shown</p><p>to improve both the speed and accuracy of k-means <ref type="bibr">[63]</ref>, is used for clustering. More details about the k-means++ algorithm can be found in <ref type="bibr">[63]</ref>.</p><p>Since k-means clustering is not able to determine the number of natural clusters in the data, it must be supplied with the number of clusters (i.e., &#119896;&#119896; in Eq. ( <ref type="formula">5</ref>)) <ref type="bibr">[64]</ref>. In order to estimate the number of natural clusters in the learned latent features (hereinafter referred to as the optimal number of clusters), a quantitative measure of how well-defined and distinct the clusters are is needed. Analogous to <ref type="bibr">[28]</ref>, <ref type="bibr">[64]</ref>,</p><p>[65], <ref type="bibr">[66]</ref>, silhouette score is used as a metric to determine the optimal number of clusters. For the &#119897;&#119897;th latent features, &#119963;&#119963; &#119897;&#119897; , the silhouette score is calculated as</p><p>where &#119886;&#119886; &#119897;&#119897; is the average distance of &#119963;&#119963; &#119897;&#119897; to all other latent features in the same cluster and &#119887;&#119887; &#119897;&#119897; is the average distance of &#119963;&#119963; &#119897;&#119897; to all the latent features in the nearest cluster. Consistent with k-means clustering, Euclidean distance is used for silhouette score calculation. For a given set of clusters, the silhouette score is the average value of the silhouette scores of all latent features. The silhouette score varies from -1 to 1, where a larger value indicates that the clusters are further apart from each other and more clearly distinguishable.</p><p>Consequently, the &#119896;&#119896; resulting in the highest silhouette score is selected as the optimal number of clusters.</p><p>Considering that the latent features learned by the CAE represent the low-dimensional underlying characteristics of GM response spectra, it is reasonable to infer that the clusters of latent features are equivalent to the clusters of GM response spectra.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.5.">Ground motion selection</head><p>GMs are selected from the clusters to match the target mean, variance, and correlations of response spectral values at a range of periods (i.e., CS in <ref type="bibr">[4]</ref>). This is achieved by three main steps: 1) develop the CS, 2)</p><p>determine the desired number of pulse-type and non-pulse-type GMs, and 3) select pulse-type and nonpulse-type GMs from the corresponding clusters to match the CS.</p><p>Based on <ref type="bibr">[4]</ref>, <ref type="bibr">[12]</ref>, <ref type="bibr">[13]</ref>, <ref type="bibr">[15]</ref>, CS is expressed as the conditional mean spectrum and +/-one conditional standard deviation of logarithmic spectral acceleration (ln&#119878;&#119878; &#119886;&#119886; ) around the mean. The conditional mean spectrum is</p><p>where &#119879;&#119879; * is the fundamental period of the structure of interest; &#119879;&#119879; &#119905;&#119905; is the &#119905;&#119905;th period in a user-defined range;</p><p>&#120583;&#120583; ln&#119878;&#119878;&#119886;&#119886; (&#119877;&#119877;&#119877;&#119877;&#119877;&#119877;, &#119879;&#119879; &#119905;&#119905; ) and &#120590;&#120590; ln&#119878;&#119878;&#119886;&#119886; (&#119877;&#119877;&#119877;&#119877;&#119877;&#119877;, &#119879;&#119879; &#119905;&#119905; ) are the mean and standard deviation of ln&#119878;&#119878; &#119886;&#119886; at period &#119879;&#119879; &#119905;&#119905; , which are calculated using the weighted average of four GM prediction models <ref type="bibr">[67]</ref>, <ref type="bibr">[68]</ref>, <ref type="bibr">[69]</ref>, <ref type="bibr">[70]</ref> and the rupture scenario (&#119877;&#119877;&#119877;&#119877;&#119877;&#119877; defined by the earthquake's magnitude, distance, rupture mechanism, and other parameters required by the GM prediction models); &#120576;&#120576;(&#119879;&#119879; * ) is the number of standard deviations by which ln&#119878;&#119878; &#119886;&#119886; (&#119879;&#119879; * ) differs from &#120583;&#120583; ln&#119878;&#119878; &#119886;&#119886; (&#119877;&#119877;&#119877;&#119877;&#119877;&#119877;, &#119879;&#119879; * ) <ref type="bibr">[11]</ref>, which can be expressed as</p><p>and &#120588;&#120588;(&#119879;&#119879; &#119905;&#119905; , &#119879;&#119879; * ) is the correlation coefficient between &#120576;&#120576;(&#119879;&#119879; &#119905;&#119905; ) and &#120576;&#120576;(&#119879;&#119879; * ) from <ref type="bibr">[71]</ref>. The conditional standard deviation of ln&#119878;&#119878; &#119886;&#119886; is calculated as</p><p>As discussed in Section 2.1, there are two types of candidate GMs, pulse-type and non-pulse type.</p><p>According to <ref type="bibr">[2]</ref>, the proportion of pulse-type GMs to be selected is</p><p>where &#120574;&#120574; = 0.905 -0.188&#119877;&#119877; + 1.337&#120576;&#120576;; &#120576;&#120576; is calculated using Eq. ( <ref type="formula">9</ref>); and &#119877;&#119877; is rupture distance. Similarly, after the clusters of pulse-type (or non-pulse-type) GMs are determined by CAE + k-means, the number of GMs selected from each cluster is determined based on the proportion of the number of GMs in each cluster to the total number of GMs. Using Section 4 as an example, &#119875;&#119875; &#119901;&#119901;&#119901;&#119901;&#119897;&#119897;&#119901;&#119901;&#119901;&#119901; is 0.269, equivalent to 5 pulse-type GMs out of 20 GMs. The CAE + k-means results in 2 clusters associated with 41 and 26 candidate pulse-type GMs. Thus, the number of pulse-type GMs selected from these clusters is 3 and 2, respectively.</p><p>Selecting a set of GMs from the clusters of pulse-type and non-pulse-type candidate GMs to present a CS can result in different combinations of GMs <ref type="bibr">[4]</ref>. Again, using Section 4 as an example, the number of possible combinations for selecting 5 pulse-type GMs from 2 clusters is 3,464,500, calculated as the product of 41-choose-3 and 26-choose-2 scenarios. Likewise, an even larger number of combinations are possible when selecting the 15 non-pulse-type GMs. Consequently, it is unrealistic to conduct an exhaustive search through all possible combinations of pulse-type and non-pulse-type GMs. An effective clustering-based GM selection method is proposed based on <ref type="bibr">[4]</ref>, which involves the following steps:</p><p>1) Statistically generate realizations of response spectra from the CS <ref type="bibr">[4]</ref> This is done by sampling from a multivariate normal distribution with the CS's mean vector and covariance matrix. The mean vector consists of all &#120583;&#120583; ln&#119878;&#119878; &#119886;&#119886; (&#119877;&#119877;&#119877;&#119877;&#119877;&#119877;, &#119879;&#119879; &#119905;&#119905; ) . The covariance matrix consists of diagonal elements &#120590;&#120590; ln&#119878;&#119878; &#119886;&#119886; 2 (&#119877;&#119877;&#119877;&#119877;&#119877;&#119877;, &#119879;&#119879; &#119905;&#119905; ) and non-diagonal elements</p><p>where</p><p>2) Assign the realizations to each cluster This is achieved by using the sum of squared errors between each realization and the mean of ln&#119878;&#119878; &#119886;&#119886; of each GM cluster (&#120576;&#120576; &#119903;&#119903;&#119901;&#119901;&#119886;&#119886;&#119897;&#119897; ). For a given realization and GM cluster, &#120576;&#120576; &#119903;&#119903;&#119901;&#119901;&#119886;&#119886;&#119897;&#119897; is calculated as</p><p>where ln&#119878;&#119878; &#119886;&#119886; (&#119879;&#119879; &#119905;&#119905; ) &#119903;&#119903;&#119901;&#119901;&#119886;&#119886;&#119897;&#119897; is the ln&#119878;&#119878; &#119886;&#119886; of the realization at the &#119905;&#119905; th period and &#120583;&#120583; ln&#119878;&#119878; &#119886;&#119886; (&#119879;&#119879; &#119892;&#119892; ) &#119888;&#119888; is the mean of ln&#119878;&#119878; &#119886;&#119886; calculated using the GMs in the &#119888;&#119888;th cluster at the &#119905;&#119905;th period. Each realization is assigned to the cluster associated with the smallest value of &#120576;&#120576; &#119903;&#119903;&#119901;&#119901;&#119886;&#119886;&#119897;&#119897; . Note that the number of realizations for each cluster is proportional to the number of GMs in each cluster.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3) Select a GM for each realization</head><p>For each realization, a GM is selected from the corresponding cluster. This selected GM has the smallest value of the sum of squared errors between its ln&#119878;&#119878; &#119886;&#119886; and the realization's ln&#119878;&#119878; &#119886;&#119886; .</p><p>4) Quantify the difference between the set of selected GMs and the CS Analogous to <ref type="bibr">[4]</ref>, the goodness-of-fit metric between the selected GMs and the CS is proposed to be</p><p>where &#120583;&#120583; ln&#119878;&#119878; &#119886;&#119886; (&#119879;&#119879; &#119892;&#119892; ) &#119904;&#119904;&#119904;&#119904;&#119904;&#119904; and &#120590;&#120590; ln&#119878;&#119878; &#119886;&#119886; (&#119879;&#119879; &#119892;&#119892; ) &#119904;&#119904;&#119904;&#119904;&#119904;&#119904; are the mean and standard deviation of ln&#119878;&#119878; &#119886;&#119886; calculated using the selected GMs at the &#119905;&#119905;th period, and &#120582;&#120582; is a user-defined parameter used to assign relative importance to mismatches in mean versus standard deviation values.</p><p>5) Repeat steps one through four 2,000 times to reduce the effect of realization randomness on the GM selection 6) Select the set of GMs that has the smallest value of &#120576;&#120576; &#119901;&#119901;&#119901;&#119901;&#119897;&#119897; .</p><p>It is important to note that the presented GM selection method is based on the use of linear response spectra.</p><p>As a result, two ground motions with identical linear response spectra do not lead to the same nonlinear dynamic response of a structure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Performance assessment for convolutional autoencoder + k-means clustering</head><p>Since clustering is an unsupervised task (i.e., there are no ground truth labels against which to compare the clustering algorithm's output), it is a challenge to measure the quality of its results. However, it is essential to provide an evaluation that assesses the performance of the proposed clustering algorithm -CAE + kmeans. Therefore, a case study is designed and presented in Section 3.1 with known ground truth to evaluate the performance of CAE + k-means. Two other case studies for pulse-type and non-pulse-type GMs are also presented in Sections 3.2 and 3.3 to evaluate the performance of CAE + k-means in practice (without known ground truth).</p><p>The candidate GMs used in these case studies are pre-selected for a structure with a fundamental period of  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Well-defined clusters of ground motions</head><p>A set of 39 GMs is generated by relatively small perturbations of three recorded non-pulse-type GMs, in order to develop known ground truth of three well-defined clusters. The modification is made by adding normally distributed random noises to the recorded non-pulse-type ground accelerations. The scaled ln&#119878;&#119878; &#119886;&#119886; of these 39 modified GMs are calculated over the period range from 0.01s to 2.0s at 0.01s intervals (see Figure <ref type="figure">5</ref>). As shown in Figure <ref type="figure">5</ref>(a), over the period range from 0.01s to 1.0s, the scaled ln&#119878;&#119878; &#119886;&#119886; fall clearly into three distinct groups and are cohesive, compact, and close to each other. The median values of the coefficient of variance for scaled ln&#119878;&#119878; &#119886;&#119886; within this period range are 0.17, 0.06, and 0.10 for these three distinct groups. Therefore, these groups are identified as three well-defined clusters and shown in Figure <ref type="figure">5</ref> (b) with different colors and line types. Having conducted parametric studies, the CAE architecture shown in Figure <ref type="figure">6</ref> with two latent features is designed. The scaled ln&#119878;&#119878; &#119886;&#119886; of all 39 modified GMs are used for training. The leaky ReLU activation function <ref type="bibr">[54]</ref> with an &#120572;&#120572; of 0.3 is used for all layers except the last, which uses a linear activation function. Note that because these modified GMs are from three well-defined clusters, the inclusion of a transfer learning process is not required in this case. In terms of CAE training, the number of epochs is set at 10,000, and the initial learning rate is set at 0.001 with a decay period of 100 epochs and a factor of 0.75. Early stopping with a loss threshold of 0.001 is used to halt training, reduce computational costs, and mitigate overfitting.</p><p>Notably, this case study does not include a testing phase to assess generalization. Therefore, overfitting is not a critical concern. The primary focus is on effectively learning the latent features of all modified GMs.</p><p>The k-means clustering is performed on the latent features learned by the designed CAE. (b) (a)</p><p>algorithms, only time series k-means cannot reach 100% accuracy by misclassifying two spectra. Figure <ref type="figure">8</ref> shows that this level of accuracy cannot be achieved by solely using time series k-means. Figure 7. Classification results for the modified GMs (a) silhouette score vs. number of clusters and (b) learned latent features (b) (a) (a) (b) (c) (d) Based on parameter studies, this case study requires leveraging the concept of transfer learning in the CAE training to improve the stability and reproducibility of GM clustering. More specifically, transfer learning ensures that applying the CAE + k-means algorithm across different environments can produce consistent classification results. To do so, simulation of GM response spectra is required. The simulation is conducted in three steps. First, the scaled ln&#119878;&#119878; &#119886;&#119886; of each pulse-type GM is shifted to have ln&#119878;&#119878; &#119886;&#119886; (&#119879;&#119879; * ) = 0. Then, the shifted scaled ln&#119878;&#119878; &#119886;&#119886; of each pulse-type GM at each period is simulated by including a normally distributed random noise with a mean as the shifted scaled ln&#119878;&#119878; &#119886;&#119886; and coefficient of variation varying from 0.20 to 0.45 (user-defined). Periods that are further away from the fundamental period (1.0s) are assigned to a larger coefficient of variation. Finally, the simulated ln&#119878;&#119878; &#119886;&#119886; are shifted back by adding the value of ln&#119878;&#119878; &#119886;&#119886; (&#119879;&#119879; * ) to the spectrum. Since the simulation is performed for each period, it can result in unreasonably high or low values of ln&#119878;&#119878; &#119886;&#119886; (e.g., spikes), which are distorted. Therefore, a smooth function based on local regression [73] is used to improve the quality of the simulated spectra. Note that this is just one approach to simulating the GM response spectra. The scaled ln&#119878;&#119878; &#119886;&#119886; of one pulse-type GM and its associated 160 simulated response spectra are shown in Figure 10. In total, 5,280 simulated response spectra are used to pre-train the CAE model. (a) (b)</p><p>Figure <ref type="figure">10</ref>. Simulated response spectra for the scaled response spectra of one pulse-type GM A CAE similar to the one used in the previous case study (see Figure <ref type="figure">8</ref>) is used for the pre-trained model.</p><p>For a given accuracy of response spectra reconstruction, the number of latent features (which is the bottleneck of the CAE) can significantly affect the computational cost. More specifically, when there is a larger number of latent features, there are more channels/bridges for the CAE to reconstruct response spectra, which leads to a lower computational cost. Considering the trade-off between the number of latent features and computational cost, it is necessary to determine the optimal number of latent features. The number of epochs required to reach a user-defined value of loss (e.g., 0.005) is used to represent computational cost, and the relationship between the number of latent features and the number of epochs is shown in Figure <ref type="figure">11</ref>. Consistent with the site-specific GM clustering applications in <ref type="bibr">[6]</ref> and <ref type="bibr">[28]</ref>, the elbow method is used to determine the optimal number of latent features. As shown in Figure <ref type="figure">11</ref>, two elbows are identified, one at three and one at five latent features. In this case study, the optimal number of latent features is selected to be five. Compared with three latent features, using five can further reduce computational cost while not significantly increasing dimensionality. Considering the elbow method is heuristic, one can use quantitative measures, such as silhouette score shown in Eq. ( <ref type="formula">7</ref>), to determine the optimal number of latent features.</p><p>The fine-tuned CAE is trained using the scaled ln&#119878;&#119878; &#119886;&#119886; of these 33 pulse-type GMs. It is worth noting that, compared with the pre-trained CAE, a lower initial learning rate is recommended for training the fine-tuned CAE. Based on a parametric study, the parameters of fine-tuned CAE do not require significant adjustments to reach a high accuracy of response spectra reconstruction. For instance, the median relative difference between the parameters obtained from the pre-trained and fine-tuned CAEs is only 0.64%. Compared with the initial learning rate of 0.001 for the pre-trained CAE, training the fine-tuned CAE with an initial learning Figure 13. Quantile-quantile plot for ln&#119878;&#119878; &#119886;&#119886; obtained by (a) the pre-trained CAE and (b) the fine-tuned CAE  (a) (b) (a) (b)</p><p>Established clustering algorithms are also applied comparison purposes. The resulting clusters and accuracies are shown in Figures <ref type="figure">16(b</ref>) to 16(e). Note that the transfer learning is not used for the AE + kmeans algorithm <ref type="bibr">[28]</ref>, which leads to instability . Figure <ref type="figure">16</ref>(b) shows one typical classification result. The k-means clustering algorithm is applied to the GM influence factors. For illustration purposes, the resulting clusters are presented in terms of GM response spectra and shown in Figure <ref type="figure">16</ref>(e). As shown in Figure <ref type="figure">16</ref>, the CAE + k-means results in the highest accuracy (100%) among all clustering algorithms. Both the AE + k-means and time series k-means methods achieve the same level of accuracy -85%, despite producing different clusters (see Figures <ref type="figure">16(b</ref>) and 16(c)). The main reason that AE + k-means cannot achieve a higher accuracy is the instability due to the fact that the number of parameters of AE (188,773) is much larger than the data points (6,600) of these selected pulse-type GMs. Transfer learning can effectively address this issue. For time series k-means, the reason it cannot classify these selected pulse-type GMs as effectively as the CAE + k-means does is mainly due to the characteristics of time series k-means. As stated in <ref type="bibr">[27]</ref>, time series k-means iteratively discovers subspaces within the entire sequence and then clusters objects based on these uncovered subspaces, rather than using the whole sequence. More specifically, time series k-means focuses on a specific range of periods, rather than the entire period range. As shown in Figure <ref type="figure">16</ref>(d), spectral clustering demonstrates poor classification performance, with an accuracy of only 55%. This result indicates that spectral clustering is not a suitable clustering algorithm for these pulse-type GMs.</p><p>Additionally, the k-means on GM influence factors also shows poor classification performance (67% accuracy). This is mainly due to the small sample size of these pulse-type GMs -33, which is the main difference between this case study and the one in <ref type="bibr">[21]</ref>.</p><p>For the clusters obtained by CAE + k-means, which achieves 100% accuracy, the statistics of magnitude and Rjb for each resulting cluster are listed in Table <ref type="table">1</ref>. The cluster IDs correspond to those shown in Figure <ref type="figure">9</ref>(b). It is worth noting that there is clear overlap between the magnitude and Rjb ranges for the two clusters.</p><p>This overlap implicitly explains the poor classification performance of k-means GM influence factors (i.e., magnitude and Rjb).   (a) (b) (c) (d) (e)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Non-pulse-type ground motions</head><p>As another case study, a set of 36 non-pulse-type GMs is manually selected, which can be potentially classified as 3 clusters. As shown in Figure <ref type="figure">17</ref>(a), the scaled ln&#119878;&#119878; &#119886;&#119886; of these 36 non-pulse-type GMs fall clearly into three distinct groups for the period range from 1.5s to 2.0s. Therefore, these groups are identified as three clusters in practice and shown in Figure <ref type="figure">17</ref>(b). The response spectra simulation, CAE architecture, and settings for the pre-trained and fine-tuned CAEs are the same as those used in Section 3.2.</p><p>The pre-trained and fine-tuned CAEs are trained using 6,120 simulated response spectra and 36 non-pulsetype GM response spectra, respectively. The resulting silhouette scores indicate that the optimal number of clusters is three (see Figure <ref type="figure">18</ref>  (a) (b) Figure 18. Classification results for the selected non-pulse-type GMs (a) silhouette score vs. number of clusters and (b) the first three principal components of the learned latent features Figure 19. Classification results for the selected non-pulse-type GMs obtained using CAE + k-means (a) initial and (b) final Similar to Section 3.2, established clustering algorithms are also applied for comparison purposes. The resulting clusters and accuracies are shown in Figures 20(b) to 20(e). As shown in Figure 20, the CAE + kmeans results in the highest accuracy (100%) among all clustering algorithms. The statistics of magnitude and Rjb for each resulting cluster are listed in Table 2. The cluster IDs correspond to those shown in Figure 17(b). The time series k-means results in the second-highest accuracy -92% (see Figure 20(c)). As shown in Figure 20(b), the AE + k-means achieves an accuracy level of 78%, which is attributed to the instability issue discussed in Section 3.2. Both spectral clustering and k-means on GM influence factors result in poor (a) (b) (a) (b) classification performance, with less than 70% accuracy. These findings suggest that spectral clustering and k-means are not suitable for classifying these pulse-type GMs.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Application</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Selecting ground motions for a structure in San Francisco, California</head><p>The CAE + k-means is applied to select 20 GMs for the structure discussed in Section 3, which has a fundamental period of 1.0s. The GM pre-selection process results in 69 pulse-type and 199 non-pulse-type candidate GMs (see Figure <ref type="figure">21</ref>). The scaled response spectra of these GMs are classified via the CAE + kmeans used in Sections 3.2 and 3.3. For pre-training purposes, 10,050 and 29,850 response spectra are simulated for the selected pulse-type and non-pulse-type GMs, respectively. The relationship between the silhouette score and the number of clusters is shown in Figure <ref type="figure">22</ref>. For both the pulse-type and non-pulsetype GMs, the optimal number of clusters is two. The cluster centroids (ln&#119878;&#119878; &#119886;&#119886; * , which is the mean value of the response spectra in each cluster at each period) are shown in Figure <ref type="figure">23</ref> and are clearly separate for the period range of 1.0s to 2.0s. The statistics of magnitude and Rjb for each resulting cluster are listed in Table <ref type="table">3</ref>.</p><p>Using Eq. ( <ref type="formula">11</ref>), the number of pulse-type GMs is determined to be five. Based on the proportion of the number of GMs in each cluster, the number of selected GMs from each cluster is calculated and listed in Table <ref type="table">4</ref>. Following the GM selection method introduced in Section 2.5, the response spectra of 20 realizations are generated and assigned to each cluster. The GMs in each cluster, whose response spectra are closest to the assigned realizations (resulting in the smallest &#120576;&#120576; &#119903;&#119903;&#119901;&#119901;&#119886;&#119886;&#119897;&#119897; ), are selected. The set of selected GMs is evaluated using Eq. ( <ref type="formula">13</ref>) with a &#120582;&#120582; of 1. Repeating the process from the realization generation by  Figure 22. Silhouette score vs. number of clusters for (a) pulse-type GMs and (b) non-pulse-type GMs (a) (b) (a) (b) Figure 23. Mean of ln&#119878;&#119878; &#119886;&#119886; of GMs in each cluster (a) pulse-type GMs and (b) non-pulse-type GMs  In terms of computational cost, the CAE pre-training and fine-tuning processes are implemented in the TensorFlow [75] framework, executed on a standard PC with one NVIDIA RTX A4000 GPU, and finished in about 4 hours. The GM response spectra simulation, k-means clustering, and CS-based GM selection are performed in the MATLAB [76] framework on a standard PC with one Intel(R) Core(TM) i7-8650U CPU and finished in less than 0.5 hours. In total, approximately 4.5 hours was required for the proposed GM selection. Using the same computational resources, the CS-based GM selection with AE + k-means can be completed in 1 hour. With the parallel computing on 8 Intel(R) Core(TM) i7-8650U CPUs, the time series k-means case takes approximately 7 hours to finish. The spectral clustering and k-means on GM influence factors cases can finish the GM selection in 0.5 hours using one Intel(R) Core(TM) i7-8650U CPU. Additionally, the execution time for the CS-based GM selection without any clustering algorithm is approximately 0.3 hours on one Intel(R) Core(TM) i7-8650U CPU. The proposed clustering-based GM selection method requires less execution time only when compared to time series k-means. But for practice purposes, 4.5 hours is acceptable. The computational cost can be further reduced by training the CAEs on a more powerful GPU.</p><p>Additional applications are also conducted for different seismological conditions and structural fundamental periods. The computational costs are comparable to this application.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Practical use of proposed ground motion clustering algorithm for ground motion selection</head><p>In order for practitioners to be able to use the proposed GM clustering algorithm for GM selection, the developed user-friendly codes will be published (on GitHub at <ref type="url">https://github.com/yimjia/Ground-Motion-  Clustering-and-Selection</ref>). As prerequisites, users need to have knowledge of earthquake engineering to perform the GM pre-selection (see Section 2.1) and have access to run Python and MATLAB codes. No </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions</head><p>This paper presents a clustering-based GM selection method to select GMs that a structure of interest will probabilistically experience. First, by leveraging domain-specific knowledge, the candidate GMs are preselected. Then, a CAE is trained to learn the low-dimensional underlying characteristics of the candidate GMs' response spectra, also known as latent features. Next, k-means clustering is performed to classify the learned latent features, which is equivalent to assigning the response spectra of candidate GMs into groups with similar underlying characteristics. Finally, the grouped GMs are embedded in the CS-based GM selection. The selected GMs can represent the given hazard level well (by matching the CS mean and variance) and fully describe the complete set of candidate GMs. The presented clustering-based GM selection method can be readily used by practitioners because training in machine learning is not required to perform it. The developed codes will be publicly available on GitHub at <ref type="url">https://github.com/yimjia/Ground-Motion-Clustering-and-Selection</ref> after the paper is published.</p><p>To evaluate the performance of the proposed GM clustering algorithm (CAE + k-means), case studies are designed for a structure in San Francisco, California. These case studies illustrate that the CAE + k-means can accurately classify the GM response spectra and determine the optimal number of clusters. This is achieved by 1) CAE successfully extracting the underlying characteristics of GM response spectra as latent features, and 2) k-means clustering classifying the latent features, which is equivalent to grouping GM response spectra. The results of the second and third case studies demonstrate that the CAE + k-means outperforms the other clustering algorithms (time series k-means and spectral clustering).</p><p>The proposed clustering-based GM selection method is applied to select 20 GMs for a structure in San Francisco, California. The candidate pulse-type and non-pulse-type GMs are classified into two clusters, respectively. Instead of selecting GMs from one database, 20 GMs are proportionally selected from four clusters, which allows the selected GMs to fully describe the complete set of candidate GMs. The response spectra of these 20 selected GMs match the CS well, indicating that these selected GMs can represent the given seismic hazard level. However, the proposed clustering-based GM selection method has some limitations and leaves opportunities for future improvements. One limitation is that none of the structural properties, other than the fundamental period of structure, are considered when selecting GMs. Another general limitation is the potential misrepresentation of GM due to the scaling for CS matching.</p><p>Nevertheless, the proposed GM clustering algorithm (CAE + k-means) can be embedded to other GM selection methods to allow that the selected GMs to fully describe the complete set of candidate GMs. The CAE + k-means can also be applied to other engineering problems in which time series data needs to be classified. The transfer learning used in CAE can provide guidance to other studies when the stability and reproducibility of a complex machine learning model need to be improved. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Postdoctoral Scholar, Department of Civil and Environmental Engineering, University of California, Berkeley, CA, Email: yimjia@berkeley.edu</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>Professor, Department of Civil and Environmental Engineering, Northeastern University, Boston, MA, Email: sasani@neu.edu, Tel: 617-373-5222</p></note>
		</body>
		</text>
</TEI>
