<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Machine-Learning Approach for Semantically-Enriched Building-Code Sentence Generation for Automatic Semantic Analysis</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>11/09/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10350179</idno>
					<idno type="doi">10.1061/9780784482865.133</idno>
					<title level='j'>Construction Research Congress 2020</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Ruichuan Zhang</author><author>Nora El-Gohary</author><author>P. Tang</author><author>D. Grau</author><author>M. El Asmar</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Existing automated code checking (ACC) systems require the extraction of requirements from regulatory textual documents into computer-processable rule representations. The information extraction processes in those ACC systems are based on either human interpretation, manual annotation, or predefined automated information extraction rules. Despite the high performance they showed, rule-based information extraction approaches, by nature, lack sufficient scalability—the rules typically need some level of adaptation if the characteristics of the text change. Machine learning-based methods, instead of relying on hand-crafted rules, automatically capture the underlying patterns of the existing training text and have a great capability of generalizing to a variety of texts. A more scalable, machine learning-based approach is thus needed to achieve a more robust performance across different types of codes/documents for automatically generating semantically-enriched building-code sentences for the purpose of ACC. To address this need, this paper proposes a machine learning-based approach for generating semantically-enriched building-code sentences, which are annotated syntactically and semantically, for supporting IE. For improved robustness and scalability, the proposed approach uses transfer learning strategies to train deep neural network models on both general-domain and domain-specific data. The proposed approach consists of four steps: (1) data preparation and preprocessing; (2) development of a base deep neural network model for generating semantically-enriched building-code sentences; (3) model training using transfer learning strategies; and (4) model evaluation. The proposed approach was evaluated on a corpus of sentences from the 2009 International Building Code (IBC) and the Champaign 2015 IBC Amendments. The preliminary results show that the proposed approach achieved an optimal precision of 88%, recall of 86%, and F1-measure of 87%, indicating good performance.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>INTRODUCTION</head><p>Existing automated code checking (ACC) systems require the extraction of requirements from regulatory textual documents into computer-processable rule representations. The information extraction (IE) processes in those ACC systems rely on either human interpretation, manual annotation, or predefined automated information extraction rules. For example, the state-of-the-art methods for extracting ACC-related information are rule-based (e.g., <ref type="bibr">Zhang and El-Gohary 2013, Zhou and</ref> El-Gohary 2017), which require human effort to develop rules for automatically extracting the information from the building codes. Despite the high performance they showed, rule-based approaches, by nature, lack sufficient scalability -the rules typically need some level of adaptation if the characteristics of the text change. Machine learning-based methods, instead of relying on hand-crafted rules, automatically capture the underlying patterns of the existing training text and have a great capability of generalizing to a variety of texts. A more scalable, machine learningbased approach is thus needed to achieve a more robust performance -without requiring manual rule adaptation effort -across different types of codes/documents for automatically generating semantically-enriched building-code sentences for the purpose of ACC.</p><p>To address this need, this paper proposes a machine learning-based approach for generating such semantically-enriched building-code sentences that are annotated with semantic information elements <ref type="bibr">(Zhang and El-Gohary 2013)</ref> and syntactic fillers, and thus are ready for computer processing and reasoning. The proposed approach uses transfer learning strategies to train deep neural network models on both general-domain and domain-specific data. On one hand, general-domain data are large-scale and pattern-rich, which helps train the model to deal with different text patterns across multiple codes/documents for increased robustness and scalability; but general-domain data are relatively different from the domain-specific data in terms of vocabularies, syntactics, and semantics. On the other hand, domain-specific data (i.e., annotated building code sentences) are the target data from the architecture/engineering/construction (AEC) domain, but they are much smaller in size and are lower in syntactic and semantic richness, which would limit the robustness and scalability of the deep neural network model if they are solely used for training. The proposed approach, thus, takes the best of both worlds.</p><p>The proposed approach consists of four main steps: (1) prepare and preprocess training and testing data from both outside of the AEC domain (i.e., the general-domain data) and within the AEC domain (i.e., the domain-specific data); (2) development of a base deep neural network model for generating semantically-enriched building-code sentences; (3) model training using different transfer learning strategies; and (4) model evaluation using precision, recall, and F1-measure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>BACKGROUND Semantic Text Enrichment</head><p>Semantic text enrichment aims to attach computer-processible semantic information to the natural language text <ref type="bibr">(Abel et al. 2011)</ref>. Compared to the original natural language text, the semantically-enriched text contains highly-structured, and often domain-specific semantic information that can be used directly by computers for semantic analysis tasks. There are many applications, tools, and platforms for creating and managing semantically-enriched texts (e.g., semantic wiki). And many research efforts have been focused on automating the process of generating semanticallyenriched text and/or text semantic annotation (e.g., <ref type="bibr">Abel et al. 2011</ref><ref type="bibr">, Dugas et al. 2016</ref>). To solve the needs for automated compliance checking of building designs, different types of building-code requirement representations have been proposed and can be potentially used as the semantic annotations in the semantically-enriched building-code sentences, such as the semantic information elements <ref type="bibr">(Zhang and El-Gohary 2013)</ref>, shown in Table <ref type="table">1</ref>. The unit of measure for a "quantity value"</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Deep Learning</head><p>Deep learning methods use computational models such as deep neural networks to learn multiple levels of information representations from large-scale data <ref type="bibr">(LeCun et al. 2015)</ref>. Deep learning methods have drastically improved the state-of-the-art performance in many domains such as natural language processing and computer vision, and meanwhile reduced or eliminated the manual effort in feature engineering compared to traditional machine learning methods. Deep learning methods have been used in the AEC domain for solving computer vision problems such as construction equipment detection <ref type="bibr">(Kim et al. 2017</ref><ref type="bibr">), activity recognition (Luo et al. 2018)</ref>, and crack detection <ref type="bibr">(Park et al. 2019)</ref>, and text analysis problems such as building-code requirement extraction <ref type="bibr">(Zhang and El-Gohary 2019)</ref>. The most commonly used deep neural networks include convolutional neural networks <ref type="bibr">(Kim et al. 2017;</ref><ref type="bibr">Luo et al. 2018;</ref><ref type="bibr">Gulgec et al. 2019</ref>) and recurrent neural networks <ref type="bibr">(Zhang and El-Gohary 2019)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Transfer Learning</head><p>Transfer learning aims to use machine-learning models that are trained for one task and/or on the data from one domain for another task and/or on the data from another domain <ref type="bibr">(Shin et al. 2016)</ref>. By enabling the training of the machine learning models on large-scale, pattern-rich, and annotated training data that are outside the target domain (e.g., the AEC domain) for solving domain-specific tasks, transfer learning techniques can be used to improve the robustness and scalability of the machine learning-based methods (e.g. <ref type="bibr">Teh et al. 2017)</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>PROPOSED MACHINE LEARNING-BASED APPROACH FOR SEMANTICALLY-ENRICHED BUILDING-CODE SENTENCE GENERATION</head><p>The proposed machine learning-based approach for generating semantically-enriched building-code sentences consists of four main steps, as shown in Figure <ref type="figure">1</ref>.  The deep neural network model -bidirectional long short term memory (LSTM) with conditional random fields (CRF) <ref type="bibr">(Huang et al. 2015)</ref> -was adopted as the base model for generating semantically-enriched building-code sentences given natural language building-code sentences. The base deep neural network model consists of three main layers: the input word-embedding layer, the bidirectional LSTM layer, and the output CRF layer, as depicted in Figure <ref type="figure">3</ref>. The input word-embedding layer aimed to represent the semantics of each word in a vector of real numbers for deep neural network computation purposes. The LSTM layer aimed to learn the feature representations of each word using the input word embeddings of the current word and the context words. To improve the ability of the LSTM layer to deal with long-term syntactic dependencies in the sentences, the bidirectional architecture was used -both the forward and backward context words were considered when learning the feature representations. Finally, for each word, the output CRF layer aimed to compute the conditional probabilities of different types of semantic annotations (i.e., the semantic information elements and the syntactic fillers), based on which the final type of semantic annotation can be predicted, given the feature representations of this word learned by the LSTM layer. To compute the model parameters, the cross entropy loss was minimized. The model was implemented using Keras in Python 3, and run on top of TensorFlow. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Two-stage Training Strategy</head><p>In the two-stage training strategy (as illustrated in Figure <ref type="figure">4a</ref>), the deep neural network model was trained in two related stages. In the first stage, the model was trained on the general-domain data. In the second stage, the output CRF layer of the trained model (i.e., CRF 1) was replaced by a new output CRF layer (i.e., CRF 2), and the model was trained on the domain-specific data. During the training in the second stage, only the output CRF layer (CRF 2) was trainable, and the input word-embedding layer and the bidirectional LSTM layer were not trainable -the parameters of these two layers remained unchanged. For each stage, the training was stopped if the difference between the training losses of two consecutive training epochs is smaller than 0.01, or the training reaches 50 epochs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Alternating Training Strategy</head><p>In the alternating training strategy (as illustrated in Figure <ref type="figure">4b</ref>), the deep neural network model was trained on the general-domain and the domain-specific training data in an alternating manner. The model had two separate output CRF layers -one layer is used when the model is trained on the general-domain training data (i.e., CRF 1), and the other layer is used when the model is trained on the domain-specific training data (i.e., CRF 2). In each training iteration, there is an alternating probability p that the model was trained on a selected batch of the general-domain data, and a probability of (1-p) that the model was trained on a selected batch of the domain-specific training data. Typically, the alternating probability p is a number close to 1, meaning the model is more frequently trained on the general-domain data rather than the domain-specific data, to prevent overfitting on the relatively small-scale domain-specific data. The training was stopped if the difference between the training losses of two consecutive epochs when the model is trained on the domain-specific data is smaller than 0.01, or the training on the domain-specific data reaches 50 epochs. Step 4: Model Evaluation Given a natural language building-code sentence and a trained deep neural network model, the corresponding semantically-enriched building-code sentence was generated by searching the optimal sequence of semantic annotations (i.e., semantic information elements and syntactic fillers) that maximizes the sum of the conditional log-likelihoods computed by the output CRF layer. The searching was conducted using dynamic programming on the matrix of conditional probabilities, allowing the optimal sequence of semantic enrichments to be generated in polynomial time instead of exponential time.</p><p>Three metrics were used to evaluate the performance of the deep neural network models for generating the semantically-enriched building-code sentences: precision, recall, and F1 measure, where for a specific type of semantic enrichment SE, TP is the number of true positives (i.e., number of words correctly labeled as SE), FP is the number of false positives (i.e., number of words incorrectly labeled as SE), and FN is the number of false negatives (i.e., number of words not labeled as SE but should have been) <ref type="bibr">(Zhai and Massung 2016)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>PRELIMINARY EXPERIMENTAL RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Performance of the Proposed Approach with Different Transfer Learning Strategies</head><p>The two different transfer learning strategies -the two-stage training strategy and the alternating training strategy -were tested. The deep neural network model for generating semantically-enriched building-code sentences achieved better performance when the alternating training strategy was used, outperforming the model using the two-stage training strategy by 15% in precision, 11% in recall, and 14% in F1 measure, as shown in Table <ref type="table">2</ref>. The deep neural network model using the two-stage training strategy achieved relatively low performance. This is possibly because the input wordembedding layer and the bidirectional LSTM layer of the deep neural network model were trained on the general-domain data only, and therefore might not have been able to learn the representations that capture the syntactic and semantic patterns in the domain-specific data well. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Performance of the Proposed Approach Using the Alternating Training Strategy with Different Alternating Probabilities</head><p>When the alternating training strategy was used for training the deep neural network model, different alternating probabilities were tested, including 90%, 92%, 95%, and 99%. The optimal performance for semantically-enriched building-code sentence generation was achieved when the alternating probability was 92%, as shown in Table <ref type="table">3</ref>. However, the differences were small. Further testing is needed in the future to study the statistical and practical significances of these performance differences. Comparing to a medium alternating probability (i.e., 92%), when the alternating probability was very high (i.e., 99%), the performance decreased by 13% in precision, 6% in recall, and 10% in F1 measure, possibly because the input word-embedding layer and the bidirectional LSTM layer of the deep neural network model was mainly trained on the general-domain data and might not be able to learn the representations that capture the syntactic and semantic patterns in the domain-specific data well. Comparing to a medium alternating probability (i.e., 92%), when the alternating probability was lower (i.e., 90%), the performance started to decrease slightly. This is possibly because the deep neural network model was overly trained on the domainspecific training data, and thus was overfitted to these data. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Error Analysis</head><p>Two main types of errors were identified based on the experimental results. First, the proposed approach had errors when generating semantically-enriched forms of multiword expressions, which consist of multiple words and function as individual syntactic and semantic units, especially those including prepositions. For example, the words in the multiword expression "means of egress" should have been annotated with a single semantic information element -a subject, but the proposed model annotated the expression with a subject, a syntactic filler, and another subject. In future work, a multiword expression list could be integrated into the proposed approach for generating semantically-enriched building-code sentences. Second, the proposed approach had errors when dealing with some compliance checking attributes. For example, "Group R-1", which means the first residential group in terms of use and occupancy classification in the IBC, were usually mistakenly annotated as part of a subject instead of a compliance checking attribute. In the future, additional levels of input embedding layers could be used (e.g., character embeddings) to capture useful patterns beyond the syntactic and semantic ones.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CONCLUSION</head><p>This paper proposed a machine learning-based method for generating semantically-enriched building-code sentences for semantic analysis of the code for supporting automated compliance checking. First, a bidirectional LSTM-CRF model was adopted for the semantically-enriched building-code sentence generation task. Second, two transfer learning strategies were used for training the deep neural network models on both the general-domain data and the domain-specific data. Third, the proposed approach was tested on the domain-specific data. The proposed method achieved a precision of 88%, a recall of 86%, and an F1 measure of 87% when the second transfer learning strategy -the alternating training strategy -was used, indicating good semantically-enriched building-code sentence generation performance.</p><p>This paper contributes to the body of knowledge in three primary ways. First, the paper proposed a new, machine learning-based approach to generate semanticallyenriched building-code sentences for facilitating the semantic analysis of the code for supporting automated compliance checking. Second, the paper leveraged large-scale, pattern-rich annotated training data from outside the AEC domain by using transfer learning strategies to increase the robustness and scalability of the proposed approach. Third, the experimental results show that the transfer learning strategies and some of the hyperparameters (e.g., the alternating probability for the alternating training strategy) of the deep neural network models could contribute to the performance variations of the models.</p><p>In their future work, the authors first plan to improve the proposed approach for generating semantically-enriched building-code sentences by including more types of semantic information elements, such as restrictions and references, as semantic annotations. Second, the authors will explore further ways to improve the performance of the proposed approach, including testing different general-domain training data, using more domain-specific training data, exploring different transfer learning strategies (e.g., initiating the parameters of the input word-embedding layer using pretrained word embeddings), and integrating a domain ontology into the proposed approach. Third, and most importantly, the authors plan to integrate the proposed approach for semantically-enriched building-code sentence generation with machine learning-based information extraction and semantic information matching, with an aim to find a scalable method for fully automated compliance checking.</p></div></body>
		</text>
</TEI>
