<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Automatic Extraction of Opinion-based Q&amp;A from Online Developer Chats</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10287696</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the International Conference on Software Engineering</title>
<idno>1819-3781</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Preetha Chatterjee</author><author>Kostadin Damesvski</author><author>Lori Pollock</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Virtual conversational assistants designed specifically for software engineers could have a huge impact onthe time it takes for software engineers to get help. Researchefforts are focusing on virtual assistants that support specificsoftware development tasks such as bug repair and pair programming. In this paper, we study the use of online chatplatforms as a resource towards collecting developer opinionsthat could potentially help in building opinion Q&A systems,as a specialized instance of virtual assistants and chatbots forsoftware engineers. Opinion Q&A has a stronger presence inchats than in other developer communications, thus mining themcan provide a valuable resource for developers in quickly gettinginsight about a specific development topic (e.g., What is the bestJava library for parsing JSON?). We address the problem ofopinion Q&A extraction by developing automatic identification ofopinion-asking questions and extraction of participants’ answersfrom public online developer chats. We evaluate our automaticapproaches on chats spanning six programming communitiesand two platforms. Our results show that a heuristic approachto opinion-asking questions works well (.87 precision), and adeep learning approach customized to the software domainoutperforms heuristics-based, machine-learning-based and deeplearning for answer extraction in community question answering.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Recognizing the increasing capabilities of virtual assistants that use conversational artificial intelligence (AI) (e.g., chatbots, voice assistants), some researchers in software engineering are working towards the development of virtual assistants to help programmers. They have conducted studies to gain insights into the design of a programmer conversational agent <ref type="bibr">[1]</ref>, proposed techniques to automatically detect speech acts in conversations about bug repair to aid the assistant in mimicking different conversation types <ref type="bibr">[2]</ref>, and designed virtual assistants for API usage <ref type="bibr">[3]</ref>.</p><p>While early versions of conversational assistants were focused on short, task-oriented dialogs (e.g., playing music or asking for facts), more sophisticated virtual assistants deliver coherent and engaging interactions by understanding dialog nuances such as user intent (e.g., asking for opinion vs knowledge) <ref type="bibr">[4]</ref>. They integrate specialized instances dedicated to a single task, including dialog management, knowledge retrieval, opinion-mining, and question-answering <ref type="bibr">[5]</ref>. To build virtual assistants for software engineers, we need to provide similar specialized instances based on the available information from software engineers' daily conversations. Recent studies indicate that online chat services such as IRC, Slack, and Gitter are increasingly popular platforms for software engineering conversations, including both factual and opinion information sharing and now playing a significant role in software development activities <ref type="bibr">[6]</ref>- <ref type="bibr">[9]</ref>. These conversations potentially provide rich data for building virtual assistants for software engineers, but little research has explored this potential.</p><p>In this paper, we leverage the availability of opinionasking questions in developer chat platforms to explore the feasibility of building opinion-providing virtual assistants for software engineers. Opinion question answering (Opinion QA) systems <ref type="bibr">[10]</ref>- <ref type="bibr">[13]</ref> aim to find answers to subjective questions from user-generated content, such as online forums, product reviews, and discussion groups. One type of virtual assistant that can benefit from opinions are Conversational Search Assistants (CSAs) <ref type="bibr">[24]</ref>. CSAs support information seekers who struggle forming good queries for exploratory search, e.g., seeking opinions/recommendations on API, tools, or resources, by eliciting the actual need from the user through conversation. Studies indicate developers conducting web searches or querying Q&amp;A sites for relevant questions often find it difficult to formulate good queries <ref type="bibr">[25]</ref>, <ref type="bibr">[26]</ref>. Wizard of Oz studies have explicitly shown the need for opinions within CSAs <ref type="bibr">[27]</ref>. A key result of our paper is the availability of opinions on chat platforms, which would enable the creation of a sizable opinion Q&amp;A corpus that could actually be used by CSAs. The opinion Q&amp;A corpus generated from chats by our technique can be used in a few different ways to build a CSA: 1) matching queries/questions asked to the CSA with questions from the corpus and retrieving the answers; 2) summarizing related groups of opinion Q&amp;A to generate (e.g., using a GAN) an aggregate response for a specific software engineering topic.</p><p>Opinion extraction efforts in software engineering have focused on API-related opinions and developer emotions from Q&amp;A forums <ref type="bibr">[14]</ref>- <ref type="bibr">[18]</ref>, developer sentiments from commit logs <ref type="bibr">[19]</ref>, developer intentions from emails and issue reports <ref type="bibr">[20]</ref>, <ref type="bibr">[21]</ref> and detecting software requirements and feature requests from app reviews <ref type="bibr">[22]</ref>, <ref type="bibr">[23]</ref>. These studies suggest that, beyond reducing developers' effort of manual searches on the web and facilitating information gathering, mining of opinions could help in increasing developer productivity, improving code efficiency <ref type="bibr">[16]</ref>, and building better recommendation systems <ref type="bibr">[17]</ref>.</p><p>Findings from our previous exploratory study <ref type="bibr">[9]</ref> of Slack conversations suggests that developer chats include opinion expression during human conversations. Our current study (in Section II) of 400 developer chat conversations selected from six programming communities showed that 81 (20%) of the chat conversations start with a question that asks for opinions (e.g., "Which one is the best ORM that is efficient for large datasets?", "What do you think about the Onyx platform?"). This finding shows much higher prevalence of questions asking for opinions in chats than the 1.6% found in emails <ref type="bibr">[20]</ref> and 1.1% found in issue reports <ref type="bibr">[21]</ref>.</p><p>Thus, we investigate the problem of opinion Q&amp;A extraction from public developer chats in this paper. We decompose the problem of opinion Q&amp;A extraction into two subproblems: <ref type="bibr">(1)</ref> identifying questions where the questioner asks for opinions from other chat participants (which we call posing an opinionasking question), and (2) extracting answers to those opinionasking questions within the containing conversation.</p><p>Researchers extracting opinions from software-related documents have focused on identifying sentences containing opinions, using lexical patterns <ref type="bibr">[20]</ref> and sentiment analysis techniques <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref>, <ref type="bibr">[28]</ref>. However, these techniques are not directly applicable to identifying opinion-asking questions in chats for several reasons. Chat communities differ in format, with no formal structure and informal conversation style. The natural language text in chats could follow different syntactic patterns and contain incomplete sentences <ref type="bibr">[9]</ref>, which could potentially inhibit automatic mining of opinions.</p><p>Outside the software engineering domain, researchers have addressed the problem of answer extraction from community question-answering (CQA) forums by using deep neural network models <ref type="bibr">[29]</ref>- <ref type="bibr">[31]</ref>, and syntactic tree structures <ref type="bibr">[32]</ref>, <ref type="bibr">[33]</ref>. Compared to CQA forums, chats contain rapid exchanges of messages between two or more developers in short bursts <ref type="bibr">[9]</ref>. A question asked at the start of a conversation may be followed by a series of clarification or follow-up questions and their answers, before the answers to the original question are given. Moreover, along with the answers, conversations sometimes contain noisy and unrelated information. Therefore, to determine the semantic relation between a question and answer, understanding the context of discussion is crucial.</p><p>We are the first to extract opinion Q&amp;A from developer chats, which could be used to support SE virtual assistants as well as chatbots, programmer API recommendation, automatic FAQ generation, and in understanding developer behavior and collaboration. Our automatic opinion Q&amp;A extraction takes a chat conversation as input and automatically identifies whether the conversation starts with an opinion-asking question, and if so, extracts one or more opinion answers from the conversation. The major contributions of this paper are:</p><p>&#8226; For opinion-asking question identification, we designed a set of heuristics, learned from the results from our preliminary chat analysis, to determine if the leading question in a chat conversation asks for opinions. show that we can automatically identify opinion-asking questions and extract their corresponding answers within a chat conversation with a precision of 0.87 and 0.77, respectively. &#8226; We publish the dataset and source code <ref type="foot">1</ref> to facilitate the replication of our study and its application in other contexts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. OPINION-ASKING QUESTIONS IN DEVELOPER ONLINE</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>COMMUNICATIONS</head><p>Since developer chats constitute a subclass of developer online communications, we began by investigating whether we could gain insights from work by others on analyzing the opinion-asking questions in other kinds of developer online discussions (emails, issue reports, Q&amp;A forums). Emails. The most closely related work, by Di Sorbo et al. <ref type="bibr">[20]</ref>, proposed an approach to classify email sentences according to developers' intentions (feature request, opinion asking, problem discovery, solution proposal, information seeking and information giving). Their taxonomy of intentions and associated linguistic patterns have also been applied to analyze user feedback in app reviews <ref type="bibr">[34]</ref>, <ref type="bibr">[35]</ref>.</p><p>In their taxonomy, Di Sorbo et al. define "opinion asking" as: requiring someone to explicitly express his/her point of view about something (e.g., What do you think about creating a single webpage for all the services?. They claim that sentences belonging to "opinion asking" may emphasize discussion elements useful for developers' activities; and thus, make it reasonable to distinguish them from more general information requests such as "information seeking". Of their manually labelled 1077 sentences from mailing lists of Qt and Ubuntu, only 17 sentences (1.6%) were classified as "opinion asking", suggesting that opinion-asking questions are infrequent in developer emails. Issue Reports. To investigate the comprehensiveness and generalizability of Di Sorbo et al.'s taxonomy, Huang et al. <ref type="bibr">[21]</ref> manually labelled 1000 sentences from issue reports of four projects (TensorFlow, Docker, Bootstrap, VS Code) in GitHub. Consistent with Di Sorbo et al.'s findings, Huang et al. <ref type="bibr">[21]</ref> reported that only 11 (1.1%) sentences were classified as "opinion asking". Given this low percentage and that "opinion asking" could be a sub-category of "information seeking", they merged these two categories in their study. To broaden their search of opinions, Huang et al. introduced a new category "aspect evaluation", defined as: express opinions or evaluations on a specific aspect (e.g., "I think BS3's new theme looks good, it's a little flat style.", "But I think it's cleaner than my old test, and I prefer a non-JS solution personally.?)." They classified 14-20% sentences as "aspect evaluation". Comparing the two definitions and their results, it is evident that although opinions are expressed widely in issue reports, questions asking for others' opinions are rare. Chats. Chatterjee et al.'s <ref type="bibr">[9]</ref> results showing potentially more opinions in Slack developer chats motivated us to perform a manual study to systematically analyze the occurrence of opinion-asking questions and their answers in developer chats. Dataset: To create a representative analysis dataset, we identified chat groups that primarily discuss software development topics and have a substantial number of participants. We selected three programming communities with an active presence on Slack. Within those selected communities, we focused on public channels that follow a Q&amp;A format, i.e., a conversation typically starts with a question and is followed by a discussion potentially containing multiple answers or no answers. Our analysis dataset of 400 Slack developer conversations consists of 100 conversations from Slack Pythondev#help channel, 100 from clojurians#clojure, 100 from elmlang#beginners, and 100 from elmlang#general, all chosen randomly from the dataset released by Chatterjee et al. <ref type="bibr">[36]</ref>. Procedure: Using the definition of opinion-asking sentences proposed by Di Sorbo et al. <ref type="bibr">[20]</ref>, two annotators (authors of this paper) independently identified conversations starting with an opinion-asking question. We also investigated if those questions were answered by others in a conversation. The authors annotated a shared set of 400 conversations, which indicates that the sample size is sufficient to compute the agreement measure with high confidence <ref type="bibr">[37]</ref>. We computed Cohen's Kappa inter-rater agreement between the 2 annotators, and found an agreement of 0.74, which is considered to be sufficient (&gt; 0.6) <ref type="bibr">[38]</ref>. The annotators further discussed their annotations iteratively until all disagreements were resolved. Observations: We observed that out of our 400 developer conversations, 81 conversations (20%) start with an opinion-asking question. There are a total of 134 answers to those 81 opinion-asking questions, since each conversation could contain no or multiple answers.</p><p>Table <ref type="table">I</ref> shows an opinion-asking question (Ques) and its answers (Ans) extracted from a conversation on #python-dev channel. Each of the answers contain sufficient information as a standalone response, and thus could be paired with the question to form separate Q&amp;A pairs. Given that conversations are composed of a sequence of utterances by each of the people participating in the conversation in a back and forth manner, the Q&amp;A pairs are pairs of utterances.</p><p>Summary: Compared to other developer communications, conversations starting with opinion-asking questions in developer chats are much more frequent. Thus, chats may serve as a better resource to mine for opinion Q&amp;A systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. AUTOMATICALLY EXTRACTING OPINION Q&amp;A FROM DEVELOPER CHATS</head><p>Figure <ref type="figure">1</ref> describes the overview of our approach, ChatEO, to automatically Extract Opinion Q&amp;A from software developer Chats. Our approach takes a developer chat history as input and extracts opinion Q&amp;A pairs using the three major steps:</p><p>(1) Individual conversations are extracted from the interleaved chat history using conversation disentanglement. (2) Conversations starting with an opinion-asking question are identified by applying textual heuristics. (3) Possibly multiple available answers to the opinion-asking question within the conversation are identified using a deep learning-based approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Conversation Disentanglement</head><p>Since utterances in chats form a stream, conversations often interleave such that a single conversation thread is entangled with other conversations. Hence, to ease individual conversation analysis, we separate, or disentangle, the conversation threads in each chat log. The disentanglement problem has been widely addressed by researchers in the context of IRC and similar chat platforms <ref type="bibr">[9]</ref>, <ref type="bibr">[36]</ref>, <ref type="bibr">[39]</ref>- <ref type="bibr">[41]</ref>. We used the best available disentanglement approaches proposed for Slack and IRC chat logs, respectively, in this paper.</p><p>&#8226; Slack chat logs: We used a subset of the publicly available disentangled Slack chat dataset<ref type="foot">foot_1</ref> released by Chatterjee et al. <ref type="bibr">[36]</ref> since their modified disentanglement algorithm customized for Slack developer chats achieved a microaveraged F-measure of 0.80. &#8226; IRC chat logs: We used Kummerfeld et al.'s <ref type="bibr">[41]</ref> technique, a feed-forward neural network model for conversation disentanglement, trained on 77K manually annotated IRC utterances, and achieving 74.9% precision and 79.7% recall.</p><p>In the disentangled conversations, each utterance contains a unique conversation id and metadata including timestamp and author information.   <ref type="bibr">[20]</ref> claim that developers tend to use recurrent linguistic patterns within discussions about development issues. Thus, using natural language parsing, they defined five linguistic patterns to identify opinion-asking questions in developer emails. First, to investigate the generalizability of Di Sorbo et al.'s linguistic patterns, we used their replication package of DECA 3 to identify opinion-asking questions in developer chats. We found that, out of 400 conversations in our manual analysis dataset, only 2 questions were identified as opinion-asking questions by DECA. Hence, we conducted a deeper analysis of opinion-asking questions in our manual dataset in Section II, to identify additional linguistic patterns that represent opinion-asking questions in developer chats.</p><p>We followed a qualitative content analysis procedure <ref type="bibr">[42]</ref>, where the same two annotators (authors of this paper) first independently analyzed 100 developer conversations to understand the structure and characteristics of opinion-asking questions in chats. The utterances of the first speaker in each conversation were manually analyzed to categorize if they were asking for an opinion. When an opinion-asking question was manually identified, the annotator identified the parts of the utterance that contributed to that decision and identified part-of-speech tags and recurrent keywords towards potential linguistic patterns. Consider the question in Table <ref type="table">I</ref>. First, the annotator selects the text that represents an opinion-asking question, in this case: "which one is the best ORM which is efficient for large datasets?". 'ORM" is noted as a noun referring to the library, and "best" as an adjective related to the opinion about ORM. Thus, a proposed linguistic pattern to consider is: "Which one &lt;verb to be&gt; [positive adjective] [rec target noun] &lt;verb phrase&gt;?".</p><p>Throughout the annotation process, the annotators wrote memos to facilitate the analysis, recording observations on types of opinions asked, observed linguistic patterns of opinion-asking questions, and researcher reflections. The two annotators then met and discussed their observations on common types of questions asking for opinions, which resulted in a set of preliminary patterns for identifying opinion-asking 3 <ref type="url">https://www.ifi.uzh.ch/en/seal/people/panichella/tools/DECA.html</ref> Table <ref type="table">II</ref>: Most Common Pattern for opinion-asking Question Identification in Chats.</p><p>Pattern code: P ANY ADJ Description: question starting with "any" and followed by a positive adjective and target noun.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Rule: Any [rec positive adj] [rec target noun] verb phrase</head><p>Definitions: [rec positive adj] = "good","better","best","right","optimal",... [rec target noun] = "project","api","tutorial","library",... Example: Any good examples or things I need to be looking at? questions in developer chats. Using the preliminary patterns, the two authors then independently coded the rest of the opinion-asking questions in our manual analysis dataset, after which they met to further analyze their annotations and discuss disagreements. Thus, the analysis was performed in an iterative approach comprised of multiple sessions, which helped in generalizing the hypotheses and revising the linguistic patterns of opinion-asking questions.</p><p>Our manual analysis findings from 400 Slack conversations showed that an opinion-asking question in a developer chat is a question occurring primarily at the beginning of a conversation and could exhibit any of these characteristics:</p><p>&#8226; Expects subjective answers (i.e., opinions) about APIs, libraries, examples, resources, e.g., "Is this a bad style?","What do you think?" &#8226; Asks for which path to take among several paths, e.g., "Should I use X instead of Y?" &#8226; Asks for an alternative solution (other than questioner's current solution), e.g., "Is there a better way?" Thus, we extended Di Sorbo et al.'s linguistic pattern set for identifying opinion-asking questions, by adding 10 additional linguistic patterns. Table <ref type="table">II</ref> shows the most common pattern, P ANY ADJ, in our dataset with its description and example question. Most of the patterns utilize a combination of keywords and part-of-speech-tagging. The annotators curated sets of keywords in several categories, e.g., [rec target noun], [rec verbs], [rec positive adjective] related to nouns, verbs, and adjectives, respectively. The complete set of patterns and keywords list are available in our replication package.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Answer Selection from a Conversation</head><p>We build upon work by Zhou et al. <ref type="bibr">[29]</ref>, who designed R-CNN, a deep learning architecture, for answer selection in community question answering (CQA). Since R-CNN was designed for application in CQA for the non-SE domain <ref type="foot">4</ref> , we customize for application in chats for software engineering and then compare to the non-customized R-CNN in our evaluation. We chose to build on R-CNN because other answer extraction models <ref type="bibr">[43]</ref>- <ref type="bibr">[45]</ref> only model the semantic relevance between questions and answers. In contrast, R-CNN models the semantic links between successive candidate answers in a discussion thread, in addition to the semantic relevance between question and answer. Since developer chats often contain short and rapid exchanges of messages between participants, understanding the context of the discussion is crucial to determine the semantic relation between question and answer. Hence, we adapt R-CNN to extract the relevant answer(s) to an opinion-asking question based on the context of the discussion in a conversation.</p><p>Zhou et al. <ref type="bibr">[29]</ref> regarded the problem of answer selection as an answer sequence labeling task. First, they apply two convolution neural networks (CNNs) to summarize the meaning of the question and a candidate answer, and then generate the joint representation of a Q&amp;A pair. The learned joint representation is then used as input to long short-term memory (LSTM) to learn the answer sequence of a question for labeling the matching quality of each answer.</p><p>To design ChatEO, we make the following adaptations to account for both the SE domain content and specifically software-related chats. First, we preprocess the text, apply a software-specific word-embedding model, and use those embeddings as input to a CNN to learn joint representation of a Q&amp;A pair. We use TextCNN <ref type="bibr">[46]</ref> since text in chat (utterances) are much shorter compared to CQA (post). The representations from the CNN are then passed as input to Bidirectional LSTM (BiLSTM) instead of LSTM to improve prediction of the answers from a sequence of utterances in a conversation. We detail ChatEO answer extraction as follows:</p><p>1) Preprocessing: To help ChatEO with the semantics of the chat text, the textual content in the disentangled conversations is preprocessed. We replace url, user mentions, emojis, and code with specific tokens 'url', 'username', 'emoji', and 'code' respectively. To handle the informal style of communication in chats, we use a manual set of common phrase expansions (e.g., "you've" to "you have"). We then convert the text to lowercase.</p><p>2) SE-customized Word Embeddings: Text in developer chats and other software development communication can differ from regular English text found in Wikipedia, news articles, etc. in terms of vocabulary and semantics. Hence, we trained custom GloVe vectors on the most recent Stack Overflow data dump (as of June, 2020) to more precisely capture word semantics in the context of developer communications. To train GloVe vectors, we performed standard tokenization and preprocessing on each Stack Overflow post's title and text and trimmed extremely rarely occurring words (vocabulary minimum threshold of 100 posts; window size of 15 words). Our word embedding model thus consists of 123,995 words, where each word is represented by a 200-dimensional word vector. We applied this custom word embedding model to each word in each utterance of a conversation.</p><p>3) Convolutional Neural Networks: In natural language analysis tasks, a sentence is encoded before it is further processed in neural networks. We leverage the sentence encoding technique from TextCNN <ref type="bibr">[46]</ref>. TextCNN, also used for other dialog analysis tasks <ref type="bibr">[47]</ref>, is a classical technique for sentence modeling which uses a shallow Convolution Neural Network (CNN) that combines n-gram information to model compact text representation. Since an utterance in a chat is typically short (&lt;25 words on average), we take each utterance as a sentence and apply word embedding, multiple convolution, and max-pooling operations.</p><p>The input for TextCNN is the distributed representation of an utterance, created by mapping each word index into its pre-trained embeddings. Each utterance is padded to the same length n with zero vectors. Let z j &#8712; R d denote the d-dimensional embedding for the jth word in an utterance. Thus, an utterance of length n can be represented by: z 1:n = z 1 &#8853; z 2 &#8853; . . . z n , where &#8853; is the concatenation operator. To gather local information, convolution is achieved by applying a fixed length sliding window (kernel) w m &#8712; R h&#215;d , on each word position i such that n -h + 1 convolutional units in the mth layer are generated by: c m i = &#963; (w m &#8226; z i:i+h + b m ) , i = 0, 1, . . . , n -h + 1, where h is the size of convolution kernel, &#963; is the activation function, and b m is the bias factor for the mth layer. The convolution layer is followed by a max pooling layer, which can select the most effective information with the highest value. The flattened output vectors for each kernel after max-pooling are concatenated as the final output.</p><p>4) Bidirectional LSTM: The task of identifying answers in a conversation requires capturing the context and flow of information among the utterances inside a conversation. Hence, we use Bidirectional Long Short Term Memory (BiLSTM) <ref type="bibr">[48]</ref>, where the utterances of a conversation are considered as sequential data. The input to our BiLSTM is a sequence of utterance representations created by TextCNN. Variations of LSTM, widely used by researchers for answer extraction tasks <ref type="bibr">[30]</ref>, <ref type="bibr">[49]</ref>, are capable of modeling semantic links between continuous text to perform answer sequence learning.</p><p>LSTM <ref type="bibr">[50]</ref> uses a gate mechanism to filter relevant information and capture long-term dependencies. An LSTM cell comprises of input gate (i), forget gate (f), cell state (c), and output gate (o). The outputs of LSTM at each time step h t can be computed by the following equations:</p><p>where x t is the ith element in the input sequence; W is the weight matrix of LSTM cells; b is the bias term; &#963; denotes sigmoid activation function, and tanh denotes hyperbolic tangent activation function; denotes element-wise multiplication. BiLSTM processes a sequence on two opposite directions (forward and backward), and generates two independent sequences of LSTM output vectors. Hence, the output of a BiLSTM at each time step is the concatenation of the two output vectors from both directions, h = [ h &#8853; h] , where h and h denote the outputs of two LSTM layers respectively, and &#8853; is the concatenation operator.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. EVALUATION STUDY</head><p>We designed our evaluation to analyze the effectiveness of the pattern-based identification of opinion-asking questions (RQ1) and of our answer extraction technique (RQ2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Metrics</head><p>We use measures that are widely used for evaluation in machine learning and classification. To analyze whether the automatically identified instances are indeed opinion-asking questions and their answers, we use precision, the ratio of true positives over the sum of true and false positives. To evaluate how often a technique fails to identify an opinionasking question or its answer, we use recall, the ratio of true positives over the sum of true positives and false negatives. F-measure combines these measures by harmonic mean.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Evaluation Datasets</head><p>We established several requirements for dataset creation to reduce bias and threats to validity. To curate a representative analysis dataset, we identified chat groups that primarily discuss software development topics and had a substantial number of participants. To ensure the generalizability of our techniques, we chose two separate chat platforms, Slack and IRC, which are currently the most popular chat platforms used by software developers. We selected six popular programming communities with an active presence on Slack or IRC. We believe the communities are representative of public softwarerelated chats in general; we observed that the structure and intent of conversations are similar across all 6 communities.</p><p>To collect conversations on Slack, we downloaded the developer chat conversations dataset released by Chatterjee et al. <ref type="bibr">[36]</ref>. To gather conversations on IRC, we scraped publicly available online chat logs <ref type="foot">5</ref> . After disentanglement, we discarded single-utterance conversations, and then created two separate evaluation datasets, one for opinion-asking question identification and a subset with only chats that start with an opinion-asking question, for answer extraction. We created our evaluation datasets by randomly selecting a representative portion of the conversations from each of the six programming communities.</p><p>Table <ref type="table">III</ref> shows the characteristics of the collected chat logs, and our evaluation datasets, where #OAConv gives the number of conversations that we identified as starting with an opinionasking question using the heuristics described in section III-B, per community.</p><p>The question identification evaluation dataset consists of a total of 400 conversations, 5153 utterances, and 489 users. Our question extraction technique is heuristics-based, requiring conversations that do not start with an opinion-asking question. Thus, we randomly chose 400 from our 45K chat logs for a reasonable human-constructed goldset.</p><p>The evaluation dataset for answer extraction consists of a total of 2001 conversations, 23,972 utterances, and 3160 users. Our machine-learning-based answer extraction requires conversations starting with a question, and a dataset large enough for training. Thus, 2001 conversations starting with opinion-asking questions were used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RQ1. How effective is ChatEO in identifying opinionasking questions in developer chats?</head><p>Gold Set Creation: We recruited 2 human judges with (3+ years) experience in programming and in using both chat platforms (Slack and IRC), but no knowledge of our techniques. They were provided a set of conversations where each utterance includes: the unique conversation id, anonymized name of the speaker, the utterance timestamp, and the utterance text. Using Di Sorbo et al.'s <ref type="bibr">[20]</ref> definition of opinion-asking questions i.e., requiring someone to explicitly express his/her point of view about something (e.g., What do you think about creating a single webpage for all the services?), the human judges were asked to annotate only the utterances of the first speaker of each conversation with value '1' for opinion-asking question, or otherwise '0'.</p><p>The judges annotated a shared set of 400 conversations, of which they identified 69 instances i.e., utterances of the first speaker of each conversation, containing opinion-asking questions (AngularJS: 10, C++: 10, OpenGL: 5, Python: 15, Clojurians: 12, Elm: 17). We computed Cohen's Kappa interrater agreement between the 2 judges, and found an agreement of 0.76, which is considered to be sufficient (&gt; 0.6) <ref type="bibr">[38]</ref>, while the sample size of 400 conversations is sufficient to compute the agreement measure with high confidence <ref type="bibr">[37]</ref>. The two judges then iteratively discussed and resolved all conflicts to create the final gold set. Comparison Techniques: Researchers have used sentiment analysis techniques <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref> and lexical patterns <ref type="bibr">[20]</ref> to extract opinions from software-related documents. Thus, we selected two different approaches, i.e., pattern-matching approaches and sentiment analysis, as comparison techniques to ChatEO. We evaluated three well-known sentiment analysis techniques SentiStrength-SE <ref type="bibr">[51]</ref>, CoreNLP <ref type="bibr">[52]</ref>, and NLTK <ref type="bibr">[53]</ref> with their default settings. Since opinions could have positive/negative polarities, for the purpose of evaluation, we consider a leading question in a conversation identified with either positive or negative sentiment as opinion-asking. DECA <ref type="bibr">[20]</ref> is a pattern-based technique that uses Natural Language Parsing to classify the content of development emails according to their purpose. We used their tool to investigate the use of their linguistic patterns to identify opinion-asking questions in developer chats. We do not compare with Huang et al.'s CNNbased classifier of intention categories <ref type="bibr">[21]</ref>, since they merged "opinion asking" with the "information seeking" category.</p><p>Results: Table <ref type="table">IV</ref>  With high precision, when ChatEO identifies an opinionasking question, the chance of it being correct is higher than that identified by other techniques. We aim for higher precision (with possible lower recall) in identifying opinionasking questions, since that could potentially contribute to the next module of ChatEO i.e., extracting answers to opinionasking questions.</p><p>Some of the opinion-asking instances that ChatEO wasn't able to recognize lacked presence of recurrent linguistic patterns such as ""How does angular fit in with traditional MVC frameworks like .NET MVC and ruby on rails? Do people generally still use an MVC framework or just write a web api?". Some FNs also resulted from instances where the opinion-asking questions were continuation of a separate question such as "Is there a canvas library where I can use getImageData to work with the typed array data? Or is this where I should use ports?".</p><p>We observe that the sentiment analysis tools show a high recall at the expense of low precision, with an exception of SentiStrengthSE, which exhibits lower values for both precision and recall. The "Example FP" column in Table <ref type="table">IV</ref> indicates that sentiment analysis tools are often unable to catch the nuances of SE-specific keywords such as 'expose', 'exception'. Another example, "What is the preferred way to distribute python programs?", which ChatEO is able to correctly identify as opinion-asking, is labelled as neutral by all the sentiment analysis tools. The same happens for the instance "how do I filter items in a list when displaying them with ngFor? Should I use a filter/pipe or should I use ngIf in the template?". ChatEO is able to recognize that this is asking for opinions on what path to take among two options, while the sentiment analysis tools classify this as neutral. Note that this just indicates that these tools are limited in the context of identifying opinion-asking questions, but could be indeed useful for other tasks (e.g., assessing developer emotions).</p><p>DECA <ref type="bibr">[20]</ref> identified only one instance to be opinionasking, which is a true positive, hence the precision is 1.00. Apart from this, it was not able to classify any other instance as opinion-asking, hence the low recall (0.01). On analyzing DECA's classification results, we observe that, out of 69 instances in the gold set, it could not assign any intention category to 17 instances. This is possibly due to the informal communication style in chats, which is considerably different than emails. Since an utterance could contain more than one sentence, DECA often assigned multiple categories (e.g., information seeking, feature request, problem discovery) to each instance. The most frequent intention category observed was information seeking. During the development phase, we explored additional linguistic patterns, but they yielded more FPs. This is a limitation of using linguistic patterns, as they are restrictive when expressing words that have different meaning in different contexts. ChatEO opinion-asking question identification significantly outperforms an existing pattern-based technique that was designed for emails <ref type="bibr">[20]</ref>, as well as sentiment analysis tools <ref type="bibr">[51]</ref>- <ref type="bibr">[53]</ref> in terms of F-measure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RQ2. How effective is ChatEO in identifying answer(s) to opinion-asking questions in a developer conversation?</head><p>Gold Set Creation: Similar to RQ1, we recruited 2 human judges with (3+ years) experience in programming and in using both chat platforms (Slack and IRC), but no knowledge of our techniques. The gold set creation for answer annotation was conducted in two phases, as follows:</p><p>&#8226; Phase-1 (Annotation): The human judges were provided a set of conversations with annotation instructions as follows:</p><p>Mark each utterance in the conversation that provides information or advice (good or bad) that contributes to addressing the opinion-asking question in a way that is understandable/meaningful/interpretable when read as a standalone response to the marked opinion-asking question (i.e., the answer should provide information that is understandable without reading the entire conversation). Such utterances should not represent answer(s) to followup questions in a conversation. An answer to an opinionasking question could also be a yes/no response. There could be more than one answer provided to the opinion-asking question in a conversation. &#8226; Phase-2 (Validation): The purpose of Phase-2 was two-fold:</p><p>(1) measure validity of Phase-1 annotations, and (2) evaluate if an answer would match an opinion-asking question out of conversational context, such that the Q&amp;A pair could be useful as part of a Q&amp;A system. Therefore, for Phase-2 annotations, we ensured that the annotators read only the provided question and answers, and not the entire conversations from which they were extracted. The Phase-1 annotations from the first annotator were used to generate a set of question and answers, which were used for Phase-2 annotations by the second annotator, and vice-versa. For each utterance provided as an answer to an opinion-asking question, the annotators were asked to indicate ("yes/no") if the utterance represents an answer based on the guidelines in Phase-1. Additionally, if the annotation value was "no", the annotators were asked to state the reason.</p><p>The judges annotated a total of 2001 conversations, of which they identified a total of 2292 answers to opinion-asking questions (AngularJS: 1001, C++: 133, OpenGL: 263, Python: 165, Clojurians: 197, Elm: 533). We found that the first annotator considered 94.6% of annotations of the second annotator as valid, while the second annotator considered 96.2% annotations of the first annotator as valid. We also noticed that the majority of disagreements were due to the answer utterances containing incomplete or inadequate information to answer the marked opinion-asking question when removed from conversational context (e.g., "and then you can replace your calls to 'f' with 'logArgs2 f' without touching the function..."), and human annotation errors such as marking an utterance as an answer when it just points to other channels.</p><p>Comparison Techniques: Since we believe this is the first effort to automatically extract answers to opinion-asking questions from developer chats, we chose to evaluate against heuristicbased and feature-based machine learning classification as well as the original R-CNN deep learning-based technique on which we built ChatEO.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Heuristic-based (HE):</head><p>Intuitively, the answer to an opinion-asking question might be found based on its location in the conversation, the relation between its content and the question, or the presence of sentiment in the answer. We investigated each of these possibilities separately and in combination.</p><p>&#8226; Location: Based on the intuition that a question might be answered immediately after it is asked during a conversation, we compare against the naive approach of identifying the next utterance after the leading opinion-asking question as an answer. &#8226; Content: Traditional Q&amp;A systems have often aimed to extract answers based on semantic matching between question and answers <ref type="bibr">[43]</ref>, <ref type="bibr">[54]</ref>. Thus, to model content-based semantic relation between question and answer, we compare the average word embedding of the question and answer texts. Using our word embedding model described in III-C2, we extract utterances with considerable similarity (&#8805; 0.5) to the opinion-asking question as answers. &#8226; Sentiment: Previous researchers have leveraged sentiment analysis to extract relevant answers in non-factoid Q&amp;A systems <ref type="bibr">[55]</ref>- <ref type="bibr">[58]</ref>. Thus, based on the intuition that the answer to an opinion-asking question might exhibit sentiment, we use CoreNLP <ref type="bibr">[52]</ref> to extract utterances bearing sentiment (positive or negative) as answers. We explored other sentiment analysis tools (e.g., SentiStrength-SE <ref type="bibr">[51]</ref>, NLTK <ref type="bibr">[53]</ref>); however, we do not discuss them, since they yielded inferior results.</p><p>Machine Learning-based (ML): We combine location, content, sentiment attributes as features of a machine learning (ML)-based classifier. We explored several popular ML algorithms (e.g., Support Vector Machines (SVM), Random Forest) using the Weka toolkit <ref type="bibr">[59]</ref>, and observed that they yielded nearly similar results. We report the results for SVM.</p><p>Deep Learning-based (DL): We present the results for both R-CNN and ChatEO implemented as follows.</p><p>&#8226; RCNN: We implemented R-CNN <ref type="bibr">[29]</ref> for developer chats using open-source neural-network library Keras <ref type="bibr">[60]</ref>. R-CNN used word embeddings pre-trained on their corpus. Similarly, we trained custom GloVe vectors on our chat corpus for our comparison. &#8226; ChatEO: We also implemented ChatEO using Keras <ref type="bibr">[60]</ref>.</p><p>We used grid-search <ref type="bibr">[61]</ref> to perform hyper-parameter tuning. First, to obtain sufficient semantic information at the utterance level, we use three convolution filters of size 2, 3, and 4, with 50 (twice the average length of an utterance) feature maps for each filter. The pool sizes of convolution are (2,1), (2,1), <ref type="bibr">(3,</ref><ref type="bibr">1)</ref>, respectively. Then, a BiLSTM layer with 400 units (200 for each direction) is used to capture the contextual information in a conversation. Finally, we use a linear layer with sigmoid activation function to predict the probability scores of binary classes (answer and nonanswer). We use binary-cross-entropy as the loss function, and Adam optimization algorithm for gradient descent.</p><p>To avoid over-fitting, we apply a dropout <ref type="bibr">[62]</ref> of 0.5 on the TextCNN embeddings, i.e., 50% units will be randomly omitted to prevent complex co-adaptations on the training  <ref type="table">III</ref>. We created a test set of 400 conversations (adhering to commonly recognized traintest ratio of 80-20%). We ensured that it contains similar number of instances from each programming community: 140 (70*2) conversations from Angular JS and Python, 260 (65*4) conversations from C++, OpenGL, Clojurians, and Elm. Results: Table V presents precision (P), recall (R), and Fmeasure (F) for ChatEO and our comparison techniques for automatic answer extraction on our held-out test set of 400 conversations, which contains a total of 499 answers. The best results for P, R, F across all techniques are highlighted in bold.</p><p>Overall, Table <ref type="table">V</ref> shows that ChatEO achieves the highest overall precision, recall, and F-Measure of all techniques, with 0.77, 0.59, and 0.67, respectively. Overall, ChatEO identified 370 answers in the test set, out of which 285 are true positives. R-CNN and the ML-based classifier perform next best and similar to each other in precision and F-measure. The better performance of ChatEO suggests that capturing contextual information through BiLSTM can benefit answer extraction in chat messages, and that using a domain-specific word embedding model (trained on software-related texts) accompanied with hyper-parameter tuning, is essential to adapting deep learning models for software-related applications. Figure <ref type="figure">2</ref> shows that the performance of ChatEO is consistent (76-81% precision) across all communities, except Clojure. One possible reason for this is, answers are often provided in this chat community in the form of code snippets along with little to no natural language text, which makes it difficult for ChatEO to model the semantic links.</p><p>ChatEO's overall recall of 0.59 indicates that it is difficult to identify all relevant answers in chats even with complex dialog modeling. In fact, the recall of ChatEO is lower than the heuristic-based technique location for C++ and OpenGL. Upon deeper analysis, we observed that these two communities contain less answers (one answer per opinion-asking question on average) compared to the other communities, and the answer often resides in the utterance occurring immediately after the first speaker.</p><p>The location heuristic exhibits significantly better performance than content or sentiment heuristics. Of the 400 conversations, 278 have at least one answer occurring immediately after the utterances of the first speaker. Neither content or sentiment is a strong indicator of answers, with precision of 0.07 and F-measure of 0.09 and 0.11, respectively. These heuristics cannot distinguish the intent of a response. Consider, "Q: Hi, Clojure intermediate here. What is the best way to read big endian binary file?; R: do you need to make something that parses the binary contents, does it suffice to just get the bytes in an array?". Both content and sentiment marked this response as an answer, without being able to understand that this is a follow-up question.</p><p>Combining the heuristics as features to SVM, the precision (and thus F-Measure) improved slightly over the location heuristic with .06 increase in precision and .02 in F-measure. As expected, location as a feature shows the highest information gain. We investigated several classifier parameters (e.g., kernal and regularization parameters), but observed that the classification task was not very sensitive to parameter choices, as they had little discernible effect on the effectiveness metrics (in most cases &#8804; 0.01). Since our dataset is imbalanced, with considerably low ratio of answers to other utterances, we explored over-sampling (SMOTE) techniques. No significant improvements occurred. ML-based classification may be improved with more features and feature engineering.</p><p>ChatEO answer extraction shows improvement over heuristics-based, ML-based, and existing deep learningbased <ref type="bibr">[29]</ref> techniques in terms of precision, recall, and Fmeasure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. THREATS TO VALIDITY</head><p>Construct Validity: A threat to construct validity might arise from the manual annotations for creating the gold sets. To limit this threat, we ensured that our annotators have considerable experience in programming and in using both chat platforms and that they followed a consistent annotation procedure piloted in advance. We also observed high values of Cohen's Kappa coefficient, which measures the inter-rater agreement for opinion-asking questions. For answer annotations, we conducted a two-phase annotation procedure to ensure the validity of the selected answers. Internal Validity: Errors in the automatically disentangled conversations could pose a threat to internal validity affecting misclassification. We mitigated this threat by humans, without knowledge of our techniques, manually discarding poorly disentangled conversations from our dataset. In all stages of the pipeline of ChatEO, we aimed for higher precision over recall as the quality of information is more important than missing instances; chat datasets are large with many opinions so our achieved recall is sufficient to extract a significant number of opinion Q&amp;A. Other potential threats could be related to evaluation bias or errors in our scripts. To reduce these threats, we ensured that the instances in our development set do not overlap with our train or test sets. We also wrote unit tests and performed code reviews. External Validity: To ensure generalizability of our approach, we selected the subjects of our study from the two most popular software developer chat communities, Slack and IRC. We selected statistically representative samples from six active communities which represent a broad set of topics related to each programming language. However, our study's results may not transfer to other chat platforms or developer communications. Scaling to larger datasets might also lead to different evaluation results. Our technique of identifying opinion-asking questions could be made more generalizable by augmenting the set of identified patterns and vocabulary terms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. RELATED WORK</head><p>Mining Opinions in SE. In addition to the related work discussed in Section II, significant work has focused on mining opinions from developer forums. Uddin and Khomh <ref type="bibr">[16]</ref> designed Opiner, which uses keyword-matching along with a customized Sentiment Orientation algorithm to summarize API reviews. Lin et al. <ref type="bibr">[17]</ref> used patterns to identify and classify opinions on APIs from Stack Overflow. Zhang et al. <ref type="bibr">[64]</ref> identifies negative opinions about APIs from forums. Huang et al. <ref type="bibr">[65]</ref> proposed an automatic approach to distill and aggregate comparative opinions of comparable technologies from Q&amp;A websites. Ren et al. <ref type="bibr">[66]</ref> discovered and summarized controversial (criticized) answers in Stack Overflow, based on judgment, sentiment and opinion. Novielli et al. <ref type="bibr">[18]</ref>, <ref type="bibr">[67]</ref> investigated the role of affective lexicon on the questions posted in Stack Overflow.</p><p>Researchers have also analyzed opinions in developer emails, commit logs, and app reviews. Xiong et al. <ref type="bibr">[68]</ref> studied assumptions in OSS development mailing lists. Sinha et al. <ref type="bibr">[19]</ref> analyzed developer sentiment in Github commit logs. Opinions in app reviews <ref type="bibr">[22]</ref>, <ref type="bibr">[23]</ref>, <ref type="bibr">[34]</ref>, <ref type="bibr">[69]</ref> have been mined to help app developers gather information about user requirements, ideas for improvements, and user sentiments about specific features. To the best of our knowledge, our is the first to extract opinion Q&amp;A from developer chats.</p><p>Extracting Q&amp;A from Online Communications. Outside the SE domain, researchers have proposed techniques to identify Q&amp;A pairs in online communications (e.g., Yahoo Answers). Shrestha et al. <ref type="bibr">[70]</ref> used machine learning approaches to automatically detect Q&amp;A pairs in emails. Cong et al. <ref type="bibr">[71]</ref> detected Q&amp;A pairs from forum threads by using Sequential Pattern Mining to detect questions, and a graph-based propagation method to detect answers in the same thread.</p><p>Recently, researchers have focused on answer selection, a major subtask of Q&amp;A extraction, which aims to select the most relevant answers from a candidate answer set. Typical approaches for answer selection model the semantic matching between question and answers <ref type="bibr">[31]</ref>, <ref type="bibr">[43]</ref>- <ref type="bibr">[45]</ref>. These approaches have the advantage of sharing parameters, thus making the model smaller and easier to train. However, they often fail to capture the semantic correlations embedded in the response sequence of a question. To overcome such drawbacks, Zhou et al. <ref type="bibr">[29]</ref> designed a recurrent architecture that models the semantic relations between successive responses, as well as the question and answer. Xiang et al. <ref type="bibr">[49]</ref> investigated an attention mechanism and context modeling to aid the learning of deterministic information for answer selection. Wang et al. <ref type="bibr">[30]</ref> proposed a bilateral multi-perspective matching model in which Q&amp;A pairs are matched on multiple levels of granularity at each time-step. Our model belongs to the same framework which captures the contextual information of conversations in extracting answers from developer chats.</p><p>Most of the these techniques for Q&amp;A extraction were designed for general online communications and not specifically for software forums. Gottipati et al. <ref type="bibr">[72]</ref> used Hidden Markov Models to infer semantic tags (e.g., question, answer, clarifying question) of posts in the software forum threads. Hen&#223; et al. <ref type="bibr">[73]</ref> used topic modeling and text similarity measures to automatically extract FAQs from software development discussions (mailing lists, online forums).</p><p>Analyzing Developer Chats. Wood et al. <ref type="bibr">[2]</ref> created a supervised classifier to automatically detect speech acts in developer Q&amp;A bug repair conversations. Shi et al. <ref type="bibr">[47]</ref> use deep Siamese network to identify feature-request from chat conversations. Alkadhi et al. <ref type="bibr">[74]</ref>, <ref type="bibr">[75]</ref> showed that machine learning can be leveraged to detect rationale in IRC messages. Chowdhury and Hindle <ref type="bibr">[76]</ref> exploit Stack Overflow discussions and YouTube video comments to automatically filter off-topic discussions in IRC chats. Romero et al. <ref type="bibr">[8]</ref> developed a chatbot that detects a troubleshooting question asked on Gitter and provides possible answers retrieved from querying similar Stack Overflow posts. Compared to their work, we are automatically identifying opinion-asking questions and their answers provided by developers in chat forums.</p><p>Chatterjee et al.'s <ref type="bibr">[9]</ref> exploratory study on Slack developer chats suggested that developers share opinions and interesting insights on APIs, programming tools and best practices, via conversations. Other studies have focused on learning developer behaviors and how chat communities are used by development teams across the globe <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[77]</ref>- <ref type="bibr">[81]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. CONCLUSIONS AND FUTURE WORK</head><p>In this paper, we present and evaluate ChatEO, which automatically identifies opinion-asking questions from chats using a pattern-based technique, and extracts participants' answers using a deep learning-based architecture. This research provides a significant contribution to using software developers' public chat forums for building opinion Q&amp;A systems, a specialized instance of virtual assistants and chatbots for software engineers. ChatEO opinion-asking question identification significantly outperforms existing sentiment analysis tools and a pattern-based technique that was designed for emails <ref type="bibr">[20]</ref>. ChatEO answer extraction shows improvement over heuristicsbased, ML-based, and an existing deep learning-based technique designed for CQA <ref type="bibr">[29]</ref>. Our replication package can be used to verify our results <ref type="bibr">[url]</ref>.</p><p>Our immediate next steps focus on investigating machine learning-based techniques for opinion-asking question identification, and attention-based LSTM network for answer extraction. We will also expand to a larger and more diverse developer chat communication dataset. The Q&amp;A pairs extracted using ChatEO could also be leveraged to generate FAQs, provide tool support for recommendation systems, and in understanding developer behavior and collaboration (asking and sharing opinions).</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Replication Package: https://tinyurl.com/y3qth6s3</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>https://www.zenodo.org/record/3627124</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>http://alt.qcri.org/semeval2015/task3/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3"><p>https://echelog.com/</p></note>
		</body>
		</text>
</TEI>
