<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Analyzing Student and Instructor Comments using NLP</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>03/14/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10560542</idno>
					<idno type="doi">10.1145/3626253.3635593</idno>
					
					<author>Zack Butler</author><author>Ivona Bezáková</author><author>Shaoxuan Xu</author><author>Angelina Brilliantova</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We report on our experience using common natural language processing (NLP) tools to analyze two vastly dierent data sets of free-form responses collected during a study of assignments in introductory computing courses. Our rst data set consists of typically short comments left by hundreds of students on assignment surveys. Our second data set is comprised of semi-structured individual interviews of eight instructors of up to an hour long each. We collected the data across several years as part of our investigation of the use of pencil puzzles as a context for introductory computer science. In an earlier work, we manually analyzed a fraction of the student comments (all data collected until that point), using grounded theory. The results were illuminating, but the process was very time consuming, consisting of manual assignment of a small number of codes to each comment. In this work, we investigate the usability of common NLP tools to speed up the process for the entire data set of student comments. We also applied these tools to the instructor interviews. The NLP tools do not appear to be eective to create the code base, but, once the code base was determined, they performed the actual coding (assignment of codes to each student comment) promisingly well. For the long-form instructor interviews, the situation was much more challenging, due to the wide-ranging nature of semi-structured interviews, interleaving discussion topics, and elements of natural speech. We report on the lessons learned while automatically analyzing these complex data sets.
CCS CONCEPTS• Social and professional topics ! Computing education.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION AND BACKGROUND</head><p>Grounded theory <ref type="bibr">[3]</ref> is a well-established technique for qualitative analysis of free-form responses. It consists of the creation of a code base -a small set of strings capturing the most common themes in the data set. Then, a small number of codes are associated with each comment or a part of the free-form text (for example, a paragraph or a sentence in an interview). Having thus coded the free-form responses, standard statistical approaches can be applied to study the occurrences and co-occurrences of individual codes, determining correlations between the themes and sentiments in the responses.</p><p>The manual creation of the code base (often done using triangulation, when multiple researchers compare their proposed code bases and arrive at a mutually agreeable compromise), and, even more so, the manual assignment of codes to individual comments, paragraphs, or sentences, is very time consuming. We investigate the use of common NLP tools to help automate the process.</p><p>We collected our data set over several years, and across several institutions. We studied the use of pencil puzzles (puzzles such as sudoku that are designed to be solved by people, whether with a pencil or an app) as a context for introductory CS assignments. We initially collected data (student surveys including Likert-scale questions and open-ended comment boxes, plus grade data) at our home institution. We analyzed the Likert-scale data, and found that students reacted positively to these assignments, and that their experience was independent of their prior exposure to computing or of their gender <ref type="bibr">[1]</ref>. We also performed qualitative analysis using grounded theory on the open-ended comments <ref type="bibr">[2]</ref>. Notably, this work required manual coding of over 1000 individual comments.</p><p>As a followup to the initial study, we recruited 10 collaborating instructors (at institutions ranging from R1 schools to small liberal-arts schools to an international campus of our own university), who delivered puzzle-based assignments in 14 dierent course sections and allowed us to survey their students. Throughout this second study, we also conducted semi-structured interviews with the collaborating instructors and some of their teaching assistants, reecting on how they valued the use of puzzle-based assignments and their adoption into their courses. We used grounded theory to manually code the interviews and the student comments in this SIGCSE 2024, March 20-23, 2024, Portland, OR, USA Zack Butler, Ivona Bez&#225;kov&#225;, Shaoxuan Xu, and Angelina Brilliantova larger data set. Recent advances in large language models may allow some of this coding and analysis to be automated, and since we have the human coding available for comparison, we nd ourselves in a strong position to evaluate the ecacy of such approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>Natural language processing (NLP) research has achieved admirable progress in recent years. However, the use of these tools in the context of education research has been limited thus far. A survey from 2021 <ref type="bibr">[7]</ref> summarizes studies using NLP to analyze the sentiments of students' feedback. The authors identied almost a hundred publications, though only a fraction uses student feedback in the form of survey responses (other sources of student feedback included blog posts, forums, and feedback provided on online education and research platforms). Almost all of this work examines only whether students' comments are generally positive or negative, but we wish to examine whether similar techniques can distill the multitude of thoughts given at the end of a survey. In addition to sentiment analysis, we found papers discussing automated approaches to obtain various statistical analyses such as identifying important keywords and correlations between them, e.g. <ref type="bibr">[5,</ref><ref type="bibr">8]</ref>, but we have not found publications aiming for (semi-)automated and more detailed analysis of student feedback.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">METHODOLOGY</head><p>The various themes expressed in student comments in our initial data set included the puzzle concept, the student's learning experience, the details of the assignments, and the overall course. In each of these areas, there were many comments both positive and negative, and our codes reected the combination of a theme and the accompanying sentiment. Once the codebook was developed, each comment (from both the initial study and the followup study) was assigned between one and three codes based on its content. For the interviews in the followup study, we used a similar approach, though here the coding is more free-form, as large parts of the interview may be irrelevant to the study topic and remain uncoded, while some codes may apply to a long section of the interview.</p><p>To perform supervised learning of comment codes, rather than developing new models, we chose to investigate how well common existing language models would be able to learn the coding. We initially chose the BERT model <ref type="bibr">[4]</ref> as a well-known and eective deep-network model for general language understanding tasks, but also looked into less computationally intensive models such as FastText <ref type="bibr">[6]</ref>. For the interview data, we also used BERT, but this data required signicant pre-processing to turn the natural human conversation into a more regular form. We also used various data augmentation techniques on both data sets to increase the size and breadth of the training data.</p><p>Finally, we investigated clustering techniques on the survey comments to see if they could generate the initial set of comment codes. Here again we chose to explore a common technique, Latent Dirichlet Analysis (LDA), which is often used for topic modelling across a document corpus. LDA uses a probabilistic model that describes documents as collections of topics, where each topic is represented as a distribution over all words. The algorithm then uses an inference process to determine both the most likely distribution of topic clusters and the most likely distributions of the words in each cluster.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">RESULTS</head><p>For the comment analysis, since each comment has up to three correct codes (as determined by the manual coding), instead of a binary accuracy measurement per comment, we measured success with precision and recall metrics. BERT was able to achieve a precision of 0.660 and recall of 0.750, and manual investigation showed that many "errors" assigned codes similar to the correct ones. (This often happens also with human coding, when dierent researchers pick up dierent nuances of the text. The nal coding is then the result of a discussion between the researchers.) On the other hand, Fasttext was able to achieve only 0.53 precision and recall.</p><p>For the interviews, BERT performed extremely well at determining which parts of the interviews were relevant and which were not. Including the "uncoded" label, the precision of the model was 0.976 and recall was 0.966. However, if we look only at the portions of the interviews which had codes assigned in the ground truth, the accuracy was lower, though still somewhat reasonable, with a precision of 0.410 and recall of 0.667. As such, it appears that NLP techniques can be a meaningful time saver at least in terms of locating the most relevant portions of these long interviews.</p><p>On the other hand, clustering techniques appear unable to generate a meaningful set of codes for these survey comments. After manually examining the comments put together in each cluster, there appeared to be little consistency among them. We also compared the manually-assigned codes within each cluster, and likewise found that there was little correspondence between them. In short, it does not appear that this type of analysis is (yet) suitable for topic discovery amongst collections of short comments.</p></div>		</body>
		</text>
</TEI>
