Machine learning (ML)-based analysis of electroencephalograms (EEGs) is playing an important role in advancing neurological care. However, the difficulties in automatically extracting useful metadata from clinical records hinder the development of large-scale EEG-based ML models. EEG reports, which are the primary sources of metadata for EEG studies, suffer from lack of standardization. Here we propose a machine learning-based system that automatically extracts attributes detailed in the SCORE specification from unstructured, natural-language EEG reports. Specifically, our system, which jointly utilizes deep learning- and rule-based methods, identifies (1) the type of seizure observed in the recording, per physician impression; (2) whether the patient was diagnosed with epilepsy or not; (3) whether the EEG recording was normal or abnormal according to physician impression. We performed an evaluation of our system using the publicly available Temple University EEG corpus and report F1 scores of 0.93, 0.82, and 0.97 for the respective tasks.
more »
« less
Automatic coding of students' writing via Contrastive Representation Learning in the Wasserstein space
Qualitative analysis of verbal data is of central importance in the learning sciences. It is labor-intensive and time-consuming, however, which limits the amount of data researchers can include in studies. This work is a step towards building a statistical machine learning (ML) method for achieving an automated support for qualitative analyses of students' writing, here specifically in score laboratory reports in introductory biology for sophistication of argumentation and reasoning. We start with a set of lab reports from an undergraduate biology course, scored by a four-level scheme that considers the complexity of argument structure, the scope of evidence, and the care and nuance of conclusions. Using this set of labeled data, we show that a popular natural language modeling processing pipeline, namely vector representation of words, a.k.a word embeddings, followed by Long Short Term Memory (LSTM) model for capturing language generation as a state-space model, is able to quantitatively capture the scoring, with a high Quadratic Weighted Kappa (QWK) prediction score, when trained in via a novel contrastive learning set-up. We show that the ML algorithm approached the inter-rater reliability of human analysis. Ultimately, we conclude, that machine learning (ML) for natural language processing (NLP) holds promise for assisting learning sciences researchers in conducting qualitative studies at much larger scales than is currently possible.
more »
« less
- PAR ID:
- 10288825
- Date Published:
- Journal Name:
- ArXivorg
- ISSN:
- 2331-8422
- Page Range / eLocation ID:
- https://arxiv.org/abs/2011.13384
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Education researchers have proposed that qualitative and emerging computational machine learning (ML) approaches can be productively combined to advance analyses of student-generated artifacts for evidence of engagement in scientific practices. We applied such a combined approach to written arguments excerpted from university students’ biology laboratory reports. These texts are lengthy and contain multiple different features that could be attended to in analysis. We present two outcomes of this combined analysis that illustrate possible affordances of combined workflows: 1) Comparing ML and human-generated scores allowed us to identify and reanalyze mismatches, increasing our overall confidence in the coding; and 2) ML-identified word clusters allowed us to interpret the overlap in meaning between the original coding scheme and the ML predicted scores, providing insight into which features of students’ writing can be used to differentiate rote from more meaningful engagement in scientific argumentation.more » « less
-
Abstract Machine learning (ML) provides a powerful framework for the analysis of high‐dimensional datasets by modelling complex relationships, often encountered in modern data with many variables, cases and potentially non‐linear effects. The impact of ML methods on research and practical applications in the educational sciences is still limited, but continuously grows, as larger and more complex datasets become available through massive open online courses (MOOCs) and large‐scale investigations. The educational sciences are at a crucial pivot point, because of the anticipated impact ML methods hold for the field. To provide educational researchers with an elaborate introduction to the topic, we provide an instructional summary of the opportunities and challenges of ML for the educational sciences, show how a look at related disciplines can help learning from their experiences, and argue for a philosophical shift in model evaluation. We demonstrate how the overall quality of data analysis in educational research can benefit from these methods and show how ML can play a decisive role in the validation of empirical models. Specifically, we (1) provide an overview of the types of data suitable for ML and (2) give practical advice for the application of ML methods. In each section, we provide analytical examples and reproducible R code. Also, we provide an extensive Appendix on ML‐based applications for education. This instructional summary will help educational scientists and practitioners to prepare for the promises and threats that come with the shift towards digitisation and large‐scale assessment in education. Context and implicationsRationale for this studyIn 2020, the worldwide SARS‐COV‐2 pandemic forced the educational sciences to perform a rapid paradigm shift with classrooms going online around the world—a hardly novel but now strongly catalysed development. In the context of data‐driven education, this paper demonstrates that the widespread adoption of machine learning techniques is central for the educational sciences and shows how these methods will become crucial tools in the collection and analysis of data and in concrete educational applications. Helping to leverage the opportunities and to avoid the common pitfalls of machine learning, this paper provides educators with the theoretical, conceptual and practical essentials.Why the new findings matterThe process of teaching and learning is complex, multifaceted and dynamic. This paper contributes a seminal resource to highlight the digitisation of the educational sciences by demonstrating how new machine learning methods can be effectively and reliably used in research, education and practical application.Implications for educational researchers and policy makersThe progressing digitisation of societies around the globe and the impact of the SARS‐COV‐2 pandemic have highlighted the vulnerabilities and shortcomings of educational systems. These developments have shown the necessity to provide effective educational processes that can support sometimes overwhelmed teachers to digitally impart knowledge on the plan of many governments and policy makers. Educational scientists, corporate partners and stakeholders can make use of machine learning techniques to develop advanced, scalable educational processes that account for individual needs of learners and that can complement and support existing learning infrastructure. The proper use of machine learning methods can contribute essential applications to the educational sciences, such as (semi‐)automated assessments, algorithmic‐grading, personalised feedback and adaptive learning approaches. However, these promises are strongly tied to an at least basic understanding of the concepts of machine learning and a degree of data literacy, which has to become the standard in education and the educational sciences.Demonstrating both the promises and the challenges that are inherent to the collection and the analysis of large educational data with machine learning, this paper covers the essential topics that their application requires and provides easy‐to‐follow resources and code to facilitate the process of adoption.more » « less
-
Machine learning (ML) has become increasingly prevalent in various domains. However, ML algorithms sometimes give unfair outcomes and discrimination against certain groups. Thereby, bias occurs when our results produce a decision that is systematically incorrect. At various phases of the ML pipeline, such as data collection, pre-processing, model selection, and evaluation, these biases appear. Bias reduction methods for ML have been suggested using a variety of techniques. By changing the data or the model itself, adding more fairness constraints, or both, these methods try to lessen bias. The best technique relies on the particular context and application because each technique has advantages and disadvantages. Therefore, in this paper, we present a comprehensive survey of bias mitigation techniques in machine learning (ML) with a focus on in-depth exploration of methods, including adversarial training. We examine the diverse types of bias that can afflict ML systems, elucidate current research trends, and address future challenges. Our discussion encompasses a detailed analysis of pre-processing, in-processing, and post-processing methods, including their respective pros and cons. Moreover, we go beyond qualitative assessments by quantifying the strategies for bias reduction and providing empirical evidence and performance metrics. This paper serves as an invaluable resource for researchers, practitioners, and policymakers seeking to navigate the intricate landscape of bias in ML, offering both a profound understanding of the issue and actionable insights for responsible and effective bias mitigation.more » « less
-
Abstract Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.more » « less