skip to main content


Title: Design and Evaluation Challenges of Conversational Agents in Health Care and Well-being: Selective Review Study
Background Health care and well-being are 2 main interconnected application areas of conversational agents (CAs). There is a significant increase in research, development, and commercial implementations in this area. In parallel to the increasing interest, new challenges in designing and evaluating CAs have emerged. Objective This study aims to identify key design, development, and evaluation challenges of CAs in health care and well-being research. The focus is on the very recent projects with their emerging challenges. Methods A review study was conducted with 17 invited studies, most of which were presented at the ACM (Association for Computing Machinery) CHI 2020 conference workshop on CAs for health and well-being. Eligibility criteria required the studies to involve a CA applied to a health or well-being project (ongoing or recently finished). The participating studies were asked to report on their projects’ design and evaluation challenges. We used thematic analysis to review the studies. Results The findings include a range of topics from primary care to caring for older adults to health coaching. We identified 4 major themes: (1) Domain Information and Integration, (2) User-System Interaction and Partnership, (3) Evaluation, and (4) Conversational Competence. Conclusions CAs proved their worth during the pandemic as health screening tools, and are expected to stay to further support various health care domains, especially personal health care. Growth in investment in CAs also shows the value as a personal assistant. Our study shows that while some challenges are shared with other CA application areas, safety and privacy remain the major challenges in the health care and well-being domains. An increased level of collaboration across different institutions and entities may be a promising direction to address some of the major challenges that otherwise would be too complex to be addressed by the projects with their limited scope and budget.  more » « less
Award ID(s):
2144880
NSF-PAR ID:
10404511
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; ; « less
Date Published:
Journal Name:
Journal of Medical Internet Research
Volume:
24
Issue:
11
ISSN:
1438-8871
Page Range / eLocation ID:
e38525
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, I. (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of high-resolution images from scanned pathology samples [1], as part of its National Science Foundation-funded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” [2]. The long-term goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition [3], image recognition [4] and text processing [5] are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely high-dimensional models (e.g., billions of parameters). Other fields that do not have access to such data resources must rely on techniques in which existing models can be adapted to new datasets [6]. A preliminary version of this breast corpus release was tested in a pilot study using a baseline machine learning system, ResNet18 [7], that leverages several open-source Python tools. The pilot corpus was divided into three sets: train, development, and evaluation. Portions of these slides were manually annotated [1] using the nine labels in Table 1 [8] to identify five to ten examples of pathological features on each slide. Not every pathological feature is annotated, meaning excluded areas can include focuses particular to these labels that are not used for training. A summary of the number of patches within each label is given in Table 2. To maintain a balanced training set, 1,000 patches of each label were used to train the machine learning model. Throughout all sets, only annotated patches were involved in model development. The performance of this model in identifying all the patches in the evaluation set can be seen in the confusion matrix of classification accuracy in Table 3. The highest performing labels were background, 97% correct identification, and artifact, 76% correct identification. A correlation exists between labels with more than 6,000 development patches and accurate performance on the evaluation set. Additionally, these results indicated a need to further refine the annotation of invasive ductal carcinoma (“indc”), inflammation (“infl”), nonneoplastic features (“nneo”), normal (“norm”) and suspicious (“susp”). This pilot experiment motivated changes to the corpus that will be discussed in detail in this poster presentation. To increase the accuracy of the machine learning model, we modified how we addressed underperforming labels. One common source of error arose with how non-background labels were converted into patches. Large areas of background within other labels were isolated within a patch resulting in connective tissue misrepresenting a non-background label. In response, the annotation overlay margins were revised to exclude benign connective tissue in non-background labels. Corresponding patient reports and supporting immunohistochemical stains further guided annotation reviews. The microscopic diagnoses given by the primary pathologist in these reports detail the pathological findings within each tissue site, but not within each specific slide. The microscopic diagnoses informed revisions specifically targeting annotated regions classified as cancerous, ensuring that the labels “indc” and “dcis” were used only in situations where a micropathologist diagnosed it as such. Further differentiation of cancerous and precancerous labels, as well as the location of their focus on a slide, could be accomplished with supplemental immunohistochemically (IHC) stained slides. When distinguishing whether a focus is a nonneoplastic feature versus a cancerous growth, pathologists employ antigen targeting stains to the tissue in question to confirm the diagnosis. For example, a nonneoplastic feature of usual ductal hyperplasia will display diffuse staining for cytokeratin 5 (CK5) and no diffuse staining for estrogen receptor (ER), while a cancerous growth of ductal carcinoma in situ will have negative or focally positive staining for CK5 and diffuse staining for ER [9]. Many tissue samples contain cancerous and non-cancerous features with morphological overlaps that cause variability between annotators. The informative fields IHC slides provide could play an integral role in machine model pathology diagnostics. Following the revisions made on all the annotations, a second experiment was run using ResNet18. Compared to the pilot study, an increase of model prediction accuracy was seen for the labels indc, infl, nneo, norm, and null. This increase is correlated with an increase in annotated area and annotation accuracy. Model performance in identifying the suspicious label decreased by 25% due to the decrease of 57% in the total annotated area described by this label. A summary of the model performance is given in Table 4, which shows the new prediction accuracy and the absolute change in error rate compared to Table 3. The breast tissue subset we are developing includes 3,505 annotated breast pathology slides from 296 patients. The average size of a scanned SVS file is 363 MB. The annotations are stored in an XML format. A CSV version of the annotation file is also available which provides a flat, or simple, annotation that is easy for machine learning researchers to access and interface to their systems. Each patient is identified by an anonymized medical reference number. Within each patient’s directory, one or more sessions are identified, also anonymized to the first of the month in which the sample was taken. These sessions are broken into groupings of tissue taken on that date (in this case, breast tissue). A deidentified patient report stored as a flat text file is also available. Within these slides there are a total of 16,971 total annotated regions with an average of 4.84 annotations per slide. Among those annotations, 8,035 are non-cancerous (normal, background, null, and artifact,) 6,222 are carcinogenic signs (inflammation, nonneoplastic and suspicious,) and 2,714 are cancerous labels (ductal carcinoma in situ and invasive ductal carcinoma in situ.) The individual patients are split up into three sets: train, development, and evaluation. Of the 74 cancerous patients, 20 were allotted for both the development and evaluation sets, while the remain 34 were allotted for train. The remaining 222 patients were split up to preserve the overall distribution of labels within the corpus. This was done in hope of creating control sets for comparable studies. Overall, the development and evaluation sets each have 80 patients, while the training set has 136 patients. In a related component of this project, slides from the Fox Chase Cancer Center (FCCC) Biosample Repository (https://www.foxchase.org/research/facilities/genetic-research-facilities/biosample-repository -facility) are being digitized in addition to slides provided by Temple University Hospital. This data includes 18 different types of tissue including approximately 38.5% urinary tissue and 16.5% gynecological tissue. These slides and the metadata provided with them are already anonymized and include diagnoses in a spreadsheet with sample and patient ID. We plan to release over 13,000 unannotated slides from the FCCC Corpus simultaneously with v1.0.0 of TUDP. Details of this release will also be discussed in this poster. Few digitally annotated databases of pathology samples like TUDP exist due to the extensive data collection and processing required. The breast corpus subset should be released by November 2021. By December 2021 we should also release the unannotated FCCC data. We are currently annotating urinary tract data as well. We expect to release about 5,600 processed TUH slides in this subset. We have an additional 53,000 unprocessed TUH slides digitized. Corpora of this size will stimulate the development of a new generation of deep learning technology. In clinical settings where resources are limited, an assistive diagnoses model could support pathologists’ workload and even help prioritize suspected cancerous cases. ACKNOWLEDGMENTS This material is supported by the National Science Foundation under grants nos. CNS-1726188 and 1925494. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] N. Shawki et al., “The Temple University Digital Pathology Corpus,” in Signal Processing in Medicine and Biology: Emerging Trends in Research and Applications, 1st ed., I. Obeid, I. Selesnick, and J. Picone, Eds. New York City, New York, USA: Springer, 2020, pp. 67 104. https://www.springer.com/gp/book/9783030368432. [2] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning.” Major Research Instrumentation (MRI), Division of Computer and Network Systems, Award No. 1726188, January 1, 2018 – December 31, 2021. https://www. isip.piconepress.com/projects/nsf_dpath/. [3] A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 5036-5040. https://doi.org/10.21437/interspeech.2020-3015. [4] C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” in Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 331–344. https://ieeexplore.ieee.org/document/8675201. [5] I. Caswell and B. Liang, “Recent Advances in Google Translate,” Google AI Blog: The latest from Google Research, 2020. [Online]. Available: https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html. [Accessed: 01-Aug-2021]. [6] V. Khalkhali, N. Shawki, V. Shah, M. Golmohammadi, I. Obeid, and J. Picone, “Low Latency Real-Time Seizure Detection Using Transfer Deep Learning,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2021, pp. 1 7. https://www.isip. piconepress.com/publications/conference_proceedings/2021/ieee_spmb/eeg_transfer_learning/. [7] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning,” Philadelphia, Pennsylvania, USA, 2020. https://www.isip.piconepress.com/publications/reports/2020/nsf/mri_dpath/. [8] I. Hunt, S. Husain, J. Simons, I. Obeid, and J. Picone, “Recent Advances in the Temple University Digital Pathology Corpus,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2019, pp. 1–4. https://ieeexplore.ieee.org/document/9037859. [9] A. P. Martinez, C. Cohen, K. Z. Hanley, and X. (Bill) Li, “Estrogen Receptor and Cytokeratin 5 Are Reliable Markers to Separate Usual Ductal Hyperplasia From Atypical Ductal Hyperplasia and Low-Grade Ductal Carcinoma In Situ,” Arch. Pathol. Lab. Med., vol. 140, no. 7, pp. 686–689, Apr. 2016. https://doi.org/10.5858/arpa.2015-0238-OA. 
    more » « less
  2. A Mavragani (Ed.)
    Background

    Posttraumatic stress disorder (PTSD) is a serious public health concern. However, individuals with PTSD often do not have access to adequate treatment. A conversational agent (CA) can help to bridge the treatment gap by providing interactive and timely interventions at scale. Toward this goal, we have developed PTSDialogue—a CA to support the self-management of individuals living with PTSD. PTSDialogue is designed to be highly interactive (eg, brief questions, ability to specify preferences, and quick turn-taking) and supports social presence to promote user engagement and sustain adherence. It includes a range of support features, including psychoeducation, assessment tools, and several symptom management tools.

    Objective

    This paper focuses on the preliminary evaluation of PTSDialogue from clinical experts. Given that PTSDialogue focuses on a vulnerable population, it is critical to establish its usability and acceptance with clinical experts before deployment. Expert feedback is also important to ensure user safety and effective risk management in CAs aiming to support individuals living with PTSD.

    Methods

    We conducted remote, one-on-one, semistructured interviews with clinical experts (N=10) to gather insight into the use of CAs. All participants have completed their doctoral degrees and have prior experience in PTSD care. The web-based PTSDialogue prototype was then shared with the participant so that they could interact with different functionalities and features. We encouraged them to “think aloud” as they interacted with the prototype. Participants also shared their screens throughout the interaction session. A semistructured interview script was also used to gather insights and feedback from the participants. The sample size is consistent with that of prior works. We analyzed interview data using a qualitative interpretivist approach resulting in a bottom-up thematic analysis.

    Results

    Our data establish the feasibility and acceptance of PTSDialogue, a supportive tool for individuals with PTSD. Most participants agreed that PTSDialogue could be useful for supporting self-management of individuals with PTSD. We have also assessed how features, functionalities, and interactions in PTSDialogue can support different self-management needs and strategies for this population. These data were then used to identify design requirements and guidelines for a CA aiming to support individuals with PTSD. Experts specifically noted the importance of empathetic and tailored CA interactions for effective PTSD self-management. They also suggested steps to ensure safe and engaging interactions with PTSDialogue.

    Conclusions

    Based on interviews with experts, we have provided design recommendations for future CAs aiming to support vulnerable populations. The study suggests that well-designed CAs have the potential to reshape effective intervention delivery and help address the treatment gap in mental health.

     
    more » « less
  3. Background Over the past 2 decades, various desktop and mobile telemedicine systems have been developed to support communication and care coordination among distributed medical teams. However, in the hands-busy care environment, such technologies could become cumbersome because they require medical professionals to manually operate them. Smart glasses have been gaining momentum because of their advantages in enabling hands-free operation and see-what-I-see video-based consultation. Previous research has tested this novel technology in different health care settings. Objective The aim of this study was to review how smart glasses were designed, used, and evaluated as a telemedicine tool to support distributed care coordination and communication, as well as highlight the potential benefits and limitations regarding medical professionals’ use of smart glasses in practice. Methods We conducted a literature search in 6 databases that cover research within both health care and computer science domains. We used the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) methodology to review articles. A total of 5865 articles were retrieved and screened by 3 researchers, with 21 (0.36%) articles included for in-depth analysis. Results All of the reviewed articles (21/21, 100%) used off-the-shelf smart glass device and videoconferencing software, which had a high level of technology readiness for real-world use and deployment in care settings. The common system features used and evaluated in these studies included video and audio streaming, annotation, augmented reality, and hands-free interactions. These studies focused on evaluating the technical feasibility, effectiveness, and user experience of smart glasses. Although the smart glass technology has demonstrated numerous benefits and high levels of user acceptance, the reviewed studies noted a variety of barriers to successful adoption of this novel technology in actual care settings, including technical limitations, human factors and ergonomics, privacy and security issues, and organizational challenges. Conclusions User-centered system design, improved hardware performance, and software reliability are needed to realize the potential of smart glasses. More research is needed to examine and evaluate medical professionals’ needs, preferences, and perceptions, as well as elucidate how smart glasses affect the clinical workflow in complex care environments. Our findings inform the design, implementation, and evaluation of smart glasses that will improve organizational and patient outcomes. 
    more » « less
  4. Background

    COVID-19 has severely impacted health in vulnerable demographics. As communities transition back to in-person work, learning, and social activities, pediatric patients who are restricted to their homes due to medical conditions face unprecedented isolation. Prior to the pandemic, it was estimated that each year, over 2.5 million US children remained at home due to medical conditions. Confronting gaps in health and technical resources is central to addressing the challenges faced by children who remain at home. Having children use mobile telemedicine units (telerobots) to interact with their outside environment (eg, school and play, etc) is increasingly recognized for its potential to support children’s development. Additionally, social telerobots are emerging as a novel form of telehealth. A social telerobot is a tele-operated unit with a mobile base, 2-way audio/video capabilities, and some semiautonomous features.

    Objective

    In this paper, we aimed to provide a critical review of studies focused on the use of social telerobots for pediatric populations.

    Methods

    To examine the evidence on telerobots as a telehealth intervention, we conducted electronic and full-text searches of private and public databases in June 2010. We included studies with the pediatric personal use of interactive telehealth technologies and telerobot studies that explored effects on child development. We excluded telehealth and telerobot studies with adult (aged >18 years) participants.

    Results

    In addition to telehealth and telerobot advantages, evidence from the literature suggests 3 promising robot-mediated supports that contribute to optimal child development—belonging, competence, and autonomy. These robot-mediated supports may be leveraged for improved pediatric patient socioemotional development, well-being, and quality-of-life activities that transfer traditional developmental and behavioral experiences from organic local environments to the remote child.

    Conclusions

    This review contributes to the creation of the first pediatric telehealth taxonomy of care that includes the personal use of telehealth technologies as a compelling form of telehealth care.

     
    more » « less
  5. Obeid, Iyad Selesnick (Ed.)
    The Temple University Hospital EEG Corpus (TUEG) [1] is the largest publicly available EEG corpus of its type and currently has over 5,000 subscribers (we currently average 35 new subscribers a week). Several valuable subsets of this corpus have been developed including the Temple University Hospital EEG Seizure Corpus (TUSZ) [2] and the Temple University Hospital EEG Artifact Corpus (TUAR) [3]. TUSZ contains manually annotated seizure events and has been widely used to develop seizure detection and prediction technology [4]. TUAR contains manually annotated artifacts and has been used to improve machine learning performance on seizure detection tasks [5]. In this poster, we will discuss recent improvements made to both corpora that are creating opportunities to improve machine learning performance. Two major concerns that were raised when v1.5.2 of TUSZ was released for the Neureka 2020 Epilepsy Challenge were: (1) the subjects contained in the training, development (validation) and blind evaluation sets were not mutually exclusive, and (2) high frequency seizures were not accurately annotated in all files. Regarding (1), there were 50 subjects in dev, 50 subjects in eval, and 592 subjects in train. There was one subject common to dev and eval, five subjects common to dev and train, and 13 subjects common between eval and train. Though this does not substantially influence performance for the current generation of technology, it could be a problem down the line as technology improves. Therefore, we have rebuilt the partitions of the data so that this overlap was removed. This required augmenting the evaluation and development data sets with new subjects that had not been previously annotated so that the size of these subsets remained approximately the same. Since these annotations were done by a new group of annotators, special care was taken to make sure the new annotators followed the same practices as the previous generations of annotators. Part of our quality control process was to have the new annotators review all previous annotations. This rigorous training coupled with a strict quality control process where annotators review a significant amount of each other’s work ensured that there is high interrater agreement between the two groups (kappa statistic greater than 0.8) [6]. In the process of reviewing this data, we also decided to split long files into a series of smaller segments to facilitate processing of the data. Some subscribers found it difficult to process long files using Python code, which tends to be very memory intensive. We also found it inefficient to manipulate these long files in our annotation tool. In this release, the maximum duration of any single file is limited to 60 mins. This increased the number of edf files in the dev set from 1012 to 1832. Regarding (2), as part of discussions of several issues raised by a few subscribers, we discovered some files only had low frequency epileptiform events annotated (defined as events that ranged in frequency from 2.5 Hz to 3 Hz), while others had events annotated that contained significant frequency content above 3 Hz. Though there were not many files that had this type of activity, it was enough of a concern to necessitate reviewing the entire corpus. An example of an epileptiform seizure event with frequency content higher than 3 Hz is shown in Figure 1. Annotating these additional events slightly increased the number of seizure events. In v1.5.2, there were 673 seizures, while in v1.5.3 there are 1239 events. One of the fertile areas for technology improvements is artifact reduction. Artifacts and slowing constitute the two major error modalities in seizure detection [3]. This was a major reason we developed TUAR. It can be used to evaluate artifact detection and suppression technology as well as multimodal background models that explicitly model artifacts. An issue with TUAR was the practicality of the annotation tags used when there are multiple simultaneous events. An example of such an event is shown in Figure 2. In this section of the file, there is an overlap of eye movement, electrode artifact, and muscle artifact events. We previously annotated such events using a convention that included annotating background along with any artifact that is present. The artifacts present would either be annotated with a single tag (e.g., MUSC) or a coupled artifact tag (e.g., MUSC+ELEC). When multiple channels have background, the tags become crowded and difficult to identify. This is one reason we now support a hierarchical annotation format using XML – annotations can be arbitrarily complex and support overlaps in time. Our annotators also reviewed specific eye movement artifacts (e.g., eye flutter, eyeblinks). Eye movements are often mistaken as seizures due to their similar morphology [7][8]. We have improved our understanding of ocular events and it has allowed us to annotate artifacts in the corpus more carefully. In this poster, we will present statistics on the newest releases of these corpora and discuss the impact these improvements have had on machine learning research. We will compare TUSZ v1.5.3 and TUAR v2.0.0 with previous versions of these corpora. We will release v1.5.3 of TUSZ and v2.0.0 of TUAR in Fall 2021 prior to the symposium. ACKNOWLEDGMENTS Research reported in this publication was most recently supported by the National Science Foundation’s Industrial Innovation and Partnerships (IIP) Research Experience for Undergraduates award number 1827565. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the official views of any of these organizations. REFERENCES [1] I. Obeid and J. Picone, “The Temple University Hospital EEG Data Corpus,” in Augmentation of Brain Function: Facts, Fiction and Controversy. Volume I: Brain-Machine Interfaces, 1st ed., vol. 10, M. A. Lebedev, Ed. Lausanne, Switzerland: Frontiers Media S.A., 2016, pp. 394 398. https://doi.org/10.3389/fnins.2016.00196. [2] V. Shah et al., “The Temple University Hospital Seizure Detection Corpus,” Frontiers in Neuroinformatics, vol. 12, pp. 1–6, 2018. https://doi.org/10.3389/fninf.2018.00083. [3] A. Hamid et, al., “The Temple University Artifact Corpus: An Annotated Corpus of EEG Artifacts.” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2020, pp. 1-3. https://ieeexplore.ieee.org/document/9353647. [4] Y. Roy, R. Iskander, and J. Picone, “The NeurekaTM 2020 Epilepsy Challenge,” NeuroTechX, 2020. [Online]. Available: https://neureka-challenge.com/. [Accessed: 01-Dec-2021]. [5] S. Rahman, A. Hamid, D. Ochal, I. Obeid, and J. Picone, “Improving the Quality of the TUSZ Corpus,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2020, pp. 1–5. https://ieeexplore.ieee.org/document/9353635. [6] V. Shah, E. von Weltin, T. Ahsan, I. Obeid, and J. Picone, “On the Use of Non-Experts for Generation of High-Quality Annotations of Seizure Events,” Available: https://www.isip.picone press.com/publications/unpublished/journals/2019/elsevier_cn/ira. [Accessed: 01-Dec-2021]. [7] D. Ochal, S. Rahman, S. Ferrell, T. Elseify, I. Obeid, and J. Picone, “The Temple University Hospital EEG Corpus: Annotation Guidelines,” Philadelphia, Pennsylvania, USA, 2020. https://www.isip.piconepress.com/publications/reports/2020/tuh_eeg/annotations/. [8] D. Strayhorn, “The Atlas of Adult Electroencephalography,” EEG Atlas Online, 2014. [Online]. Availabl 
    more » « less