skip to main content

Title: Crowd-Sourced Reliability of an Assessment of Lower Facial Aging Using a Validated Visual Scale
Background: Reliable and valid assessments of the visual endpoints of aesthetic surgery procedures are needed. Currently, most assessments are based on the opinion of patients and their plastic surgeons. The objective of this research was to analyze the reliability of crowdworkers assessing de-identified photographs using a validated scale that depicts lower facial aging. Methods: Twenty photographs of the facial nasolabial region of various non-identifiable faces were obtained for which various degrees of facial aging were present. Independent crowds of 100 crowd workers were tasked with assessing the degree of aging using a photograph numeric scale. Independent groups of crowdworkers were surveyed at 4 different times (weekday daytime, weekday nighttime, weekend daytime, weekend nighttime), once a week for 2 weeks. Results: Crowds assessing midface region photographs had an overall correlation of R = 0.979 (weekday daytime R = 0.991; weekday nighttime R = 0.985; weekend daytime R = 0.997; weekend nighttime R = 0.985). Bland−Altman test for test-retest agreement showed a normal distribution of assessments over the various times tested, with the differences in the majority of photographs being within 1 SD of the average difference in ratings. Conclusions: Crowd assessments of facial aging in de-identified photographs displayed very strong concordance with each other, regardless of time of day or week. This shows promise toward obtaining reliable assessments of pre and postoperative results for aesthetic surgery procedures. More work must be done to quantify the reliability of assessments for other pretreatment states or the corresponding results following treatment.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Plastic and Reconstructive Surgery - Global Open
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. AI-based educational technologies may be most welcome in classrooms when they align with teachers' goals, preferences, and instructional practices. Teachers, however, have scarce time to make such customizations themselves. How might the crowd be leveraged to help time-strapped teachers? Crowdsourcing pipelines have traditionally focused on content generation. It is an open question how a pipeline might be designed so the crowd can succeed in a revision/customization task. In this paper, we explore an initial version of a teacher-guided crowdsourcing pipeline designed to improve the adaptive math hints of an AI-based tutoring system so they fit teachers' preferences, while requiring minimal expert guidance. In two experiments involving 144 math teachers and 481 crowdworkers, we found that such an expert-guided revision pipeline could save experts' time and produce better crowd-revised hints (in terms of teacher satisfaction) than two comparison conditions. The revised hints however, did not improve on the existing hints in the AI tutor, which were carefully-written but still have room for improvement and customization. Further analysis revealed that the main challenge for crowdworkers may lie in understanding teachers' brief written comments and implementing them in the form of effective edits, without introducing new problems. We also found that teachers preferred their own revisions over other sources of hints, and exhibited varying preferences for hints. Overall, the results confirm that there is a clear need for customizing hints to individual teachers' preferences. They also highlight the need for more elaborate scaffolds so the crowd can have specific knowledge of the requirements that teachers have for hints. The study represents a first exploration in the literature of how to support crowds with minimal expert guidance in revising and customizing instructional materials. 
    more » « less
  2. Abstract

    Streams are complex where biology, hydrology, and atmospheric processes are all important. Because quantifying and modeling of these systems can be challenging, many teams go directly to prescribed restoration treatments and principles. Restoration on the Middle Fork of the John Day River in Oregon, USA, shows how a project that was designed according to widely accepted restoration principles may lead to outcomes contrary to one of the project's stated goals: reducing peak temperatures for endangered salmonids on the site. This study employed the most sophisticated equipment available for stream temperature monitoring, including approximately 1 million independent hourly measurements in the 2‐week period considered. These data were collected along the river channel with fiber optic–distributed temperature sensing and were used to quantify thermal dynamics. These observations were paired with a physically based stream temperature model which was then employed to predict temperature change from design alternatives. Restored‐reach impact on peak temperature was directly correlated with the air–water interfacial area and the percentage of effective shade (R2 > 0.99). The increase in air–water area of the proposed design was predicted to increase daytime stream temperature by as much as 0.5°C upon completion of the work. Shade from riparian vegetation was found to potentially mitigate stream temperature increases, though only after decades of growth. A moderately dense canopy of 5 m tall trees blocking 17% of daily shortwave solar radiation is predicted to mitigate predicted temperature increases over the 1,800 m reach but also increases nighttime temperatures due to blocking of long‐wave radiation. These outcomes may not be intuitive to restoration practitioners and show how quantitative analysis can benefit the design of a project. This is significant in an area where riparian vegetation has been difficult to reestablish. Without quantitative analysis, restoration efforts can lead to outcomes opposite to stated goals and may be costly and disruptive interventions to fragile stream systems.

    more » « less
  3. rowdsourcing has been used to produce impactful and large-scale datasets for Machine Learning and Artificial Intelligence (AI), such as ImageNET, SuperGLUE, etc. Since the rise of crowdsourcing in early 2000s, the AI community has been studying its computational, system design, and data-centric aspects at various angles. We welcome the studies on developing and enhancing of crowdworker-centric tools, that offer task matching, requester assessment, instruction validation, among other topics. We are also interested in exploring methods that leverage the integration of crowdworkers to improve the recognition and performance of the machine learning models. Thus, we invite studies that focus on shipping active learning techniques, methods for joint learning from noisy data and from crowds, novel approaches for crowd-computer interaction, repetitive task automation, and role separation between humans and machines. Moreover, we invite works on designing and applying such techniques in various domains, including e-commerce and medicine. 
    more » « less
  4. Li-Jessen, Nicole Yee-Key (Ed.)
    The Earable device is a behind-the-ear wearable originally developed to measure cognitive function. Since Earable measures electroencephalography (EEG), electromyography (EMG), and electrooculography (EOG), it may also have the potential to objectively quantify facial muscle and eye movement activities relevant in the assessment of neuromuscular disorders. As an initial step to developing a digital assessment in neuromuscular disorders, a pilot study was conducted to determine whether the Earable device could be utilized to objectively measure facial muscle and eye movements intended to be representative of Performance Outcome Assessments, (PerfOs) with tasks designed to model clinical PerfOs, referred to as mock-PerfO activities. The specific aims of this study were: To determine whether the Earable raw EMG, EOG, and EEG signals could be processed to extract features describing these waveforms; To determine Earable feature data quality, test re-test reliability, and statistical properties; To determine whether features derived from Earable could be used to determine the difference between various facial muscle and eye movement activities; and, To determine what features and feature types are important for mock-PerfO activity level classification. A total of N = 10 healthy volunteers participated in the study. Each study participant performed 16 mock-PerfOs activities, including talking, chewing, swallowing, eye closure, gazing in different directions, puffing cheeks, chewing an apple, and making various facial expressions. Each activity was repeated four times in the morning and four times at night. A total of 161 summary features were extracted from the EEG, EMG, and EOG bio-sensor data. Feature vectors were used as input to machine learning models to classify the mock-PerfO activities, and model performance was evaluated on a held-out test set. Additionally, a convolutional neural network (CNN) was used to classify low-level representations of the raw bio-sensor data for each task, and model performance was correspondingly evaluated and compared directly to feature classification performance. The model’s prediction accuracy on the Earable device’s classification ability was quantitatively assessed. Study results indicate that Earable can potentially quantify different aspects of facial and eye movements and may be used to differentiate mock-PerfO activities. Specially, Earable was found to differentiate talking, chewing, and swallowing tasks from other tasks with observed F1 scores >0.9. While EMG features contribute to classification accuracy for all tasks, EOG features are important for classifying gaze tasks. Finally, we found that analysis with summary features outperformed a CNN for activity classification. We believe Earable may be used to measure cranial muscle activity relevant for neuromuscular disorder assessment. Classification performance of mock-PerfO activities with summary features enables a strategy for detecting disease-specific signals relative to controls, as well as the monitoring of intra-subject treatment responses. Further testing is needed to evaluate the Earable device in clinical populations and clinical development settings. 
    more » « less
  5. null (Ed.)
    This research paper describes the development of an assessment instrument for use with middle school students that provides insight into students’ interpretive understanding by looking at early indicators of developing expertise in students’ responses to solution generation, reflection, and concept demonstration tasks. We begin by detailing a synthetic assessment model that served as the theoretical basis for assessing specific thinking skills. We then describe our process of developing test items by working with a Teacher Design Team (TDT) of instructors in our partner school system to set guidelines that would better orient the assessment in that context and working within the framework of standards and disciplinary core ideas enumerated in the Next Generation Science Standards (NGSS). We next specify our process of refining the assessment from 17 items across three separate item pools to a final total of three open-response items. We then provide evidence for the validity and reliability of the assessment instrument from the standards of (1) content, (2) meaningfulness, (3) generalizability, and (4) instructional sensitivity. As part of the discussion from the standards of generalizability and instructional sensitivity, we detail a study carried out in our partner school system in the fall of 2019. The instrument was administered to students in treatment (n= 201) and non-treatment (n = 246) groups, wherein the former participated in a two-to-three-week, NGSS-aligned experimental instructional unit introducing the principles of engineering design that focused on engaging students using the Imaginative Education teaching approach. The latter group were taught using the district’s existing engineering design curriculum. Results from statistical analysis of student responses showed that the interrater reliability of the scoring procedures were good-to-excellent, with intra-class correlation coefficients ranging between .72 and .95. To gauge the instructional sensitivity of the assessment instrument, a series of non-parametric comparative analyses (independent two-group Mann-Whitney tests) were carried out. These found statistically significant differences between treatment and non-treatment student responses related to the outcomes of fluency and elaboration, but not reflection. 
    more » « less