<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Wikipedia ORES Explorer: Visualizing Trade-offs For Designing Applications With Machine Learning API</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/28/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10283256</idno>
					<idno type="doi">10.1145/3461778.3462099</idno>
					<title level='j'>DIS '21: Designing Interactive Systems Conference 2021</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zining Ye</author><author>Xinran Yuan</author><author>Shaurya Gaur</author><author>Aaron Halfaker</author><author>Jodi Forlizzi</author><author>Haiyi Zhu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[With the growing industry applications of Artificial Intelligence (AI) systems, pre-trained models and APIs have emerged and greatly lowered the barrier of building AI-powered products. However, novice AI application designers often struggle to recognize the inherent algorithmic trade-offs and evaluate model fairness before making informed design decisions. In this study, we examined the Objective Revision Evaluation System (ORES), a machine learning (ML) API in Wikipedia used by the community to build anti-vandalism tools. We designed an interactive visualization system to communicate model threshold trade-offs and fairness in ORES. We evaluated our system by conducting 10 in-depth interviews with potential ORES application designers. We found that our system helped application designers who have limited ML backgrounds learn about in-context ML knowledge, recognize inherent value trade-offs, and make design decisions that aligned with their goals. By demonstrating our system in a real-world domain, this paper presents a novel visualization approach to facilitate greater accessibility and human agency in AI application design.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>While developing and training machine learning (ML) models can involve complex processes, the building of machine learning powered applications is now expanding to ML novices. Machine learning capabilities are increasingly offered as industry services and are becoming accessible to people with limited ML expertise. For example, platforms such as Amazon Sagemaker Autopilot<ref type="foot">foot_0</ref> , Google Cloud AutoML<ref type="foot">foot_1</ref> , Microsoft Azure<ref type="foot">foot_2</ref> and IBM AutoAI<ref type="foot">foot_3</ref> provide automatic pipelines through which ML novices can develop machine learning models to address real-world problems. In other cases, machine learning is offered as an API that people can directly call to access pre-trained models and get prediction results.</p><p>Recent research <ref type="bibr">[35,</ref><ref type="bibr">36]</ref> investigated how application designers build ML solutions and identified two unique challenges in envisioning and prototyping with AI: the uncertainty surrounding AI's capabilities, and AI's output complexity. Specifically, there are often inherent trade-offs between different system criteria of machine learning models (e.g., accuracy, false-positive rates, false-negative rates, and disparity) and optimizing one criterion often leads to poor performance in others. Furthermore, there is an emerging body of literature demonstrating that machine learning models can have disparate performances, for example, among different social groups <ref type="bibr">[3,</ref><ref type="bibr">20,</ref><ref type="bibr">24]</ref>. This disparity can impact experiences of users, and even lead to negative societal outcomes <ref type="bibr">[21]</ref>. However, there is little research on how to help application designers understand the trade-offs and disparities in ML services in order to make informed decisions and achieve their design goals.</p><p>In this paper, we focus on a service that is designed to help application designers&#347;whom we broadly define as a group of designers, engineers and product managers that are ML novices&#347;to understand the trade-offs in ML models during the design and development process.Wikipedia has a community of application designers commonly referred to as "tool developers" or "bot developers." These application designers are members of the Wikipedia community and are often prolific editors themselves. They develop technologies that other Wikipedians use as tools to support the contribution and curation work in Wikipedia. The Objective Revision Evaluation Service (ORES) is a web service and API developed by Wikimedia's Scoring Platform Team. The service allows users to build applications to fight vandalism and review edits on Wikipedia and is available to Wikipedia application designers around the world. It uses machine learning to evaluate the quality and intention of Wikipedia edits <ref type="bibr">[16]</ref>. The ORES API takes in an edit revision ID and outputs both a &#322;damaging&#382; and &#322;good-faith&#382; score that represent the respective likelihoods that an edit is damaging or malicious. Using ORES, application designers have the opportunity to decide how to use the prediction scores, choosing, for example, a prediction threshold for their applications. While building effective ORES tools requires deep contextual knowledge and technical expertise, it is uncommon for these application designers have substantial backgrounds in machine learning or data engineering.</p><p>Without a thorough understanding of model trade-offs, application designers may apply ML capabilities to systems that could lead to under-performing and potentially serious problems. Moreover, the ORES model tends to be more aggressive to edits made by newcomers and anonymous editors; false positive rates of edits from those two groups are significantly higher than false positive rates of edits from the experienced editors <ref type="bibr">[16,</ref><ref type="bibr">28]</ref>. Such disparities might impact the experiences of novice users and discourage their edit contributions in the Wikipedia Community. Thus, application designers should be made aware of the inherent trade-offs and model fairness in ML models in order to make informed decisions about selecting desired models, or taking action to mitigate potential problems caused by performance disparity.</p><p>In the research that follows, we present a case study that explores using interactive visualizations to communicate Wikipedia's ORES API's threshold-associated trade-offs and model fairness across different groups of editors. To surface the issues discussed above and help application designers effectively use ORES, we designed and implemented ORES Explorer, an interactive visualization website that uses a sample data set to allow people to learn about and experiment with ORES models. There are four visualizations in ORES Explorer: About ORES, Threshold Explorer, Group Disparity Visualizer, and Threshold Recommender. The goal of the visualization website is to eventually assist ORES application designers in making sensible product decisions that align with operational needs.</p><p>To evaluate the effectiveness of ORES Explorer, we conducted a series of interviews with application designers using a think-aloud protocol. During the interview, participants were given a scenario for building an ORES-based tool (either an automated bot or a human-review tool) and asked to express their opinions as they explored the visualizations. The goal of each interview was to see whether participants could perceive trade-offs and performance disparity in the ORES system and, eventually, make reasonable decisions based on the types of tools that they were developing. We found that 1) our visualization system improved people's understanding of the trade-offs in setting thresholds for the ORES model; 2) participants were able to select model thresholds that aligned with their design goals based on gathered information; and 3) surfacing the trade-offs and bias of the model helped participants develop more trust in the ORES AI system. This paper contributes findings on how to help designers make sense of model capabilities and limitations in order to build responsible ML applications in the context of Wikipedia. Wikipedia ORES API, however, is not an isolated case. Numerous similar Machine Learning APIs have been created to help designers build ML solutions. For example, Google's Perspective API <ref type="foot">5</ref> , which takes in any text-based conversation and outputs toxicity scores, has been used by publishers, platforms, and individuals to power a variety of different use cases of content moderation. We believe that our findings and aspects of our visualization tool are readily generalizable to other contexts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK 2.1 Machine Learning as a Design Material</head><p>The application of Artificial Intelligence (AI) has flourished in recent years. Technologies such as facial recognition, data visualization, predictive analytics, natural language processing, and deep learning are becoming more integrated into the design of essential products and services, thus creating new challenges for developers and designers of these systems. As AI/ML gains recognition as a type of design material in Human-Computer Interaction (HCI) <ref type="bibr">[32]</ref>, HCI researchers and designers have investigated the idea of AI as a design material to explore AI's potential in improving interactions with apps and software systems. Research has focused on how design practitioners work with AI, how they envision and propose new uses for AI, and collaboration with data scientists <ref type="bibr">[11,</ref><ref type="bibr">14,</ref><ref type="bibr">17,</ref><ref type="bibr">33,</ref><ref type="bibr">34]</ref> Other work has focused on the difficulties and bottlenecks designers, often ML novice, experience while working with AI. Yang et al. and Kaur et al. investigated how non-ML-experts build ML solutions using ML services and tools to address their own real-world problems <ref type="bibr">[19,</ref><ref type="bibr">36]</ref>. They identified pitfalls and unique situations that non-ML-experts experience engaging with these services. A related study identified two sources of human-AI interaction design complexity: ML's capability uncertainty, and output complexity, which respectively affect designers' understanding of the ML system, and how they conceptualize the system's behaviors in order to choreograph its interactions. <ref type="bibr">[35]</ref>. Other research has been conducted to understand the challenges UX designers face when working with AI <ref type="bibr">[11]</ref>  <ref type="bibr">[33]</ref>. Findings showed that UX designers saw challenges in three main areas: envisioning what ML might be, working with ML as a design material, and ethical concerns on how to purposefully use ML <ref type="bibr">[11]</ref>. Additionally, data scientists were found to over-trust and misuse interpretability tools, and few of the participants were able to accurately describe the visualizations output by these tools <ref type="bibr">[19]</ref>. Some work has been done to facilitate non-experts with the understanding and use of ML. The field has responded with resources, tools, and processes to support ML novices and design practitioners in designing and working with AI. These materials include curricula to improve data science skills <ref type="bibr">[Patrick Hebron. 2016</ref>  <ref type="bibr">[36]</ref> provided three general principles on how to guide non-experts to easily build robust ML solutions: (1) &#322;grounding its interaction design in nonexperts' intuitive approach to ML&#382;; (2) &#322;scaffolding and safeguarding a robust model building process&#382;; (3) &#322;supporting users of various needs and skill sets, both in terms of ML knowledge and programming skills&#382;. While we have useful guidelines for non-experts in the space to work with ML models, there is a lack of domain-specific tools. Domain-specific tools are essential in assisting non-experts make algorithmic decisions during their working processes, as they could face common problems such as inherent trade-offs in implementing multiple design goals in the algorithm <ref type="bibr">[37]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Explaining AI and Machine Learning algorithms</head><p>Bringing explainability to ML models is one of the major ways to help non-experts work effectively and responsibly with those complex systems. Prior research in this space can be categorized into two approaches: explaining an algorithm's individual decisions, and explaining an algorithm's performance and outcomes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1">Explaining individual decisions. Many studies have been done</head><p>to explore algorithmic approaches that help humans better understand the results of ML algorithms with the goal of evaluating the model's decisions <ref type="bibr">[15]</ref>. Common strategies include transforming complex models into simpler ones (e.g. transforming neural networks to linear models or decision trees) through global and local proximation <ref type="bibr">[10,</ref><ref type="bibr">25]</ref>, visualizing the prediction result <ref type="bibr">[7,</ref><ref type="bibr">15,</ref><ref type="bibr">26,</ref><ref type="bibr">31]</ref>, or decomposing predictions to relevance scores for model inputs <ref type="bibr">[4]</ref>, etc. A variety of ML interpretability methods have also been experimented with in different real-world situations, including health care and finance. Other work explores how to communicate bias and model fairness <ref type="bibr">[8]</ref>.</p><p>The HCI community has explored interactivity and learnability for designing visualizations that better support interpretability (e.g., <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">22]</ref>). Cheng et al. studied how different explanation strategies (e.g., &#322;black-box&#382; versus &#322;white-box&#382;, and &#322;interactive&#382; versus &#322;static&#382;) affected novice stakeholders' understandings of the model's decision in an algorithm-assisted college admission scenario. Elzen et al. developed a system for interactive construction and analysis of decision trees, which enabled domain experts to understand the inner workings of the algorithm and to apply their domain specific knowledge to its optimization <ref type="bibr">[12]</ref>. In industry, companies such as Google and Facebook have also built visualization tools that provide support for model interpretability in running ML experiments. For example, Captum Insights 6 is an interactive visualization tool built along with the Captum interpretability library on PyTorch that helps ML engineers understand feature attribution behind individual model predictions. Although many of those tools provide effective support on interpretability in AI, most of them are applied directly on ML models for ML engineers thus there is a lack of support for application designers who are non-experts working with machine learning services (mentioned in 2.1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2">Explaining Performance and Outcome.</head><p>Other explainable AI research has been conducted around explaining a model's performance, trade-offs, and fairness based on model outcomes. Mitchell et al. <ref type="bibr">[23]</ref> proposed Model Cards, which focus on trained model characteristics and inform users about what machine learning systems can do, related errors, and actions towards more fair and inclusive outcomes, which accompany machine learning models that have been released into the public domain. A recent interview study with pathologists about a diagnostic AI assistant found that users also wanted to know the design objectives of the AI systems and the &#322;inherent trade-offs that the designers of the intelligent systems must navigate in implementing the system&#382; ( <ref type="bibr">[6]</ref>). Yu et al. <ref type="bibr">[37]</ref> proposed a general two-step method to help designers and users explore algorithmic trade-offs: (1) given a set of design objectives (and corresponding system criteria), generate a family of prediction models with a wide spectrum of trade-offs; and (2) create interactive interfaces to visualize the trade-offs.</p><p>In both the popular and academic press, the potential for ML systems to amplify social inequities and unfairness is receiving increasing attention <ref type="bibr">[18]</ref>. Using loan granting scenarios, Wattenberg et al. <ref type="bibr">[29]</ref> developed an interactive interface that shows how classifiers could potentially be unfair and also demonstrated potential strategies to turn an unfair classifier into a fairer classifier. Cabrera et al <ref type="bibr">[5]</ref> created FairViz, a mix-initiative system that allows data scientists to explore intersectional bias in their models across both suggested and user-specified subgroups. Model probing platforms, such as Google's WIT, have also adopted several bias detection and mitigation features: e.g., calculating ML fairness metrics on trained models and applying fairness optimization strategies <ref type="bibr">[30]</ref>.</p><p>However, prior work in explainable AI has tended to primarily design, study and evaluate techniques and approaches with expert and novice users; it is not clear how the method and interfaces would help designers and developers in real-world design and development scenarios. building ML products. A number of interpretability methods and XAI design guidelines have been created to support this understanding. One research gap we have identified is exploring how visualization techniques could further help application designers and ML novices understand model trade-offs and fairness. We also see a need to directly deploy and evaluate these types of visualizations with Machine Learning APIs in a real-world context. In this paper, we address this gap through the design and evaluation of a set of interactive visualizations that assist application designers and ML-novices designing with the ORES ML system in Wikipedia. Our findings also contributed to broader discussions around the utilization of interactive visualization in the explainable AI domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">ORES: MACHINE LEARNING SERVICE IN WIKIPEDIA</head><p>ORES is an algorithmic scoring service that supports real-time access to machine learning classifiers that predict useful characteristics of wiki edits by using multiple independent classifiers trained on different datasets <ref type="bibr">[16]</ref>. The service is maintained by the Scoring Platform Team at the Wikimedia Foundation<ref type="foot">foot_7</ref> , the non-profit that supports Wikipedia's technical, legal, and fiscal infrastructure. The Scoring Platform Team works with wiki communities (including various languages of Wikipedia, Wiktionary, Wikidata, and other Wikimedia supported wikis) to identify opportunities in which users can apply machine learning to support wiki work processes and they develop and evaluate classifiers in collaboration with community members.</p><p>For the purposes of this paper, we focus on two specific types of classifiers available in ORES: the damaging and goodfaith models. These classifiers use a gradient boosting strategy to learn from examples of edits as labeled by Wikipedia editors, and they are commonly used in quality control processes like counter-vandalism <ref type="bibr">[13]</ref>. The damaging model is trained to find and highlight problematic edits for review, while the goodfaith model is trained to find intentional vandalism and to help reviewers distinguish these edits from good-faith mistakes.</p><p>ORES's primary interface is a RESTful API that provides access to the models and generates a score for any edit in the history of Wikipedia as needed. The interface also provides access to the raw feature values that ORES uses to make predictions, the fitness statistics derived while testing the model, and even allows for the injection of synthetic feature values (aka counter-factuals) for inspection and experimentation. See <ref type="bibr">[16]</ref> for an overview of these features.</p><p>ORES is used widely by volunteer application designers and professional product teams at the Wikimedia Foundation to design intelligent user interfaces and robots for editing Wikipedia's pages directly. As of September of 2020, the documentation lists over 30 different interfaces, robots, and secondary data services that use ORES <ref type="foot">8</ref> . Examples include the popular Recent Changes page <ref type="foot">9</ref>and Huggle tool <ref type="foot">10</ref> , which allow users to review and revert recent revisions in Wikipedia articles. These tools use ORES to filter and flag suspected damaging and bad-faith edits that need more focus and allow users to revert problematic edits quickly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">METHOD &amp; DESIGN PROCESS</head><p>We adopted an iterative, user-centered design process, with three phases: User Research, Design Objectives Gathering, and Iterative Design &amp; Development.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">User Research</head><p>We conducted interviews to understand the ORES ecosystem, and the existing pain-points and challenges that application designers face when using the ORES API. We also aimed to identify opportunities to improve participants' experiences. We recruited six people who had previously worked with ORES API by using the public research mailing list: wiki-research-l@lists.wikimedia.org. Our final interview sample consists of three ORES-based application designers, one ORES creator, one ORES core developer, and one researcher who developed research tools using ORES. Each interviewee was compensated with a $20 gift card for an approximately 45-minute interview. The semi-structured interview had a list of predefined questions along with open-ended discussion topics to help define the challenges that participants&#347;or the application designers that they know&#347;faced in the development process. Based on the interviews, we surfaced common themes, derived insights, and identified design opportunities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Design Objectives</head><p>Analysis of our interview data led to the creation of four design objectives that outlined our goals in developing the ORES Explorer visualization system:</p><p>&#8226; Explain digestible ML concepts in the context of Wikipedia. Most participants mentioned that they did not have a Machine Learning background prior to working with ORES and thus struggled to understand ML concepts when reading the documentation. For example, when asked about background, one participant answered that &#322;everything that I know about AI and ML is through working with ORES. What I do is that I have to look things up over and over&#382;. Although the ORES documentation does provide links to Wiki pages for most of the terminologies used, it also forces application designers to make and constantly go back and forth between different Wiki pages. To solve this problem, ORES Explorer could provide some basic explanation on the core ML concepts within the context of Wikipedia. This explanation could help reduce the barriers and confusion for the application designers before they make important decisions using the ORES models.</p><p>&#8226; Facilitate understanding of trade-offs in threshold setting. Choosing the right threshold is the key to effectively using the ORES predictions. To do that, application designers need to understand the algorithmic trade-offs and make informed decisions based on the context. For example, a low threshold ensures that the majority of damage is caught at the cost of needing to review more edits. Conversely, a high threshold minimizes the harm of automatically reverting good edits but might let a large number of damaging edits pass by 11 . During our interviews, several participants expressed concerns over choosing the appropriate threshold for their tools. One participant, for example, found some guidance on the ORES Wiki page, but later found out that the result was not as promising and sufficient as she had expected. She wished that there were more resources for her to dig into and explore herself. We therefore wanted to provide a way for users to explore different threshold settings and intuitively understand how those settings will affect the different values they care about.</p><p>&#8226; Suggest threshold settings based on use cases and design goals. Application designers use ORES for different reasons and to build different types of tools. For example, some tools are built to flag damaging edits for the human reviewers while others (automated bots) can directly revert them. The purposes of the tool can directly affect how application designers should set the thresholds. For example, a higher threshold should be used on the automated bots in order to catch the most damaging edits while avoiding misclassifications of the good edits. However, some application designers just blindly go with &#322;50%&#382; since it looks like the default option. Thus, we believe that suggested thresholds should be given as a reference based on application designers' goals and the type of tools that they are building.</p><p>&#8226; Explain model performance disparity in editor groups.</p><p>Prior work <ref type="bibr">[16,</ref><ref type="bibr">28]</ref> demonstrates that there is a disparity in ORES performance on edits from different editor groups. For example, the model is more aggressive to edits from newcomers and anonymous editors. Thus, it's more likely for good edits from these two groups to be misclassified as damaging edits (as compared to those edits from the experienced, logged-in editors). During the interview, several participants mentioned that they had noticed or heard of such a disparity in their previous experience of working with ORES but had difficulty further exploring the disparity. For example, one participant mentioned that it would be helpful to access any information about the potential issues before using the ORES API. We would, therefore, like to provide ways for people to explore and recognize model disparity, which would be helpful in driving important product feature decisions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Iterative Design and Development</head><p>We generated different design concepts that met our design objectives, ultimately developing an interactive visualization system based on feedback from informal discussion with Wikipedia domain experts. We further refined the final concept, starting with low-fidelity mock-ups and moving to high-fidelity designs, using a similar iterative process. We developed the front-end visualization website in React, iteratively coding by connecting to a sample dataset of 500 Wikipedia edits with ORES prediction scores. This process resulted in the development of a visualization tool with fully interactive features that were published as a functional website.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.1">Data.</head><p>The sample data we used in the visualizations consists of 500 Wikipedia edits randomly selected from the training data of 11 <ref type="url">https://www.mediawiki.org/wiki/ORES/Thresholds</ref> the English ORES model. To avoid over-fitting and showing untrue performance metrics, we constructed a replicate machine learning model of ORES, using exactly the same machine learning algorithm (Gradient Boosting), label weights (10 for damaging edits, 1 for non-damaging edits) and hyper-parameters, while excluding the 500 edits we selected from the training data. Then, we applied this replicated ORES prediction model to the 500 edits on which the model was not trained, to generate out-of-bag prediction results and performance.</p><p>For the purpose of visualizing model performances in groups, we selected a balanced number of newcomer and experienced editors' edits from the 500 edit dataset. Within each editor group, we also balanced the number of damaging and non-damaging edits. We balanced the data with respect to the types of editors and the number of damaging and non-damaging edits to make visualizations more illustrative. For example, 96% of the data in the original training dataset is labeled as non-damaging, therefore showing the distribution with the original ratio would make the damaging edits extremely difficult to identify. Having a balanced data set on types of editors and within-group damaging rates allows users to make clear comparisons between different editor groups, as shown in Figure <ref type="figure">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">ORES EXPLORER SYSTEM</head><p>ORES Explorer is an interactive visualization website that allows user to explore and understand the algorithmic trade-offs and model fairness in ORES via a sample dataset. The visualization website was designed to meet our design objectives through the combination of four individual visualizations:</p><p>&#8226; About ORES Explaining ORES API and necessary ML knowledge specific to Wikipedia</p><p>&#8226; Threshold Explorer Visualizing trade-offs in setting different performance threshold</p><p>&#8226; Threshold Recommender Recommending thresholds based on the application types</p><p>&#8226; Group Disparity Visualizer Communicating model performance disparity in editor groups</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Landing Page: About ORES</head><p>About ORES serves as a landing page that provides a basic overview of the ORES system. The goal of the About ORES section is to provide the necessary knowledge for users to explore the following pages. Visualizations are used to explain concepts such as how ORES scores edits, and how ORES makes predictions based on threshold settings and the prediction confusion matrix in the context of Wikipedia.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Threshold Explorer</head><p>The Threshold Explorer is the interactive visualization that allows users to explore impact in model performance by setting different thresholds. Users will first choose a model and slide the top slider to perceive changes in the model's prediction and performance. The graph underneath the slider directly represents individual edits and their prediction results distributed from left to right based on their prediction scores. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">RESULTS</head><p>Our participants found ORES Explorer to be useful in educating them about ML concepts and in helping them to understanding trade-offs and performance disparities in ML models. We organized our findings into the following three themes, along with several key findings:</p><p>Meeting design objectives on educating contextualized ML concepts, threshold trade-offs and model performance disparity</p><p>&#8226; Contextualized ML concepts in the About ORES page were helpful for participants with limited ML expertise to start learning about the ORES API.</p><p>&#8226; ORES Explorer improved participants' understanding of the trade-offs in setting different ORES model thresholds and the associated impact on the tool they chose to build in the design task.</p><p>&#8226; Participants were able to use Threshold Explorer and Threshold Recommender to make decisions and evaluate trade-offs about desired thresholds that aligned with their goals.</p><p>&#8226; Although Group Disparity Visualizer helped surface the ORES model's performance disparity in different editor groups, most participants accepted the disparity as a natural occurrence and were not concerned about fairness implications in the system.</p><p>Participants' perceptions on the underlying ML system</p><p>&#8226; When balancing different performance metrics to select a model threshold, participants tended to prioritize model accuracy over other metrics.</p><p>&#8226; Overall, ORES Explorer helped participants to trust the underlying AI models more. The visualizations surfaced AI models' limitations and the resulting transparency allowed participants to see the AI models not as &#322;black boxes,&#382; but rather as helpful and editable tools.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Effectiveness of the visualization tools for users with different backgrounds</head><p>&#8226; ORES Explorer was shown to be effective in helping participants understand the ORES ML system without Wikipedia domain background.</p><p>In the following section, we discuss our findings in detail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1">Assist contextual learning of ML concepts in Wikipedia</head><p>We found that learning about ML concepts in the context of Wikipedia is essential for application designers to start designing with the ORES API. Participants mentioned that, at the beginning of the design task, they were challenged to understand what different performance metrics&#347;such as False Negatives and False Positives&#347; meant in order to learn about the algorithmic trade-offs and model disparity. In the group outside of Wikipedia, P4 commented &#322;I think the difficult part is to understand what exactly is a false and positive means in a certain scenario&#382; (P4). From the group inside Wikipedia, W1 commented &#322;I always forget the meanings of the false positive rate and false negative rate. I always have to interpret in the context of the model&#382; (W1). Participants thought that the one-sentence explanation (underneath each concept), as well as the use of iconography (circles and crosses) were helpful in facilitating the contextual understanding of the confusion matrix. However, participants who had no prior experience with ML were sometimes confused, mixing up the meanings of false positives and false negatives.&#322;I think decreasing false positives might be the most important . . . Actually no, sorry, let me figure out what the terms are. It always gets so confusing . . . Okay, no, yeah, I'm gonna say percentage of false negatives is more important to be lower. . . &#382; (P5).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.2">Improve overall understanding on model threshold trade-offs</head><p>We observed that participants gradually gained knowledge about the ORES model threshold trade-offs by interacting with the Threshold Explorer. As participants moved the threshold slider and perceived changes, they intuitively understood the underlying tradeoffs among key model performance metrics in the context of Wikipedia. &#322;I see, the lower the threshold goes, a lot more edits sneak by [false positives] . . . But at the same time, that also lets more good-faith edits through without being flagged [true negatives]&#382; (W5). Along with the threshold slider, the performance charts on the performance panel also helped surface trade-offs, specifically for accuracy, false positive rate, and false negative rate. All of the participants in our study based their decisions, at least partially, on the performance charts, and seven out of ten participants reported primarily focusing on them. As P5 commented, &#322;false positives and false negatives are pretty clearly inversely related to each other [as] you can see from the shape of the graphs, whereas accuracy sort of has a sweet spot. &#382; Furthermore, Threshold Explorer helped participants understand how to adjust the model threshold in order to prioritize different performance metrics, which correspond with different community values. For example, all participants increased the threshold when asked to minimize misclassifying good edits in Wikipedia. W4 indicated &#322;I'll have to play with the false positive rate. Okay, when I move this way it's reducing good edits that are falsely identified as damaging. So then I need to move this to the right.&#382; The model trade-offs that were shown are valuable for making model threshold decisions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.3">Enable participants to select thresholds based on their design goals</head><p>The Threshold Recommender facilitated participants' decision making in selecting thresholds based on their application design goals. The average damaging threshold among the seven participants who design review tools is 0.48 (standard deviation = 0.09), while the average damaging threshold among the three participants who design automated bots is 0.61 (standard deviation = 0.08). Commenting on his experience choosing a threshold to build an automated bot, W2 said, &#322;[If I choose a] small threshold we have lots of good faith and not damaging edits that are classified as damaging so that is quite annoying especially if we are making a bot that will revert automatically. &#382; Thus, some participants found Threshold Recommender useful as it directly suggested thresholds that were aligned with their design goals. &#322;I can just choose one type of model based on what I want&#382; (P1). While some participants found that Threshold Recommender could be more practical and require less cognitive workload compared to Threshold Explorer, others also found it less educational in facilitating their understanding of the underlying relationships among different model performance metrics. &#322;I think the first one (Threshold Explorer) explains the relationship better while I'm playing with the visualization and looking at the graphs on the right. I think like the first one is more educational versus this one (Threshold Recommender) is more practical&#382; (P1).</p><p>On the other hand, the threshold suggestion from Threshold Recommender could also lead participants to question the logic behind those direct suggestions. Some participants conveyed that they found the suggestion less trustworthy due to the lack of explainability. &#322;I don't know why there's a default suggested number... and some explanation on how the suggestions are generated will be helpful&#382; (P2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.4">Participants attributed the model performance disparity to a natural occurrence</head><p>The Group Disparity Visualizer helped to surface how the model could perform differently among different editor groups. By interacting with this visualization, eight out of ten participants found it intuitive to understand the model performance disparity. &#322;I just feel like the system treats these two user groups differently. Definitely obvious, it is more aggressive to newcomers and more gentle to the experienced edits&#382; (P4). However, nine out of ten participants also paid more attention to the editing ability of the two editors groups instead of the model's performances on the two editors groups.</p><p>While participants perceived the performance disparity and potential fairness problem within the model, 7 out of 10 participants (4 out of 5 Wikipedia participants, and 3 out of 5 non-Wikipedia participants) attributed the performance disparity to a natural occurrence, and were, therefore, not particularly concerned about potential fairness problems. For example, P3 commented that &#322;this makes sense, because for experienced edits, the machine already has experience looking at their edits, so it has a higher accuracy&#382; (P3). From the group inside Wikipedia, W1 commented that &#322;it's probably to be expected, because for experienced users, they will likely have more examples of edits to train the model... maybe it's over-fitting to experienced editors' edits, [but] I don't think it's necessarily bad because we want more edits which look like it's made by experienced users&#382; (W1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.5">Participants prioritized accuracy when balancing different performance metrics</head><p>Based on the understanding of model threshold trade-offs, most participants tried to strike a balance between multiple evaluation metrics for the model in the evaluation design task. At first glance, P2 noted: &#322;Okay, I guess the goal here is to like, find the balance, where we have a good number of accuracy and like, not too low on these two as well. &#382; Seven out of ten participants mentioned that they would keep a high accuracy before considering other factors. &#322;I'm trying to maximize accuracy, and at the same time lower false-positive rate or false-negative rate&#382;(P1). Some participants also decided to compromise on certain performance metrics in order to achieve particular design goals. W5 commented &#322;It's better to have a false positive than a false negative . . . a lot of the times false negatives will sneak through and they'll sit in an article for weeks or months before somebody goes, &#322;Hey, this doesn't sound right&#382;, and actually goes about changing it. &#382;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.6">Trade-offs education and transparency enhanced overall user trust</head><p>After interacting with ORES Explorer, eight out of ten participants indicated that they had gained more trust in using the machine learning model. With limited machine learning expertise, some participants mentioned that the transparency that the visualizations provided them helped them to understand how the model actually works so that they were no longer perceiving it as only a &#322;black box&#382; capability. &#322;Now that I can visualize what is actually happening with this data set, I could just tell that it actually does work. And it's not just voodoo magic occurring behind the API&#382; (W5). &#322;I definitely feel like seeing this type of visualization helps me trust AI more in the sense that I understand how much of it is actually due to human decision making like, because the AI model is really just like complicated statistics and it's the humans who decide what matters on which Just kind of an opinion that I already have&#382; (P5).</p><p>Participants also talked about how the experience of having guidance when making customized adjustments to the model threshold contributed to a sense of &#322;shared responsibility&#382; with the machine learning model. &#322;I felt I have a shared responsibility with the model to prioritize which kind of things I will get right and which I will get wrong, because I cannot have both at the same time in this. Let's say this slide here provides something like that I can maximize getting one thing and I have to give up on the other aspect&#382; (W1). &#322;I like the way that it's customizable. So I feel like it's not just the AI working on itself. &#382; (P3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.7">The visualization tools are shown to be effective for participants without Wikipedia domain expertise</head><p>When comparing the interview sessions between the two groups of participants recruited inside and outside Wikipedia, we found no distinct difference in how the two groups perceived and interacted with ORES Explorer. Based on our recording of the thresholds that participants chose in the design task, the group of participants outside Wikipedia on average tended to choose moderately more aggressive prediction thresholds as compared to the Wikipedia participants. The Wikipedia group chose an average threshold of 0.58 (standard deviation = 0.08), while the group outside of Wikipedia chose an average threshold of 0.46 (standard deviation = 0.1). The group outside of Wikipedia also recognized how the algorithmic trade-offs would affect Wikipedia's community values. &#322;I'm gonna say the percentage of false negatives is more important to be lower in this case (building a review tool). . . . too many false negatives will really downgrade the quality of the system for the readers&#382; (P5). These results showed that ORES Explorer can be helpful for users from different backgrounds&#208;editors, developers, product manager and designers&#208;or users with limited application domain knowledge.</p><p>In evaluating why participants outside of Wikipedia chose more aggressive prediction thresholds, one explanation is that they might tolerate higher false-positives (flagging good edits as damaging) since they have not done any editing work or worked with Wikipedia editors before. While participants inside Wikipedia, by comparison, tend to tolerate fewer false positives, they tend to be more familiar with the reviewing process and express more concern that having more false positives will overwhelm the reviewers shouldering review work. W5 comments that one &#322;difficulty with setting a threshold is that you don't want to get so many false positives that the reviewer starts getting fatigued and just will not use that tool anymore. &#382; Further research evaluating how users' backgrounds affect the ways that they balance the algorithmic trade-offs when designing with machine learning services would be worthwhile.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">DISCUSSION</head><p>Our research investigated two significant challenges faced by ML novice designers when designing ML applications: deciding on model thresholds, and understanding model fairness. When building most AI products with classification tasks, setting the appropriate model decision threshold is a crucial step in determining how the product will weigh false positives and false negatives, which can have lasting impacts on both user experiences and societal outcomes in some high-stake industries. On the other hand, understanding model fairness in terms of a model's performance among different social groups is also critical in helping designers make informed decisions on model usage and in helping them design responsible product features.</p><p>While ORES Explorer was designed and evaluated in the real world scenario of the Wikipedia ORES system, we believe that our visualization framework is readily generalizable in other systems and could readily be applied in different use cases. Our evaluation revealed that even participants without Wikipedia background gained a similar understanding of model trade-offs as compared to Wikipedia participants. In other online communities such as Facebook and Reddit, similar content moderation models have been widely applied in order to regulate fraudulent or dangerous content. Beyond content moderation, other high-stake AI use cases, such as criminal sentencing or disease diagnosis, are also in need of tools to help application designers and stakeholders understand model trade-offs and fairness in order to make model threshold decisions and assess model limitations. Interactive features from our work, including threshold sliders, confusion matrix dot maps, and group performance visualizers, would allow application designers working in these contexts to effectively explore model trade-offs and fairness, to make decisions about model thresholds, or to compare models in order to minimize ML harm. One line of future work is to build on ORES Explorer to create a model-agnostic visualization tool that allows users to visualize model value trade-offs and fairness with custom models and use cases. Further research and evaluation around a model-agnostic tool could potentially and systematically improve transparency for designers using ML models.</p><p>In order to communicate model threshold trade-offs, we explored and evaluated two specific visualization methods. Threshold Explorer visualizes the changes to a model's accuracy, false-positive rate and false negative rate when users interact with the threshold slider on a sample dataset. Threshold Recommender took a different approach, directly suggesting threshold values based on a pre-defined set of design goals. In our study, we found that the two visualizations are complementary in facilitating an application builder's understanding of model trade-offs. While some participants commented that direct recommendation (Threshold Recommender) is more practical and efficient, others mentioned that it is less educational and less transparent when compared to the interactive exploration on the sample data set (Threshold Explorer). It would be worthwhile, in future work, to thoroughly evaluate these visualization methods, along with other visualization techniques, to provide guidelines for how those methods can be adopted in different contexts.</p><p>While our evaluation provided evidence that our visualization is effective in communicating model threshold trade-offs in ORES, there are some limitations in this work that leave room for improvement and more design exploration in the future. We found that it was difficult to separate the participants' superficial understanding of model fitness from suitability. For example, when setting the model threshold, seven out of ten participants chose to prioritize the accuracy measure. This is surprising, because many workflows that use ORES in Wikipedia <ref type="bibr">[27]</ref> lend themselves better to measures like precision, recall, and false positive rate. Accuracy can be a misleading and less relevant metric, due to the typically imbalanced ratio between damaging and non-damaging edits. Thus, an additional topic for future research is to explore how we could leverage design to further explain ML concepts, such as accuracy, in a more familiar way.</p><p>The present research points to the need for a broader conversation about how ML novices perceive model fairness. When interacting with the Group Disparity Visualizer, some novice participants successfully perceived that the model was &#322;treating&#382; edits from different editor groups differently. However, most participants believed that the model should be more aggressive to newcomers because they are more likely to make mistakes. These participants were likely confused about editor performance (the quality of the editor's edits) and model performance (model's ability to correctly classify the edits), and thus unable to correctly evaluate the model's fairness. In future work, we could explore how to better help people distinguish between these two issues, and how to develop empathy for the disadvantaged groups such as, in this case, the newcomer editors. For example, we could provide the model predictions for a given editor's first few edits and let application designers see how their edits would be treated from a newcomer's perspective.</p><p>One interesting finding from this work was that participants reported developing greater trust towards the ORES system after interacting with the visualizations. As we previously mentioned in the results, the process of enabling application designers to explore the ORES model gave them a sense of &#322;co-creation&#382; and &#322;shared responsibility. &#382; Participants also commented that the design of visual artifact made the complex ML concepts more &#322;approachable&#382; and &#322;intuitive to process&#382;. This opens up a new opportunity to explore how to leverage design to better establish trust in AI systems. In many other real world scenarios, such as social media content moderation, criminal sentencing and disease diagnosis, the aspect of trust is essential to build more effective human-AI collaboration. Additionally, there is a potential to expand on the concept of &#322;co-creation&#382; and design in order to engage ORES users in the actual model development process. Wikimedia Foundation has done some related work in the area of collaborative AI development. Jade<ref type="foot">foot_11</ref> is a MediaWiki extension that is designed to allow editors to annotate articles, revisions, diffs, and other Wikipedia components. It offers a simple workflow to collect human labels from the Wikipedia community and to provide data to train better versions of the ML models. In the future, we could explore ways to build on the transparency provided by our visualization tool with user</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10">CONCLUSION</head><p>In this study, we presented the design, development, and evaluation of ORES Explorer, an interactive visualization tool to communicate the algorithmic trade-offs and model performance disparity in ORES, the ML service for editing moderation in Wikipedia. We reported on a case study conducting in-depth interviews with current and potential application designers in and outside the Wikipedia community to evaluate our visualization. The study results demonstrated that our system is helpful in facilitating designers and developers with limited ML knowledge to develop a better understanding of the ML system and make decisions that align with their design goals. Our study provided future opportunities for the design community in exploring general visualization techniques to provide greater transparency when designing with AI systems.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Amazon Sagemaker Autopilot: https://aws.amazon.com/sagemaker/autopilot/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>Google Cloud AutoML: https://cloud.google.com/automl</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>Microsoft Azure Machine Learning: https://azure.microsoft.com/en-us/services/ machine-learning/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>IBM AutoAI: https://www.ibm.com/demos/collection/IBM-Watson-Studio-AutoAI/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>https://www.perspectiveapi.com</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_5"><p>Prior research has demonstrated a growing need to assist application designers and ML novices in understanding trade-offs in ML systems trade-offs and in making better design decisions in<ref type="bibr">6</ref> Captum Insights: https://captum.ai/docs/captum_insights</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_6"><p>DIS '21, June 28-July 2, 2021, Virtual Event, USA Ye et al.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_7"><p>https://mediawiki.org/wiki/Wikimedia_Scoring_Platform_team</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_8"><p>https://www.mediawiki.org/wiki/ORES/Applications</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_9"><p>https://en.wikipedia.org/wiki/Special:RecentChanges</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_10"><p>https://en.wikipedia.org/wiki/Wikipedia:Huggle</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_11"><p>Jade: https://www.mediawiki.org/wiki/JADE feedback mechanisms that could help improve our models overtime and further enhance user trust towards the overall ML system.</p></note>
		</body>
		</text>
</TEI>
