skip to main content


Title: CounterFAccTual: How FAccT Undermines Its Organizing Principles
This essay joins recent scholarship in arguing that FAccT's fundamental framing of the potential to achieve the normative conditions for justice through bettering the design of algorithmic systems is counterproductive to achieving said justice in practice. Insofar as the FAccT community's research tends to prioritize design-stage interventions, it ignores the fact that the majority of the contextual factors that practically determine FAccT outcomes happen in the implementation and impact stages of AI/ML lifecycles. We analyze an emergent and widely-cited movement within the FAccT community for attempting to honor the centrality of contextual factors in shaping social outcomes, a set of strategies we term ‘metadata maximalism’. Symptomatic of design-centered approaches, metadata maximalism abstracts away its reliance on institutions and structures of justice that are, by every observable metric, already struggling (where not failing) to provide accessible, enforceable rights. These justice infrastructures, moreover, are currently wildly under-equipped to manage the disputes arising from digital transformation and machine learning. The political economy of AI/ML implementation provides further obstructions to realizing rights. Data and software supply chains, in tandem with intellectual property protections, introduce structural sources of opacity. Where duties of care to vulnerable persons should reign, profit incentives are given legal and regulatory primacy. Errors are inevitable and inextricable from the development of machine learning systems. In the face of these realities, FAccT programs, including metadata maximalism, tend to project their efforts in a fundamentally counter-factual universe: one in which functioning institutions and processes for due diligence in implementation and for redress of harms are working and ready to interoperate with. Unfortunately, in our world, these institutions and processes have been captured by the interests they are meant to hold accountable, intentionally hollowed-out, and/or were never designed to function in today's sociotechnical landscape. Continuing to produce (fair! accountable! transparent!) data-enabled systems that operate in high-impact areas, irrespective of this landscape's radically insufficient paths to justice, given the unavoidability of errors and/or intentional misuse in implementation, and the exhaustively-demonstrated disproportionate distribution of resulting harms onto already-marginalized communities, is a choice - a choice to be CounterFAccTual.  more » « less
Award ID(s):
1828010
NSF-PAR ID:
10344140
Author(s) / Creator(s):
;
Date Published:
Journal Name:
FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency
Page Range / eLocation ID:
1982 to 1992
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In widely used sociological descriptions of how accountability is structured through institutions, an “actor” (e.g., the developer) is accountable to a “forum” (e.g., regulatory agencies) empowered to pass judgements on and demand changes from the actor or enforce sanctions. However, questions about structuring accountability persist: why and how is a forum compelled to keep making demands of the actor when such demands are called for? To whom is a forum accountable in the performance of its responsibilities, and how can its practices and decisions be contested? In the context of algorithmic accountability, we contend that a robust accountability regime requires a triadic relationship, wherein the forum is also accountable to another entity: the public(s). Typically, as is the case with environmental impact assessments, public(s) make demands upon the forum's judgements and procedures through the courts, thereby establishing a minimum standard of due diligence. However, core challenges relating to: (1) lack of documentation, (2) difficulties in claiming standing, and (3) struggles around admissibility of expert evidence on and achieving consensus over the workings of algorithmic systems in adversarial proceedings prevent the public from approaching the courts when faced with algorithmic harms. In this paper, we demonstrate that the courts are the primary route—and the primary roadblock—in the pursuit of redress for algorithmic harms. Courts often find algorithmic harms non-cognizable and rarely require developers to address material claims of harm. To address the core challenges of taking algorithms to court, we develop a relational approach to algorithmic accountability that emphasizes not what the actors do nor the results of their actions, but rather how interlocking relationships of accountability are constituted in a triadic relationship between actors, forums, and public(s). As is the case in other regulatory domains, we believe that impact assessments (and similar accountability documentation) can provide the grounds for contestation between these parties, but only when that triad is structured such that the public(s) are able to cohere around shared experiences and interests, contest the outcomes of algorithmic systems that affect their lives, and make demands upon the other parties. Where courts now find algorithmic harms non-cognizable, an impact assessment regime can potentially create procedural rights to protect substantive rights of the public(s). This would require algorithmic accountability policies currently under consideration to provide the public(s) with adequate standing in courts, and opportunities to access and contest the actor's documentation and the forum's judgments. 
    more » « less
  2. Obeid, I. (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of high-resolution images from scanned pathology samples [1], as part of its National Science Foundation-funded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” [2]. The long-term goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition [3], image recognition [4] and text processing [5] are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely high-dimensional models (e.g., billions of parameters). Other fields that do not have access to such data resources must rely on techniques in which existing models can be adapted to new datasets [6]. A preliminary version of this breast corpus release was tested in a pilot study using a baseline machine learning system, ResNet18 [7], that leverages several open-source Python tools. The pilot corpus was divided into three sets: train, development, and evaluation. Portions of these slides were manually annotated [1] using the nine labels in Table 1 [8] to identify five to ten examples of pathological features on each slide. Not every pathological feature is annotated, meaning excluded areas can include focuses particular to these labels that are not used for training. A summary of the number of patches within each label is given in Table 2. To maintain a balanced training set, 1,000 patches of each label were used to train the machine learning model. Throughout all sets, only annotated patches were involved in model development. The performance of this model in identifying all the patches in the evaluation set can be seen in the confusion matrix of classification accuracy in Table 3. The highest performing labels were background, 97% correct identification, and artifact, 76% correct identification. A correlation exists between labels with more than 6,000 development patches and accurate performance on the evaluation set. Additionally, these results indicated a need to further refine the annotation of invasive ductal carcinoma (“indc”), inflammation (“infl”), nonneoplastic features (“nneo”), normal (“norm”) and suspicious (“susp”). This pilot experiment motivated changes to the corpus that will be discussed in detail in this poster presentation. To increase the accuracy of the machine learning model, we modified how we addressed underperforming labels. One common source of error arose with how non-background labels were converted into patches. Large areas of background within other labels were isolated within a patch resulting in connective tissue misrepresenting a non-background label. In response, the annotation overlay margins were revised to exclude benign connective tissue in non-background labels. Corresponding patient reports and supporting immunohistochemical stains further guided annotation reviews. The microscopic diagnoses given by the primary pathologist in these reports detail the pathological findings within each tissue site, but not within each specific slide. The microscopic diagnoses informed revisions specifically targeting annotated regions classified as cancerous, ensuring that the labels “indc” and “dcis” were used only in situations where a micropathologist diagnosed it as such. Further differentiation of cancerous and precancerous labels, as well as the location of their focus on a slide, could be accomplished with supplemental immunohistochemically (IHC) stained slides. When distinguishing whether a focus is a nonneoplastic feature versus a cancerous growth, pathologists employ antigen targeting stains to the tissue in question to confirm the diagnosis. For example, a nonneoplastic feature of usual ductal hyperplasia will display diffuse staining for cytokeratin 5 (CK5) and no diffuse staining for estrogen receptor (ER), while a cancerous growth of ductal carcinoma in situ will have negative or focally positive staining for CK5 and diffuse staining for ER [9]. Many tissue samples contain cancerous and non-cancerous features with morphological overlaps that cause variability between annotators. The informative fields IHC slides provide could play an integral role in machine model pathology diagnostics. Following the revisions made on all the annotations, a second experiment was run using ResNet18. Compared to the pilot study, an increase of model prediction accuracy was seen for the labels indc, infl, nneo, norm, and null. This increase is correlated with an increase in annotated area and annotation accuracy. Model performance in identifying the suspicious label decreased by 25% due to the decrease of 57% in the total annotated area described by this label. A summary of the model performance is given in Table 4, which shows the new prediction accuracy and the absolute change in error rate compared to Table 3. The breast tissue subset we are developing includes 3,505 annotated breast pathology slides from 296 patients. The average size of a scanned SVS file is 363 MB. The annotations are stored in an XML format. A CSV version of the annotation file is also available which provides a flat, or simple, annotation that is easy for machine learning researchers to access and interface to their systems. Each patient is identified by an anonymized medical reference number. Within each patient’s directory, one or more sessions are identified, also anonymized to the first of the month in which the sample was taken. These sessions are broken into groupings of tissue taken on that date (in this case, breast tissue). A deidentified patient report stored as a flat text file is also available. Within these slides there are a total of 16,971 total annotated regions with an average of 4.84 annotations per slide. Among those annotations, 8,035 are non-cancerous (normal, background, null, and artifact,) 6,222 are carcinogenic signs (inflammation, nonneoplastic and suspicious,) and 2,714 are cancerous labels (ductal carcinoma in situ and invasive ductal carcinoma in situ.) The individual patients are split up into three sets: train, development, and evaluation. Of the 74 cancerous patients, 20 were allotted for both the development and evaluation sets, while the remain 34 were allotted for train. The remaining 222 patients were split up to preserve the overall distribution of labels within the corpus. This was done in hope of creating control sets for comparable studies. Overall, the development and evaluation sets each have 80 patients, while the training set has 136 patients. In a related component of this project, slides from the Fox Chase Cancer Center (FCCC) Biosample Repository (https://www.foxchase.org/research/facilities/genetic-research-facilities/biosample-repository -facility) are being digitized in addition to slides provided by Temple University Hospital. This data includes 18 different types of tissue including approximately 38.5% urinary tissue and 16.5% gynecological tissue. These slides and the metadata provided with them are already anonymized and include diagnoses in a spreadsheet with sample and patient ID. We plan to release over 13,000 unannotated slides from the FCCC Corpus simultaneously with v1.0.0 of TUDP. Details of this release will also be discussed in this poster. Few digitally annotated databases of pathology samples like TUDP exist due to the extensive data collection and processing required. The breast corpus subset should be released by November 2021. By December 2021 we should also release the unannotated FCCC data. We are currently annotating urinary tract data as well. We expect to release about 5,600 processed TUH slides in this subset. We have an additional 53,000 unprocessed TUH slides digitized. Corpora of this size will stimulate the development of a new generation of deep learning technology. In clinical settings where resources are limited, an assistive diagnoses model could support pathologists’ workload and even help prioritize suspected cancerous cases. ACKNOWLEDGMENTS This material is supported by the National Science Foundation under grants nos. CNS-1726188 and 1925494. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] N. Shawki et al., “The Temple University Digital Pathology Corpus,” in Signal Processing in Medicine and Biology: Emerging Trends in Research and Applications, 1st ed., I. Obeid, I. Selesnick, and J. Picone, Eds. New York City, New York, USA: Springer, 2020, pp. 67 104. https://www.springer.com/gp/book/9783030368432. [2] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning.” Major Research Instrumentation (MRI), Division of Computer and Network Systems, Award No. 1726188, January 1, 2018 – December 31, 2021. https://www. isip.piconepress.com/projects/nsf_dpath/. [3] A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 5036-5040. https://doi.org/10.21437/interspeech.2020-3015. [4] C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” in Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 331–344. https://ieeexplore.ieee.org/document/8675201. [5] I. Caswell and B. Liang, “Recent Advances in Google Translate,” Google AI Blog: The latest from Google Research, 2020. [Online]. Available: https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html. [Accessed: 01-Aug-2021]. [6] V. Khalkhali, N. Shawki, V. Shah, M. Golmohammadi, I. Obeid, and J. Picone, “Low Latency Real-Time Seizure Detection Using Transfer Deep Learning,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2021, pp. 1 7. https://www.isip. piconepress.com/publications/conference_proceedings/2021/ieee_spmb/eeg_transfer_learning/. [7] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning,” Philadelphia, Pennsylvania, USA, 2020. https://www.isip.piconepress.com/publications/reports/2020/nsf/mri_dpath/. [8] I. Hunt, S. Husain, J. Simons, I. Obeid, and J. Picone, “Recent Advances in the Temple University Digital Pathology Corpus,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2019, pp. 1–4. https://ieeexplore.ieee.org/document/9037859. [9] A. P. Martinez, C. Cohen, K. Z. Hanley, and X. (Bill) Li, “Estrogen Receptor and Cytokeratin 5 Are Reliable Markers to Separate Usual Ductal Hyperplasia From Atypical Ductal Hyperplasia and Low-Grade Ductal Carcinoma In Situ,” Arch. Pathol. Lab. Med., vol. 140, no. 7, pp. 686–689, Apr. 2016. https://doi.org/10.5858/arpa.2015-0238-OA. 
    more » « less
  3. Who and by what means do we ensure that engineering education evolves to meet the ever changing needs of our society? This and other papers presented by our research team at this conference offer our initial set of findings from an NSF sponsored collaborative study on engineering education reform. Organized around the notion of higher education governance and the practice of educational reform, our open-ended study is based on conducting semi-structured interviews at over three dozen universities and engineering professional societies and organizations, along with a handful of scholars engaged in engineering education research. Organized as a multi-site, multi-scale study, our goal is to document differences in perspectives and interest the exist across organizational levels and institutions, and to describe the coordination that occurs (or fails to occur) in engineering education given the distributed structure of the engineering profession. This paper offers for all engineering educators and administrators a qualitative and retrospective analysis of ABET EC 2000 and its implementation. The paper opens with a historical background on the Engineers Council for Professional Development (ECPD) and engineering accreditation; the rise of quantitative standards during the 1950s as a result of the push to implement an engineering science curriculum appropriate to the Cold War era; EC 2000 and its call for greater emphasis on professional skill sets amidst concerns about US manufacturing productivity and national competitiveness; the development of outcomes assessment and its implementation; and the successive negotiations about assessment practice and the training of both of program evaluators and assessment coordinators for the degree programs undergoing evaluation. It was these negotiations and the evolving practice of assessment that resulted in the latest set of changes in ABET engineering accreditation criteria (“1-7” versus “a-k”). To provide an insight into the origins of EC 2000, the “Gang of Six,” consisting of a group of individuals loyal to ABET who used the pressure exerted by external organizations, along with a shared rhetoric of national competitiveness to forge a common vision organized around the expanded emphasis on professional skill sets. It was also significant that the Gang of Six was aware of the fact that the regional accreditation agencies were already contemplating a shift towards outcomes assessment; several also had a background in industrial engineering. However, this resulted in an assessment protocol for EC 2000 that remained ambiguous about whether the stated learning outcomes (Criterion 3) was something faculty had to demonstrate for all of their students, or whether EC 2000’s main emphasis was continuous improvement. When it proved difficult to demonstrate learning outcomes on the part of all students, ABET itself began to place greater emphasis on total quality management and continuous process improvement (TQM/CPI). This gave institutions an opening to begin using increasingly limited and proximate measures for the “a-k” student outcomes as evidence of effort and improvement. In what social scientific terms would be described as “tactical” resistance to perceived oppressive structures, this enabled ABET coordinators and the faculty in charge of degree programs, many of whom had their own internal improvement processes, to begin referring to the a-k criteria as “difficult to achieve” and “ambiguous,” which they sometimes were. Inconsistencies in evaluation outcomes enabled those most discontented with the a-k student outcomes to use ABET’s own organizational processes to drive the latest revisions to EAC accreditation criteria, although the organization’s own process for member and stakeholder input ultimately restored much of the professional skill sets found in the original EC 2000 criteria. Other refinements were also made to the standard, including a new emphasis on diversity. This said, many within our interview population believe that EC 2000 had already achieved much of the changes it set out to achieve, especially with regards to broader professional skills such as communication, teamwork, and design. Regular faculty review of curricula is now also a more routine part of the engineering education landscape. While programs vary in their engagement with ABET, there are many who are skeptical about whether the new criteria will produce further improvements to their programs, with many arguing that their own internal processes are now the primary drivers for change. 
    more » « less
  4. Instagram, one of the most popular social media platforms among youth, has recently come under scrutiny for potentially being harmful to the safety and well-being of our younger generations. Automated approaches for risk detection may be one way to help mitigate some of these risks if such algorithms are both accurate and contextual to the types of online harms youth face on social media platforms. However, the imminent switch by Instagram to end-to-end encryption for private conversations will limit the type of data that will be available to the platform to detect and mitigate such risks. In this paper, we investigate which indicators are most helpful in automatically detecting risk in Instagram private conversations, with an eye on high-level metadata, which will still be available in the scenario of end-to-end encryption. Toward this end, we collected Instagram data from 172 youth (ages 13-21) and asked them to identify private message conversations that made them feel uncomfortable or unsafe. Our participants risk-flagged 28,725 conversations that contained 4,181,970 direct messages, including textual posts and images. Based on this rich and multimodal dataset, we tested multiple feature sets (metadata, linguistic cues, and image features) and trained classifiers to detect risky conversations. Overall, we found that the metadata features (e.g., conversation length, a proxy for participant engagement) were the best predictors of risky conversations. However, for distinguishing between risk types, the different linguistic and media cues were the best predictors. Based on our findings, we provide design implications for AI risk detection systems in the presence of end-to-end encryption. More broadly, our work contributes to the literature on adolescent online safety by moving toward more robust solutions for risk detection that directly takes into account the lived risk experiences of youth. 
    more » « less
  5. Machine learning (ML) methods already permeate environmental decision-making, from processing high-dimensional data on earth systems to monitoring compliance with environmental regulations. Of the ML techniques available to address pressing environmental problems (e.g., climate change, biodiversity loss), Reinforcement Learning (RL) may both hold the greatest promise and present the most pressing perils. This paper explores how RL-driven policy refracts existing power relations in the environmental domain while also creating unique challenges to ensuring equitable and accountable environmental decision processes. We leverage examples from RL applications to climate change mitigation and fisheries management to explore how RL technologies shift the distribution of power between resource users, governing bodies, and private industry. 
    more » « less