<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Credibility via Coupling: Institutions and Infrastructures in Climate Model Intercomparisons</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>12/27/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10356858</idno>
					<idno type="doi">10.17351/ests2021.769</idno>
					<title level='j'>Engaging Science, Technology, and Society</title>
<idno>2413-8053</idno>
<biblScope unit="volume">7</biblScope>
<biblScope unit="issue">2</biblScope>					

					<author>Matthew S. Mayernik</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This study investigates Model Intercomparison Projects (MIPs) as one example of a coordinated approach to establishing scientific credibility. MIPs originated within climate science as a method to evaluate and compare disparate climate models, but MIPs or MIP-like projects are now spreading to many scientific fields. Within climate science, MIPs have advanced knowledge of: a) the climate phenomena being modeled, and b) the building of climate models themselves. MIPs thus build scientific confidence in the climate modeling enterprise writ large, reducing questions of the credibility or reproducibility of any single model. This paper will discuss how MIPs organize people, models, and data through institution and infrastructure coupling (IIC). IIC involves establishing mechanisms and technologies for collecting, distributing, and comparing data and models (infrastructural work), alongside corresponding governance structures, rules of participation, and collaboration mechanisms that enable partners around the world to work together effectively (institutional work). Coupling these efforts involves developing formal and informal ways to standardize data and metadata, create common vocabularies, provide uniform tools and methods for evaluating resulting data, and build community around shared research topics.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>. . . there should be no such thing as a theory of how credibility is achieved, at least in the sense of one of those grand theories that would offer an adequate formula for how it is done regardless of setting and the nature of the case at hand. In any particular case the resources and tactics relevant to the achievement of credibility are likely to be very diverse, and a different array of resources and tactics is likely to bear on different types of case <ref type="bibr">(Shapin 1995, 261)</ref>.</p><p>Questions about the credibility of scientific research abound within public discourse about science. Multiple efforts have sprung up to address credibility questions related to the "reproducibility" of scholarly research, my position within the National Center for Atmospheric Research (NCAR), through which I have engaged in ethnographic and participant observation intermittently over the past ten years in workshops, seminars, small group meetings, and informal interactions with scientists, technology and data experts. I also incorporate quotes from a series of semi-structured interviews conducted in 2017 (at NCAR).</p><p>The following sections discuss how MIPs organize people, models, and data through institution and infrastructure coupling <ref type="bibr">(IIC)</ref>. IIC involves establishing mechanisms and technologies for collecting, distributing, and comparing data and models (infrastructural work), alongside corresponding governance structures, rules of participation, and collaboration mechanisms that enable partners around the world to work together effectively (institutional work). Coupling these efforts involves developing formal and informal ways to standardize data and metadata, create common vocabularies, provide uniform tools and methods for evaluating resulting data, and build community around shared research topics. I argue that this coupling of the institutional and infrastructural work is instrumental in enabling MIPs to achieve credible research outcomes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Background-Credibility and Reproducibility Overview</head><p>STS literature vividly depicts how the reproducibility, accuracy, and credibility of scientific instruments and findings depend on the movement of people with specific standing and expertise. Replications of scientific findings encounter "regress" challenges, namely, difficulties in knowing whether experimental or theoretical findings are real if they have never been encountered before <ref type="bibr">(Collins 1985;</ref><ref type="bibr">Kennefick 2007)</ref>. This can lead to debates that span years or even decades, across multiple generations of people and research studies <ref type="bibr">(Galison 1987)</ref>. Work to resolve such debates occurs at both individual and institutional levels <ref type="bibr">(Braun &amp; Kropp 2010)</ref>.</p><p>Most scholarly communities, and in particular the geosciences <ref type="bibr">(Yan et al. 2020)</ref>, do not organize themselves to achieve "reproducibility" per se. As noted in a recent report by National Academies of Sciences, Engineering, and Medicine <ref type="bibr">(NASEM)</ref>: "A predominant focus on the replicability of individual studies is an inefficient way to assure the reliability of scientific knowledge. Rather, reviews of cumulative evidence on a subject, to assess both the overall effect size and generalizability, is often a more useful way to gain confidence in the state of scientific knowledge" <ref type="bibr">(NASEM 2019, 2)</ref>. Cumulative evidence gives credibility to research findings even if strong reproducibility is difficult or impossible to achieve.</p><p>Credibility is used here to refer to the quality of engendering trust or belief, and of being convincing or inspiring confidence. This understanding of credibility is aligned with both dictionary definitions and past sociology of science research, such as that of <ref type="bibr">Shapin (1995)</ref>. Organized community research endeavors often arise due to challenges in reproducing or replicating certain findings. Climate science provides an excellent case of how groups of researchers and other stakeholders design for comparison and cumulative evidence.</p><p>As recently stated by a group of climate science experts: numerical reproducibility is difficult to achieve with the computing arrays required by modern GCMs <ref type="bibr">[General Circulation Models]</ref>. . . . Therefore, the focus of the discipline has not been on model run reproducibility, but rather on replication of model phenomena observed and their magnitudes, which is performed mostly in organized multi-model ensembles <ref type="bibr">(Bush et al. 2020, 10)</ref>.</p><p>MIPs are a high-profile example of a multi-model ensemble. Organized at an international scale, MIPs provide a venue through which dozens of computational modeling teams can compare, evaluate, and diagnose their models. In the literature focused on reproducibility, such as <ref type="bibr">Leonelli (2018)</ref>, computational simulation-based research projects are often grouped together under the broad notion of "computational reproducibility." <ref type="bibr">Penders et al. (2019)</ref>-in extending Leonelli's typology-list simulations as having "high" reproducibility due in part to an "absolute" level of control over the research environment.</p><p>The challenge with climate models, as noted by <ref type="bibr">Bush et al.</ref>, is that climate models simulate chaotic phenomena. Thus, re-running a model with the same configuration and input data may not provide the same bit-level output. Also, climate models are optimized to run on particular hardware and software systems (supercomputers) and can produce different output when run on other computational platforms <ref type="bibr">(Easterbrook 2014)</ref>. For example, in a workshop held in May 2020 focused on geoscience model output archiving and reproducibility (of which I was a co-convener), questions about bit-wise reproducibility were quickly set aside by participants as being not useful due to the difficulty of achieving bit-wise equivalent simulations and the low value for modelers in doing so. Instead, discussions focused on "feature reproducibility," namely, whether the same physical phenomena (or statistics about those phenomena) could be seen on subsequent simulations. Examples include whether tornadoes emerged in simulations consistently when the same atmospheric conditions were present, or whether geographic temperature trends were consistent across long-term climate simulations. Within climate model research papers and policy documents, the term "reproducibility" is most commonly used in this way, e.g. the extent to which models can reproduce observed temperature trends or particular well-known climate phenomena such as El</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ni&#241;o.</head><p>The MIP approach to organizing research has more in common with <ref type="bibr">Leonelli's (2018)</ref> third type of reproducibility, "Semi-Standardized Experiments." As described by <ref type="bibr">Leonelli, in this category, "[research]</ref> methods, set-up, and materials used have been construed with ingenuity in order to yield very specific outcomes, and yet some significant parts of the set-up necessarily elude the controls set up by experimenters." <ref type="bibr">(ibid., 136)</ref>. In this category, research is not aimed for direct reproducibility, but instead emphasize other things, including comparability, validity, and predictability.</p><p>MIPs complement other considerations involved assessing the reproducibility and credibility of climate models. Climate models contain multiple sources of uncertainty, and validating their outputs necessarily require multiple approaches <ref type="bibr">(Randall et al. 2007)</ref>. Decisions within climate modeling centers about model development and assessment are influenced by the centers' objectives, conceptual assumptions, community norms, and the availability of funding and computational resources <ref type="bibr">(Morrison 2021)</ref>. Climate model results are evaluated quantitatively and qualitatively via comparisons to observations, to known physical laws (such as conservation of energy), and to prior generations of climate models <ref type="bibr">(Rood 2019)</ref>. MIPs do not encompass all sources of model variation and uncertainty. In general, any model that can meet the requirements of a given MIP can participate, resulting in "ensembles of opportunity," rather than a random or systematic sample of all possible Earth system models <ref type="bibr">(Winsberg 2018)</ref>. Other kinds of simulation ensembles, such as large ensembles of simulations from a single model, are better suited for studying the internal variability of the climate system or of an individual model's components <ref type="bibr">(Deser et al. 2020)</ref>. MIPs do, however, provide robust social and technical scaffolding to reduce certain kinds of variation and uncertainty, and to buttress the establishment of credibility in climate models more broadly <ref type="bibr">(Leonelli 2019)</ref>. Such social and technical scaffolding is critical to enabling climate science to meet external demands for transparency, accountability, and credibility <ref type="bibr">(Edwards 2019b;</ref><ref type="bibr">Mayernik 2019)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Institutional And Infrastructural Coupling (IIC)</head><p>Before going further into the MIP case, this section provides an outline of the key concepts of this paper, namely, infrastructures and institutions. A full review of the extensive literature that exists around each concept is beyond the scope of this article. Here, however, I discuss a few characteristics related to each concept to frame the rest of the study.</p><p>As developed by Susan Leigh <ref type="bibr">Star and Karen Ruhleder (1996)</ref>, groupings of technical systems, human practices, and organizations can be studied as infrastructure if they present certain characteristics, including-that they are embedded within other social arrangements and technologies, are built upon an installed base of prior systems, and are typically invisible to the user until they break down. Infrastructures are also deeply connected with routines and habits involved in their use. Their "invisibility" often comes from the ways that habits and norms fade into the background of routine interactions with built systems <ref type="bibr">(Edwards, 2019a)</ref>. "Infrastructure" is thus a concept that denotes human-built networks of technical systems that underpin distributed sets of human practices and movements of material entities <ref type="bibr">(Edwards et al. 2007</ref>). This paper follows the recommendation of Charlotte Lee and Kjeld Schmidt-to delineate the scope of the concept of "infrastructure" more precisely (2018). Lee and Schmidt's analysis details how the gradual expansion of the concept has led to terminological vagueness and conceptual imprecision. They depict how these problems emerge from Star and Ruhleder's initial studies, and carry forward through subsequent studies of "information infrastructures" and "cyberinfrastructure." The result of this imprecision is that "the term 'infrastructure' can be used to mean just about anything. This licenses not just semantic drift but a conceptual landslide" <ref type="bibr">(Lee &amp; Schmidt 2018, 191-2)</ref>. The range of entities that have been characterized as "infrastructure" within various literatures is indeed remarkable, encompassing even the sky and nonhuman animals <ref type="bibr">(Hoeppe 2018;</ref><ref type="bibr">Barua 2021)</ref>. This begs the question posed by Lee and Schmidt: "Does 'infrastructure' then simply mean the infinite assortment of stuff upon which a practice, any practice, relies?" <ref type="bibr">(Lee &amp; Schmidt 2018, 192)</ref>.</p><p>To avoid these conceptual challenges, I use a narrower view of infrastructure, one of four from Lee and Schmidt: "a technical structure or installation or material substrate ([e.g.] 'networked computing') conceived of in terms of its structure and the services it provides to some social system" <ref type="bibr">(ibid., 207)</ref>. This definition maps well to types of infrastructures that are commonly used as examples of the concept, such as the electricity grid, the US interstate highway system, the telephone networks (wired and cellular), and the internet. I thus use the term "infrastructural work" to refer to the work required by people and organizations to establish built systems as infrastructure.</p><p>In this study, I also bring in the lens of institutional theory. This body of literature provides a complementary set of terminologies and concepts that can shed light on the connections between technological systems and human processes of organization and coordination.</p><p>Institutions are generally understood to be "complex social forms that reproduce themselves, such as governments, family, human languages, universities, hospitals, business corporations, and legal systems" <ref type="bibr">(Miller 2019, n. p.)</ref>. They manifest as stable patterns of individual and organizational behavior that structure and legitimize actions, relationships, and understandings within specific situations. In Douglass North's metaphor-institutions are "rules of the game" within social interactions, while individuals and organizations are the "players in the game" (1990). Institutions can be understood to be social structures, orders, or patterns that enable cooperation across formal organizations or where formal organizations are absent. I use the phrase "institutional work" to refer to the work involved in establishing stable processes and practices for coordination of heterogeneous stakeholders <ref type="bibr">(Mayernik 2016)</ref>. Bruno Latour, in his book "An Inquiry into Modes of Existence" (2013), notes the connection between the trustworthiness of science and the robustness of its institutions. In particular, he draws attention "to the institutions that would allow <ref type="bibr">[truths]</ref> to maintain themselves in existence a little longer (and it is here, as we have already seen, that the notion of trust in institutions comes to the fore)" (ibid., 18-19, italics in original). Latour discusses how the validity of "truths" is buttressed by particular configurations of practices, values, and institutions. In another example, Harry Collins depicts the organizational work of gravity wave physicists as being central to the operation of their research agendas:</p><p>One thing I have discovered about physicists, or at least this group of physicists, is that they love to try to solve problems by inventing organizational structures. I have often been surprised that, when I have asked a question of a senior member of the collaboration about what the members are thinking about this or that conundrum of analysis or judgment, the reply refers to the committees or bureaucratic units they are putting together to deal with it. It is as though a properly designed organization can serve the same purpose as a properly designed experiment-to produce a correct answer <ref type="bibr">(Collins 2013, 81)</ref>.</p><p>My analysis is rooted in the coupling of these two conceptual frameworks. As Paul Edwards notesinfrastructures enable people to "generate, share, and maintain specific knowledge about the human and natural worlds" (2010, 17). Infrastructures shape the kinds of entities that can exist to play roles in knowledge generation, sharing, and maintenance <ref type="bibr">(Edwards et al. 2007)</ref>. Institutions, on the other hand, provide "vehicles through which the validity of new knowledge can be accredited" <ref type="bibr">(Jasanoff 2004, 39-40)</ref>.</p><p>Institutional work is also central to mediate information and knowledge exchanges on the science-policy interface <ref type="bibr">(Miller 2001)</ref>. As described by Oran Young, Paul Berkman, and Alexander Vylegzhanin in a discussion of governance of environmental systems, considering infrastructures and institutions as coupled phenomena exposes "the relationship between the design and establishment of institutions that form the core of governance systems on the one hand and the administration of these arrangements on a day-to-day basis on the other" (2020, 348). The following discussion of MIPs investigates how infrastructures and institutions support distributed scientific collaboration and data sharing via a coupling of technical systems, research coordination mechanisms, governance structures, and rules of participation. AMIP was motivated by several smaller scale climate model comparison studies that took place in the 1970s and 1980s <ref type="bibr">(Gates 1979;</ref><ref type="bibr">Cess et al. 1989</ref>). These early projects were focused on evaluating climate models' behavior with respect to specific phenomena, such as clouds, precipitation, or air temperature. They demonstrated that comparing simulations from different models helped to identify where they agree and diverge. AMIP formalized a process for doing intercomparisons. This included defining standard model experiments, often called "model scenarios," such as to simulate climate impacts of a 1% annual increase in the atmospheric CO2 concentration. Also standardized within AMIP were the requested model output data, the data used to initialize, compare, and validate model outputs, and the model validation procedures themselves <ref type="bibr">(Gates 1992</ref>). The first AMIP set the stage for subsequent AMIPs, as well as the Coupled Model Intercomparison Project (CMIP), which was the next major international MIP. The first iteration of CMIP was initiated in 1996, and focused on characterizing systematic simulation errors of global coupled climate models <ref type="bibr">(Meehl et al. 1997)</ref>. Since the beginning the CMIPs have been organized by the Working Group on Coupled Modeling (WGCM), which has operated under the joint auspices of the WCRP and the Climate Variability and Predictability (CLIVAR) organization. The CMIP operations have been managed by a "CMIP Panel," which is constituted within the WGCM. The CMIP Panel is responsible for overseeing the design of the CMIP experiments, the input and output datasets, and for resolving problems that arise.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model Intercomparison Projects (MIPs)</head><p>Many iterations of CMIP and other MIPs have taken place since the turn of the century. Figure <ref type="figure">1</ref> shows a timeline of the iterations of AMIP and CMIP since 1990. All of these MIPs themselves consisted of numerous sub-projects and/or sub-MIPs. As of 2021, CMIP6 is currently in process. The results from model   <ref type="bibr">(Gates 1995, 8)</ref>. In later MIPs, PCMDI has been a leading player in the creation of large-scale infrastructures, specifically the Earth System Grid (ESG) and Earth System Grid Federation (ESGF), which have evolved over multiple iterations (shown in figure <ref type="figure">1</ref>) to involve dozens of international partners and global-scale computational and data systems. Thus, MIPs have been targeted at establishing the credibility for climate models within climate science research and policy-relevant science products, such as the IPCC reports <ref type="bibr">(Rood 2019)</ref>. The success of AMIP and CMIP helped to spawn other MIPs <ref type="bibr">(Gates et al. 1999</ref>), thus providing good case studies in how "reproducibility" and its associated concepts are achieved in such coordinated projects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Why Participate in MIPs?</head><p>Participating in MIPs, particularly the more recent generations of CMIP, puts significant amounts of work on the modeling centers. The required model runs for each generation of CMIP typically take months or years to complete on supercomputers, and the data requirements are significant, both in terms of data volume and standardization, as described further in the next section. Here, however, I address the basic question: why do modeling centers participate? The motivations for participating range from scientific to social in nature, as encapsulated by the following interview quote by an individual who was involved in organizing the CMIP1-2:</p><p>The benefit. . . , and I think it became quickly apparent, is "how well is our model doing comparing with the others?" And if only because the financial sponsors, say in the US, the NSF, and whoever else is sponsoring these models, the managers in Washington would say, "Well, okay. How good is that model compared to others? Is it really state of the art?" And so to be able to give an objective answer to that question, and also in a more practical way to see where model weaknesses lie (Interview A, 2017).</p><p>The analysis of MIP model runs thus gives modelers a window into how their model compares to others. This understanding can be instrumental in improving the models themselves, as well as help with the practical need to report to funders. This social aspect of participating in MIPs also manifests via peer pressure, as noted in the following quote from the same individual: I think it was as simple as once we got one or two [to participate], then the others didn't want to be left out. "Well, gee whiz. [Organization A &amp; B] are doing this. Shouldn't we be doing this? We don't even look like we're a legitimate model if we're not up there with the big boys."</p><p>This quote was made about the early CMIPs, but it holds true in later iterations. For example, a collaboration in Brazil has been working since 2008 to develop the Brazilian Earth System Model (BESM). The BESM was not a participant in CMIP5 (completed in 2013) or any earlier CMIPs, but the BESM team used the CMIP modeling scenarios and data requirements as benchmarks and standards when developing and evaluating the BESM output: ". . . we have followed the criteria for participation in phase 5 of the Coupled Model Intercomparison Project (CMIP5) protocol. . . . The atmospheric data were output at a 3-hourly frequency and later processed using the Climate Model Output Re-writer version 2 (CMOR2) software . . . to satisfy all CMIP5 output requirements" <ref type="bibr">(Nobre et al. 2013, 6717, and 6719)</ref>. A more recent paper explicitly states that "One of the fundamental aims of the BESM project is to participate in the Coupled Model Intercomparison Project's sixth phase" <ref type="bibr">(Veiga et al. 2019</ref><ref type="bibr">(Veiga et al. , 1613))</ref>.</p><p>Likewise, a 2015 paper describing the development of a climate model in India, the Indian Institute of Tropical Meteorology Earth System Model (IITM-ESM), explicitly points to the goal of contributing to the IIPC assessment reports: "The model, a successful result of Indo-U.S. collaboration, will contribute to the IPCC's Sixth Assessment Report (AR6) simulations, a first for India" <ref type="bibr">(Swapna et al. 2015, n.</ref>p. from the abstract). The results presented in the paper also feature multiple comparisons between the outcomes of the Indian model to those demonstrated by the CMIP5 models, even though the IITM-ESM was not a participant in CMIP5.</p><p>Newly developed climate models are not only targeted toward participating in international projects like MIPs, however. The goals of the IITM-ESM model are also explicitly targeted toward improving the representation of climate phenomena that impact India, such as monsoons. But the CMIP activities have clearly provided important benchmarks for the Brazilian and Indian modeling efforts. Notably, as of April 2021, the IITM-ESM is listed on the CMIP6 directory of data contributions, but the BESM is not <ref type="bibr">(PCMDI, 2021)</ref>.</p><p>The MIP modeling scenarios also provide important benchmarks for long-established models. One participant in many MIPs described how one modeling center used a standard AMIP model run as a way to evaluate new versions of their model.</p><p>It turns out the [modeling center] people, they still do an AMIP run every time they change the model, they run an AMIP run, a 10 or 20-year AMIP run, just to see where they stand in the fixed sea surface temperature thing. Make sure that the ocean's not running it all over. So it's a standard, you know <ref type="bibr">(Interview B, 2017)</ref>.</p><p>The AMIP protocol thus provides a well-understood and relatively simple simulation scenario that can be used to evaluate whether newly added changes to the model would result in unexpected simulation outcomes. Thus, the MIPs serve both scientific and social purposes for those modeling centers that participate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Institution And Infrastructure Coupling (IIC) Within MIPS</head><p>To illustrate IIC in relation to the use of MIPs within climate sciences, I start with the following quote from <ref type="bibr">Meehl et al. (2007)</ref> about the compiled CMIP3 dataset: "This unique and valuable multimodel dataset will be maintained at PCMDI and overseen by the WGCM Climate Simulation Panel for at least the next several years." <ref type="bibr">(ibid., 1393)</ref>. This quote notes two key components of the MIP projects that will be featured in this section, namely, the data infrastructures and the institutionalized governance structures. These components and their coupling have evolved iteratively through the past few decades.</p><p>The WGCM panels have been responsible for the MIP experimental designs and overall project organization. A range of institutional structures and processes have been created over time related to governance and collaboration. This institutional work has involved establishing sets of rules related to participation and roles necessary to support the projects' goals. As an example, since the initial AMIP there has been the idea of having a core set of experiments that all participants contribute to, along with a set of focused investigations into specific phenomena. In the first phases of AMIP and CMIP, the specific focused investigations were called "diagnostic subprojects" <ref type="bibr">(Gates 1992)</ref>. In CMIP6, the most recent MIP, these have been called "endorsed MIPs" that complement the core set of experiments. Each "endorsed MIP" was required to submit a formal proposal that detailed the scientific goals of the effort, the model output to be generated, and any unique data/metadata characteristics, and the designated co-chairs and Scientific Steering Committee <ref type="bibr">(WCRP 2014)</ref>.</p><p>The institutional work done by the CMIP Panel and associated committees also encompasses setting requirements for data and metadata. This has involved forming recommendations and requirements for the datasets used as input for the modeling experiments, and for the output data to be generated by the modeling groups for submission to the central CMIP data collection. These requirements also specify data and metadata file formats, data gridding and coordinate systems, data file organizing scheme, variable names, units, and sign conventions, and more recently, the structure and content of the citation for the output data.</p><p>For example, the CMIP5 data request included over 400 variables, spanning the atmosphere, oceans, land surface and many other phenomena. The full specification of the CMIP5 model output request runs 133 pages of tables <ref type="bibr">(Taylor 2013)</ref>.</p><p>The infrastructural work involved in the operation of these MIPs has increased in scope over time, being both enabling and constraining on the ambitions of the projects. In the initial AMIPs and CMIP1-2, very little "infrastructure" existed per se, beyond the internet as it existed in the 1990s. Participating modeling teams were asked to send their model output data directly to PCMDI via email or file transfer protocol (FTP). In CMIP3, as the scientific ambitions grew, so did the size of the requested data, along with the concomitant requirements for data infrastructure. As depicted in figure 1, the ESG system was first used to support data collection and distribution for CMIP3. The large data volumes for CMIP3, however, precluded the data from being sent over the internet. Thus, the CMIP3 modeling groups "were sent hard disks and asked to copy their model data onto the disks in netCDF format and then mail the disks to PCMDI where the model data were downloaded and cataloged" <ref type="bibr">(Meehl et al. 2007</ref><ref type="bibr">(Meehl et al. , 1385))</ref>. PCMDI staff then performed the work to load the data onto the ESG for wider distribution. Starting with CMIP5, participating groups were asked to stage their data directly to the ESGF, through a distributed system of fifteen ESGF nodes operated by organizations on four continents <ref type="bibr">(Cinquini et al. 2014</ref>). This was a major transition that was characterized as a "nightmare" by one interviewee (Interview C, 2017), due to the heavy demands put on the modeling centers, who were required to install and operate new large-scale software infrastructures for data publication, transfer, replication, authorization, and quality control.</p><p>The infrastructural work in recent MIPs has been aimed toward creating and operating global data collection, management, and distribution systems. In the earlier MIPs, where the technical ambitions for the data infrastructure were lower, significant amounts of work went into producing and distributing the input datasets, and compiling the output from the modeling teams in a common format. Over time, additional infrastructural work has taken place to create a framework for documenting models, data, and their provenance, assign persistent identifiers to data, secure data quality and integrity, enable data discovery and access, data replication, and provide software to visualize and analyze data. For CMIP3, for example, PCMDI staff wrote a software program called Climate Model Output Rewriter (CMOR) that transformed native model output into data files that met the CMIP data request. The stated goal of CMOR was "simply to reduce the effort required to prepare and manage MIP data" <ref type="bibr">(Taylor, Doutriaux, &amp; Peterschmitt 2006, 1)</ref>. Later, a new data quality control approach was developed for CMIP5 that formalized three "levels" of data. Within each level, a series of quality control processes and tools were invoked to check for data completeness and conformance to the stipulated standards, to establish version tracking and integrity checking across data replications, and to create persistent identifiers for the data <ref type="bibr">(Stockhause et al. 2012)</ref>.</p><p>The coupling of the institutional and infrastructural work described here is necessary for projects like MIPs to move forward amid the complex configurations of human, organizational, and technical factors at play. As one example, the metadata requirements evolved significantly from CMIP3 to CMIP5, becoming much more labor-intensive in response to concerns coming out of CMIP3 about the version control and provenance tracking for the model components <ref type="bibr">(Guilyardi et al. 2011)</ref>. One individual who was involved in the technical development of the ESG during CMIP3 indicated that these problems were anticipated, but ultimately not addressed by earlier versions of the ESG:</p><p>Within the [ESG-CET] project, when it was conceived and proposed and even reviewed, there were things that we knew were gonna be absolutely essential and that if we didn't build them in, they were going to be really hard to tack on. So one was provenance, the other was a robust handling of semantics, and both those basically got taken off the table. I believe, I don't have this directly from a DOE [Department of Energy] program manager, but we were told, DOE told us not to focus on that <ref type="bibr">(Interview D, 2017)</ref>.</p><p>Retrofitting the collection of metadata and provenance information into the CMIP and ESG workflows was indeed a challenging process. In my interactions with MIP participants, I have heard multiple accounts about how documenting the models and resulting output data via the process used for CMIP5 typically took multiple weeks. CMIP6 metadata requirements and infrastructural components were largely the same as for CMIP5, enabling the modeling centers to carry over lessons learned, and tools built from CMIP5.</p><p>This metadata example is one illustration of how the coupling between infrastructural and institutional components of scientific work can reveal frictions. As Ville Aula (2019) notes, infrastructural and data frictions are often addressed via institutional means, whether through formal regulation or informal negotiations among stakeholders. The creation of the ESGF, around the time of CMIP5, was a lengthy and occasionally contentious process that involved negotiations among the prior ESG collaborators, with the WCRP as the orchestrating body. The initial ESG collaboration involved partners funded to investigate a specific problem-how to do high-speed transfers of large-scale data across the internetwhile operating as a somewhat informal collaboration. In contrast, a new formal governance structure was set up for ESGF to establish it as an open consortium of organizations, with official documents that stipulate the roles and responsibilities of an ESGF steering committee, its executive committee, and working teams.</p><p>In another illustration of IIC, the approach to solving data infrastructure problems that cropped up in CMIP5 was the creation of the WGCM Infrastructure Panel (WIP) for CMIP6 <ref type="bibr">(Balaji et al. 2018)</ref>. The ESGF infrastructures were noted in a 2019 WGCM document as being critical to the operation of CMIP, while also being "fragile" and a "single point of failure" for the enterprise <ref type="bibr">(WCRP 2019)</ref>. A prominent illustration of this was when the ESGF was brought down by a security breach in 2015. No data were corrupted, but the ESGF was completely offline for about six months as reengineering took place. According to multiple participants, during that time CMIP5 data users either hit a roadblock or had to find back channel sources.</p><p>The terms of reference for the new WIP committee included eight clauses, touching on both infrastructural and institutional responsibilities, as demonstrated by the below excerpt:</p><p>1.</p><p>Serve the interests of the WGCM in establishing and maintaining standards and policies for sharing climate model output and derived products . . .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4.</head><p>Review and provide guidance on requirements of the infrastructure (e.g. level of service, accessibility, level of security) . . .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>6.</head><p>Collaborate with and rely on the ideas and leadership of other groups with interests in standards and infrastructure for climate data (e.g., CMIP, obs4MIPs, CORDEX, ESGF, ES-DOC, CF conventions), with the understanding that the WGCM expects the WIP to provide oversight <ref type="bibr">(WIP 2014)</ref>.</p><p>The institutional work of the WIP was thus focused on mitigating the sources of risk that had cropped up over time as the ESG and ESGF infrastructures became indispensable to the CMIPs.</p><p>IIC is thus critical to understanding how MIP-based research achieves the goals of the recent "reproducibility" movements, even if it is not possible and/or practical to bit-for-bit reproduce the outputs of the climate models involved. The rules of participation, experimental specifications, metadata and data standards, and data delivery systems are intertwined in ensuring the comparability of the MIP-generated data and results. The credibility of the data and findings likewise emerge from the articulations between the experimental design, metadata and provenance tracing, the assignment of persistent identifiers to data, and the security features implemented within the data management and preservation systems. The credibility of organizations and processes involved are both critical to ensuring the trustworthiness of the data and the scientific findings from MIPs. This is nicely encapsulated by this quote from a contributing author to the IPCC Report AR4, which was in fact included in the AR4:</p><p>It proved important in carrying out the various MIPs to standardize the model forcing parameters and the model output so that file formats, variable names, units, etc., are easily recognized by data users. The fact that the model results were stored separately and independently of the modeling centers, and that the analysis of the model output was performed mainly by research groups independent of the modelers, has added to confidence in the results. AMIP and CMIP opened a new era for climate modeling, setting standards of quality control, providing organizational continuity, and ensuring that results are generally reproducible 2011, 244).</p><p>Note how many different details are mentioned with regard to "ensuring that results are generally reproducible": standardizing parameters and model output files, the separation between modeling centers and the model output, standards for quality control, and organizational continuity. Coordinating and achieving these many details at international scales requires both institutional and infrastructural work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Science Case Study-Clouds And Aerosols</head><p>In the past few years, participants in CMIP6 have found that their new generation of climate models are producing future climates with higher "equilibrium climate sensitivity" (ECS) than prior models. The ECS is an important statistic for climate models because it indicates how much the planet may warm due to a doubling of atmospheric CO&#8322; compared to pre-industrial levels. This section provides insight into how the credibility of a finding like this potentially higher ECS is established.</p><p>For the climate modeling group based at the National Center for Atmospheric Research (NCAR), this higher ECS was produced by a new version of a model. Developing this new model involved running nearly 300 different model runs, with varying configurations and inputs. Many of the developmental runs of this model produced surface temperature trends that did not track with the observed twentieth century temperature increase. By iteratively running the model, the team diagnosed that the model produced different climate trends when using aerosol emissions input datasets that were produced for the CMIP6 project, versus emissions datasets produced for CMIP5. (Both sets of emissions data were distributed through the ESGF.) Further diagnosis indicated that cloud production components of the model were the primary cause of the output changes, as cloud generation is tied to the presence of aerosols within the atmosphere.</p><p>Accurately simulating feedbacks related to clouds and aerosols have been long-term challenges for climate modelers <ref type="bibr">(Cess et al. 1989)</ref>, so this finding was not itself surprising, but the modeling team then faced questions about whether the preliminary simulations of the twentieth century were inaccurate due to problems with the model, the aerosol emissions input datasets, or both. After several iterations, the team found a model configuration that simulated a temperature trend close to the observations when using CMIP5 emissions input data, but not with the CMIP6 emissions. During a 2017 presentation about this investigation, one scientist observed that questions about the input data required a broader discussion within the CMIP6 project, and that data questions needed to be sorted out with the CMIP governance teams before the group could start doing their CMIP6 model runs.</p><p>The emissions data had already been re-released a few times to correct problems, once to add back in data that had been dropped due to a "limitation in the ESGF" <ref type="bibr">(Hoesly et al. 2018)</ref>, and once to correct gridding errors discovered by users (including by the NCAR modeling group). But no further problems were found in the data that resulted in any new releases of the CMIP6 emissions input datasets. Thus, the NCAR modelers continued to evolve their model using these particular datasets as input. Iterative model development continued until their new model was simulating the twentieth century climate trends acceptably while using the later releases of the CMIP6 input emissions forcing data. At this point, the team began analyzing the simulated climates in more detail, including analyzing the new model's ECS.</p><p>Subsequent analyses showed that the higher ECS was due to cloud processes. The group did not, however, adjust the model directly to affect the ECS.</p><p>As noted in the following quote from a publication in which the higher ECS is presented and analyzed in detail, the importance of the overall finding of a potentially increased ECS required coordinated study:</p><p>An ECS [equilibrium climate sensitivity] of 5.3 K would lead to a high level of climate change and large impacts. It is imperative that the community work in a multimodel context to understand how plausible such a high ECS is. What scares us is not that [our] ECS is wrong (all models are wrong, <ref type="bibr">[Box 1976</ref>]) but that it might be right <ref type="bibr">(Gettelman et al. 2019, 8336)</ref>.</p><p>This quote closes the cited paper. Note the reference to statistician George Box and his famous statement that "all models are wrong, but some are useful" <ref type="bibr">(Box 1976)</ref>. Also note the explicit call to study this finding in a "multimodel context." Subsequent studies of CMIP6 models show that many of them are producing higher ECS numbers than were produced for any previous iteration of CMIP <ref type="bibr">(Zelinka et al. 2020)</ref>. Thus, the credibility of any one model is tied to the coordinated effort to standardize the inputs, outputs, and evaluation across many models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Generalization-Semi-Controlled Experiments</head><p>I have so far argued that MIP-based research could be characterized as a "semi-controlled experiment," following <ref type="bibr">Leonelli's (2018)</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Text Retrieval:</head><p>The Text Retrieval Conferences (TREC) have been held every year since 1992 as a venue to advance the field of information and text retrieval. The focus of TREC has been to establish a coordinated mechanism for conducting, comparing, and evaluating text retrieval experiments, with the goal of improving document search algorithms <ref type="bibr">(Harman &amp; Voorhees 2007)</ref>. Similar to MIPs, the TREC series is organized around common input datasets, experiment specifications, and evaluation methods, and the explicit goal is to conduct comparative assessments of particular text retrieval algorithms. Also similar to MIPs, the TREC activities have been sub-divided according to particular research questions, such as a "routing" task, in which participating research teams were asked to search specified collections of documents, such as news clipping services, to find relevant documents for a particular subject query. The "semi-controlled" aspect of the TREC projects is the evaluation process, which is based on human judgements of the relevance of documents for particular information retrieval tasks. The inconsistency of human relevance assessments is a systematic challenge in evaluating the effectiveness of information retrieval algorithms <ref type="bibr">(Harter 1996)</ref>. TREC addressed this factor by standardizing the relevance judgements used to evaluate the outcomes of the various TREC experimental tracks. The TREC series has been supported since the beginning by the US National Institute of Standards and Technology (NIST) and the US Defense Advanced Research Projects Agency (DARPA). NIST has been the primary site for the "infrastructural work" within TREC, in this case, compiling, standardizing, and distributing the document collections used for each TREC. A TREC Program Committee oversees the meeting programs, including the definition of participation roles, the organization of the experimental tracks, and the specification of their test corpuses.</p><p>Ecological Observatories: The last example of a "semi-controlled" scientific endeavor discussed here is ecological observatories like the Long Term Ecological Network (LTER) and the National Ecological Observatory Network (NEON), where multiple ecological sites are operated in coordination to support data collection and scientific projects that span time and space. Significant effort goes into achieving consistent, comparable, and traceable datasets from field sites that are "semi controllable" due to unpredictable flora, fauna, and weather. Staff of such observatories use processes that are designed to enable the comparability of data over time <ref type="bibr">(Ribes &amp; Jackson 2013)</ref>. These observatories are complex to build, operate, and sustain.</p><p>Henry Loescher, Eugene Kelly, and Russ Lea (2017), writing as then-members of NEON leadership about NEON's construction, discussed the coupling of institutional and infrastructural issues explicitly:</p><p>[S]taff scientists are faced with an unfamiliar organizational structure of NEON that acts as a construction company, scientific institution, and a start-up company combined, each with its own culture, that often manifest in needs to build internal organizational function/structures for one culture. Compounding this dynamic, are the rapidly changing institutional needs, and the changing and unforeseen reporting and oversight of the sponsors themselves. The need to engage with the user community has never been greater and, at the same time, always outweighs the institutional capability to do so. This is not meant as an excuse, but rather a common, reoccurring reality seen by all research infrastructure during their construction (ibid., 44).</p><p>These challenges resulted in a shake-up in 2016, when the NSF abruptly changed NEON's management, with the goal of speeding up the construction. This changeover did result in the successful completion of the NEON data collection infrastructure, but also involved removal of key scientists and dissolution of scientific advisory committees. This led to questions about the observatory within the ecological science community.</p><p>One prominent ecologist was quoted in Science magazine as posting on Twitter about NEON as follows:</p><p>"Great data, no users, no trust = failure" <ref type="bibr">(Mervis 2019, 212)</ref>. A NEON Science, Technology &amp; Education Advisory Committee has since been reconstituted, along with 25 other "technical committees" that provide NEON guidance across a range of topics, including community engagement, data standards, and soil sensors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>This paper argues that the credibility of scientific findings and data for projects that are focused on enabling comparison, such as MIPs in climate science, is strongly related to the coupling of robust infrastructures for data collection, curation, distribution, and preservation, with robust institutions that facilitate the delineation of roles, responsibilities, rules of participation, and coordination processes. The cases depicted in this paper suggest that trust and credibility of the data and findings from such projects comes not from either infrastructures or institutions in isolation, but rather the articulations between them. The concept of "reproducibility" is nuanced for these kinds of projects, due to the semi-controlled nature of the research.</p><p>In the case of MIP-based climate science, critical aspects of the investigation are beyond the full control of the research community: the complexity of the climate system, the non-deterministic mathematics used to simulate that system, and the portability of the computational manifestations of those mathematics.</p><p>Institutional and infrastructural coupling (IIC) is introduced as a framework for understanding how scientific endeavors achieve other things that are more applicable than "reproducibility," namely trustworthiness, credibility, and comparability of the data and findings from such projects. The cases depicted in this paper demonstrate how MIPs and other kinds of scientific endeavors can enable the generation of reliable scientific findings and data even when strict notions of "reproducibility" are not useful. MIPs enable people to perform specific kinds of scientific investigations through collaboration around common research methods, data, and evaluation procedures. The structure of the MIPs in climate science has been relatively robust over time and has transferred from one project to another, with some evolution to both the institutional and infrastructural components, as well as the couplings between them.</p><p>For example, problems encountered in prior MIPs, including governance challenges, technical problems, and unresolved or emergent scientific questions, tend to re-appear as the remit of a new committee, working group, or sub-project, and often as the subject of a new rule or recommendation related to participation and data contributions.</p><p>The term "coupling" within the IIC concept perhaps implies a one-to-one connection, like between rail cars. I suggest, however, that it is more useful to think of these couplings like those between the bones of the skull, where there are seams of unique and multifaceted connections. One skull bone is only as effective as its coupling to the other bones adjacent to it. Likewise, the infrastructures that support MIPs (and other kinds of research that can be characterized as "semi-controlled experiments" <ref type="bibr">(Leonelli 2018))</ref> are only effective to the extent that they are coupled with institutions that facilitate the credibility of the scientific effort, and vice versa.</p></div></body>
		</text>
</TEI>
