<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Designing workflows for materials characterization</title></titleStmt>
			<publicationStmt>
				<publisher>AIP Publishing</publisher>
				<date>03/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10500564</idno>
					<idno type="doi">10.1063/5.0169961</idno>
					<title level='j'>Applied Physics Reviews</title>
<idno>1931-9401</idno>
<biblScope unit="volume">11</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Sergei V. Kalinin</author><author>Maxim Ziatdinov</author><author>Mahshid Ahmadi</author><author>Ayana Ghosh</author><author>Kevin Roccapriore</author><author>Yongtao Liu</author><author>Rama K. Vasudevan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>Experimental science is enabled by the combination of synthesis, imaging, and functional characterization organized into evolving discovery loop. Synthesis of new material is typically followed by a set of characterization steps aiming to provide feedback for optimization or discover fundamental mechanisms. However, the sequence of synthesis and characterization methods and their interpretation, or research workflow, has traditionally been driven by human intuition and is highly domain specific. Here, we explore concepts of scientific workflows that emerge at the interface between theory, characterization, and imaging. We discuss the criteria by which these workflows can be constructed for special cases of multiresolution structural imaging and functional characterization, as a part of more general material synthesis workflows. Some considerations for theory–experiment workflows are provided. We further pose that the emergence of user facilities and cloud labs disrupts the classical progression from ideation, orchestration, and execution stages of workflow development. To accelerate this transition, we propose the framework for workflow design, including universal hyperlanguages describing laboratory operation, ontological domain matching, reward functions and their integration between domains, and policy development for workflow optimization. These tools will enable knowledge-based workflow optimization; enable lateral instrumental networks, sequential and parallel orchestration of characterization between dissimilar facilities; and empower distributed research.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Scientific progress is inherently linked to the development and utilization of progressively more complex methods for synthesis, imaging, and functional characterization of materials, from simple human eye-based examination and macroscopic property measurements to bespoke electron <ref type="bibr">1</ref> and scanning probe microscopes (SPMs), <ref type="bibr">2</ref> scattering facilities, <ref type="bibr">[3]</ref><ref type="bibr">[4]</ref><ref type="bibr">[5]</ref> and low-temperature quantum measurements. <ref type="bibr">[6]</ref><ref type="bibr">[7]</ref><ref type="bibr">[8]</ref> These imaging and characterization techniques, in turn, provide feedback for material synthesis optimization, <ref type="bibr">7</ref> enable refining of theoretical models, <ref type="bibr">9</ref> and often lead to serendipitous discoveries. <ref type="bibr">10,</ref><ref type="bibr">11</ref> The role of tool development in science is reflected by the renowned quote by Freeman Dyson, one of the leading physicists of the 20th century: "New directions in science are launched by new tools much more often than by new concepts. The effect of a concept-driven revolution is to explain old things in new ways. The effect of a tool-driven revolution is to discover new things that have to be explained." <ref type="bibr">12</ref> Present-day material discovery, design, and optimization are based on a well-established centuries-old paradigm of serendipitous findings of materials with useful functionalities, and long and timeconsuming sequential optimization of compositions and processing conditions toward target functionalities. However, this approach tends to be extremely inefficient in the systems with multiple functionalities that achieve their optimal properties in different parts of multicomponent phase diagrams or synthesis parameter space. One of the material systems where these limitations are particularly important is the hybrid perovskite for solar cells and other optoelectronic applications. <ref type="bibr">[13]</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref><ref type="bibr">[16]</ref><ref type="bibr">[17]</ref> Similar challenges emerge for other multicomponent materials and devices including Li-ion batteries, <ref type="bibr">18</ref> metallurgy, <ref type="bibr">19</ref> high-entropy alloys, <ref type="bibr">20</ref> and many functional ceramics and glasses. <ref type="bibr">21</ref> The material synthesis and characterization workflows typically emerge within specific domain areas, such as the epitaxial thin film growth community, <ref type="bibr">22</ref> Li-ion batteries, <ref type="bibr">23</ref> hybrid perovskite solar cells, <ref type="bibr">16,</ref><ref type="bibr">24</ref> crystal growth in condensed matter physics <ref type="bibr">25</ref> or radiation detectors, <ref type="bibr">26</ref> and many others. For many of these fields, these workflows are inherently familiar to any practitioner in the field and often define it. Once novel or improved characterization tools appear, the workflows adapt to balance the availability of new tools, perceived gains in knowledge, potential for discovery, and costs in terms of time and expenses. This balancing is almost invariably based on intuitive decision making and is constrained by the availability of characterization tools, expected waiting times, and costs.</p><p>The new opportunity in the experimental domains is the rise of automated experiments (AEs), <ref type="bibr">27,</ref><ref type="bibr">28</ref> where the artificial intelligence/ machine learning (AI/ML) methods are used both to enable automation to reduce latency (within a domain) and guide the discovery workflow. The combination of these two concepts gives rise to the concept of automated laboratories for the discovery of new materials for pharmaceutical and biological science and energy applications including solar cells. Despite some early demonstrations, this concept became mainstream only in the last 5 years, as a result of the large-scale efforts by Cronin et al., <ref type="bibr">[29]</ref><ref type="bibr">[30]</ref><ref type="bibr">[31]</ref> Maruyama et al., <ref type="bibr">27,</ref><ref type="bibr">[32]</ref><ref type="bibr">[33]</ref><ref type="bibr">[34]</ref> Aspuru-Guzik et al., <ref type="bibr">33,</ref><ref type="bibr">35,</ref><ref type="bibr">36</ref> Abolhasani et al., <ref type="bibr">[37]</ref><ref type="bibr">[38]</ref><ref type="bibr">[39]</ref> and others <ref type="bibr">17,</ref><ref type="bibr">[40]</ref><ref type="bibr">[41]</ref><ref type="bibr">[42]</ref><ref type="bibr">[43]</ref><ref type="bibr">[44]</ref><ref type="bibr">[45]</ref><ref type="bibr">[46]</ref><ref type="bibr">[47]</ref><ref type="bibr">[48]</ref><ref type="bibr">[49]</ref><ref type="bibr">[50]</ref><ref type="bibr">[51]</ref><ref type="bibr">[52]</ref><ref type="bibr">[53]</ref><ref type="bibr">[54]</ref><ref type="bibr">[55]</ref><ref type="bibr">[56]</ref> as well as (less advertised) efforts in big pharma industries. For the last 3 years, the effort in small-scale laboratory-based automated experimentation (AE) via solution synthesis robot resulted in advancement in highthroughput and combinatorial studies of hybrid perovskite materials by Ahmadi et al., <ref type="bibr">[57]</ref><ref type="bibr">[58]</ref><ref type="bibr">[59]</ref><ref type="bibr">[60]</ref> Brabec et al., <ref type="bibr">61,</ref><ref type="bibr">62</ref> etc. However, simple acceleration of material synthesis by typically reported $10 and even 2-4 orders of magnitude <ref type="bibr">[63]</ref><ref type="bibr">[64]</ref><ref type="bibr">[65]</ref> for individual steps is insufficient compared to the vastness of the composition and processing spaces of multicomponent materials, necessitating development of workflows that will efficiently guide the synthesis and experimental protocols based on the results of previous experiments and general domain knowledge.</p><p>As an additional consideration, emergence of new tools gives rise to new opportunities for workflow development. For scanning probe microscopy, examples of this include the development of singlemolecule unfolding spectroscopy that has opened the pathway to study the kinetics and thermodynamics of single-molecule reactions using benchtop tools, <ref type="bibr">66,</ref><ref type="bibr">67</ref> piezoresponse force, and electrochemical strain microscopies that have enabled quantitative studies of bias-induced phase transitions and electrochemical reactions at the single defect level, <ref type="bibr">[68]</ref><ref type="bibr">[69]</ref><ref type="bibr">[70]</ref> and scanning tunneling microscopy for exploring quantum physics <ref type="bibr">71,</ref><ref type="bibr">72</ref> and chemical reactions <ref type="bibr">73</ref> on a single atom level and enabling atomic fabrication. <ref type="bibr">74</ref> For electron beam methods, the examples will include cryo-electron microscopy <ref type="bibr">75</ref> that enabled mapping of protein structures and hence accelerated drug discovery, and electron diffraction <ref type="bibr">76</ref> that allowed acquisition diffraction data from very small crystals for crystal structure determination. Examples abound in other fields. It is also important to note that many imaging tools can be also use for material synthesis, including the dip-pen lithography <ref type="bibr">77</ref> and electron beam atomic manipulation. <ref type="bibr">[78]</ref><ref type="bibr">[79]</ref><ref type="bibr">[80]</ref> Similarly, local methods can be seamlessly combined with well-established combinatorial spread libraries <ref type="bibr">81</ref> to establish the characterization feedback and further integrated with laboratory robotics. <ref type="bibr">82</ref> However, until now these developments have been largely ad hoc. The workflows have been developed in individual fields <ref type="bibr">83,</ref><ref type="bibr">84</ref> and grow and evolve as a result of multi-year community-wide processes. New techniques give rise to fundamentally new scientific opportunities with often rapid growth, but discovery of these opportunities is often a black swan event rather than the result of long-term community-wide planning. Most importantly, in the everyday activity of research groups across academia, government labs, and industry, the choice of the measurements and characterization tools is determined by tradition far more then planning or analysis of possible gains and costs.</p><p>The workflow development can be subdivided into several elements including ideation, orchestration, and implementation, as defined in Table <ref type="table">I</ref>. Traditionally, all three elements are human based, and the progression of the scientific career starts with the implementation and progresses to the orchestration and ideation part. The rapid emergence of networks of scientific user facilities and cloud laboratories disrupts this progression, allowing the implementation of workflows via computerized orchestration agents for human and automated equipment. The development of automated labs spurred the growth of instrument-level drivers and enterprise-level software ecosystems allowing for orchestrating the operation of multiple tools and creation of software layers for data storage and analytics. <ref type="bibr">[85]</ref><ref type="bibr">[86]</ref><ref type="bibr">[87]</ref><ref type="bibr">[88]</ref> The key step now is developing systematic ways to design, implement, and build the characterization workflows and determine the gains in terms of materials and physical discovery. <ref type="bibr">83</ref> Ideally, we want to determine the sequence of synthesis and measurements in an optimal way, balance the cost of the tools and required characterization times to the knowledge or other gain, and use this to develop characterization workflows. We further want to adapt the workflows to the emergence of new characterization and synthesis tools, and therefore estimate the potential benefits from their introduction.</p><p>Here, we describe the scientific workflows, analyze their ideation and optimization for the specific case of the characterization methods, and explore the combined workflows containing characterization and synthesis components. We explore some elements of the workflow building when synthesis, characterization, and theory components are present. Finally, we analyze how these components can be used to enable the next generation of scientific research, including orchestration of the geographically distributed synchronous multimodal characterization workflows, lateral instrument networks, and the emergence of distributed experimental workflows across enterprise-level and community-level facilities. To accelerate the adoption of these tools by the community, we propose the framework for the workflow design, including the development of appropriate hyperlanguage and ontological connections between the domains, identification of a specific hierarchical reward functions, and policy development.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. CLASSICAL SCIENTIFIC WORKFLOWS: IDEATION, ORCHESTRATION, AND IMPLEMENTATION</head><p>To illustrate the general concept of a scientific workflow, and some of the general principles of scientific workflow design, here we discuss several examples from areas that the authors are familiar with. However, similar elements and constructs can be identified across multiple other domain areas.</p><p>Shown in Fig. <ref type="figure">1</ref> is the example of the workflow development for the hybrid perovskite synthesis in the Ahmadi lab. Here, shown is the </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Workflow</head><p>The sequence of steps, processes, or operations performed by human or ML agent when working in the lab or running the microscope to complete a specific task or achieve a particular goal. It outlines the sequence of tasks, their order, and the inputs and outputs of each task Ideation</p><p>The process of generating and developing new ideas and approaches for workflow design and optimization. The goal of workflow ideation is to produce innovative and effective workflow designs that can achieve target goal (reward) more effectively Orchestration</p><p>Management and coordination of a series of interconnected tasks or processes defined within the workflow that needs to be executed in a specific order to achieve a particular goal or outcome. It involves the automation and monitoring of workflows, ensuring that each task is completed correctly, and that the overall workflow is running smoothly Execution Performing specific steps or tasks in the workflow Hyperlanguage</p><p>The language expressing operations that can be performed in the lab or running the microscope that define actions to be taken and their parameters. The hyperlanguage for ML agent needs to be expressed via API controlling the instrument Reward</p><p>The perceived goal of the experiment. This can be discovery of material with specific functionality, optimization, or deriving general knowledge. The workflow is designed to maximize the reward. Rewards in scientific workflows are often hierarchical in nature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Policy</head><p>The set of rules that define the actions taken depending on the state of the system. The important aspect of policies is the balance between exploration and exploitation, i.e., exploring new regions of parameter space or maximizing specific reward.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Value</head><p>As adapted from reinforcement learning, expected long-term reward or return that an agent can achieve by taking a particular action in a given state of the environment. The value of an action is calculated based on the expected sum of future rewards that an agent can receive from that action and all subsequent actions that follow. material skeleton of the workflow, meaning the pathway the materials follow during the experiment. Note that workflow can also be defined in terms of the human or instrument time, depending on the specific problem to be solved. The initial points of the workflow development are the initial solutions and antisolvents. These can be used in a pipetting robot to prepare combinatorial libraries spanning the composition space of the hybrid perovskite, antisolvent compositions, etc. The optical bandgap energy, material stability, and quantum yield within the libraries are, in turn, accessed by the time-evolution of photoluminescent and UV-Visible absorption spectroscopies. Based on the results, the specific composition can be chosen for imaging studies with high spatial resolution. Similar solutions can be used for the single-crystal growth. The crystals can be fabricated into the devices for radiation sensors, explored via photo-Hall effect spectroscopy, neutron scattering, or Mossbauer spectroscopy. The crystal-based devices can, in turn, be visualized in operando via scanning probe microscopy or time-offlight mass spectrometry (ToF-SIMS). The same initial reagents can be used to deposit thin films. The films, in turn, can be further fabricated into the device structures that can undergo physical testing. The microstructure of the films can be explored using cathodoluminescent (CL) imaging, ToF-SIMs, and scanning probe microscopy. Finally, devices can be characterized in operando using same techniques as crystals.</p><p>The characteristic aspect of this workflow is the progression from rapid, low cost, and high-throughput methods that provide limited and low-fidelity information on specific functionalities, to the expensive slow characterization methods for end materials or devices. In this process, the orchestration of the workflow includes multiple decision making and feedback steps on what composition to choose for complex characterizations. Note that the decision making is often nonlinear and gives rise to multiple feedback cycles at different latencies and degrees of analysis. For example, the results of the photoluminescence (PL) screening can be used for the composition selection for film deposition, and the CL and ToF-SIMS imaging of films will be used for composition selection for the robotic synthesis or the material selection for the initial endmembers or solvents. This decision making is further informed by the general information available to the researchers and adjusted as a result of the interaction with the scientific community via publications, conferences, social networks, interaction with large language model (LLM) optimized for generating hypotheses (a scientific version of ChatGPT), and private communications. We note that very recently the first systems allowing for such development have been introduced. <ref type="bibr">[91]</ref><ref type="bibr">[92]</ref><ref type="bibr">[93]</ref> The second example shown in Fig. <ref type="figure">2</ref> is workflow development in imaging. <ref type="bibr">83</ref> Note that this workflow is hierarchical <ref type="bibr">94</ref> element of the workflow in Fig. <ref type="figure">2</ref> and can apply to the cathodoluminescence, FIG. <ref type="figure">2</ref>. Workflow development in imaging. Here, the decision making includes the selection of specific regions for high-resolution studies and subsequently for the spectroscopic probing. This process is often iterative and includes multiple overview scans, zoom-in and zoom-out stages, and human-driven spectrum acquisition and hyperspectral imaging. Note that for many instruments, the workflow will also include the repetitive tuning of instrument parameters for maximizing instrument performance. In this case, the workflow skeleton represents that the sequence of operations performed by the microscope, and workflow orchestration is effectively a stochastic optimization process. The nature of the reward function and values of individual steps are dependent on human operator.</p><p>ToF-SIMS, or scanning probe microscopy. Here, the natural backbone of the process is the series of operations performed by the microscope based on the input from human operator and internal feedback. The initial state of the system comprises the chosen sample and operator knowledge. The instrument operation includes tuning of the imaging conditions and overview scanning. Based on observations, the operator makes a decision to zoom-in on specific regions to explore these in detail, and potentially zoom-out and zoom-in on a different region and perform spectroscopic measurements. We note that for the unknown systems (for which there are no prior measurements or sufficient literature data), the selection of zoom-in regions is usually based on the educated guess of an operator and what he/she considers to "look interesting." Depending on the type of the measurement and instrument configuration, the operator can perform spectroscopic measurements on the grid (hyperspectral imaging). The process can also include the additional microscope tuning steps, with the human operator selecting the times to introduce them based on the operator assessment of the observed data.</p><p>There are three pertinent elements of the described process. The first is a clearly defined hyperlanguage that describes the humaninitiated high-level operations executed sequentially (Table <ref type="table">I</ref>). This language can differ between domain areas, but for most humanoperated tools the set of elementary commands are similar. The second is the workflow orchestration based on the responses generated during the experiment and evaluated by human operator, including the decisions of what region to select for scanning and spectroscopy, and when to tune the instrument. Note that instrument automation makes some of these operations automatic, and often prompts the operator to perform them. The third and key element is the (scalar or multi-objective) reward function. The operator interested in mechanical properties of the material will be interested in different objects than the person exploring the emergence of ultrahigh electromechanical responses. Similarly, scientists interested in ferroelastic phenomena will choose different objects for study than those interested in flexoelectric phenomena at ferroelectric domain walls. This reward function, in turn, determines the value of the individual steps for the operator and, in this fashion, guides the workflow ideation. Finally, note that the scientific workflows and rewards have a clear hierarchical character, obvious given that example in Fig. <ref type="figure">2</ref> represents any of the imaging techniques that are used as a part of workflow in Fig. <ref type="figure">1</ref>.</p><p>The brief examination of the workflows for synthesis and characterization illustrates several common elements. The first is the emergence of the funnel character. All samples are explored using easy characterization methods such as optical microscopy (and sometimes just human eye) or low-resolution overview scans, and this information is used to select locations for progressively more complex or expensive methods for smaller number of samples or locations. The second and less obvious component is the perceived reward of the experiment. In some cases, this is purely curiosity-based selection. In others, this is targeted exploration of specific aspects of microstructure or material behavior. In this latter case, the exploration pathway is driven by the specific interest of the experimentalist. The third implicit component is reliance on prior knowledge in selection of objects of interest, interpretation, and establishing the reward. Similarly, the discoveries can update the knowledge base, while slow for characterization experiments, this strongly affects material synthesis workflows shown in Fig. <ref type="figure">1</ref>.</p><p>The two examples above illustrate the workflow concept, organization of the workflows based on materials or instrument time, and the complex hierarchical connections between the reward functions of the individual elements and values of the specific steps and operations. Below, we explore the principles based on which these workflows can be designed and optimized-here, for the specific case of characterization workflows as shown in Fig. <ref type="figure">2</ref>. However, we emphasize that these workflows are, in turn, defined only in the context of general synthesis and characterization workflows shown in Fig. <ref type="figure">1</ref>. In other words, materials first need to be synthesized-and the motivation for synthesis is derived from combination of specific goal (material for photovoltaics) and discovery. Characterization and microscopy aim to provide the feedback to material synthesis workflow.</p><p>In discussing these workflows, we separate the components of imaging, spectroscopy, and theory. Here, imaging is referred to the process of the acquisition of spatially resolved structural and functional information. Spectroscopy refers to the process of the detailed measurement in a specific location that is assumed to provide desired information on material functionality. Note that the process is hierarchical, in a sense that imaging can be performed via the acquisition of spectra on the sample grid (hyperspectral imaging) if the spectroscopy is nondestructive. Due to the difference in cost and latencies, this gives rise to such standard tasks as pan-sharpening. <ref type="bibr">95</ref> Finally, the third component is theoretical analysis, generally referring to the derivation of the insight from observations and using these to modify the way workflow is ideated. First, we discuss the principles of workflow development for pure imaging and spectroscopy scenarios and compare it to the (well explored) ideation of theoretical workflows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. HOMOGENEOUS WORKFLOWS</head><p>We define the homogeneous (or single element) multiresolution workflow as those emerging within a single type of characterization or theory connecting multiple length scales. <ref type="bibr">96,</ref><ref type="bibr">97</ref> Here, we consider the well-known multiscale workflows in theory and expand these concepts to characterization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Multiscale workflows in theory</head><p>The concept of multiscale workflows is well developed in the theory domain. Over the years, the development and implementation of multiscale modeling <ref type="bibr">[98]</ref><ref type="bibr">[99]</ref><ref type="bibr">[100]</ref><ref type="bibr">[101]</ref><ref type="bibr">[102]</ref><ref type="bibr">[103]</ref><ref type="bibr">[104]</ref><ref type="bibr">[105]</ref> workflows have paved their way into multiple disciplines ranging from solid mechanics, fluid mechanics, materials science, physics, mathematics, and biological to chemistry. Parallel computing has only made it more feasible to solve more accurate and precise algorithmic formulations, which is needed for these workflows. In addition to being useful in academic research, such modeling capabilities have been adapted in industry for its numerous advantages such as cost-effective physics-based product design, assessment of product quality, and performance. The primary requirement for any multiscale workflow is to lay out strategies to bridge between different length scales, going from atoms to automotives. We note that it differs from the conventional point of view being followed in various disciplines where the focus remains on solving a particular challenge with sole consideration of the pertinent length scale and associated latencies.</p><p>Two of the most common approaches as followed in the multiscale paradigm are concurrent <ref type="bibr">106,</ref><ref type="bibr">107</ref> and hierarchical <ref type="bibr">94,</ref><ref type="bibr">108</ref> in nature. Methods to bridge between length scales vary between the two. In concurrent techniques, the bridging methods depend on numerical solutions, whereas hierarchical approaches rely on performing independent numerical simulations at different length scales followed by identifying relations between parameters relevant to integrate or reconstruct material behavior at the corresponding higher layer in the ladder of length scales. The hierarchical approach is top-down in nature. Understanding microstructure-property relationships inferred from the interplay of internal state variables existent in various scales with thermodynamic constraints is a good illustration of such an approach. One example where concurrent and hierarchical schemes are practiced together <ref type="bibr">101,</ref><ref type="bibr">[109]</ref><ref type="bibr">[110]</ref><ref type="bibr">[111]</ref> is in connecting atomic scale simulations with electronics principle scale simulations. Density functional theory (DFT) simulations performed on metals are coupled with embedded atom method (EAM) potentials within molecular dynamics (MD) environment to model edge dislocations, with subsequent studies performed with quantum models at disparate length scales. Here, the many-body interactions are evaluated within the semiempirical formalisms of the potentials such that the results from the electronic structure theory computations become useful to reproduce physical properties of many metals, defects, and impurities. In addition, it is also possible to conduct such multiscale studies on-the-fly where the classical potential adapts to the local environment via dynamic force matching. Several machine learning frameworks have also proven to be useful to develop these potentials <ref type="bibr">[112]</ref><ref type="bibr">[113]</ref><ref type="bibr">[114]</ref><ref type="bibr">[115]</ref><ref type="bibr">[116]</ref><ref type="bibr">[117]</ref><ref type="bibr">[118]</ref> within an interactive suit bridging between the atomic, coarse-grained descriptions with promises of connecting to the continuum theories.</p><p>Machine learning methods have significantly accelerated the development of multiscale workflows for theory. Multiple approaches to address it have been used, ranging from rare event samplings in molecular dynamics (MD) simulations to machine learning-based information compression schemes. <ref type="bibr">119,</ref><ref type="bibr">120</ref> Particularly over the last five years, a number of machine learning approaches based on variational autoencoders, <ref type="bibr">121</ref> generative adversarial networks, and diffusion models have been suggested to bridge length and time scales in simulations, establish statistically significant descriptors such as order parameters, and determine their constitutive relations. It should be noted that many of these methods have also been shown for the information compression in experimental data such as electron and scanning probe microscopy, <ref type="bibr">[122]</ref><ref type="bibr">[123]</ref><ref type="bibr">[124]</ref> albeit with the additional requirements to account for the out-of-distribution shifts due to changes in imaging conditions that are typically absent in modeling Fig. <ref type="figure">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Multiresolution characterization workflows</head><p>The nature of the multiscale problem in experiment is opposite in the sense of information flow. Multiscale characterization in experiment requires solution of the inverse problem, namely, the optimal hierarchical experiment design. For example, given the results of macroscopic measurements or low-resolution imaging, we seek to identify the potential object of interest to be explored via high-resolution imaging probes in such way as to gain the maximal insight into the nature and origins of observed macroscopic behaviors.</p><p>To set this problem in a more mathematical basis, we consider it as one of optimizing a sequence of actions to maximize a cumulative reward function (Fig. <ref type="figure">4</ref>). <ref type="bibr">125</ref> The reward function can be defined in multiple contexts from pure optimal structure discovery (we aim to characterize the structure as detailed as possible via reduced multiscale representations) to discovery based on prior knowledge (we know what microstructural objects we are interested in). In the case of information gain, one can simply seek to minimize uncertainty arising from models at different length scales, by finding the sequence of measurements that can best reduce the overall uncertainty across length scales. Alternatively, a measurement sequence can be found that minimizes the uncertainty at one desired length scale. To construct a reward function for this can be as simple as taking the negative of the sum of the uncertainty. In a reinforcement learning (RL) framework, the actions consist of both the length scale to explore next in the workflow, as well as parameters within the specific experiment. For simplicity, we will ignore the latter component and consider that the only action is to choose the appropriate length scale. The task is to find a policy that will determine how best to select actions to maximize the cumulative rewards. Additionally, it is known that measurements at one length scale can be highly informative of measurements at other length scales.</p><p>For this scenario, the knowledge discovery process can be represented via a probabilistic machine learning (PML) framework based, for example, on Gaussian process, Bayesian neural network, or deep kernel learning (DKL), (Fig. <ref type="figure">5</ref>), with the idea that the data can be mapped from one length scale to another. Assume that we have training data X captured at the n th level, X n . We can generate predictions on the test data X n &#195; ,</p><p>As an example, we consider that we take a few optical images at certain positions (X n ), and then want to predict the image at unseen locations (X n &#195; ). The predicted images are f &#195; n , and the predictive uncertainty is given by V&#240;X n &#195; &#222;. However, we can also setup models that attempt to predict at the next (e.g., lower) length scale. By feeding in the test data X 1 &#195; , we can train a PML model to predict the function value at, e.g., the next level, i.e., X 1 where the measurement at level m provides estimates of the function value at level k along with uncertainty estimates at level k. The task is to determine which measurement level m (out of the n available levels) to choose that will minimize the uncertainty in V&#240;&#193;&#222;, which could be the uncertainty at one level, at all the levels, or at some combination of levels, e.g., with a weighted sum. This means we need to know which V k or combinations thereof to use. Minimizing this uncertainty will be the objective of the policy. After fixing the reward function, the workflow reduces to a simple reinforcement learning (RL) environment in which the action is to decide which measurement (level) to capture, which will then enable the appropriate PML <ref type="bibr">(m)</ref> model to be updated. The state fed back to the agent will then be the set of predicted means, i.e., f k X &#195; for 1 k n. In the case where we do not know which V k to use, this can be reformulated as an RL problem, with the difference being that the action now selects not only the measurement level, but also, which uncertainty map level is used. Alternatively, one may also consider some type of superpositions of the uncertainty from different levels. Additionally, given that different experiments will likely greatly differ in their actual cost per data point measured, this can also be factored into the reward function: penalties can be applied to actions or action sequences that use "expensive" characterization tools, for instance. Regardless, the optimization is straightforward once the reward is defined, similar to our recent work with hypothesis learning. <ref type="bibr">46,</ref><ref type="bibr">126</ref> As a practical example, we consider that we might have optical images, SEM images with some chemical maps (e.g., from energy dispersive spectroscopy, EDS), and some scanning tunneling microscopy images of a material system, such as a substrate with 2D flakes of varying chemical composition. This defines three levels, and the question is which set of experiments will be most informative. In this case, optical FIG. <ref type="figure">4</ref>. The workflow for structural imaging. Here, the imaging studies are performed increasing the resolution of the microscope (and changing the imaging system). The decision making along the workflow includes selection of regions for detailed studies at progressively higher level of details. The decision making along the workflow can be purely data driven (green lines), or incorporate decisions made with prior knowledge and informed by perceived reward. microscopy of similar looking flakes is not likely to be highly correlated with local electronic structure as imaged by STM, if these flakes are different in chemical composition and defect concentrations. On the other hand, there will be a significant correlation with the EDS data on the same flakes. As such, after initial correlations between STM, optical, and SEM/EDS data are found, minimizing total uncertainty may hinge more on performing a few experiments with SEM/EDS and confirming them with STM. Our problem is in determining exactly which of these measurements are most informative, and this can be done, in principle, via the PML framework described above.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Beyond data driven discovery</head><p>The recent advances in imaging and characterization tools including electron and scanning probe microscopy and associated spectroscopies, atom probe tomography, focused x-ray scattering, nanoindentation, optical microscopy and a gamut of electrical, mechanical, and magnetic testing methods span the multitude of length scales and functionalities. However, the gamut of available techniques is belied by the dearth of the systematic workflows that allow exploring the behaviors of interest in a systematic and unbiased way. It is by now common to complement the macroscopic probing of the piezoelectric, catalytic, and electric properties by the STEM or atomic probe tomography studies of atomic structures, or correlate the photovoltaic performance of polycrystalline solar cells by the nanometer resolution cathodoluminescence and chemical imaging maps.</p><p>However, it is seldom that we can certainly say what specific type of microstructural elements is most strongly associated with the functionalities of interest. In the cases where such relationship was defined, as for the role of step edges for catalysis, dislocation theory of plasticity, or role of the domain wall dynamics on giant electromechanical responses in piezoelectric ceramics, these discoveries required community-wide effort involving multiple experiment and theory development cycles. At the same time, it is these insights that are most relevant toward understanding the underpinning mechanisms and particularly establishing pathways for material optimization and discovery. Only by understanding the fundamental origins of structure-property relationship can the strategies for the improving material performance be formulated and tested.</p><p>For example, given the polycrystalline ferroelectric material, we may seek to understand the origins of the high electromechanical response or resistive switching. Given the hybrid perovskite solar cell, we want to understand which microstructural elements are responsible for chemical stability, current-voltage (IV) hysteresis, or the open circuit voltage (OCV) losses compared to the ideal values. The complexity of such analysis stems from the fact that it may be different defect populations that are ultimately responsible for these behaviors, and hence, selection of the regions of interest for the detailed studies depends on the specific goals. In this sense, this problem is poorly defined-we seek to preferentially explore via detailed high-resolution studies objects whose identity we do not know. Hence, we use hypotheses formed based on the prior body of knowledge to guide the exploration process, while maintaining the need for serendipitous discoveries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Reward design</head><p>These considerations further bring us to the concept of reward engineering. Here, methods such as reinforcement learning (RL) have been shown to be highly effective in simulated environments such as computer games or simulations. <ref type="bibr">[127]</ref><ref type="bibr">[128]</ref><ref type="bibr">[129]</ref> One of the key elements of RL is a reward function that is made available for the algorithm during the training. However, for many real-world problems, the reward functions available in the end of experimental campaign (or after several steps) are absent; rather, the experiments motivated by the long-term objectives. Designing reward function that adequately represents realworld objective and does not lead to reward hacking <ref type="bibr">130,</ref><ref type="bibr">131</ref> is a challenge. Similarly, very often experimental results can contribute to multiple objectives, with fundamental scientific research being the most notable example of such activity.</p><p>As an example of such a problem, we consider climate change, the problem motivating multi-billion-dollar investments over the globe. Minimizing climate change is a very long-term objective. The lower rank objectives are the development of solar and wind energy and associated grid-level storage and effective energy transport methods, along with the technologies for direct carbon capture. The even lower rank objectives are the development of cheap, environmentally friendly, and stable chemistries for grid storage. None of these objectives can be translated into a reward for an experimental campaign. Rather, these objectives serve as a motivation for experiment planning-and reward is often a short-term battery performance or observation of specific mechanism in microscope that can suggest potential ways to improve the battery materials.</p><p>We pose that discovery of the short-term rewards that can be used for hypothesis making to guide experimental research, and as rewards functions to guide and ascertain the success of experimental campaigns is the missing link required to connect ML to real-world applications. The potential pathways to address this challenge can include literature mining toward building the DAGs connecting experimental outcomes (rewards) and objectives (motivation), technoeconomic analysis of past publications outcomes, and crowdsourcing to the community of experts (aka "what would be the potential of high temperature conductivity to change the world" to "how does the phase separation in cuprates affects peak effect and losses").</p><p>The key consideration for the reward-driven workflow design is the capability to separate the specific objective into the probabilistic graph of short-term reward functions that can guide experiment planning and establish measures of success. Naturally, these reward functions will be probabilistic, and the value of real-world experiment can affect (much) more than one objective. For example, mechanisms of metal-air interactions can be used both for corrosion mitigation and for design of metal-air batteries. The important element of this approach is that humans are the part of the theory-experiment loopand hence, the structure of the rewards can be amended via human feedback the observations (much like science works now).</p><p>Notably, the LLMs such as ChatGPT are often capable of making the connection between high-and lower-level objectives (e.g., prompts "what should I study with microscope to understand plasticity" gives very plausible answers). Presumably, complementing LLMs with models trained on domain-specific literature can both allow systematic developments of such workflows and their integration across multiple domains following common rewards.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. MULTIRESOLUTION STRUCTURE-PROPERTY CHARACTERIZATION WORKFLOWS</head><p>It is well recognized that understanding structure-property relationships in materials requires exploring properties and functionalities on all length scales, from the atomic scale structures to mesoscale to the global properties of the material or device. Multiresolution analysis workflows are used when, for each characterization method, we have access to image-spectroscopy pairs, and the spectroscopic data provides information that is predictive of the functionality of interest. Here, we define as the image the high spatial resolution/low information density imaging including structural STEM image, SPM images, etc. The spectroscopy refers to the information-rich local measurements that are associated with larger measurement times or lead to the irreversible changes in material structure, and therefore can be performed only in a limited number of locations. However, we implicitly assume that the spectroscopic measurements are correlated with the macroscopic functionality of interest.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Single-step direct workflow</head><p>Here, first we illustrate these concepts for the single structureproperty mapping step. In this case, we can define forward and inverse experiment. The forward experiment relies on the a priori defined objects of interest that can be recognized in real time, e.g., using deep convolutional networks. Here, the emergence of the ensemble and iterative training methods allowed to partially address the inevitable outof-distribution effects (i.e., capability of the trained network recognize object of interest if microscope parameters have changed). Recently, a deep residual learning framework with holistically nested edge detection (ResHedNet) was ensembled to minimize the out-of-distribution drift effects in real-time SPM measurement. <ref type="bibr">132</ref> The ensembled ResHedNet was implemented on an operating SPM, where it converted the real-time SPM data stream to segmented objects of interest, e.g., ferroelastic domain wall or polycrystal grain boundary images. Then, a pre-defined workflow used the coordinates of the discovered objects for spectroscopic measurements. In doing so, the approach allowed a thorough of interested objects of interest (virtually all locations at objects of interest) in an automated manner; in contrast, traditional manual operation only allows us to investigate a limited number of locations at domain walls. Using this approach, alternating high-and low-polarization dynamic ferroelastic domain walls in a PbTiO 3 thin film was observed <ref type="bibr">52</ref> and the behavior of grain boundary junction points in metal halide perovskites was discovered. <ref type="bibr">133</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Single-step inverse workflow</head><p>In the inverse experiment, the operator defines the characteristics that make spectrum "interesting," e.g., intensity of a specific feature, specific aspect of spectrum shape, or even maximal variability of spectra within the image. In other words, each collected spectrum can be associated with a single number defining how interesting it is, either in absolute sense or as compared to previously acquired spectrum. The deep kernel learning (DKL) algorithm learns what elements of the material structure maximize this reward and guides the exploration of material surface accordingly. This DKL algorithm was recently implemented in SPM to investigate the relationship between ferroelectric domain structure and polarization dynamics, <ref type="bibr">54,</ref><ref type="bibr">134</ref> and in STEM to explore bulk and edge plasmonic modes. As shown in Fig. <ref type="figure">2</ref>, the DKL exploration process identifies the domain walls as objects of interest and the DKL predictions indicate the high polarization dynamic of 180 domain walls. Although these are expected by ferroelectric experts, the DKL itself did not have any prior physical knowledge and learned all that information during the experiment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Multiple-step workflows</head><p>The multiple-step imaging and characterization workflows can be represented as the direct extension of single-step workflows as shown in Fig. <ref type="figure">6</ref>. Here, the structural imaging at the low resolution yields the library of possible microstructural elements. We assume that the microscopic measurements, e.g., using the micropatterned contact arrays and current-voltage (IV) measurements via SPM, are representative of the macroscopic properties of the systems (even though the exact physical mechanisms can be different due to changes in contact conditions, confinement effects, etc.). These elements can be sampled in statistically balanced way, e.g., based on the distributions in the latent space of the system, to give the initial information on the FIG. <ref type="figure">6</ref>. Multiple-level imaging and characterization workflow. Here, we assume that the material functionality of interest probed on the macroscopic level is controlled by the hierarchy of structural elements from macroscopic to atomic scale and that the available functional probing methods are representative of the material functionality. For example, on the mesoscale, this can be realized using the microelectrodes and the SPM-based IV measurements.</p><p>structure-property relationships in the system. With these, the process can be iterated balancing the structural learning and learning structure-property relationships on multiple length scales. For the cases when the spectroscopic measurements provide information that is a proxy to the macroscopic functionalities but does not directly reproduce it, the definition of the workflows becomes more complex and requires incorporation of the multiresolution multifidelity measurements. For example, this can include the use of the easy to measure signals such as micro-Raman to identify preferential locations for expensive measurements such as scanning probe microscopy or scanning transmission electron microscopy, as has recently been explored by Kusne. <ref type="bibr">48</ref> The key emerging aspect <ref type="bibr">[135]</ref><ref type="bibr">[136]</ref><ref type="bibr">[137]</ref> will be the integration of workflows across diverse domains, including different characterization methods and synthesis. This, in turn, necessitates the employment of consistent ontologies, a critical factor for ensuring effective interdomain communication and data interoperability. Within multidisciplinary landscapes, each field typically develops its own unique lexicon, data structures, and conceptual frameworks, which can lead to significant challenges in cross-domain interactions. The disparate nature of these elements often impedes the seamless exchange and accurate interpretation of information, posing a barrier to the cohesive integration of workflows. Consistent ontologies should offer a standardized, unified schema for representing and understanding diverse data sets and inferential biases, enabling disparate systems to interact synergistically. This standardization is imperative for the accurate mapping of concepts and terminologies across different fields, ensuring not just data exchange but also its meaningful contextualization and utilization. Furthermore, the adoption of consistent ontologies facilitates scalability and adaptability within integrated systems, allowing for the incorporation of emerging knowledge and technologies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. SOME CONSIDERATIONS ON THEORY-EXPERIMENT-CHARACTERIZATION WORKFLOWS</head><p>The particularly interesting problems emerge at the interface between the theory and experiment. The conflux of parallel computing, experimental capabilities along with theoretical simulations is necessary to develop and implement end-to-end theory-experimentcharacterization workflows. It is safe to say that within materials science challenges spanning over various applications, the primary goal is to leverage existing structure-property relationships <ref type="bibr">138</ref> to propel both design and discovery. These relations could be formulated at the atomistic level where electronic structure of systems plays the key role to determine the energetics and stability. For mesoscale to continuum scale, we tend to map the evolutions of microstructures with physical properties such as plasticity, damage, or failure. The standalone investigations <ref type="bibr">[139]</ref><ref type="bibr">[140]</ref><ref type="bibr">[141]</ref><ref type="bibr">[142]</ref><ref type="bibr">[143]</ref><ref type="bibr">[144]</ref><ref type="bibr">[145]</ref><ref type="bibr">[146]</ref><ref type="bibr">[147]</ref><ref type="bibr">[148]</ref> based on data from theoretical simulations <ref type="bibr">[149]</ref><ref type="bibr">[150]</ref><ref type="bibr">[151]</ref><ref type="bibr">[152]</ref><ref type="bibr">[153]</ref><ref type="bibr">[154]</ref> are well capable of elucidating the fundamental mechanisms responsible for materials characteristics with applications in energy, catalysis, and photovoltaics, drug design, to name a few. <ref type="bibr">118,</ref><ref type="bibr">[155]</ref><ref type="bibr">[156]</ref><ref type="bibr">[157]</ref><ref type="bibr">[158]</ref><ref type="bibr">[159]</ref><ref type="bibr">[160]</ref><ref type="bibr">[161]</ref><ref type="bibr">[162]</ref><ref type="bibr">[163]</ref><ref type="bibr">[164]</ref><ref type="bibr">[165]</ref><ref type="bibr">[166]</ref><ref type="bibr">[167]</ref><ref type="bibr">[168]</ref><ref type="bibr">[169]</ref><ref type="bibr">[170]</ref><ref type="bibr">[171]</ref> However, the ultimate validations of such proposed mechanisms always rely on experimental observables. Hence, to fully realize the potential of workflows <ref type="bibr">161,</ref><ref type="bibr">165,</ref><ref type="bibr">168,</ref><ref type="bibr">[172]</ref><ref type="bibr">[173]</ref><ref type="bibr">[174]</ref><ref type="bibr">[175]</ref><ref type="bibr">[176]</ref> bridging instruments and theory, <ref type="bibr">174,</ref><ref type="bibr">[177]</ref><ref type="bibr">[178]</ref><ref type="bibr">[179]</ref><ref type="bibr">[180]</ref><ref type="bibr">[181]</ref><ref type="bibr">[182]</ref><ref type="bibr">[183]</ref><ref type="bibr">[184]</ref><ref type="bibr">[185]</ref><ref type="bibr">[186]</ref><ref type="bibr">[187]</ref><ref type="bibr">[188]</ref><ref type="bibr">[189]</ref><ref type="bibr">[190]</ref><ref type="bibr">[191]</ref> we must move toward theory-assisted experiments from the standard perspective of matching final outcomes from experiments and simulations. It becomes important to consider how causal structure-property relations <ref type="bibr">[192]</ref><ref type="bibr">[193]</ref><ref type="bibr">[194]</ref> may hold true at different length scales while establishing connections between experimentally controllable parameters with theoretical variables. Here, we consider this interaction only in the more limited context of imaging and characterization methods, but even in this case, the complexity of possible interactions is immense.</p><p>Here, we assume that we have access to microstructure (M) at a single length scale, global property measurements (G), local property measurement (L), and theoretical model (T). We further assume that M and G are available from the beginning of the experiment and are not updated. Comparatively, T is available in the form of analytical or numerical model and can contain partially unknown parameters that can be updated during the experiment, and L can be performed sequentially within known M. With this, we can define the static learning problem, meaning establishing the relationships between G, M, and T after the measurements, and active learning problems, meaning the workflow design for active experiments within M. Here, we will use arrow ! to define establishing a relationship given full batch data, and " to define active learning workflow.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Static problems</head><p>With these building blocks, we can combinatorically define three static analysis problems. Here, MT ! G is a direct calculation problem. In this case, we assume that the microstructure is known and the theory is correct, and aim to calculate the global properties of the system. The example of such approach will be the finite element calculation of the mechanical properties of the composite material, estimation of the effective transport properties of the microstructure mapped via the x-ray tomography, etc.</p><p>The inverse problem will be MG ! T. Here, given known microstructure and global properties, we aim to refine theory that governs structure-property relationships in the system. Finally, GT ! M defines the design problem. Given the property of interest and theory, we aim to design microstructure satisfying the given properties. These problems are static in nature, i.e., we do not have iterative/active learning component. Of course, these problems become active when a part of the synthesis or manufacturing workflow. Similar workflows emerge in the context of local property measurements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Active learning problems</head><p>With these simple examples, we can define several classes of active learning problems. As mentioned above, defining active learning workflows necessitates introduction of the reward, R, defining the discovery target. For example, M(R)"L is the direct imaging problem, where we choose locations for sampling local microstructure L given what is interesting about M. In our experience for human-driven exploration, the reward function often changes during the experiment. For example, initial measurements are done with the target of microscope performance optimization, proceed to get the overview image, explore the most statistically prevalent regions within the image, and then proceed to explore the regions that are believed to be interesting based on the prior knowledge and beliefs. For example, it is natural to target dislocations for mechanical properties, or ferroelastic domain walls for understanding of the origins of the ultrahigh electromechanical responses.</p><p>Comparatively, the L(R) " M is the DKL problem, where we aim to discover locations for L given what is interesting about L and perceived reward R. We can also envision more complex workflows, for example, ML " T, meaning how to learn the theoretical model in an efficient manner given microstructure and local measurements. The more complex scenarios include M G(R)"L, M T(R) " L, and M GT (R) " L, meaning that we aim to discover locations for L given what we know about microstructure and theory and perceived benefit for global properties, theory, or both. As defined earlier, the reward can be optimization of some characteristic, or simply, information gain as represented by reduction in uncertainty of DKL (or other relevant) surrogate models.</p><p>For multilevel problems, actively learning theory at the same time will induce additional nonstationary characteristics to the objective, further increasing the complexity of the problem. In fact, even the simple case of the co-navigation of theoretical and experimental domain requires detailed analysis balancing the uncertainties between the domains. <ref type="bibr">195</ref> Similar to single domain workflows, the development of consistent ontologies for description of objects (and even search spaces) is a key part of these developments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. IMPLICATIONS</head><p>Finally, we analyze how the design and optimization of synthesis and characterization workflows can affect the structural organization of the research process. Specifically, we consider both the driving forces such as increase in the throughput and costs of many imaging and characterization tools, and emergence of the cloud infrastructure and edge computing that allow information processing and feedback from the cloud.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. User facilities and lateral instrumental networks</head><p>From the historical perspective, the first big changes in the scientific workflow in condensed matter physics, materials science, and biology were brought about by the emergence of the large tools such as synchrotrons and nuclear reactors. <ref type="bibr">196,</ref><ref type="bibr">197</ref> At that time, the concept of user facility included developing the instruments, instrument scientists operating them, and user scientists that physically visit for day-week long experiments on their specific materials has become a norm.</p><p>The second big change has been emergence of the user facilities, as exemplified by the Department of Energy Nanoscale Science Research Centers, and user facilities at universities. This trend is associated with the emergence of the microfabrication labs as a part of exploratory research, and rapid growth of throughput and costs of characterization tools such as electron microscopes, scanning probes, and chemical imaging. Correspondingly, integration of these tools within the same geographically localized facilities that maintain the instrument, offer the sample preparation facilities often shared between multiple instruments, and maintain the highly trained scientists capable of operating and using them greatly increases the efficiency of use.</p><p>However, despite the drastic changes on the operational side, the mode of use of the instruments in user facilities and individual labs has been remaining largely the same. In all cases, the scientist run the instrument manually, generating the large volumes of data during the experiment. The data are typically analyzed after the experiment, in the process that often takes weeks and months, and is subsequently shared with the scientific community via publications. The latter process is typically associated with extremely large latencies of the order of months and often years, hindering the research process. Correspondingly, until 10-15 years ago, scientific conferences were the primary means for rapid information sharing. Over the last decade, rapid growth of popularity of the preprint servers such as arxiv, chemrxiv, and biorxiv, as well as social networks greatly accelerated information sharing. Similarly, code and data sharing via platforms such as GitHub, Zenodo, and Google Colab are rapidly becoming a part of scientific culture in many domain areas.</p><p>The rapid increase in ease of use of cloud technologies suggests that the field is now poised to the next transition, where the operation of the tools is largely remote and the information is either directly streamed to the cloud storage, streamed after the acquisition based on the upload speed constraints, or following the point-of-generation compression. In addition to obvious advantages in terms of data accessibility and security, this allows the formation of the lateral instrument networks as illustrated in Fig. <ref type="figure">7</ref>. Here, multiple instruments of the same kind store the data within a community-accessible cloud space. The latter also supports the computational capabilities and code ecosystems that allow data analysis and, in turn, can be further harnessed for the decision making and automation of workflow on individual instruments, significantly accelerating the scientific process. It should be noted that very likely data permissions will be dependent on the study; most likely, data will not be shared universally without an embargo period or something of the sort.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Sequential and parallel experiment orchestration</head><p>The development of cloud connectivity for characterization tools establishes a set of novel opportunities for experimental workflows across multiple facilities. The particularly important case for the parallel experiment orchestration <ref type="bibr">198</ref> is when multiple copies of the same sample are available, e.g., combinatorial spread libraries. In these, the local concentrations and functionalities are rigidly encoded in positional descriptor, allowing matching the regions between different tools. For these structures, only a few characterization methods such as optical hyperspectral imaging or photoluminescence can be performed in parallel. For techniques such as structural characterization via focused x ray, chemical characterization via ToF-SIM, cathodoluminescence, and scanning probe microscopy, the measurements can be performed sequentially. At the same time, these techniques often give complementary information on structure, properties, and chemical composition. Correspondingly, running the automated experiment on same object and multiple systems allows to explore the material composition space combining information from multiple sources (Fig. <ref type="figure">8</ref>).</p><p>We pose that these experiments can be also performed in the statistical sense, in which the information from multiple tools is combined via partially similar channels. For example, for hybrid perovskite samples in Fig. <ref type="figure">1</ref>, the statistical properties of the microstructures can be explored via correlating the chemical or CL signals referenced to grain boundaries or other key point objects that can be identified in both methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Automated laboratories vs user facilities</head><p>It is important to distinguish between cloud laboratories and user facilities, as they serve different needs. The former is generally a replication of what would be found in an individual lab at a university, though with APIs to enable automated experimentation and data collection, and the possibility of enabling autonomous workflows, given the presence of these APIs. This means that individual researchers can rapidly test ideas without the need for setting up their own lab, which can dramatically improve efficiency and reduce redundancy, as individuals need not invest in their own laboratories for the purpose of completing a synthesis. This should result in a significant reduction in overall time to discovery, as it removes a significant activation barrier (lab setup).</p><p>In contrast, user facilities, especially those run by the government sponsors (e.g., light sources, synchrotrons, and NSRCs), provide capabilities that are not generally available to most individual PIs. Efforts to implement autonomous capabilities are nascent but developing. Still, the fact that skilled instrument scientists are at these facilities enables automation of different aspects of the facility, due to in-house technical knowledge that would be difficult to acquire externally, and can realistically only be shared through these outlets.</p><p>Regardless, the development of automated facilities is a prerequisite for the development of autonomous workflows where the whole setup can be optimized. It should further be noted that the optimization, e.g., of synthesis, is not always equitable with a quantitative efficiency gain. For many, if not most experimental workflows, the bottleneck lies not in time taken to perform a single measurement or experiment, but in other points (such as model generation). However, this is not to say, the automated synthesis is not valuable. Rather, it enables rapid iterations to converge on higher-quality solutions that can be found with human expertise alone. <ref type="bibr">199,</ref><ref type="bibr">200</ref> FIG. <ref type="figure">7</ref>. Transition from single tool to lateral instrumental networks enabled by cloud technologies and data analysis ecosystem. Traditional scientific research often relied on individual, isolated instruments. Each instrument was operated independently, and the interaction between different instruments was performed in the phase of post-experiment data analysis by human. The transition here is to move from this isolated setup to a network of instruments that are interconnected, which allows for lateral communication and interaction between various instruments, enabling various instruments work in conjunction and share information seamlessly in real time. Cloud technology plays a role in enabling real-time data sharing and processing, scientists can access data remotely, collaborate remotely, and utilize computational resources analyze complex datasets. The cartoon scientist is generated by DALLE3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. REQUIREMENTS FOR IMPLEMENTATION OF WORKFLOW</head><p>The analysis above allows us to summarize the requirements for broad implementation of scientific workflow design for automated and traditional laboratories. These include the following:</p><p>(1) Development of the labs capable of orchestrating pre-defined workflows based on human and robotic agents. These can be purely human-operated lab, purely robotic lab, or human-robot lab where humans perform technical tasks. The associated need is the hyperlanguage that summarizes possible operations and provides access to control parameters. This also includes process monitoring on multiple levels-sample locations, collection of proxy signals during processing, environment monitoring, etc. Here, it is important to have access to full experimental data including both positive and negative outcomes, since purely positive data are often insufficient for optimal ML. (2) Workflow design based on AI and human decision making, meaning specific series of synthesis and characterization steps based in hyperlanguage. Since physical objects that can be only in one place in one time, workflow will have a directed graph structure (but can form quasi loops when folded on material axis, e.g., for optimization). Note that currently humans both plan workflows and execute them, but these functions can be separated. Key aspect here will be the development of consistent ontologies operating across disparate workflow elements.</p><p>(3) Defining domain-specific reward functions that guide workflow development. Why are we running experiments? Is the reward scientific discovery, optimization, or something else (curiosity or empowerment)? Ultimately, we should be able to quantify (in the style of Bellman equation for reinforcement learning) <ref type="bibr">201</ref> what is the benefit of the specific step in the workflow, and how does it accomplish or affects exploration and exploitation goals. (4) Integration of reward functions from dissimilar domains, since almost in all cases total reward function will be compounded from multiple intermediate rewards. For example, how does better microscope help us learn physics of specific material? Why would the specific DFT calculation help us understand experimental data? (5) Creating experimentally falsifiable hypothesis from the domainspecific body of knowledge that can be incorporated in the exploratory part of automated workflows. This is required because workflow design should ideally include a discovery component, and not only optimization. Discovery is effectively extrapolation into unknown domain, and given very high dimensionality, the full space of possible experiments is intractable. Hypotheses provide a way to constrain the space to explore. Note that updating hypotheses based on experimental data is a well-defined problem in Bayesian sense. (6) Hypothesis generation. In many cases, this is an extrapolation problem and will likely be human-driven (and potentially AI assisted) for the foreseeable future. However, it will be interesting to explore whether combination of interpolative capabilities of models such as ChatGPT combined with their ability to work with the full volume of text and other data available to mankind will be sufficient for hypothesis generation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VIII. SUMMARY</head><p>Scientific research and discovery are typically organized around workflows, or sequences of the specific actions and experiments targeting specific outcomes. Until now, the workflow design has been highly domain-specific and once established, the workflows remain constant over decades. The disruption of the existing workflows or the introduction of the new ones is typically associated with the emergence of the novel experimental tools. Traditionally, ideation, orchestration, and implementation of the workflows are human-based. The advent of machine learning tools over the last three years has facilitated optimization of human-built workflows, but yet has not led to beyondhuman experimentation.</p><p>Here, we introduce simple workflows for structural characterization and show that these can be based only on the discovery or weighted by prior knowledge. We suggest several possible strategies for the characterization workflow design. We discuss the increase in complexity for combined imaging-characterization workflows and illustrate the direct and inverse step for design of such workflow.</p><p>Finally, we argue that a similar approach can be extended to combined synthesis-imaging-characterization workflows and workflows containing theory in the loop. The workflow design in this case becomes extremely complex and will weigh the latencies, costs, and expected benefits of all steps. We believe that the design of such workflows will require careful analysis of the rewards and the analysis of the value of individual steps. However, the emergence of automated experiments and labs necessitates these developments. Overall, these tools will enable knowledge-based workflow optimization; enable lateral instrumental networks, sequential and parallel orchestration of characterization between dissimilar facilities; and empower distributed research.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>J. N. Wilson, J. M. Frost, S. K. Wallace, and A. Walsh, "Dielectric and ferroic properties of metal halide perovskites," APL Mater. 7, 010901 (2019).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="15" xml:id="foot_1"><p>April 2024  </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>16:13:56   </p></note>
		</body>
		</text>
</TEI>
