<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Assessing Multimodal Dynamics in Multi-Party Collaborative Interactions with Multi-Level Vector Autoregression</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>11/07/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10494293</idno>
					<idno type="doi">10.1145/3536221.3556595</idno>
					<title level='j'>Proceedings of the 2022 International Conference on Multimodal Interaction</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Robert G. Moulder</author><author>Nicholas D. Duran</author><author>Sidney K. D'Mello</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Multi-level vector autoregression (mlVAR) is a recently developed dynamic network model for assessing multimodal temporal data streams derived from multiple users over time. Importantly, mlVAR facilitates investigations into highly complex collaborative interactions within a unifed framework. In order to demonstrate the utility of mlVAR for understanding the temporal dynamics of multimodal multi-party (MMP) interactions, we apply it to 9 signals measured from 201 users (67 triads) who engaged in a 15-minute collaborative problem solving task. Measured signals refect participants' afective states (positive valence and negative valence), physiological states (skin conductance and heart rate), attention (gaze fxation duration and gaze dispersion), nonverbal communication (head acceleration and facial expressiveness), and verbal communication (speech rate). Using node-level metrics of in-strength, out-strength, and synchrony, we show that mlVAR is capable of teasing apart complex role-based dynamics (controller, primary contributor, or secondary contributor) between participants. Our fndings also provide evidence for a complex feedback system between individuals where internal states (i.e., skin conductance) are infuenced by external signals of shared attention and communication (i.e., gaze and speech).
CCS CONCEPTS• Applied computing → Psychology.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Collaboration between individuals in specialized roles is a key component of solving complex problems at both large and small scales.</p><p>Successful large-scale humanitarian eforts may require collaboration between aid workers, local politicians, and funding sources. Product development at a moderate-sized company requires collaboration and efective communication between management, production facility personnel, development teams, and quality assurance experts. Even small-scale collaborations such as students working on a classroom project may require role assignment and work division to achieve high marks. When collaboration eforts succeed, solutions are found to problems that would otherwise not have been found. When collaboration eforts fail, resources are wasted, time and money are spent, and in extreme cases people may be injured or die. Thus, understanding the mechanisms behind maintaining strong and active collaborations is imperative to increasing innovation, stoking creativity, and spurring invention. However, despite the ubiquity and impact of multi-party collaborations, they are notoriously difcult to study due to the inherent complexity of multimodal information streams interacting together as humans collaborate with other humans or machines.</p><p>This difculty should be no surprise to many researchers, as humans are widely difering in their afective, behavioral, and cognitive qualities. Additionally, these qualities are not static within individuals, but are a function of that person's current and past environment, biological state, and psychological state. Each of these qualities may directly or indirectly infuence an individual's efciency and success at solving a problem. It is also no surprise then that successful multi-party collaborations are difcult to maintain <ref type="bibr">[23,</ref><ref type="bibr">40]</ref>. Complex systems of interactions between many agents may weaken or break down entirely due to numerous factors <ref type="bibr">[38]</ref>. Because multi-party collaborative dynamics are more than the sum of the qualities of individual interacting agents, studying multimodal and multi-party (MMP) collaboration becomes increasingly difcult. With individual agents constantly creating multimodal associations with other agents (that also have unique and dynamic qualities), the complexity of studying multi-party interactions quickly grows and more complex multimodal data streams are needed to fully represent all of the dynamics involved between and within each agent. As these data streams increase in complexity, so too must analytic methods used to model these data streams.</p><p>One approach to drawing meaningful inferences from highly complex systems is with network analysis <ref type="bibr">[42]</ref>. Network analysis is a class of statistical and computational analysis for modeling complex systems <ref type="foot">1</ref> . Network models are powerful computational tools for simultaneously assessing relationships between large numbers of variables/features across multiple domains. In a network analytic framework, a graph is constructed representing underlying connections between units in a system. Variables in this graph are represented as nodes and the connections between those variables are represented as edges. These graphs may be directed or undirected and can estimate relationships between large amounts of data. Because of these qualities, network analytic approaches have been shown to be a useful tool for inferential analyses of complex multimodal data streams <ref type="bibr">[31,</ref><ref type="bibr">60]</ref>. Unlike many common statistical and machine learning techniques, network analytic approaches seek to jointly estimate the mutual infuence between large numbers of variables for the purposes of statistical inference in favor of predictive strength <ref type="bibr">[18]</ref>.</p><p>Recently, network-based models have been developed for studying systems that change dynamically over time. These dynamic networks are capable of modeling how the state of a variable (or set of variables) at a given time point afect itself or another variable at a future time point. This allows for the successful inferential modeling of simultaneous and mutually interacting temporal multimodal data streams both within and between agents, making dynamic network models a prime candidate for the study of multi-party collaboration paradigms.</p><p>We employ one such dynamic network model (multilevel vector autoregression, a.k.a. mlVAR) for modeling multimodal afective, behavioral, and cognitive data streams derived from a role-based triadic collaborative problem solving task. Specifcally, we use mlVAR to model how students' emotional, physiological, attentional, verbal communication, and nonverbal communication dynamics infuence each other, and are in-turn infuenced by each other, while solving digital physics-based puzzles in teams of three. Insights derived from these modalities show clear patterns in how students taking a lead in a group-based problem solving task infuence, and are infuenced by, actively engaged and less-actively engaged collaborators. These fndings are evidence that dynamic network modalities such as mlVAR are appropriate and invaluable tools for modeling complex multimodal data derived from multi-party collaborations.</p><p>The successful adoption of mlVAR models into the study of MMP collaboration tasks may further the development of tools for increasing the collaborative efciency of teams, identifying unique and important dynamic links between team members, or training artifcial intelligence systems to become optimal team members when interacting with humans. Findings from the current study can imply that the multimodal dynamics of individuals in specifc roles within a MMP collaborative environment have diferential infuences on emotional, behavioral, and cognitive signals within their team members. Thus, if altering the dynamics of a MMP collaborative environment is warranted, mlVAR yields specifc targets (i.e., what team member and what data stream) on which interested parties may best focus their resources and eforts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Related Work</head><p>Multi-party Collaboration. Although collaboration is a powerful tool for solving problems, maintaining and managing successful collaborations between agents is difcult. Kerr and Tindale <ref type="bibr">[38]</ref> use the term process loss to describe the trend of collaboration eforts becoming less efective over time due to breakdowns in social interactions, communications, or a lack of resources necessary to maintain efective collaboration between individuals. Indeed, it appears that some process loss is inevitable as the complexity of the task and the number of agents involved in active collaboration increases exponentially <ref type="bibr">[55]</ref>.</p><p>Despite process loss, successful large-scale collaboration eforts do exist and Barron <ref type="bibr">[7]</ref> makes the claim that successful large-scale collaboration eforts persist due to the mutuality of exchanges, the achievement of joint attentional engagement, and the alignment of group members' goals for the problem solving process. Thus, researchers interested in bolstering the efectiveness of collaboration eforts, while staving of process loss, have focused on understanding what psychological, physiological, and structural processes diferentiate successful collaboration endeavors from unsuccessful ones. Thus far, researchers have found these diferences to be highly complex and dependent on many multimodal signal interactions, team makeup, and situational factors. For example, Vrzakova et al. <ref type="bibr">[62]</ref> found that perceived collaboration success in small teams of students was determined by difering patterns of speech and body movement synchronization between each student. Longer periods of silence and less movement were positively correlated with how well students performed on a shared task. Stewart et al. <ref type="bibr">[57]</ref> showed that team diversity on a number of metrics (e.g., personality, demographics, and prior experience) was associated with collaboration success of a shared task. Using recurrence quantifcation analysis, Eloy et al. <ref type="bibr">[28]</ref> demonstrated that less regularity in multimodal data streams (i.e., more novel collaboration patterns) was associated with collaboration success.</p><p>Multimodal Complexity. It is the clear that maintaining successful collaboration is a difcult problem involving many multimodal processes occurring within and between collaborative agents. In an attempt to simplify this complexity, we discuss the interactions between the multimodal processes occurring within and between agents using the following categorizations: (a) infuential processes, (b) infuenced processes, and (c) synchronized processes. Infuential processes are processes that, when change or are changed, cause other processes to change. Infuenced processes are processes that are changed when another process (or set of processes) change. When processes are simultaneously infuential and infuenced by one another, these processes are synchronized. Each of these types of multimodal processes serve an important purpose in collaborative endeavours as means of information transfer, Figure <ref type="figure">1</ref>.</p><p>Infuential and infuenced processes facilitate information transfer between collaborative agents. Ochoa and Dominguez <ref type="bibr">[44]</ref> showed that automated multimodal training systems that ofer immediate feedback for oral presentations showed a signifcant positive infuence on users preparing to give oral presentations to a class such that users were perceived to have improved their oral presentation skills. Boker et al. <ref type="bibr">[12]</ref> demonstrated the infuential processes of emotional information transfer between facial expressions in a collaborative communication task between digital avatars of pairs of real interacting agents. By dampening the emotional expressions of one digital avatar, they were able to reliably elicit stronger emotional expressions in the person controlling the second avatar. Boker et al. <ref type="bibr">[12]</ref> hypothesized this is due to a shared equilibrium state occurring during conversation which, if violated by one individual, causes a correction by the other. This fnding is important to studying multimodal collaboration processes as emotional intensity and afective state predicts the willingness of agents to interact with others for long periods of time <ref type="bibr">[9]</ref>. Brennan et al. <ref type="bibr">[14]</ref> found that periods of shared gaze between pairs of individuals engaged in a problem solving task were signifcantly related to collaboration efectiveness. Moulder et al. <ref type="bibr">[43]</ref> argues that chaotic behaviors in head motion during collaborative communication tasks form emergent high-dimensional states that facilitate communication. Beyan et al. <ref type="bibr">[10]</ref> also argues that such emergent states are necessary for role formation in collaborative tasks, such as leader or supervisory roles. Due to the universality of infuenced and infuential processes in collaborative paradigms, Cooke and Gorman <ref type="bibr">[22]</ref> have developed numerous metrics for determining infuence between team members, such as "dominance" and information sharing.</p><p>Synchronized states between multimodal signals also facilitate information exchange between collaborative agents, leading to successful multi-party collaboration. Especially in collaborative systems with human agents, synchrony is a core aspect of facilitating social and developmental processes <ref type="bibr">[30,</ref><ref type="bibr">50]</ref>. Synchronization between physiological signals of collaborators (e.g., heart rate and skin conductance) has been shown to be associated with increased success rates in collaborative problem solving paradigms <ref type="bibr">[8]</ref>. Similarly, Ashenfelter et al. <ref type="bibr">[4]</ref> found that the amount of symmetry between infuential and infuenced head motions in conversation undergoes periods of building and breaking, and that these periods are necessary for efective information transfer. Chikersal et al. <ref type="bibr">[19]</ref> argue that synchronization between facial expressions and skin conductance between collaborating agents is indicative of a group's capacity to perform a wide variety of tasks. Synchronization between emotional states, body motion, and attention has also been shown to facilitate efective communication and understanding between members of a group <ref type="bibr">[1,</ref><ref type="bibr">32,</ref><ref type="bibr">48]</ref>. It is of note that although synchronization is important in infuencing collaborative outcomes (e.g., problem solving ability), synchronization is not always a positive infuence on collaborative outcomes and may lead to worse team performance <ref type="bibr">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Current Directions in Multimodal</head><p>Multi-Party Data Analysis</p><p>Researchers are aware of the need to collect multimodal data streams from multi-party collaboration studies in order to efectively understand multi-party collaborative processes. However, until recently, the majority of research on collaboration has focused specifcally on a small number of signals collected from dyads (pairs of collaborating agents). This is due to a combination of difculties in collecting multimodal signals from multiple agents engaged in shared collaboration and a lack of computational methods geared towards high dimensional multi-party data streams.</p><p>The focus on a small number of signals is beginning to change as researchers have begun to advance computational methodologies to be able to handle high dimensional MMP data streams. For instance, Amon et al. <ref type="bibr">[3]</ref> developed a means of assessing the collaboration skills of individuals in triads (groups of three) using multidimensional recurrence quantifcation analysis, while Subburaj et al. <ref type="bibr">[58]</ref> used a weighting based approach to quantify collaboration performance in multi-party interactions. Researchers have also developed numerous group level synchrony metrics using complex systems methodologies <ref type="bibr">[34,</ref><ref type="bibr">52]</ref>. Some researchers have proposed generative/predictive models of MMP collaborative processes, while other researchers such as Burk et al. <ref type="bibr">[16]</ref> and Rowley <ref type="bibr">[53]</ref> argue that network analysis provides the most natural analytic framework for assessing multimodal multi-party data streams.</p><p>Generative and Predictive Models. Predictive models such as random forest algorithms and artifcial neural networks are powerful tools for deriving useful insights from MMP collaborative tasks. Grafsgaard et al. <ref type="bibr">[33]</ref> utilized both feed-forward and long short-term memory (LSTM) artifcial neural networks to learn synchronization patterns between romantic couples at greater than chance levels. Olsen et al. <ref type="bibr">[45]</ref> also utilized LSTM networks, as well as measures of entropy, to show that multimodal data feeds showed a signifcant improvement over singular data feeds in predicting collaborative learning outcomes. Researchers have even used predictive modeling in a MMP collaborative context to create generative models of individuals playing sports such as basketball <ref type="bibr">[36]</ref>. While these predictive methods are useful, it is difcult to derive specifc information about the underlying multimodal dynamics occurring in MMP collaborative tasks due to the "black-box" nature of such methods. Other methods such as network analysis trade some predictive and generative capability for inferential information.</p><p>Network Analysis of Complex Social Systems. Network analysis is a popular tool for assessing large systems of complex relationships across multiple research felds <ref type="bibr">[2,</ref><ref type="bibr">20]</ref>. Network analysis represents variables as nodes in a large graph, with each node being connected by either directed or undirected edges. These edges are defned by the observed relationships between variables. The resulting graph can be a source of rich information about how multimodal data streams are interacting within and between individual agents <ref type="bibr">[6]</ref> and have been used in a variety of ways to study complex social systems. For instance, Golino et al. <ref type="bibr">[31]</ref> used a dynamic network analysis approach to determine which tweets were likely from troll accounts during the 2016 US presidential election, and Ruis et al. <ref type="bibr">[54]</ref> showed that network analysis was able to distinguish error handling patterns between novice and more experienced surgeons. Durugbo et al. <ref type="bibr">[27]</ref> has specifcally argued that network analysis is both a useful and practical solution for studying multimodal multi-party collaboration dynamics.</p><p>Indeed, network analysis has been successfully applied to studying multi-party collaboration eforts. Barab&#225;si et al. <ref type="bibr">[5]</ref> used network analysis to understand the collaboration patterns of individual scientists and how these collaboration patterns change the scientifc inquiry landscape by creating both clusters of habitual collaborators and large networks spanning many labs. Ramasco et al. <ref type="bibr">[49]</ref> showed similar fndings with similar scientifc collaboration networks as well as movie actor collaboration networks. Network analysis has also been used to understand large-scale collaborative eforts from numerous agents editing Wikipedia pages to collaborations between global tech companies <ref type="bibr">[13,</ref><ref type="bibr">26]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Research Questions</head><p>In using the mlVAR method, we aim to better understand MMP collaboration dynamics between agents engaged in a role-based collaborative problem solving task. Specifcally, we seek to use the mlVAR approach coupled with variable infuence and synchrony dynamics in order to answer the following research questions: RQ1 -How does multimodal information fow between individuals in diferent roles?, RQ2 -Which multimodal data streams in what context are most infuenced by role?, RQ3 -Which multimodal data streams in what context are most infuential by role?, and RQ4 -Which multimodal data streams are most involved in synchronization between roles?</p><p>We use multimodal data streams derived from participants engaged in a triadic collaborative problem solving task to answer each research question. During this task, each participant was assigned to either be a controller (i.e., to control the game while solving a physics puzzle), or a collaborator who can only give verbal assistance to the controller. The collected data streams represent six specifc modalities that index key aspects of interpersonal information exchange in the context of this collaborative problem solving task: emotional information, physiological information, attentional information, nonverbal information, and verbal information. Emotional information is represented by each participant's positive and negative valence, physiological information is represented by each participant's heart rate and skin conductance, attentional information is represented by each participant's average length of gaze fxation and gaze dispersion across the computer screen, nonverbal information is represented by each participant's head acceleration and facial expressiveness, and verbal information is represented by each participant's speech rate, yielding 27 time series per triad (9 data streams &#215; 3 roles).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.4">Contribution and Novelty</head><p>To our knowledge, this is the frst successful attempt to explicitly model multimodal and multi-party collaborative dynamics in a holistic network analysis framework, while taking into account triadic interactions, interaction context (i.e., role), nested data structures, and emotional, behavioral, and cognitive data streams. Unlike previous studies that have focused on studying only 1 or 2 signals at a time in the context of dyadic collaboration, we simultaneously assess 9 signals across each of 3 students engaged in triadic collaboration, yielding 27 time series per triad. We propose a novel application of mlVAR, combined with modern network quantifcation indices of variable infuence and synchrony for studying dynamic multimodal multi-party collaborations. The mlVAR model is well grounded in both complexity theory and graph theory, and provides many metrics for assessing dynamics both within and between collaborating agents. We focus on node in-strength, outstrength, and synchrony. Although we demonstrate the mlVAR model on triads, this approach scales to any number of multi-agent systems. Above and beyond more common network approaches, mlVAR allows for the simultaneous estimation of temporal network dynamics while accounting for nesting structures of multimodal data steams nested within collaborating agents. The mlVAR approach, coupled with the quantifcation of in-strength, out-strength, and synchrony, contributes to the study of collaborative dynamics of MMP interactions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">METHOD 2.1 Data Collection</head><p>The data used in this analysis is from a larger study on collaborative problem solving. Only variables and teams relevant to the current analysis are described below [see 57]. Figure <ref type="figure">2</ref> represents the data collection and analysis pipeline for this study.</p><p>Participants. Participants selected for this analysis were 201 students selected from a larger data set of 288 students from 2 large public universities in the United States (average age = 21.77 years) engaged in a collaborative physics game. Of these 201 students 57% were female with a racial makeup of approximately 53% White, 23% Hispanic/Latino, 17% Asian, 3% Black, and 1% Native American, with 3% reporting "Other". Participants were compensated for their time with either course credit or with an Amazon gift card of 50.00 USD if they completed the study.</p><p>Collaboration Task. Participants were partnered into teams of 3 based upon their scheduling availability. Each team was then asked to join an in-lab video-conferencing session (Zoom) and work on solving problems in an online Newtonian physics-based game entitled "Physics Playground" <ref type="bibr">[56]</ref>, Figure <ref type="figure">2</ref>. All teams engaged three 15-minute long blocks (a warm-up block and 2 experimental blocks). During each block of the study students were randomly assigned to be either a controller (a person who controlled the mouse and solved the puzzle) or a contributor (a person who could give verbal thoughts and suggestions on the current level). Data for the current paper is taken exclusively from the warm-up block. Students were tasked with using basic principles of Newtonian physics (e.g., gravity, transfer of energy, and leverage/torque) to guide a ball to a goal. The controller's objective was to draw simple machines (e.g., levers) that would interact with the ball on screen and guide the ball to the goal, while the objective of the contributors was to ofer their thoughts and suggestions to the controller. Going further, we diferentiate the two contributors as the more verbally active contributor across the warm-up block (primary contributor) and the less verbally active contributor across the warm-up block (secondary contributor).</p><p>Data Collection and Processing. Collaboration is a multimodal process in which visual, auditory, behavioral, emotional, and attentional information are real-time infuences of collaboration dynamics. In order to model this complex process, we collected nine data streams from each participant. Multimodal data streams were extracted at a per-participant level through the use of each participant's webcam, a headset microphone, an eye tracking system, and physiological sensors. All data streams were then down-sampled to a rate of 1Hz for data analysis to correspond to the average length of an utterance. For a triad to be selected for this analysis, each member of the triad must have had observations for each time series. Triads with entirely missing time series were dropped from the analysis. In total, 61 triads (69.7% of the original data set) met our criteria for inclusion in this analysis. Remaining missingness within time series (averaging missingness = 6.36%) was handled with Kalman fltering.</p><p>Audio data streams were collected through each participant's headset. These data streams were then fed through IBM Watson's Speech to Text software to yield timestamped transcripts (beginning and end times) of each utterance across the 15-minute warm-up block for each participant. The count of these utterances were then aggregated to 1Hz and defned each participant's speech rate data stream. If an utterance lasted longer than 1 second, it was assigned to the second that it started. Physiological data streams were collected through the use of Shimmer 3 GSR+ devices. The Shimmer 3 is an unobtrusive wearable device that collects both heart rate (PPG signal) and changes in skin conductance (galvanic skin response) at 51.2Hz. The Shimmer 3 galvanic skin response sensor was placed on each participant's wrist and the heart rate monitor was placed on each participant's earlobe. After data collection, skin conductance was then separated into tonic (slow moving) and phasic (fast moving) components. The current study focuses on the phasic component because the phasic component is sensitive to changes in external stimuli. The Shimmer family of devices have been validated to collect highly accurate and synchronized physiological data streams with minimal error and drift, even on moving participants <ref type="bibr">[17]</ref>. Each participant was ftted with a Shimmer 3 during data collection. Both heart rate and skin conductance data streams obtained from the Shimmer devices were down-sampled to 1Hz for analysis using an order eight Chebyshev type I flter.</p><p>Emotional data streams (positive valence and negative valence), as well as expressiveness, were collected from videos of participant's faces recorded via webcams attached to each participant's computer. Videos were sampled at 10Hz for the purposes of feature extraction. We utilized the Emotient video analysis software which estimates the likelihoods of the presence of 20 action units relevant to each participant's expressions in each video <ref type="bibr">[41]</ref>. These action units were the used to assess participant's positive valence and negative valence using an algorithm developed by Cohn et al. <ref type="bibr">[21]</ref>. Expressiveness (overall activity across a given frame) was then calculated as the mean value across the action units.</p><p>Motion and attention based data streams were collected with Tobii4C eye tracking devices attached to each participant's computer. Tobii4C devices collected data on each participant's eye gaze and head motion sampled at 90Hz. The Tobii4C devices collected information on the pitch, roll, and yaw of each participant's head. We used participants' visual fxation information (i.e., points where gaze is maintained on a location at a maximum of 25 pixels apart for at least 50ms) to compute fxation dispersion (i.e., the mean Euclidean distance of each raw gaze point in a 1s window from the centroid) and average fxation duration <ref type="bibr">[25]</ref>. Fixations longer than 1s were trimmed to 1s and fxations that overlapped boundaries were assigned to the majority second. Fixations were then averaged over 1s windows. Head pitch, roll, and yaw were converted into X, Y, and Z axis displacement, then into accelerations using two steps of fourth-order central diferencing. The magnitude of the X, Y, and Z accelerations was then calculated as a measure of head motion dynamics during the collaborative problem solving task and down-sampled to 1Hz. Head acceleration magnitude has been shown to be a useful data stream for nonverbal information transfer between individuals [e.g., 11].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Multi-Level Vector Autoregression</head><p>Network analytic methods are capable of modeling complex relationships between multiple variables across multiple data clusters (triads in the current case). The mlVAR model is one such network analytic approach that is especially well suited for assessing temporal dynamics in multimodal multi-party data streams <ref type="bibr">[15,</ref><ref type="bibr">29]</ref>. mlVAR accomplishes this task by constructing a series of multiple linear mixed-efects models, each of which predicts a variable at a given time within a cluster () by all other variables at time -1:</p><p>where () is the value of variable &#8712; [1, . . . , ] at time within cluster , ( -1) is a time-lagged (lag-1) version of () , 11 . . . are fxed efects representing average associations between all () and ( -1) , 11 . . . are random efects representing the cluster-level deviations from 11 . . . , and () 1 . . . () are error terms. In the case of the current data, () represents positive valence, negative valence, expressiveness, heart rate, skin conductance, speech rate, average gaze fxation duration, gaze dispersion, and head acceleration for controllers, primary contributors, and secondary contributors, as well as a measure of all changes occurring on the screen of the physics game as performed by the controller, for a total of 28 variables per triad. All analyses were conducted with the mlVAR R package (version 0.5).</p><p>Adjacency matrices can then be constructed from from 11 . . . and 11 . . . representing general directed connections between nodes and cluster specifc connections between nodes respectively:</p><p>. . &#63739; Adjacency matrices B and U i form a graph that contains information on the temporal dynamics relating all ( -1) to () . Elements of these matrices can be read as connections going from columns to rows (e.g., 21 represents a connection from node 1 at time -1 to node 2 at time . It is possible to use the values of B and U i to calculate metrics representing variable infuence and synchrony. Here we focus on measures of node degree (in-strength and out-strength) and synchrony.</p><p>The total sum of connections (or strength of connections) involving a node within a network is known as the degree of the node. Node degree in directed networks is measured by in-strength (the sum of absolute values of edges going into a node) and out-strength (the sum of absolute values of edges going out of a node) <ref type="bibr">[35]</ref>. Node synchrony is a measure of how a set of nodes jointly infuence one another <ref type="bibr">[34]</ref>.</p><p>In-strength. Node in-strength represents the total magnitude of connections going from a node (or set of nodes) to a node of interest. For instance, the in-strength of node 1 from nodes 2, 3, and 4 from B is calculated as:</p><p>In-strength represents the total amount to which a node is infuenced within a network. Nodes with higher in-strength compared to nodes with lower in-strength coming from the same set of nodes are therefore more infuenced by other nodes within a network. That is, the current dynamics of these nodes are sensitive to changes in other nodes at previous time points.</p><p>Out-strength. Node out-strength represents the total magnitude of connections going from a node of interest to a node (or set of nodes). For instance, the out-strength of node 1 to nodes 2, 3, and 4 from B is calculated as:</p><p>Out-strength represents the total amount to which a node infuences other nodes within a network. Nodes with higher out-strength compared to nodes with lower out-strength going to the same set of nodes are therefore more infuential to other nodes within a network. That is, the prior dynamics of these nodes change the current dynamics of other nodes.</p><p>Synchrony. Node synchrony may be thought of as the mutual infuence of change between sets of nodes within a network. Whereas in-strength represents only the degree to which a node is infuenced by changes in other nodes and out-strength represents how infuential a change in a node is to other nodes, synchrony is a joint metric representing mutual changes in dynamics. Guastello and Peressini <ref type="bibr">[34]</ref> developed a method for modeling synchrony between individual nodes and groups of nodes based upon an optimal linear map of elements of an adjacency matrix.</p><p>To illustrate the calculation of node synchrony, consider the following adjacency matrix:</p><p>.9 &#63739; This method begins by choosing a reference node for which to calculate a synchrony score, as well as other nodes with which the reference node may synchronize. In this case we will select node 3 as the reference node and nodes 1, 2, and 4 as the nodes with which node 3 synchronizes. Two additional matrices are then formed from B, Matrix V is defned as the elements of B representing node 3 going to all other nodes of interest. Matrix M is defned as the remaining elements of B after removing rows and columns associated with node 3. Matrix V represents the infuence of node 3 on all other nodes of interest at a future state. Note that the sum of the absolute values of V is the out-strength of node 3 to nodes 1, 2, and 4. Matrix M represents the joint infuence of nodes 1, 2, and 4 on themselves at a future state (i.e., the dynamics of these nodes).</p><p>The synchrony between these nodes can then be computed as:</p><p>In this equation, M -1 V yields a linear map (i.e., regression weight) from the internal dynamics represented by M to the external dynamics represented by V. M -1 represents the matrix inverse of M and V &#8242; represents the matrix transpose of V. A further premultiplication by V &#8242; yields a singular regression weight, , mapping how much the outward dynamics of node 3 (V) infuence the association between the outward dynamics of node 3 and the internal dynamics of nodes 1, 2, and 4 (M). That is, represents how much of the joint temporal relationship between nodes 1, 2, 3, and 4 is due to changes in node 3. Higher values represent more synchrony and lower values of represent less synchrony. In this example, &#8776; .745. For RQ1, mlVAR was conducted to yield an average temporal network across all triads that accounts for the inherent nesting of participants within teams in the data. Statistically signifcant paths were assessed at the &lt; .01 level due to the large number of simultaneous estimations occurring within this model. For RQ2, RQ3, and RQ4, we calculate node in-strength (a measured of being infuenced), out-strength (a measure of infuence), and synchrony from sub-models derived from mlVAR representing the dynamics of each triad.</p><p>All analyses relevant to RQ2 -RQ4 we assessed using linear mixed efects models. These models were necessary to account for the inherent nesting of individual participant network metrics within teams. In total, six models were conducted each with either in-strength, out-strength, or synchrony as the outcome variable and participant role (ROLE) or data stream (MODALITY) as predictor variables. Models with participant role as a predictor take the form: = () + 0 + Models with data stream as a predictor take the form: = ( ) + 0 + where is either in-strength, out-strength, or synchrony score, () and ( ) are linear models of the form XB that includes all main efects (3 for ROLE and 9 for MODALITY), 0 represents a random intercept term per team, and is an error term. A false discovery rate correction was then conducted to control for Type-I statistical errors. We report the results of all between-role contrasts and the strongest efects shown for the between-modality contrasts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Global Network</head><p>The mlVAR method estimated network graphs for individual triads (i.e., random efects), as well as a single network representing statistically signifcant general fndings across all estimated networks (i.e., fxed efects). This single graph is shown in Figure <ref type="figure">3</ref>. Findings from the overall network answer RQ1. Although this network was complex given the large amount of signifcant associations, there was also a large body of qualitative information to glean from this network. For instance, the strongest connections existed between a variable and itself at the next time point, indicating that these variables tend to greatly infuence their own dynamics across time.</p><p>Secondly, there were a relatively equal number of signifcant connections within any team member (40 for primary contributor, 39 for primary contributor, 40 for secondary contributor) as there are between any member and the other two team members (average number of connections = 35). This indicated that there was indeed information transfer between these multimodal signals occurring during collaborative interaction and that regulatory processes existed both within a single participant and between that participant and both other collaborators. Analyses relating to RQ2 -RQ4 go into further detail. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Individual Triad Networks</head><p>In addition to the global model, mlVAR estimated networks similar to that of Figure <ref type="figure">3</ref> for each triad. In order to understand collaboration dynamics within and across triads for each of multimodal signals (emotional, physiological, attentional, nonverbal, and verbal), in-strength, out-strength, and synchrony was calculated for all nodes. All analyses were conducted at the within-team level using in-strength, out-strength, and synchrony values assessed from a participant in one role to participants in other roles. For instance, controller in-strength is calculated as all edges going to controller nodes from primary and secondary contributor nodes.</p><p>Linear mixed-efect models were then used to understand how participant role infuenced average in-strength, out-strength, and synchrony scores between an individual in a given role and their two collaborators. These same models also modeled how diferent modalities infuenced average levels of in-strength, out-strength and synchrony scores. A random efect of team was included to account for the nesting of participants within teams, Figure <ref type="figure">4</ref>.</p><p>In-Strength. Node in-strength represents how much a single data stream from a single participant was jointly infuenced by all other data streams from both other participants (RQ2, Figure <ref type="figure">4</ref>-A and 4-D). Controllers were signifcantly more infuenced (i.e., have higher average in-strength) than primary contributors ( = .001) or secondary contributors ( &lt; .001). Primary contributors were also signifcantly more infuenced than secondary contributors ( &lt; .001). Looking more closely at individual data streams, all participants are mostly infuenced on their skin conductance (all &lt; .001). This indicates that infuence may be best observed through physiological arousal.</p><p>Out-Strength. Node out-strength represents how much a single data stream from a single participant was able to collectively infuence all other data streams from both other participants (RQ3, Figure <ref type="figure">4-B</ref> and<ref type="figure">4-E</ref>). There were no signifcant role diferences in out-strength (all &gt; .836). However, at a per data stream level, both gaze dispersion and gaze fxation showed signifcantly larger infuence on all other variables compared to all other variables within a role (all &lt; .001). This indicates that participant's attentional information directed the behavior of other signals within other participants.</p><p>Synchrony. Node synchrony represents how much a sets of relationship strengths between data signals are jointly infuenced by specifc nodes (RQ4, Figure <ref type="figure">4</ref>-C and 4-F). There were no signifcant role diferences in synchrony ( &gt; .575). Between modality, similar to out-strength, both gaze dispersion and gaze fxation showed signifcantly more infuence on the synchronization/relationship between data signals compared to other signals (all &lt; .001).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">DISCUSSION</head><p>Collaboration is a complex process involving dynamic multimodal multi-party interactions. We have shown that mlVAR is capable of modeling multimodal multi-party data over time. The results of mlVAR are a set of adjacency matrices that can be used to test complex hypotheses regarding both within and between participant dynamics. Although we have shown a specifc case of mVAR applied to triads, mlVAR can theoretically be scaled to any number of group members with any number of shared or uncommon variables. This makes mlVAR an invaluable tool for researchers interested in a complex and holistic view of multimodal collaboration processes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Main Findings</head><p>Our fndings emphasize the importance of each role on the dynamics of the team. Team members are in general similar in their ability to infuence and synchronize the modalities of other team members (i.e., indistinguishable out-strength and synchrony values across roles). However, the controller and primary contributors have a unique place in the team as their modalities are the most infuenced by the dynamics of the team (i.e., controllers show the highest in-strength levels, followed by primary contributors, followed by secondary contributors). This makes sense in the context of the current study as controllers were the only ones able to directly efect the end result of a given physics puzzle.</p><p>At an individual modality level, we fnd that the internal states of participants (as measured by skin conductance) are most infuenced by their team members, while participants' attention (as measured by gaze fxation and dispersion) was the most infuential modality shared between participants, as well as the modality most infuential to team synchrony. Interestingly, while skin conductance was highly infuenced by other team members, it is not at all infuential to other team members' data streams or the synchronization between those data streams. This may mean that the infuential dynamics of collaboration tend to have the highest infuence on physiological arousal, and that physiological arousal is a data stream only valuable in information transfer within an individual and not between individuals. Thus, it appears that a main component of team collaboration dynamics involves processes in which overt signals between individuals (e.g., attention and verbal modalities) infuence the internal states of singular members. This change in internal state of individuals may then focus the collective behavior of a team toward a common goal.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Applications and Implications</head><p>All of these fndings would be difcult to ascertain outside of the mlVAR framework. If properly implemented, the mlVAR modality has the potential to ofer researchers unparalleled insight into the dynamics of interpersonal collaboration between multiple agents. Possible applications of the fndings of the current manuscript exist for both real time team collaboration optimization and recommender systems. Researchers such as Palau et al. <ref type="bibr">[46]</ref> have shown that behavioral dynamics estimated through network analysis can be used to create recommender systems as a means of improving collaboration between individuals or to create strategic collaborative groupings of agents <ref type="bibr">[24]</ref>. Results from an mlVAR analysis may also be used as a "team fngerprint" in identifying specifc teams by their shared multimodal dynamics <ref type="bibr">[47]</ref>, or allow artifcial intelligence agents to have a better understanding of human afective states in order to assess and mitigate collaboration issues <ref type="bibr">[51]</ref>. Network perturbation testing also ofers valuable "what-if" scenario testing by allowing researcher to change specifc parameters or data streams of a learned network model to understand the expected change in other data streams at a later time point <ref type="bibr">[39]</ref>. This would be useful in determining specifc changes within a given team that might improve team performance over time or keep team performance from falling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Limitations/Future Work</head><p>Although mlVAR is a well-suited analysis for studying collaboration in a MMP framework, there are limitations to this method that researchers should be aware of. Two primary concerns are the possibility of false positives or false negatives in the network model estimation process. As the mlVAR model is relatively new, little research has been done on the error rates and statistical power of these models. There are suggestions on minimum sample sizes and efect size calculations, but there is little formal analyses done to assess these statistical properties <ref type="bibr">[37,</ref><ref type="bibr">61]</ref>. Although it can be expected that as the number of multimodal signals or number of group members increases, more data will be necessary both at the individual level and the group level.</p><p>Additionally, although we have only discussed in-strength, out-strength, and synchrony as being measures of how infuenced/infuential a variable is and how much synchrony is due to a specifc variable respectively, there are dozens more network metrics that can be calculated from mlVAR <ref type="bibr">[35]</ref>. Each metric has its own meaning and may show diferential interest to diferent researchers. More metrics are constantly being developed for quantifying network dynamics in order to study specifc aspects of network functions and diferences between networks <ref type="bibr">[59]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Concluding Remarks</head><p>Network models are an invaluable tool for inferential analysis of multimodal multi-party collaboration processes. We have demonstrated that the mlVAR model is an especially well suited method for uncovering collaboration dynamics occurring within and between collaborative agents. The mlVAR model is uniquely able to handle this large number of variables, as well as a large number of collaborators, making it an indispensable tool for understanding complex collaboration eforts. Additionally, new network metrics are being developed yearly, each with their own specifc representation of the complex dynamics occurring in temporal network models. As the mlVAR model is still actively being improved upon, we believe it will continue to increase its utility for studying multimodal multiparty processes and will become a common analysis for individuals interested in studying group dynamics such as collaboration.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>In this paper we make a clear distinction between network analysis/network models and artifcial neural networks (ANNs). Artifcial neural networks seek to estimate an underlying functional relationship between a set of inputs and a set of outputs, generally for the purpose of prediction. Network analytic models seek to estimate associations between all variables of interest for the purpose of statistical inference and modeling.</p></note>
		</body>
		</text>
</TEI>
