Abstract Reinforcement learning (RL), a subset of machine learning (ML), could optimize and control biomanufacturing processes, such as improved production of therapeutic cells. Here, the process of CAR T‐cell activation by antigen‐presenting beads and their subsequent expansion is formulated in silico. The simulation is used as an environment to train RL‐agents to dynamically control the number of beads in culture to maximize the population of robust effector cells at the end of the culture. We make periodic decisions of incremental bead addition or complete removal. The simulation is designed to operate in OpenAI Gym, enabling testing of different environments, cell types, RL‐agent algorithms, and state inputs to the RL‐agent. RL‐agent training is demonstrated with three different algorithms (PPO, A2C, and DQN), each sampling three different state input types (tabular, image, mixed); PPO‐tabular performs best for this simulation environment. Using this approach, training of the RL‐agent on different cell types is demonstrated, resulting in unique control strategies for each type. Sensitivity to input‐noise (sensor performance), number of control step interventions, and advantages of pre‐trained RL‐agents are also evaluated. Therefore, we present an RL framework to maximize the population of robust effector cells in CAR T‐cell therapy production. 
                        more » 
                        « less   
                    This content will become publicly available on January 22, 2026
                            
                            Reinforcement Learning for Control of Non-Markovian Cellular Population Dynamics
                        
                    
    
            Many organisms and cell types, from bacteria to cancer cells, exhibit a remarkable ability to adapt to fluctuating environments. Additionally, cells can leverage memory of past environments to better survive previously-encountered stressors. From a control perspective, this adaptability poses significant challenges in driving cell populations toward extinction, and is thus an open question with great clinical significance. In this work, we focus on drug dosing in cell populations exhibiting phenotypic plasticity. For specific dynamical models switching between resistant and susceptible states, exact solutions are known. However, when the underlying system parameters are unknown, and for complex memory-based systems, obtaining the optimal solution is currently intractable. To address this challenge, we apply reinforcement learning (RL) to identify informed dosing strategies to control cell populations evolving under novel non-Markovian dynamics. We find that model-free deep RL is able to recover exact solutions and control cell populations even in the presence of long-range temporal dynamics. To further test our approach in more realistic settings, we demonstrate performant RL-based control strategies in environments with dynamic memory strength. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2019786
- PAR ID:
- 10620815
- Publisher / Repository:
- Open Review
- Date Published:
- Format(s):
- Medium: X
- Location:
- https://openreview.net/forum?id=dsHpulHpOK
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            While cancer has traditionally been considered a genetic disease, mounting evidence indicates an important role for non-genetic (epigenetic) mechanisms. Common anti-cancer drugs have recently been observed to induce the adoption of non-genetic drug-tolerant cell states, thereby accelerating the evolution of drug resistance. This confounds conventional high-dose treatment strategies aimed at maximal tumor reduction, since high doses can simultaneously promote non-genetic resistance. In this work, we study optimal dosing of anti-cancer treatment under drug-induced cell plasticity. We show that the optimal dosing strategy steers the tumor to a fixed equilibrium composition between sensitive and tolerant cells, while precisely balancing the trade-off between cell kill and tolerance induction. The optimal equilibrium strategy ranges from applying a low dose continuously to applying the maximum dose intermittently, depending on the dynamics of tolerance induction. We finally discuss how our approach can be integrated with in vitro data to derive patient-specific treatment insights.more » « less
- 
            Wodarz, Dominik (Ed.)The spreading of bacterial populations is central to processes in agriculture, the environment, and medicine. However, existing models of spreading typically focus on cells in unconfined settings—despite the fact that many bacteria inhabit complex and crowded environments, such as soils, sediments, and biological tissues/gels, in which solid obstacles confine the cells and thereby strongly regulate population spreading. Here, we develop an extended version of the classic Keller-Segel model of bacterial spreading via motility that also incorporates cellular growth and division, and explicitly considers the influence of confinement in promoting both cell-solid and cell-cell collisions. Numerical simulations of this extended model demonstrate how confinement fundamentally alters the dynamics and morphology of spreading bacterial populations, in good agreement with recent experimental results. In particular, with increasing confinement, we find that cell-cell collisions increasingly hinder the initial formation and the long-time propagation speed of chemotactic pulses. Moreover, also with increasing confinement, we find that cellular growth and division plays an increasingly dominant role in driving population spreading—eventually leading to a transition from chemotactic spreading to growth-driven spreading via a slower, jammed front. This work thus provides a theoretical foundation for further investigations of the influence of confinement on bacterial spreading. More broadly, these results help to provide a framework to predict and control the dynamics of bacterial populations in complex and crowded environments.more » « less
- 
            In reinforcement learning (RL), the ability to utilize prior knowledge from previously solved tasks can allow agents to quickly solve new problems. In some cases, these new problems may be approximately solved by composing the solutions of previously solved primitive tasks (task composition). Otherwise, prior knowledge can be used to adjust the reward function for a new problem, in a way that leaves the optimal policy unchanged but enables quicker learning (reward shaping). In this work, we develop a general framework for reward shaping and task composition in entropy-regularized RL. To do so, we derive an exact relation connecting the optimal soft value functions for two entropy-regularized RL problems with different reward functions and dynamics. We show how the derived relation leads to a general result for reward shaping in entropy-regularized RL. We then generalize this approach to derive an exact relation connecting optimal value functions for the composition of multiple tasks in entropy-regularized RL. We validate these theoretical contributions with experiments showing that reward shaping and task composition lead to faster learning in various settings.more » « less
- 
            Abstract Partially Observable Markov Decision Processes (POMDPs) can model complex sequential decision-making problems under stochastic and uncertain environments. A main reason hindering their broad adoption in real-world applications is the unavailability of a suitable POMDP model or a simulator thereof. Available solution algorithms, such as Reinforcement Learning (RL), typically benefit from the knowledge of the transition dynamics and the observation generating process, which are often unknown and non-trivial to infer. In this work, we propose a combined framework for inference and robust solution of POMDPs via deep RL. First, all transition and observation model parameters are jointly inferred via Markov Chain Monte Carlo sampling of a hidden Markov model, which is conditioned on actions, in order to recover full posterior distributions from the available data. The POMDP with uncertain parameters is then solved via deep RL techniques with the parameter distributions incorporated into the solution via domain randomization, in order to develop solutions that are robust to model uncertainty. As a further contribution, we compare the use of Transformers and long short-term memory networks, which constitute model-free RL solutions and work directly on the observation space, with an approach termed the belief-input method, which works on the belief space by exploiting the learned POMDP model for belief inference. We apply these methods to the real-world problem of optimal maintenance planning for railway assets and compare the results with the current real-life policy. We show that the RL policy learned by the belief-input method is able to outperform the real-life policy by yielding significantly reduced life-cycle costs.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
