skip to main content


Title: Adaptive Discretization in Online Reinforcement Learning
Discretization-based approaches to solving online reinforcement learning problems are studied extensively on applications such as resource allocation and cache management. The two major questions in designing discretization-based algorithms are how to create the discretization and when to refine it. There are several experimental results investigating heuristic approaches to these questions but little theoretical treatment. In this paper, we provide a unified theoretical analysis of model-free and model-based, tree-based adaptive hierarchical partitioning methods for online reinforcement learning. We show how our algorithms take advantage of inherent problem structure by providing guarantees that scale with respect to the “zooming” instead of the ambient dimension, an instance-dependent quantity measuring the benignness of the optimal [Formula: see text] function. Many applications in computing systems and operations research require algorithms that compete on three facets: low sample complexity, mild storage requirements, and low computational burden for policy evaluation and training. Our algorithms are easily adapted to operating constraints, and our theory provides explicit bounds across each of the three facets. Funding: This work is supported by funding from the National Science Foundation [Grants ECCS-1847393, DMS-1839346, CCF-1948256, and CNS-1955997] and the Army Research Laboratory [Grant W911NF-17-1-0094]. Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.2396 .  more » « less
Award ID(s):
1955997 1948256 1847393
NSF-PAR ID:
10437343
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Operations Research
ISSN:
0030-364X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Objective Over the past decade, we developed and studied a face-to-face video-based analysis-of-practice professional development (PD) model. In a cluster randomized trial, we found that the face-to-face model enhanced elementary science teacher knowledge and practice and resulted in important improvements to student science achievement (student treatment effect, d = 0.52; Taylor et al, 2017; Roth et al, 2018). The face-to-face PD model is expensive and difficult to scale. In this paper, we present the results of a two-year design-based research study to translate the face-to-face PD into a facilitated online PD experience. The purpose is to create an effective, flexible, and cost-efficient PD model that will reach a broader audience of teachers. Perspective/Theoretical Framework The face-to-face PD model is grounded in situated cognition and cognitive apprenticeship frameworks. Teachers engage in learning science content and effective science teaching practices in the context in which they will be teaching. There are scaffolded opportunities for teachers to learn from analysis of model videos by experienced teachers, to try teaching model units, to analyze video of their own teaching efforts, and ultimately to develop their own unit, with guidance. The PD model attends to the key features of effective PD as described by Desimone (2009) and others. We adhered closely to the design principles of the face-to-face model as described by Authors, 2019. Methods We followed a design-based research approach (DBR; Cobb et al., 2003; Shavelson et al., 2003) to examine the online program components and how they promoted or interfered with the development of teachers’ knowledge and reflective practice. Of central interest was the examination of mechanisms for facilitating teacher learning (Confrey, 2006). To accomplish this goal, design researchers engaged in iterative cycles of problem analysis, design, implementation, examination, and redesign (Wang & Hannafin, 2005) in phase one of the project before studying its effect. Data Three small pilot groups of teachers engaged in both synchronous and asynchronous components of the larger online course which began implementation with a 10-week summer course that leads into study groups of participants meeting through one academic year. We iteratively designed, tested, and revised 17 modules across three pilot versions. On average, pilot groups completed one module every two weeks. Pilot 1 began the work in May 2019; Pilot 2 began in August 2019, and Pilot 3 began in October 2019. Pilot teachers responded to surveys and took part in interviews related to the PD. The PD facilitators took extensive notes after each iteration. The development team met weekly to discuss revisions. We revised all modules between each pilot group and used what we learned to inform our development of later modules within each pilot. For example, we applied what we learned from testing Module 3 with Pilot 1 to the development of Module 3 for Pilots 2, and also applied what we learned from Module 3 with Pilot 1 to the development of Module 7 for Pilot 1. Results We found that community building required the same incremental trust-building activities that occur in face-to-face PD. Teachers began with low-risk activities and gradually engaged in activities that required greater vulnerability (sharing a video of themselves teaching a model unit for analysis and critique by the group). We also identified how to contextualize technical tools with instructional prompts to allow teachers to productively interact with one another about science ideas asynchronously. As part of that effort, we crafted crux questions to surface teachers’ confusions or challenges related to content or pedagogy. We called them crux questions because they revealed teachers’ uncertainty and deepened learning during the discussion. Facilitators leveraged asynchronous responses to crux questions in the synchronous sessions to push teacher thinking further than would have otherwise been possible in a 2-hour synchronous video-conference. Significance Supporting teachers with effective, flexible, and cost-efficient PD is difficult under the best of circumstances. In the era of covid-19, online PD has taken on new urgency. NARST members will gain insight into the translation of an effective face-to-face PD model to an online environment. 
    more » « less
  2. We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCB-AVG, based on an optimistic variant of variance-reduced Q-learning. We show that UCB-AVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of state-action space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCB-AVG to develop a model-free algorithm that finds an $\epsilon$-optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{-2} + S^2Asp(h^*)\epsilon^{-1}).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{-2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worst-case mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of value-difference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity. 
    more » « less
  3. Objective Over the past decade, we developed and studied a face-to-face video-based analysis-of-practice PD model. In a cluster randomized trial, we found that the face-to-face model enhanced elementary science teacher knowledge and practice, and resulted in important improvements to student science achievement (student treatment effect, d = 0.52; Taylor et al., 2017: Roth et al., 2018). The face-to-face PD model is expensive and difficult to scale. In this poster, we present the results of a two-year design-based research study to translate the face-to-face PD into a facilitated online PD experience. The purpose is to create an effective, flexible, and cost-efficient PD model that will reach a broader audience of teachers. Perspective/Theoretical Framework The face-to-face PD model is grounded in situated cognition and cognitive apprenticeship frameworks. Teachers engage in learning science content and practices in the context in which they will be teaching. In addition, there are scaffolded opportunities for teachers to learn from model videos by experienced teachers, try model units, and ultimately develop their own unit, with guidance. The PD model also attends to the key features of effective PD as described by Desimone (2009) and others. We adhered closely to the design principles of the face-to-face model as described by Roth et al., 2018. Methods We followed a design-based research approach (DBR: Cobb et al., 2003: Shavelson et al., 2003) to examine the online program components and how they promoted or interfered with the development of teachers’ knowledge and reflective practice. Of central interest was the examination of mechanisms for facilitating teacher learning (Confrey, 2006). To accomplish this goal, design researchers engaged in iterative cycles of problem analysis, design, implementation, examination, and redesign (Wang & Hannafin, 2005). Data We iteratively designed, tested, and revised 17 modules across three pilot versions. Three small groups of teachers engaged in both synchronous and asynchronous components of the larger online course. They responded to surveys and took part in interviews related to the PD. The PD facilitators took extensive notes after each iteration. The development team met weekly to discuss revisions. Results We found that community building required the same incremental trust-building activities that occur in face-to-face PD. Teachers began with low-risk activities and gradually engaged in activities that required greater vulnerability (sharing a video of themselves teaching a model unit for analysis and critique by the group). We also identified how to contextualize technical tools with instructional prompts to allow teachers to productively interact with one another about science ideas asynchronously. As part of that effort, we crafted crux questions to surface teachers’ confusions or challenges related to content or pedagogy. Facilitators leveraged asynchronous responses to crux questions in the synchronous sessions to push teacher thinking further than would have otherwise been possible in a 2-hour synchronous video-conference. Significance Supporting teachers with effective, flexible, and cost-efficient PD is difficult under the best of circumstances. In the era of COVID-19, online PD has taken on new urgency. AERA members will gain insight into the construction of an online PD for elementary science teachers/ Full digital poster available at: https://aera21-aera.ipostersessions.com/default.aspx?s=64-5F-86-2E-15-F8-C3-C0-45-C6-A0-B7-1D-90-BE-46 
    more » « less
  4. Problem definition: We seek to provide an interpretable framework for segmenting users in a population for personalized decision making. Methodology/results: We propose a general methodology, market segmentation trees (MSTs), for learning market segmentations explicitly driven by identifying differences in user response patterns. To demonstrate the versatility of our methodology, we design two new specialized MST algorithms: (i) choice model trees (CMTs), which can be used to predict a user’s choice amongst multiple options, and (ii) isotonic regression trees (IRTs), which can be used to solve the bid landscape forecasting problem. We provide a theoretical analysis of the asymptotic running times of our algorithmic methods, which validates their computational tractability on large data sets. We also provide a customizable, open-source code base for training MSTs in Python that uses several strategies for scalability, including parallel processing and warm starts. Finally, we assess the practical performance of MSTs on several synthetic and real-world data sets, showing that our method reliably finds market segmentations that accurately model response behavior. Managerial implications: The standard approach to conduct market segmentation for personalized decision making is to first perform market segmentation by clustering users according to similarities in their contextual features and then fit a “response model” to each segment to model how users respond to decisions. However, this approach may not be ideal if the contextual features prominent in distinguishing clusters are not key drivers of response behavior. Our approach addresses this issue by integrating market segmentation and response modeling, which consistently leads to improvements in response prediction accuracy, thereby aiding personalization. We find that such an integrated approach can be computationally tractable and effective even on large-scale data sets. Moreover, MSTs are interpretable because the market segments can easily be described by a decision tree and often require only a fraction of the number of market segments generated by traditional approaches. Disclaimer: This work was done prior to Ryan McNellis joining Amazon. Funding: This work was supported by the National Science Foundation [Grants CMMI-1763000 and CMMI-1944428]. Supplemental Material: The online appendices are available at https://doi.org/10.1287/msom.2023.1195 . 
    more » « less
  5. Introduction and Theoretical Frameworks Our study draws upon several theoretical foundations to investigate and explain the educational experiences of Black students majoring in ME, CpE, and EE: intersectionality, critical race theory, and community cultural wealth theory. Intersectionality explains how gender operates together with race, not independently, to produce multiple, overlapping forms of discrimination and social inequality (Crenshaw, 1989; Collins, 2013). Critical race theory recognizes the unique experiences of marginalized groups and strives to identify the micro- and macro-institutional sources of discrimination and prejudice (Delgado & Stefancic, 2001). Community cultural wealth integrates an asset-based perspective to our analysis of engineering education to assist in the identification of factors that contribute to the success of engineering students (Yosso, 2005). These three theoretical frameworks are buttressed by our use of Racial Identity Theory, which expands understanding about the significance and meaning associated with students’ sense of group membership. Sellers and colleagues (1997) introduced the Multidimensional Model of Racial Identity (MMRI), in which they indicated that racial identity refers to the “significance and meaning that African Americans place on race in defining themselves” (p. 19). The development of this model was based on the reality that individuals vary greatly in the extent to which they attach meaning to being a member of the Black racial group. Sellers et al. (1997) posited that there are four components of racial identity: 1. Racial salience: “the extent to which one’s race is a relevant part of one’s self-concept at a particular moment or in a particular situation” (p. 24). 2. Racial centrality: “the extent to which a person normatively defines himself or herself with regard to race” (p. 25). 3. Racial regard: “a person’s affective or evaluative judgment of his or her race in terms of positive-negative valence” (p. 26). This element consists of public regard and private regard. 4. Racial ideology: “composed of the individual’s beliefs, opinions and attitudes with respect to the way he or she feels that the members of the race should act” (p. 27). The resulting 56-item inventory, the Multidimensional Inventory of Black Identity (MIBI), provides a robust measure of Black identity that can be used across multiple contexts. Research Questions Our 3-year, mixed-method study of Black students in computer (CpE), electrical (EE) and mechanical engineering (ME) aims to identify institutional policies and practices that contribute to the retention and attrition of Black students in electrical, computer, and mechanical engineering. Our four study institutions include historically Black institutions as well as predominantly white institutions, all of which are in the top 15 nationally in the number of Black engineering graduates. We are using a transformative mixed-methods design to answer the following overarching research questions: 1. Why do Black men and women choose and persist in, or leave, EE, CpE, and ME? 2. What are the academic trajectories of Black men and women in EE, CpE, and ME? 3. In what way do these pathways vary by gender or institution? 4. What institutional policies and practices promote greater retention of Black engineering students? Methods This study of Black students in CpE, EE, and ME reports initial results from in-depth interviews at one HBCU and one PWI. We asked students about a variety of topics, including their sense of belonging on campus and in the major, experiences with discrimination, the impact of race on their experiences, and experiences with microaggressions. For this paper, we draw on two methodological approaches that allowed us to move beyond a traditional, linear approach to in-depth interviews, allowing for more diverse experiences and narratives to emerge. First, we used an identity circle to gain a better understanding of the relative importance to the participants of racial identity, as compared to other identities. The identity circle is a series of three concentric circles, surrounding an “inner core” representing one’s “core self.” Participants were asked to place various identities from a provided list that included demographic, family-related, and school-related identities on the identity circle to reflect the relative importance of the different identities to participants’ current engineering education experiences. Second, participants were asked to complete an 8-item survey which measured the “centrality” of racial identity as defined by Sellers et al. (1997). Following Enders’ (2018) reflection on the MMRI and Nigrescence Theory, we chose to use the measure of racial centrality as it is generally more stable across situations and best “describes the place race holds in the hierarchy of identities an individual possesses and answers the question ‘How important is race to me in my life?’” (p. 518). Participants completed the MIBI items at the end of the interview to allow us to learn more about the participants’ identification with their racial group, to avoid biasing their responses to the Identity Circle, and to avoid potentially creating a stereotype threat at the beginning of the interview. This paper focuses on the results of the MIBI survey and the identity circles to investigate whether these measures were correlated. Recognizing that Blackness (race) is not monolithic, we were interested in knowing the extent to which the participants considered their Black identity as central to their engineering education experiences. Combined with discussion about the identity circles, this approach allowed us to learn more about how other elements of identity may shape the participants’ educational experiences and outcomes and revealed possible differences in how participants may enact various points of their identity. Findings For this paper, we focus on the results for five HBCU students and 27 PWI students who completed the MIBI and identity circle. The overall MIBI average for HBCU students was 43 (out of a possible 56) and the overall MIBI scores ranged from 36-51; the overall MIBI average for the PWI students was 40; the overall MIBI scores for the PWI students ranged from 24-51. Twenty-one students placed race in the inner circle, indicating that race was central to their identity. Five placed race on the second, middle circle; three placed race on the third, outer circle. Three students did not place race on their identity circle. For our cross-case qualitative analysis, we will choose cases across the two institutions that represent low, medium and high MIBI scores and different ranges of centrality of race to identity, as expressed in the identity circles. Our final analysis will include descriptive quotes from these in-depth interviews to further elucidate the significance of race to the participants’ identities and engineering education experiences. The results will provide context for our larger study of a total of 60 Black students in engineering at our four study institutions. Theoretically, our study represents a new application of Racial Identity Theory and will provide a unique opportunity to apply the theories of intersectionality, critical race theory, and community cultural wealth theory. Methodologically, our findings provide insights into the utility of combining our two qualitative research tools, the MIBI centrality scale and the identity circle, to better understand the influence of race on the education experiences of Black students in engineering. 
    more » « less