skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Model Lakes. In EDBT 2025.
Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practi- tioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of models increases, the challenges of finding, differentiating, and understanding mod- els become increasingly crucial. Inspired from research on data lakes, we introduce the concept of model lakes. We formalize key model lake tasks, including model attribution, versioning, search, and benchmarking, and discuss fundamental research challenges in the management of large models. We also explore what data management techniques can be brought to bear on the study of large model management.  more » « less
Award ID(s):
2325632 2107248
PAR ID:
10614597
Author(s) / Creator(s):
; ;
Editor(s):
EDBT
Publisher / Repository:
OpenProceedings.org
Date Published:
Subject(s) / Keyword(s):
Data Management Database Technology
Format(s):
Medium: X
Institution:
EDBT International Conference on Extending Data Base Techology
Sponsoring Org:
National Science Foundation
More Like this
  1. There is strong agreement across the sciences that replicable workflows are needed for computational modeling. Open and replicable workflows not only strengthen public confidence in the sciences, but also result in more efficient community science. However, the massive size and complexity of geoscience simulation outputs, as well as the large cost to produce and preserve these outputs, present problems related to data storage, preservation, duplication, and replication. The simulation workflows themselves present additional challenges related to usability, understandability, documentation, and citation. These challenges make it difficult for researchers to meet the bewildering variety of data management requirements and recommendations across research funders and scientific journals. This paper introduces initial outcomes and emerging themes from the EarthCube Research Coordination Network project titled “What About Model Data? - Best Practices for Preservation and Replicability,” which is working to develop tools to assist researchers in determining what elements of geoscience modeling research should be preserved and shared to meet evolving community open science expectations. Specifically, the paper offers approaches to address the following key questions: • How should preservation of model software and outputs differ for projects that are oriented toward knowledge production vs. projects oriented toward data production? • What components of dynamical geoscience modeling research should be preserved and shared? • What curation support is needed to enable sharing and preservation for geoscience simulation models and their output? • What cultural barriers impede geoscience modelers from making progress on these topics? 
    more » « less
  2. Comprehensive assessments of hydrological components are crucial for enhancing operational water supply simulations. However, hydrological models are often evaluated based on their surface flow simulations, while the validation of subsurface and groundwater components tends to be overlooked or not well documented. In this study, we evaluated the outputs of two hydrological models, the Large Basin Runoff Model (LBRM) and the Weather Research and Forecasting – Hydrological modeling extension package (WRF-Hydro), for potential implementation in operational water balance forecasting in the Great Lakes region. We examined the simulated hydrological variables including surface (e.g. snow water equivalent, evapotranspiration, and streamflow), subsurface (e.g. soil moisture at different layers), and groundwater components with observed or reference data from ground-based stations and remotely sensed images. The findings of this study provide valuable insights into the capabilities and limitations of each model. These findings contribute to more informed water management strategies for the Great Lakes region. 
    more » « less
  3. Current research and literature lack the discussion of how production automation is introduced to existing lines from the perspective of change management. This paper presents a case study conducted to understand the change management process for a large-scale automation implementation in a manufacturing environment producing highly complex products. Through a series of fifteen semi-structured interviews of eight engineers from three functional backgrounds, a process model was created to understand how the company of study introduced a new automation system into their existing production line, while also noting obstacles identified in the process. This process model illustrates the duration, sequencing, teaming, and complexity of the project. This model is compared to other change process models found in literature to understand critical elements found within change management. The process that was revealed in the case study appeared to contain some elements of a design process as compared to traditional change management processes found in literature. Finally, a collaborative resistance model is applied to the process model to identify and estimate the resistance for each task in the process. Based on the objective analysis of the collaborative situations, the areas of highest resistance are identified. By comparing the resistance model to the interview data, the results show that the resistance model does identify the challenges found in interviews. This means that the resistance model has the potential to identify obstacles within the process and open the opportunity to mitigate those challenges before they are encountered within the process. 
    more » « less
  4. null (Ed.)
    Abstract. We develop a new large-scale hydrological and water resources model, theCommunity Water Model (CWatM), which can simulate hydrology both globallyand regionally at different resolutions from 30 arcmin to 30 arcsec atdaily time steps. CWatM is open source in the Python programming environmentand has a modular structure. It uses global, freely available data in thenetCDF4 file format for reading, storage, and production of data in acompact way. CWatM includes general surface and groundwater hydrologicalprocesses but also takes into account human activities, such as water useand reservoir regulation, by calculating water demands, water use, andreturn flows. Reservoirs and lakes are included in the model scheme. CWatMis used in the framework of the Inter-Sectoral Impact Model IntercomparisonProject (ISIMIP), which compares global model outputs. The flexible modelstructure allows for dynamic interaction with hydro-economic and water qualitymodels for the assessment and evaluation of water management options.Furthermore, the novelty of CWatM is its combination of state-of-the-arthydrological modeling, modular programming, an online user manual andautomatic source code documentation, global and regional assessments atdifferent spatial resolutions, and a potential community to add to, change,and expand the open-source project. CWatM also strives to build a communitylearning environment which is able to freely use an open-source hydrologicalmodel and flexible coupling possibilities to other sectoral models, such asenergy and agriculture. 
    more » « less
  5. Abstract Despite the proliferation of computer‐based research on hydrology and water resources, such research is typically poorly reproducible. Published studies have low reproducibility due to incomplete availability of data and computer code, and a lack of documentation of workflow processes. This leads to a lack of transparency and efficiency because existing code can neither be quality controlled nor reused. Given the commonalities between existing process‐based hydrologic models in terms of their required input data and preprocessing steps, open sharing of code can lead to large efficiency gains for the modeling community. Here, we present a model configuration workflow that provides full reproducibility of the resulting model instantiations in a way that separates the model‐agnostic preprocessing of specific data sets from the model‐specific requirements that models impose on their input files. We use this workflow to create large‐domain (global and continental) and local configurations of the Structure for Unifying Multiple Modeling Alternatives (SUMMA) hydrologic model connected to the mizuRoute routing model. These examples show how a relatively complex model setup over a large domain can be organized in a reproducible and structured way that has the potential to accelerate advances in hydrologic modeling for the community as a whole. We provide a tentative blueprint of how community modeling initiatives can be built on top of workflows such as this. We term our workflow the “Community Workflows to Advance Reproducibility in Hydrologic Modeling” (CWARHM; pronounced “swarm”). 
    more » « less