skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
Attention:The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.


Title: Characteristics of a Multi-model Ensemble of Mock-Walker Simulations: Code and Derived Data
{"Abstract":["This folder contains the code needed to generate the plots and data for "Characteristics of a Multi-model Ensemble of Mock-Walker Simulations", a manuscript submitted to the Journal of Advances in Modeling Earth Systems (JAMES).\n\n \n\nAll code except for Figure 17 is run in Python.\n\n \n\nFor all codes, the variable "codeDirectory" should be set to the file location of the codes to run properly.  This is primarily to properly import code from "Utilities/"\n\n \n\nGeneral programs useful for handling RCEMIP data are given in \n\n"Utilities/".\n\n-Utilities/metFormulas.py: various meteorological formulas\n\n-Utilities/generalUtilities.py and Utilities/dsTools.py: basic processing tools\n\n-Utilities/extract*.py: imports RCEMIP data from directories specified in generalUtilities.py\n\n \n\nOrganization metric data is calculated by the codes in METRICS/A-*/.\n\nThe metric files are saved in METRICS/nc-mw/.  The current metric importation is designed to require the files in this folder rather than importing data directly from Utilities/aggRecalc/aggregation_metrics_mw.csv as several codes require timeseries of the values of the metrics.\n\n____________________________________________________________\n\n \n\nFigure Generation\n\nFigure 1: DemonstrateSST/demonstrateSST.py\n\n    -Saved as DemonstrateSST/demonstrateSST.png.\n\n \n\nFigure 2, 3, S1, S2: images/rlut_pr_mw_multimodel.py, function createImages().\n\n    -Stored in images/CRM/ and images/GCM/.\n\n \n\nFigure 4, S3-S6 [fourier, hovmoller]: ClassifyScenes/hovmollerWithClassifications.py\n\n    -Saved in ClassifyScenes/hovmollerWithPie/.\n\n    -4: 300dT1p25-crh-slice-sametime.pdf\n\n    -S3-S6: -crh-slice-sametime.pdf\n\n    A version of this code that omits the pie charts and does not require [fourier] is hovmoller/hovmoller.py.\n\n \n\nFigure 5, 8, 12, S7-S10: Metrics/multiplot.py.  \n\n    Files are saved as Metrics/Metrics-.pdf\n\n    -5 and S7-S10: SST is one of the five Simulations\n\n    -8: SST is T\n\n    -12: SST is DT\n\n \n\nFigure 6: ClassifyScenes/hovmollerClassFourierDiscretePoster.py.  \n\n    -Saved as ClassifyScenes/classFourierDemo/Plots/mw/SAM-CRM/SAM-CRM-300dT1p25-crh.png\n\n \n\nFigure 7, S11 [fourier]: ClassifyScenes/classDiscreteVsMetrics.py.  \n\n    -Saved as ClassifyScenes/metricViolinPlot/CRM/CRM_all_Lorg.pdf and ./CRM_all_Iorg.pdf.\n\n \n\nFigure 9: [fourier]: ClassifyScenes/plotPercentilesByCategory.py, method createFrequencyPieCharts().\n\n    -Saved as ClassifyScenes/pieCharts/CRM.pdf.\n\n \n\nFigure 10: [fourierContinuous]: ClassifyScenes/plotPercentilesByCategory.py, method createVarianceBarChartsPoster().\n\n    -Saved as ClassifyScenes/pieCharts/Bar-CRM-Variance.pdf.\n\n \n\nFigure 11, S12: Metrics/iorgBoxPlots.py, method changeMultiPanel().\n\n    -Saved as Metrics/BoxPlots/change/metricChangeCombined295305shared.pdf and ...all.pdf\n\n \n\nFigure 13, S13, S14: DomainStatistics/BoxPlots.py.\n\n    -13: DomainStatistics/boxplots-domainmean-T.pdf\n\n \n\nFigure 14-16, S15-S18 [isccp]: ISCCP/histograms.py, method plotHistMultimodel()\n\n    -Saved in histPlot/Multimodel-ISCCP-CRM__GCM.png\n\n    -14, S15-S18: is one of the Simulations\n\n    -15: is "changeT"\n\n    -16: is "changeDT"\n\n \n\nFigure 17: A-Statistics/plot_climatesensitivity.m.  \n\n    -This function requires some files from RCEMIP-I which are not included in this repository.\n\n    -Saved as A-Statistics/Fig_lambda_I_II.pdf.  \n\n    -Note: this is a MATLAB code.\n\n \n\nFigure 18 [percentiles]: Percentiles/plotPercentiles.py, method plotPercentileRatioVsAgg().\n\n    -Must be run twice for the two subpanels.\n\n    -Saved in percentilePlots/percentileRatioVsAgg/mw/pr/Ichange/, with the files corresponding to input parameters.\n\n \n\nFigure S19-S22 [percentiles]: Percentiles/plotSpaceTimeCorrelations.py, method correlatePercentileVsAggChanges().\n\n    -Must be run twice for the two subpanels.\n\n    -Saved in percentilePlots/LinearArithmetic/correlatePercentileAggChange/mw/pr/Change/.\n\n \n\nFigure S23 [scaling]: Scaling/BoxPlots.py\n\n    -Saved in scalingPlots/LinearArithmetic/BoxPlots/avgAbove/separateDyn2/mw/coarse15km/shift1/\n\n        File name matches *-def-*-99.9.pdf\n\n    -Generates two needed files, 305dT1p25-295dT1p25/ for S23a and 300dT2p5-300dT0p625/ for S23b\n\n \n\nFigure S24-S27 [scaling]: Scaling/ComponentsVsAgg.py, method componentsVsAggSubplots()\n\n    -Saved at scalingPlots/LinearArithmetic/componentsVsAgg/def-shift1/mw/coarse15km//Combined/\n\n    -S24 and S26: File name contains Ichange\n\n    -S25 and S27: File name contains Lchange\n\n__________________________________________________\n\n \n\n[fourier]: This code requires the discrete Fourier classification.  These classifications are generated by ClassifyScenes/hovmollerClassFourierDiscrete.py and saved in ClassifyScenes/classDS//.\n\n[fourierContinuous]: This code requires the continuous Fourier classification.  This is calculated by ClassifyScenes/hovmollerClassFourierContinuous.py and saved in ClassifyScenes/classDSContinuous//.\n\n[hovmoller]: This code requires hovmoller data.  This data is calculated by running hovmoller/createDatasets.py, with data stored in hovmoller/2D-Timeseries//.\n\n    -These data are created and saved in the 2D-Timeseries/ folder of the repository, which must be loaded in as hovmoller/2D-Timeseries/.\n\n[isccp]: This code requires use of the ISCCP data.\n\n    -CRMs: ISCCP data calculated via the Approximate ISCCP Simulator (Stauffer and Wing 2023) within ISCCP/tau2023.py.\n\nData is saved in two separate folders, isccp_mw/ and isccp_large/, which must be loaded in as ISCCP/isccpData/mw/ and ISCCP/isccpData/large/, respectively.\n\n    -GCMs: ISCCP data was saved by three models during running (CNRM-CM6, E3SM, UKMO-GA7.1)\n\n    From the ISCCP data, histograms are calculated by ISCCP/histograms.py method generateHistograms() (for CRMs) and generateHistogramsGCM() (for GCMs).\n\n        These are saved at ISCCP/hist//.\n\n[percentiles]: This code requires precipitation percentile tables.  These are calculated by Utilities/calcPercentiles.py and saved in  Percentiles/.\n\n[scaling]: This code requires the scaling JSON files in Scaling/scalingJsons.py. This code also requires [percentiles].\n\n    -To create these files, first run Scaling/calculatePercentileProfiles-accumulate-coarsen-parallel.py. This creates the composite profiles in Scaling/scalingParallel/.\n\n    -Then, run Scaling/separateComponentsDecomposeDyn.py to create the JSONs.\n\n    -The scaling profile creation code takes a long time to run and outputs large files.  Scaling/scalingParallel.py is uploaded as a separate folder within the repository to decrease the required file size."]}  more » « less
Award ID(s):
2331199
PAR ID:
10667137
Author(s) / Creator(s):
;
Publisher / Repository:
Zenodo
Date Published:
Format(s):
Medium: X
Right(s):
Creative Commons Attribution 4.0 International
Sponsoring Org:
National Science Foundation
More Like this
  1. This codebase contains data and code needed to generate plots for O'Donnell, G. L. and A. A. Wing (2024). Precipitation Extremes and their Modulation by Convective Organization in RCEMIP. Journal of Advances in Modeling Earth Systems, 16(11), e2024MS004535. https://doi.org/10.1029/2024MS004535   PrecipExtremesInRCEMIP/: Python code and most derived data, including tables of percentiles of precipitation for each model at numerous spatiotemporal scales scalingProfiles_{domain}/: Profiles of temperature, humidity, and vertical velocity conditioned on extreme precipitation, saved as .nc files   RCEMIP data can be found at http://hdl.handle731.net/21.14101/d4beee8e-6996-453e-bbd1-ff53b6874c0e or at https://swiftbrowser.dkrz.de/public/dkrz_70a517a8-039d-4a1b-a30d-841923f8bc7a/RCEMIP/   
    more » « less
  2. {"Abstract":["# DeepCaImX## Introduction#### Two-photon calcium imaging provides large-scale recordings of neuronal activities at cellular resolution. A robust, automated and high-speed pipeline to simultaneously segment the spatial footprints of neurons and extract their temporal activity traces while decontaminating them from background, noise and overlapping neurons is highly desirable to analyze calcium imaging data. In this paper, we demonstrate DeepCaImX, an end-to-end deep learning method based on an iterative shrinkage-thresholding algorithm and a long-short-term-memory neural network to achieve the above goals altogether at a very high speed and without any manually tuned hyper-parameters. DeepCaImX is a multi-task, multi-class and multi-label segmentation method composed of a compressed-sensing-inspired neural network with a recurrent layer and fully connected layers. It represents the first neural network that can simultaneously generate accurate neuronal footprints and extract clean neuronal activity traces from calcium imaging data. We trained the neural network with simulated datasets and benchmarked it against existing state-of-the-art methods with in vivo experimental data. DeepCaImX outperforms existing methods in the quality of segmentation and temporal trace extraction as well as processing speed. DeepCaImX is highly scalable and will benefit the analysis of mesoscale calcium imaging. ![alt text](https://github.com/KangningZhang/DeepCaImX/blob/main/imgs/Fig1.png)\n\n## System and Environment Requirements#### 1. Both CPU and GPU are supported to run the code of DeepCaImX. A CUDA compatible GPU is preferred. * In our demo of full-version, we use a GPU of Quadro RTX8000 48GB to accelerate the training speed.* In our demo of mini-version, at least 6 GB momory of GPU/CPU is required.#### 2. Python 3.9 and Tensorflow 2.10.0#### 3. Virtual environment: Anaconda Navigator 2.2.0#### 4. Matlab 2023a\n\n## Demo and installation#### 1 (_Optional_) GPU environment setup. We need a Nvidia parallel computing platform and programming model called _CUDA Toolkit_ and a GPU-accelerated library of primitives for deep neural networks called _CUDA Deep Neural Network library (cuDNN)_ to build up a GPU supported environment for training and testing our model. The link of CUDA installation guide is https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html and the link of cuDNN installation guide is https://docs.nvidia.com/deeplearning/cudnn/installation/overview.html. #### 2 Install Anaconda. Link of installation guide: https://docs.anaconda.com/free/anaconda/install/index.html#### 3 Launch Anaconda prompt and install Python 3.x and Tensorflow 2.9.0 as the virtual environment.#### 4 Open the virtual environment, and then  pip install mat73, opencv-python, python-time and scipy.#### 5 Download the "DeepCaImX_training_demo.ipynb" in folder "Demo (full-version)" for a full version and the simulated dataset via the google drive link. Then, create and put the training dataset in the path "./Training Dataset/". If there is a limitation on your computing resource or a quick test on our code, we highly recommand download the demo from the folder "Mini-version", which only requires around 6.3 GB momory in training. #### 6 Run: Use Anaconda to launch the virtual environment and open "DeepCaImX_training_demo.ipynb" or "DeepCaImX_testing_demo.ipynb". Then, please check and follow the guide of "DeepCaImX_training_demo.ipynb" or or "DeepCaImX_testing_demo.ipynb" for training or testing respectively.#### Note: Every package can be installed in a few minutes.\n\n## Run DeepCaImX#### 1. Mini-version demo* Download all the documents in the folder of "Demo (mini-version)".* Adding training and testing dataset in the sub-folder of "Training Dataset" and "Testing Dataset" separately.* (Optional) Put pretrained model in the the sub-folder of "Pretrained Model"* Using Anaconda Navigator to launch the virtual environment and opening "DeepCaImX_training_demo.ipynb" for training or "DeepCaImX_testing_demo.ipynb" for predicting.\n\n#### 2. Full-version demo* Download all the documents in the folder of "Demo (full-version)".* Adding training and testing dataset in the sub-folder of "Training Dataset" and "Testing Dataset" separately.* (Optional) Put pretrained model in the the sub-folder of "Pretrained Model"* Using Anaconda Navigator to launch the virtual environment and opening "DeepCaImX_training_demo.ipynb" for training or "DeepCaImX_testing_demo.ipynb" for predicting.\n\n## Data Tailor#### A data tailor developed by Matlab is provided to support a basic data tiling processing. In the folder of "Data Tailor", we can find a "tailor.m" script and an example "test.tiff". After running "tailor.m" by matlab, user is able to choose a "tiff" file from a GUI as loading the sample to be tiled. Settings include size of FOV, overlapping area, normalization option, name of output file and output data format. The output files can be found at local folder, which is at the same folder as the "tailor.m".\n\n## Simulated Dataset#### 1. Dataset generator (FISSA Version): The algorithm for generating simulated dataset is based on the paper of FISSA (_Keemink, S.W., Lowe, S.C., Pakan, J.M.P. et al. FISSA: A neuropil decontamination toolbox for calcium imaging signals. Sci Rep 8, 3493 (2018)_) and SimCalc repository (https://github.com/rochefort-lab/SimCalc/). For the code used to generate the simulated data, please download the documents in the folder "Simulated Dataset Generator". #### Training dataset: https://drive.google.com/file/d/1WZkIE_WA7Qw133t2KtqTESDmxMwsEkjJ/view?usp=share_link#### Testing Dataset: https://drive.google.com/file/d/1zsLH8OQ4kTV7LaqQfbPDuMDuWBcHGWcO/view?usp=share_link\n\n#### 2. Dataset generator (NAOMi Version): The algorithm for generating simulated dataset is based on the paper of NAOMi (_Song, A., Gauthier, J. L., Pillow, J. W., Tank, D. W. & Charles, A. S. Neural anatomy and optical microscopy (NAOMi) simulation for evaluating calcium imaging methods. Journal of neuroscience methods 358, 109173 (2021)_). For the code use to generate the simulated data, please go to this link: https://bitbucket.org/adamshch/naomi_sim/src/master/code/## Experimental Dataset#### We used the samples from ABO dataset:https://github.com/AllenInstitute/AllenSDK/wiki/Use-the-Allen-Brain-Observatory-%E2%80%93-Visual-Coding-on-AWS.#### The segmentation ground truth can be found in the folder "Manually Labelled ROIs". #### The segmentation ground truth of depth 175, 275, 375, 550 and 625 um are manually labeled by us. #### The code for creating ground truth of extracted traces can be found in "Prepro_Exp_Sample.ipynb" in the folder "Preprocessing of Experimental Sample"."]} 
    more » « less
  3. {"Abstract":["Human disturbance can have profound effects on biodiversity, including\n increasing hybridization between reproductively isolated species. One\n approach for understanding how human activity affects hybridization\n dynamics is to evaluate correlations between disturbance (e.g.,\n urbanization, temperature change) and hybridization. Because variation in\n hybridization can also arise from historical factors unrelated to recent\n human disturbance, it is essential to account for population structure to\n avoid spurious correlations. Here, we combine environmental and\n high-coverage whole-genome resequencing data to investigate how human\n disturbance and population structure affect hybridization dynamics between\n a pair of pine sawflies adapted to different pines, Neodiprion lecontei\n and Neodiprion pinetum. We find that N. lecontei and N. pinetum exhibit\n strikingly different patterns of population structure, which we\n hypothesize stems from differences in host. We also find that recent\n admixture is both asymmetric and geographically variable. Linear\n regression analyses reveal that admixture proportion is predicted by\n indirect human disturbance (i.e., climate change) and not direct human\n disturbance (e.g., urbanization) in both N. lecontei and N. pinetum.\n Lastly, in N. pinetum, we find evidence of a spurious association between\n admixture and direct human disturbance that disappears when regression\n models account for population structure via inclusion of genetic principal\n component scores as covariates. Together, our data suggest that indirect\n human disturbance and population structure both contribute to geographic\n variation in admixture between N. lecontei and N. pinetum. Our study also\n highlights the importance of adequately controlling for population\n structure when attempting to identify environmental predictors (human\n disturbance-related or not) of hybridization."],"TechnicalInfo":["# Recent climate change and historical population structure predict\n spatial patterns of admixture between two host-specialized pine sawfly\n species Dataset DOI: [10.5061/dryad.kh18932jw](10.5061/dryad.kh18932jw) ##\n Description of the data and file structure To perform all analyses, run\n scripts provided in the order indicated in the folder/script name (i.e.,\n "01*" to "07*"). NOTE: for all bash scripts (*.sh),\n you will have to modify them to run on your cluster. However, the program\n commands should work as long as you are using the same version of the\n program (program versions indicated in the Materials and Method section of\n the manuscript). ### Files and variables #### File:\n Glover_Linnen_dryad.zip **Description:**  Input files required to run\n analyses are provided: 1.\n "allsites_LecPineHetr_filtered_ExclIndvs_minQ20_AllImbHom_refilter_bcftools_norm.vcf.gz" - required for ABBA-BABA analysis ("06_ABBA-BABA" folder); vcf file contains variant and invariant sites for all *N. lecontei*, *N. pinetum*, and *N. hetricki* individuals included in analysis 2. "snps_LecPine_filtered_ExclIndvs_MAF05_minQ20_AllImbHom_refilter_pruned_bcftools_norm.vcf.gz" - required for ADMIXTURE ("03_ADMIXTURE" folder) and principal component analysis (PCA) for both species combined ("04_run_pcas.sh" script) analyses; vcf file contains linkage-pruned SNPs for all *N. lecontei* and *N. pinetum* individuals included in analyses (i.e., contaminated/putative haploid male individuals removed) 3. "snps_LecOnly_filtered_ExclIndvs_MAF05_minQ20_AllImbHom_refilter_pruned_bcftools_norm.vcf.gz" - required for PCA for *N. lecontei* only analysis ("04_run_pcas.sh" script"); vcf file contains linkage-pruned SNPs for all *N. lecontei* individuals included in analysis (i.e., contaminated/putative haploid male individuals removed) 4. "snps_PineOnly_noHybrid_filtered_ExclIndvs_MAF05_minQ20_AllImbHom_refilter_pruned_bcftools_norm.vcf.gz" - required for PCA for *N. pinetum *only analysis ("04_run_pcas.sh" script"); vcf file contains linkage-pruned SNPs for all *N. pinetum *individuals included in analysis (i.e., contaminated/putative haploid male individuals and F1 hybrid called *N. pinetum* at time of collection due to being collected on white pine removed) 5. "fulldata_LecPine.csv", "fulldata_LecOnly.csv", "fulldata_PineOnly.csv" - full/compiled data for all *N. lecontei* and *N. pinetum* individuals together, all *N. lecontei* individuals, and all *N. pinetum* individuals included in downstream analyses, respectively. These files are required to run the "05_plot_PCAs_ADMIXTURE.R" and "07_run_models.R" scripts. Description of columns in the *.csv files (#5) above (each row represents an individual sawfly): 1. "ind" = unique individual identifier assigned by the sequencing center (Admera Health) 2. "genPC1_*" = genetic PC1 score from PCA of both species combined ("genPC1_LecPine" in "fulldata_LecPine.csv" file), *N. lecontei* only ("genPC1_LecOnly" in "fulldata_LecOnly.csv" file), or *N. pinetum* only ("genPC1_PineOnly" in "fulldata_PineOnly.csv" file) 3. "genPC2_*" = genetic PC2 score from PCA of both species combined ("genPC2_LecPine" in "fulldata_LecPine.csv" file), *N. lecontei* only ("genPC2_LecOnly" in "fulldata_LecOnly.csv" file), or *N. pinetum* only ("genPC2_PineOnly" in "fulldata_PineOnly.csv" file) 4. "genPC3_*" = genetic PC3 score from PCA of both species combined ("genPC3_LecPine" in "fulldata_LecPine.csv" file), *N. lecontei* only ("genPC3_LecOnly" in "fulldata_LecOnly.csv" file), or *N. pinetum* only ("genPC3_PineOnly" in "fulldata_PineOnly.csv" file) 5. "AG_ID" = unique individual identifier assigned by the Linnen lab 6. "avg_perc_PP_reads_mapped" = percentage of properly paired reads mapped; averaged between the two sequencing batches 7. "MERGED_FINAL_read_count" = total read count across two sequencing batches 8. "species" = species identity 9. "city" = city individual was collected in 10. "state" = state individual was collected in 11. "area" = geographic area ("North" or "Central") that individual was assigned to based on ADMIXTURE ("03_ADMIXTURE" folder) and PCA ("04_run_pcas.sh" script) analyses 12. "latitude" = latitude individual was collected at 13. "longitude" = longitude individual was collected at 14. "host" = pine host individual was collected on 15. "stage" = life stage of individual at time of DNA extraction, either "larva" or adult female ("adultF") 16. "lcPC1" = PC1 score from PCA of land cover proportions (from analyses in "01_LandCover_data" folder) 17. "lcPC2" = PC2 score from PCA of land cover proportions (from analyses in "01_LandCover_data" folder) 18. "change_tmax" = change in average daily maximum temperature from the 1980's to the 2010's in degrees Celsius (from analyses in "02_temperature_data" folder) 19. "site_number" = arbitrary number assigned to sampling site; each unique sampling site (i.e., unique latitude/longitude GPS coordinates) was assigned an arbitrary number (1-189) 20. "grouped_site_number_5km" = grouped site number, where all sampling sites that were within 5km of each other were considered a “grouped site” and were assigned an arbitrary grouped site number 21. "allele_imbalance" = percentage of heterozygote calls displaying allele balance ratios less than 0.3 22. "avg_depth_coverage" = average depth coverage across all linkage-pruned SNPs 23. "missingness" = proportion of sites with a missing genotype 24. "heterozygosity" = proportion of sites with a heterozygous genotype 25. "pop" = population label for making ADMIXTURE plots ("05_plot_PCAs_ADMIXTURE.R" script); [species]_[sampling state] 26. "admix_prop_k2" = admixture proportion for K=2 (i.e., species-level admixture) 27. "lec_blue_k2" = final assignment probability for "blue" genetic group (color assigned by CLUMPAK) that corresponds to *N. lecontei* ancestry across all 100 ADMIXTURE runs for K=2 28. "pine_orange_k2" = final assignment probability for "orange" genetic group (color assigned by CLUMPAK) that corresponds to *N. pinetum* ancestry across all 100 ADMIXTURE runs for K=2 29. "lec_blue_k3" = final assignment probability for "blue" genetic group (color assigned by CLUMPAK) that corresponds to *N. lecontei* ancestry across all 100 ADMIXTURE runs for K=3 30. "pine_orange_k3" = final assignment probability for "orange" genetic group (color assigned by CLUMPAK) that corresponds to *N. pinetum* ancestry across all 100 ADMIXTURE runs for K=3 31. "lec_purple_k3" = final assignment probability for "purple" genetic group (color assigned by CLUMPAK) that corresponds to *N. lecontei* ancestry across all 100 ADMIXTURE runs for K=3 32. "lec_blue_k4" = final assignment probability for "blue" genetic group (color assigned by CLUMPAK) that corresponds to *N. lecontei* ancestry across all 100 ADMIXTURE runs for K=4 33. "pine_orange_k4" = final assignment probability for "orange" genetic group (color assigned by CLUMPAK) that corresponds to *N. pinetum* ancestry across all 100 ADMIXTURE runs for K=4 34. "lec_purple_k4" = final assignment probability for "purple" genetic group (color assigned by CLUMPAK) that corresponds to *N. lecontei* ancestry across all 100 ADMIXTURE runs for K=4 35. "lec_green_k4" = final assignment probability for "green" genetic group (color assigned by CLUMPAK) that corresponds to *N. lecontei* ancestry across all 100 ADMIXTURE runs for K=4 Perform analyses: The first folder ("01_LandCover_data") contains the R script to calculate land cover proportions within a 5km-radius buffer around each site's GPS coordinates ("get_LandCover_props.R"). The required input land cover raster files to run this script are available in the public domain at ([https://www.mrlc.gov/viewer/](https://www.mrlc.gov/viewer/)) for the United States and ([http://www.cec.org/north-american-environmental-atlas/land-cover-30m-2020/](http://www.cec.org/north-american-environmental-atlas/land-cover-30m-2020/)) for Canada. This folder also contains the input files ("for_lcPCA.csv", "latlong.csv") and script ("LandCover_PCA.R") necessary to complete the PCA of land cover proportions for each of the 189 sampling sites. Specifically, the "for_lcPCA.csv" file contains, for each sampling site, the proportions of each land cover class within a 5km radius buffer around the GPS coordinates of the sampling site; the "latlong.csv" file contains the GPS coordinates (latitude and longitude) of each sampling site. The second folder ("02_temperature_data") contains the input file ("LatLong.csv") and script ("get_tmax.R") necessary to extract the average of daily maximum temperature for each month within each year (1980-1989 and 2010-2019) for the latitude/longitude that each individual sawfly was collected at. The required input temperature raster files to run this script are available in the public domain at ([https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=2131](https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=2131)) for the United States and Canada. The third folder ("03_ADMIXTURE") contains the script ("run_admixture.sh") to perform 100 ADMIXTURE runs for each K (1-10). The output file from CLUMPAK for K=2 is also provided ("ClumppIndFile.output"). In this text file, each row represents an individual; the proportion of the individual's ancestry originating from each of the two inferred ancestral populations are provided (last two columns), from which admixture proportions can be extracted. The script to perform the next analysis (PCA) is provided ("04_run_pcas.sh"). Next, the script to plot the results from the ADMIXTURE and PCA analyses above is provided ("05_plot_PCAs_ADMIXTURE.R"). The next folder ("06_ABBA-BABA") contains the scripts (".sh") and input files (".txt" and ".nwk") to perform the ABBA-BABA analysis for the "North" and "Central" geographic regions separately. For each region, the *"*.txt" file indicates which of the four taxa each individual belongs to (i.e., P~1~, P~2~, P~3~, or Outgroup) - individuals that are in the vcf file but excluded from the analysis are indicated with an "xxx". The ".nwk" file is a text file that represents the phylogenetic tree of the four taxa above in the Newick format (i.e., uses parentheses and commas to show the relationships between taxa). Finally, the script to perform the multiple linear regression models and plot the model results is provided ("07_run_models.R"). ## Code/software All scripts to perform analyses are described above. All software and R package versions are described in the Materials and Methods section of the manuscript."]} 
    more » « less
  4. {"Abstract":["We use open source human gut microbiome data to learn a microbial\n “language” model by adapting techniques from Natural Language Processing\n (NLP). Our microbial “language” model is trained in a self-supervised\n fashion (i.e., without additional external labels) to capture the\n interactions among different microbial taxa and the common compositional\n patterns in microbial communities. The learned model produces\n contextualized taxon representations that allow a single microbial taxon\n to be represented differently according to the specific microbial\n environment in which it appears. The model further provides a sample\n representation by collectively interpreting different microbial taxa in\n the sample and their interactions as a whole. We demonstrate that, while\n our sample representation performs comparably to baseline models in\n in-domain prediction tasks such as predicting Irritable Bowel Disease\n (IBD) and diet patterns, it significantly outperforms them when\n generalizing to test data from independent studies, even in the presence\n of substantial distribution shifts. Through a variety of analyses, we\n further show that the pre-trained, context-sensitive embedding captures\n meaningful biological information, including taxonomic relationships,\n correlations with biological pathways, and relevance to IBD expression,\n despite the model never being explicitly exposed to such signals."],"Methods":["No additional raw data was collected for this project. All inputs\n are available publicly. American Gut Project, Halfvarson, and Schirmer raw\n data are available from the NCBI database (accession numbers PRJEB11419,\n PRJEB18471, and PRJNA398089, respectively). We used the curated data\n produced by Tataru and David, 2020."],"TechnicalInfo":["# Code and data for "Learning a deep language model for microbiomes:\n the power of large scale unlabeled microbiome data" ## Data: *\n vocab_embeddings.npy * Fixed vocabulary embeddings produced from prior\n work: [Decoding the language of microbiomes using word-embedding\n techniques, and applications in inflammatory bowel\n disease](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007859). Adapted from [here](http://files.cqls.oregonstate.edu/David_Lab/microbiome_embeddings/data/embed/). * microbiomedata.zip * Contains the labels and data for the three datasets used in this study. Specifically, it includes: * IBD_(test|train)*(512|otu).npy and IBD*(test|train)_labels.npy * halfvarson_(512_otu|otu).npy and halfvarson_IBD_labels.npy * schirmer_IBD_(512_otu|otu).npy and schirmer_IBD_labels.npy * (test|train)encodings_(512|1897).npy * The data are stored as n_samples x max_sample_size x 2 numpy arrays, containing both the vocab IDs of the taxa in the samples, as well as the abundance values for each taxa. data[:,:,0] will give the vocab IDs, and data[:,:,1] will give the abundances. * Files which mention '512' are truncated to only have up to 512 taxa in them (max_sample_size = 512). * Note that we refer to the schirmer dataset as HMP2 in the paper. * (test|train)encodings_(512|1897).npy represents the full collection of [American Gut Project](https://doi.org/10.1128%2FmSystems.00031-18) data, regardless of whether that data has IBD labels or not, split into train / test splits. * Also contains the folders fruitdata and vegdata containing fruit and vegetable data respectively, and the file README, which documents the contents of the first two folders. * American Gut Project, Halfvarson, and Schirmer raw data are available from the NCBI database (accession numbers PRJEB11419, PRJEB18471, and PRJNA398089, respectively). We used the curated data produced by [Tataru and David, 2020](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007859). * pretrainedmodels.zip * Contains a sequence of pretrained discriminator models across different epochs, allowing users to compute embeddings without having to pretrain models themselves. Each model is stored as a pair of a pytorch_model.bin file containing weights and a config.json file containing model config parameters. Each pair is located in its own folder whose name corresponds to epoch. E.g., "5head5layer_epoch60_disc" stores the discriminator model that were trained for 60 epochs. Model checkpoints can be loaded by providing a path to the pytorch_model.bin file in the --load_disc argument of begin.py in microbiome_transformers-master/finetune_discriminator. * ensemble.zip * Contains the result of an ensemble finetuning run, allowing users to perform interpretability / attribution experiments without having to train models themselves. Each model is similarly stored as a pytorch_model.bin file and config.json file in its own folder. E.g., the run3_epoch0_disc folder stores the model from the third finetuning run (with epoch0 reflecting that the finetuning only takes one epoch). * seqs_.07_embed.fasta * Contains the 16S sequences associated with each taxon vocabulary element of our study, originally produced by prior work: [Decoding the language of microbiomes using word-embedding techniques, and applications in inflammatory bowel disease](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007859). Also available [here](http://files.cqls.oregonstate.edu/David_Lab/microbiome_embeddings/data/embed/seqs_.07_embed.fasta). ## Code/Software: Note that the Dryad repository stores the code and software discussed here is available at [this](https://doi.org/10.5281/zenodo.13858903) site, which is linked under the "Software" tab on the current page.\\ The following software include hardcoded absolute paths to various files of interest (described above). These paths have been changed to be of the form "/path/to/file_of_interest", where the "path/to" portion must be changed to reflect the actual paths on whichever system you run these on. * Attribution_calculations.ipynb * Used to calculate per-sample model prediction scores, per-taxa attribution values (used for interpretability), as well as per-taxa averaged embeddings (used for plotting the taxa). Note the current file is set to compute attributions only for IBD, but can easily be changed for Schirmer/HMP2 and Halfvarson. * Process_Attributions_No_GPU.ipynb * Takes the per-sample prediction scores and the per-taxa attribution values (both from Attribution_calculations.ipynb) and identifies the taxa most and least associated with IBD. * assign_16S_to_phyla.R * An R script that makes phylogenetic assignments to the 16S sequences from seqs_.07_embed.fasta. Invoke with 'Rscript assign_16S_to_phyla.R' and no arguments. * run_blast_with_downloads.sh * Compares the overlap in ASVs between Halfvarson and AGP versus between HMP2 and AGP. Must have BLAST installed. BLAST parameters are set in file, via the results filtering lines ("awk '$5 < 1e-20 && $8 >= 99' | \\\\"), that set the e-value to 20^-20 and the percent similarity to 99%, with one line for each of the two pairwise comparisons. Simply run via "bash run_blast_with_downloads.sh". * Plot_microbiome_transformers_results.ipynb * Loads the averaged taxa embeddings (from Attribution_calculations.ipynb) and the vocabulary embeddings (from [Decoding the language of microbiomes using word-embedding techniques, and applications in inflammatory bowel disease](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007859) / vocab_embeddings.npy), as well as the taxonomic assignments (from assign_16S_to_phyla.R), and generates the various TSNE-based plots of the embedding space geometry. It also generates plots to compare the clustering quality of the averaged embeddings and the vocabulary embeddings. * DeepMicro.zip * A modified version of [DeepMicro](https://github.com/minoh0201/DeepMicro), adapted to more easily run the DeepMicro-based baselines included in our paper. Most additional functionality is described in the 'help' strings of the additional arguments and the docstrings of the functions. In particular, since our data include unlabeled samples witch nonetheless contribute to learning an embedding space, we needed to add a "--pretraining_data" argument to allow such data to be included in the self-supervised learning portion of the baselines. * "convert_data.py" under the "data" folder serves as a utility to help convert from the coordinate-list format of this study to the one-hot abundance table format expected by DeepMicro. * "get_unlabeled_pretraining_data.py" under the "data" folder processes labeled microbiome datasets (fruit, vegetable, and IBD) and extends them with unlabeled data from the American Gut Project (AGP).  * host_to_ids.py under the data/host_to_indices folder will combine metadata from err-to-qid.txt and AG_mapping.txt (both available at *[https://files.cqls.oregonstate.edu/David_Lab/microbiome_embeddings/data/AG_new](https://files.cqls.oregonstate.edu/David_Lab/microbiome_embeddings/data/AG_new)*) with the sequences in seqs_.07_embed.fasta and the numpy data files to create dictionaries that map from host ids to indices in the numpy files, then store those as pickle files. This allow for future training runs from the transformer or the baselines to block their train / validation / test splits by host id. * exps_ae.sh, exps_cae.sh, and exps_baselines.sh are shell scripts with the python commands that run the various DeepMicro-based baselines. * "display_results.py" is a helper for accumulating experimental results and displaying them in a table. * property_pathway_correlations.zip * A folder containing the required code and files to run the property and pathway correlation experiments. * property_pathway_correlations contains three subfolders: * figures: stores output figures such as the heatmap of property - pathway correlation strengths. * csvs: contains gen_hists.py, which takes the outputs of significant correlation counts / strength from metabolic_pathway_correlations.R and plots a histogram to compare the property correlations of the initial vocabulary embeddings with those of the learned embeddings. Also contains significant_correlations_tests.py, which applies non-parametric and permutation tests to statistically determine whether the learned embeddings tend to have stronger property correlations. Also reports the effect size via Cliff's Delta and Cohen's d statistics. * new_hists: will store the histogram generated from gen_hists.py * pathways: stores text and csv outputs, such as the correlation strengths between each property and pathway pair (property_pathway_dict_allsig.txt), the top 20 pathways associated with each property (top20Paths_per_property_(ids|names)_v2.csv), and list of which pathway is most correlated with each property (property_pathway_dict.txt). * metabolic_pathways: contains the code and data required to actually run the correlation tests. The code appears in metabolic_pathway_correlations.R, and simply runs with the command Rscript and no arguments. The data appears in the data subfolder, which itself contains three subfolders: * embed: contains embeddings to be loaded by metabolic_pathway_correlations.R, e.g., merged_sequences_embeddings.txt or glove_emb_AG_newfilter.07_100.txt. Also contains a script assemble_new_embs.py, which lets new embeddings txt files be formatted from a pytorch embeddings tensor, such as the one stored in epoch_120_IBD_avg_vocab_embeddings.pth, as well as seqs_.07_embed.txt. * AG_new/pathways: contains a bunch of files like "corr_matches_i_i+9.RDS", which store intermediate results of the permutation tests, so they don't all have to be calculated at once. Should be recomputed with each run. * pathways: mostly stores various other input and output RDS files: * corr_matches.rds : stores intermediate results of statistical significance testing with model embeddings. Recomputed each time. * corr_matches_pca.rds : stores prior result of statistical significance testing with PCA embeddings. Loaded from storage by default. * filtered_otu_pathway_table.RDS / txt : stores associations of each taxa vocab entry with metabolic pathways, filtered to exclude pathways that are no longer present in KEGG. * pathway_table.RDS : updated pathway table saved by metabolic_pathway_correlations.R each run. * pca_embedded_taxa.rds : stores PCA embeddings of all the vocab taxa entries. * microbiome_transformers.zip * A backup of our [GitHub repository](https://github.com/QuintinPope/microbiome_transformers) for the model architecture (both generator and discriminator), the pretraining processes for both, as well as the model finetuning scripts. Contains its own READMEs. * Has the code for pretraining generator models. See pretrain_generator/train_command.sh and pretrain_generator/README.MD * Has the code for using those models to pretrain discriminator models. See pretrain_discriminator/train_command.sh and pretrain_discriminator/README.MD * Has the code for finetuning those pretrained discriminator models on the classification data in our study (both within-distribution experiments and out of distribution experiments). * See finetune_discriminator/README.MD for general info on finetuning. * See finetune_discriminator/run_agp_agp_exps.sh for the commands to run the in-distribution experiments. * See finetune_discriminator/run_agp_HF_SH_cross_gen_ensemble_tests.sh to run the out of distribution experiments using an ensemble of models. * See finetune_discriminator/run_agp_HF_SH_cross_gen_val_set_tests.sh to run the out of distribution experiments without an ensemble and using a val set for stopping condition. ## File Structures: **microbiomedata.zip** ``` |____total_IBD_otu.npy |____IBD_train_512.npy |____halfvarson_IBD_labels.npy |____IBD_train_otu.npy |____test_encodings_512.npy |____total_IBD_512.npy |____train_encodings_512.npy |____schirmer_IBD_labels.npy |____schirmer_IBD_512_otu.npy |____fruitdata | |____FRUIT_FREQUENCY_all_label.npy | |____FRUIT_FREQUENCY_otu_512.npy | |____FRUIT_FREQUENCY_binary24_labels.npy | |____FRUIT_FREQUENCY_all_otu.npy | |____FRUIT_FREQUENCY_binary34_labels.npy |____vegdata | |____VEGETABLE_FREQUENCY_all_label.npy | |____VEGETABLE_FREQUENCY_binary24_labels.npy | |____VEGETABLE_FREQUENCY_otu_512.npy | |____VEGETABLE_FREQUENCY_all_otu.npy | |____VEGETABLE_FREQUENCY_binary34_labels.npy |____README |____schirmer_IBD_otu.npy |____IBD_test_label.npy |____IBD_test_512.npy |____IBD_train_label.npy |____IBD_test_otu.npy |____test_encodings_1897.npy |____halfvarson_otu.npy |____halfvarson_512_otu.npy |____total_IBD_label.npy |____train_encodings_1897.npy ``` **pretrainedmodels.zip** ``` ____5head5layer_epoch60_disc | |____config.json | |____pytorch_model.bin |____5head5layer_epoch30_disc | |____config.json | |____pytorch_model.bin |____5head5layer_epoch105_disc | |____config.json | |____pytorch_model.bin |____5head5layer_epoch0_disc | |____config.json | |____pytorch_model.bin |____5head5layer_epoch45_disc | |____config.json | |____pytorch_model.bin |____5head5layer_epoch90_disc | |____config.json | |____pytorch_model.bin |____5head5layer_epoch120_disc | |____config.json | |____pytorch_model.bin |____5head5layer_epoch15_disc | |____config.json | |____pytorch_model.bin |____5head5layer_epoch75_disc | |____config.json | |____pytorch_model.bin ``` **ensemble.zip** ``` |____run4_epoch0_disc | |____config.json | |____pytorch_model.bin |____run8_epoch0_disc | |____config.json | |____pytorch_model.bin |____run1_epoch0_disc | |____config.json | |____pytorch_model.bin |____run2_epoch0_disc | |____config.json | |____pytorch_model.bin |____run10_epoch0_disc | |____config.json | |____pytorch_model.bin |____run7_epoch0_disc | |____config.json | |____pytorch_model.bin |____run9_epoch0_disc | |____config.json | |____pytorch_model.bin |____run5_epoch0_disc | |____config.json | |____pytorch_model.bin |____run6_epoch0_disc | |____config.json | |____pytorch_model.bin |____run3_epoch0_disc | |____config.json | |____pytorch_model.bin ``` **DeepMicro.zip** ``` |____LICENSE |____deep_env_config.yml |____DM.py |____exception_handle.py |____README.md |____exps_cae.sh |____exps_ae.sh |____exps_baselines.sh |____results | |____display_results.py | |____plots |____data | |____host_to_indices | | |____host_to_ids.py | |____marker.zip | |____UserLabelExample.csv | |____convert_data.py | |____get_unlabeled_pretraining_data.py | |____UserDataExample.csv | |____abundance.zip |____DNN_models.py ``` **property_pathway_correlations.zip** ``` |____metabolic_pathways | |____metabolic_pathway_correlations.R | |____data | | |____AG_new | | | |____pathways | | | | |____corr_matches_141_150.RDS | | | | |____corr_matches_81_90.RDS | | | | |____corr_matches_21_30.RDS | | | | |____corr_matches_51_60.RDS | | | | |____corr_matches_121_130.RDS | | | | |____corr_matches_101_110.RDS | | | | |____corr_matches_61_70.RDS | | | | |____corr_matches_31_40.RDS | | | | |____corr_matches_131_140.RDS | | | | |____corr_matches_181_190.RDS | | | | |____corr_matches_161_170.RDS | | | | |____corr_matches_11_20.RDS | | | | |____corr_matches_1_10.RDS | | | | |____corr_matches_191_200.RDS | | | | |____corr_matches_171_180.RDS | | | | |____corr_matches_71_80.RDS | | | | |____corr_matches_91_100.RDS | | | | |____corr_matches_111_120.RDS | | | | |____corr_matches_41_50.RDS | | | | |____corr_matches_151_160.RDS | | |____embed | | | |____seqs_.07_embed.txt | | | |____merged_sequences_embeddings.txt | | | |____assemble_new_embs.py | | | |____epoch_120_IBD_avg_vocab_embeddings.pth | | | |____glove_emb_AG_newfilter.07_100.txt | | |____pathways | | | |____filtered_otu_pathway_table.RDS | | | |____pca_embedded_taxa.rds | | | |____pathway_table.RDS | | | |____corr_matches.rds | | | |____filtered_otu_pathway_table.txt | | | |____corr_matches_pca.rds |____figures | |____csvs | | |____significant_correlations_tests.py | | |____gen_hists.py | |____new_hists |____pathways | |____top20Paths_per_property_ids_v2.csv | |____top20Paths_per_property_names_v2.csv | |____property_pathway_dict_allsig.txt | |____property_pathway_dict.txt ``` **microbiome_transformers.zip** ``` |____electra_trace.py |____multitaskfinetune | |____begin.py | |____pretrain_hf.py | |____electra_discriminator.py | |____dataset.py | |____startup |____finetune_discriminator | |____begin.py | |____pretrain_hf.py | |____electra_pretrain_model.py | |____electra_discriminator.py | |____run_agp_agp_exps.sh | |____run_agp_HF_SH_cross_gen_val_set_tests.sh | |____run_agp_HF_SH_cross_gen_ensemble_tests.sh | |____hf_startup_3 | |____hf_startup_4 | |____README.MD | |____dataset.py | |____torch_rbf.py |____combine_sets.py |____pretrain_discriminator | |____begin.py | |____pretrain_hf.py | |____electra_pretrain_model.py | |____hf_startup | |____README.MD | |____train_command.sh | |____dataset.py |____benchmark_startup |____pretrain_generator | |____begin.py | |____pretrain_hf.py | |____electra_pretrain_model.py | |____hf_startup | |____README.MD | |____train_command.sh | |____dataset.py |____README.md |____compress_data.py |____generate_commands.py |____attention_benchmark | |____begin.py | |____pretrain_hf.py | |____electra_discriminator.py | |____hf_startup | |____dataset.py |____data_analyze.py |____benchmarks.py ``` # Usage Instructions Intended to cover both repeating the experiments we performed in our paper, or extending our methods to new datasets: * Prepare input data and initial embeddings * Vocabulary: Set the initial vocabulary size to accommodate all the unique OTUs/ASVs found in the data, plus special tokens such as mask, padding, and cls tokens. * Initial embeddings: Each vocabulary element (including special tokens) is assigned a unique embedding vector. * Input data format: Given the highly sparse nature of most microbiome samples relative to vocabulary size, we store each sample’s abundance information in coordinate-list format. I.e., a data file is a numpy array of size (n_samples, max_sample_size, 2), and each sample is stored as a (max_sample_size, 2) array.  * Pretrain a language model on those embeddings * ELECTRA generators: Pretrain a sequence of generator models on unsupervised microbiome data. See pretrain_generator/train_command.sh and pretrain_generator/README.MD in microbiome_transformers.zip * ELECTRA discriminators: Pretrain a sequence of discriminator models on unsupervised microbiome data using outputs from the previously trained generators to generate substitutions for the original sequences. See pretrain_discriminator/train_command.sh and pretrain_discriminator/README.MD in microbiome_transformers.zip * Characterize the language model with the following interpretability steps: * Perform taxonomic assignments: Use assign_16S_to_phyla.R (or similar R code) to map your sequences to the phylogenetic hierarchy. * Attribution calculations: Use Attribution_calculations.ipynb to calculate per-sample model prediction scores, per-taxa attribution values (used for interpretability), as well as per-taxa averaged embeddings (used for plotting the taxa). * Embeddings visualizations and embedding space clustering: * Provide Plot_microbiome_transformers_results.ipynb with the paths to your per-taxa averaged embeddings calculated above, initial vocabulary embeddings (equivalent of vocab_embeddings.npy), and taxonomic assignments. * It will help generate TSNE visualizations of the two embedding spaces, as well as cross-comparisons of where taxa in one embedding space appear in the other embedding space. * The notebook contains preset regions for which parts of the two embedding spaces to compare (via bounding boxes with the select_by_rectangles function). These regions will likely not work for a new dataset, so you'll have to change them. * Finally, the notebook will also plot graphs comparing the clusterability of the data in the original two embedding spaces (non TSNE), so as to not be fooled by the dimension reduction technique. * Identify high-attribution taxa: * Process_Attributions_No_GPU.ipynb takes the per-sample prediction scores and the per-taxa attribution values (both from Attribution_calculations.ipynb) and identifies the taxa most and least associated with IBD. * It also includes filtration steps for the attribution calculations (e.g., only analyze taxa that appear >= 5 times, only use attribution scores that are confident and correct, etc), reflecting those we used in the paper.  * The notebook will identify the taxa IDs of the top and bottom attributed taxa, then it will use seqs_.07_embed.fasta (or similar taxa-ID mapping) to print the 16S sequences associated with those taxa. * Pathway correlations: * Use assemble_new_embs.py to format pytorch vocab embedding files into the expected format for metabolic_pathway_correlations.R * Use metabolic_pathway_correlations.R (in the metabolic_pathways folder of property_pathway_correlations.zip) to produce heatmaps of embedding dim / metabolic pathway correlation strengths, and to save a file with the statistically significant correlation data. * Use gen_hists.py (in the figures/csvs folder of property_pathway_correlations.zip) to generate histograms comparing embedding dim / pathway correlation strengths of the initial fixed embeddings with those of the learned contextual embeddings. * Use significant_correlations_tests.py (also in the figures/csvs folder of property_pathway_correlations.zip) to apply non-parametric statistical tests to determine whether the distribution of embedding dim / pathway correlation strengths from the learned contextual embeddings is shifted right compared to those from the fixed embeddings. * Evaluate the language model for downstream task * First, account for any patients who have multiple samples in the dataset by blocking out any train / validation / test splits you perform by patient ID. Future steps will assume you have dictionaries (stored as pickle files) that map from some patient ID strings (which just need to be unique per patient) to indices of the data files (i.e., you need one mapping dict per training data file). In general, the way to do this will depend on how your patient metadata is structured. You can look to host_to_ids.py (in DeepMicro.zip) to see how we combined metadata from multiple files and compared that with the different training data numpy files to produce this mapping. * To run experiments using our paper's transformer methods: * "Within distribution" evaluations: Relevant commands are in finetune_discriminator/run_agp_agp_exps.sh in microbiome_transformers.zip * "Out of distribution" evaluations: Relevant commands are in finetune_discriminator/run_agp_HF_SH_cross_gen_ensemble_tests.sh (when using an ensemble of models) and finetune_discriminator/run_agp_HF_SH_cross_gen_val_set_tests.sh (without using an ensemble and when using a val set for stopping condition). Both are in microbiome_transformers.zip * See also finetune_discriminator/README.MD in microbiome_transformers.zip for more general information about the finetuning functionality * To run experiments using the DeepMicro-derived baseline methods: * See exps_ae.sh, exps_cae.sh, and exps_baselines.sh in DeepMicro.zip for the experiment commands (for both in-distribution and out of distribution experiments) * Also see README.md in DeepMicro.zip for more general information on using DeepMicro and our modifications to it. ## Changelog: **01/29/2025** Updated significant_correlations_tests.py to apply permutation testing and report Cohan's d and Cliff's Delta. Added run_blast_with_downloads.sh, which reports how many taxa in Halfvarson match to any taxa in AGP and how many taxa in Schirmer match any taxa in AGP. It's a way of comparing which of Schirmer or Halfvarson is more similar to AGP in terms of taxa that are present. We also slightly clarified the README's language to make it clearer where the software can be found."]} 
    more » « less
  5. {"Abstract":["Body size allometry has been proposed to explain neurocranial shape\n variation across platyrrhine monkeys, including the extreme neurocranial\n narrowness (NCN) of small-bodied taxa. However, catarrhine primates do not\n align with the proposed platyrrhine allometry, suggesting that body size\n alone may not explain anthropoid NCN variation. We measured NCN in\n platyrrhine and catarrhine anthropoids, and identified inter-clade\n differences in degree and allometric scaling of NCN using GLS pANCOVA, a\n method combining generalized least squares analysis and phylogenetic\n analysis of covariance. We further performed character history analysis\n using extant anthropoid data and fossil evidence to estimate neurocranial\n shape in anthropoid ancestors. We found no unified allometry of NCN across\n platyrrhines or across anthropoids as a whole, and there was no\n significant correlation between body mass and NCN across platyrrhines or\n catarrhines. At the subfamily level, callitrichine and possibly cebine\n platyrrhines differ from other anthropoids in degree and allometric\n scaling of NCN. Character history analysis suggests that higher NCN is the\n ancestral platyrrhine condition, and broader neurocranial shape evolved\n independently in catarrhines and larger-bodied platyrrhines. Our results\n support the hypothesis that body size does not explain anthropoid NCN\n variation, and provide a foundation for future research into potential\n correlates of anthropoid NCN."],"TechnicalInfo":["# Neurocranial narrowness across nonhuman anthropoid primates Dataset DOI:\n [10.5061/dryad.r4xgxd2r0](10.5061/dryad.r4xgxd2r0) ## Description of the\n data and file structure The dataset contains a complete list of skeletal\n specimens with the linear measurement values used to calculate\n neurocranial narrowness (NCN), as well as annotated R code to replicate\n phylogenetic analyses of covariance and generalized least squares analyses\n (GLS pANCOVA). GLS pANCOVA identifies species or clades that deviate from\n expected neurocranial shape allometry. R code is also included for\n performing linear regression of brain weight residuals and neurocranial\n narrowness values. ### Code/software The .Rmd files were compiled in\n RStudio for R version 4.3.3, and require the **evomap** package (Smaers,\n J.B., and F.J. Rohlf. 2016. "Testing species’ deviations from\n allometric predictions using the phylogenetic\n regression." *Evolution* 70: 1145-1149). The phylogenetic tree\n referenced in the R code is available through the **10kTrees Project**\n (Arnold C., L. J. Matthews, and C. L. Nunn. 2010. "The *10kTrees\n Website*: A New Online Resource for Primate Phylogeny." *Evolutionary\n Anthropology* 19: 114-118). * To download **evomap** for R:\n [https://smaerslab.com/software/](https://smaerslab.com/software/) * The\n 10kTrees Project:\n [https://10ktrees.nunn-lab.org/Primates/index.html](https://10ktrees.nunn-lab.org/Primates/index.html) ### Files and variables #### File: qualaveragesfemale.csv **Description:** Species-average neurocranial narrowness values for female specimens, formatted for use in the attached R code. These values were derived from caliper measurements of skeletal specimens from the Smithsonian National Museum of Natural History and Harvard Museum of Comparative Zoology; a complete accounting of specimens can be found in the **rawdata.csv** file. ##### Variables * **Genus.** Genus and species, formatted to match phylogenetic tree labels used in the * **Weight.** Species-average body mass in kilograms (kg), per Smith and Jungers (1997). * **NCN.** Species-average neurocranial narrowness, calculated as **NCL** divided by **NCB**. * **NCL.** Species-average neurocranial length in millimeters (mm), measured from glabella to lambda. * **NCB.** Species-average neurocranial breadth (mm), measured as biparietal breadth from euryon to euryon. #### File: pANCOVA_f.Rmd **Description:** Annotated R Markdown code for performing a phylogenetic ANCOVA to identify deviations from expected allometry of neurocranial narrowness. This analysis includes only female specimens, and requires importing the **qualaveragesfemale.csv** file. #### File: qualaveragesmale.csv **Description:** Species-average neurocranial narrowness values for male specimens, formatted for use in the attached R code. As above, these values were derived from caliper measurements of skeletal specimens from the Smithsonian National Museum of Natural History and Harvard Museum of Comparative Zoology; a complete accounting of specimens can be found in the **rawdata.csv** file. ##### Variables * **Genus.** Genus and species, formatted to match tree tip labels. * **Weight.** Species-average body mass in kilograms (kg), per Smith and Jungers (1997). * **NCN.** Species-average neurocranial narrowness, calculated as **NCL** divided by **NCB**. * **NCL.** Species-average neurocranial length in millimeters (mm), measured from glabella to lambda. * **NCB.** Species-average neurocranial breadth (mm), measured as biparietal breadth from euryon to euryon. #### File: pANCOVA_m.Rmd **Description:** Annotated R Markdown code for performing a phylogenetic ANCOVA to identify deviations from expected allometry of neurocranial narrowness. This analysis includes only male specimens, and requires importing the **qualaveragesmale.csv** file. #### File: pgls.Rmd **Description:** Annotated R Markdown code for performing a phylogenetic generalized least squares analysis of neurocranial narrowness allometry across anthropoids. Requires importing both **qualaveragesfemale.csv** and **qualaveragesmale.csv**. #### File: univariate_f.Rmd **Description:** Annotated R Markdown code for univariate phylogenetic ANCOVA of neurocranial length and breadth, where each variable is analyzed individually. Includes only female specimens, and requires importing the **qualaveragesfemale.csv** file. #### File: univariate_m.Rmd **Description:** Annotated R Markdown code for univariate phylogenetic ANCOVA of neurocranial length and breadth, where each variable is analyzed individually. Includes only male specimens, and requires importing the **qualaveragesmale.csv** file. #### File: brainweights.csv **Description:** Species-average brain weight, body weight, and neurocranial narrowness values. ##### Variables * **Species.** Genus and species. * **Brainweight.** Species-average brain weight in kilograms (kg), derived from Stephan (1981) and Boddy *et al.* (2012). * **Bodyweight.** The median value between species-average male and species-average female body mass (kg), per Smith and Jungers (1997). * **M.** Species-average male body mass (kg). * **F.** Species-average female body mass (kg). * **NCN.** Species-average neurocranial narrowness, calculated as neurocranial length divided by neurocranial breadth. * **EQ.** Encephalization quotient, calculated as per Jerison (1973). #### File: EQ_analysis.Rmd **Description:** Annotated R Markdown code for linear regressions of brain weight residuals and neurocranial narrowness. Requires importing the **brainweights.csv** file. #### File: rawdata.csv **Description:** Original list of specimens with taxonomic information and linear measurements, as well as body mass (where available) and notes on methods and preservation. ##### Variables * **Call num.** Specimen's museum call number, where "USNM" refers to the Smithsonian National Museum of Natural History and "MCZ" the Harvard Museum of Comparative Zoology. * **Sex.** Male (M), female (F), or unknown (?). * **Genus / sp.** Genus, species, and (when available) subspecies of each specimen. * **Recorded weight (kg).** Museum data for some specimens included the weight of the individual in kilograms. This column is left blank for specimens without recorded weights. * **Cranial length (mm).** Cranial length from glabella to lambda, in millimeters. * **Cranial breadth (mm).** Cranial breadth from euryon to euryon, in millimeters. * **CL/CB (NCN).** Neurocranial narrowness, calculated as cranial length divided by cranial breadth."]} 
    more » « less