Abstract

For this research project, I was appointed to examine and investigate the genomics for the phylum class of Archaea in association with the Santa Rita Experimental Range (SRER). Located in Tuscan, Arizona, this field site was established for researchers to further advance their knowledge on ecosystems, as well as applying sustainable management practices [@the_university_of_arizona_history]. Through the utilization of the National Ecological Observatory Network (NEON) and the Integrated Microbial Genome database (IMG) for examining this field site’s data, I intend to exemplify a detailed analysis of Archaea at the SRER field site.

Motivating Reasons

To further expand my knowledge of field-based genomics, I was appointed to analyze a specific NEON site with its associated phylum. With the assistance of the skills and resources I have obtained in evolutionary genomics and bioinformatics, I am able to inquire more about my assigned phylum, Archaea. More specifically, through thorough exploration and analysis of the Santa Rita Experimental Range (SRER) NEON site, I am able to assemble an in-depth analysis of the Archaea at the SRER NEON site.

Introduction

In this course, I was assigned to conduct an investigation on a specific NEON site and its associated phylum. More specifically, I was appointed to analyze the Santa Rita Experimental Range (SRER) NEON site. Located in Arizona, this NEON site demonstrates a comprehensive metagenome analysis on the terrestrial soil microbial communities. As this field site has become one of the most continuously prolonged active rangeland research facilities in the United States, the accumulation of data provides information on site location, domain, site chemistry, the ecosystem, and many more [@the_university_of_arizona_mission] [@NEON]. Precisely, the domain I was appointed to analyze at SRER is the single-celled organism, archaea. Prominently found in just about every habitat, archaea is an essential contributor to the human microbiome. Through this organism’s metabolic, morphological, and geographical variations at the NEON site, I am able to provide a comprehensive analysis based on my observation of archaeas consistent patterns and characteristics [@Wiki].

Methods

The majority of the data utilized to conduct this research project was obtained through NEON and IMG databases [@IMGER] [@NEON]. Correspondingly, data was assembled through the utilization of techniques such as remote sensing, meteorological measurements, phenocams, soil sensor measurements, and observational sampling. More specifically, remote sensing at the field site is executed through the conduction of surveys. Each survey indicates the collected lidar, spectrometer, and high-resolution RBG camera data. The collection of meteorological measurements is conducted through a flux/meteorological tower. Evidently, this tower is 26 ft tall and incorporates four various levels of measurement. Specifically, the top of the tower was developed for its extension across the vegetation canopy. This authorizes the sensors located at the top of the tower to collect and formulate a complete profile of the atmospheric condition all the way to the ground. Additionally, this tower captures both the physical and chemical properties of wind, humidity, and net ecosystem gas exchange. The data correlating to this site’s precipitation was gathered within close proximity of the towel through a Double Fence Intercomparison Reference (DFIR). Similarly, one phenocam is joined to the bottom and the top of the tower. These cameras capture images from the tower every hour. The measurements for soil sensing are located within an approximate distance from the flux tower. Through the collection of measurements at five various soil plots at the soil surface at the site are examined through Photosynthetically Active Radiation (PAR), soil heat flux, solar radiation, and throughfall. Specifically, at each soil plot, the measurements of soil moisture, soil temperature, and CO2 concentration is established at multiple depths in the soil. Lastly, through the conduction of observational sampling at terrestrial NEON sites, field ecologists examined birds, plants, sample ground beetles, mosquitoes, small mammals, soil microbes, and ticks. This allowed for the analysis of data to demonstrate DNA sequences, pathogens, soils, sediments, and biogeochemistry. Precisely, following NEONs collection and assembly of the observed data, it was published to IMG where I was able to analyze the data individually.(n.d.b) Through the manipulation of SRERs collected data with the collaboration of various resources and filtering applications such as, IMG databases, Rstudio, Zotero, Pavian, and Snakey plot, I am able to demonstrate my assigned phylum, archaea at the Santa Rita Experimental Range (SRER) NEON site [@NEON].

Libraries

library(usethis)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(zoo)
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(knitr)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(DT)
library(ggtree)
## ggtree v3.12.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## 
## Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods
## for mapping and visualizing associated data on phylogeny using ggtree.
## Molecular Biology and Evolution. 2018, 35(12):3041-3043.
## doi:10.1093/molbev/msy194
## 
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
## 
## Attaching package: 'ggtree'
## The following object is masked from 'package:tidyr':
## 
##     expand
library(TDbook) 
library(ggimage)
library(rphylopic)
## You are using rphylopic v.1.4.0. Please remember to credit PhyloPic contributors (hint: `get_attribution()`) and cite rphylopic in your work (hint: `citation("rphylopic")`).
## 
## Attaching package: 'rphylopic'
## The following object is masked from 'package:ggimage':
## 
##     geom_phylopic
library(treeio)
## treeio v1.28.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
## 
## Guangchuang Yu.  Data Integration, Manipulation and Visualization of
## Phylogenetic Trees (1st edition). Chapman and Hall/CRC. 2022,
## doi:10.1201/9781003279242
## 
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
library(tidytree)
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
## 
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## 
## Attaching package: 'tidytree'
## The following object is masked from 'package:treeio':
## 
##     getNodeNum
## The following object is masked from 'package:stats':
## 
##     filter
library(ape)
## 
## Attaching package: 'ape'
## The following objects are masked from 'package:tidytree':
## 
##     drop.tip, keep.tip
## The following object is masked from 'package:treeio':
## 
##     drop.tip
## The following object is masked from 'package:ggtree':
## 
##     rotate
## The following object is masked from 'package:dplyr':
## 
##     where
library(TreeTools)
## 
## Attaching package: 'TreeTools'
## The following object is masked from 'package:tidytree':
## 
##     MRCA
## The following object is masked from 'package:treeio':
## 
##     MRCA
## The following object is masked from 'package:ggtree':
## 
##     MRCA
library(phytools)
## Loading required package: maps
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
## 
## Attaching package: 'phytools'
## The following object is masked from 'package:TreeTools':
## 
##     as.multiPhylo
## The following object is masked from 'package:treeio':
## 
##     read.newick
library(ggnewscale)
library(ggtreeExtra)
## ggtreeExtra v1.14.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
library(ggstar)

NEON tables

NEON MAG tables

NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON_2024_4_21.csv") %>%
  # remove columns that are not needed for data analysis
  select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`, `Bin Lineage`)) %>%
  # create a new column with the Assembly Type
  mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
                            TRUE ~ "Individual")) %>%
  mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>%
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "d__", "") %>%  
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "p__", "") %>%
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "c__", "") %>%
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "o__", "") %>%
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "f__", "") %>%
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "g__", "") %>%
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "s__", "") %>%
  separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), ";", remove = FALSE) %>%
  mutate_at("Domain", na_if,"") %>%
  mutate_at("Phylum", na_if,"") %>%
  mutate_at("Class", na_if,"") %>%
  mutate_at("Order", na_if,"") %>%
  mutate_at("Family", na_if,"") %>%
  mutate_at("Genus", na_if,"") %>%
  mutate_at("Species", na_if,"") %>%
 
  # Get rid of the the common string "Soil microbial communities from "
  mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
  # Use the first `-` to split the column in two
  separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
  # Get rid of the the common string "S-comp-1"
  mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
  # separate the Sample Name into Site ID and plot info
  separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
  # separate the plot info into 3 columns
  separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-")
## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl  (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date  (1): Date Added
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 624 rows [1131, 1132,
## 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145,
## 1146, 1147, 1148, 1149, 1150, ...].
NEON_MAGs_bact_ind <- NEON_MAGs %>% 
  filter(Domain == "Bacteria") %>% 
  filter(`Assembly Type` == "Individual") 

NEON metagenome table

NEON_metagenomes <- read_tsv("data/NEON/exported_img_data_Gs0161344_NEON.tsv") %>%
  select(-c(`Domain`, `Sequencing Status`, `Sequencing Center`)) %>%
  rename(`Genome Name` = `Genome Name / Sample Name`) %>%
  filter(str_detect(`Genome Name`, 're-annotation', negate = T)) %>%
  filter(str_detect(`Genome Name`, 'WREF plot', negate = T))
## Rows: 176 Columns: 46
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (18): Domain, Sequencing Status, Study Name, Genome Name / Sample Name, ...
## dbl (16): taxon_oid, IMG Genome ID, Depth In Meters, Elevation In Meters, Ge...
## lgl (12): Altitude In Meters, Chlorophyll Concentration, Longhurst Code, Lon...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_metagenomes <- NEON_metagenomes %>%
  # Get rid of the the common string "Soil microbial communities from "
  mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
  # Use the first `-` to split the column in two
  separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
  # Get rid of the the common string "-comp-1"
  mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
  # separate the Sample Name into Site ID and plot info
  separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
  # separate the plot info into 3 columns
  separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [53].

NEON chemistry data

NEON_chemistry <- read_tsv("data/NEON/neon_plot_soilChem1_metadata.tsv") %>%
  # remove -COMP from genomicsSampleID
  mutate_at("genomicsSampleID", str_replace, "-COMP", "")
## Rows: 87 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr   (5): genomicsSampleID, siteID, plotID, nlcdClass, horizon
## dbl  (11): decimalLatitude, decimalLongitude, elevation, soilTemp, d15N, org...
## date  (1): collectionDate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
kable(
  NEON_chemistry_description <- read_tsv("data/NEON/neon_soilChem1_metadata_descriptions.tsv") 
)
## Rows: 23 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): fieldName, description, dataType, units
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fieldName description dataType units
siteID NEON site code string NA
plotID Plot identifier (NEON site code_XXX) string NA
sampleID Identifier for sample string NA
horizon Organic or mineral soil string NA
genomicsSampleID Identifier for a genomics sample string NA
d15N Measure of the ratio of 15N:14N in a sample, relative to atmospheric N2 real permill
organicd13C Measure of the ratio of 13C:12C in soil organic carbon, relative to Vienna Pee Dee Belemnite real permill
nitrogenPercent Percent nitrogen in a sample on a dry weight basis real percent
organicCPercent Percent organic carbon in a sample on a dry weight basis real percent
CNratio Ratio of carbon to nitrogen concentration in a sample on a dry weight basis real NA
nlcdClass National Land Cover Database Vegetation Type Name string NA
subplotID Identifier for the NEON subplot string NA
coreCoordinateX x location of the soil core relative to the SW corner real meter
coreCoordinateY y location of the soil core relative to the SW corner real meter
decimalLatitude The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area real decimalDegree
decimalLongitude The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area real decimalDegree
elevation Elevation (in meters) above sea level real meter
sampleTiming Timing of the sampling event with regard to the field season string NA
soilTemp In-situ temperature of soil at approximately 10 cm depth real degree
sampleTopDepth Depth below the soil surface of the top of a soil sample real centimeter
sampleBottomDepth Depth below the soil surface of the bottom of a soil sample real centimeter
soilInWaterpH pH value of soil measured in water solution real pH
soilInCaClpH pH value of soil measured in calcium chloride solution real pH

Join of NEON data into a single data frame

NEON_MAGs_metagenomes_chemistry <- NEON_MAGs %>%
  left_join(NEON_metagenomes, by = "Sample Name") %>%
  left_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID")) %>%
  rename("label" = "Bin ID")
NEON_MAGs1 <- NEON_MAGs%>%
  select(`Sample Name`,`Site ID`, `GTDB-Tk Taxonomy Lineage`)
NEON_metagenomes1 <- NEON_metagenomes%>%
  select(`Sample Name`,`Site ID`, `Ecosystem Subtype`)
NEON_chemistry1 <- NEON_chemistry %>% 
select("genomicsSampleID", "siteID", "nlcdClass")
NEON1 <- NEON_MAGs %>%
  full_join(NEON_metagenomes, by = "Site ID")
## Warning in full_join(., NEON_metagenomes, by = "Site ID"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
NEON2 <- NEON1 %>%
  full_join(NEON_chemistry, by = c("Site ID" = "siteID"))
## Warning in full_join(., NEON_chemistry, by = c(`Site ID` = "siteID")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 22 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
NEON2
## # A tibble: 63,009 × 90
##    `Bin ID`      Site.x       `Sample Name.x` `Site ID` Subplot.x Layer.x Date.x
##    <chr>         <chr>        <chr>           <chr>     <chr>     <chr>   <chr> 
##  1 3300060643_14 National Gr… CLBJ_001-M-202… CLBJ      001       M       20210…
##  2 3300060643_14 National Gr… CLBJ_001-M-202… CLBJ      001       M       20210…
##  3 3300060643_14 National Gr… CLBJ_001-M-202… CLBJ      001       M       20210…
##  4 3300060643_14 National Gr… CLBJ_001-M-202… CLBJ      001       M       20210…
##  5 3300060643_14 National Gr… CLBJ_001-M-202… CLBJ      001       M       20210…
##  6 3300060643_14 National Gr… CLBJ_001-M-202… CLBJ      001       M       20210…
##  7 3300060643_14 National Gr… CLBJ_001-M-202… CLBJ      001       M       20210…
##  8 3300060643_14 National Gr… CLBJ_001-M-202… CLBJ      001       M       20210…
##  9 3300060643_14 National Gr… CLBJ_001-M-202… CLBJ      001       M       20210…
## 10 3300060643_14 National Gr… CLBJ_001-M-202… CLBJ      001       M       20210…
## # ℹ 62,999 more rows
## # ℹ 83 more variables: `IMG Genome ID.x` <dbl>, `Bin Quality` <chr>,
## #   `GTDB-Tk Taxonomy Lineage` <chr>, Domain <chr>, Phylum <chr>, Class <chr>,
## #   Order <chr>, Family <chr>, Genus <chr>, Species <chr>,
## #   `Bin Completeness` <dbl>, `Bin Contamination` <dbl>,
## #   `Total Number of Bases` <dbl>, `5s rRNA` <dbl>, `16s rRNA` <dbl>,
## #   `23s rRNA` <dbl>, `tRNA Genes` <dbl>, `Gene Count` <dbl>, …
NEON2 %>%
  filter(str_detect(`GTDB-Tk Taxonomy Lineage`, "Archaea"))
## # A tibble: 1,930 × 90
##    `Bin ID`      Site.x       `Sample Name.x` `Site ID` Subplot.x Layer.x Date.x
##    <chr>         <chr>        <chr>           <chr>     <chr>     <chr>   <chr> 
##  1 3300060650_21 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  2 3300060650_21 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  3 3300060650_21 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  4 3300060650_21 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  5 3300060650_24 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  6 3300060650_24 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  7 3300060650_24 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  8 3300060650_24 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  9 3300060650_39 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
## 10 3300060650_39 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
## # ℹ 1,920 more rows
## # ℹ 83 more variables: `IMG Genome ID.x` <dbl>, `Bin Quality` <chr>,
## #   `GTDB-Tk Taxonomy Lineage` <chr>, Domain <chr>, Phylum <chr>, Class <chr>,
## #   Order <chr>, Family <chr>, Genus <chr>, Species <chr>,
## #   `Bin Completeness` <dbl>, `Bin Contamination` <dbl>,
## #   `Total Number of Bases` <dbl>, `5s rRNA` <dbl>, `16s rRNA` <dbl>,
## #   `23s rRNA` <dbl>, `tRNA Genes` <dbl>, `Gene Count` <dbl>, …
view(NEON2)

SRER site data

Site_MAGs <- NEON_MAGs %>%
  filter(`Site ID` == "SRER")
Site_metagenomes <- NEON_metagenomes %>%
  filter(`Site ID` == "SRER")
Site_chemistry <- NEON_chemistry %>%
  filter(`siteID` == "SRER")
NEON_MAGs_HSite <- NEON_MAGs %>% 
  filter(Site == "Santa Rita Experimental Range, Tucson, Arizona, USA")%>%
  filter(`Assembly Type` == "Individual") 

Filtering for Archaea

NEON_MAGs_bact_ind2 <- NEON_MAGs_bact_ind %>%
filter(`Site ID`== "SRER")

Full join of NEON data

NEON1 <- NEON_MAGs %>%
  full_join(NEON_metagenomes, by = "Sample Name")
NEON_full <- NEON1 %>%
  full_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID"))
NEON2 <- Site_MAGs %>%
  full_join(Site_metagenomes, by = "Sample Name")
Site_full <- NEON2 %>%
  full_join(Site_chemistry, by = c("Sample Name" = "genomicsSampleID"))
NEON1 <- NEON_MAGs %>%
  full_join(NEON_metagenomes, by = "Site ID")
## Warning in full_join(., NEON_metagenomes, by = "Site ID"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
NEON2 <- NEON1 %>%
  full_join(NEON_chemistry, by = c("Site ID" = "siteID"))
## Warning in full_join(., NEON_chemistry, by = c(`Site ID` = "siteID")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 22 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
NEON2 %>%
  filter(str_detect(`GTDB-Tk Taxonomy Lineage`, "Archaea"))
## # A tibble: 1,930 × 90
##    `Bin ID`      Site.x       `Sample Name.x` `Site ID` Subplot.x Layer.x Date.x
##    <chr>         <chr>        <chr>           <chr>     <chr>     <chr>   <chr> 
##  1 3300060650_21 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  2 3300060650_21 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  3 3300060650_21 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  4 3300060650_21 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  5 3300060650_24 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  6 3300060650_24 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  7 3300060650_24 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  8 3300060650_24 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
##  9 3300060650_39 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
## 10 3300060650_39 Healy, Dena… HEAL_048-O-202… HEAL      048       O       20210…
## # ℹ 1,920 more rows
## # ℹ 83 more variables: `IMG Genome ID.x` <dbl>, `Bin Quality` <chr>,
## #   `GTDB-Tk Taxonomy Lineage` <chr>, Domain <chr>, Phylum <chr>, Class <chr>,
## #   Order <chr>, Family <chr>, Genus <chr>, Species <chr>,
## #   `Bin Completeness` <dbl>, `Bin Contamination` <dbl>,
## #   `Total Number of Bases` <dbl>, `5s rRNA` <dbl>, `16s rRNA` <dbl>,
## #   `23s rRNA` <dbl>, `tRNA Genes` <dbl>, `Gene Count` <dbl>, …

Specific data to create various trees

NEON_MAGs_site <- NEON_MAGs %>% 
  filter(`Site` == "Santa Rita Experimental Range, Tucson, Arizona, USA")
NEON_MAGs_metagenomes_chemistry <- NEON_MAGs %>%
left_join(NEON_metagenomes, by = "Sample Name") %>%
left_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID")) %>%
rename("label" = "Bin ID")
NEON_MAGs_metagenomes_chemistry_noblank <- NEON_MAGs_metagenomes_chemistry %>%
rename("AssemblyType" = "Assembly Type") %>%
rename("BinCompleteness" = "Bin Completeness") %>%
rename("BinContamination" = "Bin Contamination") %>%
rename("TotalNumberofBases" = "Total Number of Bases") %>%
rename("EcosystemSubtype" = "Ecosystem Subtype")
tree_arc <- read.tree("data/NEON/gtdbtk.ar53.decorated.tree")
tree_bac <- read.tree("data/NEON/gtdbtk.bac120.decorated.tree")
node_vector_arc = c(tree_arc$tip.label,tree_arc$node.label)
grep("p__", node_vector_arc, value = TRUE)
## [1] "'1.0:p__Halobacteriota; c__Methanosarcinia; o__Methanosarcinales; f__Methanosarcinaceae; g__Methanosarcina'"                                            
## [2] "'1.0:p__Thermoplasmatota; c__E2; o__JACPAO01; f__JAHFTW01'"                                                                                             
## [3] "'1.0:p__Methanobacteriota; c__Methanobacteria; o__Methanobacteriales; f__Methanobacteriaceae; g__Methanobacterium_B; s__Methanobacterium_B sp003151535'"
## [4] "'1.0:p__Thermoproteota'"
match(grep("Archaea", node_vector_arc, value = TRUE), node_vector_arc)
## [1] 45
tree_bac_preorder <- Preorder(tree_bac)
tree_Archaea <- Subtree(tree_bac_preorder, 1712)

Results

knitr::include_url("sankey-NEON_MAG_ind_pavian (2).txt.html")

Figure 1: This Sankey plot demonstrates a visual representation of the individual assemblies of the phylum archaea. Exhibited are the two main classes of archaea. Although the largest class of this phylum is nitrosophaerota, euryarchaeota is also present.

knitr::include_url("sankey-NEON_MAG_co_pavian (4).txt.html")

Figure 2: This Sankey plot demonstrates a visual representation of the combined assembly for the phylum archaea. Exhibited are the two main classes of archaea. With the largest class of this phylum is nitrosophaerota, euryarchaeota is also present.

knitr::include_url("sankey-NEON_MAG_site_pavian (2).txt.html")

Figure 3: This Sankey plot demonstrates a visual representation of the assortment of all phyla present at the SRER NEON site. Respectively, there are 7 phyla which indicates that the variation among taxa at SRER is dispersed.

Archaea

Circular tree with outer ring

ggtree(tree_Archaea, layout="circular", branch.length="none") %<+%
NEON_MAGs_metagenomes_chemistry +
geom_point2(mapping=aes(color=`Ecosystem Subtype`, size=`Total Number of Bases`)) +
new_scale_fill() +
geom_fruit(
data=NEON_MAGs_metagenomes_chemistry_noblank,
geom=geom_tile,
mapping=aes(y=label, x=1, fill= AssemblyType),
offset=0.08, # The distance between external layers, default is 0.03 times of x range of tree.
pwidth=0.25 # width of the external layer, default is 0.2 times of x range of tree.
)
## ! The following column names/name: Site.x, Sample Name, Site ID.x, Subplot.x, Layer.x, Date.x, IMG Genome ID.x, Bin Quality, GTDB-Tk Taxonomy Lineage, Domain, Phylum, Class, Order, Family, Genus, Species, 5s rRNA, 16s rRNA, 23s rRNA, tRNA Genes, Gene Count, Scaffold Count, taxon_oid, Study Name, Site.y, Site ID.y, Subplot.y, Layer.y, Date.y, IMG Genome ID.y, GOLD Study ID, Ecosystem, Ecosystem Category, Ecosystem Type, Specific Ecosystem, Altitude In Meters, Chlorophyll Concentration, Depth In Meters, Elevation In Meters, Geographic Location, Habitat, Isolation, Isolation Country, Latitude, Longhurst Code, Longhurst Description, Longitude, Nitrate Concentration, Oxygen Concentration, pH, Pressure, Salinity, Salinity Concentration, Sample Collection Date, Sample Collection Temperature, Subsurface In Meters, Genome Size  * assembled, Gene Count  * assembled, Scaffold Count  * assembled, Genome MetaBAT Bin Count  * assembled, Genome EukCC Bin Count  * assembled, CRISPR Count  * assembled, GC Count  * assembled, GC  * assembled, Coding Base Count  * assembled, Coding Base Count %  * assembled, CDS Count  * assembled, CDS %  * assembled, siteID, plotID, nlcdClass, decimalLatitude, decimalLongitude, elevation, collectionDate, horizon, soilTemp, d15N, organicd13C, nitrogenPercent, organicCPercent, CNratio, soilInWaterpH, soilInCaClpH are/is the same to tree data, the tree data column names are : label, y, angle, Site.x, Sample Name, Site ID.x, Subplot.x, Layer.x, Date.x, IMG Genome ID.x, Bin Quality, GTDB-Tk Taxonomy Lineage, Domain, Phylum, Class, Order, Family, Genus, Species, Bin Completeness, Bin Contamination, Total Number of Bases, 5s rRNA, 16s rRNA, 23s rRNA, tRNA Genes, Gene Count, Scaffold Count, Assembly Type, taxon_oid, Study Name, Site.y, Site ID.y, Subplot.y, Layer.y, Date.y, IMG Genome ID.y, GOLD Study ID, Ecosystem, Ecosystem Category, Ecosystem Subtype, Ecosystem Type, Specific Ecosystem, Altitude In Meters, Chlorophyll Concentration, Depth In Meters, Elevation In Meters, Geographic Location, Habitat, Isolation, Isolation Country, Latitude, Longhurst Code, Longhurst Description, Longitude, Nitrate Concentration, Oxygen Concentration, pH, Pressure, Salinity, Salinity Concentration, Sample Collection Date, Sample Collection Temperature, Subsurface In Meters, Genome Size  * assembled, Gene Count  * assembled, Scaffold Count  * assembled, Genome MetaBAT Bin Count  * assembled, Genome EukCC Bin Count  * assembled, CRISPR Count  * assembled, GC Count  * assembled, GC  * assembled, Coding Base Count  * assembled, Coding Base Count %  * assembled, CDS Count  * assembled, CDS %  * assembled, siteID, plotID, nlcdClass, decimalLatitude, decimalLongitude, elevation, collectionDate, horizon, soilTemp, d15N, organicd13C, nitrogenPercent, organicCPercent, CNratio, soilInWaterpH, soilInCaClpH.
## Warning: Removed 46 rows containing missing values or values outside the scale range
## (`geom_point_g_gtree()`).

Figure 4: This phylogenetic circular tree exhibits data correlating specifically to the phylum archaea. Demonstrated are archaeas multitude of lineages represented in its genome. These lineages are indicated to posses a range of 2e+06 to 4e+06 total base pairs. Additionally, this tree exhibits the ecosystem subtype of each lineage, as well as indications reflecting if each lineage is apart of a combined assembly or a Individual assembly.

ggtree(tree_bac, layout="circular", branch.length="none") +
  
    geom_hilight(node=1712, fill="red", alpha=.6) +
    geom_cladelab(node=1712, label="Halobacteriota", align=TRUE, hjust=-0.1, offset = 0, textcolor='red', barcolor='red') +

    geom_hilight(node=1789, fill="darkgreen", alpha=.6) +
    geom_cladelab(node=1789, label="Thermoplasmatota", align=TRUE, vjust=-0.4, offset = 0, textcolor='darkgreen', barcolor='darkgreen') +
  
      geom_hilight(node=2673, fill="darkorange", alpha=.6) +
    geom_cladelab(node=2673, label="Methanobacteriota", align=TRUE, hjust=1.1, offset = 0, textcolor='darkorange', barcolor='darkorange') +
  
    geom_hilight(node=3047, fill="purple", alpha=.6) +
    geom_cladelab(node=3047, label="Thermoproteota ", align=TRUE, hjust=-0.1, offset = 0, textcolor='purple', barcolor='purple')

Figure 5: This phylogenetic circular tree demonstrates the 4 highlighted taxonomic groups of the domain archaea at the NEON site. Highlighted are the phylums Thermoplasmata, Halobacteria,Thermoproteota, and Methanobacteria, which are all indicating members of its domain archaea.

Phylum MAG counts and dispersion at SRER

NEON_MAGs_HSite %>% 
ggplot(aes(x = Phylum)) +
  geom_bar() +
  labs(title = 'Overall Phylum MAG Counts and Dispersion at SRER')+
  coord_flip()

Figure 6: This bar plot exhibits the overview of phylum MAG counts collected and phylum dispersions at SRER. These seven various phylums analyzed are: Thermoproteota, Pseudomonadota, Gemmatimonadota, Desulfobacterota_B, Cloroflexota, Actinomycetota, and Acidobacteriota.

Soil in water pH vs. Soil in CaCl pH

ggplotly(
ggplot(data= Site_chemistry ,aes(x = soilInWaterpH , y = soilInCaClpH)) +
geom_point(aes(color= genomicsSampleID))+
  labs(title = "Soil in CaCl pH vs. Soil in Water pH")
)

Figure 7: This scatterplot demonstrates the correlation between soil in water pH and soil in CaCl pH at various location at the NEON site. The differently colored dots on the plot correlate to a genomic sample ID of a soil samples taken at the site. These samples are analyzed and pHs for both the soil in water and CaCl are documented. The configuration of the values appear to moderately clump on the top left corner of the scatterplot. Therefore indicating favorability for nutrient uptake via plants when the soil in CaCl and water pH are both higher basic values. Specifically, genomicsSampleIDs SRER_047, SRER_052, SRER_043, and SRER_053 are indicated in this clump.

Nitrogen % vs. Organic C %

ggplotly(
ggplot(data= Site_chemistry ,aes(x = organicCPercent , y = nitrogenPercent)) +
geom_point(aes(color= genomicsSampleID))+
  labs(title = "Organic C % vs. Nitrogen % in Soil")
)

Figure 8: This scatterplot demonstrates the correlation between organic carbon percentage and nitrogen percentage at the site. The differently colored dots on the plot indicate the values of the documented organic carbon percentage and nitrogen percentage collected from soil samples collected at the field site. The configuration of the values appear to faintly clump towards the bottom left corner of the plot. Indicating favorability for nutrient uptake via plants when organic carbon percentage and nitrogen percentage at SRER between organic carbon percentage values of 0.470-0.697% and nitrogen percentage values of 0.073-0.047%. Specifically, genomicsSampleIDs SRER_006, SRER_043, SRER_053, SRER_047, and SRER_052 are indicated in this clump.

Soil Temperature vs. Ecosystem Subtype of Archaea

NEON2 %>%   
ggplot(aes(x = fct_infreq(`Ecosystem Subtype`), y = soilTemp, color = 'Phylum')) +
  geom_point() +
  coord_flip()
## Warning: Removed 1812 rows containing missing values or values outside the scale range
## (`geom_point()`).

Figure 9: This scatterplot displays the various soil temperatures observed at a specific ecosystem subtype for archaea at SRER. At this site, exhibition of archaea was observed at 8 various types of ecosystems: Boreal forest/Taiga, desert, tropical forrest, shrubland, wetlands, tundra, grasslands, and temperature forest.

Soil Temperature at Each Subplot Sample

Site_chemistry %>%   
ggplot(aes(x = fct_infreq(`plotID`), y = soilTemp)) +
  geom_point(aes(color= genomicsSampleID))+
  theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))

Figure 10: This scatterplot illustrates the various soil temperatures collected from samples at each indicated subplot in SRER. SRER_053 is shown to possess the lowest soil temperature of approximately 24.45 degrees. All of the other indicated subplots demonstrate favorability for soil temperatures between approximately 25.85 degrees to 27.10 degrees.

Soil temperatures for each sample at all sites

NEON2 %>%   
ggplot(aes(x = fct_infreq(`Site ID`), y = soilTemp)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))
## Warning: Removed 1812 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Figure 11: This boxplot demonstrates the various soil temperatures collected at each NEON site. At each site, a sample of soil is gathered from each site and the temperature was recorded. The highest documented soil temperature of approximately 27 degrees was exhibited at the GUAN site. Correspondingly, SRER presented to have the second highest soil temperature of approximately 23 degrees.

Novel Bacteria at Each Site

NEON_MAGs_bact_ind %>%
filter(is.na(Class) | is.na(Order) | is.na(Domain) | is.na(Phylum) | is.na(Family) | is.na(Genus)) %>%
ggplot(aes(x = fct_infreq(`Site`))) +
geom_bar() +
coord_flip() +
labs(title = 'Novel Bacteria at each Site') 

Figure 12: This bar graph displays the count of novel bacteria collected at each NEON site. As demonstrated, National Grasslands LBJ site in Texas, USA presented to posses the largest number of novel bacteria at approximately 50 novel bacteria. Meanwhile, SRER demonstrates to possess significantly less bacteria with approximately 15 novel bacteria present. Although SRER has a small count of novel bacteria, it is relatively sized in comparison to some of the other NEON sites.

Overall MAG Counts at each site, by phylum

NEON_MAGs_bact_ind %>%
ggplot(aes(x = fct_rev(fct_infreq(Site)), fill = Phylum)) +
geom_bar() +
coord_flip() +
labs(title = 'Total MAGs at Each Site')

Figure 13: This bar plot illustrates the overall MAG counts at each NEON site by each indicated phylum. The National Grasslands LBJ site in Texas, USA is shown to present with the largest overall MAG count compared to the other sites. Specifically, Acidobacteriota is shown to present with the largest MAG count of approximately 260 total MAGs counted at this site. The smallest indicated phylum for most sites was shown to be Verrucomicrobiota. My specific NEON site, SRER presents with the smallest overall MAG counts in comparison to the other 12 NEON sites.

Total MAGs at each site, by phylum

NEON_MAGs_bact_ind %>% 
ggplot(aes(x = Phylum)) +
  geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +
  coord_flip() +
  facet_wrap(vars(`Site ID`), scales = "free", ncol = 2) +
  labs(title = 'Total MAGs at Each Site, by Phylum')

Figure 14: These 13 bar plots exhibit the total MAGs counted at each NEON site, by phylum. These plots demonstrates the relationship between the overall MAG genomic sixe and the number of genes.

Overview of phylum MAG counts collected and phylum dispersions across all NEON sites

NEON_MAGs %>% 
ggplot(aes(x = Phylum)) +
  geom_bar() +
  labs(title = 'Overall Phylum MAG Counts and Dispersion Across all NEON sites')+
  coord_flip()

Figure 15: This bar graph displays the overview of phylum MAG counts collected and phylum dispersions across all NEON sites. Actinomycetota demonstrates to possess the largest MAG count with approximately 660 phylum counted across all sites. With a smaller count, Pseudomonadota presented with approximately 375 overall phylum MAG counts and Acidobacteriota with approximately 310 overall phylum MAG counts across all sites.

Overview of class MAG counts collected and class dispersions across all NEON sites

NEON_MAGs %>% 
ggplot(aes(x = Class)) +
  geom_bar() +
  labs(title = 'Overall Class MAG Counts and Dispersion Across all NEON sites')+
  coord_flip()

Figure 16: This bar graph displays the overview of class MAG counts collected and class dispersions across all NEON sites. Thermolephilla demonstrates to possess the largest MAG count with approximately 290 classes counted across all sites. With a smaller count, Alphaprotebacteria presented with approximately 250 overall class MAG counts and Actinomycetia with approximately 240 overall class MAG counts across all sites.

Discussion

For my research, I was assigned to provide a comprehensive analysis of the NEON site, the Santa Rita Experimental Range (SRER), with the collaboration of a plethora of data sets. Specifically, by utilizing the NEON metagenome data from a variety of sites, I am able to analyze each site’s respective findings on ecosystems/environments, soil chemistry, and much more. Correspondingly, along with the implications of the NEON MAG data, I am able to extensively demonstrate and compare SRER with the various other NEON sites. The phylogenetic circular tree illustrated in figure 5 demonstrates the four taxonomic groups of the domain archaea at the NEON site. Highlighted are the phylums Thermoplasmata, Halobacteria,Thermoproteota, and Methanobacteria, which are all indicating members of its domain archaea. As demonstrated in figures 1, 2, and 3, these Sankey plots illustrate a visual representation of the individual, combined, and site assemblies of the phylum archaea. Incorporated in figure 1 and 2s, these Sankey plots exhibit the two main classes of archaea. Presenting as the two largest class of this archaea is nitrosophaerota and euryarchaeota. Meanwhile, figure 3 displays the respective assortment of all phyla present at SRER. Respectively, there are 7 phyla which indicates that the variation among taxa at SRER is dispersed. Correspondingly, the phylogenetic tree of SRER further demonstrates the multitude of lineages and dispersion represented in its genome (Figure 4). With examining this phylogenetic breakdown of archaea, the families that are identified to be apart of archaea are HRBIN12 (subplot 052), Pseudonocardiaceae (subplot 005 and subplot 006), Rubrobacteraceae (subplot 047), UBA5704 (subplot 043), WHSQ01 (subplot 043), GWC2-71-9 (subplot 053), Xanthobacteraceae (subplot 004), and Xanthomonadaceae (subplot 043). More specifically, these families all fall under the four taxonomic groups (phylums) of the domain archaea at the NEON site. Comparatively, figure 7 displays the correlation between soil in water pH and soil in CaCl pH at various sample locations at SRER. The genomicsSampleIDs shown in this figure and the subplot numbers associated with the previously indicated families cosenside with one another. Subplots 043, 047, 052, 053 all demonstrate favorability for maximum performance when exposed to high basic pH levels. Therefore, indicating favorability for archaeas four highlighted phylums at this site. In correlation, figure 8 demonstrates the relationship between organic carbon percentage and nitrogen percentage at the previously indicated subplots at this site. Although this scatterplot does not illustrate a substantial size clump of subplots for an approximate value of favorability, subplots 006, 043, 047, 052, and 053 all indicate favorability for maximum performance when exposed to low amounts of nitrogen and organic carbon. All of these 7 highlighted subplots located at SRER are compared amongst their documented soil temperatures (Figure 10). Subplot 053 is shown to possess the lowest soil temperature of approximately 24.45 degrees. All of the other indicated subplots (subplots 004, 005, 006, 043, 047, 052) demonstrate favorability for soil temperatures between approximately 25.85 degrees to 27.10 degrees.

When analyzing archaea at all the observed NEON sites, one method I used was through the collection each site’s soil temperature has been configured into a boxplot (Figure 11). The highest documented soil temperature was presented at the GUAN site with an average approximated soil temperature of 27 degrees. Correspondingly, SRER presented to have the second highest soil temperature of approximately 23 degrees in comparison to the other 12 NEON sites. Another method utilized was through the collection of the overall counts of novel bacteria at each site (Figure 12). The National Grasslands LBJ site in Texas, USA is estimated to have counted approximately 50 novel bacteria. This site is indicated to possess the largest number of counted novel bacteria compared to the other examined sites. Meanwhile, SRER is shown to present significantly less bacteria with approximately 15 novel bacteria present. Although SRER has a small count of novel bacteria, it is relatively sized in comparison to some of the other NEON sites. Compartitably, figure 13 illustrates the overall MAG counts at each NEON site by each phylum. All NEON sites were identified to present with Acidobacteriota as their largest phylum counted. More specifically, the National Grasslands LBJ site in Texas, USA is shown to present the largest overall MAG count compared to the other sites. Specifically, Acidobacteriota is shown to present with the largest MAG count of approximately 260 total MAGs counted at this site. The smallest indicated phylum for most sites was shown to be Verrucomicrobiota. My observed NEON site, SRER, presented with the smallest overall MAG count in comparison to the other 12 NEON sites. The overall count for Acidobacteriota at SRER is approximated to be a total of 30 MAGs. Meanwhile, SRER does not exhibit any Verrucomicrobiota. On a broader scale, figure 14 demonstrates the relationship between the overall MAG genomic size and the number of genes. Amongst all of the NEON sites, each soil sample’s temperatures was also analyzed. Correspondingly, figure 15 and 16 show the relationship between the overall phylum and class MAG counts and dispersions across all NEON sites. Precisely, figure 15 illustrates an overview of the overall phylum MAG counts collected and phylum dispersions across all NEON sites. Specifically, Actinomycetota demonstrates to possess the largest MAG count with approximately 660 phylum counted across all sites. As the second and third largest MAG counts, Pseudomonadota presented with approximately 375 overall phylum MAG counts and Acidobacteriota with approximately 310 overall phylum MAG counts across all sites. Figure 16 on the other hand, illustrates an overview of the overall class MAG counts collected and class dispersions across all NEON sites. As demonstrated, Thermolephilla possesses the largest MAG count with approximately 290 classes. Presenting with the second largest MAG count is Alphaprotebacteria with an approximated 250 overall classes. Actinomycetia demonstrated to present the third largest MAG count with an approximated 240 overall classes.

Conclusion

Throughout the conduction of my investigative analysis on archaea at my assigned NEON site, the Santa Rita Experimental Range (SRER), I am able to demonstrate a comprehensive understanding of the presented data. With the assistance of obtained data from NEON and IMG databases, as well as the incorporation of skills and resources I have gathered in evolutionary genomics and bioinformatics, I am able to inquire more about my assigned phylum, archaea. The exhibited plots showcase where archaeas four highlighted phyla reside and what alters their functionality. As demonstrated, the four phylums examined include: Thermoplasmata, Halobacteria, Thermoproteota, and Methanobacteria. Based on my findings, archaea demonstrates to be more suitable for environments with higher soil pH values of CaCl and water. Precisely, the families Rubrobacteraceae, HRBIN12, UBA5704, WHSQ0, and GWC2-71-9 all excel in high basic soil pH values where CaCl and water is present. Additionally, archaea was shown to also be favorable in soil with a lower percentage of organic carbon and nitrogen. More specifically, this indicates the organisms specificity for environments where these characteristics are present. Precisely, demonstrating that the families Pseudonocardiaceae, UBA5704, WHSQ01, GWC2-71-9, Rubrobacteraceae, and HRBIN12 excel in soil with a lower percentage of organic carbon and nitrogen. Additionally, these findings have allowed for the correlation between other phylums that are identified at SRER, as well as the other 12 examined NEON sites. To conclude, through the observation of the provided databases, I was able to demonstrate the relationship between a multitude of organisms and their ecosystems amongst a variety of NEON sites. With the observation, analysis, and comparison of these findings, I was further able to demonstrate and contribute to the genomic findings of archaea, as well as for the SRER NEON site. Correspondingly, as the collection and analysis of organisms and the environment at the SRER site is ongoing, these findings will heavily contribute to allowing researchers to develop a more comprehensive understanding of archaea and the site.

References

“Archaea.” Wikipedia, Wikimedia Foundation, 12 May 2024, en.wikipedia.org/wiki/Archaea.

“Santa Rita Experimental Range Neon / SRER.” Santa Rita Experimental Range NEON | NSF NEON | Open Data to Understand Our Ecosystems, www.neonscience.org/field-sites/srer. Accessed 20 May 2024.

n.d.a. DOE Joint Genome Institute - IMG/ER. https://img.jgi.doe.gov/cgi-bin/mer/main.cgi.

“History.” Santa Rita Experimental Range, 27 Apr. 2022, cales.arizona.edu/srer/content/history.

“Mission and Opportunities.” Santa Rita Experimental Range, 14 June 2022, cales.arizona.edu/srer/content/mission-and-opportunities.