To process NEON soil microbe marker gene sequence data into ASV tables.
neonMicrobe
is a suite of functions for downloading, pre-processing, and assembling heterogeneous data around the NEON soil microbe marker gene sequence data. To do so, neonMicrobe
downloads NEON data products from the NEON Data API and processes sequences using the DADA2 workflow. In the future, neonMicrobe
will offer a processing-batch infrastructure to encourage explicit versioning of processed data.
Please cite this package by citing the associated methods paper:
Qin, C., Bartelme, R., Chung, Y. A., Fairbanks, D., Lin, Y., Liptzin, D., Muscarella, C., Natihani, K., Peay, K., Pellitier, P., St. Rose, A., Werbin, Z., & Zhu, K. (2021). From DNA sequences to microbial ecology: Wrangling NEON soil microbe data with the neonMicrobe R package. Ecosphere, 12(11). https://doi.org/10.1002/ecs2.3842
The development version of neonMicrobe
can be installed directly from this GitHub repo using this code:
install.packages("devtools")
devtools::install_github("claraqin/neonMicrobe")
In addition to the R package dependencies which are installed alongside neonMicrobe
, users may also need to complete the following requirements before using some functions in neonMicrobe
:
data/tax_ref
subdirectory that is created after you run makeDataDirectories()
(see “Input data” below).cutadapt
. Installation instructions can be found here. Once installed, you can tell neonMicrobe
where to look for it by specifying the cutadapt_path
argument each time you use the trimPrimerITS
function. For an example, see the “Process 16S Sequences” vignette or the “Process ITS Sequences” vignette.The following R script makes use of neonMicrobe
to create ASV tables for 16S sequences collected from three NEON sites in the Great Plains:
Analyze NEON Great Plains 16S Sequences
Tutorials for neonMicrobe
are available in the vignettes
directory, and some are also linked here:
neonUtilities
R package.dada2
R package. The dada denoising algorithm partitions reads into amplicon sequence variants (ASVs), which are finer in resolution than OTUs.The Download NEON Data vignette demonstrates how to download NEON data, optionally writing to the file system. By default, the input data is downloaded into the following structure, which is created in the working directory after running makeDataDirectories()
:
The tree structure in the upper-left represents the data directory structure constructed within the project root directory. Red dotted lines represent explicit linkages between NEON data products via shared data fields. (a) Sequence metadata is downloaded from NEON data product DP1.10108.001 (Soil microbe marker gene sequences) using the downloadSequenceMetadata() function. (b) Raw microbe marker gene sequence data is downloaded from NEON based on the sequence metadata using the downloadRawSequenceData() function. (c) Soil physical and chemical data is downloaded from NEON data product DP1.10086.001 using the downloadRawSoilData() function. (d) Taxonomic reference datasets (e.g. SILVA, UNITE) are added separately by the user.
The Process (16S/ITS) Sequences and Add Environmental Variables to 16S Data vignettes demonstrate how to process the NEON data inputs into useful sample-abundance tables with accompanying environmental data.
By default, output data from neonMicrobe
is written to the outputs/
directory.
─ outputs
├── mid_process
│ ├── 16S
│ └── ITS
└── track_reads
├── 16S
└── ITS
The mid_process/
subdirectory contains files in the middle of being processed — for example, fastq files that have been trimmed or filtered, and sequencing run-specific ASV tables that have not yet been joined together. Once the desired outputs have been created, you may choose to clear the contents of mid_process/
, or leave them to retrace your processing steps.
The track_reads/
subdirectory contains tables tracking the number of reads remaining at each step in the pipeline, from the “raw” sequence files downloaded from NEON to the ASV table. These tables can be useful for pinpointing steps and samples for which an unusual number of reads were lost.
(Coming soon: When the processing batch feature is released, the default outputs directory will be switched to batch_outputs
. More on this later!)