Microbiome sequence analysis workshop

About the data set

In this workshop, we will be analyzing the bacterial communities found on leaves of sugar maple seedlings of different provenances planted at sites across eastern North America.

These data are taken from the article:

De Bellis, Laforest-Lapointe, Solarik, Gravel, and Kembel. 2022. Regional variation drives differences in microbial communities associated with sugar maple across a latitudinal range. Ecology (published online ahead of print). doi:10.1002/ecy.3727

For this workshop, we will work with a subset of samples from six of the nine sites sampled for the article. These samples are located in three different biomes (temperate, mixed, and boreal forests). At each site, several sugar maple seedlings were planted and harvested, and we collected bacterial DNA from the leaves. Each sample thus represents the bacterial communities on the leaves of a single sugar maple seedling. For each seedling, we have associated metadata on the provenance of the seed, the site where the seed was planted, and the biome/stand type where the seed was planted. We’ll talk more about these samples and the biological hypotheses we want to test in day 2 of the workshop.

Data file setup and download

To follow along with the code, you will need to download the raw data (FASTQ sequence files) and the SILVA taxonomy files, and place a copy of these data sets in the working directory from which you are running the code.

The data files we will use for this workshop are available for download at the following URL: https://figshare.com/articles/dataset/Data_files_for_BIOS2_Microbiome_Analysis_Workshop/19763077.

The sequence data in the folder rawdata-Qleaf_BACT consist of 74 compressed FASTQ files (2 files for each of 36 samples). These files were provided by the UQAM CERMO-FC Genomics Platform where the samples were sequenced in a multiplexed run on an Illumina MiSeq. The samples were demultiplexed into separate files based on the barcode associated with each sample, so there is a separate pair of files for each sample. One file labelled R2 contains the forward read for the sample, the other file labelled R1 contains the reverse read for the sample.

If your samples were not automatically demultiplexed by the sequencing centre, you will instead have just two files which are normally labelled R2 for forward reads and R1 for reverse reads, and you should also receive information about the sequence barcode associated with each sample. You will need to demultiplex this file yourself by using the unique barcode associated with each sample to split the single large file containing all samples together into a separate file for each sample. It is easier to not have to do this step yourself, so I recommend asking your sequencing centre how they will provide the data and request that the data be provided demultiplexed with separate files for each sample. If you do need to do the multiplexing yourself, there are numerous tools available to carry out demultplexing, including QIIME, mothur, or the standalone idemp program.

After downloading the compressed file “rawdata-Qleaf_BACT.zip” (the file is around 150MB in size), you will need to unzip the file into a folder on your computer. I suggest you place this folder in the working directory from which you are running the code for the workshop.

The SILVA taxonomy reference database files we will use for this workshop are available for download at the same URL where we found the sequence data: https://figshare.com/articles/dataset/Data_files_for_BIOS2_Microbiome_Analysis_Workshop/19763077

After downloading the compressed file “SILVA138.1.zip” (the file is around 200MB in size), you will need to unzip the file into a folder on your computer. As for the raw data, I suggest you place this folder in the working directory from which you are running the code for the workshop.

This workshop uses the most recent version of the SILVA taxonomy currently available, which is version 138.1. There are up-to-date DADA2 formatted versions of the SILVA database along with other databases available at the DADA2 website.

Install the DADA2 package

You will need to install the most recent version of the DADA2 R package. At the time of writing this workshop, this is DADA2 version 1.24.0. There are instructions for installing the DADA2 package here. The command below works to install DADA2 version 1.24.0 using R version 4.2.0:

# install dada2 if needed
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("dada2", version = "3.15")

Sequence analysis with DADA2

This workshop is based directly on the excellent DADA2 tutorial. I highly recommend following the DADA2 tutorial when working with your own data, it is constantly updated to incorporate the latest version of the analyses available in the package and has helpful tips for analysing your own data. The DADA2 tutorial goes into more depth about each of the steps in this analysis pipeline, and includes useful advice about each step of the pipeline. In this workshop, we have adapted the tutorial materials to run through the basic steps of sequence analysis to go from sequence files to an ecological matrix of ASV abundances in samples, and the taxonomic annotation of those ASVs.

We begin by loading the DADA2 package.

Load package

# load library, check version
library(dada2); packageVersion("dada2")
## [1] '1.24.0'

Set file paths

We need to set file paths so DADA2 knows where to find the raw data FASTQ files. I recommend creating a RStudio project for this workshop and placing a copy of the folder with the raw data FASTQ files in the working directory; by default in RStudio the working directory is the folder where you created the project. The exact paths below will need to be changed depending on where you placed the raw data files.

# set path to raw data FASTQ files
# here we assume the raw data are in a folder named "rawdata-Qleaf_BACT"
# in the current working directory (the directory where we created our
# RStudio project for this workshop, by default)
path <- "rawdata-Qleaf_BACT/"
# list the files found in the path - this should include 74 files with
# names like '2017-11-miseq-trimmed_R1.fastq_BC.C2.3X2.fastq.gz"
list.files(path)
##  [1] "2017-11-miseq-trimmed_R1.fastq_AUC.C1.2X1.fastq.gz" 
##  [2] "2017-11-miseq-trimmed_R1.fastq_AUC.C1.3.fastq.gz"   
##  [3] "2017-11-miseq-trimmed_R1.fastq_AUC.N1.1.fastq.gz"   
##  [4] "2017-11-miseq-trimmed_R1.fastq_AUC.N1.3.fastq.gz"   
##  [5] "2017-11-miseq-trimmed_R1.fastq_AUC.N2.1.X1.fastq.gz"
##  [6] "2017-11-miseq-trimmed_R1.fastq_AUC.N2.1X2.fastq.gz" 
##  [7] "2017-11-miseq-trimmed_R1.fastq_AUC.N2.2.fastq.gz"   
##  [8] "2017-11-miseq-trimmed_R1.fastq_AUC.S1.1.X2.fastq.gz"
##  [9] "2017-11-miseq-trimmed_R1.fastq_AUC.S1X1.fastq.gz"   
## [10] "2017-11-miseq-trimmed_R1.fastq_AUC.S2.2.fastq.gz"   
## [11] "2017-11-miseq-trimmed_R1.fastq_BC.C1.2.fastq.gz"    
## [12] "2017-11-miseq-trimmed_R1.fastq_BC.C2.1X1.fastq.gz"  
## [13] "2017-11-miseq-trimmed_R1.fastq_BC.C2.3X2.fastq.gz"  
## [14] "2017-11-miseq-trimmed_R1.fastq_BC.C2.3X2b.fastq.gz" 
## [15] "2017-11-miseq-trimmed_R1.fastq_BC.N1.3X2.fastq.gz"  
## [16] "2017-11-miseq-trimmed_R1.fastq_BC.N1.fastq.gz"      
## [17] "2017-11-miseq-trimmed_R1.fastq_BC.N2.1.X3.fastq.gz" 
## [18] "2017-11-miseq-trimmed_R1.fastq_BC.N2.1X1.fastq.gz"  
## [19] "2017-11-miseq-trimmed_R1.fastq_BC.N2.1X2.fastq.gz"  
## [20] "2017-11-miseq-trimmed_R1.fastq_PAR.C1.1.fastq.gz"   
## [21] "2017-11-miseq-trimmed_R1.fastq_PAR.C1.2.fastq.gz"   
## [22] "2017-11-miseq-trimmed_R1.fastq_PAR.C2.3.fastq.gz"   
## [23] "2017-11-miseq-trimmed_R1.fastq_PAR.N1.3.fastq.gz"   
## [24] "2017-11-miseq-trimmed_R1.fastq_QLbactNeg19.fastq.gz"
## [25] "2017-11-miseq-trimmed_R1.fastq_QLbactNeg20.fastq.gz"
## [26] "2017-11-miseq-trimmed_R1.fastq_S.M.C1.3X2.fastq.gz" 
## [27] "2017-11-miseq-trimmed_R1.fastq_S.M.C2.3.fastq.gz"   
## [28] "2017-11-miseq-trimmed_R1.fastq_S.M.C2.3X2.fastq.gz" 
## [29] "2017-11-miseq-trimmed_R1.fastq_S.M.S1.1X1.fastq.gz" 
## [30] "2017-11-miseq-trimmed_R1.fastq_SF.C1.3.fastq.gz"    
## [31] "2017-11-miseq-trimmed_R1.fastq_SF.C2.2.fastq.gz"    
## [32] "2017-11-miseq-trimmed_R1.fastq_SF.N2.3.fastq.gz"    
## [33] "2017-11-miseq-trimmed_R1.fastq_SF.S1.3.fastq.gz"    
## [34] "2017-11-miseq-trimmed_R1.fastq_T2.C1.1.fastq.gz"    
## [35] "2017-11-miseq-trimmed_R1.fastq_T2.C2.3.fastq.gz"    
## [36] "2017-11-miseq-trimmed_R1.fastq_T2.N1.1.fastq.gz"    
## [37] "2017-11-miseq-trimmed_R1.fastq_T2.S2.1.fastq.gz"    
## [38] "2017-11-miseq-trimmed_R2.fastq_AUC.C1.2X1.fastq.gz" 
## [39] "2017-11-miseq-trimmed_R2.fastq_AUC.C1.3.fastq.gz"   
## [40] "2017-11-miseq-trimmed_R2.fastq_AUC.N1.1.fastq.gz"   
## [41] "2017-11-miseq-trimmed_R2.fastq_AUC.N1.3.fastq.gz"   
## [42] "2017-11-miseq-trimmed_R2.fastq_AUC.N2.1.X1.fastq.gz"
## [43] "2017-11-miseq-trimmed_R2.fastq_AUC.N2.1X2.fastq.gz" 
## [44] "2017-11-miseq-trimmed_R2.fastq_AUC.N2.2.fastq.gz"   
## [45] "2017-11-miseq-trimmed_R2.fastq_AUC.S1.1.X2.fastq.gz"
## [46] "2017-11-miseq-trimmed_R2.fastq_AUC.S1X1.fastq.gz"   
## [47] "2017-11-miseq-trimmed_R2.fastq_AUC.S2.2.fastq.gz"   
## [48] "2017-11-miseq-trimmed_R2.fastq_BC.C1.2.fastq.gz"    
## [49] "2017-11-miseq-trimmed_R2.fastq_BC.C2.1X1.fastq.gz"  
## [50] "2017-11-miseq-trimmed_R2.fastq_BC.C2.3X2.fastq.gz"  
## [51] "2017-11-miseq-trimmed_R2.fastq_BC.C2.3X2b.fastq.gz" 
## [52] "2017-11-miseq-trimmed_R2.fastq_BC.N1.3X2.fastq.gz"  
## [53] "2017-11-miseq-trimmed_R2.fastq_BC.N1.fastq.gz"      
## [54] "2017-11-miseq-trimmed_R2.fastq_BC.N2.1.X3.fastq.gz" 
## [55] "2017-11-miseq-trimmed_R2.fastq_BC.N2.1X1.fastq.gz"  
## [56] "2017-11-miseq-trimmed_R2.fastq_BC.N2.1X2.fastq.gz"  
## [57] "2017-11-miseq-trimmed_R2.fastq_PAR.C1.1.fastq.gz"   
## [58] "2017-11-miseq-trimmed_R2.fastq_PAR.C1.2.fastq.gz"   
## [59] "2017-11-miseq-trimmed_R2.fastq_PAR.C2.3.fastq.gz"   
## [60] "2017-11-miseq-trimmed_R2.fastq_PAR.N1.3.fastq.gz"   
## [61] "2017-11-miseq-trimmed_R2.fastq_QLbactNeg19.fastq.gz"
## [62] "2017-11-miseq-trimmed_R2.fastq_QLbactNeg20.fastq.gz"
## [63] "2017-11-miseq-trimmed_R2.fastq_S.M.C1.3X2.fastq.gz" 
## [64] "2017-11-miseq-trimmed_R2.fastq_S.M.C2.3.fastq.gz"   
## [65] "2017-11-miseq-trimmed_R2.fastq_S.M.C2.3X2.fastq.gz" 
## [66] "2017-11-miseq-trimmed_R2.fastq_S.M.S1.1X1.fastq.gz" 
## [67] "2017-11-miseq-trimmed_R2.fastq_SF.C1.3.fastq.gz"    
## [68] "2017-11-miseq-trimmed_R2.fastq_SF.C2.2.fastq.gz"    
## [69] "2017-11-miseq-trimmed_R2.fastq_SF.N2.3.fastq.gz"    
## [70] "2017-11-miseq-trimmed_R2.fastq_SF.S1.3.fastq.gz"    
## [71] "2017-11-miseq-trimmed_R2.fastq_T2.C1.1.fastq.gz"    
## [72] "2017-11-miseq-trimmed_R2.fastq_T2.C2.3.fastq.gz"    
## [73] "2017-11-miseq-trimmed_R2.fastq_T2.N1.1.fastq.gz"    
## [74] "2017-11-miseq-trimmed_R2.fastq_T2.S2.1.fastq.gz"

Find data files and extract sample names

Now that we have the list of raw data FASTQ files, we need to process these file names to identify forward and reverse reads. For each sequence, the sequencer will create two files labelled R2 and R1. One of these files contains the forward reads, the other the reverse reads. Which file is which will depend on the protocol used to create sequencing libraries. In our case, the R2 files contain the forward reads, and the R1 files contain the reverse reads.

# Identify forward and reverse reads
# Filenames containing R2 are forward reads, R1 are reverse reads
fnFs <- sort(list.files(path, pattern="R2", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="R1", full.names = TRUE))

We also need to process the filenames to extract and clean up the sample names.

# FASTQ filenames have format:
# BLABLA_R1.fastq_SAMPLENAME.fastq.gz
# Extract sample names from filenames
# Split filename at "_" character, the sample name is in the third chunk
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 3)
# Remove extra text from the end of filenames
sample.names <- gsub(".fastq.gz", "", sample.names)
# Look at sample names after processing
sample.names
##  [1] "AUC.C1.2X1"  "AUC.C1.3"    "AUC.N1.1"    "AUC.N1.3"    "AUC.N2.1.X1"
##  [6] "AUC.N2.1X2"  "AUC.N2.2"    "AUC.S1.1.X2" "AUC.S1X1"    "AUC.S2.2"   
## [11] "BC.C1.2"     "BC.C2.1X1"   "BC.C2.3X2"   "BC.C2.3X2b"  "BC.N1.3X2"  
## [16] "BC.N1"       "BC.N2.1.X3"  "BC.N2.1X1"   "BC.N2.1X2"   "PAR.C1.1"   
## [21] "PAR.C1.2"    "PAR.C2.3"    "PAR.N1.3"    "QLbactNeg19" "QLbactNeg20"
## [26] "S.M.C1.3X2"  "S.M.C2.3"    "S.M.C2.3X2"  "S.M.S1.1X1"  "SF.C1.3"    
## [31] "SF.C2.2"     "SF.N2.3"     "SF.S1.3"     "T2.C1.1"     "T2.C2.3"    
## [36] "T2.N1.1"     "T2.S2.1"

We now have the list of filenames for the forward and reverse reads, the list of sample names, and we know the sample name associated with each file.

Look at sequence quality profiles

The first step of working with our raw data is to look at the sequence quality profiles. These raw data FASTQ files were generated by sequencing a multiplexed library using the Illumina MiSeq sequencer. For this study, we used the Illumina V3 2x300nt sequencing chemistry, which sequences the first 300 nucleotides at the start (forward read) and end (reverse read) of each amplicon.

There is a quality score associated with each nucleotide known as the Phred quality score. This score indicates how confident we are in the identity of that nucleotide. Normally, the quality score should be above 30 (= 1 in 1000 chance of an incorrect base call) for much of the read, but it will begin to drop towards the end of the read. We want to identify where the quality begins to drop so we can trim out the low quality nucleotides.

We will visualize the forward and reverse read quality scores separately. Here, we begin by looking at the quality profile of forward reads for the first 9 samples.

# Plot quality profile of forward reads
plotQualityProfile(fnFs[1:9])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

We can see that the quality is quite good, but around 200-220 nucleotides it begins to decrease. It’s not a sharp drop (a “crash”) but the median quality starts to decline around 200 nucleotides and dips below a median quality of around 20 by around 230-240 nucleotides.

Now let’s look at the reverse reads.

# Plot quality profile of reverse reads
plotQualityProfile(fnRs[1:9])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

These are similar to the forward reads, but the quality stays high for longer. This is common (the R1 read tends to be of high quality than the R2). Here, the quality stays quite high almost all the way to the end of the read, although the median quality does begin to decline starting around 230-240 nucleotides.

Trimming and quality control

Now that we have visualized the quality of the reads, we need to make some decisions about where to trim the reads.

It is very important to trim barcode and primer nucleotides from the reads as they can cause problems when identifying ASVs. Our reads contain the 12 nucleotide barcode sequence at the beginning of each read that will need to be trimmed. In our case, we used the primers 799F and 1115R to amplify the bacterial 16S rRNA gene. These primers are 18 nucleotides (799F) and 16 nucleotides (1115R) in length, and are found after the barcode sequence at the beginning of each read. Sometimes FASTQ files will be supplied with the barcodes and/or primers already removed from the sequences; you will need to check or ask the sequencing centre about this for your own data. In our case we will need to remove the forward barcode + primer nucleotides (30 nucleotides) and the reverse barcode + primer nucleotides (28 nucleotides) from the beginning of each read.

We also need to trim the reads to remove low-quality nucleotides from the end of the reads. As we saw above, the quality of both the forward and reverse reads are quite good, but forward read quality did begin to decline around 200 nucleotides and reverse read quality declines slightly beginning around 230-240 nucleotides.

An important consideration when deciding where to trim our reads is that we want to be able to merge together the forward and reverse reads later in the pipeline. To do this, we need the reads to overlap with one another. To figure out whether our forward and reverse reads will overlap after trimming, we need to know the length of the amplicon that we sequenced. In our case, the 799F-1115R primers create an amplicon that is 316 nucleotides in length. Thus, we need to make sure that the total length of the forward and reverse reads that remains after trimming is longer than 316 nucleotides in order to have some overlapping nucleotides that can be used to merge the reads together. If the forward and reverse reads cannot be merged by overlap after trimming, it will not be possible to create a merged read that covers the full length of the amplicon. This is not a fatal problem but it can reduce accuracy of ASV binning (since we are missing data for a portion of the amplicon) and it makes taxonomic annotation more challenging. The overlapping portion of the reads also serves to increase confidence in the quality of ASVs, since for the overlapping portion we have two estimates of the nucleotide at each position which serves to reduce error and increase confidence in the base call at that position.

This is an important consideration when choosing what primers to use and what sequencing technology to use. You should aim to ensure that the total length of your forward and reverse reads is enough to cover the length of the region targeted by your primers, with as much overlap as possible to make merging the reads together easier. In our case, with an amplicon length of 312 nucleotides, and forward/reverse reads of around 300 nucleotides each, we will be sure to have lots of overlap, so we don’t have to worry too much if we need to trim our reads.

Now that we know the number of nucleotides to trim from the start and end of each sequence, we can now filter and trim our reads. Here, we will trim out barcode and primer nucleotides from the start of reads and low-quality nucleotides from the end of reads. We also filter out reads with low quality after trimming, as well as reads that contain N (unknown) nucleotides and reads that map to the phiX genome (a viral genome that is often added to sequencing runs to increase library quality).

This will create filtered versions of the FASTQ files in a new folder named dada2-filtered.

# set filtered file folder paths and filenames
filt_path <- file.path(getwd(), "dada2-filtered")
filtFs <- file.path(filt_path, paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sample.names, "_R_filt.fastq.gz"))

# Filter and trim reads
# multithread argument = number of cores to use (set TRUE to use all)
# note need to trim out primers at the start of the reads
# and trim the end of the reads when the quality crashes
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft = c(30,28),
                     truncLen=c(200,230), maxN=0, maxEE=2,
                     truncQ=2, rm.phix=TRUE, compress=TRUE, 
                     multithread=TRUE, verbose=TRUE)

Inferring ASVs

Now that we have our trimmed quality-control sequences, we can infer the identity and abundance of ASVs in our sequence files.

The first step of this process is to learn the error rates of our sequences. DADA2 uses information about error rates to infer the identity of ASVs in the samples.

# Learn error rates
errF <- learnErrors(filtFs, multithread=TRUE)
## 103149200 total bases in 606760 reads from 32 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread=TRUE)
## 101062620 total bases in 500310 reads from 22 samples will be used for learning the error rates.

Once we have learned the error rates, we do some bookkeeping to reduce the time required to infer ASVs by removing replicate sequences from each sample. A given FASTQ file might contain large numbers of identical sequences. We don’t need to analyse each of these sequences separately; we’ll dereplicate the file by removing duplicates of identical sequences. This keeps track of the number of duplicates so that at the end of our process we know the actual abundance of each of these identical sequences.

# Dereplicate the filtered fastq files
derepFs <- derepFastq(filtFs)
derepRs <- derepFastq(filtRs)
# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names

Now we are ready to use the DADA2 algorithm to identify ASVs in our samples. DADA2 uses a model of the error rate in sequences to identify the unique amplicon sequence variants (ASVs) that are present in our samples.

When carrying out ASV binning and chimera identification, we have the option to pool together samples (the pool argument). There are different strategies for pooling samples together. It is quicker to bin ASVs on a per-sample basis, but this might miss extremely rare ASVs that occur with low abundance in multiple samples. By pooling samples together, we can increase our ability to accurately identify these rare ASVs. The downside is that pooling samples together takes more time since we not have to analyse a larger number of sequences at the same time. The “pseudo-pooling” option strikes a balance between these options and is the approach we’ll use today. There is an in-depth explanation of the different pooling options at the DADA2 website.

# Infer the sequence variants in each sample
dadaFs <- dada(derepFs, err=errF, pool="pseudo", multithread=TRUE)
## Sample 1 - 43893 reads in 4133 unique sequences.
## Sample 2 - 1535 reads in 279 unique sequences.
## Sample 3 - 342 reads in 80 unique sequences.
## Sample 4 - 36815 reads in 4992 unique sequences.
## Sample 5 - 6901 reads in 796 unique sequences.
## Sample 6 - 69130 reads in 6170 unique sequences.
## Sample 7 - 13883 reads in 1640 unique sequences.
## Sample 8 - 35769 reads in 2874 unique sequences.
## Sample 9 - 35178 reads in 3504 unique sequences.
## Sample 10 - 49487 reads in 5858 unique sequences.
## Sample 11 - 15594 reads in 1607 unique sequences.
## Sample 12 - 28849 reads in 2366 unique sequences.
## Sample 13 - 529 reads in 110 unique sequences.
## Sample 14 - 29475 reads in 2956 unique sequences.
## Sample 15 - 22328 reads in 1714 unique sequences.
## Sample 16 - 17615 reads in 2171 unique sequences.
## Sample 17 - 24895 reads in 3116 unique sequences.
## Sample 18 - 31171 reads in 2974 unique sequences.
## Sample 19 - 11127 reads in 1205 unique sequences.
## Sample 20 - 20 reads in 16 unique sequences.
## Sample 21 - 255 reads in 57 unique sequences.
## Sample 22 - 25519 reads in 2519 unique sequences.
## Sample 23 - 7057 reads in 645 unique sequences.
## Sample 24 - 29 reads in 11 unique sequences.
## Sample 25 - 86 reads in 23 unique sequences.
## Sample 26 - 11570 reads in 1413 unique sequences.
## Sample 27 - 21806 reads in 2227 unique sequences.
## Sample 28 - 8836 reads in 1092 unique sequences.
## Sample 29 - 500 reads in 108 unique sequences.
## Sample 30 - 19733 reads in 2783 unique sequences.
## Sample 31 - 347 reads in 94 unique sequences.
## Sample 32 - 36486 reads in 5477 unique sequences.
## Sample 33 - 22106 reads in 2394 unique sequences.
## Sample 34 - 24618 reads in 2881 unique sequences.
## Sample 35 - 608 reads in 124 unique sequences.
## Sample 36 - 25074 reads in 3998 unique sequences.
## Sample 37 - 22197 reads in 2188 unique sequences.
## 
##    selfConsist step 2
dadaRs <- dada(derepRs, err=errR, pool="pseudo", multithread=TRUE)
## Sample 1 - 43893 reads in 3352 unique sequences.
## Sample 2 - 1535 reads in 196 unique sequences.
## Sample 3 - 342 reads in 71 unique sequences.
## Sample 4 - 36815 reads in 4930 unique sequences.
## Sample 5 - 6901 reads in 557 unique sequences.
## Sample 6 - 69130 reads in 5237 unique sequences.
## Sample 7 - 13883 reads in 1296 unique sequences.
## Sample 8 - 35769 reads in 2357 unique sequences.
## Sample 9 - 35178 reads in 2961 unique sequences.
## Sample 10 - 49487 reads in 4856 unique sequences.
## Sample 11 - 15594 reads in 1250 unique sequences.
## Sample 12 - 28849 reads in 2007 unique sequences.
## Sample 13 - 529 reads in 83 unique sequences.
## Sample 14 - 29475 reads in 2315 unique sequences.
## Sample 15 - 22328 reads in 1413 unique sequences.
## Sample 16 - 17615 reads in 1728 unique sequences.
## Sample 17 - 24895 reads in 2801 unique sequences.
## Sample 18 - 31171 reads in 2593 unique sequences.
## Sample 19 - 11127 reads in 1113 unique sequences.
## Sample 20 - 20 reads in 16 unique sequences.
## Sample 21 - 255 reads in 48 unique sequences.
## Sample 22 - 25519 reads in 2316 unique sequences.
## Sample 23 - 7057 reads in 571 unique sequences.
## Sample 24 - 29 reads in 8 unique sequences.
## Sample 25 - 86 reads in 18 unique sequences.
## Sample 26 - 11570 reads in 1256 unique sequences.
## Sample 27 - 21806 reads in 1916 unique sequences.
## Sample 28 - 8836 reads in 1008 unique sequences.
## Sample 29 - 500 reads in 96 unique sequences.
## Sample 30 - 19733 reads in 2245 unique sequences.
## Sample 31 - 347 reads in 87 unique sequences.
## Sample 32 - 36486 reads in 5014 unique sequences.
## Sample 33 - 22106 reads in 2037 unique sequences.
## Sample 34 - 24618 reads in 2683 unique sequences.
## Sample 35 - 608 reads in 83 unique sequences.
## Sample 36 - 25074 reads in 3689 unique sequences.
## Sample 37 - 22197 reads in 1876 unique sequences.
## 
##    selfConsist step 2

Merge denoised forward and reverse reads

Now that we have identified ASVs from our sequences, we can merge together the forward and reverse reads into a single sequence that will cover the full length of our amplicon. This approach works by looking for matching nucleotides in the overlapping portion of the forward and reverse reads.

As noted above, if your forward and reverse reads do not overlap, they cannot be merged. It is still possible to merge them by concatenating together the forward and reverse reads, but this approach makes it more difficult to do taxonomic annotation.

# Merge the denoised forward and reverse reads:
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs)

Construct sequence table

We have identified ASVs and merged the forward and reverse reads into a single ASV. We can now construct the sequence table, which is a matrix of ASV abundances in each sample.

# construct sequence table
seqtab <- makeSequenceTable(mergers)
# look at dimension of sequence table (# samples, # ASVs)
dim(seqtab)
## [1]   37 2540
# look at length of ASVs
table(nchar(getSequences(seqtab)))
## 
## 295 296 297 298 299 300 301 302 303 304 305 306 307 308 312 318 
##  12 113  73  99  82 458 264 697 484 109 140   1   5   1   1   1

Remove chimeras

The final step of processing the sequence files involves identifying and removing chimeric sequences. These are sequences that contain portions of two distinct organisms. Chimeric sequences can arise at different stages in the process of sequencing samples including during PCR and during sequencing.

# remove chimeras
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
## Identified 1425 bimeras out of 2540 input sequences.
# look at dimension of sequence table (# samples, # ASVs) after chimera removal
dim(seqtab.nochim)
## [1]   37 1115
# what % of reads remain after chimera removal?
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.9836811

Tracking read fate through the pipeline

Now we have completed the processing of our files. We can track how many reads per sample remain after the different steps of the process.

# summary - track reads through the pipeline
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(mergers, getN), rowSums(seqtab), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoised", "merged", "tabled", "nonchim")
rownames(track) <- sample.names
head(track)
##             input filtered denoised merged tabled nonchim
## AUC.C1.2X1  45852    43893    43836  43297  43297   43268
## AUC.C1.3     1605     1535     1535   1532   1532    1522
## AUC.N1.1      362      342      332    324    324     324
## AUC.N1.3    38442    36815    36731  35069  35069   33707
## AUC.N2.1.X1  7193     6901     6899   6857   6857    6854
## AUC.N2.1X2  72307    69130    69116  68251  68251   68226

Taxonomic annotation

Now we will annotate the taxonomic identity of each ASV by comparing its sequence with a reference database. Our data are bacterial 16S rRNA sequences so we’ll use the SILVA database for this taxonomic annotation. A version of the SILVA database formatted for use with DADA2 is available at the DADA2 website, along with other databases for different barcode genes.

Higher-level taxonomic annotation

For taxonomic annotation at ranks from domains to genera, DADA2 uses the RDP Naive Bayesian Classifier algorithm. This algorithm compares each ASV sequence to the reference database. If the bootstrap confidence in the taxonomic annotation at each rank is 50% or greater, we assign that annotation to the ASV at that rank. If an ASV cannot be confidently identified at a given rank, there will be a NA value in the taxonomy table.

Note that this step can be time consuming, especially if you have a large number of ASVs to identify. Running this annotation step on a computer with numerous cores/threads will speed up the process.

# identify taxonomy
taxa <- assignTaxonomy(seqtab.nochim, "SILVA138.1/silva_nr99_v138.1_train_set.fa.gz", multithread=TRUE, tryRC=TRUE)

Species-level taxonomic annotation

We use the RDP Naive Bayesian Classifier algorithm for taxonomic annotations at ranks from domains to genera. This allows for annotations even if there is not an exact match between ASV sequences and sequences in the reference database. At the species level, we need a different strategy, since we do not want to identify an ASV as belonging to a species unless its sequence matches that species exactly. Note that this approach means that we cannot carry out species-level taxonomic annotation if we concatenated our sequences rather than merging them together into a single sequence.

This step can also be somewhat time consuming depending on the number of ASVs to be annotated.

# exact species matching (won't work if concatenating sequences)
taxa.sp <- addSpecies(taxa,  "SILVA138.1/silva_species_assignment_v138.1.fa.gz", allowMultiple = TRUE, tryRC = TRUE)

Now that we have the taxonomic annotations, we can inspect them. We create a special version of the taxonomy table with the ASV names removed (the name of each ASV is its full sequence, so the names are very long).

# inspect the taxonomic assignments
taxa.print <- taxa.sp # Removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print, n=20)
##       Kingdom    Phylum             Class                 Order             
##  [1,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Rhizobiales"     
##  [2,] "Bacteria" "Cyanobacteria"    "Cyanobacteriia"      "Chloroplast"     
##  [3,] "Bacteria" "Bacteroidota"     "Bacteroidia"         "Cytophagales"    
##  [4,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Sphingomonadales"
##  [5,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Sphingomonadales"
##  [6,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Rhizobiales"     
##  [7,] "Bacteria" "Actinobacteriota" "Actinobacteria"      "Micrococcales"   
##  [8,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Sphingomonadales"
##  [9,] "Bacteria" "Actinobacteriota" "Actinobacteria"      "Micrococcales"   
## [10,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Rhizobiales"     
## [11,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Rhizobiales"     
## [12,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Rhizobiales"     
## [13,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Rhizobiales"     
## [14,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Sphingomonadales"
## [15,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Sphingomonadales"
## [16,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Sphingomonadales"
## [17,] "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Burkholderiales" 
## [18,] "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Burkholderiales" 
## [19,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Sphingomonadales"
## [20,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Rhizobiales"     
##       Family              Genus                           
##  [1,] "Beijerinckiaceae"  "Methylobacterium-Methylorubrum"
##  [2,] NA                  NA                              
##  [3,] "Hymenobacteraceae" "Hymenobacter"                  
##  [4,] "Sphingomonadaceae" "Sphingomonas"                  
##  [5,] "Sphingomonadaceae" "Sphingomonas"                  
##  [6,] "Beijerinckiaceae"  "Methylobacterium-Methylorubrum"
##  [7,] "Microbacteriaceae" "Amnibacterium"                 
##  [8,] "Sphingomonadaceae" "Sphingomonas"                  
##  [9,] "Microbacteriaceae" "Frondihabitans"                
## [10,] "Beijerinckiaceae"  "Methylobacterium-Methylorubrum"
## [11,] "Beijerinckiaceae"  "Methylobacterium-Methylorubrum"
## [12,] "Beijerinckiaceae"  "1174-901-12"                   
## [13,] "Beijerinckiaceae"  "Methylobacterium-Methylorubrum"
## [14,] "Sphingomonadaceae" "Sphingomonas"                  
## [15,] "Sphingomonadaceae" "Sphingomonas"                  
## [16,] "Sphingomonadaceae" "Sphingomonas"                  
## [17,] "Oxalobacteraceae"  "Massilia"                      
## [18,] "Oxalobacteraceae"  "Massilia"                      
## [19,] "Sphingomonadaceae" "Sphingomonas"                  
## [20,] "Beijerinckiaceae"  "Methylobacterium-Methylorubrum"
##       Species                                        
##  [1,] NA                                             
##  [2,] NA                                             
##  [3,] "arcticus/bucti/perfusus"                      
##  [4,] "aerolata/faeni"                               
##  [5,] NA                                             
##  [6,] NA                                             
##  [7,] NA                                             
##  [8,] "ginsenosidivorax"                             
##  [9,] "australicus/cladoniiphilus/peucedani/sucicola"
## [10,] NA                                             
## [11,] NA                                             
## [12,] NA                                             
## [13,] NA                                             
## [14,] "faeni"                                        
## [15,] "echinoides/glacialis/mucosissima/rhizogenes"  
## [16,] NA                                             
## [17,] NA                                             
## [18,] "aurea/oculi"                                  
## [19,] NA                                             
## [20,] NA

Final steps - clean up and save data objects and workspace

We have now completed our analysis of the sequence files. We have created an ASV table seqtab.nochim that contains the error-corrected non-chimeric ASV sequences and their abundances in each sample. For each ASV we also have the taxonomic annotations of the ASV in the object taxa.sp. We will save these data objects to the file system, along with a copy of the entire workspace containing all the output of the analyses so far. We will use these files to continue our analyses in the next part of the workshop.

## Save sequence table and taxonomic annotations to individual files
saveRDS(seqtab.nochim, "seqtab.nochim.rds")
saveRDS(taxa.sp, "taxa.sp.rds")
## Save entire R workspace to file
save.image("Microbiome-sequence-analysis-workspace.RData")