Microbiome ecological analysis workshop

In this workshop, we will use the output of a DADA2 analysis of raw sequence data to explore the importance of different factors that influence the diversity and composition of microbial communities on leaves.

About the data set

In this workshop, we will be analyzing the bacterial communities found on leaves of sugar maple seedlings of different provenances planted at sites across eastern North America.

These data are taken from the article:

De Bellis, Laforest-Lapointe, Solarik, Gravel, and Kembel. 2022. Regional variation drives differences in microbial communities associated with sugar maple across a latitudinal range. Ecology (published online ahead of print). doi:10.1002/ecy.3727

For this workshop, we will work with a subset of samples from six of the nine sites sampled for the article. These samples are located in three different biomes (temperate, mixed, and boreal forests). At each site, several sugar maple seedlings were planted and harvested, and we collected bacterial DNA from the leaves. Each sample thus represents the bacterial communities on the leaves of a single sugar maple seedling. For each seedling, we have associated metadata on the provenance of the seed, the site where the seed was planted, and the biome/stand type where the seed was planted.

Install and load required packages and data

For this part of the workshop we will need to install several R packages. The commands below should work to install these packages, you should make sure you have the latest version of R installed (version 4.2.0 at the time of writing this document).

# install packages from CRAN
install.packages(pkgs = c("Rcpp", "RcppArmadillo", "picante", "ggpubr", "pheatmap"), dependencies = TRUE)
# install packages from Bioconductor
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("dada2", version = "3.15", update = FALSE)
# if the dada2 install returns a warning for BiocParallel, install from binary using this command:
# BiocManager::install("BiocParallel", version = "3.15", type="binary", update = FALSE)
BiocManager::install("DESeq2")
BiocManager::install("phyloseq")
BiocManager::install("ANCOMBC")

To begin, we will load the required packages.

### load packages
library(picante)
library(vegan)
library(ggplot2)
library(reshape2)
library(ggpubr)
library(DESeq2)
library(pheatmap)
library(ANCOMBC)
library(phyloseq)

Load DADA2 results

We will continue our analyses picking up where we left off at the end of day 1 of the workshop. We used the DADA2 package to identify ASVs and their abundances in samples, and to annotate ASVs taxonomically by comparing them to the SILVA rRNA database. We saved these data objects as files. You will need to place a copy of the files “seqtab.nochim.rds” and “taxa.sp.rds” in the working directory, and then we load them into our R workspace.

# load sequence table (nonchimeric ASV abundances in samples)
seqtab.nochim <- readRDS("seqtab.nochim.rds")
# load taxonomic annotations (taxonomic ID of each ASV)
taxa.sp <- readRDS("taxa.sp.rds")

Load metadata

The metadata file we will use for this workshop is available for download at the following URL: https://figshare.com/articles/dataset/Data_files_for_BIOS2_Microbiome_Analysis_Workshop/19763077.

You will need to download the file “metadata-Qleaf_BACT.csv” and place a copy of this file in the working directory. Then we can load the metadata.

# load metadata
metadata <- read.csv("metadata-Qleaf_BACT.csv", row.names = 1)
# inspect metadata
head(metadata)

##             Description    SampleType StandType TransplantedSite
## ASH.S2.2       ASH.S2.2 Leaf.Bacteria    Boreal    Ashuapmushuan
## ASH.N1.1.X1 ASH.N1.1.X1 Leaf.Bacteria    Boreal    Ashuapmushuan
## ASH.N1.1.X3 ASH.N1.1.X3 Leaf.Bacteria    Boreal    Ashuapmushuan
## ASH.N1.1X3   ASH.N1.1X3 Leaf.Bacteria    Boreal    Ashuapmushuan
## ASH.C1.1       ASH.C1.1 Leaf.Bacteria    Boreal    Ashuapmushuan
## ASH.C1.2       ASH.C1.2 Leaf.Bacteria    Boreal    Ashuapmushuan
##             SeedSourceRegion SeedSourceOrigin TransplantedSiteLat
## ASH.S2.2               South         Kentucky               48.81
## ASH.N1.1.X1            North        Montmagny               48.81
## ASH.N1.1.X3            North        Montmagny               48.81
## ASH.N1.1X3             North        Montmagny               48.81
## ASH.C1.1             Central     Pennsylvania               48.81
## ASH.C1.2             Central     Pennsylvania               48.81
##             TransplantedSiteLon SeedSourceOriginLat SeedSourceOriginLon
## ASH.S2.2                 -72.77               38.26              -84.95
## ASH.N1.1.X1              -72.77               46.95              -70.46
## ASH.N1.1.X3              -72.77               46.95              -70.46
## ASH.N1.1X3               -72.77               46.95              -70.46
## ASH.C1.1                 -72.77               41.13              -77.62
## ASH.C1.2                 -72.77               41.13              -77.62

The metadata contains information about each sample, including the following columns:

Description: the name of the sample (these match the names used on the sequence files)
SampleType: the samples we are working with are all leaf bacteria (we collected other types of data in the original study, but here we focus just on leaf bacteria)
StandType: the stand/biome type of the site where the seedling were planted. Either boreal, mixed, or temperate forest.
TransplantedSite: the site where the seedling was planted
SeedSourceRegion: the region from which the seed was collected
SeedSourceOrigin: the site from which the seed was collected
TransplantedSiteLat and TransplantedSiteLon: the latitude and longitude of the site where the seedling was transplanted

Cleaning and summarizing DADA2 results

Our first steps will be to do some cleaning up of the DADA2 results to make it easier to work with them. We create community and taxonomy objects that we will work with.

# community object from nonchimeric ASV sequence table
comm <- seqtab.nochim
# taxonomy object - make sure order of ASVs match between community and taxonomy
taxo <- taxa.sp[colnames(comm),]
# keep metadata only for community samples
metadata <- metadata[rownames(comm),]

We will also rename the ASVs in our community and taxonomy data objects. By default, ASVs are named by their full sequence. This makes it hard to look at these objects since the names are too long to display easily. We will replace the names of ASVs by “ASV_XXX” where XXX is a number from 1 to the number of observed ASVs.

# ASV names are their full sequences by default
# This makes it very hard to look at them (the names are too long)
# Store ASV sequence information and rename ASVs as "ASV_XXX"
ASV.sequence.info <- data.frame(ASV.name=paste0('ASV_',1:dim(comm)[2]),
                                ASV.sequence=colnames(comm))
colnames(comm) <- ASV.sequence.info$ASV.name
rownames(taxo) <- ASV.sequence.info$ASV.name

Exploring community data

Now we can begin to look at our data in more depth.

# community - number of samples (rows) and ASVs (columns)
dim(comm)

## [1]   37 1115

We have 37 samples and 1115 ASVs. If we look at our data objects, we can see the type of data in each of the different objects.

# look at community data object
head(comm)[,1:6]

##             ASV_1 ASV_2 ASV_3 ASV_4 ASV_5 ASV_6
## AUC.C1.2X1  10335   158  3993  4092   584  3225
## AUC.C1.3      123    98    15     0     3     3
## AUC.N1.1       84    17    16     5     0     3
## AUC.N1.3     4738   113   628  3978     0   329
## AUC.N2.1.X1  1871  1853  1431    20    58     2
## AUC.N2.1X2  19648  4200  6465   191  7221  2054

# look at taxonomy object
head(taxo)

##       Kingdom    Phylum           Class                 Order             
## ASV_1 "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales"     
## ASV_2 "Bacteria" "Cyanobacteria"  "Cyanobacteriia"      "Chloroplast"     
## ASV_3 "Bacteria" "Bacteroidota"   "Bacteroidia"         "Cytophagales"    
## ASV_4 "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Sphingomonadales"
## ASV_5 "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Sphingomonadales"
## ASV_6 "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales"     
##       Family              Genus                           
## ASV_1 "Beijerinckiaceae"  "Methylobacterium-Methylorubrum"
## ASV_2 NA                  NA                              
## ASV_3 "Hymenobacteraceae" "Hymenobacter"                  
## ASV_4 "Sphingomonadaceae" "Sphingomonas"                  
## ASV_5 "Sphingomonadaceae" "Sphingomonas"                  
## ASV_6 "Beijerinckiaceae"  "Methylobacterium-Methylorubrum"
##       Species                  
## ASV_1 NA                       
## ASV_2 NA                       
## ASV_3 "arcticus/bucti/perfusus"
## ASV_4 "aerolata/faeni"         
## ASV_5 NA                       
## ASV_6 NA

# look at metadata object
head(metadata)

##             Description    SampleType StandType TransplantedSite
## AUC.C1.2X1   AUC.C1.2X1 Leaf.Bacteria     Mixed          Auclair
## AUC.C1.3       AUC.C1.3 Leaf.Bacteria     Mixed          Auclair
## AUC.N1.1       AUC.N1.1 Leaf.Bacteria     Mixed          Auclair
## AUC.N1.3       AUC.N1.3 Leaf.Bacteria     Mixed          Auclair
## AUC.N2.1.X1 AUC.N2.1.X1 Leaf.Bacteria     Mixed          Auclair
## AUC.N2.1X2   AUC.N2.1X2 Leaf.Bacteria     Mixed          Auclair
##             SeedSourceRegion SeedSourceOrigin TransplantedSiteLat
## AUC.C1.2X1           Central     Pennsylvania               47.75
## AUC.C1.3             Central     Pennsylvania               47.75
## AUC.N1.1               North        Montmagny               47.75
## AUC.N1.3               North        Montmagny               47.75
## AUC.N2.1.X1            North   Rivere-du-Loup               47.75
## AUC.N2.1X2             North   Rivere-du-Loup               47.75
##             TransplantedSiteLon SeedSourceOriginLat SeedSourceOriginLon
## AUC.C1.2X1               -68.08               41.13              -77.62
## AUC.C1.3                 -68.08               41.13              -77.62
## AUC.N1.1                 -68.08               46.95              -70.46
## AUC.N1.3                 -68.08               46.95              -70.46
## AUC.N2.1.X1              -68.08               47.73              -69.48
## AUC.N2.1X2               -68.08               47.73              -69.48

Subset community to remove host DNA

Before proceeding we need to remove host DNA and other contaminants and low-quality sequences. We should do this before rarefaction and subsetting samples since it could change the number of sequences per sample if there is a lot of non-target DNA. We will carry this step out first so that subsequent analyses don’t include these non-bacterial ASVs in any of the analyses.

A reminder that primer choice and the habitat you are studying will have a big effect on the amount of non-target DNA that is in your samples. Many commonly used primers (for example, the widely used Earth Microbiome Project 16S rRNA primers 515F–806R which target the V4 region of the 16S gene) will amplify not only free living bacteria, but also DNA from host cells including chloroplasts and mitochondria. If you are working with host-associated microbiomes you may want to consider using a different primer that will not amplify host DNA. For example, when quantifying plant-associated microbiomes we commonly use the 16S primer 799F-1115R which targets the V5 region of the 16S gene and does not amplify chloroplasts or mitochondria.

Because we used the 799F-1115R primers for this study, we do not expect to find much host DNA, but we should check anyways, since a few host DNA sequences might have been amplified anyways. This is a very important step of the data analysis process since host DNA is not part of the microbial community and should not be included in data analyses. Otherwise, the amount of host DNA in a sample will lead to changes in the abundance of ASVs that are not actually part of the bacterial community being targeted.

When using a primer that targets bacteria, host DNA and DNA of non-target organisms in the samples will typically show up ASVs annotated as belonging to a non-Bacteria domain (Archaea or Eukaryota), or the taxonomic order is “Chloroplast”, or the taxonomic family is “Mitochondria”.

First we check whether any of our sequence are classified into a domain other than the bacteria.

# How many ASVs are classified to different domains?
table(taxo[,"Kingdom"])

## 
## Bacteria 
##     1115

All of the ASVs are classified as belonging to the domain Bacteria. Now let’s check for chloroplasts and mitochondria.

# How many ASVs are classified as Chloroplasts?
table(taxo[,"Order"])["Chloroplast"]

## Chloroplast 
##          12

# How many ASVs are classified as Mitochondria?
table(taxo[,"Family"])["Mitochondria"]

## Mitochondria 
##            2

There are only a few ASVs classified as chloroplasts and mitochondria. This is expected since we used a primer that should exclude this host DNA. We will remove these ASVs before continuing our analyses.

# Remove chloroplast and mitochondria ASVs from taxonomy table
taxo <- subset(taxo, taxo[,"Order"]!="Chloroplast" & taxo[,"Family"]!="Mitochondria")
# Remove these ASVs from the community matrix as well
comm <- comm[,rownames(taxo)]

If you find a large number of ASVs classified as non-Bacteria, or as chloroplasts or mitochondria, you will need to remove them from your data set prior to further analysis. Sometimes this can mean that many of your samples will contain few sequences after removing the host DNA and non-bacterial ASVs. If this is the case, you can consider using a primer that will not amplify host DNA, or sequencing your samples more deeply so that enough bacterial DNA will remain after removing the host DNA. However, this latter approach will waste a lot of your sequences, and in some cases if your samples contain mostly host DNA it will be difficult to obtain enough bacterial sequences to carry out ecological analyses.

Summary statistics

To begin with, we can calculate summary statistics for our community and taxonomy data. We mentioned earlier that the number of sequences per sample (library size) is normalized, meaning we aimed to sequence approximately the same number of reads per sample.

# number of reads per sample
rowSums(comm)

##  AUC.C1.2X1    AUC.C1.3    AUC.N1.1    AUC.N1.3 AUC.N2.1.X1  AUC.N2.1X2 
##       43110        1305         307       33460        4966       63971 
##    AUC.N2.2 AUC.S1.1.X2    AUC.S1X1    AUC.S2.2     BC.C1.2   BC.C2.1X1 
##       13344       19423       33599       45852        8667       25329 
##   BC.C2.3X2  BC.C2.3X2b   BC.N1.3X2       BC.N1  BC.N2.1.X3   BC.N2.1X1 
##         271       24130        7507       15011       22156       20475 
##   BC.N2.1X2    PAR.C1.1    PAR.C1.2    PAR.C2.3    PAR.N1.3 QLbactNeg19 
##       10486          13         161       23407        6921          27 
## QLbactNeg20  S.M.C1.3X2    S.M.C2.3  S.M.C2.3X2  S.M.S1.1X1     SF.C1.3 
##          84       10137       21017        8403         113       19081 
##     SF.C2.2     SF.N2.3     SF.S1.3     T2.C1.1     T2.C2.3     T2.N1.1 
##         273       31340       19664       23530         602       20712 
##     T2.S2.1 
##       21663

# visualize log10 number of reads per sample
hist(log10(rowSums(comm)))

We can see that most samples have at least 4000-10000 sequences/sample. But there are some samples with fewer reads. This includes both the negative control samples as well as some of the ‘real’ samples. We will need to remove samples with too few reads - we’ll return to this shortly.

We can also look at the number of sequences per ASV.

# log10 of number of reads per ASV
hist(log10(colSums(comm)))

We can see that ASV abundances follow a roughly lognormal distribution, meaning that there are a few highly abundant ASVs, but most ASVs are rare. Most ASVs are represented by fewer than 100 sequences (10^2).

Remove negative controls and samples with low-quality data

One of the first steps of analysing ASV data sets is to remove control samples, as well as samples with too few sequences. We will also need to normalize our data to take into account the fact that samples differ in the number of sequences they contain.

Rarefaction curves

We will visualize the relationship between the number of sequences in each sample (“sample size” in the figures below) versus the number of ASVs they contain (“species” in the figures below). These rarefaction curves make it clear why we need to take the number of sequences per sample into account when we analyse sample diversity.

First we plot a rarefaction curve for all samples. Each curve shows how the number of ASVs per sample changes with number of sequences in a particular sample.

rarecurve(comm, step=200, label=TRUE)

For these rarefaction curves, the key thing to look for is where ASV richness (number of ASVs per sample) reaches a plateau relative to the number of reads per sample. It does look like the samples we can see have reached a plateau of ASV richness. But it’s hard to see the curve for many of the samples because of the x-axis limits - a few samples with a lot of sequences make it hard to see the less diverse samples. Let’s zoom in to look just at the range from 0 to 5000 sequences/sample.

rarecurve(comm, step=200, label=TRUE, xlim=c(0,5000))

Now we can see that the vast majority of samples reach a plateau in their ASV richness by around 3000-5000 sequences/sample. There are a few samples with very low numbers of both sequences and ASVs. We will now look at the composition of the communities in these different samples.

Visualize community composition

Another way to quickly visualize the composition of all of our samples is to do an ordination to visualize similarity in composition among samples. Here we will visualize a principal components analysis on Hellinger-transformed ASV abundance data. Normally we should normalize the data before doing any diversity analysis but here we just want a quick look at whether some samples have very different composition from the rest. We do a PCA analysis and then plot the ordination results, overlaying confidence ellipses around the samples from the different transplanted stand types/biomes.

# PCA on Hellinger-transformed community data
comm.pca <- prcomp(decostand(comm,"hellinger"))
# plot ordination results
ordiplot(comm.pca, display="sites", type="text",cex=0.5)
ordiellipse(comm.pca, metadata$StandType, label=TRUE,cex=0.8)

There are two things to note in this ordination figure. First, we can see that along the first axis there is a separation between the negative control samples and a few other samples that cluster with the negative controls, versus all the other samples. Second, we can see that the composition of the communities seems to differ among stand type/biome along the second axis, varying from boreal to mixed to temperate forests. We’ll return to this observation later, but for now the important thing is that we can see that there are some samples that are more similar to the negative control samples than they are to the other ‘real’ samples. Let’s visualize the number of sequences per sample (library size) mapped onto these ordination axes.

# Surface plot - plot number of sequences/sample (library size)
ordisurf(comm.pca, rowSums(comm), bubble=TRUE, cex=2,
         main="Library size (sequences/sample)")

We can see that the common characteristic of the samples that cluster on the right side of the ordination plot, next to the negative control samples, is that these samples contain very few sequences.

There are several reasons why some samples contain few sequences, but in general this is an indication that these samples contain data of low quality and should be excluded from further analysis. This can be caused by low microbial biomass in the original samples, with some problem with the extraction or amplification of microbial DNA in these samples, or simply by chance. Regardless, the fact that the composition of the microbial communities in these samples is more similar to the negative controls than to the other ‘real’ samples means we need to get rid of these samples.

If many of your samples contain few sequences, this could indicate some problem with the protocol being used for sample processing and sequencing, or the normalization step of library preparation. It could also indicate low microbial biomass. In the data we are analysing here, I suspect that the problem with these samples was low microbial biomass. The maple seedlings from which we extracted microbial DNA were sometimes small with only 1-2 tiny leaves, which doesn’t contain enough bacterial cells to reliably extract and sequence their DNA. When microbial biomass in a sample is small, the potential importance of contaminant DNA increases. Regardless of the cause, we will have to exclude these samples.

Deciding what number of sequences to use as a cutoff when getting rid of samples with a low number of sequences will depend on the nature of your data set and your biological questions. You should always aim to ensure that you have enough reads per sample to reach a plateau in the rarefaction curve. Beyond that, you may want set a cutoff for a minimum number of reads per sample that ensures that you keep enough samples to have sufficient replicates in your different treatments.

Check negative controls

We sequenced negative controls in this study to ensure that our samples were not contaminated and that the sequences we obtained come from bacteria on leaves and not from some other step of the sample processing such as from the DNA extraction kits or during PCR. As we saw above, the inclusion of these negative control samples is useful because we can look at their composition to identify ‘real’ samples that might contain contaminants. As a final step, let’s look at the negative control samples to see what ASVs they contain.

# Abundance of ASVs in negative controls
comm['QLbactNeg19',][comm['QLbactNeg19',]>0]

##   ASV_1  ASV_44  ASV_77 ASV_138 
##       1      11      11       4

comm['QLbactNeg20',][comm['QLbactNeg20',]>0]

##    ASV_8   ASV_30   ASV_77   ASV_91  ASV_418  ASV_625 ASV_1002 
##        1        1        2        1       72        3        4

Our negative controls look clean - they contain only a few sequences (fewer than 100) belonging to a small number of ASVs. This is the type of result we are looking for, and we will remove these negative controls shortly when we remove low quality samples with small numbers of sequences. We can also look at the taxonomic identity of the ASVs that are present in the negative controls.

# Taxonomic identity of ASVs present in negative controls
taxo[names(comm['QLbactNeg19',][comm['QLbactNeg19',]>0]),]

##         Kingdom    Phylum           Class                 Order             
## ASV_1   "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales"     
## ASV_44  "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacterales"
## ASV_77  "Bacteria" "Firmicutes"     "Bacilli"             "Lactobacillales" 
## ASV_138 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacterales"
##         Family             Genus                           
## ASV_1   "Beijerinckiaceae" "Methylobacterium-Methylorubrum"
## ASV_44  "Shewanellaceae"   "Shewanella"                    
## ASV_77  "Streptococcaceae" "Streptococcus"                 
## ASV_138 "Shewanellaceae"   "Shewanella"                    
##         Species                                               
## ASV_1   NA                                                    
## ASV_44  "algae/haliotis"                                      
## ASV_77  "cristatus/gordonii/ictaluri/mitis/porcorum/sanguinis"
## ASV_138 NA

taxo[names(comm['QLbactNeg20',][comm['QLbactNeg20',]>0]),]

##          Kingdom    Phylum           Class                 Order             
## ASV_8    "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Sphingomonadales"
## ASV_30   "Bacteria" "Bacteroidota"   "Bacteroidia"         "Cytophagales"    
## ASV_77   "Bacteria" "Firmicutes"     "Bacilli"             "Lactobacillales" 
## ASV_91   "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Acetobacterales" 
## ASV_418  "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Xanthomonadales" 
## ASV_625  "Bacteria" "Firmicutes"     "Bacilli"             "Lactobacillales" 
## ASV_1002 "Bacteria" "Firmicutes"     "Bacilli"             "Staphylococcales"
##          Family               Genus           
## ASV_8    "Sphingomonadaceae"  "Sphingomonas"  
## ASV_30   "Hymenobacteraceae"  "Hymenobacter"  
## ASV_77   "Streptococcaceae"   "Streptococcus" 
## ASV_91   "Acetobacteraceae"   "Roseomonas"    
## ASV_418  "Rhodanobacteraceae" "Fulvimonas"    
## ASV_625  "Streptococcaceae"   "Streptococcus" 
## ASV_1002 "Staphylococcaceae"  "Staphylococcus"
##          Species                                                                               
## ASV_8    "ginsenosidivorax"                                                                    
## ASV_30   NA                                                                                    
## ASV_77   "cristatus/gordonii/ictaluri/mitis/porcorum/sanguinis"                                
## ASV_91   NA                                                                                    
## ASV_418  NA                                                                                    
## ASV_625  "anginosus/mitis/oralis/pneumoniae/pseudopneumoniae"                                  
## ASV_1002 "aureus/capitis/caprae/epidermidis/haemolyticus/saccharolyticus/saprophyticus/warneri"

Subset community to remove low sequence number samples

Now that we have done some exploratory data analyses, we will take a subset of our communities remove the negative controls as well as samples with a low number of sequences.

When we looked at the rarefaction curves, it looked like most samples reached a plateau in number of ASVs per sequence by around 3000-5000 sequences/sample. When we looked at the distribution of number of sequences per sample, we saw that most samples contained at least 4000-5000 sequences. The samples that had fewer than that number of sequences were the ones that clustered with the negative control samples to the right of the ordination figure. Thus, we will set a cutoff of 4000 sequences per sample as a minimum to include a sample in further analysis. This will remove both the negative control samples and the samples of questionable quality that contained very few sequences.

# what is the dimension of the full community data set
dim(comm)

## [1]   37 1029

# take subset of communities with at least 4000 sequences
comm.sub <- comm[rowSums(comm)>=4000,]
# also take subset of ASVs present in the remaining samples
comm.sub <- comm.sub[,apply(comm.sub,2,sum)>0]
# what is the dimension of the subset community data set?
dim(comm.sub)

## [1]   27 1011

# subset metadata and taxonomy to match
metadata.sub <- metadata[rownames(comm.sub),]
taxo.sub <- taxo[colnames(comm.sub),]
# descriptive stats for samples and ASVs
# number of sequences per sample
hist(rowSums(comm.sub))

# log10 number of sequences per ASV
hist(log10(colSums(comm.sub)))

PCA on subset data

Before we normalize the data, let’s do the same PCA analysis on Hellinger-transformed ASV abundances, this time only for the samples that remain after getting rid of those with fewer than 5000 sequences/sample.

comm.sub.pca <- prcomp(decostand(comm.sub,"hellinger"))
# plot ordination results
ordiplot(comm.sub.pca, display="sites", type="text",cex=0.5)
ordiellipse(comm.sub.pca, metadata.sub$StandType, label=TRUE)

This looks much better, there are no longer negative controls or ‘outlier’ samples that are highly distinct compositionally from the other sampes. We can still clearly see the gradient in composition from boreal-mixed-temperate stand types, which is now falling along the first two axes of the ordination. As mentioned previously, we now need to normalize the data to account for the remaining variation in sequence depth per sample.

Data normalization for diversity analysis

Data normalization for the analysis of microbiome data is a subject of great debate in the scientific literature. Different normalization approaches have been suggested to control for the compositional nature of microbiome data, to maximize statistical power, to account for variation in sequencing depth among samples, and to meet the assumptions of different statistical analysis methods.

I recommend rarefaction as an essential data normalization approach that is needed when analysing the ecological diversity based on amplicon sequencing data sets. Because samples differ in sequencing depth, and this sequencing depth variation is due to the way we prepare libraries for sequencing and not any biological attribute of samples, we need to control for this variation. Rarefaction involves randomly selecting the same number of sequences per sample so that we can compare diversity metrics among samples.

The simplest justification for rarefaction is that that most measures of diversity are sensitive to sampling intensity. For example, taxa richness increases as we increase the number of individuals we sample from a community. If we take the same community and sample 10 individuals from it, and then sample 1000 individuals, we will obviously find more species when we sample more individuals. This is the fundamental problem that rarefaction addresses - we want to separate variation in diversity caused by some biological process from variation in diversity that is due to variation in library size.

Simulation and empirical studies have shown that rarefaction outperforms other normalization approaches when samples differ in sequencing depth (e.g. Weiss et al. 2017. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5:27. https://doi.org/10.1186/s40168-017-0237-y).

Dr. Pat Schloss has an excellent series of videos on rarefaction as part of his Riffomonas project where he demonstrates very clearly why rarefaction is necessary when calculating diversity from sequencing data sets.

Rarefaction

To rarefy our data, we will randomly select the same number of sequences per sample for all of our samples. Because we already excluded samples containing fewer than 4000 sequences, we know that we can rarefy the remaining samples to at least 4000 sequences. Let’s look at the rarefaction curve for our subset of samples.

# What is the smallest number of sequences/sample for subset of samples?
min(rowSums(comm.sub))

## [1] 4966

# Rarefaction curve for subset of samples
rarecurve(comm.sub, step=200, label=FALSE, xlim=c(0,8000))

We can see that nearly all samples have reached a plateau of ASV richness by around 3000-4000 sequences/sample. The sample with the smallest number of sequences remaining contains 4966 sequences. We will randomly rarefy our data to this number of sequences. This way, we can feel confident that our samples contain enough sequences to adequately quantify the diversity of ASVs present in the sample even after rarefaction.

set.seed(0)
# Randomly rarefy samples
comm_rarfy <- rrarefy(comm.sub, sample = min(rowSums(comm.sub)))
# Remove any ASVs whose abundance is 0 after rarefaction
comm_rarfy <- comm_rarfy[,colSums(comm_rarfy)>1]
# Match ASV taxonomy to rarefied community
taxo_rarfy <- taxo.sub[colnames(comm_rarfy),]

Check the effect of rarefaction

Let’s check to see what effect rarefaction had on the ASV richness of samples.

# Rarefaction is not expected to have a major influence on the diversity (as rarefaction curves have reached the plateau).
richness_raw <- rowSums((comm.sub>0)*1)
richness_rarfy <- rowSums((comm_rarfy>0)*1)
plot(richness_rarfy~richness_raw,xlab='number of ASVs in raw data',ylab='number of ASVs in rarefied data')

This supports the idea that rarefaction is not expected to have a major influence on diversity, since we saw in the rarefaction curves that richness reaches a plateau below the threshold number of sequences that we used for rarefaction. However, now that we have normalized the number of reads per sample, any differences in diversity among samples are not due to potential differences in library size. We are ready to begin our diversity analyses. We will use the rarefied community data set for these analyses.

Community analysis

Before we dive into analysing the diversity of leaf bacterial communities on sugar maples, this is a good time to return to the biological questions we posed in the article from which the data are taken.

Biological questions

In the study for which these data were collected, we looked at different taxonomic groups (bacteria, fungi, mycorrhizal fungi) living on sugar maple leaves and roots. Here we will concentrate just on leaf bacteria.

Our aim was to assess the relative influence of host genotype (seed provenance) and environment (transplanted site and region) in structuring the microbiome of sugar maple seedlings. When analyzing the data, let’s keep focused on these biological questions to help guide our decisions about how to analyze the data. We can take different perspectives to respond to this question by looking at different aspects of diversity. And beyond specific tests of our biological hypotheses, it is also useful to carry out basic descriptive analysis of community structure in order to determine what organisms were living in our samples.

In the metadata, the variables StandType and TransplantedSite represent the region and site into which the seedlings were planted (environment), and the variables SeedSourceRegion and SeedSourceOrigin represent the region and site from which the seeds were collected (genotype).

Visualize taxonomic composition of communities

A fundamental question we can address using microbiome data is simply ‘who is there’? What are the abundant taxa in different samples?

Phylum-level taxonomic composition of samples

Remember that for each ASV, we have taxonomic annotations at different ranks. Here we’ll look at the relative abundance of bacterial phyla in each sample. We can repeat these analyses at different taxonomic ranks, but as we go to finer ranks there will be more ASVs with missing data because we could not confidently determine their taxonomic annotation, so there will be more unidentified taxa. Nearly all ASVs have an annotation at the phylum level. We first manipulate the ASV data to create a new data object of phylum abundances, and then we can visualize those abundances.

# community data aggregation at a taxonomic level. e.g. phylum 
# take the sum of each phylum in each sample
taxa.agg <- aggregate(t(comm_rarfy),by=list(taxo.sub[colnames(comm_rarfy),2]),FUN=sum)
# clean up resulting object
rownames(taxa.agg) <- taxa.agg$Group.1
phylum_data <- t(taxa.agg[,-1])
# convert abundances to relative abundances
phylum_data <- phylum_data/rowSums(phylum_data)
# remove rare phyla
phylum_data <- phylum_data[,colSums(phylum_data)>0.01]
# now reshape phylum data to long format
phylum_data <- reshape2::melt(phylum_data)
# rename columns
colnames(phylum_data)[1:2] <- c('Samples','Bacteria_phylum')
# now we can plot phylum relative abundance per sample
ggplot(phylum_data, aes(Samples, weight = value, fill = Bacteria_phylum)) +
  geom_bar(color = "black", width = .7, position = 'fill') +
  labs( y = 'Relative abundance (%)') +
  scale_y_continuous(expand = c(0,0)) +
  scale_fill_viridis_d(direction = -1L) +
  theme_classic() +
  coord_flip()

Phylum-level composition for each transplant site

It is also useful to see how taxonomic composition varies with respect to different variables we measured. For example, how does phylum-level taxonomic composition differ among different transplant sites?

# aggregate average phylum abundances per transplanted site
phylum_data_agg <- aggregate(phylum_data$value,by=list(metadata.sub[phylum_data$Samples,]$TransplantedSite,phylum_data$Bacteria_phylum),FUN=mean)
# rename columns
colnames(phylum_data_agg) <- c('TransplantedSite','Bacteria_phylum','value')
# now we can plot phylum abundance by transplant site
ggplot(phylum_data_agg, aes(TransplantedSite, weight =value, fill = Bacteria_phylum)) +
  geom_bar(color = "black", width = .7, position = 'fill') +
  labs( y = 'Relative abundance (%)') +
  scale_y_continuous(expand = c(0,0)) +
  scale_fill_viridis_d(direction = -1L) +
  theme_classic()

Community diversity (alpha diversity)

To look at how diversity differed among samples as a function of the different variables we are interested in, we’ll begin by looking at the alpha-diversity, or within-community diversity. Two commonly used measures of alpha diversity are ASV richness (the number of ASVs present in the sample), and the Shannon index, also sometimes referred to as Shannon diversity or Shannon diversity index, which measures both the number of taxa in the sample (ASV richness) as well as the equitability of their abundances (evenness).

Calculate ASV richness and Shannon diversity of bacterial community

# calculate ASV richness
# calculate Shannon index
Shannon <- diversity(comm_rarfy)
hist(richness_rarfy)

hist(Shannon)

Compare bacterial diversity among categories

We can ask how diversity differs with respect to seedling genotype and environment. Here, let’s ask specifically whether leaf bacterial alpha diversity differs between stand types. See Figure 2 of the article by De Bellis et al. 2022 for a comparable analysis.

# create data frame to hold alpha diversity values by stand type
div_standtype <- data.frame(richness=richness_rarfy, Shannon=Shannon, standtype=metadata.sub$StandType)
# set up analysis to compare diversity among different pairs of stand types
my_comparisons <- list( c("Mixed", "Boreal"), c("Mixed", "Temperate"), c("Boreal", "Temperate") )
# plot ASV richness as a function of stand type
ggboxplot(div_standtype, x = "standtype", y = "richness", hide.ns=F,
          color = "standtype", palette = "jco",add = "jitter") + 
  stat_compare_means(comparisons = my_comparisons,method = "t.test")

# plot Shannon index as a function of stand type
ggboxplot(div_standtype, x = "standtype", y = "Shannon",hide.ns=F,
          color = "standtype", palette = "jco",add = "jitter")+ 
  stat_compare_means(comparisons = my_comparisons,method = "t.test")

These figures indicate that ASV richness does not differ significantly among pairs of stand types. However, the Shannon index does differ somewhat between temperate and boreal stands, where the Shannon index tends to be lower in boreal stands than in temperate stands. These analyses are limited somewhat by the sample size that remains after removing low quality samples; in the original article (De Bellis et al 2022) we compared more sites which increased sample size enough that we found a significant (P<0.05) difference in the Shannon index between temperate and boreal stand types.

Bacterial diversity related to numeric variables

We can also ask if bacterial diversity differs with respect to numeric variables, for example with respect to latitude. To ask how richness varies with latitude, we will use a generalized linear model to take into account that richness counts are count data and thus are more likely to follow a Poisson distribution than a normal distribution.

# plot richness versus latitude
plot(richness_rarfy~metadata.sub$TransplantedSiteLat,xlab='Latitude',ylab='ASV richness')
# add best fit line
abline(glm(richness_rarfy~metadata.sub$TransplantedSiteLat), family="poisson")

# generalized linear model of richness vs. latitude
summary(glm(richness_rarfy~metadata.sub$TransplantedSiteLat, family="poisson"))

## 
## Call:
## glm(formula = richness_rarfy ~ metadata.sub$TransplantedSiteLat, 
##     family = "poisson")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.3986  -2.7521  -0.0885   2.0005   8.7112  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       8.31484    0.59676  13.933  < 2e-16 ***
## metadata.sub$TransplantedSiteLat -0.07704    0.01249  -6.169 6.85e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 421.11  on 26  degrees of freedom
## Residual deviance: 383.49  on 25  degrees of freedom
## AIC: 559.93
## 
## Number of Fisher Scoring iterations: 4

ASV richness of leaf bacteria decreases with increasing latitude.

What about the Shannon index? Here we’ll fit a linear model since we expect the data to be normally distributed.

# plot Shannon index versus latitude
plot(Shannon~metadata.sub$TransplantedSiteLat,xlab='Latitude',ylab='Shannon diversity')
# add best fit line
abline(lm(Shannon~metadata.sub$TransplantedSiteLat))

# linear model of Shannon index vs. latitude
summary(lm(Shannon~metadata.sub$TransplantedSiteLat))

## 
## Call:
## lm(formula = Shannon ~ metadata.sub$TransplantedSiteLat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.57308 -0.29404  0.08498  0.36877  1.11969 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                       7.87635    3.62424   2.173   0.0394 *
## metadata.sub$TransplantedSiteLat -0.10313    0.07556  -1.365   0.1845  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5867 on 25 degrees of freedom
## Multiple R-squared:  0.06935,    Adjusted R-squared:  0.03212 
## F-statistic: 1.863 on 1 and 25 DF,  p-value: 0.1845

The Shannon index of leaf bacteria decreases with increasing latitude, but the relationship is not statistically significant.

Community composition (beta diversity)

We can now look at the beta-diversity of the samples, which measures the between-community differences in composition.

Beta diversity approaches to studying composition involve calculating a measure of the compositional difference between pairs of communities (= beta diversity, dissimilarity, or distance). Then these distance metrics can be analysed using approaches such as ordination (to summarize the overall trends in community composition among samples) or PERMANOVA (to test whether groups of samples differ significantly in their composition).

There are a huge number of different beta diversity/dissimilarity metrics that have been used in ecology. There is an ongoing debate about the relative advantages and disadvantages of these beta diversity measures and approaches. There are several different beta diversity metrics and ordination approaches that work well based on simulation studies and empirical analysis. For those wanting to learn more about multivariate analysis and beta diversity approaches for the analysis of ecological communities, the book “Numerical Ecology with R” by Borcard, Gillet and Legendre is an excellent reference. Dr. Pierre Legendre also maintains a website with links to several excellent courses that serve as a complete introduction to multivariate methods in ecology.

In this workshop we will focus on a few different beta diversity metrics and ordination approaches that have been shown to perform well in theory and practise when applied to ecological community data. How should you choose which of these approaches to your own data? This is not an easy question to answer, as long as you are using a method that works well with ecological community data, there are different reasons to prefer one method over another. Ultimately, it can be useful to try different approaches and see if you obtain similar results with your data.

Ordination - Principal Components Analysis (PCA)

Principal components analysis (PCA) is a commonly used approach to analysing multivariate data. PCA uses eigenanalysis of Euclidean distances among samples computed from ASV abundance data to identify the axes of correlated variation that explain the most possible variation in your multivariate data. These axes correspond to major gradients of changes in community composition. We typically focus on the first few axes of the PCA ordination, since these axes should capture the majority of the variation in community composition among samples.

It’s important to note that PCA should never be used to analyze untranformed ecological community data. PCA is based on the Euclidean distance among samples. Ecological community data violate many of the assumptions of PCA - recall that there are many zeroes in our matrix of ASV abundances, and the distribution of abundance values among ASVs is very non-normal. Legendre and Gallagher (2001) showed that several transformations including the Chord and Hellinger transformation allow ecological community data matrices to be analyzed using PCA. Here we will use the Hellinger transformation to transform ASV abundances prior to analyzing them with a PCA ordination.

# Create Hellinger-transformed version of rarefied community data
comm_hel <- decostand(comm_rarfy,method='hellinger')
# PCA analysis of Hellinger-transformed community data
comm_hel_PCA <- prcomp(comm_hel)
# Summarize variance in beta diversity explained by PCA axes
summary(comm_hel_PCA)

## Importance of components:
##                           PC1     PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     0.2589 0.20937 0.1953 0.17243 0.16137 0.15434 0.14364
## Proportion of Variance 0.1484 0.09702 0.0844 0.06581 0.05764 0.05272 0.04567
## Cumulative Proportion  0.1484 0.24539 0.3298 0.39559 0.45323 0.50595 0.55162
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.13885 0.13302 0.12862 0.12292 0.12069 0.11398 0.11101
## Proportion of Variance 0.04267 0.03916 0.03661 0.03344 0.03224 0.02875 0.02728
## Cumulative Proportion  0.59429 0.63345 0.67007 0.70351 0.73575 0.76450 0.79178
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.10556 0.10227 0.09821 0.09428 0.09310 0.08979 0.08871
## Proportion of Variance 0.02466 0.02315 0.02135 0.01967 0.01919 0.01784 0.01742
## Cumulative Proportion  0.81644 0.83959 0.86094 0.88061 0.89980 0.91764 0.93506
##                           PC22    PC23    PC24    PC25    PC26      PC27
## Standard deviation     0.08379 0.08119 0.07633 0.07143 0.06929 2.048e-16
## Proportion of Variance 0.01554 0.01459 0.01289 0.01129 0.01063 0.000e+00
## Cumulative Proportion  0.95060 0.96518 0.97808 0.98937 1.00000 1.000e+00

# Plot PCA results
ordiplot(comm_hel_PCA, display = 'sites', type = 'text',cex=0.8, main="PCA on Hellinger transformed data")
# Add ellipses around samples from different stand types
ordiellipse(comm_hel_PCA, metadata.sub$StandType, label=TRUE)

The PCA ordination diagram indicates the overall compositional similarity of samples. Samples that are close together in the ordination space contain similar ASVs with similar abundances. We can see that the gradient in community composition among different stand types is visible along the first two axes. Samples from each stand type tend to contain compositionally similar leaf bacterial communities.

Recall that when we were first looking at the composition of communities in our samples (prior to subsetting and rarefaction), we obtained a similar looking result. As noted, because we used a rarefaction threshold that was sufficient for the rarefaction curves of the samples to reach a plateau in ASV richness, we do not expect major differences between the analysis of rarefied versus non-rarefied data. However, by analyzing the rarefied data we are now confident that the differences in composition among the samples are due to true differences in community composition and not due to differences in library size among samples.

Ordination - Non-metric Multidimensional Scaling (NMDS)

Another commonly used ordination approach in ecology is non-metric multidimensional scaling (NMDS). NMDS can be used with any beta diversity distance metric. Unlike PCA, NMDS is not based on eigenanalysis of the distance metric. Rather, NMDS uses an algorithm to find an ordination of samples in a few dimensions that represents as best as possible the rank transformed pairwise distances among samples measured with the original distance metric. When carrying out a NMDS analysis, rather than obtaining many PCA axes, the user specifies how many axes to identify. NMDS analysis is based on a heuristic algorithm that may give slightly different results when run multiple times on the same data, whereas PCA has a unique analytical solution.

NMDS can be used with any distance metric. Commonly in microbial ecology it is used with the Bray-Curtis distance. The Bray-Curtis distance, like the Hellinger distance, has been shown to perform well compared with other distance measures (Faith et al. 1987, Gallagher and Legendre 2001).

Here we’ll calculate a NMDS ordination using Bray-Curtis distance.

# NMDS ordination based on Bray-Curtis distances
comm_NMDS <- metaMDS(comm_hel, distance="bray", trace=FALSE)
ordiplot(comm_NMDS, cex = 0.5, type = 'text', display='sites')
ordiellipse(comm_NMDS, metadata.sub$StandType, label=TRUE)
# overlay the direction of latitude effect on bacteria community composition
ef <- envfit(comm_NMDS, metadata.sub$TransplantedSiteLat)
rownames(ef$vectors$arrows)='Latitude'
plot(ef)

Here we can see that the NMDS ordination based on Bray-Curtis distances among samples looks quite similar to the PCA ordination, with community composition differing among different stand types. We have also added an arrow indicating the correlation between latitude and the ordination axes, which also supports the idea that communities vary as we move from boreal stands in the north to temperate stands in the south.

PERMANOVA

Ordination analyses allow us to visualize how the composition of communities differs among samples and how it relates to different variables in a qualitative way. For example, we can see from the ordination diagrams above that it seems that community composition differs among stand types. We can statistically test whether stand types differ in composition using permutational multivariate analysis of variance (PERMANOVA). A PERMANOVA works in a way similar to an ANOVA, but with multivariate compositional data. PERMANOVA tests indicate whether community composition (beta diversity) differs among groups of samples.

Here we can first use a PERMANOVA to test whether community composition differs among stand types. As with ordination methods, a PERMANOVA can be run on any distance metric. Let’s try computing a PERMANOVA using both Hellinger distance (used for the PCA) and Bray-Curtis distance (used for the NMDS).

# set number of permutations for PERMANOVA
perm <- how(nperm = 999)
# PERMANOVA on Hellinger-transformed Euclidean distances
adonis2(formula = dist(comm_hel) ~ StandType, data = metadata.sub, permutations = perm)

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = dist(comm_hel) ~ StandType, data = metadata.sub, permutations = perm)
##           Df SumOfSqs      R2      F Pr(>F)    
## StandType  2   1.8343 0.15614 2.2204  0.001 ***
## Residual  24   9.9129 0.84386                  
## Total     26  11.7472 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# PERMANOVA on Bray-Curtis distances
adonis2(formula = vegdist(comm_rarfy, method="bray") ~ StandType, data = metadata.sub, permutations = perm)

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = vegdist(comm_rarfy, method = "bray") ~ StandType, data = metadata.sub, permutations = perm)
##           Df SumOfSqs     R2      F Pr(>F)    
## StandType  2   0.9707 0.1783 2.6039  0.001 ***
## Residual  24   4.4734 0.8217                  
## Total     26   5.4441 1.0000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both of these analyses indicate that community composition differs significantly among stand types, explaining around 16-18% of variation in the distance metric.

PERMANOVA can also be used to test more complex experimental designs. Here, let’s replicate the analyses by De Bellis et al. (2022). We used PERMANOVA on Bray-Curtis distances to measure the relative importance of transplanted region and site (environment) versus origin region and site (genotype) in determining the composition of leaf bacterial communities. Site is nested within region for both variable types, which we represent using “/” in the formula (region/site).

permanova.bc <- adonis2(formula = vegdist(comm_rarfy, method="bray") ~ StandType/TransplantedSite * SeedSourceRegion/SeedSourceOrigin, data = metadata.sub, permutations = perm)
print(permanova.bc)

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = vegdist(comm_rarfy, method = "bray") ~ StandType/TransplantedSite * SeedSourceRegion/SeedSourceOrigin, data = metadata.sub, permutations = perm)
##                                                              Df SumOfSqs
## StandType                                                     2   0.9707
## SeedSourceRegion                                              2   0.2316
## StandType:TransplantedSite                                    3   0.6361
## StandType:SeedSourceRegion                                    3   0.5523
## StandType:TransplantedSite:SeedSourceRegion                   3   0.7417
## StandType:TransplantedSite:SeedSourceRegion:SeedSourceOrigin  5   1.1111
## Residual                                                      8   1.2006
## Total                                                        26   5.4441
##                                                                   R2      F
## StandType                                                    0.17830 3.2342
## SeedSourceRegion                                             0.04253 0.7715
## StandType:TransplantedSite                                   0.11685 1.4130
## StandType:SeedSourceRegion                                   0.10145 1.2268
## StandType:TransplantedSite:SeedSourceRegion                  0.13624 1.6475
## StandType:TransplantedSite:SeedSourceRegion:SeedSourceOrigin 0.20409 1.4808
## Residual                                                     0.22053       
## Total                                                        1.00000       
##                                                              Pr(>F)    
## StandType                                                     0.001 ***
## SeedSourceRegion                                              0.820    
## StandType:TransplantedSite                                    0.059 .  
## StandType:SeedSourceRegion                                    0.190    
## StandType:TransplantedSite:SeedSourceRegion                   0.015 *  
## StandType:TransplantedSite:SeedSourceRegion:SeedSourceOrigin  0.029 *  
## Residual                                                               
## Total                                                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This analysis shows that, as mentioned by De Bellis et al. (2022), stand type of planting was the most important variable explaining variation in leaf bacterial community composition. There was a marginally significant variation among transplanted sites. There were also significant interactions among transplant region and site as a function of seed source region and seed source site of origin, indicating that the leaf bacterial communities of some genotypes respond differently to different environments. Taken together, this analysis indicates that environment (transplant region and site) have a greater impact on the composition of leaf bacterial communities than genotype, although there is also an interaction between genotype and environment.

Differentially abundant taxa

By analyzing alpha and beta diversity, we have determined that some groups of samples differ in their diversity and composition. But measures of diversity look at the entire community. We may also be interested in knowing which individual taxa differ in abundance among groups of samples. To address this question we can use differential abundance analysis. There are many methods that can be used for differential abundance analysis, and as for other ecological analysis there is active debate about which methods should be used to detect differences in taxa abundances between groups of samples.

Differential abundance analysis allows you to identify differentially abundant taxa between two groups. There are many methods for this type of analysis such as ALDEx2, ANCOM-BC, and DESeq2. Several recent articles have compared the performance of these differential abundance approaches when applied to microbiome data (Calgaro et al. (2020), Nearing et al. (2022)), finding that some methods perform better than others, but also finding that different methods are sensitive to different aspects of data structure and with differing statistical power and sensitivity. Here we will compare two different methods for detecting differentially abundant taxa.

Having seen a clear difference in leaf bacterial community composition between temperate and boreal forest stand types, we may ask which bacterial ASVs are differentially abundant between temperate versus boreal forest stand types.

DESeq2

The DESeq2 approach (Love et al. 2014. Genome Biol 15:550) models taxa abundances using a negative binomial distribution to detect taxa that are differentially abundant between groups. This approach was initially developed for gene expression analyses, but it is commonly used in the analysis of differential taxa abundances in microbial ecology.

# conduct DESeq analysis of ASV abundances between temperate and boreal stand types
# We add a pseudocount (+1) to each abundance value to avoid zeroes
countdata <- data.frame(ASV=colnames(comm_rarfy),t(comm_rarfy+1))
metas <- metadata.sub
metas$StandType <- relevel(as.factor(metas$StandType),ref='Boreal')
dds <- DESeqDataSetFromMatrix(countData=countdata, 
                              colData=metas,
                              design=~StandType, tidy = TRUE)

## converting counts to integer mode

dds2 <- DESeq(dds, fitType="local", quiet=TRUE)
res <- results(dds2, name="StandType_Temperate_vs_Boreal")
#coef of differently abundant ASV between temperate and boreal forest
head(res[order(res$padj),])

## log2 fold change (MLE): StandType Temperate vs Boreal 
## Wald test p-value: StandType Temperate vs Boreal 
## DataFrame with 6 rows and 6 columns
##          baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##         <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## ASV_8   102.05960        6.51020  1.064569   6.11534 9.63495e-10 3.54566e-07
## ASV_37   15.81910        5.31101  0.912448   5.82062 5.86295e-09 1.07878e-06
## ASV_26   29.14083        6.04225  1.086781   5.55977 2.70128e-08 3.31357e-06
## ASV_22   36.39515        5.44804  1.002427   5.43485 5.48424e-08 5.04550e-06
## ASV_94    8.10968        4.53432  0.852386   5.31956 1.04017e-07 7.65567e-06
## ASV_118   7.15579        4.68465  0.945276   4.95586 7.20115e-07 4.41671e-05

#Number of significantly different ASVs (adjusted_p_value<0.05)
dim(res[is.na(res$padj)==F&res$padj<0.05,])[1]

## [1] 59

sig_des <- rownames(res[is.na(res$padj)==F&res$padj<0.05,])
# show the distribution of these marker ASV in a heatmap
metatoshow <- subset(metas,!metas$StandType%in%'Mixed')
datatoshow <- comm_rarfy[rownames(metatoshow),rownames(res[order(res$padj),])[1:20]]
annotation_col = data.frame(standtype = as.factor(metatoshow$StandType))
rownames(annotation_col)=rownames(datatoshow)
pheatmap(t(datatoshow),scale= "row", annotation_col = annotation_col,cluster_cols = F)

ANCOM-BC

The ANCOM-BC approach (Lin and Peddada. 2020. Nat. Comm. 35:14) detects differentially abundant taxa by analysis of compositions of microbiomes with bias correction. Nearing et al. (2022) found this method to be among the best-performing methods for detection of differentially abundant taxa.

# conduct ANCOM-BC analysis of differentially abundant taxa between boreal/temperate stand types
ASV <- otu_table(countdata[,-1],taxa_are_rows<-T)
TAX <- tax_table(as.matrix(taxo.sub))
META <- sample_data(metas)
phyloseqobj = phyloseq(ASV, TAX, META)
ancob <- ancombc(phyloseq=phyloseqobj,formula='StandType',p_adj_method = "holm", zero_cut = 0.90, lib_cut = 1000,
                                 group = 'StandType', struc_zero = TRUE, neg_lb = F,
                                 tol = 1e-5, max_iter = 100, conserve = TRUE,
                                 alpha = 0.05, global = TRUE)
res_ancob <- ancob$res
#show coefficients of differential abundance test between temperate vs. boreal
coef_ancob <- data.frame(beta=res_ancob$beta$StandTypeTemperate,se=res_ancob$se$StandTypeTemperate,W=res_ancob$W$StandTypeTemperate,
           p_val=res_ancob$p_val$StandTypeTemperate,p_adj=res_ancob$q_val$StandTypeTemperate,sign_dif=res_ancob$diff_abn$StandTypeTemperate,
           row.names = rownames(res_ancob$beta))
coef_ancob <- coef_ancob[order(coef_ancob$p_adj),]
head(coef_ancob)

##              beta        se         W        p_val        p_adj sign_dif
## ASV_44  -2.120139 0.3618189 -5.859670 4.637886e-09 3.752050e-06     TRUE
## ASV_22   2.603931 0.5096734  5.109020 3.238345e-07 2.616583e-04     TRUE
## ASV_101 -1.875018 0.4065007 -4.612584 3.976946e-06 3.209395e-03     TRUE
## ASV_37   2.438281 0.5580338  4.369415 1.245799e-05 1.004114e-02     TRUE
## ASV_11   2.171955 0.5294593  4.102214 4.092164e-05 3.294192e-02     TRUE
## ASV_69  -1.743506 0.4439932 -3.926876 8.605617e-05 6.918916e-02    FALSE

#Number of significantly different ASVs (adjusted p_value<0.05)
dim(coef_ancob[coef_ancob$p_adj<0.05,])

## [1] 5 6

sig_ancob <- rownames(coef_ancob[1:dim(coef_ancob[coef_ancob$p_adj<0.05,])[1],])
# show the significant ASVs
metatoshow <- subset(metas,!metas$StandType%in%'Mixed')
datatoshow <- comm_rarfy[rownames(metatoshow),sig_ancob]
annotation_col = data.frame(standtype = as.factor(metatoshow$StandType))
rownames(annotation_col)=rownames(datatoshow)
pheatmap(t(datatoshow),scale= "row", annotation_col = annotation_col,cluster_cols = F)

Comparing differentially abundant taxa among methods

In a comparison between boreal and temperate forest stand types, ANCOM-BC and DESeq2 detected different numbers of differentially abundant ASVs. As noted earlier, there is still debate about which methods perform best for differentially abundant taxa detection. While DESeq2 and ANCOM identified differentially abundant taxa that are potentially unique to each method, there may be some ASVs that were detected as differentially abundant using both methods. This type of cross-method comparison can help us to identify statistically robust results - if taxa are differentially abundant enough to be detected by multiple methods, they should differ strongly in their abundance between groups. Let’s look at the distribution of these differentially abundant ASVs identified by both methods.

# identify significantly differentially abundant ASVs according to both methods
sig_both <- sig_des[sig_des%in%sig_ancob]
sig_both

## [1] "ASV_11"  "ASV_22"  "ASV_37"  "ASV_44"  "ASV_101"

# show the distribution of these differentially abundant ASVs
metatoshow <- subset(metas,!metas$StandType%in%'Mixed')
datatoshow <- comm_rarfy[rownames(metatoshow),sig_both]
annotation_col = data.frame(standtype = as.factor(metatoshow$StandType))
rownames(annotation_col)=rownames(datatoshow)
pheatmap(t(datatoshow),scale= "row", annotation_col = annotation_col,cluster_cols = F)

We can also inspect the taxonomic annotations of these differentially abundant ASVs.

taxo_rarfy[sig_both,]

##         Kingdom    Phylum           Class                 Order             
## ASV_11  "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales"     
## ASV_22  "Bacteria" "Bacteroidota"   "Bacteroidia"         "Cytophagales"    
## ASV_37  "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Sphingomonadales"
## ASV_44  "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacterales"
## ASV_101 "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales"     
##         Family              Genus                            Species         
## ASV_11  "Beijerinckiaceae"  "Methylobacterium-Methylorubrum" NA              
## ASV_22  "Hymenobacteraceae" "Hymenobacter"                   NA              
## ASV_37  "Sphingomonadaceae" "Sphingomonas"                   NA              
## ASV_44  "Shewanellaceae"    "Shewanella"                     "algae/haliotis"
## ASV_101 "Beijerinckiaceae"  "Methylocella"                   NA

Final steps - clean up and save data objects and workspace

We have now completed our ecological analyses of leaf bacterial communities. You may want to save the R workspace containing all the different data objects so that you can reload it in the future without having to re-run the analyses.

## Save entire R workspace to file
save.image("Microbiome-ecological-analysis-workspace.RData")

Microbiome ecological analysis workshop

Steve Kembel, Zihui Wang, and Salix Dubois

May 2022

Microbiome ecological analysis workshop

About the data set

Install and load required packages and data

Load DADA2 results

Load metadata

Cleaning and summarizing DADA2 results

Exploring community data

Subset community to remove host DNA

Summary statistics

Remove negative controls and samples with low-quality data

Rarefaction curves

Visualize community composition

Check negative controls

Subset community to remove low sequence number samples

PCA on subset data

Data normalization for diversity analysis

Rarefaction

Check the effect of rarefaction

Community analysis

Biological questions

Visualize taxonomic composition of communities

Phylum-level taxonomic composition of samples

Phylum-level composition for each transplant site

Community diversity (alpha diversity)

Calculate ASV richness and Shannon diversity of bacterial community

Compare bacterial diversity among categories

Community composition (beta diversity)

Ordination - Principal Components Analysis (PCA)

Ordination - Non-metric Multidimensional Scaling (NMDS)

PERMANOVA

Differentially abundant taxa

DESeq2

ANCOM-BC

Comparing differentially abundant taxa among methods

Final steps - clean up and save data objects and workspace