Data overview

Welcome

This tutorial is developed as a part developed for NCBI codeathon (see details) a hackathon-style event focused on rapid innovation. While we encourage you to explore and adapt this code, please be aware that NCBI does not provide ongoing support for it. For general questions about NCBI software and tools, please visit: NCBI Contact Page.

The tutorial provides the framework developed by integrating R/Bioconductor packages to analyse Antimicrobial Resistance Gene (ARG) data derived from metagenome sequence data. For the purpose of this tutorial, we are using Lee et al. (2023) dataset for analysis in R/Bioc. See data summary. The data is pre-formated into TreeSummarizedExperiment object which can be downloaded as R object (.rds) file from here.

Note: If you are interested know about TreeSummarizedExperiments

Install and load required packages

packages <- c('mia', 'miaViz', 'scater', 'ComplexHeatmap', 'pheatmap', 'dplyr')

# Get packages that are already installed installed
packages_already_installed <- packages[ packages %in% installed.packages() ]

# Get packages that need to be installed
packages_need_to_install <- setdiff( packages, packages_already_installed )

# Loads BiocManager into the session. Install it if it not already installed.
if( !require("BiocManager") ){
    install.packages("BiocManager")
    library("BiocManager")
}

# If there are packages that need to be installed, installs them with BiocManager
if( length(packages_need_to_install) > 0 ) {
   install(packages_need_to_install, ask = FALSE)
}

# load packages
lapply(packages, require, character.only = TRUE)
[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

[[4]]
[1] TRUE

[[5]]
[1] TRUE

[[6]]
[1] TRUE

Load and explore data

tse <- load('../data/Lee2023.rda')
tse <- Lee2023
tse
class: TreeSummarizedExperiment 
dim: 4182 5195 
metadata(1): Country
assays(1): relabundance
rownames(4182): 362 363 ... 23727 23733
rowData names(13): sgbid sgbname ... assigned_pathogen_name
  assigned_pathogen_taxid
colnames(5195): AsnicarF_2017__MV_FEM1_t1Q14
  AsnicarF_2017__MV_FEM2_t1Q14 ... ZellerG_2014__CCMD74930188ST-21-0
  ZellerG_2014__CCMD79987997ST-21-0
colData names(22): Tier_1.Exclusion_before_analysis
  Tier_2.Recover_all_40_SCGs ... antibiotic_exposure_status_descriptive
  antibiotic_current_use_binary
reducedDimNames(0):
mainExpName: NULL
altExpNames(2): read assembly
rowLinks: NULL
rowTree: NULL
colLinks: NULL
colTree: NULL

The data is of class TreeSummarizedExperiment (TSE) with following data containers:

  • Assays: A two-dimensional matrix with abundance data. Columns represent samples and row represents features (microbial taxa, in this case).

  • rowData: This is data about the features present in rows of assays. In this case we have taxonomic classification of our microbial taxa.

  • colData: This is data about the samples

  • Alternative experiments (altExp): Any alternative counts table or experiments are stored in this slot. In our case we have added another TSE object with abudance of antibiotic resistance genes in our samples. In our case there are two alternative experiments namely, ‘read’ and ‘assembly’. We will be focussing on ‘read’ based experiment.

Note: Read more about data containers on our book, Orchestrating Microbiome Analysis with Bioconductor (Lahti et al. 2021).

In this tutorial we will be focussing on analysing the AMR data, present in altExp container.

altExp(tse, 'read')
class: TreeSummarizedExperiment 
dim: 360 5195 
metadata(0):
assays(1): abundances
rownames(360): AAC(1)-I/AAC(3)-Ia AAC(3)-IIa/IIc ... dfrG/K rob
rowData names(2): antibiotic_class resistance_mechanism
colnames(5195): AsnicarF_2017__MV_FEM1_t1Q14
  AsnicarF_2017__MV_FEM2_t1Q14 ... ZellerG_2014__CCMD74930188ST-21-0
  ZellerG_2014__CCMD79987997ST-21-0
colData names(0):
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: NULL
rowTree: NULL
colLinks: NULL
colTree: NULL

As this is also an TSE object, it has the sample containers like assays, rowData, colData, etc which can be accessed by different functions.

Assay: contains the abundances of AMR genes in samples

head(assay(altExp(tse, 'read'))[,1:3])
                   AsnicarF_2017__MV_FEM1_t1Q14 AsnicarF_2017__MV_FEM2_t1Q14
AAC(1)-I/AAC(3)-Ia                            0                            0
AAC(3)-IIa/IIc                                0                            0
AAC(3)-IV                                     0                            0
AAC(3)-VIa                                    0                            0
AAC(6)-clusterB                               0                            0
AAC(6)-clusterC                               0                            0
                   AsnicarF_2017__MV_FEM3_t1Q14
AAC(1)-I/AAC(3)-Ia                            0
AAC(3)-IIa/IIc                                0
AAC(3)-IV                                     0
AAC(3)-VIa                                    0
AAC(6)-clusterB                               0
AAC(6)-clusterC                               0

colData: contains information about samples

head(colData(altExp(tse, 'read')))
DataFrame with 6 rows and 0 columns

It does not have any colData so lets add colData from our original TSE object

colData(altExp(tse, 'read')) <- colData(tse)
head(colData(altExp(tse, 'read'))[,5:9])
DataFrame with 6 rows and 5 columns
                                     Study     Sample_ID Sample_assembly_ID
                               <character>   <character>        <character>
AsnicarF_2017__MV_FEM1_t1Q14 AsnicarF_2017 MV_FEM1_t1Q14      MV_FEM1_t1Q14
AsnicarF_2017__MV_FEM2_t1Q14 AsnicarF_2017 MV_FEM2_t1Q14      MV_FEM2_t1Q14
AsnicarF_2017__MV_FEM3_t1Q14 AsnicarF_2017 MV_FEM3_t1Q14      MV_FEM3_t1Q14
AsnicarF_2017__MV_FEM4_t1Q14 AsnicarF_2017 MV_FEM4_t1Q14      MV_FEM4_t1Q14
AsnicarF_2017__MV_FEM4_t2Q15 AsnicarF_2017 MV_FEM4_t2Q15      MV_FEM4_t2Q15
AsnicarF_2017__MV_FEM5_t1Q14 AsnicarF_2017 MV_FEM5_t1Q14      MV_FEM5_t1Q14
                              NumReads   NumBases
                             <integer>  <numeric>
AsnicarF_2017__MV_FEM1_t1Q14  27277064 2722301417
AsnicarF_2017__MV_FEM2_t1Q14  19883080 1983449695
AsnicarF_2017__MV_FEM3_t1Q14  28539448 2847418433
AsnicarF_2017__MV_FEM4_t1Q14  19038206 1899312825
AsnicarF_2017__MV_FEM4_t2Q15  50648184 5024125151
AsnicarF_2017__MV_FEM5_t1Q14  23822026 2376720104

rowData: contains information about AMRs (features in row of assay)

head(rowData(altExp(tse, 'read')))
DataFrame with 6 rows and 2 columns
                   antibiotic_class resistance_mechanism
                        <character>          <character>
AAC(1)-I/AAC(3)-Ia   Aminoglycoside         Inactivation
AAC(3)-IIa/IIc       Aminoglycoside         Inactivation
AAC(3)-IV            Aminoglycoside         Inactivation
AAC(3)-VIa           Aminoglycoside         Inactivation
AAC(6)-clusterB      Aminoglycoside         Inactivation
AAC(6)-clusterC      Aminoglycoside         Inactivation

Lets try to make some plots and explore the quality of data.

Abundance

First, lets transform the data into relative abundance

altExp(tse, 'read') <- transformAssay(altExp(tse, 'read'),assay.type = 'abundances', method = 'relabundance')

Now, lets plot the jitter plot based on relative abundance data, similar to the one presented at (Salosensaari et al. 2021) (Supplementary Fig.1), can be visualized as follows:

plotAbundanceDensity(
  altExp(tse, 'read'), layout = "jitter",
  assay.type = "relabundance",
  n = 20, point_size=1, point.shape=19,
  point.alpha=0.1) +
  scale_x_log10(label=scales::percent)
Warning in scale_x_log10(label = scales::percent): log-10 transformation
introduced infinite values.

The relative abundance values for the top-10 AMR genes can be visualized as a density plot over a log-scaled axis, with “Westernized” variable indicated by colors:

plotAbundanceDensity(
  altExp(tse, 'read'), layout = "density",
  assay.type = "relabundance",
  n = 10, colour.by="Westernized",
  point.alpha=1/10) +
  scale_x_log10()
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning: Removed 8606 rows containing non-finite outside the scale range
(`stat_density()`).

This plot gives an idea about the abundances of some AMRs like tetQ are more in westernized population.

Lets agglomerate data based on anitbiotic class present in rowData and see the same plots at antibiotics class level:

# Agglomerate
altExp(tse, 'read_abclass') <- agglomerateByVariable(
                                  x = altExp(tse, 'read'),
                                  by='rows',
                                  rowData(altExp(tse,'read'))$antibiotic_class
                                  )
# plot
plotAbundanceDensity(
  altExp(tse, 'read_abclass'), layout = "density",
  assay.type = "relabundance",
  n = 10, colour.by="Westernized",
  point.alpha=1/10) +
  scale_x_log10()
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning: Removed 12041 rows containing non-finite outside the scale range
(`stat_density()`).

Diversity and Similarity

Alpha diversity

Alpha diversity can be estimated with addAlpha() wrapper function that interact with other packages implementing the calculation, such as vegan (Oksanen et al. 2020)

These functions calculate the given indices, and add them to the colData slot of the SummarizedExperiment object with the given name.

# calculate observed alpha diversity and add it to colData slot
altExp(tse, 'read') <- addAlpha(
  altExp(tse, 'read'), assay.type = "abundances", index = "shannon_diversity", name = "shannon_diversity",
  detection = 10)

# plot
plotColData(
    altExp(tse, 'read'),
    "shannon_diversity",
    "Disease",
    colour_by = "Disease") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab('Host status')

Unsupervised ordinations

Unsupervised ordination methods analyze variation in the data without additional information on covariates or other supervision of the model. Among the different approaches, Multi-Dimensional Scaling (MDS) and non-metric MDS (NMDS) can be regarded as the standard. They are jointly referred to as PCoA.

# Run PCoA on relabundance assay with Bray-Curtis distances
altExp(tse, 'read') <- runMDS(
    altExp(tse, 'read'),
    FUN = vegan::vegdist,
    method = "bray",
    assay.type = "relabundance",
    name = "MDS_bray")

Sample dissimilarity can be visualized on a lower-dimensional display (typically 2D) using the plotReducedDim() function from the scater package. This also provides tools to incorporate additional information encoded by color, shape, size and other aesthetics.

# Create ggplot object
p <- plotReducedDim(altExp(tse, 'read'), "MDS_bray", colour_by = "Disease")

# Calculate explained variance
e <- attr(reducedDim(altExp(tse, 'read'), "MDS_bray"), "eig")
rel_eig <- e / sum(e[e > 0])

# Add explained variance for each axis
p <- p + labs(
    x = paste("PCoA 1 (", round(100 * rel_eig[[1]], 1), "%", ")", sep = ""),
    y = paste("PCoA 2 (", round(100 * rel_eig[[2]], 1), "%", ")", sep = "")
    )

p

A few combinations of beta diversity metrics and assay types are typically used. For instance, Bray-Curtis dissimilarity and Euclidean distance are often applied to the relative abundance and the clr assays, respectively. Besides beta diversity metric and assay type, the PCoA algorithm is also a variable that should be considered. For more informatition read Unsupervised ordination.

Heatmaps

# tranform assay with CLR
altExp(tse, "read_abclass") <- transformAssay(altExp(tse, "read_abclass"), MARGIN = "samples", method = "clr", assay.type = "relabundance", pseudocount=TRUE)
A pseudocount of 1.12661374804138e-05 was applied.
altExp(tse, "read_abclass") <- transformAssay(altExp(tse, "read_abclass"), MARGIN = "features", method = "standardize", assay.type = "clr", pseudocount=FALSE, name="clr_z")

# store abudances into matrix
mat <- assay(altExp(tse, "read_abclass"), "clr_z")

# prepare column annotation data
col_ann <- data.frame(
  Westernized = colData(altExp(tse, 'read_abclass'))[colnames(mat),'Westernized'],
  Gender = colData(altExp(tse, 'read_abclass'))[colnames(mat),'Gender'],
  Disease = colData(altExp(tse, 'read_abclass'))[colnames(mat),'Disease']
)
rownames(col_ann) <- rownames(colData(altExp(tse, 'read_abclass'))[colnames(mat),])

pheatmap(mat,annotation_col = col_ann,show_colnames = F)

Antibiotic Susceptibility (NDARO)

National Database of Antibiotic Resistant Organisms (NDARO) is a collaborative, cross-agency, centralized hub for researchers to access AMR data to facilitate real-time surveillance of pathogenic organisms. NDARO hosts Antibiotic Susceptibility Test (AST) browser with data on laboratory typing and antibioti ausceptibility from clinical and environmental isolates. The data provides a good opportunity to study the coverage of antibiotic resistant/susceptible organisms in human microbiome context.

Here, we will integrate the AST data with our metagenome data analysis.

df.asts <- read.table(file = gzfile('../data/asts.tsv.gz'),header = T,sep = '\t')

Let’s explore Clostridioides difficile infection (CDI) infection samples from our metagenomic data.

# Subset CDI samples
tse.clean <- tse[,!is.na(colData(tse)$Disease)]
tse.cdi <- tse.clean[,colData(tse.clean)$Disease == 'CDI']

# take list of species present in our samples
tse.taxa.list <- rowData(tse.cdi)$Species
# clean species names
tse.taxa.list <- tse.taxa.list %>% gsub('s__','',.) %>% gsub('_', ' ', .)
# add cleaned specie names in TSE
rowData(tse.cdi)$Species_cleaned <- tse.taxa.list

# subset tse for species present in AST and our samples
tse.sub <- tse.cdi[tse.taxa.list %in% df.asts$Scientific.name,]

# create IDs to species names list
id2species <- rowData(tse.sub)$Species
names(id2species) <- as.numeric(rownames(rowData(tse.sub)))

# check prevalence of species
# get prevalent
prev.df <- getPrevalence(tse.sub,assay.type='relabundance', detection = 1/100, sort = TRUE, as.relative = TRUE)

prev.df <- data.frame(
  Prevalence = prev.df,
  Species = id2species[names(prev.df)]
)

head(prev.df, 5)
     Prevalence                     Species
4285  0.2307692 s__Clostridioides_difficile
9699  0.1538462   s__Pseudomonas_aeruginosa
9675  0.1538462       s__Pseudomonas_putida
9674  0.1538462       s__Pseudomonas_putida
9672  0.1538462       s__Pseudomonas_putida

As we can see, Clostridioides difficile, Pseudomonas aeruginosa and Pseudomonas putida are prevalent in our samples with CDI.

Let’s now explore the data available in AST browser about these species.

df.asts.sub <- df.asts[df.asts$Organism.group %in% c('Clostridioides difficile', 'Pseudomonas aeruginosa', 'Pseudomonas putida'),]

# Isolation types
ggplot(data = df.asts.sub,mapping = aes(x = Isolation.type, fill = Organism.group)) + geom_histogram(stat="count", position = 'dodge') + theme_bw()
Warning in geom_histogram(stat = "count", position = "dodge"): Ignoring unknown
parameters: `binwidth`, `bins`, and `pad`

The plot suggests that Pseudomonas aeruginosa has more evidences from clinical studies about the antibiotics resistant/susceptibility. Let’s now explore different antibiotics options available for Pseudomonas aeruginosa.

df.asts.sub.pa <- df.asts.sub[df.asts.sub$Organism.group == 'Pseudomonas aeruginosa',]

length(unique(df.asts.sub.pa$Antibiotic))
[1] 73
# select top 5 highly studied antibiotics
top5 <- sort(table(df.asts.sub.pa$Antibiotic),decreasing = T)[1:5]
top5

          ciprofloxacin              gentamicin              tobramycin 
                    548                     532                     520 
               cefepime piperacillin-tazobactam 
                    514                     463 
# % evidences covered by top5 antibiotics
(sum(top5)/dim(df.asts.sub.pa)[1]) * 100
[1] 40.7302

Let’s try to explore the phenotypes of this anitbiotics for Pseudomonas aeruginosa:

df.asts.sub.pa.top5 <- df.asts.sub.pa[df.asts.sub.pa$Antibiotic %in% names(top5),]

ggplot(data = df.asts.sub.pa.top5,mapping = aes(x = Antibiotic, fill = Antibiotic)) + geom_histogram(stat = 'count') + facet_wrap(~Resistance.phenotype) + theme_bw() + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
`binwidth`, `bins`, and `pad`

The analysis suggests that, Pseudomonas aeruginosa is highly resistant to ciprofloxacin and succeptible to cefepime. But there are certain strains which can be both resistant/succeptible to both the anitbiotics.