R package for annotation of glycans in mass spectrometry data.
This is a simple tutorial for the annotation of glycans in a single LC-MS/MS file in R using XCMS and GlycoAnnotateR. For a more detailed tutorial, including correspondence analysis and isotope detection, see this tutorial.
The data used in this tutorial is an enzyme digest of fucoidan, a sulfated fucan from brown algae. The data can be downloaded from the MassIVE server (f.MSV000095410/peak/MS31_20240618_newMpyrifera_digest_SIMddMS2_2.5uL_85%EtOH_11.mzML).
library(GlycoAnnotateR)
library(tidyverse)
library(xcms)
library(ggplot2)
library(ggrepel)
Now we import the mzML file as an OnDiskMSnExp object.
#read in data
data <- readMSData(files = 'MS31_20240618_newMpyrifera_digest_SIMddMS2_2.5uL_85%EtOH_11.mzML',
mode = 'onDisk')
Peak picking is done with the CentWave algorithm, for details see Tautenhahn et al., 2008.
Important to note: each dataset will require parameters appropriate for your data! If no peaks are picked, then you will not be able to annotate any peaks! See this tutorial for pick peaking with CentWave.
#create parameter object
cwp <- CentWaveParam(ppm = 10, peakwidth = c(10, 60), integrate = 2,
snthresh = 1)
#pick peaks
pks <- findChromPeaks(data, cwp)
Peak picking should always be optimised and checked extensively, as this will have a massive impact on the downstream data interpretation. Here, we will check an example chromatogram for a hexasulfated trifucan oligomer We will quickly use glycoPredict from GlycoAnnotateR to get the correct m/z value.
#calculate m/z values
#first make parameter object
param = glycoPredictParam(dp = c(3,3), #degree of polymerisation
polarity = 'neg', #polarity
adducts = 'H', # adducts
modifications = c('sulfate', #modifications
'deoxy'),
double_sulfate = T, #allow double sulfation
nmod_max = 3) #allow 3 modifications per monomer i.e. 2 sulfates, 1 deoxy
#make table
masses_trifucan <- glycoPredict(param)
#inspect table
#there are 82 possible ions
head(masses_trifucan)
## # A tibble: 6 × 10
## dp mass formula `IUPAC name` glyconnect_id glytoucan_id ion mz
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 3 456. C18H32O13 DeoxyHex3 none none [M-H… 455.
## 2 3 472. C18H32O14 Hex1 DeoxyHex2 none none [M-H… 471.
## 3 3 488. C18H32O15 Hex2 DeoxyHex1 420 G60232GL [M-H… 487.
## 4 3 504. C18H32O16 Hex3 234 G99375FM [M-H… 503.
## 5 3 536. C18H32O16S DeoxyHex3 Sulfa… none none [M-H… 535.
## 6 3 552. C18H32O17S Hex1 DeoxyHex2 … none none [M-H… 551.
## # ℹ 2 more variables: ion_formula <chr>, charge <chr>
#get mz value for [M-2H]-2 adduct
mz_trifucan = masses_trifucan$mz[masses_trifucan$`IUPAC name` == 'DeoxyHex3 Sulfate6'&
masses_trifucan$ion == '[M-2H]-2']
#extract chromatogram +- 4 ppm
chr_trifucan <- chromatogram(pks,
mz = c(mz_trifucan-ppm_to_mz(mz_trifucan, 4),
mz_trifucan+ppm_to_mz(mz_trifucan, 4)))
The picked peak is indicated by the shaded grey area in the extracted ion chromatogram.
plot(chr_trifucan[[1]])
We will now move forward with extracting and annotating our peak
table table to identify putative oligosaccharide ions. Here, we will
first create the glycoPredictParam
object, then make a
table of potential oligosaccharide compositions. We will filter this for
charges of -1 and -2, and then use it to annotate the peak table. The
charge filtering may not be appropriate depending on your instrument and
dataset.
We will extract a dataframe with the picked chromatographic peaks
#extract table and make dataframe
pks_df <- chromPeaks(pks) %>% as.data.frame()
#inspect table
#54 peaks were identified in this example
str(pks_df)
## 'data.frame': 54 obs. of 11 variables:
## $ mz : num 243 243 243 243 243 ...
## $ mzmin : num 243 243 243 243 243 ...
## $ mzmax : num 243 243 243 243 243 ...
## $ rt : num 39 38.8 39 40.4 42.7 ...
## $ rtmin : num 36.4 36.4 36.4 36.5 36.8 ...
## $ rtmax : num 41.9 45 45 43.5 48.8 ...
## $ into : num 100630 345758 345758 280727 462989 ...
## $ intb : num 100316 343758 343758 276292 461956 ...
## $ maxo : num 29487 91344 91344 63715 88028 ...
## $ sn : num 47 71 71 28 105 ...
## $ sample: num 1 1 1 1 1 1 1 1 1 1 ...
head(pks_df)
## mz mzmin mzmax rt rtmin rtmax into intb
## CP01 243.0652 243.0635 243.0666 38.95620 36.39610 41.92055 100630.0 100315.5
## CP02 243.1792 243.1787 243.1795 38.82145 36.39610 45.01966 345758.0 343758.5
## CP03 243.1792 243.1785 243.1795 38.95620 36.39610 45.01966 345758.0 343758.5
## CP04 243.0873 243.0868 243.0878 40.43844 36.53084 43.53754 280726.8 276292.2
## CP05 242.9601 242.9598 242.9606 42.72905 36.80034 48.79251 462988.8 461956.4
## CP06 242.9603 242.9598 242.9610 52.00861 47.17552 62.14229 415453.0 414178.8
## maxo sn sample
## CP01 29486.51 47 1
## CP02 91343.73 71 1
## CP03 91343.73 71 1
## CP04 63714.87 28 1
## CP05 88027.51 105 1
## CP06 43843.45 52 1
Create the parameter object for calculation of theoretical compositions.
#create the glycoPredictParam object
gpp <- glycoPredictParam(
#degree of polymerisation range
dp = c(1,10),
#ionisation polarity
polarity = "neg",
#ionisation type (affects adduct types)
ion_type = "ESI",
#scan range during MS1 acquisition
scan_range = c(175,1400),
#expected modifications
modifications = c('sulfate', 'deoxy'),
#double sulfation is possible
double_sulfate = T,
#maximum average number of modifications per monomer
nmod_max = 3,
#adducts
adducts = "H")
gpp
## An object of class "glycoPredictParam"
## Slot "dp":
## [1] 1 10
##
## Slot "polarity":
## [1] "neg"
##
## Slot "scan_range":
## [1] 175 1400
##
## Slot "pent_option":
## [1] FALSE
##
## Slot "modifications":
## [1] "sulfate" "deoxy"
##
## Slot "nmod_max":
## [1] 3
##
## Slot "double_sulfate":
## [1] TRUE
##
## Slot "label":
## [1] "none"
##
## Slot "ion_type":
## [1] "ESI"
##
## Slot "format":
## [1] "long"
##
## Slot "adducts":
## [1] "H"
##
## Slot "naming":
## [1] "IUPAC"
##
## Slot "glycan_linkage":
## [1] "none"
##
## Slot "modification_limits":
## [1] "none"
We can now use parameters defined in the
glycoPredictParam
object to ‘predict’ (or rather calculate)
all possible glycans. You can directly run glycoAnnotate
to
annotate your peak list using the parameters, which will call
glycoPredict
in the background, but here we will
demonstrate the output of glycoPredict
by running it
separately, as we want to introduce an addition filtering for charge.
You can see that the charge filtered table contains ~1000 ions, while
the original table contains >6000.
#create a table of theoretical glycans and their ions
pred_df <- glycoPredict(param = gpp)
#inspect table
str(pred_df)
## tibble [6,276 × 10] (S3: tbl_df/tbl/data.frame)
## $ dp : num [1:6276] 1 1 1 1 1 2 2 2 2 2 ...
## $ mass : num [1:6276] 180 244 260 324 340 ...
## $ formula : chr [1:6276] "C6H12O6" "C6H12O8S" "C6H12O9S" "C6H12O11S2" ...
## $ IUPAC name : chr [1:6276] "Hex1" "DeoxyHex1 Sulfate1" "Hex1 Sulfate1" "DeoxyHex1 Sulfate2" ...
## $ glyconnect_id: chr [1:6276] "228" "none" "none" "none" ...
## $ glytoucan_id : chr [1:6276] "G49874UX" "none" "none" "none" ...
## $ ion : chr [1:6276] "[M-H]-" "[M-H]-" "[M-H]-" "[M-H]-" ...
## $ mz : num [1:6276] 179 243 259 323 339 ...
## $ ion_formula : chr [1:6276] "C6H11O6" "C6H11O8S" "C6H11O9S" "C6H11O11S2" ...
## $ charge : chr [1:6276] "-1" "-1" "-1" "-1" ...
#charge filter
pred_df_filt <- pred_df %>% filter(charge %in% c('-1', '-2'))
#inspect filtered table
str(pred_df_filt)
## tibble [995 × 10] (S3: tbl_df/tbl/data.frame)
## $ dp : num [1:995] 1 1 1 1 1 2 2 2 2 2 ...
## $ mass : num [1:995] 180 244 260 324 340 ...
## $ formula : chr [1:995] "C6H12O6" "C6H12O8S" "C6H12O9S" "C6H12O11S2" ...
## $ IUPAC name : chr [1:995] "Hex1" "DeoxyHex1 Sulfate1" "Hex1 Sulfate1" "DeoxyHex1 Sulfate2" ...
## $ glyconnect_id: chr [1:995] "228" "none" "none" "none" ...
## $ glytoucan_id : chr [1:995] "G49874UX" "none" "none" "none" ...
## $ ion : chr [1:995] "[M-H]-" "[M-H]-" "[M-H]-" "[M-H]-" ...
## $ mz : num [1:995] 179 243 259 323 339 ...
## $ ion_formula : chr [1:995] "C6H11O6" "C6H11O8S" "C6H11O9S" "C6H11O11S2" ...
## $ charge : chr [1:995] "-1" "-1" "-1" "-1" ...
We will use the charge filtered table to annotate our peaks at 3.5
ppm error. If you want to ‘collapse’ annotations in cases where there
are multiple annotations per peak (and your peaks are therefore
duplicated), use the collapse = TRUE
option.
#annotate peaks
pks_df_annotated <- glycoAnnotate(data = pks_df,
pred_table = pred_df_filt,
error = 3.5)
## Starting annotation with predictions against data
#keep only annotated peaks in the table
pks_df_onlyannotated <- pks_df_annotated %>% drop_na(dp)
#inspect annotated peaks
#11 sulfated fucan oligosaccharides were annotated
pks_df_onlyannotated
## dp mass formula IUPAC name glyconnect_id glytoucan_id
## 1 1 244.0253 C6H12O8S DeoxyHex1 Sulfate1 none none
## 2 2 470.0400 C12H22O15S2 DeoxyHex2 Sulfate2 none none
## 3 2 629.9536 C12H22O21S4 DeoxyHex2 Sulfate4 none none
## 4 2 629.9536 C12H22O21S4 DeoxyHex2 Sulfate4 none none
## 5 2 629.9536 C12H22O21S4 DeoxyHex2 Sulfate4 none none
## 6 3 776.0116 C18H32O25S4 DeoxyHex3 Sulfate4 none none
## 7 3 855.9684 C18H32O28S5 DeoxyHex3 Sulfate5 none none
## 8 3 855.9684 C18H32O28S5 DeoxyHex3 Sulfate5 none none
## 9 4 1002.0263 C24H42O32S5 DeoxyHex4 Sulfate5 none none
## 10 4 1081.9831 C24H42O35S6 DeoxyHex4 Sulfate6 none none
## 11 3 935.9252 C18H32O31S6 DeoxyHex3 Sulfate6 none none
## ion mz_pred ion_formula charge mz mzmin mzmax rt
## 1 [M-H]- 243.0180 C6H11O8S -1 243.0178 243.0176 243.0179 59.18481
## 2 [M-2H]-2 234.0127 C12H20O15S2 -2 234.0124 234.0114 234.0132 86.78763
## 3 [M-2H]-2 313.9695 C12H20O21S4 -2 313.9688 313.9686 313.9690 406.92737
## 4 [M-2H]-2 313.9695 C12H20O21S4 -2 313.9688 313.9686 313.9690 405.62887
## 5 [M-2H]-2 313.9695 C12H20O21S4 -2 313.9688 313.9686 313.9690 406.60312
## 6 [M-2H]-2 386.9985 C18H30O25S4 -2 386.9984 386.9975 386.9996 471.60668
## 7 [M-2H]-2 426.9769 C18H30O28S5 -2 426.9768 426.9751 426.9791 588.57740
## 8 [M-2H]-2 426.9769 C18H30O28S5 -2 426.9768 426.9760 426.9777 639.01284
## 9 [M-2H]-2 500.0059 C24H40O32S5 -2 500.0063 500.0047 500.0078 678.05256
## 10 [M-2H]-2 539.9843 C24H40O35S6 -2 539.9852 539.9846 539.9859 798.39528
## 11 [M-2H]-2 466.9553 C18H30O31S6 -2 466.9552 466.9537 466.9563 837.27402
## rtmin rtmax into intb maxo sn sample
## 1 48.92726 66.35989 6848117.8 6847884.8 1215479.88 4292 1
## 2 71.87560 94.05483 520024.9 511757.4 46581.35 24 1
## 3 387.41287 420.44965 3044450.5 3014906.9 233104.44 68 1
## 4 387.41287 420.44965 3044450.5 3014906.9 233104.44 68 1
## 5 387.41287 420.44965 3044450.5 3014906.9 233104.44 68 1
## 6 459.93640 483.94111 272196.5 179606.8 30914.22 5 1
## 7 562.01033 623.75556 4376145.6 4337223.1 211686.11 78 1
## 8 623.75556 650.73222 1246159.6 1229240.2 81602.15 30 1
## 9 654.02082 696.80118 1628744.5 1614033.3 92367.78 69 1
## 10 768.32796 816.37926 5721691.4 5678748.6 411292.03 141 1
## 11 825.11310 853.21116 1084009.9 1079160.8 117393.16 114 1
## mass_error_ppm
## 1 0.8001308
## 2 1.3653844
## 3 2.2901486
## 4 2.2933373
## 5 2.2868096
## 6 0.2535610
## 7 0.3315664
## 8 0.2715139
## 9 0.9307090
## 10 1.6383194
## 11 0.2080966
We will use the glycoMS2Extract
function from
GlycoAnnotateR to extract the spectra. This function is basically a
wrapper for XCMS functions for MS2 spectra extraction, with a layer to
only extract those spectra associated with annotated peaks or
features
ms2 <- glycoMS2Extract(
#object with MS1 and MS2 data
data_ms2 = data,
#processed MS1 data object (must have peaks)
data_features = pks,
#annotated feature table
#must contain an 'mz' and 'rt' column
annotations = pks_df_onlyannotated,
#processing level here is 'peaks' (would be features after correspondence)
processing_level = 'peaks')
#see how many spectra were extracted in total (here = 961)
length(ms2@backend@spectraData@rownames)
## [1] 961
We will use the functions from Spectra
to average the
MS2 spectra.
#combine peaks with spectra
ms2_combine <- Spectra::combinePeaks(ms2, tolerance = 0.005, ppm = 3)
#combine spectra by peak within each file across scans
ms2_mean <- Spectra::combineSpectra(ms2_combine, f = ms2$peak_index,
tolerance = 0.005, ppm = 3,
peaks = 'intersect', minProp = 0.2,
intensityFun = max)
## Backend of the input object is read-only, will change that to an 'MsBackendMemory'
#see how many spectra there are after averaging - 11
length(ms2_mean)
## [1] 11
Now we will convert MS2 data into a dataframe format so its easier to work with.
#build empty df
ms2_df <- data.frame(precursorMz = as.numeric(),
rt = as.numeric(),
mz = as.numeric(),
intensity = as.numeric(),
peak_id = as.numeric())
#fill in df
for (i in 1:length(ms2_mean)){
mz = sprintf("%.4f",mz(ms2_mean)[i] %>% unlist()) %>%
as.numeric()
intensity = intensity(ms2_mean)[i] %>% unlist()
intensity = (intensity / max(intensity)) * 100 #normalise intensity with respect to maximum
rt = rep(rtime(ms2_mean)[i], length(mz))
precursorMz = rep(precursorMz(ms2_mean)[i], length(mz))
peak_id = rep(ms2_mean$peak_id[i], length(mz))
temp <- data.frame(precursorMz = precursorMz,
rt = rt, mz = mz, intensity = intensity,
peak_id = peak_id)
ms2_df <- rbind(temp, ms2_df)
}
#inspect dataframe
head(ms2_df)
## precursorMz rt mz intensity peak_id
## 1 466.9558 825.7071 67.6220 49.40483 CP54
## 2 466.9558 825.7071 96.9588 8.52363 CP54
## 3 466.9558 825.7071 97.2079 31.28682 CP54
## 4 466.9558 825.7071 293.1770 10.59588 CP54
## 5 466.9558 825.7071 307.0408 25.18584 CP54
## 6 466.9558 825.7071 347.0196 100.00000 CP54
We will begin by annotating the precursors.
ms2_df_annotP <- glycoAnnotate(data = ms2_df,
#this should generally be higher than for MS1
error = 5,
#table for annotation
pred_table = pred_df_filt,
#name of mz column to annotate)
mz_column = 'precursorMz')
## Starting annotation with predictions against data
#make column named 'annotation' with name, ion and dp pasted together
#separated by a colon
ms2_df_annotP <- ms2_df_annotP %>%
mutate(annotations = paste(`IUPAC name`, `ion`, dp, sep = ':'))
All precursors should be assigned an annotation (as we only extracted MS2 spectra for annotated precursors).
#check all precursors annotated (statement should be FLASE)
any(is.na(ms2_df_annotP$dp))
## [1] FALSE
Now we can annotate the fragment ions in MS2 spectra. The
glycoPredictParam
parameters automatically change depending
on the precursor annotation. Only fragments resulting from glycosidic
bond breakage, dehydration, or loss of modified groups
(e.g. desulfation) can currently be annotated by GlycoAnnotateR.
#make a vector of the precursor annotations
#the vector should be as long as the number of peaks / the length
#of the ms2 object after averaging (i.e. 11 in this case)
ms2_df_annotP_unique <- ms2_df_annotP %>%
distinct(precursorMz, annotations, rt, peak_id)
#rename annotations column to precursorAnnotations
names(ms2_df_annotP)[names(ms2_df_annotP) == 'annotations'] <- 'precursorAnnotations'
#remove other columns added during annotation
ms2_df_annotP <- ms2_df_annotP %>%
select(all_of(c(names(ms2_df), 'precursorAnnotations')))
#annotate the fragment ions based on the precursor annotations
ms2_df_annotF <- glycoMS2Annotate(
#vector of the precursor annotations
precursorAnnotations = ms2_df_annotP_unique$annotations,
#separator in between name, ion and dp of precursor annotations
precursorAnnotations_sep = ':',
#ms2spectra dataframe with annotated precursors
ms2spectra = as.data.frame(ms2_df_annotP),
#ion type
ion_type = 'ESI',
#error
error = 5, error_units = 'ppm',
#allow double sulfate and dehydrations and 3 annotations per monomer
double_sulfate = T, nmod_max = 3, dehydrations = T)
#ions with multiple annotations will automatically be 'collapsed'
#in the 'annotations' column
#ions with one annotation will not be collapsed, so we will now
#coalesce these columns
#and also annotate sulfate
ms2_df_annotF <- ms2_df_annotF %>%
mutate(annotations = case_when(is.na(annotations) & !is.na(ion)~
paste(`IUPAC name`, ion, sep = ':'),
is.na(annotations) & is.na(ion) ~ NA,
TRUE ~ annotations),
annotations = case_when(mz > 96.95 & mz < 97 ~ 'HSO4-',
TRUE ~ annotations))
Annotated spectra can be plotted with ggplot.
#plot spectra with annotated fragments
#you can play around with how you plot this
peakIDs <- ms2_df_annotF$peak_id %>% unique()
for(i in 1:length(peakIDs)){
pkid <- peakIDs[i]
df <- ms2_df_annotF %>%
#filter for peak id
filter(peak_id == !!pkid) %>%
#filter for 1% relative abundance
filter(intensity > 1) %>%
#add new line break where there are commas for prettier plotting
mutate(annotations = gsub('\\,', ',\n', annotations))
p <- ggplot(df) +
#plot lines
geom_segment(aes(x = mz, xend = mz, y = 0, yend= intensity)) +
#add annotations
geom_text_repel(aes(label = annotations, x = mz, y = intensity),
size = 2, hjust = 0.5, vjust = 0.5, nudge_y = 1,
segment.size = 1, segment.colour = 'lightblue',
segment.linetype = 'dashed', min.segment.length = 0.2) +
#add mz values
geom_text(mapping = aes(x = mz, y = intensity + 0.5, label = round(mz, 4)),
size = 2) +
#set labels
labs(x = expression(italic("m/z")), y = "Normalised intensity (a.u.)",
title = paste0(df$precursorAnnotations[1], ' at ',
round(df$rt[1],1), ' sec, precursor m/z=',
round(df$precursorMz[1], 4)))+
theme_minimal()+
theme(axis.title = element_text(colour = 'black', size = 8, face = 'bold'),
axis.text = element_text(colour = 'black', size = 6),
plot.title =element_text(colour = 'black', size = 8, face = 'bold',
hjust = 0.5),
panel.border = element_rect(fill = NA, colour = 'black'),
panel.grid = element_blank())
print(p)
}