GlycoAnnotateR

Logo

R package for annotation of glycans in mass spectrometry data.

View the Project on GitHub margotbligh/GlycoAnnotateR

1 Introduction

This is a tutorial for the annotation of glycans in LC-MS data in R using XCMS, CAMERA and GlycoAnnotateR. Other data processing pipelines (e.g. mzMine) can be used but are not covered in this tutorial. For the annotation step all that is needed is a dataframe with one row per m/z. This tutorial assumes a basic understanding of the XMCS processing pipeline - for a great explanation of LC-MS data processing with XCMS, check out this tutorial.

This tutorial will use example data, which you can find here on Github under ‘inst/extdata’. The mzML files should be manually downloaded. The data consists of three mzML files. All contain HILIC-ESI-Orbitrap MS data. The same mixture of commercial oligosaccharide standards was run three times in negative mode: once with full MS only and twice with DDA-MS/MS.

We will first set up the environment by loading the required packages. XCMS is used for the data processing (peak picking and grouping, chromatogram extraction etc) and CAMERA is used for isotope detection. XCMS and CAMERA can be installed from BioConductor using BiocManager.

2 Load required packages

library(GlycoAnnotateR)
library(dplyr)
library(xcms)
library(CAMERA)
library(ggplot2)
library(ggrepel)
library(scales)
library(stringr)

3 Pre-processing the data

3.1 Import and prepare the data

3.1.1 Import data

Now we import the mzML data files as an OnDiskMSnExp object (all_data) with an associated phenodata (i.e. metadata) table. It is important that you download the example files manually from Github, they can be found at ‘inst/extdata’. The data files should be 180-220 MB - if they are smaller than this, they have been downloaded incorrectly. Download them from Github manually by clicking on the individual files and then pressing the download button.

#get filepaths of mzML files
#change as necessary to match your directory structure
fp <- dir('../inst/extdata', pattern = 'mzML$', full.names = T)

#make a metadata df
pd <- data.frame(sample_name = gsub(".mzML|.*1[78]_", "", basename(fp)),
                 frag = str_split_i(basename(fp), "_", 4),
                 sample_type = str_split_i(basename(fp), "_", 3))

#print the metadata df                 
pd
##       sample_name frag sample_type
## 1 stds_DDA_neg_06  DDA        stds
## 2 stds_MS1_neg_05  MS1        stds
## 3 stds_neg_DDA_04  neg        stds
#read in data with metadata df as phenodata
all_data <- readMSData(files = fp, 
                       pdata = new("NAnnotatedDataFrame", pd),
                       mode = 'onDisk')

3.1.2 Filter by retention time

We will now filter the data to only retain scans between 1 minute (sample injection) and 30 minutes (solvent B ramped up to re-condition the column). This is not strictly necessary but increases the speed of processing by dropping unneeded data.

#filter retention time
all_data <- filterRt(all_data, rt = c(1*60, 30*60))

3.1.3 Split data

We will now split the data in two. One object has only MS1 scan events (data_ms1) and one object has all scan events (MS1 and MS2). CAMERA peak grouping by retention time for isotope picking requires objects with only MS1 scans.

#make MS1 only data object
data_ms1 <- all_data[all_data@featureData@data$msLevel == 1]
data_ms2 <- all_data

3.2 Peak picking, refining and grouping

3.2.1 Peak picking

Peak picking is done with the CentWave algorithm, for details see Tautenhahn et al., 2008. CentWave parameters should be optimised for each dataset - this can be done with the IPO package.

We will only perform peak picking on the MS1 object. Later, MS2 spectra associated with MS1 features annotated as glycans will be extracted, using the MS1 peaks.

#create parameter object
cwp <- CentWaveParam(ppm = 4, peakwidth = c(10, 60))
#pick peaks 
data_ms1_pks <- findChromPeaks(data_ms1, cwp)

Peak picking should always be optimised and checked extensively, as this will have a massive impact on the downstream data interpretation. Here, we will check an example chromatogram for a dihexose [M+Cl]- adduct. We will quickly use glycoPredict from GlycoAnnotateR to get the correct m/z value.

#calculate m/z value for dihexose Cl adduct
df_dihexose <- glycoPredict(param = glycoPredictParam(dp = c(2,2),
                                                       polarity = 'neg',
                                                       adducts = 'Cl'))

#extract chromatogram +- 4 ppm
chr_dihexose_pk <- chromatogram(data_ms1_pks,
                                mz = c(df_dihexose$mz-ppm_to_mz(df_dihexose$mz, 4), 
                                       df_dihexose$mz+ppm_to_mz(df_dihexose$mz, 4)))

Picked peaks are indicated by the red rectangles in the chromatograms. We can see that the peak was picked relatively nicely for the first sample. For the second sample, only ~half of the peak is picked. For the third sample, there are two peaks picked for the same peak. We will refine the peaks in the next steps.

plot(chr_dihexose_pk[[1]], peakType = 'rectangle', peakCol = 'darkred', peakBg = NA)
plot(chr_dihexose_pk[[2]], peakType = 'rectangle', peakCol = 'darkred', peakBg = NA)
plot(chr_dihexose_pk[[3]], peakType = 'rectangle', peakCol = 'darkred', peakBg = NA)

In the XIC version the defined peak is shown in the lower panel, this demonstrates how the peaks are defined in the mz-rt dimensions.

data_ms1_pks %>% 
  filterRt(rt = c(350, 550)) %>% 
  filterMz(mz = c(df_dihexose$mz-ppm_to_mz(df_dihexose$mz, 6), 
                  df_dihexose$mz+ppm_to_mz(df_dihexose$mz, 6))) %>% 
  plot(type = "XIC")

3.2.2 Peak merging

We will merge the split peaks using the ‘refineChromPeaks’ function of XCMS. Again, the parameters used will vary for each dataset.

#create parameter object
mpp <- MergeNeighboringPeaksParam(expandRt = 10, minProp = 0.5)
#merge peaks
data_ms1_pks_mg <- refineChromPeaks(data_ms1_pks, mpp)
## Merging reduced 2364 chromPeaks to 1619.

We can see in the chromatograms after peak merging that there is a single dihexose peak picked for each sample. The width of the peak captured in each sample is slightly different. We will not worry about that too much now, as the peaks will be grouped into features in the next step.

#extract dihexose chromatogram +- 4 ppm
chr_dihexose_pk_mg <- chromatogram(data_ms1_pks_mg,
                                mz = c(df_dihexose$mz-ppm_to_mz(df_dihexose$mz, 4), 
                                       df_dihexose$mz+ppm_to_mz(df_dihexose$mz, 4)))
#plot chromatogram for each sample
plot(chr_dihexose_pk_mg[[1]], peakType = 'rectangle', peakCol = 'darkred', peakBg = NA)
plot(chr_dihexose_pk_mg[[2]], peakType = 'rectangle', peakCol = 'darkred', peakBg = NA)
plot(chr_dihexose_pk_mg[[3]], peakType = 'rectangle', peakCol = 'darkred', peakBg = NA)

3.2.3 Peak grouping (correspondence)

We will now group the peaks across samples by density into features. In this example there will only be one sample group, as the data contains the same sample run three times. We will double check the parameters first and then group the peaks. Only peaks in 2 of the three runs will be grouped into features.

#set the parameters for peak grouping
pdp <- PeakDensityParam(sampleGroups = data_ms1_pks_mg$sample_type,
                        minFraction = 2/3, bw = 15, binSize = 0.01)
#check the parameters using the dehexose chromatogram
plotChromPeakDensity(chr_dihexose_pk_mg, 
                     param = pdp,
                     peakPch = 16)

#group peaks
data_ms1_pks_mg_gp <- groupChromPeaks(data_ms1_pks_mg, pdp)

3.2.4 Gap filling

The final step in our pre-processing workflow is gap filling. This aims to ‘rescue’ peaks that were not identified in some samples, but which were identified in other samples and have been grouped into a feature. So, for example, there are tetrahexose [M-H]- peaks in samples 1 and 2, but there is not peak in the third sample.

#calculate m/z value for deprotonated tetrahexose 
df_tetrahexose <- glycoPredict(param = glycoPredictParam(dp = c(4,4),
                                                         polarity = 'neg',
                                                         adducts = 'H'))

#extract chromatogram +- 4 ppm
chr_tetrahexose_pk <- chromatogram(data_ms1_pks_mg_gp,
                                mz = c(df_tetrahexose$mz-ppm_to_mz(df_tetrahexose$mz, 4), 
                                       df_tetrahexose$mz+ppm_to_mz(df_tetrahexose$mz, 4)))
#plot chromatogram for each sample
plot(chr_tetrahexose_pk[[1]])
plot(chr_tetrahexose_pk[[2]])
plot(chr_tetrahexose_pk[[3]])

We apply gap filling using the default parameters, and then re-extract the chromatograms, making sure that filled peaks are included. We can now see that there is a peak in all three samples.

#gap filling
data_ms1_pks_mg_gp_fill <- fillChromPeaks(data_ms1_pks_mg_gp)
## Defining peak areas for filling-in .... OK
## Start integrating peak areas from original files
#extract chromatogram +- 4 ppm
chr_tetrahexose_pkfill <- chromatogram(data_ms1_pks_mg_gp_fill,
                                mz = c(df_tetrahexose$mz-ppm_to_mz(df_tetrahexose$mz, 4), 
                                       df_tetrahexose$mz+ppm_to_mz(df_tetrahexose$mz, 4)),
                                filled = T) #make sure filled is TRUE to include gap filling

#plot chromatograms
plot(chr_tetrahexose_pkfill[[1]])
plot(chr_tetrahexose_pkfill[[2]])
plot(chr_tetrahexose_pkfill[[3]])

4 Isotope detection

We will now use CAMERA for detection of isotopes. We first need to convert our XCMSnExp object to an xcmsSet object, which can then be used to construct an xsAnnotate object.

#creat xcmsSet object
xset <- as(data_ms1_pks_mg_gp_fill, "xcmsSet")
#set sample names
sampnames(xset) <- pData(data_ms1_pks_mg_gp_fill)$sample_name
#set sample classes/groups
sampclass(xset) <- pData(data_ms1_pks_mg_gp_fill)$sample_type

Now we create the xsAnnotate object and group peaks by retention time. This allows us to find isotopes. We use getPeakList to generate a feature table containing the isotope information. Isotope identification is important for many reasons, e.g. to ensure we are not just annotating isotopes, and to try to resolve ambiguous annotations (when there are multiple annotations per peak). Features are filtered here so that at least 1 sample has an integrated peak area >5e5, this step is optional and heavily depends on your samples and system.

#build xsAnnotate object
an <- xsAnnotate(xset)
#group peaks by retention time (full width half maximum)
an <- groupFWHM(an, perfwhm = 0.8)
## Start grouping after retention time.
## Created 23 pseudospectra.
#find isotopes
an <- findIsotopes(an, ppm = 3)
## Generating peak matrix!
## Run isotope peak annotation
##  % finished: 10  20  30  60  70  80  90  100  
## Found isotopes: 93
#get feature list and filter features by intensity values to remove those with only values below 5e5
pl <-getPeaklist(an) %>% 
  filter(if_any(contains("stds"), ~ .x > 5e5))

5 Annotating the data

5.1 Making the parameter object

We will now move forward with annotating our feature table to identify putative oligosaccharide ions. Here, we will first create the glycoPredictParam object, and then annotate the data. These two steps can be done in one, but we will first look at and understand the glycoPredictParam object.

#create the glycoPredictParam object
gpp <- glycoPredictParam(
  #degree of polymerisation range, here monomer to tetramer 
  dp = c(1,4), 
  #ionisation polarity
  polarity = "neg", 
  #ionisation type (affects adduct types)
  ion_type = "ESI", 
  #scan range during acquisition (here for MS1 scans)
  scan_range = c(175,1400),
  #expected modifications
  modifications = c("alditol", 
                    "sulfate",
                    "omethyl",
                    "anhydrobridge"),
  #maximum average number of modifications per monomer
  nmod_max = 1,
  #adducts
  adducts = c("H", "Cl"))
gpp
## An object of class "glycoPredictParam"
## Slot "dp":
## [1] 1 4
## 
## Slot "polarity":
## [1] "neg"
## 
## Slot "scan_range":
## [1]  175 1400
## 
## Slot "pent_option":
## [1] FALSE
## 
## Slot "modifications":
## [1] "alditol"       "sulfate"       "omethyl"       "anhydrobridge"
## 
## Slot "nmod_max":
## [1] 1
## 
## Slot "double_sulfate":
## [1] FALSE
## 
## Slot "label":
## [1] "none"
## 
## Slot "ion_type":
## [1] "ESI"
## 
## Slot "format":
## [1] "long"
## 
## Slot "adducts":
## [1] "H"  "Cl"
## 
## Slot "naming":
## [1] "IUPAC"
## 
## Slot "glycan_linkage":
## [1] "none"
## 
## Slot "modification_limits":
## [1] "none"

5.2 Calculating compositions

We can now use parameters defined in the glycoPredictParam object to ‘predict’ (or rather calculate) all possible glycans. You can directly run glycoAnnotate to annotate your peak list using the parameters, which will call glycoPredict in the background, but here we will demonstrate the output of glycoPredict by running it separately.

#create a table of theoretical glycans and their ions
pred_df <- glycoPredict(param = gpp)

By default we have selected IUPAC naming and long output format, so glycan compositions are named with IUPAC abbreviations and there is one row per adduct/ion. 309 adducts from 134 possible compositions are calculated based on the constraints of the input parameters.

#print the number of adducts and compositions
print(paste('There are', nrow(pred_df), 
            'adducts/ions from', 
            length(unique(pred_df$`IUPAC name`)),
            'different compositions'))
## [1] "There are 309 adducts/ions from 134 different compositions"

There are 8 columns in the output:

  • dp = the degree of polymerisation (oligomer length)
  • mass = the monisotopic exact mass of the neutral oligosaccharide
  • formula = the sum formula of the neutral oligosaccharide
  • IUPAC name = the name of the oligosaacharide, using common abbreviations from IUPAC (see here) for a summary of the naming
  • ion = the adduct of the ion
  • mz = the mass-to-charge ratio of the ion (m/z)
  • ion_formula = the sum formula of the ion formed by the oligosaccharide
  • charge = the charge state of the adduct
#inspect the output
head(pred_df)
## # A tibble: 6 × 10
##      dp  mass formula `IUPAC name`        glyconnect_id glytoucan_id ion      mz
##   <dbl> <dbl> <chr>   <chr>               <chr>         <chr>        <chr> <dbl>
## 1     1  162. C6H10O5 Hex1 AnhydroBridge1 none          none         [M+C…  197.
## 2     1  180. C6H12O6 Hex1                228           G49874UX     [M-H…  179.
## 3     1  180. C6H12O6 Hex1                228           G49874UX     [M+C…  215.
## 4     1  182. C6H14O6 Alditol Hex1        none          none         [M-H…  181.
## 5     1  182. C6H14O6 Alditol Hex1        none          none         [M+C…  217.
## 6     1  194. C7H14O6 Hex1 O-Methyl1      none          none         [M-H…  193.
## # ℹ 2 more variables: ion_formula <chr>, charge <chr>

5.3 Annotate features

We will now annotate the feature list generated after isotope detection using the parameters we defined. Because our feature list has minimum and maximum m/z values, we can use those values for annotation against theoretical values (the range for each theoretical values is created as the value +/- the error in ppm or Da). This is done by checking for overlap of two ranges. For peak/feature lists with only an m/z value (mean or otherwise defined), or if you would like to only use this value, we can supply just the m/z column name. This will annotate by checking if the value is within the range of the theoretical values. The below figure summarises the two types of annotation possible:

We will try both types of annotation here to see how they work and how the output differs. We will choose NOT to collapse the annotations (i.e. if a feature receives multiple annotations, there will be one row per annotation) at this point so that we can look at all of the output information. Collapsing of annotations (so that there is one row per feature, and multiple annotations for the same feature are comma separated) can be done during this step or later.

5.3.1 Overlap of two ranges

Note that here we are supplying again the glycoPredictParam object, to demonstrate that annotation can be a single step process. However, given that we have already run glycoPredict, we could alternatively have supplied pred_table = pred_df and not provide a value for param.

#annotation is performed
pl_annot_overlap2ranges <- glycoAnnotate(
  #feature or peak list to be annotated
  data = pl,
  #glycoPredictParam object 
  param = gpp,
  #error unit and value
  error = 2, error_units = 'ppm',
  #should annotations be collapse
  collapse = F,
  #column containing m/z values
  mz_column = 'mz',
  #columns containing minimum and maximimum mz values
  mzmin_column = 'mzmin', mzmax_column = 'mzmax')

We can see that the output contains the same columns as the input (feature table), but now also with 7 additional columns - these come from the output of glycoPredict. The column containing the theoretical m/z values is now named ‘mz_pred’. The ‘mz’ column contains the m/z values of the data. The output contains 380 rows - 1 more than the input (381). This means that some features received multiple annotations. 12 features in total received at least one annotation.

#see output
head(pl_annot_overlap2ranges)
##   dp mass formula IUPAC name glyconnect_id glytoucan_id  ion mz_pred
## 1 NA   NA    <NA>       <NA>          <NA>         <NA> <NA>      NA
## 2 NA   NA    <NA>       <NA>          <NA>         <NA> <NA>      NA
## 3 NA   NA    <NA>       <NA>          <NA>         <NA> <NA>      NA
## 4 NA   NA    <NA>       <NA>          <NA>         <NA> <NA>      NA
## 5 NA   NA    <NA>       <NA>          <NA>         <NA> <NA>      NA
## 6 NA   NA    <NA>       <NA>          <NA>         <NA> <NA>      NA
##   ion_formula charge       mz    mzmin    mzmax        rt     rtmin     rtmax
## 1        <NA>   <NA> 177.9024 177.9024 177.9025  76.88945  76.81742  77.46704
## 2        <NA>   <NA> 178.9024 178.9024 178.9025  77.23567  77.00429  77.46704
## 3        <NA>   <NA> 179.0010 179.0009 179.0010 152.75758 151.66567 153.84949
## 4        <NA>   <NA> 179.0554 179.0554 179.0555 207.17708 205.28354 209.07061
## 5        <NA>   <NA> 180.8590 180.8589 180.8591  76.63093  76.24304  76.88945
## 6        <NA>   <NA> 181.0168 181.0168 181.0168 175.67673 174.38789 176.96557
##   npeaks stds ms_level stds_DDA_neg_06 stds_MS1_neg_05 stds_neg_DDA_04 isotopes
## 1      3    3        1       6024463.8       6333026.4       1657617.9         
## 2      2    2        1        804640.0        880007.8        151261.6         
## 3      2    2        1       1747128.3       1201828.6       1609318.9         
## 4      2    2        1       8438035.8       6371174.9       8543615.4         
## 5      3    3        1        290237.4        517346.0        281231.3         
## 6      2    2        1        771094.0       1140759.7       1234216.8         
##   adduct pcgroup mass_error_ppm
## 1              6             NA
## 2              6             NA
## 3              4             NA
## 4             19             NA
## 5              6             NA
## 6              4             NA

#print dimensions of input
dim(pl)
## [1] 380  15

#print dimensions of output
dim(pl_annot_overlap2ranges)
## [1] 381  26

#print number of features annotated
pl_annot_overlap2ranges %>% 
  filter(!is.na(dp)) %>% 
  distinct(mz) %>% 
  nrow()
## [1] 12

5.3.2 Value within a range

For comparison, we will repeat the annotation but now only providing the m/z column, and therefore matching by the value within a range method.

#annotation is performed
pl_annot_valueInRange <- glycoAnnotate(
  #feature or peak list to be annotated
  data = pl,
  #glycoPredictParam object 
  param = gpp,
  #error unit and value
  error = 3, error_units = 'ppm',
  #should annotations be collapse
  collapse = F,
  #column containing m/z values
  mz_column = 'mz')

Now 15 features have been annotated, and there is again one feature with multiple annotations.

#see output
head(pl_annot_valueInRange)
##   dp mass formula IUPAC name glyconnect_id glytoucan_id  ion mz_pred
## 1 NA   NA    <NA>       <NA>          <NA>         <NA> <NA>      NA
## 2 NA   NA    <NA>       <NA>          <NA>         <NA> <NA>      NA
## 3 NA   NA    <NA>       <NA>          <NA>         <NA> <NA>      NA
## 4 NA   NA    <NA>       <NA>          <NA>         <NA> <NA>      NA
## 5 NA   NA    <NA>       <NA>          <NA>         <NA> <NA>      NA
## 6 NA   NA    <NA>       <NA>          <NA>         <NA> <NA>      NA
##   ion_formula charge       mz    mzmin    mzmax        rt     rtmin     rtmax
## 1        <NA>   <NA> 177.9024 177.9024 177.9025  76.88945  76.81742  77.46704
## 2        <NA>   <NA> 178.9024 178.9024 178.9025  77.23567  77.00429  77.46704
## 3        <NA>   <NA> 179.0010 179.0009 179.0010 152.75758 151.66567 153.84949
## 4        <NA>   <NA> 179.0554 179.0554 179.0555 207.17708 205.28354 209.07061
## 5        <NA>   <NA> 180.8590 180.8589 180.8591  76.63093  76.24304  76.88945
## 6        <NA>   <NA> 181.0168 181.0168 181.0168 175.67673 174.38789 176.96557
##   npeaks stds ms_level stds_DDA_neg_06 stds_MS1_neg_05 stds_neg_DDA_04 isotopes
## 1      3    3        1       6024463.8       6333026.4       1657617.9         
## 2      2    2        1        804640.0        880007.8        151261.6         
## 3      2    2        1       1747128.3       1201828.6       1609318.9         
## 4      2    2        1       8438035.8       6371174.9       8543615.4         
## 5      3    3        1        290237.4        517346.0        281231.3         
## 6      2    2        1        771094.0       1140759.7       1234216.8         
##   adduct pcgroup mass_error_ppm
## 1              6             NA
## 2              6             NA
## 3              4             NA
## 4             19             NA
## 5              6             NA
## 6              4             NA

#print dimensions of input
dim(pl)
## [1] 380  15

#print dimensions of output
dim(pl_annot_valueInRange)
## [1] 381  26

#print number of features annotated
pl_annot_valueInRange %>% 
  filter(!is.na(dp)) %>% 
  distinct(mz) %>% 
  nrow()
## [1] 15

In this tutorial we will proceed with the annotations from within a range matching, but it is open to the user to decide which is more appropriate to their data type.

5.4 Collapse annotations

We will now collapse the annotations, so that there is one row per feature. We will collapse the columns with the IUPAC name and ion.

#define columns to collapse
collapse = c('IUPAC name', 'ion')
#define columns to keep, uncollapsed
noncollapse = names(pl)[names(pl) %in% names(pl_annot_valueInRange)]

#perform collapsing
pl_annot_c <- glycoAnnotationsCollapse(pl_annot_valueInRange,
                                       collapse_columns = collapse,
                                       noncollapse_columns = noncollapse)
## Collapsing annotations

The output now contains the same number of rows as the feature table, with one additional column called ‘annotations’ that contains the IUPAC name and ion separated by a colon. Where multiple annotations were assigned they are comma separated.

#inspect output
head(pl_annot_c)
## # A tibble: 6 × 16
##      mz mzmin mzmax    rt rtmin rtmax npeaks  stds ms_level stds_DDA_neg_06
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>    <dbl>           <dbl>
## 1  178.  178.  178.  76.9  76.8  77.5      3     3        1        6024464.
## 2  179.  179.  179.  77.2  77.0  77.5      2     2        1         804640.
## 3  179.  179.  179. 153.  152.  154.       2     2        1        1747128.
## 4  179.  179.  179. 207.  205.  209.       2     2        1        8438036.
## 5  181.  181.  181.  76.6  76.2  76.9      3     3        1         290237.
## 6  181.  181.  181. 176.  174.  177.       2     2        1         771094.
## # ℹ 6 more variables: stds_MS1_neg_05 <dbl>, stds_neg_DDA_04 <dbl>,
## #   isotopes <chr>, adduct <chr>, pcgroup <chr>, annotations <chr>
#print output dimensions
dim(pl_annot_c)
## [1] 380  16

We will now filter the data to only retain annotated features. We can see that there is one feature (m/z = 403.0546, rt = 137 s) that has two annotations: ‘Hex4 AnhydroBridge1 Sulfate2:[M-2H]-2’ and ‘Hex2 AnhydroBridge1 Sulfate1:[M-H]-’.

pl_annotonly <- pl_annot_c %>%  filter(annotations != 'NA:NA')
pl_annotonly[, c('mz', 'rt', 'annotations')]
## # A tibble: 15 × 3
##       mz     rt annotations                                                     
##    <dbl>  <dbl> <chr>                                                           
##  1  215.   76.1 Hex1:[M+Cl]-                                                    
##  2  215.  211.  Hex1:[M+Cl]-                                                    
##  3  217.  195.  Alditol Hex1:[M+Cl]-                                            
##  4  229.   83.9 Hex1 O-Methyl1:[M+Cl]-                                          
##  5  259.  111.  Hex1 Sulfate1:[M-H]-                                            
##  6  259.  187.  Hex1 Sulfate1:[M-H]-                                            
##  7  259.   66.3 Hex1 Sulfate1:[M-H]-                                            
##  8  322.  422.  Hex3 AnhydroBridge1 Sulfate2:[M-2H]-2                           
##  9  341.  465.  Hex2:[M-H]-                                                     
## 10  377.  461.  Hex2:[M+Cl]-                                                    
## 11  394.  487.  Hex4 AnhydroBridge2 Sulfate2:[M-2H]-2                           
## 12  403.  137.  Hex2 AnhydroBridge1 Sulfate1:[M-H]-, Hex4 AnhydroBridge1 Sulfat…
## 13  503.  778.  Hex3:[M-H]-                                                     
## 14  565.  467.  Hex3 AnhydroBridge1 Sulfate1:[M-H]-                             
## 15  665. 1059.  Hex4:[M-H]-

5.5 Inspect isotopes for ambiguous annotations

We will see if we can use the detected isotopes to choose one of the two annotations as the more likely one. The annotated feature table is filtered to only the pcgroup of the feature with two annotations, and within that group only features detected as isotopes by CAMERA. The ‘pcgroup’ represents features grouped together based on retention time.

#get the pc group of the feature with two annotations
pc = pl_annotonly$pcgroup[grepl(',', pl_annotonly$annotations)]

#filter the annotated feature table to only that pcgroup
#and peaks with detected isotopes
iso_check <- pl_annot_valueInRange %>% 
  filter(pcgroup %in% pc & isotopes != '')

There are a couple of things we can notice in the filtered df:

  • The ppm mass error differences of the two annotations are the same to the displayed decimal places (meaning this cannot differentiate the annotations)

  • There are M+1 (m/z 404.0576) and M+2 (m/z 405.0532) isotopes detected for molecule 42 (and the annotated feature was detected as the M isotope for molecule 42)

  • Each isotope has a mass increased of ~1 Da, indicating that the ion has a single charge, supporting the annotation of Hex2 AnhydroBridge1 Sulfate1 [M-H]-

#inspect the dataframe
iso_check
##   dp     mass     formula                   IUPAC name glyconnect_id
## 1  2 404.0625  C12H20O13S Hex2 AnhydroBridge1 Sulfate1          none
## 2  4 808.1249 C24H40O26S2 Hex4 AnhydroBridge1 Sulfate2          none
## 3 NA       NA        <NA>                         <NA>          <NA>
## 4 NA       NA        <NA>                         <NA>          <NA>
## 5 NA       NA        <NA>                         <NA>          <NA>
## 6 NA       NA        <NA>                         <NA>          <NA>
## 7 NA       NA        <NA>                         <NA>          <NA>
## 8 NA       NA        <NA>                         <NA>          <NA>
##   glytoucan_id      ion  mz_pred ion_formula charge       mz    mzmin    mzmax
## 1         none   [M-H]- 403.0552  C12H19O13S     -1 403.0546 403.0545 403.0547
## 2         none [M-2H]-2 403.0552 C24H38O26S2     -2 403.0546 403.0545 403.0547
## 3         <NA>     <NA>       NA        <NA>   <NA> 404.0576 404.0575 404.0577
## 4         <NA>     <NA>       NA        <NA>   <NA> 405.0532 405.0530 405.0532
## 5         <NA>     <NA>       NA        <NA>   <NA> 409.0651 409.0650 409.0652
## 6         <NA>     <NA>       NA        <NA>   <NA> 409.5664 409.5663 409.5664
## 7         <NA>     <NA>       NA        <NA>   <NA> 534.0383 534.0380 534.0385
## 8         <NA>     <NA>       NA        <NA>   <NA> 534.5387 534.5386 534.5388
##         rt    rtmin    rtmax npeaks stds ms_level stds_DDA_neg_06
## 1 137.3406 103.3629 137.8245      4    3        1    4648649274.0
## 2 137.3406 103.3629 137.8245      4    3        1    4648649274.0
## 3 137.7311 136.9500 138.4741      3    3        1     665967193.7
## 4 137.3406 103.3629 138.4741      4    3        1     257376019.0
## 5 146.0255 144.8354 146.3422      3    3        1       5151095.3
## 6 146.0255 144.8354 146.3422      3    3        1       1281893.0
## 7 146.0255 144.8354 146.3422      3    3        1        589825.1
## 8 145.5888 144.8354 146.3422      2    2        1        237786.7
##   stds_MS1_neg_05 stds_neg_DDA_04    isotopes adduct pcgroup mass_error_ppm
## 1    5597347650.0    1844509591.2    [42][M]+              1       1.359202
## 2    5597347650.0    1844509591.2    [42][M]+              1       1.359202
## 3     814585650.1     599524041.3  [42][M+1]+              1             NA
## 4     315649346.7     101921960.1  [42][M+2]+              1             NA
## 5       8612458.4       6422631.7   [43][M]2+              1             NA
## 6       2237035.7       1670463.0 [43][M+1]2+              1             NA
## 7       1795711.1       1322100.2   [58][M]2+              1             NA
## 8        819922.7        552496.5 [58][M+1]2+              1             NA

We can double check the isotope assignment by plotting the mass spectrum for one sample.

#filter data 
m1 <- all_data %>% 
  filterFile(which(pd$frag == 'MS1')) %>%  #filter for sample 2 
  filterRt(rt = c(min(iso_check$rtmin[1:4]), 
                  max(iso_check$rtmax[1:4]))) %>% #filter by rt
  filterMz(mz = c(min(iso_check$mzmin[1:4]-0.5), 
                  max(iso_check$mzmax[1:4]+0.5))) #filter by mz

#average spectra
m1_avg <- combineSpectra(m1, 
                         fcol = 'fileIdx',
                         intensityFun = mean,
                         ppm = 3)

#normalise with respect to ion with highest intensity
m1_avg_norm <- normalise(m1_avg, method = 'max')

#make dataframe from spectrum
#add M, M1 and M2 annotations
m1_avg_norm_df <- data.frame(mz = m1_avg_norm[[1]]@mz,
                             int = m1_avg_norm[[1]]@intensity) %>% 
  mutate(isotope = case_when(mz > 403.054 & mz < 403.058 ~ 'M',
                             mz > 404.054 & mz < 404.059 ~ 'M+1',
                             mz > 405.053 & mz < 405.06 ~ 'M+2',))

#plot spectrum with annotated isotopes
ggplot(m1_avg_norm_df)+
  geom_segment(aes(x = mz, xend = mz, y = 0, yend= int)) +
  geom_text(aes(x = mz, y = int+0.1, label = isotope)) +
  labs(x = expression(italic('m/z')), y = "Normalised intensity")+
  scale_x_continuous(breaks = seq(402.2, 405.2, 0.2)) +
  theme_light()+
  theme(axis.title = element_text(colour = 'black', size = 10, face = 'bold'),
        axis.text =  element_text(colour = 'black', size = 10),
        axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 33 rows containing missing values or values outside the scale range
## (`geom_text()`).

We can be relatively confident in the annotation of Hex2 AnhydroBridge1 Sulfate1 [M-H]- over Hex4 AnhydroBridge1 Sulfate2:[M-2H]-2, so we will assign the annotation as such.

#assign the singly charged annotation
pl_annotonly$annotations[pl_annotonly$pcgroup == pc] <- 'Hex2 Anhydro1 Sulfate1:[M-H]-'

5.6 Extract and plot chromatograms of annotated features

It is often interesting to look at the chromatograms of the annotated features, to verify the peak processing and see how the retention times differ across annotations (and if that makes sense, e.g. are higher DP oligosaccharides eluting later than shorter DP. oligosaccharides), and if we have multiple peaks (i.e. isomers) for some annotations. We can use the featureChromatograms function from XCMS to extract the chromatograms, and then extract the information from those objects.

#get the definitions of all features
featDef <- featureDefinitions(data_ms1_pks_mg_gp_fill)
#get in the indices of the annotated features
featIndices = which(featDef$mzmed %in% pl_annotonly$mz)
#extract the chromatograms of the annotated features, expand rt by 2 min
featChr <- featureChromatograms(data_ms1_pks_mg_gp_fill, 
                                expandRt = 120,
                                features = featIndices)

#reorder annotation table to be the same order as the chromatograms
pl_annotonly_reordered <- pl_annotonly[match(featDef$mzmed[featIndices],
                                             pl_annotonly$mz),]

#make empty df
featChr_df <- data.frame(sample_name = as.character(),
                         frag = as.character(),
                         annotations = as.character(),
                         mz = as.numeric(),
                         rt = as.numeric(),
                         rt_med = as.numeric(),
                         intensity = as.numeric())

#fill in df with chromatogram info
for(i in 1:length(featIndices)){
  mz = pl_annotonly_reordered$mz[i]
  rt_med = pl_annotonly_reordered$rt[i]
  annotations = pl_annotonly_reordered$annotations[i]
  for(j in 1:nrow(pd)){
    chr <- featChr[i,j]
    rt <- chr@rtime
    intensity <- chr@intensity
    tmp <- data.frame(sample_name = pd$sample_name[j],
                      frag = pd$frag[j],
                      annotations = annotations,
                      mz = mz,
                      rt = rt,
                      rt_med = rt_med,
                      intensity = intensity)
    featChr_df <- rbind(featChr_df, tmp)
  }
}

#make annotations into a factor, ordered by rt
featChr_df$annotations <- factor(featChr_df$annotations, 
                                 levels = pl_annotonly_reordered %>% 
                                   arrange(rt) %>% 
                                   select(annotations) %>% 
                                   unlist() %>% as.vector() %>% 
                                   unique())

Chromatograms can then be plotted with ggplot. We can see that some features likely derive from in-source fragmentation e.g. some of the monosulfated hexose peaks could be from the sulfated kappa carrageenan oligosaccharides (e.g. ‘Hex2 AnhydroBridge1 Sulfate1’).

#get the definitions of all features
featDef <- featureDefinitions(data_ms1_pks_mg_gp_fill)
#get in the indices of the annotated features
featIndices = which(featDef$mzmed %in% pl_annotonly$mz)
#extract the chromatograms of the annotated features, expand rt by 2 min
featChr <- featureChromatograms(data_ms1_pks_mg_gp_fill, 
                                expandRt = 120,
                                features = featIndices)

#reorder annotation table to be the same order as the chromatograms
pl_annotonly_reordered <- pl_annotonly[match(featDef$mzmed[featIndices],
                                             pl_annotonly$mz),]

#make empty df
featChr_df <- data.frame(sample_name = as.character(),
                         frag = as.character(),
                         annotations = as.character(),
                         mz = as.numeric(),
                         rt = as.numeric(),
                         rt_med = as.numeric(),
                         intensity = as.numeric())

#fill in df with chromatogram info
for(i in 1:length(featIndices)){
  mz = pl_annotonly_reordered$mz[i]
  rt_med = pl_annotonly_reordered$rt[i]
  annotations = pl_annotonly_reordered$annotations[i]
  for(j in 1:nrow(pd)){
    chr <- featChr[i,j]
    rt <- chr@rtime
    intensity <- chr@intensity
    tmp <- data.frame(sample_name = pd$sample_name[j],
                      frag = pd$frag[j],
                      annotations = annotations,
                      mz = mz,
                      rt = rt,
                      rt_med = rt_med,
                      intensity = intensity)
    featChr_df <- rbind(featChr_df, tmp)
  }
}

#make annotations into a factor, ordered by rt
featChr_df$annotations <- factor(featChr_df$annotations, 
                                 levels = pl_annotonly_reordered %>% 
                                   arrange(rt) %>% 
                                   select(annotations) %>% 
                                   unlist() %>% as.vector() %>% 
                                   unique())

ggplot(featChr_df, 
       aes(x = rt, y = intensity, group = rt_med)) +
  geom_line() +
  facet_grid(cols = vars(sample_name), 
             rows = vars(annotations), scales = "free") +
  labs(y = "Intensity (a.u.)", x = "Retention time (s)") +
  scale_y_continuous(labels = scientific) +
  theme_light()+
  theme(axis.title = element_text(colour = 'black', size = 10, 
                                  face = 'bold'),
        axis.text =  element_text(colour = 'black', size = 5),
        strip.background = element_blank(),
        strip.text.y = element_text(colour = 'black', size = 5, 
                                    angle = 0, hjust = 0),
        strip.text.x = element_text(colour = 'black', size = 5))
## Warning: Removed 1867 rows containing missing values or values outside the scale range
## (`geom_line()`).

6 Extract and annotate MS2 spectra associated with annotations

6.1 Extract MS2 spectra

We will use the glycoMS2Extract function from GlycoAnnotateR to extract the spectra. This function is basically a wrapper for XCMS functions for MS2 spectra extraction, with a layer to only extract those spectra associated with annotated features.

The function expects that you have split your MS1 and MS2 data, like we did at the start. However, it should still work even if you didn’t split your MS1 and MS2 data - in this case, you can simply provide the same data object for data_ms2 and data_features.

ms2 <- glycoMS2Extract(
  #object with MS1 and MS2 data 
  data_ms2 = data_ms2, 
  #processed MS1 data object (must have features)
  data_features = data_ms1_pks_mg_gp_fill, 
  #annotated feature table
  #must contain an 'mz' and 'rt' column for the features!
  annotations = pl_annotonly)
## Warning: No MS level 2 spectra present in file
## M31_20230717_stds_MS1_neg_05.mzML

We can see that the output is an MSpectra object which contains 248 MS2 spectra.

class(ms2)
## [1] "MSpectra"
## attr(,"package")
## [1] "MSnbase"
length(ms2@listData)
## [1] 248

6.2 Process MS2 spectra

We will use the functions from MSnbase to clean, average and normalise the MS2 spectra.

6.2.1 Clean MS2 spectra

By ‘cleaning’, we remove zero intensity masses.

#remove zero intensity masses
ms2@listData <- lapply(ms2, clean, all = TRUE)

6.2.2 Average MS2 spectra

We average all MS2 spectra that correspond to the same (chromatographic) peak in the same sample (i.e. file). There are a number of different functions which can be used to combine the spectra

#combine spectra by peak within each file
ms2_avg <- MSnbase::combineSpectra(ms2,
                                   fcol = 'peak_id',
                                   method = meanMzInts,
                                   mzd = 0.001)

The number of spectra per file is reduced to 14-15, from 119-129.

#print the number of spectra per file before and after averaging
table(fromFile(ms2))
## 
##   1   3 
## 119 129
table(fromFile(ms2_avg))
## 
##  1  3 
## 14 15

6.2.3 Normalise intensities

Intensities are normalised with respect to the maximum intensity.

#normalise with respect to ion with highest intensity
ms2_avg@listData <- lapply(ms2_avg, 
                           normalise, 
                           method = "max")

6.2.4 Convert to dataframe

Now we will convert the data into a dataframe format so its easier to work with.

#build empty df
ms2_df <- data.frame(sample.name = as.character(),
                     sample.number = as.numeric(),
                     precursorMz = as.numeric(),
                     rt = as.numeric(),
                     mz = as.numeric(),
                     intensity = as.numeric())

for (i in 1:length(ms2_avg)){
  mz = sprintf("%.4f",ms2_avg[[i]]@mz)
  intensity = ms2_avg[[i]]@intensity * 100
  rt = rep(sprintf("%.4f", ms2_avg[[i]]@rt / 60),
           length(mz))
  precursorMz = rep(sprintf("%.4f",ms2_avg[[i]]@precursorMz),
                    length(mz))
  sample.number = rep(fromFile(ms2_avg[[i]]),
                      length(mz))
  sample.name = rep(data_ms1_pks_mg_gp_fill$sample_name[fromFile(ms2_avg[[i]])],
                    length(mz))
  temp <- data.frame(sample.number = sample.number,
                     sample.name = sample.name,
                     precursorMz = precursorMz,
                     rt = rt,
                     mz = mz,
                     intensity = intensity)
  ms2_df <- rbind(temp, ms2_df)
}

#make sure mz columns are numeric
ms2_df$mz <- as.numeric(ms2_df$mz)
ms2_df$precursorMz <- as.numeric(ms2_df$precursorMz)

6.3 Annotate MS2 spectra

6.3.1 Annotate precursors

We will begin by annotating the precursors.

ms2_df_annotP <- glycoAnnotate(ms2_df,
                               error = 4,
                               param = gpp, #param object
                               mz_column = 'precursorMz', #name of mz column
                               collapse = T, #we will collapse annotations...
                               #...on these columns
                               collapse_columns = c('IUPAC name', 'ion', 'dp')) 

All precursors should be assigned an annotation (as we only extracted MS2 spectra for annotated precursors).

#check all precursors annotated
any(is.na(ms2_df_annotP$annotations))
## [1] FALSE

6.3.2 Annotate fragments

Now we can annotate the MS2 spectra. We change the glycoPredictParam parameters depending on the precursor annotation. Only fragments resulting from glycosidic bond breakage, dehydration, or loss of modified groups (e.g. desulfation) can be annotated by GlycoAnnotateR.

#make a vector of the precursor annotations
annotP <- ms2_df_annotP$annotations %>% unique()

#rename annotations column to precursorAnnotations
names(ms2_df_annotP)[names(ms2_df_annotP) == 'annotations'] <- 'precursorAnnotations'

#annotate the fragment ions based on the precursor annotations
ms2_df_annotF <- glycoMS2Annotate(
  #vector of the precursor annotations
  precursorAnnotations = annotP,
  #separator in between name, ion and dp of precursor annotations
  precursorAnnotations_sep = ':',
  #ms2spectra dataframe with annotated precursors
  ms2spectra = as.data.frame(ms2_df_annotP),
  #ion type
  ion_type = 'ESI',
  #error
  error = 4, error_units = 'ppm')

#collapse the fragment ion annotations
ms2_annotF_c <- glycoAnnotationsCollapse(
  annotated_data = ms2_df_annotF,
  collapse_columns = c('IUPAC name', 'ion'),
  noncollapse_columns = names(ms2_df_annotP))

#no collapsing necessary

Annotated spectra can be plotted with ggplot. We can see that the fragmentation energy was too low for many of the ions to generate a good fragmentation pattern, except for the kappa-carrageenan oligosaccharides (e.g. Hex4 AnhydroBridge2 Sulfate2 [M-2H]-2, precursor m/z 394.0496.

#plot spectra with annotated fragments
#you can play around with how you plot this, many ways
precursorMzs <- ms2_df_annotF$precursorMz %>% unique()
for(i in 1:length(precursorMzs)){
  pmz <- precursorMzs[i]
  df <- ms2_df_annotF %>% 
    #filter for precursor Mz
    filter(precursorMz == !!pmz)  %>% 
    #filter for 1% relative abundance
    filter(intensity > 1)
  
  
  
  p <- ggplot(df) +
    #plot lines
    geom_segment(aes(x = mz, xend = mz, y = 0, yend= intensity)) +
    #add mz values for annotated ions
    geom_text(data = df %>% 
                filter(!is.na(annotations)),
              mapping = aes(x = mz, y = intensity + 0.5, 
                            label = round(mz, 4))) +
    #add annotations
    geom_text_repel(aes(label = annotations, x = mz, y = intensity),
                    size = 2, hjust = 0.5, vjust = 0.5,
                    colour = '#f49e4c', segment.linetype = 'dashed') +
    #set labels
    labs(x = expression(italic("m/z")), y = "Normalised intensity (a.u.)",
         title = df$precursorAnnotations %>% unique())+
    theme_light()+
    theme(axis.title = element_text(colour = 'black', size = 8, family = 'Arial',
                                    face = 'bold'),
          axis.text =  element_text(colour = 'black', size = 6, family = 'Arial'),
          plot.title =element_text(colour = 'black', size = 8, family = 'Arial',
                                    face = 'bold'))
  print(p)
}