mzmatch Excel Template Tutorial

Installation & Requirements Installation The template may be used to process mzmatch output text files without additional installations or add-ins. Microsoft Excel 2007 required (2003 not sufficient, 2010 not tested) Requirements for full function R Statistical Software : for mzmatch pre-processing R packages: XCMS (BioC), mzmatch.r (Rforge), rjava and XML (CRAN) R package: rcdk : for FormulaGenerator Firefox or Internet Explorer : for Hyperlinks to online databases Thermo Xcalibur : for EIC lookup ReAdW : for conversion of.raw to.mzxml files If you wish to use R and Xcalibur links: Open the template and update cells D44 and D45 (on the Settings sheet) to the relevant paths on your computer

Data Pre-processing Step 1 - Setup Open mzmatch_template.xltm and SaveAs yourfile.xlsm (Macro enabled workbook) Go to the Settings sheet Update cells D44 and D45 (on the Settings sheet) to the relevant paths on your computer Step 2 - Convert RAW files to centroided mzxml files Save a copy of ReAdW.exe into the folder with your RAW data Click Convert RAW to mzxml files to run conversion

Data Pre-processing Step 3 If files are from Exactive, split into Pos and Neg using the Blue button Step 4 Select positive or negative mode in cell K1 (only process one mode at a time) For each polarity: sort replicate.mzxml files into folders according their experimental groups (sets). Check over the blue-shaded settings for mass, RT and Relatedpeaks windows, and RSD filter. (xcms parameters can be changed in the macro) Run xcms/mzmatch with the purple Combined Button NOTE: Files must be sorted into sets (folders) to run RSD filter NOTE: If xcms crashes in negative mode try selecting mzdata alt method in cell K2 mzmatch output files will be saved in the folder with your files

Peak data import Step 1 Import mzmatch output file combined_related.txt using the big Red button (Settings sheet) Manually check that replicate samples are in adjacent columns (if not, get cutting and pasting!) Step 2 On the Settings sheet, enter the number of replicates in each set (column F) NOTE: if you have named samples with set prefixes, the next Green button will do this for you Choose the Set-Type for each set using drop-down options in column C NOTE: hover mouse over cell C8 for more information

Update metabolite DB Step 1 Externally, prepare a list of actual retention times for authentic standards analysed under your current chromatographic conditions. Any excel-readable file with name, RT and mass (optional) in columns can be directly imported. ToxID is good for this. Step 2 NOTE: names must exactly match those in DB. (except that, can be replaced by _ ) Select the Rtcalculator sheet Enter the dead-volume time for your chromatographic column (cell O9) Scroll to the right and manually update expected retention times for given Pathways, Maps and Properties (if known) (optional) enter metabolite names and RT s for authentic standards in columns A:B and W:X NOTE: These can be entered automatically from an external excel/tsv/csv file in step 3

Update metabolite DB Step 3 Run the Update Retention Times in DB macro from either Settings or Rtcalculator sheet If the prediction model looks good (ie r 2 > 0.6), agree to update RT s in DB, otherwise try altering the variables (cells E1:J1) to suit your chromatography, and re-run the macro Step 4 (optional) If you have a species-specific database (eg. From metacyc or KEGG) enter these annotations in column G ( PreferredDB ) of the DB sheet. NOTE: This can be simplified by matching database identifiers using Excel s Vlookup function Select the entire database and Custom Sort: sort by searchmass (ascending) then by PreferredDB (ascending) to ensure annotated metabolites are at the top of the list of each group of isomers.

Run Metabolite Identification Step 1 On the Settings sheet, check over the settings in columns F and I are suitable Most commonly changed settings are: Identification RT windows (F3 and F4) and mass window (F6) RT window for duplicate peaks (I9) MaxIntensity cutoff (I10) Select the adducts (cells K15:K21) that you wish to include in the identification search Step 2 Click Run Identification Macro on the Settings sheet This could take from 2 to 20 minutes Save the file as soon as the macro is finished

Metabolite Identification: Process Metabolite Identification Macro This macro annotates information to every peak in the alldata sheet Apon completion, all basepeaks are copied to the allbasepeaks sheet All identifications with confidence < 5 are copied to the notlikely sheet All identification with confidence => 5 are copied to the identification sheet The identifications sheet is then checked for duplicates and shoulder peaks, and these are moved to the notlikely sheet

Metabolite Identification: Process Peak Information columns A: neutral exact mass (from mzmatch) B: Retention Time (from mzmatch) in minutes C: Formula from DB with closest match to mass (if within ppm window) D: Number of isomers in DB with this exact formula E: Metabolite name: best match from DB for this mass and RT F: Confidence level according to parameters on settings sheet G: Records whether the metabolite is in a preferred database (from DB) H: Map: the general area of metabolism for this metabolite (usually from KEGG) NOTE: column H can be changed by choosing a different header in cell H1 I: mass error (in ppm) from nearest match in DB (if within 2 x ppm window) J: RT error relative to authentic standard (white) or predicted RT (grey) as % of RT K: altppm: mass error for the next closest mass in the DB (if within ppm window) L: Sig: records which sample sets are significant (peaks > blank and RSD < window)

Metabolite Identification: Process Peak Information cont. M: BP: Basepeak for that peak N: Mzdiff: mass difference between this peak and the basepeak For basepeaks this column records common adducts/fragments/isotopes that were found O: relation.ship: relationship to the basepeak (according to mzmatch) P: addfrag: common adduct, fragment or neutral-loss Q: % error of C13-isotope intensity from theoretical R: % error of isotope intensity from theoretical for (Cl, S, N, O or H) S: RSD for QC samples (or for Treatment if no QC) T: minimum RSD for all included sample sets U: maximum intensity from all included sets V: Relation.id (from mzmatch) W: Peak Intensity ratio for mean of treatments vs mean of controls X: P-value for unpaired T-test between treatments and controls Y: Adduct of formula match to mass (ie H, Na, double-charge, etc) Z: Polarity AA: Number of detected peaks in included sets

Re-calibrate mass accuracy Step 1 On the Settings or Identification sheet, click the ppm check button If the polynomial curve looks like a good fit, agree to re-calibrate masses, otherwise, investigate the mass calibration manually Step 2 Sort the identification sheet by ppm error (use the blue sort button) Remove metabolites with large errors (>1.5 ppm) by cut/paste to the notlikely sheet NOTE: easiest to manually annotate all mis-annotated peaks (in column F), re-sort and move them all at once NOTE: delete rows that have been removed (even if they appear empty) to speed up processing Double-check the altppm column for alternate identifications before you remove peaks

Manual Data Filtration Step 1 recover false rejections Go to the notlikely sheet, check for false rejections, particularly with confidence of 4. (technical judgement required) Cut/paste false rejections onto the identification sheet Step 2 manual filtration On the Identification sheet, check for false positives and move to notlikely sheet by cut/paste, or by the remove row button Press the colouring button to make interpretation easier Press the hyperlink button to activate weblinks Use the Sort functions, info-boxes, graphs and hyperlinks to assist (columns B,D,K,L,W) Step 3 manual identification On the Identification sheet, check for duplicate identifications, and choose alternative isomers where appropriate

Manual Data Filtration Manual Filtration: suggested process 1. Related Peaks (mass difference, neutral loss) 2. Retention Time limits (min, max, %error) 3. Adduct likelihood (2+ or Na+) 4. Isomers (split peaks, duplicate identifications) 5. Isotopic abundance (C13 isotope, other unique isotopes) 6. Peak shape (check chromatogram if codadw < 0.95) 7. Biological likelihood (related pathways, common contaminants)

Biological data analysis Step 1 If you have exactive pos/neg data, run the Combine Pos/Neg function after processing each set individually Step 2 Run the Intensity comparison macro from the Identification sheet or settings sheet by clicking Compare All Sets. This calculates mean and SD for each set and compares each set to the designated control group (relative intensity and t-test). Step 3 In the Comparison sheet, sort data by your column of interest: Relative intensity vs control P-value (t-test) vs control Metabolite Map or KEGG Pathway Use buttons at the top to plot graphs or export to motif/metexplore

Multivariate analysis Step 1 This template doesn t incorporate functionality for multivariate analysis, use the light blue Export button to export either allbasepeaks or Identifications to Metaboanalyst, or R/matlab/etc for further analysis Step 2 If you wish to analyse all Basepeaks, run the assign Basepeaks macro to help with annotation Step 3 Unidentified masses can be investigated by clicking the empty formula (C) cell this will run FormulaGenerator in R

Other Features Additional Macros: Isotope Search for untargeted metabolic labelling studies C13, N15 and O18 supported Combine Datasets combines negative and positive data (from same column) Formula Generator Identify formulae for unknown masses (uses rcdk) Checks validity of formulae against Fiehn s Golden Rules

Other Features Additional Functions (Excel formula s): FormulaMatch looks up a mass in the database ExactMass calculates exact mass of a formula PPMcalc calculates the mass error from a given mass or formula IsotopeAbundance Calculates the theoretical isotopic abundance for a given atom in a formula FormulaValid checks formula validity against 5 Golden Rules AtomCount returns the number of specified atoms in a formula Pos calculates the positive charge at a given ph (given # cations & basic pka s) Neg - calculates the negative charge at a given ph (given # anions & acidic pka s)

FAQ WHERE TO START... which sheet? All automated functions can be run from the Settings sheet. After automated filtration and identification you can do manual curation on the identification sheet, including the mass re-calibration. Additional metabolites can be retrieved from the notlikely (or allbasepeaks ) sheet simply by using the cut/paste functions in Excel; it is recommended to cut/paste whole rows rather than individual cells. The easiest approach for meaningful biochemical analysis is to run the Compare all function and sort the Comparison sheet according to your interests. Additional columns (eg. stats, normalised intensities, other information) can always be added to the right of the existing data without affecting macro performance. POLARITY: The polarity is automatically corrected by mzmatch.r during the peak picking process, and all masses that appear in the Template are corrected neutral masses. Ensure that you set the correct 'polarity' option on the 'settings' sheet before running anything. The polarity setting is also useful for combining positive and negative mode data, and for the quicklink to Xcalibur qualbrowser EICs. (i.e. whether to add or subtract a proton to get from neutral mass back to m/z). Note: Due to the automatic polarity correction by mzmatch, the masses of cations in the database have been corrected by one proton. (eg. The mass of choline in the DB is 103, rather than actual mass of 104).

FAQ WHICH FILE TO USE FOR THE RETENTION TIME UPDATER? You need to manually generate a list of retention times for authentic standards under the current LC conditions. The simplest way is to use Toxid (or similar), otherwise do it manually from raw data. The retention time updater has been tested on Toxid.csv output files. However it should work for any excelreadable file that has a column for metabolite names and a column for retention times. (Note: the metabolite name must be identical to the name in the database - the only exception is that underscore "_" may be used in the place of comma "," to avoid issues with.csv files). IF IT RUNS SLOWLY? The peak-picking process in XCMS is quite slow, this can be left to run overnight if you have many samples. The speed of mzmatch.r functions and Excel macros will depend on the number of samples, number of detected peaks, and your computer speed. Speed can be improved by applying tighter filters earlier in the process (eg. Peak picking parameters and RSD filter), however this may cause loss of some peaks of interest. Visualisation of results in Excel can be slow if there are many active formulas. Try turning automatic calculation off, de-activating Hyperlinks, or running the Trim file size macro.

Any Further Questions/Ideas mzmatch information available at: Mzmatch.sourceforge.net Xcms information available at: metlin.scripps.edu/xcms/ Information about this mzmatch template available directly from: Dr Darren Creek University of Glasgow darrencreek@gmail.com Darren.creek@glasgow.ac.uk