The document discusses data standards for MS (mass spectrometry) and NMR (nuclear magnetic resonance) in metabolomics. It begins with an overview of the metabolomics workflow and data processing pipelines. It then discusses several existing data standards including netCDF and mzML, providing examples of how they encode metadata and spectral data. The document notes tools for working with these formats like OpenMS and ProteoWizard, and libraries for parsing the formats in different programming languages. Finally, it discusses challenges in data preprocessing and peak picking for different MS instruments and formats.
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
MS (and NMR) data standards in Metabolomics why, how and some caveats
1. MS (and NMR) data standards in Metabolomics
why, how and some caveats
Steffen Neumann
Leibniz Institute of Plant Biochemistry
ScienceCampus Halle (WCH)
June 23, 2014
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
2. Metabolomics – The Pipeline
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
3. IPB machine Park
Data processing from
LC-QqTOF-MS:
QStar Pulsar i, microTOF Q
Bruker Apex (FTICR)
HCT Ultra (IT-MS, CID+ETD)
Reflex III (Maldi-TOF)
Thermo Finnigan
Quantum Ultra AM, LCQ Deca XP
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
4. netCDF: Grandfather is still alive
netCDF as file format, ANDI-MS as content specification
fine for GC/MS and simple LC/MS
widely supported in software and programming languages
no mix of MS and MS/MS
very poor metadata
Defined in Standard: “ASTM E1947 – 98(2009)
Standard Specification for Analytical Data Inter-
change Protocol for Chromatographic Data”
available for only $42
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
12. www.openms.de
Originally for MS-based Proteomics
Reads mzData, mzXML, mzML
NetCDF (Not on 64bit!)
FileInfo, FileConverter, FileFilter, ...
plus Calibration, Merge, NoiseFilter, . . .
TOPPView Viewer and GUI
⇒ Very useful for preprocessing
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
M. Sturm, A. Bertsch, C. Gröpl, A. Hildebrandt, R. Hussong, E. Lange, N. Pfeifer, O. Schulz-Trieglaff, A. Zerck,
K. Reinert, O. Kohlbacher, 2008. OpenMS – an Open-Source Software Framework for Mass Spectrometry
BMC Bioinformatics doi:10.1186/1471-2105-9-163.
13. http://proteowizard.sourceforge.net/
Originally for MS-based Proteomics
cross-platform (MSVC on Windows, gcc on Linux, XCode on OSX)
open source (Apache v2)
Formats supported on all platforms: mzML, mzXML, MGF
Formats supported on Windows with vendor libraries installed:
Thermo RAW, Waters RAW, Bruker FID/YEP/BAF
msconvert: conversion tool.
msdiff: validation of conversion/preprocessing
msaccess: command line access:binary data and metadata,
EICs & pseudo-2D gel image creation
SeeMS: interactive viewer for mass spec data files (Windows only)
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
Chambers, Maclean, Burke, Amodei, Ruderman, Neumann, Gatto, Fischer, Pratt, Egertson, Hoff, Kessner,
Tasman, Shulman, Frewen, Baker, Brusniak, Paulse, Creasy, Flashner, Kani, Moulding, Seymour, Nuwaysir,
Lefebvre, Kuhlmann, Roark, Rainer, Gerd, Hemenway, Huhmer, Langridge, Eckels, Connolly, Stearns,
Deutsch, Katz, Agus, MacCoss, Tabb, Mallick. A cross-platform toolkit for mass spectrometry and proteomics.
14. Converters: Notes
https://xcmsonline.scripps.edu/docs/fileformats.html
Bruker:
Calibration requires setting a specific Registry Key:
HKEY_CURRENT_USERSoftwareBruker DaltonikCompassXport
UseRecalibratedSpectra=1
Waters:
No support for calibration in Waters DLL used by msconvert
DataBridge writes netCDF only, and writes calibrated data
Ancient massWolf requires full MassLynx installed, will use
calibrated data, but intermingle LockMass Scans
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
15. Plumbing: libraries for mzML
pymzML (Python) http://pymzml.github.io/
jmzML (Java) https://code.google.com/p/jmzml/
OpenMS (C++) https://www.openms.de/
Proteowizard (C++) http://proteowizard.sourceforge.net/
mzR (R/Bioconductor) http://www.bioconductor.org/packages/
release/bioc/html/mzR.html
. . . and many more!
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
16. MS and Metabolomics in BioC
Collection of biology-related R packages
Started back in 2002
Current release: >500 packages!
Package Maintainer Title
mzR Gatto,me,Fischer parser for netCDF, mzXML, mzData and mzML
xcms Ralf Tautenhahn LC/MS and GC/MS Data Analysis
MassSpecWavelet Pan Du Mass spectrum processing by wavelet-based algorithms
CAMERA Carsten Kuhl Collection of Annotation related MEthods for mass spectRometry dAta
Rdisop Steffen Neumann Decomposition of Isotopic Patterns
MSnbase Laurent Gatto Base Functions and Classes for MS-based Proteomics
iontree Mingshu Cao Data management and analysis of ion trees from ion-trap MS
rpubchem Rajarshi Guha Interface to the PubChem Collection
KEGGSOAP R. Gentleman client interface to the KEGG SOAP server
apComplex D. Scholtens Estimate protein complex membership using AP-MS protein data
PROcess X. Li Ciphergen SELDI-TOF Processing
simulatorAPMS Tony Chiang Computationally simulates the AP-MS technology.
TargetSearch Cuadros-Inostroza et al. analysis of GC-MS metabolite profiling data.
flagme Mark Robinson Analysis of Metabolomics GC/MS Data
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
17. LC-MS Data preprocessing with XCMS
www.bioconductor.org
Import: netCDF, mzXML,
mzData, mzML
Peak detection
Peak alignment
Peak integration
“Differential” metabolites
Compatible with all
MS instruments at the IPB
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
Lange, Tautenhahn, Neumann, Gröpl. Critical assessment of alignment procedures for LC-MS proteomics and
metabolomics measurements. BMC Bioinformatics (2008)
18. FTICR Peak Picking
Bioconductor Package
“MassSpecWavelet”
Integration into XCMS:
Same Annotation
and Identification
Same statistics
(Same database schema)
380 381 382 383 384
0e+002e+064e+06
a) MS raw spectrum
m/z value
Intensity
b) CWT coefficients
m/z value
CWTcoefficientscale
380 381 382 383 384
158111723
380 381 382 383 384
0e+002e+064e+06
c) Identified peaks with SNR > 3
m/z value
Intensity
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
Projektarbeit Sebastian Wolf & Michael Gerlich: Du, Kibbe, Lin: Peak Detection of Mass Spectrometry Spec-
trum by Continuous Wavelet Transform based Pattern Matching, Bioinformatics (2008)
19. Plumbing: mzR for MS raw data
New in BioC 2.10 (Oct 2011)
Joint work Fischer/Gatto/Neumann
Conglomerate of former XCMS code, ISB Ramp,
Proteowizard via Rcpp
Read netCDF, mzXML, mzData, mzML (mz5 soon ?)
Read mzIdentML mzQuantML one day ?
To become the affyIO of MS data ?!
GSoC project 2014 to improve mzR
mzR
mzRramp
mzRpwiz
mzRnetCDF
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
Chambers, Maclean, Burke, Amodei, Ruderman, Neumann, Gatto, Fischer, Pratt, Egertson, Hoff, Kessner,
Tasman, Shulman, Frewen, Baker, Brusniak, Paulse, Creasy, Flashner, Kani, Moulding, Seymour, Nuwaysir,
Lefebvre, Kuhlmann, Roark, Rainer, Gerd, Hemenway, Huhmer, Langridge, Eckels, Connolly, Stearns,
Deutsch, Katz, Agus, MacCoss, Tabb, Mallick. A cross-platform toolkit for mass spectrometry and proteomics.
20. imzML: imaging mass spectrometry in mzML
Huge data files,
complex access patterns
imzML: same ’ol mzML,
but base64 in 2nd data file
Some new CV terms
faster access
7/8 space reduction
lossless mzML imzML
http://www.imzml.org
⇒ Open MS imaging software!
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
Schramm T, Hester A, Klinkert I, Both J-P, Heeren RMA, Brunelle A, Laprévote O, Desbenoit N, Robbe M-
F, Stoeckli M, Spengler B, Römpp A (2012) imzML — A common data format for the flexible exchange and
processing of mass spectrometry imaging data. J. of Proteomics 10.1016/j.jprot.2012.07.026
21. mz5: netCDF meets mzML
Convert from XML to HDF5
HDF5: big cousin of netCDF
Pros:
size reduction 54%
read/write speed 3–4-fold
Fully implemented in pwiz
HDF5 API for most
languages
Cons:
Not human-readable
Kills emacs and wordpad
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014
mz5: Space- and Time-efficient Storage of Mass Spectrometry Data Sets
M. Wilhelm, M. Kirchner, J. Steen, H Steen, MCP 10.1074/mcp.O111.011379
22. Focus of standards in NMR
D2.6Metadata
ISAtab
D2.4Raw data
nmrML
Metabolite
Identification
mzTab
Metabolite
Quantification
mzTab
23. Capture NMR raw data (equivalent to mzML)
Ingredients for nmrML standard:
●
XML Schema and controlled vocabulary (CV)
●
Examples, converters and validation suite
●
COSMOS partners involved:
IPB, EMBL-EBI, UB2, UBHAM, UOXF,
IMPERIAL, MRC, Mike Wilson (Canada),
Matthias Klein (D), Ian Lewis (US)
New format: nmrML D2.4Raw data
nmrML
24. github.org as development platform
●
Web site with content management
http://nmrml.org/
●
Version control system,
Issue tracker, activity statistics
●
Free for open source projects
nmrML infrastructure D2.4Raw data
nmrML
25. ●
Controlled vocabulary developed
as OWL ontology
●
Based on earlier work
by MSI, D. Rubtsov and J.Cruz
●
ISAtab can leverage ontologies
●
With semantic web / RDF / SparQL
in mind for later deliverables
nmrML Ontology D2.4Raw data
nmrML
26. The need for an open nmr standard
nmrML: an XML-based open standard for
NMR data storage and exchange
NMR data is currently accumulating in local data silos, hindering distribution and secondary data usage. Cross platform NMR data access, integration and
comparison is hindered by incompatible vendor formats and the lack of a robust vendor-agnostic NMR data standard. Data in proprietary data formats
ages fast, posing the danger of irreproducible data from older studies. An open vendor-neutral storage standard is needed as long-term archival format,
if emerging metabolomics repositories are to capture data from all vendor formats in a persistent way, yet supporting the dynamics in this field.
To ease format conversions we deliver parsers
for Bruker and Varian data formats, which can be
incorporated into open NMR processing and
analysis software.
Parsers
Although coverage is good at raw data capture, the XSD and CV will be expanded for
better processed data and quantification data. Our standard is accepted by major
open source nmr data processing tools and will serve the MetaboLights repository with a
stable storage format.
Daniel Schober 1, Michael Wilson2, Daniel Jacob3, Annick Moing3, Catherine Deborde3, Luis de Figueiredo4, Kenneth Haug4,
Philippe Rocca-Serra5, John Easton6, Christian Ludwig7, Antonio Rosato8, David Wishart2, Christoph Steinbeck4, Reza Salek4, Steffen Neumann1
1Leibniz Institute of Plant Biochemistry, Dept. of Stress and Developmental Biology, Weinberg 3, 06120 Halle, Germany
2Department of Computing/Biological Sciences, University of Alberta, Edmonton, Canada
3INRA, Univ. Bordeaux, Metabolome Facility of Bordeaux Functional Genomics Center, 71 av Edouard Bourlaux, F-33140 Villenave d’Ornon, France
4European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
5University of Oxford, e-Research Centre, 7 Keble Road, Oxford, OX1 3QG, UK
6School of Electronic, Electrical and Computer Engineering, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
7School of Cancer Sciences, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
8Magnetic Resonance Center (CERM), University of Florence, 50019 Sesto Fiorentino (FI), Italy
nmrML XML schema excerpt nmrML data example nmrML use cases
The COordination of Standards in MetabOlomicS, COSMOS EU consortium has teamed up with the
metabolomics standards initiative to create an open exchange and storage format for NMR
data. We largely follow design principles already established in the Proteomics Standards Initiative
(PSI) for the mzML data standard for mass spectrometry. The standard is composed of an XML
schema (nmrML.xsd) and an accompanying controlled vocabulary (nmrCV.owl), which ensures
update flexibility and schema robustness by allowing to outsource more variant and dynamic
descriptors into the vocabulary which is referenced from within an nmrML file.
•Website: http://www.nmrML.org
•Github: https://github.com/nmrML/nmrML
•nmrML validator: http://msbi.ipb-halle.de/nmrML/index.php
•Cosmos: http://www.cosmos-fp7.eu/
•Email: info@nmrml.org
•Google Group: https://groups.google.com/forum/?hl=en#!forum/nmrml/join
Data from a paper: Farag, M., Porzel, A., Schmidt, J. & Wessjohann, L. Metabolite profiling and
fingerprinting of commercial cultivars of Humulus lupulus L. (hop) - a comparision of MS and
NMR methods in metabolomics, Metabolomics 8, 492-507, (2012)
<nmrML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://nmrml.org/schema ../../../xml-schemata/nmrML.xsd"
xmlns="http://nmrml.org/schema" version="1.0.0">
<cvList count="2">
<cv fullName="nmrML Controlled Vocabulary" version="0.0.1" id="NMRCV"
URI="http://www.nmrml.org/nmrml-cv.0.0.1.owl"/>
<cv fullName="Unit Ontology" version="3.2.0" id="UO" URI="http://unit-
ontology.googlecode.com/svn/trunk/uo.owl/"/>
</cvList>
<contactList>
<contact id="ID004" fullname="Lutger A. Wessjohann" email="Ludger.Wessjohann [a] ipb-
halle.de"/>
<contact id="ID044" fullname="Mohamed A. Farag" email="mfarag73 [a] yahoo.com"/>
</contactList>
<sourceFileList count="2">
<sourceFile sha1="fd99c095046e2356c7d31154d45353fa79cbc844"
location=file:///Users/mike/Projects/nmrML/nmrML/examples/IPB_HopExample/FIDs/FAM013_
AHTM.PROTON_04.fid/procpar
id="SOURCE_FILE_0" name="procpar">
<cvTerm cvRef="NMRCV" accession="NMR:1400297" name="Varian VNMR Format"/>
<cvTerm cvRef="NMRCV" accession="NMR:1002006" name="acquisition parameter file"/>
</sourceFile>
<sourceFile sha1="e4ffeb41da28b1e9017e72819252ec6d78f8179f“
location=
file:///Users/mike/Projects/nmrML/nmrML/examples/IPB_HopExample/FIDs/FAM013_AHTM.PROTON_04.fid/fi
d
id="SOURCE_FILE_1" name="fid">
<cvTerm cvRef="NMRCV" accession="NMR:1400297" name="Varian VNMR Format"/>
<cvTerm cvRef="NMRCV" accession="NMR:1400119" name="FID file"/>
</sourceFile>
</sourceFileList>
<softwareList count="1">
<software cvRef="NMRCV" accession="NMR:1000277" name="VnmrJ software" version="2.2C"
id="SOFTWARE_1"/>
</so<instrumentConfigurationList count="4">
<instrumentConfiguration id="INST_CONFIG_1">
<cvTerm cvRef="NMRCV" accession="NMR:1400234" name="Varian NMR instrument"/>
<cvTerm cvRef="NMRCV" accession="NMR:1000235" name="Varian probe"/>
<cvTerm cvRef="NMRCV" accession="NMR:1400234" name="Varian NMR instrument"/>
<cvTerm cvRef="NMRCV" accession="NMR:1000236" name="5mm HCN probe"/>
</instrumentConfiguration>
</instrumentConfigurationList>
<acquisition>
<acquisition1D>
<acquisitionParameterSet numberOfScans="160" numberOfSteadyStateScans="0">
<sampleAcquisitionTemperature unitName="kelvin" unitCvRef="UO" value="299.15"
unitAccession="UO:0000012"/>
<spinningRate unitName="hertz" unitCvRef="UO" value="0" unitAccession="UO:0000106"/>
<relaxationDelay unitName="second" unitCvRef="UO" value="22.2737024"
unitAccession="UO:0000010"/>
<pulseSequence/>
<DirectDimensionParameterSet numberOfDataPoints="65536" decoupled="false">
<acquisitionNucleus cvRef="NMRCV" accession="NMR:1400151" name="1H"/>
<gammaB1PulseFieldStrength unitName="hertz" unitCvRef="UO" value="34482.7586207"
unitAccession="UO:0000106"/>
<irradiationFrequency unitName="hertz" unitCvRef="UO" value="599.8311617"
unitAccession="UO:0000106"/>
</DirectDimensionParameterSet>
</acquisitionParameterSet>
<fidData byteFormat="Complex128" encodedLength="324160"
compressed="true">eJwMl4dfzl8Ux7U3lYZKy0qiomQ […]</fidData>
</acquisition1D>
</acquisition>
</nmrML>
ftwareList>
MetaboLights
The nmrML setup
We also deliver a content validator which checks a data file is syntactically well formatted, sufficiently complete and that aspects of minimal information
requirements like the Core Information for Metabolomics Reporting (CIMR) are met.
Validators
Outlook Project resources
nmrML setup
•MetaboLights: http://www.ebi.ac.uk/metabolights/
•MSI: http://msi-workgroups.sourceforge.net/
•CIMR-MI: http://mibbi.sourceforge.net/projects/CIMR.shtml
Validation Layer Onion Validation webservice & resultValidation rules (html)
27. My pleas for the future
. . . to the vendors:
Please start (or continue!) to support Open Data formats
. . . to the computational mass spec community:
Please use (and improve!) joint data I/O libraries
. . . to YOU (the users):
Please start (or continue!) to REQUEST
open formats when inviting to bid for a new instrument
S. Neumann (IPB-Halle.DE) (Raw) data standards in metabolomics June 23, 2014