Mass Spectrometry Databases and Data Format of Spectral Multifiles: Difference between pages

From MicrobeMS Wiki
(Difference between pages)
Jump to navigation Jump to search
 
mNo edit summary
 
Line 1: Line 1:
__FORCETOC__
Spectral multifiles combine multiple spectra in one single file. These files are stored in a Matlab™ specific data format and contain the spectral as well as the respective metadata. Spectral multifiles can be loaded in Matlab by entering the following command:


== Introduction ==
>> load('ecoli-filelist-oct16.muf','-mat')


Library-based MS approaches for microbial identification require labeled sets of microbial mass spectra. Starting with version 0.82 MicrobeMS can deal with experimental MALDI-TOF or LC-MS&sup1; mass spectra and the respective MS databases. <br>
This command will open ''ecoli-filelist-oct16.muf'', an example multifile containing 16 individual MALDI-TOF mass spectra acquired from five different strains of ''E. coli''. The file ''ecoli-filelist-oct16.muf'' can be downloaded [http://wiki.microbe-ms.com/upload/ecoli-filelist-oct16.muf: '''here''']. If loading was successful, you will have access to a new Matlab variable ''spec'' (structure array). Details of the structure of ''spec'' are described next.<br> &nbsp; <br>
The RKI databases of microbial MALDI-TOF mass spectra contain mass spectral entries from highly pathogenic (biosafety level 3, BSL-3) bacteria such as ''Bacillus anthracis'', ''Yersinia pestis'', ''Burkholderia mallei'', ''Burkholderia pseudomallei'' and ''Francisella tularensis'' as well as a selection of MALDI-TOF mass spectra from their close and more distant relatives. The RKI mass spectral databases can be used as a reference for the diagnostics of BSL-3 bacteria using proprietary and free software packages for MALDI-TOF MS-based microbial identification. The databases are distributed as zip archives and contain the original mass spectra in its native data format (Bruker Daltonics). MALDI-TOF MS Databases will be updated on a regular basis.<br>
The LC-MS&sup1; database is an ''in silico'' database which has been compiled from Uni-Prot Knowledgebase (Uni-Prot/KB Swissprot and TrEMBL) resources, for details see below).


== MALDI-TOF MS databases ==


The different versions of RKI biosafety level 3 (BSL-3) MALDI-TOF MS database can be downloaded from the following locations:
'''Fields of the structure array ''spec''''':


  1. [https://zenodo.org/record/7702375 Zenodo database version 4] (20230306):
{| class="wikitable" width=1100
    Lasch P, St&auml;mmler M & Schneider A, (2023). Version 4 (20230306) of the
!width=100| Fields
    MALDI-TOF Mass Spectrometry Database for Identification and Classification of Highly
!width=600| Description
    Pathogenic Microorganisms from the Robert Koch-Institute (RKI).
!width=100| Data type
    Zenodo. [https://zenodo.org/record/7702375 https://zenodo.org/record/7702375]
!width=300|
    Version Mar 06, 2023, creative commons CC BY-NC-SA 4.0 license
|-
| org
| original mass spectra [2 x n array], n: number of data points
| float32
| rowspan="35" style="background: #ffffff;" valign="top" | [[File:Multifile-format-spec-struc.jpg|250px|thumb|center|Matlab screenshot - format of a spectral multifile (*.muf) demonstrating the general structure of the structure array 'spec'. In this example the metadata of spectrum #17 are shown. Spectrum #17 is a data base spectrum which has been created from 8 individual mass spectra (cf. spec(1,17).dbs)]]
|-
| pre
| pre-processed spectra [2 x n array], n: number of data points
| float32
|-
| nam
| spectra id
| string
|-
| gen
| genus information
| string
|-
| spe
| species info
| string
|-
| str
| strain info
| string
|-
| typ
| type
| string
|-
| uid
| taxonomy identification number for species as used by the NCBI (see [https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi])
| integer
|-
| uie
| taxonomy identification number for strains used by the NCBI (see [https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi])
| interger
|-
| gti
| cultivation conditions: growth time
| string
|-
| tem
| cultivation conditions: cultivation temperature
| string
|-
| air
| cultivation conditions: cultivation under aerobic or anaerobic conditions
| string
|-
| med
| cultivation conditions: cultivation medium
| string
|-
| spo
| spore formers (YES or NO)
| string
|-
| con
| sample concentration
| string
|-
| trt
| sample treatment
| string
|-
| ext
| extra information
| string
|-
| las
| laser parameters (power, diameter, frequency, etc.)
| string
|-
| cal
| calibration info
| string
|-
| met
| measurement method
| string
|-
| cus
| customer info
| string
|-
| tim
| date and time of measurement
| string
|-
| pth
| path to spectrum
| string
|-
| pik
| [[#peak table format|peak table]], an array of the dimension [4 x npeaks] npeaks: number of peaks
| float32
|-
| cls
| class assignment (valid values are 0,1,2,3 and 4)
| float32
|-
| lms
| MALDI-TOF or LC-MS spectrum? (valid values are 0 [MALDI] and 1 [LC-MS])
| float32
|-
| lst
| formatted text containing the peak table
| char array
|-
| seq
| sequence of pre-processing steps
| string
|-
| smo
| the number of smoothing points (Savitzky-Golay smoothing)
| float32
|-
| bas
| number of intervals used for baseline correction
| float32
|-
| nrm
| normalization parameter (Yes:1, No:0)
| float32
|-
| clb
| calibration paarmeters (see below for details)
| float32
|-
| red
| data reduction factor (spectral binning)
| string
|-
| cut
| cut in the spectral domain
| string
|-
| mod
| original data modified by cut or red (Yes:1, No:0)
| float32
|-
| prm
| parameters of peak detection
| string
|-
| ccl
| [[#structure array ccl|calibration information]] (see below)
| structure array
|-
| dbs
| [[#structure array dbs|data base spectrum]] (Yes:1, No:0)
| structure array
|-
| avr
| [[#structure array avr|average spectrum]] (Yes:1, No:0)
| structure array
|}


  2. [https://doi.org/10.5281/zenodo.1880975 Zenodo database version 3] (20181130):
    Lasch P, St&auml;mmler M & Schneider A, (2018). Version 3 (20181130) of the
    MALDI-TOF Mass Spectrometry Database for Identification and Classification of Highly
    Pathogenic Microorganisms from the Robert Koch-Institute (RKI).
    Zenodo. [https://doi.org/10.5281/zenodo.1880975 https://doi.org/10.5281/zenodo.1880975]
    Version Nov 30, 2018, creative commons CC BY-NC-SA 4.0 license


  3. [https://www.microbe-ms.com/microbe-ms/refdata/3745_Zenodo_v1.pdf Zenodo database version 2] (20170523):
<span class="mw-headline" id="peak table format">'''Format of peak tables''' (spec.pik):</span>
    Lasch P, St&auml;mmler M & Schneider , (2017). Version 2 (20170523) of the
    MALDI-TOF Mass Spectrometry Database for Identification and Classification of Highly
    Pathogenic Microorganisms from the Robert Koch-Institute (RKI).
    Zenodo. [http://doi.org/10.5281/zenodo.582602 http://doi.org/10.5281/zenodo.582602]
    Version May 23, 2017, creative commons CC BY-NC-SA 4.0 license


  4. [https://www.microbe-ms.com/microbe-ms/refdata/3280_Zenodo_v1.pdf Zenodo database version 1] (20161027):
    Lasch P, St&auml;mmler M & Schneider A, (2016). A MALDI-TOF Mass Spectrometry
    Database for Identification and Classification of Highly Pathogenic Microorganisms from
    the Robert Koch-Institute (RKI). Zenodo. [http://doi.org/10.5281/zenodo.163517 http://doi.org/10.5281/zenodo.163517]
    Version October 27, 2016, creative commons CC BY-NC-SA 4.0 license


== LC-MS&sup1; databases ==
{| class="wikitable" width=800
!width=100| Fields
!width=700| Description
|-
| spec.pik(1,:) <br> &nbsp; <br>
| m/z positions of the peaks in the peak table <br> &nbsp; <br>
|-
| spec.pik(2,:) <br> &nbsp; <br>
| absolute intensities of these peaks <br> &nbsp; <br>
|-
| spec.pik(3,:) <br> &nbsp; <br>
| weighting factors (the sum of these factors equals 100) <br> &nbsp; <br>
|-
| spec.pik(4,:) <br> &nbsp; <br>
| in case of single spectra, i.e. no database or average spectra: baseline-corrected absolute intensities of the peaks, in case of average or database spectra: the relative peak frequency
|}


The original concept of microbial identification by means of MALDI-TOF MS of cultivated microbial cells and spectral distance-based comparison with entries of a microorganism spectra library has been adapted for LC-MS&sup1; microbial identification, see this '''preprint''': Lasch P, Schneider A, Blumenscheit C and Doellinger J, [https://doi.org/10.1101/870089 ''Identification of Microorganisms by Liquid Chromatography-Mass Spectrometry (LC-MS&sup1;) and in silico Peptide Mass Data]'', bioRxiv (Dec 10, '''2018'''), doi:10.1101/870089.<br>


  1. Lasch P, Schneider A, Blumenscheit C, Doellinger J. (2019). In silico Database for
<span class="mw-headline" id="structure array ccl">'''Calibration Information''' (spec.ccl):</span>
    Identification of Microorganisms by Liquid Chromatography-Mass Spectrometry (LC-MS&sup1;).
    Zenodo. [https://doi.org/10.5281/zenodo.3573996 https://doi.org/10.5281/zenodo.3573996]
    Version December 13, 2019, creative commons CC BY-NC-SA 4.0 license


Details can be found here: [[Identification Analysis by Means of LC-MS&sup1; and ''in silico'' Databases|Identification analysis by means of LC-MS&sup1; and ''in silico'' databases]]
{| class="wikitable" width=1100
!width=100| Fields
!width=600| Description
!width=100| Type
!width=300|
|-
| cl1
| calibration constant 1
| float32
| rowspan="15" style="background: #ffffff;" valign="top" | [[File:Array-spec-ccl.jpg|250px|thumb|center|Matlab screenshot - format of structure array spec.ccl containing the calibration info, such as calibration constants, delay time, number of spectra data points, etc. for spectrum #1.]]
|-
| cl2
| calibration constant 2
| float32
|-
| cl3
| calibration constant 3
| float32
|-
| del
| delay time [ns]
| float32
|-
| npt
| number of data points
| float32
|-
| res
| time resolution [ns]
| float32
|-
| ncl
| calibration info required to store the spectrum in a Bruker-specific data format
| string
|-
| ncr
| calibration info required to store the spectrum in a Bruker-specific data format
| string
|-
| bid
| hardware id of the spectrum
| string
|-
| org
| manufacturer info
| string
|-
| tfu
| manufacturer info
| string
|-
| tfu
| software info, required for compatibility issues
| string
|-
| spm
| type of instrumentation
| string
|-
| stp
| type of measurement (should be 'TOF')
| string
|-
| acq
| path to the original spectrum
| string
|}
 
 
 
 
<span class="mw-headline" id="structure array dbs">'''Data Base Spectrum''' (spec.dbs):</span>
 
A [[Create database spectra|database spectrum]] is usually created from many (>3) individual mass spectra. The structure array ''spec.dbs'' contains information (metadata, peak tables) on the mass spectra used to produce the given database spectrum. Details of the structure of ''spec.dbs'' are given in the table below.
 
{| class="wikitable" width=1100
!width=100| Fields
!width=600| Description
!width=100| Type
!width=300|
|-
| mem
| string defining if the current spectrum is a data base spectrum (1) or not (0)
| string
| rowspan="5" style="background: #ffffff;" valign="top" |[[File:Array-spec-dbs.jpg|250px|thumb|center|Matlab screenshot - format of structure array spec.dbs. spec(1,17).dbs(1,1) contains information of mass spectrum #1 which was used with others to obtain data base spectrum #17, such as the id, taxonomic information, peak tables and the respective peak detection parameters).]]
|-
| ids
| id of the individual mass spectrum used to create the data base spectrum
| string
|-
| tax
| taxonomic info of the source spectrum
| string
|-
| pik
| peak table of the source spectrum
| float32
|-
| prm
| parameters of peak detection
| string
|}
 
 
<span class="mw-headline" id="structure array avr">'''Average Spectrum''' (spec.avr):</span>
 
An [[Averaging Mass Spectra|average spectrum]] is usually created from many (>3) individual mass spectra. The structure array ''spec.avr'' contains information (metadata, peak tables) on the mass spectra used to produce the given avarage spectrum. Details of the structure of ''spec.avr'' are given in the table below.
 
{| class="wikitable" width=1100
!width=100| Fields
!width=600| Description
!width=100| Type
!width=300|
|-
| mem
| string defining if the current spectrum is a data base spectrum (1) or not (0)
| string
| rowspan="5" style="background: #ffffff;" valign="top" |[[File:Array-spec-avr.jpg|250px|thumb|center|Matlab screenshot - format of structure array spec.avr. spec(1,18).avr(1,1) contains information of mass spectrum #1 which was used with others to obtain an average spectrum #18, such as the id, taxonomic information, peak tables and the respective peak detection parameters).]]
|-
| ids
| id of the individual mass spectrum used to create the avarage spectrum
| string
|-
| tax
| taxonomic info of the source spectrum
| string
|-
| pik
| peak table of the source spectrum
| float32
|-
| prm
| parameters of peak detection
| string
|}

Latest revision as of 17:32, 21 March 2023

Spectral multifiles combine multiple spectra in one single file. These files are stored in a Matlab™ specific data format and contain the spectral as well as the respective metadata. Spectral multifiles can be loaded in Matlab by entering the following command:

>> load('ecoli-filelist-oct16.muf','-mat')

This command will open ecoli-filelist-oct16.muf, an example multifile containing 16 individual MALDI-TOF mass spectra acquired from five different strains of E. coli. The file ecoli-filelist-oct16.muf can be downloaded here. If loading was successful, you will have access to a new Matlab variable spec (structure array). Details of the structure of spec are described next.
 


Fields of the structure array spec:

Fields Description Data type
org original mass spectra [2 x n array], n: number of data points float32
Matlab screenshot - format of a spectral multifile (*.muf) demonstrating the general structure of the structure array 'spec'. In this example the metadata of spectrum #17 are shown. Spectrum #17 is a data base spectrum which has been created from 8 individual mass spectra (cf. spec(1,17).dbs)
pre pre-processed spectra [2 x n array], n: number of data points float32
nam spectra id string
gen genus information string
spe species info string
str strain info string
typ type string
uid taxonomy identification number for species as used by the NCBI (see [1]) integer
uie taxonomy identification number for strains used by the NCBI (see [2]) interger
gti cultivation conditions: growth time string
tem cultivation conditions: cultivation temperature string
air cultivation conditions: cultivation under aerobic or anaerobic conditions string
med cultivation conditions: cultivation medium string
spo spore formers (YES or NO) string
con sample concentration string
trt sample treatment string
ext extra information string
las laser parameters (power, diameter, frequency, etc.) string
cal calibration info string
met measurement method string
cus customer info string
tim date and time of measurement string
pth path to spectrum string
pik peak table, an array of the dimension [4 x npeaks] npeaks: number of peaks float32
cls class assignment (valid values are 0,1,2,3 and 4) float32
lms MALDI-TOF or LC-MS spectrum? (valid values are 0 [MALDI] and 1 [LC-MS]) float32
lst formatted text containing the peak table char array
seq sequence of pre-processing steps string
smo the number of smoothing points (Savitzky-Golay smoothing) float32
bas number of intervals used for baseline correction float32
nrm normalization parameter (Yes:1, No:0) float32
clb calibration paarmeters (see below for details) float32
red data reduction factor (spectral binning) string
cut cut in the spectral domain string
mod original data modified by cut or red (Yes:1, No:0) float32
prm parameters of peak detection string
ccl calibration information (see below) structure array
dbs data base spectrum (Yes:1, No:0) structure array
avr average spectrum (Yes:1, No:0) structure array


Format of peak tables (spec.pik):


Fields Description
spec.pik(1,:)
 
m/z positions of the peaks in the peak table
 
spec.pik(2,:)
 
absolute intensities of these peaks
 
spec.pik(3,:)
 
weighting factors (the sum of these factors equals 100)
 
spec.pik(4,:)
 
in case of single spectra, i.e. no database or average spectra: baseline-corrected absolute intensities of the peaks, in case of average or database spectra: the relative peak frequency


Calibration Information (spec.ccl):

Fields Description Type
cl1 calibration constant 1 float32
Matlab screenshot - format of structure array spec.ccl containing the calibration info, such as calibration constants, delay time, number of spectra data points, etc. for spectrum #1.
cl2 calibration constant 2 float32
cl3 calibration constant 3 float32
del delay time [ns] float32
npt number of data points float32
res time resolution [ns] float32
ncl calibration info required to store the spectrum in a Bruker-specific data format string
ncr calibration info required to store the spectrum in a Bruker-specific data format string
bid hardware id of the spectrum string
org manufacturer info string
tfu manufacturer info string
tfu software info, required for compatibility issues string
spm type of instrumentation string
stp type of measurement (should be 'TOF') string
acq path to the original spectrum string



Data Base Spectrum (spec.dbs):

A database spectrum is usually created from many (>3) individual mass spectra. The structure array spec.dbs contains information (metadata, peak tables) on the mass spectra used to produce the given database spectrum. Details of the structure of spec.dbs are given in the table below.

Fields Description Type
mem string defining if the current spectrum is a data base spectrum (1) or not (0) string
Matlab screenshot - format of structure array spec.dbs. spec(1,17).dbs(1,1) contains information of mass spectrum #1 which was used with others to obtain data base spectrum #17, such as the id, taxonomic information, peak tables and the respective peak detection parameters).
ids id of the individual mass spectrum used to create the data base spectrum string
tax taxonomic info of the source spectrum string
pik peak table of the source spectrum float32
prm parameters of peak detection string


Average Spectrum (spec.avr):

An average spectrum is usually created from many (>3) individual mass spectra. The structure array spec.avr contains information (metadata, peak tables) on the mass spectra used to produce the given avarage spectrum. Details of the structure of spec.avr are given in the table below.

Fields Description Type
mem string defining if the current spectrum is a data base spectrum (1) or not (0) string
Matlab screenshot - format of structure array spec.avr. spec(1,18).avr(1,1) contains information of mass spectrum #1 which was used with others to obtain an average spectrum #18, such as the id, taxonomic information, peak tables and the respective peak detection parameters).
ids id of the individual mass spectrum used to create the avarage spectrum string
tax taxonomic info of the source spectrum string
pik peak table of the source spectrum float32
prm parameters of peak detection string