M Nafati, M Samson, B Rossi
data reduction, fuzzy logic thresholding, mass spectrometry ms, multi-resolution, protemic
M Nafati, M Samson, B Rossi. Multi-scale Data Reduction Algorithm of Proteomic Mass Spectrum. The Internet Journal of Academic Physician Assistants. 2005 Volume 5 Number 1.
The proteomic is a field which makes it possible to connect the sequence of the genome and the cellular behaviour. The proteomic analysis can be done in various stages: preparation of the samples, separation of proteins, analysis by mass spectrometry, preprocessing (data mining) and interrogation of the data banks. The mass spectrometry measures the mass of peptides (typically obtained by tryptic digestion). These masses are then compared to those theoretical in Databases in order to identify the protein name. Electronic and chemical noise is often the source of bad identification. In this article, we propose an original data reduction algorithm with the aim of removing the spectra baseline, then removing parasitic mass peaks and amplifying those useful. The algorithm principle uses the dyadic muli-resolution technique (bio-orthogonal decomposition/reconstruction) coupled to the fuzzy logic thresholding. In order to evaluate the quality of this algorithm, we present a comparison of the results obtained by our algorithm and those obtained using the data reduction software of MALDI-TOF spectrometer (Matrix-Assisted Laser Desorption/ionization).
Proteomic analysis is done primarily by the use of the two-dimensional electrophoreses (2-DE) technique coupled with the Mass Spectrometry (MS) analysis. The first technique helped by the proteomic imaging leads to the localization of the candidates proteins for mass spectrometry analysis. The comparison between the spectra of masses obtained and those theoretical of DataBase leads to the identification of proteins of interest in term of peptides or amino acids [6,10].
In this paper, one propose an robust data reduction algorithm of mass spectra based on the multi-resolution technique [2,6,7,9] and the fuzzy set theory [3,4,11,12,13]. The idea is to separate the mass peaks into groups of dyadic sub-bands and then thresholding the high frequencies sub-band. The computation of the optimal threshold is done by minimizing the fuzzy Shannon entropy. The result is then amplified in an adaptive way. At the end of the process, the mass spectra is reconstructed and corrected by removing the baseline signal.
The currently most common method to identify proteins is to first enzymatically digest the proteins, then determine the masses of result peptides by peak detection on a MALDI-TOF spectrum , and finally use the peptide mass fingerprints to research protein sequences. The found theoretical protein is that which gives a maximum rate of covering. It is clear that this result depends mainly on the quality of the mass spectrum. Consequently the data reduction processing is a primodial stage since the presence of (electronic and/or chemical) parasitic, or the absence of useful mass peaks distorts the result of the protein identification. As sometimes, only a few experimental peptide masses in the fingerprint match the theoretical masses in a database, failure to detect one peak can hinder the correct identification of a protein. The standard data reduction software (DataExplorer Voyager) provided with MALDI spectrometer, is often unoptimal and nonadaptive, it consists of doing these processes: denoising, baseline correction, thresholding, peak detection, protein identification. Here, our data reduction algorithm aim to optimize the denoising and baseline correction processes, and to improve the SNR ratio in an adaptive way.
Objective data reduction algorithm
The global architecture of the proposed algorithm is:
Step 1: Dyadic sub-band decomposition
Step 2: High frequencies (HF) optimal thresholding
Step 3: Enhancement of the thresholded HF.
Step 3: Spectra Reconstruction
Step 4: Optimal Baseline correction
Step 5: Peak detection
Step 6: Protein identification
The dyadic sub-band decomposition is made in the following way:
The denoising difficulty resides in the fact that the noise is present in the upper and lower sub-band. It is the case of the MALDI spectra. In addition to the electronic noise, ones finds the chemical noise. This is why, each sub-band on a given level is decomposed to HF and LF sub-band [2,6,7,8].
At each pyramid level, high frequency sub-band is thresholded by minimizing the fuzzy shannon entropy. Then the spectra are reconstructed. It's clear that the decomposition/ reconstruction process should be perfect. To answer to this question, we have chosen a bio-orthogonal filter bank.
The optimal threshold computation process is found by first, defining a membership function is :
Where t is a given threshold level, C is a constant that represents the difference between the maximum (fmax) and minimum (fmin) high frequencies, µ0 and µ1 are the mean values of the upper and lower classes and h being the histogram .
The second step is to determine a measure of the fuzziness at a given threshold
The Shannon Entropy of the entire spectra is:
The optimal threshold value is that minimizes E(t). Then the useful high frequencies are amplified by a factor G such as:
Where, σtotal is the HF sub-band standard deviation (std), and σlocal is the current window std of the HF sub-band. After the reconstruction process, the baseline spectra are removed according to the concept provided by Golotvin . Among
DataExplorer reduction software Results
The raw masses spectrum given in Fig.1, is that of a known protein coming from the rat species. It has been identified as “Acyl-CoA dhydrogenase” protein.
This latter spectra preprocessed with DataExplorer Software leads to this following result
The found masses compared to those theoretical contained in the SwissProt Database lead us to the protein identification given in fig.4.
One notices that the protein candidate is identified with a score of 3.586.104, a rate of covering (cov) of 36%, a mass precision of 29.7 ppm.
Our Data Reduction algorithm Results
The obtained preprocessed mass spectrum is:
The optimal threshold computation is done block by block. The size of each block is 40. The optimal threshold values (block per block) calculated for the HF sub-band at level one are given in the following figure:
The matching result between experimental and theoretical masses is given in Fig.8. The database used is SwissProt.
One notices that the protein candidate is identified with a score of
Protein identification and characterization is one of the most essential tasks performed in proteome research. The precise determination of the peptide masses in the spectra , and highly discriminating mass comparison algorithm are therefore the keys to accurate identification of proteins. We have developed a precise and objective preprocessing algorithm. Often, the thresholds analysis associated with the peak detection is revealed that is preferable to be little selective in the choice of peaks in the mass spectrum, this is in order to avoid the loss of apparently fictitious peaks that might eventually appear to be useful. Our algorithm preprocessing is done to be more selective and precise regarding the masses peak determination. Indeed, the multiscale fuzzy thresholding is revealed as an objective tool regarding the peak selection. Obtained results confirm this, the score and the cov coeff are improved significantly. Introducing the fuzzy Shannon Entropy in multiscale concept is therefore an interesting idea.