Multi-scale Data Reduction Algorithm of Proteomic Mass Spectrum

M.  Nafati; M.  Samson; B.  Rossi

Multi-scale Data Reduction Algorithm of Proteomic Mass Spectrum

M Nafati, M Samson, B Rossi

Keywords

data reduction, fuzzy logic thresholding, mass spectrometry ms, multi-resolution, protemic

Citation

M Nafati, M Samson, B Rossi. Multi-scale Data Reduction Algorithm of Proteomic Mass Spectrum. The Internet Journal of Academic Physician Assistants. 2005 Volume 5 Number 1.

Abstract

The proteomic is a field which makes it possible to connect the sequence of the genome and the cellular behaviour. The proteomic analysis can be done in various stages: preparation of the samples, separation of proteins, analysis by mass spectrometry, preprocessing (data mining) and interrogation of the data banks. The mass spectrometry measures the mass of peptides (typically obtained by tryptic digestion). These masses are then compared to those theoretical in Databases in order to identify the protein name. Electronic and chemical noise is often the source of bad identification. In this article, we propose an original data reduction algorithm with the aim of removing the spectra baseline, then removing parasitic mass peaks and amplifying those useful. The algorithm principle uses the dyadic muli-resolution technique (bio-orthogonal decomposition/reconstruction) coupled to the fuzzy logic thresholding. In order to evaluate the quality of this algorithm, we present a comparison of the results obtained by our algorithm and those obtained using the data reduction software of MALDI-TOF spectrometer (Matrix-Assisted Laser Desorption/ionization).

Introduction

Proteomic analysis is done primarily by the use of the two-dimensional electrophoreses (2-DE) technique coupled with the Mass Spectrometry (MS) analysis. The first technique helped by the proteomic imaging leads to the localization of the candidates proteins for mass spectrometry analysis. The comparison between the spectra of masses obtained and those theoretical of DataBase leads to the identification of proteins of interest in term of peptides or amino acids [₆,₁₀].

In this paper, one propose an robust data reduction algorithm of mass spectra based on the multi-resolution technique [₂,₆,₇,₉] and the fuzzy set theory [₃,₄,₁₁,₁₂,₁₃]. The idea is to separate the mass peaks into groups of dyadic sub-bands and then thresholding the high frequencies sub-band. The computation of the optimal threshold is done by minimizing the fuzzy Shannon entropy. The result is then amplified in an adaptive way. At the end of the process, the mass spectra is reconstructed and corrected by removing the baseline signal[₁].

Problem Formulation

The currently most common method to identify proteins is to first enzymatically digest the proteins, then determine the masses of result peptides by peak detection on a MALDI-TOF spectrum [₁₄], and finally use the peptide mass fingerprints to research protein sequences. The found theoretical protein is that which gives a maximum rate of covering. It is clear that this result depends mainly on the quality of the mass spectrum. Consequently the data reduction processing is a primodial stage since the presence of (electronic and/or chemical) parasitic, or the absence of useful mass peaks distorts the result of the protein identification. As sometimes, only a few experimental peptide masses in the fingerprint match the theoretical masses in a database, failure to detect one peak can hinder the correct identification of a protein. The standard data reduction software (DataExplorer Voyager) provided with MALDI spectrometer, is often unoptimal and nonadaptive, it consists of doing these processes: denoising, baseline correction, thresholding, peak detection, protein identification. Here, our data reduction algorithm aim to optimize the denoising and baseline correction processes, and to improve the SNR ratio in an adaptive way.

Objective data reduction algorithm

The global architecture of the proposed algorithm is:

Step 1: Dyadic sub-band decomposition
Step 2: High frequencies (HF) optimal thresholding
Step 3: Enhancement of the thresholded HF.
Step 3: Spectra Reconstruction
Step 4: Optimal Baseline correction
Step 5: Peak detection
Step 6: Protein identification

The dyadic sub-band decomposition is made in the following way:

Figure 1

Figure 1: sub-band decomposition.

The denoising difficulty resides in the fact that the noise is present in the upper and lower sub-band. It is the case of the MALDI spectra. In addition to the electronic noise, ones finds the chemical noise. This is why, each sub-band on a given level is decomposed to HF and LF sub-band [₂,₆,₇,₈].

At each pyramid level, high frequency sub-band is thresholded by minimizing the fuzzy shannon entropy. Then the spectra are reconstructed. It's clear that the decomposition/ reconstruction process should be perfect. To answer to this question, we have chosen a bio-orthogonal filter bank.

The optimal threshold computation process is found by first, defining a membership function is :

Figure 2

With

Figure 3

Where t is a given threshold level, C is a constant that represents the difference between the maximum (fmax) and minimum (fmin) high frequencies, µ₀ and µ₁ are the mean values of the upper and lower classes and h being the histogram .

The second step is to determine a measure of the fuzziness at a given threshold t. One method for measuring fuzziness is based on the idea of Shannon Entropy [₃,₄,₅,₁₁]:

Figure 4

The Shannon Entropy of the entire spectra is:

Figure 5

The optimal threshold value is that minimizes E(t). Then the useful high frequencies are amplified by a factor G such as:

Figure 6

Where, σ_total is the HF sub-band standard deviation (std), and σ_local is the current window std of the HF sub-band. After the reconstruction process, the baseline spectra are removed according to the concept provided by Golotvin [₁]. Among N points the minimal and maximal values are found. If their difference does not exceed the noise std multiplied by a definite factor n (Y_max-Y_min≦nσ_noise), the i-th point is considered to belong to baseline.

Algorithm Principle

Figure 7

Results

DataExplorer reduction software Results

The raw masses spectrum given in Fig.1, is that of a known protein coming from the rat species. It has been identified as “Acyl-CoA dhydrogenase” protein.

Figure 8

Figure 2: “Acyl-CoA dhydrogenase” protein raw spectrum protein coming from the rat species.

This latter spectra preprocessed with DataExplorer Software leads to this following result

Figure 9

Figure 3: mass peak results obtained with DataExplorer.

The found masses compared to those theoretical contained in the SwissProt Database lead us to the protein identification given in fig.4.

Figure 10

Figure 4 : The MsFit software identification results.

One notices that the protein candidate is identified with a score of 3.586.104, a rate of covering (cov) of 36%, a mass precision of 29.7 ppm.

Our Data Reduction algorithm Results

The obtained preprocessed mass spectrum is:

Figure 11

Figure 5: Mass spectrum obtained with our algorithm.

The optimal threshold computation is done block by block. The size of each block is 40. The optimal threshold values (block per block) calculated for the HF sub-band at level one are given in the following figure:

Figure 12

Figure 6: HF sub-band threshold values at level one.

Figure 13

Figure 7: Amplification gain corresponding to Fig.5.

The matching result between experimental and theoretical masses is given in Fig.8. The database used is SwissProt.

Figure 14

Figure 8: Theoretical masses matched in SwissProt DataBase.

One notices that the protein candidate is identified with a score of 1.63 109, an overlapping rate (cov) of 51%, a mass precision of 22.4 ppm.

Conclusion

Protein identification and characterization is one of the most essential tasks performed in proteome research. The precise determination of the peptide masses in the spectra , and highly discriminating mass comparison algorithm are therefore the keys to accurate identification of proteins. We have developed a precise and objective preprocessing algorithm. Often, the thresholds analysis associated with the peak detection is revealed that is preferable to be little selective in the choice of peaks in the mass spectrum, this is in order to avoid the loss of apparently fictitious peaks that might eventually appear to be useful. Our algorithm preprocessing is done to be more selective and precise regarding the masses peak determination. Indeed, the multiscale fuzzy thresholding is revealed as an objective tool regarding the peak selection. Obtained results confirm this, the score and the cov coeff are improved significantly. Introducing the fuzzy Shannon Entropy in multiscale concept is therefore an interesting idea.

References

1. S. Golotvin, A. Williams, Improved Baseline Correction of FT NMR Spectra, Advanced Chemistry Development,NMR Newsletter Advanced Chemistry Development,1999
2. S. Grace, B. Yu, M. Vetterli, Adaptive wavelet Thresholding for Denoising and Compression, IEEE Transactions on image processing, Vol. 9, NO.9, 2000, pp. 1532-1546.
3. H. Haussecker, H. R. Tizhoosh, Fuzzy Image Processing, I. Handbook of Computer Vision Application, Edited by B. Jagne, H. Haussecker, and P; Geisster, Academic Press 1999.
4. E. D. Jansing, T. A. Albert, D. L. Chenoweth, Two Dimensional Entropy Segmentation. Pattern Recognition Letters 20, Letters 20, 1999, pp. 329-336.
5. A. Lindegren, Analysis of Proteomic Patterns for Detection of Prostate Cancer. Master Thesis. 2004.
6. P. Lio, Wavelets in bioinformatics and compu-tational biology : state of and perspectives, Bioinformatics, Vol. 19, 2003, pp 2-9
7. B. Liu, Y. Sera, N. Matsubara, K. Otsuka, S. Terabe, Signal Denoising by Wavelets for Microchip Electrophoresis, Chromatograpghy Tokyo- Society for Chromatograpghic sciences, Vol.23, Part Supp, 2002, pp. 59-60
8. D.I. Malyarenko, W.E. Cooke, B.L. Adam, G. Malik, H. Chen, E.R. Tracy, M.W. Trosset, M. Sasinowski, O.J Semmes, and D. M. Manos, Enhancement of sensitivity and resolution of Surface-Enhanced Laser Desorption/Ionisation Time-of-Flight Mass Spectrometric Records for Serum Peptides Using Times Series Analysis Technique, Proteomics and Protein Markers, Clinical Chemistry, 2005, pp. 65-74
9. N. Nafati.. Synthèse itérative et simultanée de banc de filtres biorthogonaux de reconstruction parfaite avec des critères adaptés aux applications de codage de la parole et de l'image, GRETSI Grenoble France. 1997, pp. 1085-1088.. Septembre.
10. P. Nugues. Interprétation de Gels d'Electrophorèses 2D, Thèse de Doctorat, Université de Nancy, 1989.
11. T. D. Pham, A New Approach for Calculating Implications of Fuzzy Rules, IEEE International Conference on Artificial Intelligence Systems, 2002, pp. 71.
12. J. POLEC, J. Pavlovieova, T. Karlubikova, Application Of Shape-independent Orthogonal Transform For Image Inerpolation, Radio-Engineering, Vol. 11, No. 1, April 2002.
13. E. G. Sánchez, Y.A. Dimitriadis, M. Sanchez-Reyes Mas, P. S. García, J.M. Cano Izquierdo, J. Lopez Coronado, On-Line Character Analysis and Recognition With Fuzzy Neural Networks, Intelligent Automation and Soft Computing, Vol. 7, No. 3,1998, pp. 161-162.
14. R. Zenobi, R. Knochenmuss, Ion formation in MALDI Mass Spectrometry. Mass Spectrom 1998, Rev. 17, pp. 337-66.

ISPUB.com

Internet
Scientific
Publications

Multi-scale Data Reduction Algorithm of Proteomic Mass Spectrum

Keywords

Citation

Abstract

Introduction

Problem Formulation

Objective data reduction algorithm

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Algorithm Principle

Figure 7

Results

DataExplorer reduction software Results

Figure 8

Figure 9

Figure 10

Our Data Reduction algorithm Results

Figure 11

Figure 12

Figure 13

Figure 14

Conclusion

References

Author Information