ISPUB.com / IJFS/2/1/4378
  • Author/Editor Login
  • Registration
  • Facebook
  • Google Plus

ISPUB.com

Internet
Scientific
Publications

  • Home
  • Journals
  • Latest Articles
  • Disclaimers
  • Article Submissions
  • Contact
  • Help
  • The Internet Journal of Forensic Science
  • Volume 2
  • Number 1

Original Article

Extended Study of Pitch Shifted Speech by Preserving Tempo: An Experimental Study

S Choudhury, C Singh, M Thakar

Keywords

frequency domain, pitch shift, speech characteristics

Citation

S Choudhury, C Singh, M Thakar. Extended Study of Pitch Shifted Speech by Preserving Tempo: An Experimental Study. The Internet Journal of Forensic Science. 2006 Volume 2 Number 1.

Abstract

The overall pitch of a recorded speech sample could be subjected to pitch shift techniques available with the advancement in digital technology. Effect on speech characteristics due to time domain pitch shift technique have been undertaken using time warping. Study on the effect of frequency domain pitch shift by preserving tempo has been conducted with the speech exemplars of 15 speakers at a stretch ratio of 90, 95, 105 and 110 as compared to the original speech exemplar. Effect due to frequency domain pitch shift on F1, F2, F3, nasal formant frequencies, duration of word segment and mean period are analyzed with respect to the overall shift in the mean F0. The change in pitch due to stretching is found independent of the position of F1, F2 and F3. However, the change in the values of F1, F2, F3 and mean period for a speaker is linear.

 

Note: The paper was presented at XVI All India Forensic Science Conference 2004, Hyderabad, India and appeared in the Proceedings.

Introduction

A change in overall pitch results in a change in the speech characteristics, which makes the forensic expert a challenging task in the process of identifying the speaker [1,2,3,4,5]. Automatic systems for speaker identification based on pitch detection technique suffer from similar problem [6,7,8]. The shift in pitch may be circumstantial or intentional. Recording of speech in a low-grade recorder, recording with off-speed due to low battery or power supply, malfunction of the tape recorder etc. lead to pitch change. Secondly, the difference between standards used for film and for video generates problems when converting from one format to another. Since all the images are displayed, the change of frame rate induces a pitch change on the sound. Another suitable example may be considered as to fit a specified duration of a video footage or speech to a fixed length of time. These are all circumstantial. Effect of change in the playback speed of an analog recorder in authenticity examination has been discussed [9]. In certain situations, factor like tape stretch can also contribute to pitch shift and timing errors, which are significant in contrast to the NAB & DIN specifications as described by McKnight [10]. Advances in technology and processing of audio data digitally by applying different signal processing techniques have contributed a wide number of tools to shape audio data. It has become possible to alter data in a desired manner with the advent of computer-based tools. The methods used are either time domain or frequency domain or time-frequency domain. Time domain uses autocorrelation technique while frequency domain uses phase-vocoder technique based on the concept of analysis, transformation and/ or synthesis applied to the original sound. Time-frequency domain is based on constant bandwidth and modification of phase. The study on the effect of time warping on speech characteristics has been carried out [11] and its impact on speaker identification has been discussed. An extended study has been conducted considering the speech characteristics due to frequency domain pitch shift technique by preserving tempo.

Methodology & Experimentation

Selection of Speech Material

Text containing vowels and nasals are prepared in Hindi. A total of 15 speakers, both male and female in the age group of 25-45 are selected and asked to read the text. Two utterances of each speaker are recorded in a semiprofessional type analog tape recorder. These samples are digitized at a sampling rate of 22050 using 16-bit quantization in mono mode. The sentence of interest “Das din tak banirahi” is chosen from the whole text and it was segregated either from the first or second utterance, whichever is clearly spoken from each of the speaker.

Exemplars are prepared by subjecting these samples to a constant stretch ratio of 90, 95, 105 and 110 by preserving tempo. Splicing frequency of 50 Hz and overlapping of 30% is used for stretch ratio of 90, splicing frequency of 49 Hz and overlapping of 29% is used for stretch ratio of 95, splicing frequency of 47 Hz and overlapping of 28% is used for both 105 and 110 stretch ratio. These exemplars are analyzed in Computerized Speech Laboratory (4003B). Mean fundamental frequency (F0); first (F1), second (F2) and third formant (F3) frequencies at a particular location (/dΛs/, /bΛni/), duration of word-segment (/din/) & number of periods and nasal formant frequencies (/din/) are measured. The word /dΛs/ and /bΛni/ are chosen to study the vowel characteristics with fricative and nasals.

Results And Discussion

Fig.-1 shows the first formant frequency (F1), second formant frequency (F2), third formant frequency (F3) at /dΛs/ for the speaker (S7) having minimum value of mean F0.

Figure 1
Figure 1: Formant frequencies at for the speaker (S7) having minimum mean F0

Variation of F2 and F3 is more than twice from the variation of F1 on changing pitch from stretch ratio of 90% through 110%. Stretching an exemplar with a ratio of 90 or 95 either add periods or reduce the duration of each period in the syllable of a word by using a complex algorithm to increase the overall pitch. The extra periods added to the existing periods as appear from the waveform are the mean of the previous and the following period at the center of the syllable. Similarly, stretch ratios of 105 or 110 either remove periods or elongate the existing the periods of the syllable and thereby lowering the overall pitch. The removal of periods cause a loss in formant information and a shift in the formant is observed. Addition or deletion of periods in the syllable results in a decrease or increase in the silence region respectively, even if the total duration of the exemplar is constant. The introduction or removal of periods takes place in such a way that the mean period decreases linearly for stretching below 100 and increases for stretch ratio higher than 100. In case of time warping, pitch changes by elongating or compressing the whole sample in time but the number of periods in the syllable remains unchanged.

The variation of F1, F2 and F3 at /bΛni/ for the speaker (S9) having maximum value of mean F0 is shown in Fig.-2. Like other speakers, the variation in F1 is found to be lesser than the variation in F2 and F3 for the word /bΛni/. The change in the value of F1, F2 and F3 due to stretching is found to be linear for all the speakers.

Figure 2
Figure 2: Formant frequencies at for the speaker (S9) having maximum mean F0

The change in the formant frequency is equally effective in other regions also. No such noticeable difference is observed in the fricative region /s/ in the wideband spectrogram.

Nasal formant frequencies measured at /din/ as shown in Table-1 is found to vary in a similar way as it varied at /dΛs/ or /bΛni/ for the corresponding speaker. The variation of N2 is more prominent than N1, which indicates that the higher formant frequencies are more affected when a change of pitch is carried out by preserving tempo.

Figure 3
Table 1

Fig.-3 (a) shows the percent variation of F1, F2 and F3 with respect to mean F0 at /bΛni/ and Fig.-3 (b) shows variation of F1, F2, F3 at /dΛs/ for stretch ratio of 110. This indicates that the percentage of decrease of F1, F2 and F3 is not same for all the speakers.

Figure 4
Figure 3 (a): Percent variation of F1, F2 & F3 with respect to Mean F0 at for stretch ratio of 110

Figure 5
Figure 3 (b): Percent variation of F1, F2 & F3 with respect to Mean F0 at for stretch ratio of 110

Fig.-4 (a) shows the percent variation of F1, F2 & F3 with Mean F0 for stretch ratio of 105 at /bΛni/ for the speakers S9, S14, S10, S11, S6, S5, S4 respectively. Percent variation of F1, F2 & F3 with Mean F0 for stretch ratio of 105 at /dΛs/ for the speakers S2, S3, S15, S13, S12, S8, S7 respectively is shown in Fig.- 4(b). These two plots indicate that the percent variation in the values of F1, F2 & F3 is independent of their initial values in the original exemplar.

Figure 6
Figure 4 (a): Percent variation of F1, F2 & F3 with Mean F0 for stretch ratio of 105 at for the speakers S9, S14, S10, S11, S6, S5, S4 respectively

Figure 7
Figure 4 (b): Percent variation of F1, F2 & F3 with Mean F0 for stretch ratio of 105 at for the speakers S2, S3, S15, S13, S12, S8, S7 respectively

Conclusion

The change of overall pitch by preserving tempo affects the higher formant frequencies more than the lower formants with linear change in the measurable speech parameters. The amount of change in the values of F1, F2 & F3 is found to be different for each speaker. The attempt to bring back the changed speech samples to the original by reversing the change in the formant frequencies could be brought near to the original. However, information contained in the removed period of the speech sample is lost while moving from higher to lower pitch. Still the sample could suitably be used for speaker identification purposes, as the characteristics pertaining to speaker dependent feature parameters are found preserved in the process.

References

1. Mead KO. Identification of speakers form fundamental frequency contours in conversational speech. Joint Speech Research Unit1974; Report No. 1002.
2. Stevens SS and Volkman J. The relation of pitch to frequency: A revised scale. American Journal of Psychology 1940; 53: 329-353.
3. Jassem W. Pitch and compass of the speaking voice. Journal of the International Phonetic Association 1971; 1: 59-68.
4. Steffen-Batog MW, Jassem and Gruszka-Koscielak H. Statistical distributions of short term F0 values as a personal voice characteristic. In: W. Jassem (ed.) Speech analysis and synthesis, Warsaw: Police Academy of Science; 1970: Vol.2.
5. Horii Y. Some statistical characteristics of voice fundamental frequency. Journal of Speech Hearing Research 1975; 18 (1): 192-201.
6. Atal BS. Automatic speaker recognition based on pitch contours. Journal of Acoustic Society of America 1972; 52: 1687-1697.
7. Green N. Automatic speaker recognition using pitch measurements in conversational speech. Joint Speech Research Unit 1972; Report No.1000.
8. Robert CL. Speaker verification by computer using speech intensity for temporal registration. IEEE Transaction on audio and Electro Acoustics. 1973; AU-21 (2): 80-89.
9. Koenig BE. Measurement of recorder speed changes in authenticity examinations. Crime Laboratory Digest 1987; 14 (4): 140-152.
10. McKnight JG. Speed, pitch and timing errors in tape recording and reproducing. Journal of Audio Engineering Society 1968; 16: 266-274.
11. Singh CP, Manisha K and Choudhury SK. Study of speech characteristics due to pitch shift by time warping method and its' impact on forensic speaker identification. Proceedings of the XV All India Forensic Science Conference 2004: 56.

Author Information

S. K. Choudhury
Central Forensic Science Laboratory

C. P. Singh
Central Forensic Science Laboratory

M. K. Thakar
Department of Forensic Science, Punjabi University

Download PDF

Your free access to ISPUB is funded by the following advertisements:

 

BACK TO TOP
  • Facebook
  • Google Plus

© 2013 Internet Scientific Publications, LLC. All rights reserved.    UBM Medica Network Privacy Policy