Electrolarynx Voice Recognition Utilizing Pulse Coupled Neural Network

The laryngectomies patient has no ability to speak normally because their vocal chords have been removed. The easiest option for the patient to spea k again is by using electrolarynx speech. This tool is placed o n the lower chin. Vibration of the neck while speaking is used to produce sound. Meanwhile, the technology of "voice recognition" has been growing very rapidly. It is expected that the technology of "voice recognition" can also be used by laryngectomies patients who use electrolarynx.Thi s paper describes a system for electrolarynx speech recognition. Two main parts of the system are featur e extraction and pattern recognition. The Pulse Couple d Neural Network – PCNN is used to extract the featur e and characteristic of electrolarynx speech. Varying of β (one of PCNN parameter) also was conducted. Multi layer perceptron is used to recognize the sound patterns. There are two kinds of recognition conducted in this pape r: speech recognition and speaker recognition. The spe ech recognition recognizes specific speech from every p eople. Meanwhile, speaker recognition recognizes specific speech from specific person. The system ran well. The "elect rolarynx speech recognition" has been tested by recog nizing of “A” and "not A" voice. The results showed that the system had 94.4% validation. Meanwhile, the electro larynx speaker recognition has been tested by recognizing of “saya” voice from some different speakers. The resul ts showed that the system had 92.2% validation. Meanwh ile, the best β parameter of PCNN for electrolarynx recognition is 3. KeywordsElectrolarynx speech recognation, Pulse Coupled Neural Network (PCNN), Multi Layer Perceptron (MLP)


I. INTRODUCTION
ore than 8900 persons in the United States are diagnosed with laryngeal cancer every year [1].The average number of laryngeal cancer patients in RSCM is 25 people per year [2].The exact cause of cancer of the larynx until now is unknown, but it is found some things that are closely related to the occurrence of laryngeal malignancy: cigarettes, alcohol, and radioactive rays.
Ostomy is a type of surgery needed to make a hole (stoma) on a particular part of body.Laryngectomy is an example of ostomy.It is an operations performed on patients with cancer of the larynx (throat) which has reached an advanced stage.The impact of this operation will make the patients no longer able to breathe with their nose, but through a stoma (a hole in the patient's neck) [3].Human voice is produced by the combination of the lungs, the valve throat (epiglottis) with the vocal cords, and articulation caused by the existence of the oral cavity (mouth cavity) and the nasal cavity (nose cavity) [4].Removal of the larynx will automatically remove the human voice.Post-surgery of the larynx may cause the patient no longer able to speak as before.There are some ways to make laryngectomies able to talk again.The easiest way is using electrolarynx voice.It is the way to speak using electrolarynx tool.This tool is placed on the lower chin.Vibration of the neck while speaking is used to produce sound.However this sound has a poor quality and it is often not understandable [3].
Meantime research in the speech recognition and its application is growing rapidly.A lot of application of speech recognition was introduced.According to Bahoura, the combination between MFCC and GMM is the best methods related to the respiratory sound.
In addition to the methods presented by Bahaoura, there is another feature extraction method that is used widely for image processing.This method is Pulse Coupled Neural Network (PCNN).
PCNN is a binary model.Although initially this method is very pupoler just for image processing extraction, now some researchers have developed it for voice recognition.Sugiyama [9] had used PCNN for pattern recognition.
In this paper, electrolarynx speech recognition system consists of Fast Fourier Transform (FFT), Pulse Couple Neural Network (PCNN), and Multi Layer Perceptron (MLP).Block diagram of this processis showed in Fig. 1.
Signal of electrolarynx speech will be converted to the frequency domain by Fast Fourier Transform (FFT).This is important because the frequency domain will give a clearer view to be observed and manipulated than time domain.The output of the Fast Fourier Transform will be fed into the PCNN for getting unique characteristic of PCNN is a pair of single layer neural network which is connected laterally, and has two dimensions.In pulsed coupled neuron models, the inputs and outputs are given by short pulse sequences generated over time.
Structure of Pulse Coupled Neural Network in this research can be seen in Fig. 2.
In the input unit there are two parts, namely linking input and feeding input.The feeding input is a primary input from the neuron's receptive area [8].In this research output signals from FFT is feed to this feeding input.But the signals from FFT output must be normalized first (Equation ( 1)).
The other hand linking input received feedback signals from output Y (n-1).These signals are biased and then multiplied together (Equation ( 2)).
Input values F ij and L ij are modulated in linking part of neuron.This process will generate neuron internal activity U ij .
If the internal activity is greater than dynamic threshold, θij, the neuron will generate output pulses.In contrast the output will be zero.
PCNN mathematical equation of the systems can be written as follows:

III. METHOD AND MATERIAL
There are two kinds of recognition conducted in this paper; they are speech recognition and speaker recognition.In speech recognition, just a specific voice from some persons will be recognized, on the other hand another voice will not be recognized.
Meanwhile in speaker recognition, specific speech from specific person is recognized.This system cannot recognize another speech from the same person.It cannot recognize the same speech or different speech from another person, either.Firstly, electrolarynx speech recognition conducted.There are 50 sample.They are 28 "A" vowel and 22 "not" from different electrolarynx speaker.Electrolarynx speech samples from data base were divided into training sets and testing sets.

A. Electro Larynx Speech Recognition
Recorded electrolarynx speech was sampled with sampling frequency 44.100 Hz and 16 bits resolution.It is assumed that the frequency of human voice signals is 300-3400 Hz.Sampling process must meet the nyquist criterion.The Nyquist criterion states: Sampling frequency must be equal to or twice as high as input frequency.Thus, 44100 Hz sampling has fulfilled the nyquist criterion.Further, sampled signal will be converted from time domain into frequency domain utilizing Fast Fourier Transform (FFT).In this paper it used FFT 512 point.Because the FFT is symmetric, the FFT output is taken only half which is 256 data.All of electrolarynx voice signal (for training or testing in MLP) are processed by this FFT.One of the FFT output signal can be seen at Fig. 3.
Furthermore, the output of the PCNN will be accepted by the MLP which has three layers.The number of neurons in each layer: input layer 256, while hidden layer is 10, and output layer is 1.
The system was trained by training set input.In the 359 iteration the system met the goal.After the training, the system was tested.The result shows that the system can recognize 47 from 50 sample of electrolarynx speech (94 %).There is only 3 data that cannot be recognized correctly.

B. Electrolarynx Speaker Recognition
In this session it will be shown that system also can recognize "Electrolarynx speaker" correctly.There are 28 "saya" of electrolarynx voice from different speaker.These Electrolarynx voice samples were divided into training and testing sets.
Furthermore, these signals were processed like at electrolarynx speech recognition processing in above.The PCNN parameters that used were the same as before.Some PCNN Output of "saya" electrolarynx speech that used for training set can be seen in Fig. 5.Then, the system was trained by training set.At the 250th iteration, the system met the goal.After training process, the system will be tested.The testing result shows that the system able to recognize 26 from 28 sample of electrolarynx speech.It means that it has validation 92,2 %.There is only 2 data that cannot be recognized correctly.

C. β Parameter of PCNN
As mentioned above, the output of PCCN is greatly affected by its parameters.The most significant parameter is β which is the linking parameter.With different β, it will be got different weight in linking channel internal activity.In the end, it will affect of PCCN output.In the experiments above, it used β = 3.
In this paper, it will be shown what happens if the Fig. 6 shows the graph of PCCN Output for some variation of β.
Furthermore, these PCNN Outputs (with differences β) were used for electrolarynx speaker recognition like above.The result of the test can be seen in the Table 1."True" means it is recognized, and false means it is not be recognized.From table above, it can be seen that PCCN with β coefficient 3 gives the best result with validation up to 92.2%.
The decrease of β value causes the decrease of validation.Meanwhile, the increase of β value also decreases its validation.Therefore, the best value of β in this research is 3.