In this seminar we discussed the subject Speaking Out. As part of the topic, we tried to explain the synthesis of speech and sound of computer, speech unit and sound card as better and better as possible to understand what the output is really and how to use it. To make the seminar work more interesting we have put together a few pictures that show the individual parts of the speech output and a few tables. In the next paragraph we briefly wrote and featured the content of the topic we are discussing.
In short, the voice output:
One of the basic assumptions regarding speech signal processing is that speech can be displayed as a way out of a linear, time-varying system whose properties change slowly over time. This leads to the basic principle of speech analysis which states that if short segments of the speech signal are observed, then each segment can be efficiently modeled as the output from a linear, time-invariant system excited by either quasi-periodic impulses or random noise ). The problem of speech analysis is the determination of the parameters of the speech model as well as the determination of their changes in time. Since the convolution of the excitation and impulse response of the linear, the time invariant system represents its output (speech signal), this problem can also be seen as a problem of separation of convoluted components, known as deconvolution. Devolution can be considered from the point of view of the time-curtailing Four’s analysis as will be explained below.
The emergence of speech technologies, especially text-based synthesizers, is of utmost importance for people with visual impairment, but it also has far-reaching significance.
Synthesis of sound
In the broader sense, the sound of the medium is spreading by the sound of the medium that it is spreading, ignoring the fact that it is in the area of human ear hearing, whereas in the narrow sense it is considered the tune that is in this area. The human ear ear canal is between 20 Hz and 20 kHz. Frequency waves below 20 Hz are called infrared, while waves above 20 kHz are called ultrasound.
Before going into consideration of simple and complex sound, one must first define the difference between such sounds and noise. As we already know, any sound signal can be represented by the sum of the sinusoidal signals of different frequencies and amplitudes. The aforementioned rule is called Fourier’s theorem, and its application of Fourier’s synthesis. Therefore, each sound signal has its basic frequency, spectrum, intensity and color. It differs from the noise as the noise is a stochastic signal.
Simple sound is a sound whose spectrum consists of only one frequency, called the base frequency, while the complex sound consists of the basic frequency and the arbitrary number of its harmonics.
Types of synthesis:
1. Additive Synthesis
2. Subtractive Synthesis
3. Granular Synthesis
4. Amplitude Modulation
5. Ring Modulation
6. Frequency Modulation
Speech Synthesis is the operation of converting a written input to speech output. The input can be in the form of graphical, orthographic or phonetic scripts, depending on the source. Simply put, the synthesis of speech is the artificial generation of human speech.
Systems used for this are called speech synthesizers and can be implemented as software or hardware.
The synthesis of speech is based on the input information in the textual form of creating a speech signal that is understandable to man.
Speech Synthesis is often referred to as Text-to-Speech (TTS), since they are just translating text into speech.
There are several algorithms for speech synthesis. The choice of algorithm depends on the operation we want to execute. The simplest way is to easily record the voice of the person who speaks the desired expressions, but it is only a limited source of phrases and sentences. The quality depends on the recording mode.
More sophisticated, but worse quality is the algorithm that shares speech in smaller units. The most commonly used unit is a phoneme, the smallest linguistic unit. Depending on language, there are about 35-50 phonemes in Western-European languages. The problem is in combining phonemes because the smooth speech requires a smooth transition between elements (phoneme units). The reasonableness is therefore smaller, but there is little demand for memory.
The solution to this problem is the use of the dibas. Instead of splitting into transitions, the pause is in the middle of the phoneme, which leaves intact transitions. It gives about 400 elements and the quality is growing.
The longer they are, the more elements there are, but with the growing memory and quality. Other widely used units are semi-logs, syllables, words, or their combination.
Areas of Speech Synthesis
According to World Health Organization (WHO) data, there are 40 to 45 million blind and 124 million blind people worldwide.
Blindness is a severe disability. About 90% of the information from the environment man receives through the sensation of vision. That is why all the usual life activities that people are otherwise ignoring (movement, doing housework, personal hygiene, etc.) for a blind person are a great effort. Blind people in education and employment also face major obstacles: inaccessibility of information, underestimating ability, disapproval, and lack of understanding of the environment.
By applying modern technology to the development of aids for blind people, they can greatly improve the quality of their lives and enable them to become equal actors of society.
The emergence of speech technologies, especially text-based speech synthesizers, is of utmost importance for people with visual impairment but has a far wider meaning. For them, speech technology is of the utmost importance because it enables them to be more self-educated, educate more equitably, train for many new jobs, and become more involved in social life and work.
Speech synthesizers and software packages are provided in professional, ie computer literatures, called screen readers, or screen readers.
The screen reader is a software suite that turns all commands and visual elements into sound with the help of a voicemail.
At the very beginning of the developer of the screen reader and speech synthesizer on the basis of the text, speech was reproduced through the modest speakers of the computers at the time, resulting in a very poor and almost incomprehensible quality of pronunciation. That is why hardware speech units (hardware speech synthesizers) have been manufactured that have been connected to the serial port of the computer and speak exclusively English. The quality of the pronunciation was different, from the robots that were at the borderline of comprehension to the almost human mode of pronunciation.
All of these software programs worked very well with all voice units that could be spoken in English, but there was growing interest from non-native speakers of the US to produce such software-hardware solutions.
Dolphin Computer Access, at the beginning of the 90s, marketed its screen reader called HAL, and in parallel with it, a hardware speech unit called Apollo was produced. HAL and Apollo have made multi-lingual use in joint collaboration.
In 1993, Dolphin Computer Access developed a chip for the pronunciation of the Croatian-Serbian language, in cooperation with the Association for Promotion of the Education of Blind and Blind Persons in Zagreb, and so were the screen reader of HAL and Apollo Speech Unit, could satisfy a large number of blind users with ex-YU areas.