VQT Frequently Asked Questions
What does PESQ mean and what are the characteristics of each?
Perceptual Evaluation of Speech Quality (PESQ)
Modern communications networks include elements (bad coding, error-prone channels and voice activity detection) that cannot reliably be assessed by such conventional engineering metrics as signal-to-noise ratio. One way to measure customers' perception of the quality of these systems is to conduct a subjective test involving panels of human subjects. However, these tests are expensive and unsuitable for such applications as real-time monitoring.
PESQ provides an objective measure that predicts the results of subjective listening tests on telephony systems. To measure speech quality, PESQ uses a sensory model to compare the original, unprocessed signal with the degraded version at the output of the communications system.
The result of comparing the reference and degraded signals is a quality score. This score is analogous to the subjective Mean Opinion Score (MOS) measured using panel tests according to ITU-T P.861. PESQ incorporates many new developments that distinguish it from earlier models for assessing codecs. These innovations allow PESQ to be used with confidence to assess end-to-end speech quality as well as the effect of such individual elements as codecs.
In addition to the standard PESQ score, the GL VQT also provides the PESQ LQ score. This revised score exhibits better correlation to subjective listening quality test scores.
What is the background and qualities of the GL supplied Reference Files?
Considerable effort has been devoted to designing test signals that have speech-like properties to better evaluate the performance of network equipment. Although single frequency tones are effective for simple, linear analogue network elements, they are inappropriate for testing the complex devices that have become widespread in the last twenty years. The different approaches to the test signal problem have resulted mainly in a number of signal processes being bolted together, for example, spectral shaping, noise, temporal envelope structure, variable power spectral density, probabilistic amplitude distribution etc. These rather complicated signals that were produced still failed to stimulate non-linear devices, such as low bit rate codecs, in a speech-like way.
Markov Spherically Invariant Random Processes (M-SIRP) signals were perhaps the most sophisticated test signals devised to date. However they are inadequate due to use of short uniform segments which do not reproduce all required speech properties and introduce non-speech properties at segment boundaries. The best signal to test the effectiveness of a speech transmission path must be speech - a Phonytalk(TM) test signal contains examples of many different properties of speech delivered in convenient package.
The Derivation of the Artificial Speech Test Signal
Speech consists of a certain number of sounds that are capable of being phonemically grouped. Each phoneme is subject to a wide speaker dependent variation and will be affected by its association with other phonemes and the context in which it is produced. The best way to test the effectiveness of a speech transmission path is with real speech. A large number of speakers and speech material is required for the test to be sufficiently rigorous in order to represent all naturally occurring speech. Many sounds will be repeated which represents redundancy in a test stimulus for objective measurements.
One aspect of speech transmission that is affected by the presence of non-linear processes is speech level. Speech level errors lead to quality degradation. Low bit rate codecs can produce erroneous results with the same sound at different speech levels. The use of tones and complex test signals will not necessarily produce a result representative of in-service operation. The best test signal is real speech. Unfortunately the very large number of different sounds in natural language renders the test process unwieldy. A representative subset of sounds would make the process more efficient.
A large corpus of conversational speech material was phonemically transcribed and groups of phonemes categorized according to acoustic characteristics. The relative frequency of occurrence, transitional probabilities, etc were extracted from the corpus. The acoustic groupings might provide for types of vowel sounds subdivided into front, middle, back, round, short, long and so on. The speech sounds can be formed into linguistically legal sequences that include the transitional probability results. These sequences, which are statistically representative of thousands of other related sounds, last for about 25 seconds. This phonemic string can be applied to a speech synthesizer to produce a sequence of sounds, which can be described as a linguistically motivated test signal. The Laureate Speech Synthesizer utilizes a very large number of very short speech sounds in a database. Diphone and triphone concatenation is used to provide the closest possible match between the symbol stream and the collection of speech sounds used by the synthesizer to produce the desired sounds. The linguistically motivated test signal, produced by Laureate, has been transcribed by phoneticians to confirm that it matches the desired symbol stream. The resulting signal is thus representative of the full range of sounds and transitions that occur in natural conversational speech, and yet is of a practical duration for testing.
Can I create and use my own Reference Files?
GL advises that customers use the supplied Reference Files, since those have been properly created and reflect the full range of sounds and transitions that occur in natural conversational speech, but customers may create their own. Please follow the below recommendations when creating custom Reference Files.
The communication network may treat speech and silence differently, and often behave in a way that is dependent on the signals passing through. In designing a test signal it is essential to consider the following factors:
- Temporal structure: Test signals should include speech burst separated by silent periods. In order to be representative of natural pauses in speech, speech bursts should normally be 1-3 seconds in duration.
To test certain types of voice activity, silent periods should be at least 200 ms in duration. As a guide, speech should be active between 40% and 80% of the time.
- Level and frequency content: In digital speech files, a typical level is -26dBov. Signals injected into the network should normally be at the appropriate calibrated level.
This may vary depending on the national standards and the impedance of the circuit.
- Source Material: Natural recorded speech or the Artificial Speech Test Stimulus (ASTS) may be used as test signals. Natural speech recordings should contain a representative and balanced range of parts
of speech, and if different recordings are to be concatenated the joins must only be made in silent periods to avoid discontinuities. Signals that are not speech-like should not be used with PAMS.
- Duration of an individual recording: PAMS is optimized for recordings of 8s in duration containing 4s of active speech. As a guide, the minimum length for a measurement to give a representative PAMS
score is about 6s, containing at least 3s of active speech. Recordings of 16s or longer in duration should be split into shorter sections and each processed separately through PAMS.
- Reference Signal: The reference should be distortion-free. Certain types of pre-processing make little difference in practice to a PAMS score. However, other types of pre-processing may significantly affect the signal's quality. Various types of noise may be added to evaluate the system's performance at transmitting noisy speech.