The GL Voice Quality Testing (VQT) product utilizes several industry standard ITU algorithms in order to measure the speech quality of a transmitted voice file. VQT compares the original unprocessed signal with the degraded version using POLQA (ITU-T P.863), PESQ (ITU-T P.862+P.862.1), PAMS (ITU-T P.800) and PSQM/PSQM+ (ITU-T P.861) international standard voice quality test methods. The GL VQT can either be installed or operated on a stand-alone system or reside as an optional feature, on other GL products. GL's VQuad™ also supports the POLQA (ITU-T P.863) standard for voice quality analysis
- Perceptual Objective Listening Quality Analysis (POLQA)
- Operations Performed by POLQA
- Results Provided by POLQA
- Perceptual Evaluation of Speech Quality (PESQ)
- Operations Performed by PESQ
- Results Provided by PESQ
- Perceptual Analysis Measurement System (PAMS)
- Operations Performed by PAMS
- Results provided by PAMS
- Perceptual Speech Quality Measure (PSQM)
- ITU.56 Measurements
POLQA versus PESQ Comparison
|Adopted in 2011, ITU-T P.863
||Adopted in 2001, ITU-T P.862
|Suitable for 3G and 4G networks, VoIP networks and NGN networks delivering HD- quality voice services such as "wideband" and "super-wideband" telephone calls, 7 kHz and 14 kHz frequency range.
||Suitable for G.711 A law and u law, and low bandwidth 300 to 3400 Hz voice bandwidth. Also supports WB (7kHz frequency range) using PESQ ITU-T P.862.2.
|POLQA works quickly and accurately. POLQA is superior to existing standards, and has overcome all known issues and limitations of PESQ
PESQ-based measurements will still be considered an industry standard for several years, also for reasons of backward compatibility
Perceptual Evaluation of Speech Quality (PESQ)
Modern communications networks include elements (bad coding, error-prone channels and voice activity detection) that cannot
reliably be assessed by such conventional engineering metrics as signal-to-noise ratio. One way to measure customers' perception
of the quality of these systems is to conduct a subjective test involving panels of human subjects. However, these tests are
expensive and unsuitable for such applications as real-time monitoring.
PESQ provides an objective measure that predicts the results of subjective listening tests on telephony systems. To measure
speech quality, PESQ uses a sensory model to compare the original, unprocessed signal with the degraded version at the output
of the communications system. The result of comparing the reference and degraded signals is a quality score. This score is
analogous to the subjective Mean Opinion Score (MOS) measured using panel tests according to ITU-T P.800.
PESQ incorporates many new developments that distinguish it from earlier models for assessing codecs. These innovations
allow PESQ to be used with confidence to assess end-to-end speech quality as well as the effect of such individual elements as codecs.
In addition to the standard PESQ score, the GL VQT also provides the PESQ LQ and LQO (P.862.1) score. These revised scores
exhibits better correlation to subjective listening quality test scores.
Operations Performed by PESQ
The processing carried out by the PESQ algorithm includes the following stages.
In order to compare the signals, the reference speech signal and the degraded signal should be at the same, constant power
level. This is necessary because the reference signal does not have to be at a defined level and because the gain of the system
under test is unknown before testing. PESQ assumes that the subjective listening level is a constant 79dB SPL at the ear reference
point [ITU-T P.830, section 8.1.2]. A gain is applied to both the reference and degraded signals to bring them to this level.
Analog connections often introduce some degree of filtering. For example, PESQ models the receive path of the telephone
handset using an input filter. This takes account of the effect of the electrical and acoustic components of the handset. The filter
used is similar to the standard "modified IRS receive characteristic" [ITU-T P.830]. It is generally accepted that this has less effect
on quality than coding distortions do. PESQ compensates for any filtering that has taken place in the network.
The system under test may include a delay, which may be variable. In order to compare the reference and degraded signals,
they need to be lined up with each other. PESQ applies voice activity detection to the signals to identify those parts of the signal
that are speech, ignoring noise. The PESQ time offset measurements do not take account of the delay of the test equipment
generating or recording the signal. This means that a time offset reported by PESQ on a file collected will be dependent upon
the way in which the test process is executed.
- First, PESQ aligns the overall speech signals (utterances). An utterance is a continuous speech burst identified by the voice
activity detector that does not contain pauses longer than a pre-determined threshold (200ms). This process detects delay over
major sections of the degraded signal compared to the reference signal.
- Second, PESQ aligns overlapping sections of the speech (frames). This process detects delay that is variable over the length
of an utterance, as this can be significant in packet-based networks.
- The third stage does not occur immediately after the second stage, but is performed after the auditory transform has been
calculated. The third stage realigns "bad intervals" (sections of the speech with very large disturbance), and improves the model's
accuracy with a small number of files where delay changes are not correctly identified by the initial time alignment process.
In order to compare the reference and degraded signals, taking account of how a listener would have heard them, each is
passed through an auditory transform that mimics certain key properties of human hearing. This gives a representation in time
and frequency of the perceived loudness of the signal, known as the sensation surface.
Part of the auditory transformation equalizes certain processes that have little subjective effect. First, the transfer function
of the system is estimated, and is used to equalize the reference to the degraded in the auditory transform domain.
This takes account of filtering in analogue components of the network such as telephone handsets. Second, the frame-by-frame
amplitude gain of the system is estimated and used to equalize the auditory transform of degraded file to the reference.
In both cases the equalization is partial - large amounts of filtering or gain variation are not cancelled, and therefore result in
errors being measured.
The difference between the sensation surfaces for the reference and degraded files is known as the error surface; this
shows any audible differences introduced by the system under test. The error surface is analyzed by a process that takes
account of the effect that small distortions in a signal are inaudible in the presence of loud signals (masking).
From the positive and negative errors, two disturbance parameters are calculated. They are calculated as non-linear
averages over specific areas of the error surface. These disturbance parameters are:
- The absolute (symmetric) disturbance - a measure of absolute audible error
- The additive (asymmetric) disturbance - a measure of audible errors that are significantly louder than the reference
This analysis gives two error parameters that summarize the amount of each type of audible error. Finally, the error
parameters are converted to a quality score, which is a linear combination of the average symmetric disturbance value
and the average asymmetric disturbance value.
Results Provided by PESQ
PESQ returns a quality score, known as PESQ score, which conforms to ITU-T P.862. PESQ score lies on a scale from
-0.5 to 4.5, though in most cases it is between 1 and 4.5. PESQ score correlates with subjective quality scores. However
the PESQ score tends to be optimistic for poor quality speech and pessimistic for good quality speech. Alternative mappings
for PESQ score have been developed which do exhibit a better correlation to subjective test scores. These are referred to
as the PESQ-LQ and PESQ-LQO scores.
PESQ-LQ scores are closer to the listening quality subjective opinion scale, which is standard in the industry and is
defined in ITU-T P.800. Listening quality scores lie between 1 and 5. PESQ-LQ score lie between 1.0 and 4.5. This is
because 4.5 is usually the maximum obtained in a subjective test.
Listening Quality Scale:
||Quality of the speech
The score gives a measure of customers' perception of quality. The highest score, 4.5, means that no distortion is measured.
As the amount of distortion increases the quality falls.
The aim of a separate recommendation ITU-T P.862.1 is to provide a single mapping from raw P.862 score to the Listening
Quality Objective Mean Opinion Score (LQO-MOS). This latest ITU standard improves on the original PESQ (P.862) by correlating
better to subjective test results.
Typical PESQ Score Comparisons
Based on simulations and real measurements, the table below represents the results of a number of typical networks and
codecs with no errors or packet loss. In addition, it gives the scores that can be expected in some mobile network conditions
where errors are significant.
|Clean ISDN network
|Analog network (G.711)
|G.728 codec (16kbit/s)
|G.729 codec (8kbit/s)
|G.723.1 codec (6.3kbit/s)
|GSM EFR codec (12.2kbit/s)
|GSM FR codec (13kbit/s)
|GSM-EFR mobile network in typical operating range
||3.6 to 3.1
||3.6 to 2.9
||3.7 to 3.0
|GSM-EFR mobile network in very poor conditions
For more details, please visit PESQ Measurement webpage
Perceptual Analysis Measurement System (PAMS)
Traditionally the only way to measure customer's perception of the quality of modern communications was to conduct a
subjective test, but these tests are expensive and unsuitable for applications such as real-time monitoring. PAMS provides
an objective measure that predicts the results of subjective listening tests on a telephony system. To measure speech
quality, PAMS uses a sensory model to compare the original, unprocessed signal with the degraded version at the output
of the communications system. PAMS parameterizes different classes of errors and maps them to predictions of subjective
listening quality and listening effort. The mappings are calibrated using a large database of subjective tests. Other
diagnostics are also returned.
PAMS incorporates many new developments that distinguish it from earlier codec assessment models such as those
given in ITU-T P.861. These innovations allow PAMS to be used with confidence to assess end-to-end speech quality as
well as the effect of individual elements such as codecs.
Operations Performed by PAMS
The processing carried out by the PAMS algorithm includes the following stages.
PAMS is a listening model and has no knowledge of the delay of the system. In order to compare the reference and
degraded signals, however, they need to be lined up with each other. This enables the analysis to cancel any bulk delay
and also most delay changes that might be caused by, for example, packet-based transmission.
Analogue connection often introduces some degree of filtering. PAMS identifies any filtering that has taken place in
the network and cancels its effect.
In order to compare the reference and degraded signals in a meaningful way, they are passed through an auditory
transform that mimics certain key properties of human hearing.
This analysis gives a number of error parameters that summarize the amount of each type of audible error.
Finally the error parameters are mapped onto predictions of perceived listening quality and listening effort. These
mappings are calculated and verified using a very large database of subjective tests to ensure that PAMS is able to
predict quality for a wide range of distortion types.
Results provided by PAMS
PAMS returns quality scores on two different opinion scales, listening quality and listening effort. These scales are
standard and are defined in [ITU rec. P.800]. Both Listening Quality and Listening Effort utilize a range between 1 and 5
and are usually quoted to two decimal places.
Listening Quality Scale:
||Quality of the speech
Listening Effort Scale:
||Effort required to understand the meaning of sentences
||Complete relaxation possible; no effort required
||Attention necessary; no appreciable effort required
||Moderate effort required
||Considerable effort required
||No meaning understood with any feasible effort
The scores give a measure of customers' perception of quality. A PAMS score of 5 indicates that no distortion is measured. As the
amount of distortion increases, the quality falls. Because they related to different aspects of subjectivity, the listening effort and
listening quality scores are normally different if there is perceived distortion; listening effort is usually higher than listening quality.
Perceptual Speech Quality Measure (PSQM)
Subjective quality assessment of speech codecs is one of the key technologies in designing digital telecommunication networks.
ITU Recommendation P.830 defines subjective testing methodologies for speech codecs. Since subjective quality assessment is
time-consuming and expensive, it is therefore desirable to develop an objective quality assessment methodology to estimate the
subjective quality of speech codecs with less subjective testing.
The most widely used objective speech quality measure demonstrating the performance of speech codecs is the Signal-to-Noise
Ratio (SNR = S/N). However, it is pointed out that the SNR does not adequately predict subjective quality for modern network
components. This is especially true for recent low bit-rate codecs. Therefore, a variety of more sophisticated objective quality
measures, such as the LPC Cepstrum Distance Measure, Information Index (II), Coherence Function (CHF), Expert Pattern
Recognition (EPR), and Perceptual Speech Quality Measure (PSQM) were developed. The performance of these systems, in terms
of ability to give accurate estimates of subjective quality, has been investigated in ITU-T since the 1980s.
After careful comparisons among these objective quality measures, it was concluded that the PSQM best correlated with the
subjective quality of coded speech.
Results Provided by PSQM
The VQT performs the PSQM measurement if the algorithm is licensed and the option is selected. The implementation of PSQM
is based upon ITU-T Rec. P.861. The algorithm and functionality are described in P. 861 and not repeated here.
The mapping of the PSQM value to Mean Opinion Score (MOS) is described in P.861. PSQM score 0 is equivalent to excellent
and 6.5 is very poor on the Listening Quality Scale defined in ITU-T Rec. P.800. For simplicity we suggest a linear re-scaling i.e. MOS
Listening Quality = 5 - (4 * PSQM/6.5). Other mappings may be more appropriate. Both scoring ranges are available within the
GL VQT application.
The VQT always performs the ITU P.56 algorithm (ITU recommendation P.56, Method B) on the reference data and degraded
data and calculates mean active speech level, activity factor and peak value for each input.
Perceptual Objective Listening Quality Analysis (POLQA)
POLQA, the successor of PESQ (ITU-T P.862) analysis, is the next generation voice quality testing standard for fixed, mobile and IP-based networks. Based on ITU standard ITU-T P.863, POLQA supports the latest HD-quality speech coding and network transport technology, with higher accuracy for 3G, 4G/LTE and VoIP networks.
The POLQA algorithm handles the higher bandwidth audio signals. POLQA supports measurements in the narrow band (NB, 300-3400 Hz), and significant new capabilities for wideband (WB, 100 - 7000 Hz), and super-wideband (SWB, 50-14000 Hz), commonly found in VoIP and next generation mobile networks.
Further improvements of this algorithm include the handling of signals with many delay variations and support for assessment of speech signals recorded acoustically by HATS (head and torso simulator).
Similar to PESQ, POLQA is a Full Reference (FR) algorithm that rates a degraded or processed speech signal in relation to the original (reference) signal. POLQA analyzes the degraded speech signal sample-by-sample after a temporal alignment of the reference test signal. Perceptual differences between both signals are scored as differences. The perceptual psycho-acoustic model is based on similar models of human perception. Basically, the signals are analyzed in the frequency domain (in critical bands) after applying masking functions. Unmasked differences between the two signal representations will be counted as distortions. Finally, the accumulated distortions in the speech file are mapped into a 1 to 5 quality scale in accordance with MOS (Mean Opinion Score) tests. FR measurements deliver the highest accuracy and repeatability but can only be applied for dedicated tests in live networks.
Using the send/record voice functionality the VQT software (POLQA) can analyze the recorded speech signals.
Unmasked differences between the degraded speech signal and the reference test signal representations will be counted as distortions. The accumulation of these distortions in the speech file is mapped into a 1 to 5 quality scale in accordance with MOS (Mean Opinion Score) tests. The POLQA analysis results include POLQA MOS, E-Model, Signal Level, SNR, and Jitter.
Operations performed by POLQA
The basic concepts of the temporal alignment are:
- To split the signals into equidistant pairs of frames and to calculate a delay for each frame pair.
- Whenever possible, the matching counterparts of the degraded signal sections are searched for in the reference signal and not vice versa.
- Stepwise refinement of the delay per frame to avoid long search ranges (long search ranges require high computing power and are critical in combination with time scaled input signals).
Sample rate estimation
The sample rate ratio detection is required to compensate for perceptually irrelevant differences in the play-out speed of both, the reference and the degraded signal.
The detection of this effect as implemented in the ITU-T P.863 algorithm is based on the delay per frame vector and the detected active sections of the speech signals, as determined by the temporal alignment.
If the detected sample rate ratio is larger than 0.01, the signal with the highest sample rate will be down sampled and the entire processing starts from the beginning. This happens at most one time to avoid excessive looping in case of signals for which the sample rate ratio cannot be determined in a reliable manner.
Even if the sample rate determination cannot be made with perfect accuracy, e.g., in case of signals with additional variable delay, the detected sample rate ratio is still accurate enough to bring the signals back to the safe operating range of the temporal alignment.
The ITU-T P.863 algorithm is designed to take into account the impact of the play back level for the perceived quality prediction in super wideband mode; the playback level is calculated relative to a nominal level of –26 dBov, which represents 73 dB(A) SPL in dichotic presentation.
In narrowband operational mode the ITU-T P.863 algorithm is designed to determine the listening speech quality at a constant listening level of 79 dB(A) SPL.
Frequency response and time alignment
The ITU-T P.863 algorithm can operate in two modes, narrowband mode, and super wideband mode. In the narrowband mode, both the reference and degraded signals are pre-filtered with an IRS receive filter representing a listening situation in which subjects judge the quality of the speech signals over an IRS receive handset in monotic mode or over an IRS receive headset in monotic mode. In the super wideband mode, both the reference and degraded signals are not filtered, representing a listening situation in which subjects judge the quality of the speech signals over a diffuse field equalized headset in dichotic mode.
Results Provided by POLQA
The most eminent result of POLQA is the MOS-LQO. It directly expresses the voice quality on the MOS scale. It is important to understand and consider the two different operational modes supported by the ITU-T P.863 algorithm:
- super wideband mode for listening over super wideband headphones;
- narrow-band mode for listening over loosely coupled IRS type handsets.
In the super wideband mode the impact of play back level is modeled and the default calibration factor (C) of 2.8 has to be used in combination with the standard –26 dBov scaling for play back levels of 73 dB(A) SPL (dichotic). Play back levels down to 53 dB(A) SPL and up to 78 dB(A) SPL may be used and MOS-LQO scores should be reported in the format MOS-LQOsw (dB level). In narrowband mode only the play back level of 79dB(A) SPL (monotic) is supported. Narrowband mode MOS scores are referred to as MOS-LQO.
The maximum ITU-T P.863 MOS-LQO score is 4.5 in narrow-band while in super wideband mode this point lies at 4.75. Under some circumstances, when the reference signal contains noise or when the voice timbre is distorted, a transparent chain will not provide the maximum MOS score of 4.5 in narrowband mode or 4.75 in super wideband mode.
Below table compares PESQ and POLQA MOS scores:
G.107 R-Factor / Ie Value
POLQA also provides a mapping of the MOS-LQO score to the scale used by G.107 (e-model). The resulting parameter is equivalent to an Ie – Value. Many people also refer to it as an R-factor. The scale ranges from 0 (bad) up to 100 (best). All values below 60 indicate unacceptable quality.
Especially all analog equipment modifies the level of the speech signal. A high attenuation generally leads to a worse perception of voice. In contrast to PESQ, POLQA does weight this as degradation of the signal. Knowing the value of the attenuation is also important for optimizing the overall system design. Attention should be paid to signals which show either a negative attenuation, or attenuations larger than approximately 10dB. In the first case, the signal was amplified instead of attenuated. This may eventually lead to level clipping during the transmission. In the second case, the quantization noise may become an important source of degradation, if low level analog signals are converted to the digital domain and are subsequently amplified in the digital domain. Depending on the test setup, both cases may be ok and intended, but this has to be decided on a case by case basis.
In order to calculate the attenuation, POLQA computes P.56 like active speech levels of the reference as well as the degraded signal in dB. The level of the degraded signal minus the level of the reference signal is then used as the attenuation.
Level and Background Noise Measurements
In transmission systems it is frequently important to know the exact levels of the signals. Especially for VoIP systems and voice activity detection (VAD) it becomes also important to know the signal level during the silent intervals as well as during active speech. It is important, that the received background noise does not exceed a certain limit. Levels can be measured in dB if you want to relate the level directly to a sound pressure or electrical level, or as loudness levels.
Signal to Noise Ratio (SNR)
POLQA calculates the SNR for the reference and the degraded signal independently. The noise as well as the signal level is calculated by the VAD which POLQA uses for the temporal alignment.
Active Speech Ratio (ASR)
ASR is calculated by POLQA based on the information calculated by the Voice Activity Detection (VAD) which is part of the temporal alignment. The ASR defines the ratio between active speech and the overall signal length.
For more details and screenshots, please visit POLQA using VQuad™ - VQT.
* Specifications are subject to change without notice.
Back to Voice Quality Testing Page