
Description:
The GL Voice Quality Testing (VQT) product utilizes several industry standard ITU algorithms in order to measure the speech
quality of a transmitted voice file. VQT compares the original unprocessed signal with the degraded version using PESQ
(ITU-T P.862+P.862.1), PAMS (ITU-T P.800) and PSQM/PSQM+ (ITU-T P.861). The GL VQT can either be installed and operated
on a stand-alone system or reside, as an optional feature, on other GL products.
- Perceptual Evaluation of Speech Quality (PESQ)
- Operations Performed by PESQ
- Results Provided by PESQ
- Perceptual Analysis Measurement System (PAMS)
- Operations Performed by PAMS
- Results provided by PAMS
- Perceptual Speech Quality Measure (PSQM)
- Results Provided by PSQM
- ITU.56 Measurements
Perceptual Evaluation of Speech Quality (PESQ)
Modern communications networks include elements (bad coding, error-prone channels and voice activity detection) that cannot
reliably be assessed by such conventional engineering metrics as signal-to-noise ratio. One way to measure customers' perception
of the quality of these systems is to conduct a subjective test involving panels of human subjects. However, these tests are
expensive and unsuitable for such applications as real-time monitoring.
PESQ provides an objective measure that predicts the results of subjective listening tests on telephony systems. To measure
speech quality, PESQ uses a sensory model to compare the original, unprocessed signal with the degraded version at the output
of the communications system. The result of comparing the reference and degraded signals is a quality score. This score is
analogous to the subjective
Mean Opinion Score (MOS) measured using panel tests according to ITU-T P.800.
PESQ incorporates many new developments that distinguish it from earlier models for assessing codecs. These innovations
allow PESQ to be used with confidence to assess end-to-end speech quality as well as the effect of such individual elements as codecs.
In addition to the standard PESQ score, the GL VQT also provides the PESQ LQ and LQO (P.862.1) score. These revised scores
exhibits better correlation to subjective listening quality test scores.
Operations Performed by PESQ
The processing carried out by the PESQ algorithm includes the following stages.
Level Alignment
In order to compare the signals, the reference speech signal and the degraded signal should be at the same, constant power
level. This is necessary because the reference signal does not have to be at a defined level and because the gain of the system
under test is unknown before testing. PESQ assumes that the subjective listening level is a constant 79dB SPL at the ear reference
point [ITU-T P.830, section 8.1.2]. A gain is applied to both the reference and degraded signals to bring them to this level.
Input Filtering
Analog connections often introduce some degree of filtering. For example, PESQ models the receive path of the telephone
handset using an input filter. This takes account of the effect of the electrical and acoustic components of the handset. The filter
used is similar to the standard "modified IRS receive characteristic" [ITU-T P.830]. It is generally accepted that this has less effect
on quality than coding distortions do. PESQ compensates for any filtering that has taken place in the network.
Time Alignment
The system under test may include a delay, which may be variable. In order to compare the reference and degraded signals,
they need to be lined up with each other. PESQ applies voice activity detection to the signals to identify those parts of the signal
that are speech, ignoring noise. The PESQ time offset measurements do not take account of the delay of the test equipment
generating or recording the signal. This means that a time offset reported by PESQ on a file collected will be dependent upon
the way in which the test process is executed.
- First, PESQ aligns the overall speech signals (utterances). An utterance is a continuous speech burst identified by the voice
activity detector that does not contain pauses longer than a pre-determined threshold (200ms). This process detects delay over
major sections of the degraded signal compared to the reference signal.
- Second, PESQ aligns overlapping sections of the speech (frames). This process detects delay that is variable over the length
of an utterance, as this can be significant in packet-based networks.
- The third stage does not occur immediately after the second stage, but is performed after the auditory transform has been
calculated. The third stage realigns "bad intervals" (sections of the speech with very large disturbance), and improves the model's
accuracy with a small number of files where delay changes are not correctly identified by the initial time alignment process.
Auditory Transform
In order to compare the reference and degraded signals, taking account of how a listener would have heard them, each is
passed through an auditory transform that mimics certain key properties of human hearing. This gives a representation in time
and frequency of the perceived loudness of the signal, known as the sensation surface.
Equalization
Part of the auditory transformation equalizes certain processes that have little subjective effect. First, the transfer function
of the system is estimated, and is used to equalize the reference to the degraded in the auditory transform domain.
This takes account of filtering in analogue components of the network such as telephone handsets. Second, the frame-by-frame
amplitude gain of the system is estimated and used to equalize the auditory transform of degraded file to the reference.
In both cases the equalization is partial - large amounts of filtering or gain variation are not cancelled, and therefore result in
errors being measured.
Disturbance Processing
The difference between the sensation surfaces for the reference and degraded files is known as the error surface; this
shows any audible differences introduced by the system under test. The error surface is analyzed by a process that takes
account of the effect that small distortions in a signal are inaudible in the presence of loud signals (masking).
From the positive and negative errors, two disturbance parameters are calculated. They are calculated as non-linear
averages over specific areas of the error surface. These disturbance parameters are:
- The absolute (symmetric) disturbance - a measure of absolute audible error
- The additive (asymmetric) disturbance - a measure of audible errors that are significantly louder than the reference
This analysis gives two error parameters that summarize the amount of each type of audible error. Finally, the error
parameters are converted to a quality score, which is a linear combination of the average symmetric disturbance value
and the average asymmetric disturbance value.
Results Provided by PESQ
PESQ (P.862)
PESQ returns a quality score, known as PESQ score, which conforms to ITU-T P.862. PESQ score lies on a scale from
-0.5 to 4.5, though in most cases it is between 1 and 4.5. PESQ score correlates with subjective quality scores. However
the PESQ score tends to be optimistic for poor quality speech and pessimistic for good quality speech. Alternative mappings
for PESQ score have been developed which do exhibit a better correlation to subjective test scores. These are referred to
as the PESQ-LQ and PESQ-LQO scores.
PESQ-LQ
PESQ-LQ scores are closer to the listening quality subjective opinion scale, which is standard in the industry and is
defined in ITU-T P.800. Listening quality scores lie between 1 and 5. PESQ-LQ score lie between 1.0 and 4.5. This is
because 4.5 is usually the maximum obtained in a subjective test.
Listening Quality Scale:
| Score |
Quality of the speech |
| 5 |
Excellent |
| 4 |
Good |
| 3 |
Fair |
| 2 |
Poor |
| 1 |
Bad |
The score gives a measure of customers' perception of quality. The highest score, 4.5, means that no distortion is measured.
As the amount of distortion increases the quality falls.
PESQ-LQO (P.862.1)
The aim of a separate recommendation ITU-T P.862.1 is to provide a single mapping from raw P.862 score to the Listening
Quality Objective Mean Opinion Score (LQO-MOS). This latest ITU standard improves on the original PESQ (P.862) by correlating
better to subjective test results.
Typical PESQ Score Comparisons
Based on simulations and real measurements, the table below represents the results of a number of typical networks and
codecs with no errors or packet loss. In addition, it gives the scores that can be expected in some mobile network conditions
where errors are significant.
| Network Condition |
PESQ |
PESQ-LQ |
PESQ-LQO |
| Clean ISDN network |
4.3 |
4.4 |
4.4 |
| Analog network (G.711) |
4.1 |
4.2 |
4.2 |
| G.728 codec (16kbit/s) |
3.8 |
3.9 |
3.9 |
| G.729 codec (8kbit/s) |
3.6 |
3.7 |
3.7 |
| G.723.1 codec (6.3kbit/s) |
3.5 |
3.4 |
3.5 |
| GSM EFR codec (12.2kbit/s) |
3.9 |
4.0 |
4.0 |
| GSM FR codec (13kbit/s) |
3.5 |
3.5 |
3.5 |
| GSM-EFR mobile network in typical operating range |
3.6 to 3.1 |
3.6 to 2.9 |
3.7 to 3.0 |
| GSM-EFR mobile network in very poor conditions |
2.2 |
1.6 |
1.8 |
Perceptual Analysis Measurement System (PAMS)
Traditionally the only way to measure customer's perception of the quality of modern communications was to conduct a
subjective test, but these tests are expensive and unsuitable for applications such as real-time monitoring. PAMS provides
an objective measure that predicts the results of subjective listening tests on a telephony system. To measure speech
quality, PAMS uses a sensory model to compare the original, unprocessed signal with the degraded version at the output
of the communications system. PAMS parameterizes different classes of errors and maps them to predictions of subjective
listening quality and listening effort. The mappings are calibrated using a large database of subjective tests. Other
diagnostics are also returned.
PAMS incorporates many new developments that distinguish it from earlier codec assessment models such as those
given in ITU-T P.861. These innovations allow PAMS to be used with confidence to assess end-to-end speech quality as
well as the effect of individual elements such as codecs.
Operations Performed by PAMS
The processing carried out by the PAMS algorithm includes the following stages.
Time Alignment
PAMS is a listening model and has no knowledge of the delay of the system. In order to compare the reference and
degraded signals, however, they need to be lined up with each other. This enables the analysis to cancel any bulk delay
and also most delay changes that might be caused by, for example, packet-based transmission.
Equalization
Analogue connection often introduces some degree of filtering. PAMS identifies any filtering that has taken place in
the network and cancels its effect.
Auditory transform
In order to compare the reference and degraded signals in a meaningful way, they are passed through an auditory
transform that mimics certain key properties of human hearing.
Error parameterization
This analysis gives a number of error parameters that summarize the amount of each type of audible error.
Regression
Finally the error parameters are mapped onto predictions of perceived listening quality and listening effort. These
mappings are calculated and verified using a very large database of subjective tests to ensure that PAMS is able to
predict quality for a wide range of distortion types.
Results provided by PAMS
PAMS returns quality scores on two different opinion scales, listening quality and listening effort. These scales are
standard and are defined in [ITU rec. P.800]. Both Listening Quality and Listening Effort utilize a range between 1 and 5
and are usually quoted to two decimal places.
Listening Quality Scale:
| Score |
Quality of the speech |
| 5 |
Excellent |
| 4 |
Good |
| 3 |
Fair |
| 2 |
Poor |
| 1 |
Bad |
Listening Effort Scale:
| Score |
Effort required to understand the meaning of sentences |
| 5 |
Complete relaxation possible; no effort required |
| 4 |
Attention necessary; no appreciable effort required |
| 3 |
Moderate effort required |
| 2 |
Considerable effort required |
| 1 |
No meaning understood with any feasible effort |
The scores give a measure of customers' perception of quality. A PAMS score of 5 indicates that no distortion is measured. As the
amount of distortion increases, the quality falls. Because they related to different aspects of subjectivity, the listening effort and
listening quality scores are normally different if there is perceived distortion; listening effort is usually higher than listening quality.
Perceptual Speech Quality Measure (PSQM)
Subjective quality assessment of speech codecs is one of the key technologies in designing digital telecommunication networks.
ITU Recommendation P.830 defines subjective testing methodologies for speech codecs. Since subjective quality assessment is
time-consuming and expensive, it is therefore desirable to develop an objective quality assessment methodology to estimate the
subjective quality of speech codecs with less subjective testing.
The most widely used objective speech quality measure demonstrating the performance of speech codecs is the Signal-to-Noise
Ratio (SNR = S/N). However, it is pointed out that the SNR does not adequately predict subjective quality for modern network
components. This is especially true for recent low bit-rate codecs. Therefore, a variety of more sophisticated objective quality
measures, such as the LPC Cepstrum Distance Measure, Information Index (II), Coherence Function (CHF), Expert Pattern
Recognition (EPR), and Perceptual Speech Quality Measure (PSQM) were developed. The performance of these systems, in terms
of ability to give accurate estimates of subjective quality, has been investigated in ITU-T since the 1980s.
After careful comparisons among these objective quality measures, it was concluded that the PSQM best correlated with the
subjective quality of coded speech.
Results Provided by PSQM
The VQT performs the PSQM measurement if the algorithm is licensed and the option is selected. The implementation of PSQM
is based upon ITU-T Rec. P.861. The algorithm and functionality are described in P. 861 and not repeated here.
The mapping of the PSQM value to Mean Opinion Score (MOS) is described in P.861. PSQM score 0 is equivalent to excellent
and 6.5 is very poor on the Listening Quality Scale defined in ITU-T Rec. P.800. For simplicity we suggest a linear re-scaling i.e. MOS
Listening Quality = 5 - (4 * PSQM/6.5). Other mappings may be more appropriate. Both scoring ranges are available within the
GL VQT application.
ITU.56 Measurements
The VQT always performs the ITU P.56 algorithm (ITU recommendation P.56, Method B) on the reference data and degraded
data and calculates mean active speech level, activity factor and peak value for each input.
Buyer's Guide:
* Specifications are subject to change without notice.
Back to Voice Quality Testing Page