GL Communications Inc.
 
 
 
Home  >  Complete VQT Solutions  > Voice Quality Testing

ITU Algorithms



  Download Voice Quality Testing ITU Algorithms Guide

  Download Voice Quality Testing Analysis Guide

  Download Complete VQT Solutions Brochure (PDF)


Description

The GL Voice Quality Testing (VQT) product utilizes several industry standard ITU algorithms in order to measure the speech quality of a transmitted voice file. VQT compares the original unprocessed signal with the degraded version using POLQA (ITU-T P.863), PESQ (ITU-T P.862+P.862.1), PAMS (ITU-T P.800) and PSQM/PSQM+ (ITU-T P.861) international standard voice quality test methods. The GL VQT can either be installed or operated on a stand-alone system or reside as an optional feature, on other GL products. GL's VQuad™ also supports the POLQA (ITU-T P.863) standard for voice quality analysis

  • Perceptual Objective Listening Quality Analysis (POLQA)
    • Operations Performed by POLQA
    • Results Provided by POLQA

  • Perceptual Evaluation of Speech Quality (PESQ)
    • Operations Performed by PESQ
    • Results Provided by PESQ

  • Perceptual Analysis Measurement System (PAMS)
    • Operations Performed by PAMS
    • Results provided by PAMS

  • Perceptual Speech Quality Measure (PSQM)
    • Results Provided by PSQM

  • ITU.56 Measurements

POLQA versus PESQ Comparison

POLQA PESQ
Adopted in 2011, ITU-T P.863 Adopted in 2001, ITU-T P.862
Suitable for 3G and 4G networks, VoIP networks and NGN networks delivering HD- quality voice services such as "wideband" and "super-wideband" telephone calls, 7 kHz and 14 kHz frequency range. Suitable for G.711 A law and u law, and low bandwidth 300 to 3400 Hz voice bandwidth. Also supports WB (7kHz frequency range) using PESQ ITU-T P.862.2.
POLQA works quickly and accurately. POLQA is superior to existing standards, and has overcome all known issues and limitations of PESQ

PESQ-based measurements will still be considered an industry standard for several years, also for reasons of backward compatibility


Perceptual Evaluation of Speech Quality (PESQ)

Modern communications networks include elements (bad coding, error-prone channels and voice activity detection) that cannot reliably be assessed by such conventional engineering metrics as signal-to-noise ratio. One way to measure customers' perception of the quality of these systems is to conduct a subjective test involving panels of human subjects. However, these tests are expensive and unsuitable for such applications as real-time monitoring.

PESQ provides an objective measure that predicts the results of subjective listening tests on telephony systems. To measure speech quality, PESQ uses a sensory model to compare the original, unprocessed signal with the degraded version at the output of the communications system. The result of comparing the reference and degraded signals is a quality score. This score is analogous to the subjective Mean Opinion Score (MOS) measured using panel tests according to ITU-T P.800.

PESQ incorporates many new developments that distinguish it from earlier models for assessing codecs. These innovations allow PESQ to be used with confidence to assess end-to-end speech quality as well as the effect of such individual elements as codecs.

In addition to the standard PESQ score, the GL VQT also provides the PESQ LQ and LQO (P.862.1) score. These revised scores exhibits better correlation to subjective listening quality test scores.


Operations Performed by PESQ

The processing carried out by the PESQ algorithm includes the following stages.

Level Alignment

In order to compare the signals, the reference speech signal and the degraded signal should be at the same, constant power level. This is necessary because the reference signal does not have to be at a defined level and because the gain of the system under test is unknown before testing. PESQ assumes that the subjective listening level is a constant 79dB SPL at the ear reference point [ITU-T P.830, section 8.1.2]. A gain is applied to both the reference and degraded signals to bring them to this level.

Input Filtering

Analog connections often introduce some degree of filtering. For example, PESQ models the receive path of the telephone handset using an input filter. This takes account of the effect of the electrical and acoustic components of the handset. The filter used is similar to the standard "modified IRS receive characteristic" [ITU-T P.830]. It is generally accepted that this has less effect on quality than coding distortions do. PESQ compensates for any filtering that has taken place in the network.

Time Alignment

The system under test may include a delay, which may be variable. In order to compare the reference and degraded signals, they need to be lined up with each other. PESQ applies voice activity detection to the signals to identify those parts of the signal that are speech, ignoring noise. The PESQ time offset measurements do not take account of the delay of the test equipment generating or recording the signal. This means that a time offset reported by PESQ on a file collected will be dependent upon the way in which the test process is executed.

  • First, PESQ aligns the overall speech signals (utterances). An utterance is a continuous speech burst identified by the voice activity detector that does not contain pauses longer than a pre-determined threshold (200ms). This process detects delay over major sections of the degraded signal compared to the reference signal.
  • Second, PESQ aligns overlapping sections of the speech (frames). This process detects delay that is variable over the length of an utterance, as this can be significant in packet-based networks.
  • The third stage does not occur immediately after the second stage, but is performed after the auditory transform has been calculated. The third stage realigns "bad intervals" (sections of the speech with very large disturbance), and improves the model's accuracy with a small number of files where delay changes are not correctly identified by the initial time alignment process.

Auditory Transform

In order to compare the reference and degraded signals, taking account of how a listener would have heard them, each is passed through an auditory transform that mimics certain key properties of human hearing. This gives a representation in time and frequency of the perceived loudness of the signal, known as the sensation surface.

Equalization

Part of the auditory transformation equalizes certain processes that have little subjective effect. First, the transfer function of the system is estimated, and is used to equalize the reference to the degraded in the auditory transform domain. This takes account of filtering in analogue components of the network such as telephone handsets. Second, the frame-by-frame amplitude gain of the system is estimated and used to equalize the auditory transform of degraded file to the reference. In both cases the equalization is partial - large amounts of filtering or gain variation are not cancelled, and therefore result in errors being measured.

Disturbance Processing

The difference between the sensation surfaces for the reference and degraded files is known as the error surface; this shows any audible differences introduced by the system under test. The error surface is analyzed by a process that takes account of the effect that small distortions in a signal are inaudible in the presence of loud signals (masking).

From the positive and negative errors, two disturbance parameters are calculated. They are calculated as non-linear averages over specific areas of the error surface. These disturbance parameters are:

  • The absolute (symmetric) disturbance - a measure of absolute audible error
  • The additive (asymmetric) disturbance - a measure of audible errors that are significantly louder than the reference

This analysis gives two error parameters that summarize the amount of each type of audible error. Finally, the error parameters are converted to a quality score, which is a linear combination of the average symmetric disturbance value and the average asymmetric disturbance value.


Results Provided by PESQ

PESQ (P.862)

PESQ returns a quality score, known as PESQ score, which conforms to ITU-T P.862. PESQ score lies on a scale from -0.5 to 4.5, though in most cases it is between 1 and 4.5. PESQ score correlates with subjective quality scores. However the PESQ score tends to be optimistic for poor quality speech and pessimistic for good quality speech. Alternative mappings for PESQ score have been developed which do exhibit a better correlation to subjective test scores. These are referred to as the PESQ-LQ and PESQ-LQO scores.

PESQ-LQ

PESQ-LQ scores are closer to the listening quality subjective opinion scale, which is standard in the industry and is defined in ITU-T P.800. Listening quality scores lie between 1 and 5. PESQ-LQ score lie between 1.0 and 4.5. This is because 4.5 is usually the maximum obtained in a subjective test.

Listening Quality Scale:
Score Quality of the speech
5 Excellent
4 Good
3 Fair
2 Poor
1 Bad

The score gives a measure of customers' perception of quality. The highest score, 4.5, means that no distortion is measured. As the amount of distortion increases the quality falls.

PESQ-LQO (P.862.1)

The aim of a separate recommendation ITU-T P.862.1 is to provide a single mapping from raw P.862 score to the Listening Quality Objective Mean Opinion Score (LQO-MOS). This latest ITU standard improves on the original PESQ (P.862) by correlating better to subjective test results.

Typical PESQ Score Comparisons

Based on simulations and real measurements, the table below represents the results of a number of typical networks and codecs with no errors or packet loss. In addition, it gives the scores that can be expected in some mobile network conditions where errors are significant.

Network Condition PESQ PESQ-LQ PESQ-LQO
Clean ISDN network 4.3 4.4 4.4
Analog network (G.711) 4.1 4.2 4.2
G.728 codec (16kbit/s) 3.8 3.9 3.9
G.729 codec (8kbit/s) 3.6 3.7 3.7
G.723.1 codec (6.3kbit/s) 3.5 3.4 3.5
GSM EFR codec (12.2kbit/s) 3.9 4.0 4.0
GSM FR codec (13kbit/s) 3.5 3.5 3.5
GSM-EFR mobile network in typical operating range 3.6 to 3.1 3.6 to 2.9 3.7 to 3.0
GSM-EFR mobile network in very poor conditions 2.2 1.6 1.8


For more details, please visit PESQ Measurement webpage


Perceptual Analysis Measurement System (PAMS)

Traditionally the only way to measure customer's perception of the quality of modern communications was to conduct a subjective test, but these tests are expensive and unsuitable for applications such as real-time monitoring. PAMS provides an objective measure that predicts the results of subjective listening tests on a telephony system. To measure speech quality, PAMS uses a sensory model to compare the original, unprocessed signal with the degraded version at the output of the communications system. PAMS parameterizes different classes of errors and maps them to predictions of subjective listening quality and listening effort. The mappings are calibrated using a large database of subjective tests. Other diagnostics are also returned.

PAMS incorporates many new developments that distinguish it from earlier codec assessment models such as those given in ITU-T P.861. These innovations allow PAMS to be used with confidence to assess end-to-end speech quality as well as the effect of individual elements such as codecs.


Operations Performed by PAMS

The processing carried out by the PAMS algorithm includes the following stages.

Time Alignment

PAMS is a listening model and has no knowledge of the delay of the system. In order to compare the reference and degraded signals, however, they need to be lined up with each other. This enables the analysis to cancel any bulk delay and also most delay changes that might be caused by, for example, packet-based transmission.

Equalization

Analogue connection often introduces some degree of filtering. PAMS identifies any filtering that has taken place in the network and cancels its effect.

Auditory transform

In order to compare the reference and degraded signals in a meaningful way, they are passed through an auditory transform that mimics certain key properties of human hearing.

Error parameterization

This analysis gives a number of error parameters that summarize the amount of each type of audible error.

Regression

Finally the error parameters are mapped onto predictions of perceived listening quality and listening effort. These mappings are calculated and verified using a very large database of subjective tests to ensure that PAMS is able to predict quality for a wide range of distortion types.


Results provided by PAMS

PAMS returns quality scores on two different opinion scales, listening quality and listening effort. These scales are standard and are defined in [ITU rec. P.800]. Both Listening Quality and Listening Effort utilize a range between 1 and 5 and are usually quoted to two decimal places.

Listening Quality Scale:
Score Quality of the speech
5 Excellent
4 Good
3 Fair
2 Poor
1 Bad

Listening Effort Scale:
Score Effort required to understand the meaning of sentences
5 Complete relaxation possible; no effort required
4 Attention necessary; no appreciable effort required
3 Moderate effort required
2 Considerable effort required
1 No meaning understood with any feasible effort

The scores give a measure of customers' perception of quality. A PAMS score of 5 indicates that no distortion is measured. As the amount of distortion increases, the quality falls. Because they related to different aspects of subjectivity, the listening effort and listening quality scores are normally different if there is perceived distortion; listening effort is usually higher than listening quality.


Perceptual Speech Quality Measure (PSQM)

Subjective quality assessment of speech codecs is one of the key technologies in designing digital telecommunication networks. ITU Recommendation P.830 defines subjective testing methodologies for speech codecs. Since subjective quality assessment is time-consuming and expensive, it is therefore desirable to develop an objective quality assessment methodology to estimate the subjective quality of speech codecs with less subjective testing.

The most widely used objective speech quality measure demonstrating the performance of speech codecs is the Signal-to-Noise Ratio (SNR = S/N). However, it is pointed out that the SNR does not adequately predict subjective quality for modern network components. This is especially true for recent low bit-rate codecs. Therefore, a variety of more sophisticated objective quality measures, such as the LPC Cepstrum Distance Measure, Information Index (II), Coherence Function (CHF), Expert Pattern Recognition (EPR), and Perceptual Speech Quality Measure (PSQM) were developed. The performance of these systems, in terms of ability to give accurate estimates of subjective quality, has been investigated in ITU-T since the 1980s.

After careful comparisons among these objective quality measures, it was concluded that the PSQM best correlated with the subjective quality of coded speech.


Results Provided by PSQM

The VQT performs the PSQM measurement if the algorithm is licensed and the option is selected. The implementation of PSQM is based upon ITU-T Rec. P.861. The algorithm and functionality are described in P. 861 and not repeated here.

The mapping of the PSQM value to Mean Opinion Score (MOS) is described in P.861. PSQM score 0 is equivalent to excellent and 6.5 is very poor on the Listening Quality Scale defined in ITU-T Rec. P.800. For simplicity we suggest a linear re-scaling i.e. MOS Listening Quality = 5 - (4 * PSQM/6.5). Other mappings may be more appropriate. Both scoring ranges are available within the GL VQT application.


ITU.56 Measurements

The VQT always performs the ITU P.56 algorithm (ITU recommendation P.56, Method B) on the reference data and degraded data and calculates mean active speech level, activity factor and peak value for each input.


Perceptual Objective Listening Quality Analysis (POLQA)

POLQA, the successor of PESQ (ITU-T P.862) analysis, is the next generation voice quality testing standard for fixed, mobile and IP-based networks. Based on ITU standard ITU-T P.863, POLQA supports the latest HD-quality speech coding and network transport technology, with higher accuracy for 3G, 4G/LTE and VoIP networks.

The POLQA algorithm handles the higher bandwidth audio signals. POLQA supports measurements in the narrow band (NB, 300-3400 Hz), and significant new capabilities for wideband (WB, 100 - 7000 Hz), and super-wideband (SWB, 50-14000 Hz), commonly found in VoIP and next generation mobile networks.

Further improvements of this algorithm include the handling of signals with many delay variations and support for assessment of speech signals recorded acoustically by HATS (head and torso simulator).

Similar to PESQ, POLQA is a Full Reference (FR) algorithm that rates a degraded or processed speech signal in relation to the original (reference) signal. POLQA analyzes the degraded speech signal sample-by-sample after a temporal alignment of the reference test signal. Perceptual differences between both signals are scored as differences. The perceptual psycho-acoustic model is based on similar models of human perception. Basically, the signals are analyzed in the frequency domain (in critical bands) after applying masking functions. Unmasked differences between the two signal representations will be counted as distortions. Finally, the accumulated distortions in the speech file are mapped into a 1 to 5 quality scale in accordance with MOS (Mean Opinion Score) tests. FR measurements deliver the highest accuracy and repeatability but can only be applied for dedicated tests in live networks.

Using the send/record voice functionality the VQT software (POLQA) can analyze the recorded speech signals. 
Unmasked differences between the degraded speech signal and the reference test signal representations will be counted as distortions. The accumulation of these distortions in the speech file is mapped into a 1 to 5 quality scale in accordance with MOS (Mean Opinion Score) tests. The POLQA analysis results include POLQA MOS, E-Model, Signal Level, SNR, and Jitter.


Operations performed by POLQA

Temporal alignment

The basic concepts of the temporal alignment are:

  • To split the signals into equidistant pairs of frames and to calculate a delay for each frame pair.
  • Whenever possible, the matching counterparts of the degraded signal sections are searched for in the reference signal and not vice versa.
  • Stepwise refinement of the delay per frame to avoid long search ranges (long search ranges require high computing power and are critical in combination with time scaled input signals).

Sample rate estimation

The sample rate ratio detection is required to compensate for perceptually irrelevant differences in the play-out speed of both, the reference and the degraded signal.

The detection of this effect as implemented in the ITU-T P.863 algorithm is based on the delay per frame vector and the detected active sections of the speech signals, as determined by the temporal alignment.

Resample

If the detected sample rate ratio is larger than 0.01, the signal with the highest sample rate will be down sampled and the entire processing starts from the beginning. This happens at most one time to avoid excessive looping in case of signals for which the sample rate ratio cannot be determined in a reliable manner.

Even if the sample rate determination cannot be made with perfect accuracy, e.g., in case of signals with additional variable delay, the detected sample rate ratio is still accurate enough to bring the signals back to the safe operating range of the temporal alignment.


Level alignment

The ITU-T P.863 algorithm is designed to take into account the impact of the play back level for the perceived quality prediction in super wideband mode; the playback level is calculated relative to a nominal level of –26 dBov, which represents 73 dB(A) SPL in dichotic presentation.

In narrowband operational mode the ITU-T P.863 algorithm is designed to determine the listening speech quality at a constant listening level of 79 dB(A) SPL.


Frequency response and time alignment

The ITU-T P.863 algorithm can operate in two modes, narrowband mode, and super wideband mode. In the narrowband mode, both the reference and degraded signals are pre-filtered with an IRS receive filter representing a listening situation in which subjects judge the quality of the speech signals over an IRS receive handset in monotic mode or over an IRS receive headset in monotic mode. In the super wideband mode, both the reference and degraded signals are not filtered, representing a listening situation in which subjects judge the quality of the speech signals over a diffuse field equalized headset in dichotic mode.


Results Provided by POLQA

Perceptual Results

MOS-LQO

The most eminent result of POLQA is the MOS-LQO. It directly expresses the voice quality on the MOS scale. It is important to understand and consider the two different operational modes supported by the ITU-T P.863 algorithm:

  • super wideband mode for listening over super wideband headphones;
  • narrow-band mode for listening over loosely coupled IRS type handsets.

In the super wideband mode the impact of play back level is modeled and the default calibration factor (C) of 2.8 has to be used in combination with the standard –26 dBov scaling for play back levels of 73 dB(A) SPL (dichotic). Play back levels down to 53 dB(A) SPL and up to 78 dB(A) SPL may be used and MOS-LQO scores should be reported in the format MOS-LQOsw (dB level). In narrowband mode only the play back level of 79dB(A) SPL (monotic) is supported. Narrowband mode MOS scores are referred to as MOS-LQO.

The maximum ITU-T P.863 MOS-LQO score is 4.5 in narrow-band while in super wideband mode this point lies at 4.75. Under some circumstances, when the reference signal contains noise or when the voice timbre is distorted, a transparent chain will not provide the maximum MOS score of 4.5 in narrowband mode or 4.75 in super wideband mode.

Below table compares PESQ and POLQA MOS scores:

Mode P.862.1/2 MOSmin P.862.1/2 MOSmax POLQA MOSmin POLQA MOSmax
NB 1 4.5 1 4.5
WB 1 4.5    
SWB     1 4.75

G.107 R-Factor / Ie Value

POLQA also provides a mapping of the MOS-LQO score to the scale used by G.107 (e-model). The resulting parameter is equivalent to an Ie – Value. Many people also refer to it as an R-factor. The scale ranges from 0 (bad) up to 100 (best). All values below 60 indicate unacceptable quality.

Non-Perceptual Results

Attenuation

Especially all analog equipment modifies the level of the speech signal. A high attenuation generally leads to a worse perception of voice. In contrast to PESQ, POLQA does weight this as degradation of the signal. Knowing the value of the attenuation is also important for optimizing the overall system design. Attention should be paid to signals which show either a negative attenuation, or attenuations larger than approximately 10dB. In the first case, the signal was amplified instead of attenuated. This may eventually lead to level clipping during the transmission. In the second case, the quantization noise may become an important source of degradation, if low level analog signals are converted to the digital domain and are subsequently amplified in the digital domain. Depending on the test setup, both cases may be ok and intended, but this has to be decided on a case by case basis.

In order to calculate the attenuation, POLQA computes P.56 like active speech levels of the reference as well as the degraded signal in dB. The level of the degraded signal minus the level of the reference signal is then used as the attenuation.

Level and Background Noise Measurements

In transmission systems it is frequently important to know the exact levels of the signals. Especially for VoIP systems and voice activity detection (VAD) it becomes also important to know the signal level during the silent intervals as well as during active speech. It is important, that the received background noise does not exceed a certain limit. Levels can be measured in dB if you want to relate the level directly to a sound pressure or electrical level, or as loudness levels.

Signal to Noise Ratio (SNR)

POLQA calculates the SNR for the reference and the degraded signal independently. The noise as well as the signal level is calculated by the VAD which POLQA uses for the temporal alignment.

Active Speech Ratio (ASR)

ASR is calculated by POLQA based on the information calculated by the Voice Activity Detection (VAD) which is part of the temporal alignment. The ASR defines the ratio between active speech and the overall signal length.

For more details and screenshots, please visit POLQA using VQuad™ - VQT.


Buyer's Guide

Item No. Item Description
  VQT
VQT002 Voice Quality Testing (PESQ only)
VQT005 Voice Quality Testing (POLQA) for VQuad
VQT006 VQT w/ POLQA Server License
  VQuad Network Options
VQT010 VQuad Software (Stand Alone)
VQT013 VQuad with SIP (VoIP) Call Control
VQT015 VQuad with T1 E1 Call Control
  VQuad Miscellaneous
VQT030 Network Command Center (Multi-Node Command and Control Center for VQuad Systems)
VQT251 Dual UTA HD Next generation Dual UTA with FXO Wideband support
VQT252 Dual UTA HD – Bluetooth Option
VBA032 Near Real-time Voice-band Analyzer

* Specifications are subject to change without notice.

 Back to Voice Quality Testing Page
 
 
Home Page Sitemap Global Presence Email