Acoustic Unit Discovery / Speech Representation Learning

The goal of acoustic unit discovery is to learn representations (embeddings) of speech sounds that retain linguistically relevant information and discard linguistically irrelevant acoustic information like speaker voice type or recording conditions (additive noise, reverberation, etc).

In text-based systems, such representations are phonemes (as defined by a pronunciation dictionary) or characters. Here, the representations are latent and may take any form (dense vectors for each frame, probabilistic codes, discrete codes, etc) as long as they can be aligned with the original signal (for instance, a vector of values every 10 ms).

To evaluate these representations, we take the view that, while representations may not correspond one-to-one to linguistically interpretable units (phonemes, phonetic features, syllables, etc.), and may not even be discrete, they should at least support the same key function: phonemic contrast. Phonemes are defined as the smallest element of speech that make a difference in meaning between words (e.g., /bit/ versus /but/). We require representations to distinguish pairs of phonemes, while ignoring non-linguistic variations. Discriminability is computed by running an ABX discrimination test ( Citation: , (). ABX-discriminability measures and applications  (PhD thesis). Paris 6 ) .

For more information about the evaluation, see Metrics explained.

Cited

Schatz (2016)
(). ABX-discriminability measures and applications  (PhD thesis). Paris 6