Spoken Language Modeling

Task & Goal

Spoken language modeling is the task of learning a language model directly from the audio. In Task 4, we do not presuppose the input to the language model. The inputs could be discreteor continuous representations from Task 1 or word level representations from Task 2, so long as these representations are learned without supervision from text or other labels.

The task can be understood as the modeling of the probability distribution of spoken utterances in an unknown language. Here, the evaluation problem is severe. Language models trained from text are typically evaluated by the perplexity over a test corpus, or by fine-tuning on downstream tasks. As discussed above, the ZRC series focuses evaluation on zero-shot tasks that require no training. This excludes a fine-tuning evaluation. As for perplexity, in text-based systems, it is derived from the conditional probability distribution of the next token given a past sequence of tokens.

In speech-based systems that use discrete pseudo-text units, the number of such units is a latent variable, making the perplexities difficult to compare across models. The problem becomes worse for systems that do not use discrete representations at all, where the estimation of the conditional probabilities themselves becomes model dependent. Instead of computing an average perplexity across a corpus, the ZRC uses a contrastive approach, where a “surprisal score” (negative log probability, or simply the loss function itself) is computed for minimal pairs of utterances, one legal, the other illegal. An accuracy is computed by counting how often the value of the surprisal score is higher for the legal than for the illegal utterance.

This logic is used to probe the lexical level (words versus nonwords) and the syntaxic level (grammatical versus non grammatical sentences), using stimuli constructed using the Wuggy nonword generator ( Citation: & , & (). Wuggy: A multilingual pseudoword generator. Behavior research methods, 42(3). 627–633. ) and derived from the BLIMP dataset ( Citation: , & al., , , , , , & (). Blimp: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582. ) (see ( Citation: , & al., , , , , , , & (). The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv preprint arXiv:2011.11588. ) and Supplementary Section S4).

Additionally, language models are also sometimes evaluated by probe tasks, which investigate the nature of the representations computed in the hidden units. The ZRC adapts a semantic similarity test previously used for evaluating text-based word embeddings ( Citation: & , & (). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976. ) to correlate the similarity of systems’ representations of words with human similarity judgments. This enables us to measure the extent to which the model is able to extract lexical semantic knowledge.

The first round of submissions was documented in 2021 {< cite “dunbar2021zero” >}}; a second round was opened as a NeurIPS 2021 challenge, including a visually-grounded training option. Briefly, this modified scenario expands the range of data that models can be trained on, to include multi-modal datasets (like speech and image, or speech and video). The rationale is that young children learn in a multimodal, multisensory enviroment rather than by just listening. Some earlier models of word discovery and representation learning demonstrated the feasibility of such muldimodal training ( Citation: , & al., , & (). Unsupervised learning of spoken language with visual context. ; Citation: & , & (). Look, listen and learn. ; Citation: , & al., , & (). Representations of language in a model of visually grounded speech signal. ) .

Following ( Citation: , & al., , , , , , , & (). ZR-2021VG: Zero-resource speech challenge, visually-grounded language modelling track. arXiv preprint arXiv:2107.06546. ) , Task 4 was expanded to include “visually-grounded” training. Partici- pants were to indicate the dataset they used. Systems were only tested with speech-only inputs, however, for comparability with non grounded systems. Here, we present for the first time the results of these latest submissions to Task 4 (see Leaderboard Page).

Similar to the baseline models, the systems of Gan21, Ngu21a,d Bha21a,b, and Gao21a-c take the approach of training acoustic units and then constructing a language model on their outputs. The distinction between high-budget systems and low-budget systems is made the basis of the number of GPU hours needed to train the language model. Gao21a-c apply segmentation and pooling to reduce the temporal resolution of the units, while Bha21a,b use Segmental CPC to learn units and segmentation jointly. Ngu21a,d are technical improvements on the previous best system Baseline4-lg. On the other hand, Ngu21b,c are HuBERT systems, trained end-to-end on a masked language modelling task.

The systems of Pen21 and Lee21a,b are visually grounded. In the case of these two systems, that means they both start from acoustic units that are trained using parallel speech– image data (picture captions). One difference between the two models is the type of training—Pen21 trains end-to-end on a masked language modelling objective, while Lee21a,b use the pre-trained features as input to a small BERT.

Task 4 is clearly in its very early stages (this in spite of the excellent ABX performance of the units used in systems up to now). However, even at this stage, after only one year’s worth of submissions, spoken language modelling has shown improvement on the spot-the-word task (moving from the best speech-based baseline’s 75% accuracy up to 80%) and on the syntactic judgment task (improving from 56% to 60% accuracy). The approach so far has been simple: high-quality units and a powerful language model. In the baseline models as well as most submissions, these components were trained separately; newer models like HuBERT ( Citation: , & al., , , , , & (). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29. 3451–3460. ) learn them jointly. The two approaches are currently tied for the top position on the leaderboard (Ngu21a,d, CPC units fed to a large BERT model, and Ngu21b,c, HuBERT systems). As for capturing word semantics, the Fast-VGS+ system of Pen21 stands out as a serious competitor. This visually-grounded system takes advantage of spoken image caption data in training.

Metrics

Lexicon: the sWUGGY spot-the-word metrics

In this task, the models are presented with a pair of spoken tokens: an existing word and a matching nonword. Participants are to provide a number (probability or pseudo-probability) associated to each acoustic tokens, and models are evaluated on their average accuracy of word-nonword classification based on this probability (chance level: 0.5). The sWUGGY test and development sets consists of 20,000 and 5,000 pairs respectively, with the existing words being part of the LibriSpeech train vocabulary. We also prepared additional OOV-sWUGGY test and development sets consisting of 20,000 and 5,000 pairs respectively, with existing words which do not appear in the LibriSpeech training set. The nonwords are produced with WUGGY ( Citation: & , & (). Wuggy: A multilingual pseudoword generator. Behavior research methods, 42(3). 627–633. ) , which generates, for a given word, a list of candidate nonwords best matched in phonotactics and syllabic structure, which we additionally filtered for pronouncability using G2P, and for having on average the same unigram and bigram phoneme frequencies as words. Stimuli were produced with the Google Speech API.

Syntax: the sBLIMP acceptability metrics

This part of the benchmark is adapted from BLIMP ( Citation: , & al., , , , , , & (). Blimp: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582. ) , a set of linguistic minimal sentence pairs of matched grammatical and ungrammatical sentences. Similarly to sWUGGY, the task is to decide which of the two is grammatical based on the probability of the sentence. The test and dev sets contain 63,000 and 6,300 sentence pairs respectively, with no sentence pair overlap. Stimuli were filtered to contain LibriSpeech vocabulary and for natural prosodic contours, and synthesised as above.

Lexical Semantics: the sSIMI similarity metrics

Here, as in ( Citation: & , & (). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976. ) , the task is to compute the similarity of the representation of pairs of words and compare it to human sim- ilarity judgements. As for the ABX task, participants provide embeddings for input tokens as well as a distance to compute similarity. Here, we provide by default the cosine distance computed over pooled embeddings (with mean, max or min pooling). We used a set of 13 existing semantic similarity and relatedness tests: WordSim-353 ( Citation: & , & (). Verb similarity on the taxonomy of WordNet. Masaryk University. ) , WordSim-353-SIM ( Citation: , & al., , , , , & (). A study on similarity and relatedness using distributional and wordnet-based approaches. ) , mc-30 ( Citation: & , & (). Contextual correlates of semantic similarity. Language and cognitive processes, 6(1). 1–28. ) , rg-65 ( Citation: & , & (). Contextual correlates of synonymy. Communications of the ACM, 8(10). 627–633. ) , Rare-Word (or rw) ( Citation: , & al., , & (). Better word representations with recursive neural networks for morphology. ) , simLex999 ( Citation: , & al., , & (). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4). 665–695. ) , simverb-3500 ( Citation: , & al., , , , & (). Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869. ) , verb-143 ( Citation: , & al., , & (). An unsupervised model for instance level subcategorization acquisition. ) , YP-130 ( Citation: & , & (). Verb similarity on the taxonomy of WordNet. Masaryk University. ) and the relatedness-based datasets include MEN ( Citation: , & al., , , & (). Distributional semantics in technicolor. ) , Wordsim-353-REL ( Citation: , & al., , , , , & (). A study on similarity and relatedness using distributional and wordnet-based approaches. ) , mturk-287 ( Citation: , & al., , , & (). A word at a time: Computing word relatedness using temporal semantic analysis. ) , and mturk-771 ( Citation: , & al., , , & (). Large-scale learning of word relatedness with constraints. ) . All scores were normalised on a 0-10 scale, and pairs within a same dataset containing the same words in different order were averaged. Pairs containing a word absent from LibriSpeech train set ( Citation: , & al., , , & (). Librispeech: An asr corpus based on public domain audio books. IEEE. ) were discarded. We selected as development set the mturk-771 dataset and the other 12 datasets were used as test sets, making sure that no pair from the development set was present in any of the test sets.

We then created two subsets of audio files, one synthetic (using the Google API), one natural obtained by retrieving the audio extracts from LibriSpeech corresponding to each word, following the process presented in [48]. In this subset, each word can appear in multiple tokens, providing phonetic diversity; duplicated scores are averaged in the analysis step. The natural subset is smaller than its synthesised counterpart, as we had to discard pairs from the test and dev sets which were not present in the LibriSpeech test and dev sets respectively. The synthesized subset is composed of 9744 and 705 word pairs for the test and dev sets respectively, and the LibriSpeech subset is composed of 3753 and 309 pairs for the test and dev sets.

Prosody : the ProsAudit metrics

The ProsAudit evaluation task evaluates the prosodic knowledge of a model at the structural level (that is, prosodic information related to the structure of words and sentences). Similarly to sWuggy and sBlimp, the models are presented with a pair of spoken utterances: one containing a break at a “natural” position and one with a break at an “unnatural” position. Participants are to provide a number (probability or pseudo-probability) associated to each acoustic utterance, and models are evaluated on their average accuracy of natural-unatural classification based on this probability (chance level: 0.5).

The task actually contains two subtasks. The protosyntax task tests the model’s ability to identify strong versus weak prosodic boundaries. The lexical task tests the model’s ability to distinguish between pauses inserted between words and within words and therefore requires some lexical knowledge of word boundaries. The utterances are extracted from the Boston University corpus ( Citation: , & al., , & (). The boston university radio news corpus. Linguistic Data Consortium. 1–19. ) and pauses are artificially inserted in both conditions at different positions.

For more details on the task please refer to ( Citation: , & al., , , , , , , , & (). ProsAudit, a prosodic benchmark for self-supervised speech models. ) .

The resources (dataset and code) associated to the ProsAudit benchmark are licensed under a CC-BY NC license attribution. Users shall give appropriate reference to the authors of the work ( Citation: , & al., , , , , , , , & (). ProsAudit, a prosodic benchmark for self-supervised speech models. ) and of the BU corpus on which the dataset is based (Ostendorf, Mari, Patti Price, and Stefanie Shattuck-Hufnagel. Boston University Radio Speech Corpus LDC96S36. Web Download. Philadelphia: Linguistic Data Consortium, 1996.)

Bibliography

$^*$The full bibliography can be found here

Agirre, Alfonseca, Hall, Kravalova, Pasca & Soroa (2009)
, , , , & (). A study on similarity and relatedness using distributional and wordnet-based approaches.
Alishahi, Chrupała, Cristia, Dupoux, Higy, Lavechin, Räsänen & Yu (2021)
, , , , , , & (). ZR-2021VG: Zero-resource speech challenge, visually-grounded language modelling track. arXiv preprint arXiv:2107.06546.
Arandjelovic & Zisserman (2017)
& (). Look, listen and learn.
Baker, Reichart & Korhonen (2014)
, & (). An unsupervised model for instance level subcategorization acquisition.
Bruni, Boleda, Baroni & Tran (2012)
, , & (). Distributional semantics in technicolor.
Chrupała, Gelderloos & Alishahi (2017)
, & (). Representations of language in a model of visually grounded speech signal.
Chung & Glass (2018)
& (). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976.
Gerz, Vulić, Hill, Reichart & Korhonen (2016)
, , , & (). Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869.
Halawi, Dror, Gabrilovich & Koren (2012)
, , & (). Large-scale learning of word relatedness with constraints.
Harwath, Torralba & Glass (2016)
, & (). Unsupervised learning of spoken language with visual context.
Hill, Reichart & Korhonen (2015)
, & (). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4). 665–695.
Hsu, Bolte, Tsai, Lakhotia, Salakhutdinov & Mohamed (2021)
, , , , & (). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29. 3451–3460.
Keuleers & Brysbaert (2010)
& (). Wuggy: A multilingual pseudoword generator. Behavior research methods, 42(3). 627–633.
Luong, Socher & Manning (2013)
, & (). Better word representations with recursive neural networks for morphology.
Miller & Charles (1991)
& (). Contextual correlates of semantic similarity. Language and cognitive processes, 6(1). 1–28.
Nguyen, Seyssel, Rozé, Rivière, Kharitonov, Baevski, Dunbar & Dupoux (2020)
, , , , , , & (). The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv preprint arXiv:2011.11588.
Ostendorf, Price & Shattuck-Hufnagel (1995)
, & (). The boston university radio news corpus. Linguistic Data Consortium. 1–19.
Panayotov, Chen, Povey & Khudanpur (2015)
, , & (). Librispeech: An asr corpus based on public domain audio books. IEEE.
Radinsky, Agichtein, Gabrilovich & Markovitch (2011)
, , & (). A word at a time: Computing word relatedness using temporal semantic analysis.
Rubenstein & Goodenough (1965)
& (). Contextual correlates of synonymy. Communications of the ACM, 8(10). 627–633.
Seyssel, Lavechin, Titeux, Thomas, Virlet, Santos Revilla, Wisniewski, Ludusan & Dupoux (2023)
, , , , , , , & (). ProsAudit, a prosodic benchmark for self-supervised speech models.
Warstadt, Parrish, Liu, Mohananey, Peng, Wang & Bowman (2019)
, , , , , & (). Blimp: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582.
Yang & Powers (2006)
& (). Verb similarity on the taxonomy of WordNet. Masaryk University.