Zerospeech 2021

Speech-based language modelling

ZeroSpeech 2021 is a challenge aimed at Spoken Language Modelling. This task consists in learning language models directly from raw audio in an unknown language, without any annotation or text.

This page describes the core benchmarks, definitions, and constraints for speech-based language modelling as laid out in the ZeroSpeech 2021 challenge.

The information on this page is common to both Track 1 and Track 2.

For details specific to Visually-grounded language modelling (Track 2), please see the page about Track 2: Visually-grounded speech-based language modelling.


The goal of the ZeroSpeech 2021 challenge is to build language models without using any textual input.

Like ordinary language models, the trained models are intended to assign scores (probabilities or pseudo-probabilities) to novel utterances, assessing whether they are possible or likely utterances in the training language. Unlike text-based language models, the input to the trained models is a speech utterance, not a written utterance.

In Track 1, speech is also the only input to training (see the page about track 2 for information about visually-grounded spoken-language model training).

Training might, for example, consist of acoustic unit discovery, to learn discrete units (pseudo-text), followed by language model training applied to the pseudo-text. Another approach might learn a sequential predictive model
end-to-end without explicit discrete units.


Evaluation is done through a suite of four black-box, zero-shot metrics, intended to probe different aspects of a
trained model. Rather than evaluating perplexity (which is not feasible when using speech), the metrics probe what models know about the language. They evaluate four different linguistic levels of knowledge: phonetics, lexicon, syntax and semantics.

  1. Phonetics (Libri-light ABX). This metric assesses whether the phonemes of the language are correctly distinguished by the model. If the phonemes have well separated representations, this metric (the ABX error) will be close to 0 (5-10% error corresponds to good separation; 20%-30%: some signal, but not very good separation – for example, MFCC representations; 50%: chance level).

  2. Lexicon (sWUGGY spot-the-word). This metric assesses whether the model knows the words of the language. Models are presented with a pair: an existing word (like ‘brick’) and a non-word (like ‘blick’). The spot-the-word accuracy measures how often the model correctly assigns a higher probability to the real word than to the corresponding non-word. Character-based LMs easily reach 95% on this task; chance is 50%.

  3. Syntactic level (sBLIMP syntactic acceptability). This metric assesses whether the model has knowledge of the syntactic structure of the language. Models are presented with a pair: a grammatical sentence (like ’the dogs sleep’) and an ungrammatical correspondent (’the dog sleep’, for example). The syntactic acceptability accuracy measures how often the model correctly assigns a higher probability to the grammatical than to the ungrammatical sentence, across a variety of syntactic phenomena. Character-based LMs trained on the Librispeech annotations are around 68% correct; humans and large LMs are in the 80-90%. Chance is 50%.

  4. Semantic level (sSIMI similarity score). This metric assesses whether the model has knowledge of the lexical semantics of the language. The task is to compute the similarity of the representations of pairs of words. The score is a correlation coefficient with respect to human similarity judgements. If the model perfectly predicts human judgement, the score will be 100%. Random models will have a score of 0. Large wordpart based LMs are around 30%, while typical static word vectors are around 50% [4].

Metric 1 is intended to measure the ability of the model to make the right phonetic distinctions to code the language, which will presumably be somewhat predictive of the quality of language modelling.

Metrics 2 through 4 measure the language model itself. These different aspects of the model may be thought of as being useful for different downstream tasks. Accurate lexical and syntactic knowledge will presumably be very important for applying language models to low-resource ASR decoding or other tasks in which what is said is of primary importance, while accurate syntactic and semantic knowledge will presumably be of greater importance for tasks such as speech-based translation or language understanding in which what is meant is critical.

Submitters are allowed to select different representations from their model to each of the different metrics. For example, phonetics may be captured by an earlier layer of an end-to-end model than lexical semantics. These representations can be selected with the help of the dev kit. Participants are provided with the scripts to run all four metrics on the dev set. They will have to submit their output files to the website to have the results on the test set.

Distance-based metrics

Metrics 1 (ABX) and 4 (sSIMI) require that participants first extract an embedding for each test input, and specify a (pseudo) distance to compare two such embeddings. As test inputs may not have the same length (different number of speech frames), this requires two decisions: picking a frame-wise distance (which can be for instance the cosine distance, the angular distance, KL, etc) and a pooling method.

The ABX metric computes, for two speech categories A and B, (e.g., bit vs bet), the probability that two sounds belonging to the same category are closer to one another than when they belong to different categories. The score is symmetrized and aggregated across all of minimal pairs of triphones like bit, bet, (where the change only occurs in the middle phoneme) and turned into a percent error. This score is computed both within speaker (a, b and x are spoken by the same speaker) and across speaker (a and b are by the same speaker, and x by a different speaker).

The sSIMI metric computes the similarity for a pair of words x and y. The Spearman’s rank correlation with human reference data across all words is calculated. This is done both for natural productions of the words, extracted from a corpus, and for synthesized items, in order to address the potential difficulties in changing domain.

For the ABX metric, it is customary to average along a DTW realignment path to deal with the issue of pooling. The evaluation for Metric 1 does not support other methods, but allows the user to specify a frame-wise distance (Euclidean distance, angular distance, KL-divergence, or symmetrized KL-divergence).

For the sSIMI metric, it will likely make more sense to use pooling on the representations, before calculating any distances, rather than by applying DTW. The evaluation for Metric 4 does not support other approaches. The user selects a pooling method (min, max, mean, sum, last, or lastlast, i.e., second-last) and a distance (any metric supported by scipy.spatial.distance.cdist). Distances will be converted into similarities by taking the negation.

We provide scripts to compute these distances and poolings given an embedding for each input file, and participants can use the dev set to select the best embedding, pooling, and distance in their system. However, in their submissions, participants submit only features, and make their choice of pooling and similarity/distance. Our remote evaluation takes care of these calculations.

For further details, see Nguyen, et al. (2020) [2]. Please cite this paper when using the benchmark and/or the baseline systems.

Scoring-based metrics

Metrics 2 and 3 require a (pseudo) probability that will be associated to each input. It is up to the participants to provide such a number (it can be any positive or negative float, hence ‘pseudo’); in our baseline, we compute it by applying various masks to the input and compute the probability of the BERT reconstruction of the pseudo-text hidden behind the mask.

The sWUGGY metric presents models with an existing word (like ‘brick’) and a non-word (like ‘blick’). The model must assign a score to both. The score for the real word should be numerically larger than the score for the non-word in order to be scored correct. The nonwords are matched for phonotactics and syllable structure, and they are designed not to be predictable from unigram or bigram statistics. They are synthesized using TTS. The overall score is the percent accuracy over all pairs.

The sBLIMP metric presents models with a pair of sentences - a grammatical sentence (like ’the dogs sleep’) and an ungrammatical correspondent (’the dog sleep’, for example). The model must provide a score for each, and the item is scored accurately if the score for the grammatical sentence is numerically larger than the score for the ungrammatical one. The dataset tests a variety of different phenomena (subject-verb agreement, negative polarity items, filler-gap dependencies, and so on). The score is an average of the scores across all the broad categories of syntactic knowledge, which is in turn an average over the percent accuracies for several narrower categories. sBLIMP is a speech version of the BLIMP benchmark [3].

For further details, see Nguyen, et al. (2020) [2]. Please cite this paper when using the benchmark and/or the baseline systems.


All of our baselines consists in the following pipeline: a CPC encoder [1], which predicts a continuous frame-level representation; a discretized (but still frame-level) representation, obtained by applying $k$-means to the CPC representations. This yields a discrete “pseudo-text.” Then, language models learn on the basis of the pseudo-text derived from clustering the learned representation (either LSTM or BERT).

For further details, see Nguyen, et al. (2020) [2]. Please cite this paper when using the benchmark and/or the baseline systems.

Training datasets

While we recognize that different models may be of radically different sizes (see Budget below), and thus may necessitate different training sets, for the sake of comparability, we recommend the use of the following data sets, on which our baselines were trained:

  • LibriSpeech 960 []: The baseline acoustic modelling (CPC) was trained on the audio from the entire LibriSpeech train set (100+360+500). The baseline speech-based language modelling (LSTM, low-budget and high-budget BERT), as well as the topline text-based BERT, was also trained on this data set.
  • LibriSpeech 100 []: The baseline acoustic unit clustering ($k$-means) was trained on the audio from the 100 hour LibriSpeech train (100)

Most systems submitted up to now have used one or a combination of these data sets. Submissions are required to declare the data sets on which they trained.


In order to take into account the computing resources of participants, we distinguish submissions by the amount of GPU budget used for training the language modelling component. We focus on the language modelling component, because we expect that the training time of systems wiill typically be dominated by language modelling. Submissions must include their GPU budget in hours.

As discussed above, our baseline systems are trained first by doing discrete unit discovery, but we do not include this training time in the GPU budget.

  • CPC: two to three days to train on eight 16Gb GPUs
  • Clusterizing and quantizing: 2 x 12h on one GPU

For the language modelling component, we do two versions.

  • Low-budget models: LSTM (22M params) and BERT (28M params): each takes 60h on one GPU, thus corresponds to a budget of 60 GPU hours
  • High-budget model: BERT-base (90M params): takes 48h on 32 GPUs, thus has a budget of 1536 GPU hours

We take “high”-budget systems to be those which would be very difficult to develop on a reasonable timeline with only a small number of GPUs. Participants do not need to classify themselves into “low” or “high”-budget (only the number of GPU hours need be specified at submission). Models trained completely end-to-end, with no language modelling component clearly distinguished, should include their whole budget.


For further details, see Nguyen, et al. (2020) [2]. Please cite this paper when using the benchmark and/or the baseline systems.

authors="Nguyen, Tu Anh and de Seyssel, Maureen and Rozé, Patricia and Rivière, Morgane and Kharitonov, Evgeny and Baevski, Alexei and Dunbar, Ewan and Dupoux, Emmanuel",
title= "{The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling}"
booktitle="Self-Supervised Learning for Speech and Audio Processing Workshop @ NeurIPS",


The ZeroSpeech 2021 Benchmark has been funded by a Facebook gift, a CIFAR grant (Learning in Minds and Brains), and grants from the Agence Nationale de la Recherche (ANR-17-EURE-0017 Frontcog, ANR-10-IDEX-0001-02 PSL*, ANR-19-P3IA-0001 PRAIRIE 3IA Institute) given to the E. Dupoux in his EHESS role.

The ZeroSpeech 2021 challenge is hosted on Codalab, an open-source web-based platform for machine learning competitions.

Alt Text


[1] Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[2] Nguyen, T.A., de Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., Baevski, A., Dunbar, E., & Dupoux, E. (2020). The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv preprint: arXiv:2011.11588.

Challenge Organizing Committee

  • Emmanuel Dupoux (Organizer)

    Researcher, EHESS / Cognitive Machine Learning / Facebook, emmanuel.dupoux at

  • Ewan Dunbar (Organizer)

    Assistant Professor, University of Toronto, ewan.dunbar at

  • Mathieu Bernard (Website & Submission)

    Engineer, INRIA, Paris, mathieu.a.bernard at

  • Nicolas Hamilakis (Website & Submission)

    Engineer, ENS, Paris, at nicolas.hamilakis at

  • Maureen de Seyssel (Datasets & Metrics)

    PhD student, INRIA, Paris, maureen.deseyssel at

  • Tu Anh Nguyen (Baselines)

    PhD student, INRIA/Facebook, Paris, nguyentuanh208 at