Visually-grounded language modelling

Questions? Contact zerospeech2021 [at] gmail [dot] com for questions or comments.

ZeroSpeech 2021 is a challenge aimed at Spoken Language Modelling. This task consists in learning language models directly from raw audio in an unknown language, without any annotation or text.

For general information about both Track 1 and Track 2, including the core benchmarks, definitions, and constraints for speech-based language modelling as laid out in the ZeroSpeech 2021 challenge, please see the general information page about Speech-based language modelling.

The information on this page is specific to Track 2: Visually-grounded language modelling.


The goal of the ZeroSpeech 2021 challenge is to build language models without using any textual input. In Track 2: Visually-grounded language modelling, participants are asked to incorporate the visual modality in their pipeline. Any pipeline using some audio-only models is allowed as long as this pipeline includes at least one model using the visual modality. Visual modality can be in the form of static images or videos, either parallel/paired or not. The goal of this track is to better understand the impact of the visually grounding on the learned representations and to improve the quality of these representations.

No linguistic supervision is allowed. You can use any automatically derived information, as long as the system used to generate it was not trained with linguistic labels (e.g., speaker diarization or denoising are allowed, speaker or language identification is allowed, speech recognition with a language model is not allowed). Explicitly, training must be done in an unsupervised fashion within the linguistic domain; i.e., no linguistic labels allowed, including generated via ASR systems (since these have themselves been trained using labels). One exception to the general rule of no supervision is for visual features that may be pre-trained with categorical object labels - but we categorically forbid the use of captions, and we caution against using this exception to allow linguistic supervision to seep in. If you are uncertain, please contact the organizers.


See Evaluation for Speech-based language modelling.

Evaluation is EXACTLY THE SAME for Track 1 and Track 2. Only speech is used in the test items, as the goal of Track 2 is to assess how and whether learning with access to visual information can improve speech-based language modelling (rather than to use speech to improve a task on images or videos). Thus, as in Track 1, all models MUST be able to output representations for novel speech-only items.


Multimodal baseline systems






CPC small

Acoustic model

VG model

VG model




Language Model

BERT small

BERT large

All of our baselines consist in the following pipeline : 1. A visually grounded model (VG) which outputs a continuous frame-level representation; 2. A discretized (but still frame-level) representation obtained by applying K-means to the VG model (this yields a discrete “pseudo-text.”); 3. Then, language models learn on the basis of the pseudo-text derived from clustering the learned representation. We propose a first version of this baseline in which the VG model has been trained from MFCCs. In a second version, the VG model has been trained from pretrained CPC representations.

Training datasets

Submissions may use different types of data, corresponding to different conditions. For example, many current visually-grounded systems are trained on images paired with speech labels. Although this is a “high-resource” training setting, and does not correspond to naturalistic, spontaneous speech, participants may find it useful to explore a “best-case” scenario. Other participants may wish to move to naturalistic videos, in which the speech is only very indirectly related to the image, or even to completely do away with parallel data. While this makes it difficult to compare models, we recognize that different models may have very different needs, and be pursuing very different scientific ideas. Submissions are, however, required to declare the data sets on which they trained.

The training sets we used to develop our baseline systems are shown in the table below.

Training sets used in the multimodal baseline systems



Download link

COCO images 2014

VG model


VG model

Librispeech 100h


Librispeech 960h



In order to take into account the computing resources of participants, we distinguish submissions by the amount of GPU budget used for training the whole pipeline.

  • CPC (optional): 48h to 72h days to train on eight GPUs (we used pretrained CPC models taken from Track 1)

  • VG: 10h to 72h to train on one GPU depending on whether the input is MFCCs or CPC representations

  • Clusterizing and quantizing: 2 x 12h on one GPU

  • Language modelling : either 60h on one GPU for LSTM (22M params) and BERT (28M params), or 5h on 32 GPUs = 160h for BERT-large (90M params)

We consider “high”-budget systems to be those which would be very difficult to develop on a reasonable timeline with only a small number of GPUs. Participants do not need to classify themselves into “low” or “high”-budget (only the number of GPU hours need be specified at submission).

Baseline system

You can find a more in-depth explanation as well as the code required to run the visually-grounded baseline in our git repository :


Alejandrina Cristia was supported by Agence Nationale de la Recherche (ANR-17-CE28-0007 LangAge, ANR-16-DATA-0004 ACLEW, ANR-14-CE30-0003 MechELex, ANR-17-EURE-0017); and the J. S. McDonnell Foundation Understanding Human Cognition Scholar Award.

Okko Räsänen was supported by Academy of Finland grant no. 314602.

Bertrand Higy was supported by a NWO/E-Science Center grant number 027.018.G03.

The ZeroSpeech 2021 challenge is hosted on Codalab, an open-source web-based platform for machine learning competitions.



To be announced

Track 2 (Multimodal) Organizing Committee

  • Alejandrina Cristia

    Laboratoire de Sciences Cognitives et Psycholinguistique, ENS, Paris, France

  • Okko Räsänen

    Unit of Computing Sciences, Tampere University, Finland

  • Bertrand Higy

    Dept. of Cognitive Science and AI, Tilburg University, Netherlands

  • Marvin Lavechin

    Cognitive Machine Learning / Facebook AI Research, ENS/INRIA, Paris, France

  • Grzegorz Chrupała

    Dept. of Cognitive Science and AI, Tilburg University, Netherlands

  • Afra Alishahi

    Dept. of Cognitive Science and AI, Tilburg University, Netherlands

  • Chen Yu

    Dept. of Psychology, University of Texas at Austin