General presentation

Traditional speech and language technologies are trained with massive amounts of text and/or expert knowledge. This is not sustainable: the majority of the world’s languages do not have reliable textual or expert resources. Even in high resourced languages, there is a large domain mismatch between oral and written uses of language.

But infants learn to speak their native language, spontaneously, from raw sensory input, without supervision from text or linguists. It should be possible to do the same in machines!

The ultimate goal of the “Zero Resource Speech Challenge” 1 is to construct a system that learn an end-to-end Spoken Dialog (SD) system, in an unknown language, from scratch, using only raw sensory information available to an early language learner.

The Zero Resource speech challenge addresses an ambitious fundamental scientific question for Artificial Intelligence: how can a system autonomously acquire language? While it is solved, it can help with three practical applications:

  • Systems constructed with zero expert resources could provide services for millions of users of ‘low-resource’ languages (keyword search, document classification, etc.)

  • Zero resource technologies could help the language documentation effort for the growing number of disappearing languages (tools for automatically discovering/annotating linguistic units).

  • Zero Resource Speech technologies provide predictive models of language developpment in a normal or abnormal setting (dyslexia, autism, etc).

The Zero Resource Challenge series is constructed to progress incrementally towards this goal, by proposing achievable but progressively harder objectives, building and open sourcing the core technological components that are needed for an autonomous SD system along the way (see Figure 1).


Figure 1. General outline of a spoken dialogue system, and positioning of the 5 ZR Challenges so far.

Weakly/Un-supervised learning is tricky to evaluate. We use two kinds of evaluations: (1). Unit testing: Each core component is evaluated by a specific set of metrics, inspired by psychometrics and linguistics. These tests do not guarantee that an entire system will work well, but they are useful to debug and understand the systems. (2). Application testing: as the challenge aggregate more components, useful applications will be possible to construct (e.g. Task 1b is low bitrate speech resynthesis), making it possible to use more standard evaluation techniques.

So far, five ZeroSpeech challenges have been organized in 2015, 2017, 2019, 2020, the latest being in 2021. Please click on the corresponding tab for more information.




“Zero resource” refers to zero linguistic expertise (e.g., orthographic/linguistic transcriptions), not zero information besides audio (visual, limited human feedback, etc).