The Zero Resource Speech Challenge 2020
Tasks & Goals
Summary
ZeroSpeech 2020 is a consolidating challenge in which participants submit systems to the ZeroSpeech 2017 (Track 1 or Track 2) or the ZeroSpeech 2019 tasks. Participants are particularly encouraged to submit to multiple tracks/challenges (unit discovery evaluated on both the 2017 and 2019 evaluations, unit discovery used as a basis for spoken term discovery).
For general background information see the Zerospeech 2017 and Zerospeech 2019 main pages. Changes have been made to the challenge for the 2020 edition. Please attentively read the Instructions for detailed information.
-
Datasets
The 2020 edition reuses the training datasets for the 2017 and 2019 challenges. The test datasets have changed to include additional files.
-
Baseline and topline
The baseline and topline reference systems will not change, and will be exactly those used in the 2017 and 2019 challenges.
-
Evaluation metrics
The evaluation has undergone an overhaul, which fixes bugs, inconsistencies, and problems of speed. The bugs and inconsistencies in the 2017 Track 2 task evaluation tool had an impact on the scores. All of the Track 2 metrics should be expected to change somewhat, with the exception of
npairs
andnwords
. See Track 1 for information on the 2017 Track 1 task evaluation, Updated 2017 Track 2 task evaluation below for information on the updated Track 2 task evaluation, and see -
Submission format
The submission format for the 2017 Track 1 task has changed. See instructions for detailed information.
-
Software
Software is provided for validating submissions and for running the evaluation on the development languages, for all three tasks. Software is provided as a Python 3 (conda) package. See instructions for detailed information. No baseline or topline systems are included in the package (the Docker containing the baseline system for the 2019 Challenge remains available from the Zerospeech 2019 site).
Timeline
This challenge has been accepted as an Interspeech 2020 special session to be held during the conference. Due to the time needed for us to run the human evaluations on the resynthesized waveforms, we will require that these waveforms be submitted two weeks before the Interspeech official abstract submission deadline.
Date | |
---|---|
Feb 7, 2020 | Release of competition materials |
March 2, 2020 | Challenge opens on Codalab |
April 24, 2020 | Challenge submission deadline |
May 01, 2020 | Leaderboard published on zerospeech.com |
May 08, 2020 | Interspeech deadline |
July 24, 2020 | Paper acceptance/rejection notification |
Oct. 26-29, 2020 | Interspeech Conference |
Updated 2017 Track 2 task evaluation
All of our metrics assume a time aligned transcription, where $T_{i,j}$ is the (phoneme) transcription corresponding to the speech fragment designated by the pair of indices $\langle i,j \rangle$ (i.e., the speech fragment between frame i and j). If the left or right edge of the fragment contains part of a phoneme, that phoneme is included in the transcription if it corresponds to more than more than 30ms or, for phonemes less than 60ms, more than 50% of its duration.
We first define the set related to the output of the discovery algorithm:
- $C_{disc}$ : the set of discovered clusters (a cluster being a set of fragments grouped together).
From these, we can derive:
-
$F_{disc}$ : the set of discovered fragments, $F_{disc} = \{ f | f \in c , c \in C_{disc} \}$
-
$P_{disc}$ : the set of non overlapping discovered pairs (two fragments a and b overlap if they share more than half of their temporal extension), $P_{disc} = \{ \{a,b\} | a \in c, b \in c, \neg \textrm{overlap}(a,b), c \in C_{disc} \}$
-
$P_{disc^*}$ : the set of pairwise substring completion of $P_{disc}$ , which mean that we compute all of the possible minimal path realignments of the two strings, and extract all of the substrings pairs along the path (e.g., for fragment pair $\langle abcd, efg \rangle$ : $\langle abc, efg \rangle$ , $\langle ab,ef \rangle$ , $\langle bc, fg \rangle$ , $\langle bcd, efg \rangle$ , etc).
-
$B_{disc}$ : the set of discovered fragment boundaries (boundaries are defined in terms of i, the index of the nearest phoneme boundary in the transcription if it is less than 30ms away or, for phonemes less than 60ms, if more than 50% of its duration is covered by a fragment associated with the boundary, and -1 [wrong boundary)] otherwise)
Next, we define the gold sets:
-
$F_{all}$ : the set of all possible fragments of size between 3 and 20 phonemes in the corpus.
-
$P_{all}$ : the set of all possible non overlapping matching fragment pairs. $P_{all}=\{ \{a,b \}\in F_{all} \times F_{all} | T_{a} = T_{b}, \neg \textrm{overlap}(a,b)\}$ .
-
$F_{goldLex}$ : the set of fragments corresponding to the corpus transcribed at the word level (gold transcription).
-
$P_{goldLex}$ : the set of matching fragments pairs from the $F_{goldLex}$ .
-
$B_{gold}$ : the set of boundaries in the parsed corpus.
Most of our measures are defined in terms of precision, recall and F-score. Precision is the probability that an element in a discovered set of entities belongs to the gold set, and recall the probability that a gold entity belongs to the discovered set. The F-score is the harmonic mean of precision and recall.
- $Precision_{disc,gold} = | disc \cap gold | / | disc |$
- $Recall_{disc,gold} = | disc \cap gold | / | gold |$
- $F-Score_{disc,gold} = 2 / (1/Precision_{disc,gold} + 1/Recall_{disc,gold})$
Matching quality
Many spoken term discovery systems incorporate a step whereby fragments of speech are realigned and compared. Matching quality measures the accuracy of this process. Here, we use the NED/coverage metrics for evaluating that.
NED and coverage are quick to compute and give a qualitative estimate of the matching step. NED is the Normalised Edit Distance; it is equal to zero when a pair of fragments have exactly the same transcription, and 1 when they differ in all phonemes. Coverage is the fraction of corpus that contain matching pairs that has been discovered.
where:
Clustering Quality
Clustering quality is evaluated using two metrics. The first metrics (Grouping precision, recall and F-score) computes the intrinsic quality of the clusters in terms of their phonetic composition. This score is equivalent to the purity and inverse purity scores used for evaluating clustering. As the Matching score, it is computed over pairs, but contrary to the Matching scores, it focusses on the covered part of the corpus.
where:
Note: The original version of the evaluation software contained an incorrect implementation of the function $match(\cdot,\cdot)$
, which did not count all matches, which is now fixed.
The second metrics (Type precision, recall and F-score) takes as the gold cluster set the true lexicon and is therefore much more demanding. Indeed, a system could have very pure clusters, but could systematically missegment words. Since a discovered cluster could have several transcriptions, we use all of them (rather than using some kind of centroid).
Parsing Quality
Parsing quality is evaluated using two metrics. The first one (Token precision, recall and F-score) evaluates how many of the word tokens were correctly segmented ( $X = F_{disc}$ , $Y = F_{goldLex}$ ). The second one (Boundary precision, recall and F-score) evaluates how many of the gold word boundaries were found ( $X = B_{disc}$ , $Y = B_{gold}$ ). These two metrics are typically correlated, but researchers typically use the first. We provide Boundary metrics for completeness, and also to enable system diagnostic.
Challenge Organizing Committee
-
Ewan Dunbar (Organizer)
Researcher, Université de Paris / Cognitive Machine Learning, ewan.dunbar at univ-paris-diderot.fr
-
Emmanuel Dupoux (Coordination)
Researcher, EHESS / Cognitive Machine Learning / Facebook, emmanuel.dupoux at gmail.com
-
Mathieu Bernard (Website & Submission)
Engineer, INRIA, Paris, mathieu.a.bernard at inria.fr
-
Julien Karadayi (Website & Submission)
Engineer, ENS, Paris, julien.karadayi at gmail.com
Scientific committee
-
Laurent Besacier
- LIG, Univ. Grenoble Alpes, France
- Automatic speech recognition, processing low-resourced languages, acoustic modeling, speech data collection, machine-assisted language documentation
- email: laurent.besacier at imag.fr, https://cv.archives-ouvertes.fr/laurent-besacier
-
Alan W. Black
- CMU, USA
- Speech synthesis, speech processing
- email: awb at cs.cmu.edu, http://www.cs.cmu.edu/~awb
-
Ewan Dunbar
- Université de Paris / Cognitive Machine Learning
- Speech perception & processing, Computational Phonology
- email: ewan.dunbar at univ-paris-diderot.fr http://www.linguist.univ-paris-diderot.fr/~edunbar
-
Emmanuel Dupoux
- Ecole des Hautes Etudes en Sciences Sociales / Cognitive Machine Learning / Facebook AI
- Computational modeling of language acquisition, psycholinguistics, unsupervised learning of linguistic units
- email: emmanuel.dupoux at gmail.com, http://www.lscp.net/persons/dupoux
-
Lucas Ondel
- University of Brno,
- Speech technology, unsupervised learning of linguistic units
- email: iondel at fit.vutbr.cz
-
Sakriani Sakti
- Nara Institute of Science and Technonology (NAIST)
- Speech technology, low resources languages, speech translation, spoken dialog systems
- email:ssakti at is.naist.jp, http://isw3.naist.jp/~ssakti
Acknowledgments
The ZeroSpeech 2021 challenge is hosted on Codalab, an open-source web-based platform for machine learning competitions.
References
-
Dunbar, E., Algayres, R., Karadayi, J., Bernard, M., Benjumea, J., Cao, X.-N., Miskic, L., Dugrain, C., Ondel, L., Black, A. W., Besacier, L., Sakti, S., Dupoux, E. (2019). The Zero Resource Speech Challenge 2019: TTS without T. In INTERSPEECH-2019.
-
The references for the 2017 challenge are here.
-
The references for the 2019 challenge are here.