ZeroSpeech 2020 is a consolidating challenge in which participants submit systems to the ZeroSpeech 2017 (Track 1 or Track 2) or the ZeroSpeech 2019 tasks. Participants are particularly encouraged to submit to multiple tracks/challenges (unit discovery evaluated on both the 2017 and 2019 evaluations, unit discovery used as a basis for spoken term discovery).

For general background information see the ZeroSpeech 2017 and
ZeroSpeech 2019 main pages. **Changes have been made to the
challenge for the 2020 edition.** Please attentively read
Instructions for detailed information.

**Datasets**The 2020 edition reuses the training datasets for the 2017 and 2019 challenges. The test datasets have

**changed**to include additional files.**Baseline and topline**The baseline and topline reference systems will not change, and will be exactly those used in the 2017 and 2019 challenges.

**Evaluation metrics**The evaluation has undergone an overhaul, which fixes bugs, inconsistencies, and problems of speed. The bugs and inconsistencies in the 2017 Track 2 task evaluation tool had an impact on the scores. All of the Track 2 metrics should be expected to change somewhat, with the exception of

`npairs`

and`nwords`

. See Track 1 for information on the 2017 Track 1 task evaluation, Updated 2017 Track 2 task evaluation below for information on the**updated**Track 2 task evaluation, and see**Submission format**The submission format for the 2017 Track 1 task has

**changed.**See Instructions for detailed information.**Software**Software is provided for validating submissions and for running the evaluation on the development languages, for all three tasks. Software is provided as a Python 3 (conda) package. See Instructions for detailed information. No baseline or topline systems are included in the package (the Docker containing the baseline system for the 2019 Challenge remains available from the ZeroSpeech 2019 site).

This challenge has been accepted as an Interspeech 2020 special session
to be held during the conference. Due to the time needed for us to run the human
evaluations on the resynthesized waveforms, we will require that these waveforms
be submitted **two weeks** before the Interspeech official abstract submission
deadline.

Date | |
---|---|

Feb 7, 2020 | Release of competition materials |

March 2, 2020 | Challenge opens on Codalab |

April 24, 2020 | Challenge submission deadline |

May 01, 2020 | Leaderboard published on zerospeech.com |

May 08, 2020 | Interspeech deadline |

July 24, 2020 | Paper acceptance/rejection notification |

Oct. 26-29, 2020 | Interspeech Conference |

All of our metrics assume a time aligned transcription, where
\(T_{i,j}\) is the (phoneme) transcription corresponding to the
speech fragment designated by the pair of indices \(\langle i,j
\rangle\) (i.e., the speech fragment between frame *i* and *j*). If the
left or right edge of the fragment contains part of a phoneme, that
phoneme is included in the transcription if it corresponds to more
than more than 30ms or, for phonemes less than 60ms, more than 50% of
its duration.

Note

The original version of the evaluation software contained an incorrect implementation of the 30ms/50% rule which affected many of the scores.

We first define the set related to the output of the discovery algorithm:

- \(C_{disc}\): the set of discovered clusters (a cluster being a set of fragments grouped together).

From these, we can derive:

- \(F_{disc}\): the set of discovered fragments, \(F_{disc} = \{ f | f \in c , c \in C_{disc} \}\)
- \(P_{disc}\): the set of non overlapping discovered pairs (two fragments
a and b
*overlap*if they share more than half of their temporal extension), \(P_{disc} = \{ \{a,b\} | a \in c, b \in c, \neg \textrm{overlap}(a,b), c \in C_{disc} \}\) - \(P_{disc^*}\): the set of pairwise substring completion of \(P_{disc}\), which mean that we compute all of the possible minimal path realignments of the two strings, and extract all of the substrings pairs along the path (e.g., for fragment pair \(\langle abcd, efg \rangle\): \(\langle abc, efg \rangle\), \(\langle ab,ef \rangle\), \(\langle bc, fg \rangle\), \(\langle bcd, efg \rangle\), etc).
- \(B_{disc}\): the set of discovered fragment boundaries
(boundaries are defined in terms of
*i*, the index of the nearest phoneme boundary in the transcription if it is less than 30ms away or, for phonemes less than 60ms, if more than 50% of its duration is covered by a fragment associated with the boundary, and -1 [wrong boundary)] otherwise)

Note

The original version of the evaluation did not implement the 50% rule for the association of boundaries with phonemes, but did use the rule for finding the transcription associated with a fragment. The updated evaluation changes the boundary association rule to make the measures consistent.

Next, we define the gold sets:

- \(F_{all}\): the set of all possible fragments of size between 3 and 20 phonemes in the corpus.
- \(P_{all}\): the set of all possible non overlapping matching fragment pairs. \(P_{all}=\{ \{a,b \}\in F_{all} \times F_{all} | T_{a} = T_{b}, \neg \textrm{overlap}(a,b)\}\).
- \(F_{goldLex}\): the set of fragments corresponding to the corpus transcribed at the word level (gold transcription).
- \(P_{goldLex}\): the set of matching fragments pairs from the \(F_{goldLex}\).
- \(B_{gold}\): the set of boundaries in the parsed corpus.

Most of our measures are defined in terms of *precision*, *recall* and
*F-score*. *Precision* is the probability that an element in a
discovered set of entities belongs to the gold set, and *recall* the
probability that a gold entity belongs to the discovered set. The
*F-score* is the harmonic mean of *precision* and *recall*.

- \(Precision_{disc,gold} = | disc \cap gold | / | disc |\)
- \(Recall_{disc,gold} = | disc \cap gold | / | gold |\)
- \(F-Score_{disc,gold} = 2 / (1/Precision_{disc,gold} + 1/Recall_{disc,gold})\)

Many spoken term discovery systems incorporate a step whereby
fragments of speech are realigned and compared. Matching quality
measures the accuracy of this process. Here, we use the *NED/coverage*
metrics for evaluating that.

*NED* and *coverage* are quick to compute and give a qualitative
estimate of the matching step. *NED* is the Normalised Edit Distance;
it is equal to zero when a pair of fragments have exactly the same
transcription, and 1 when they differ in all phonemes. *Coverage* is
the fraction of corpus that contain matching pairs that has been
discovered.

\[\begin{split}\textrm{NED} &= \sum_{\langle x, y\rangle \in P_{disc}}
\frac{\textrm{ned}(x, y)}{|P_{disc}|} \\
\textrm{Coverage} &= \frac{|\textrm{cover}(P_{disc})|}{|\textrm{cover}(P_{all})|}\end{split}\]

where

\[\begin{split}\textrm{ned}(\langle i, j \rangle, \langle k, l \rangle) &=
\frac{\textrm{Levenshtein}(T_{i,j}, T_{k,l})}{\textrm{max}(j-i+1,k-l+1)} \\
\textrm{cover}(P) &= \bigcup_{\langle i, j \rangle \in \textrm{flat}(P)}[i, j] \\
\textrm{flat}(P) &= \{p|\exists q:\{p,q\}\in P\}\end{split}\]

Note

The original version of the evaluation software sometimes output coverage scores greater than 100%. This was due to the incorrect implementation of the 30ms/50% rule, which is now fixed.

Clustering quality is evaluated using two metrics. The first metrics (Grouping precision, recall and F-score) computes the intrinsic quality of the clusters in terms of their phonetic composition. This score is equivalent to the purity and inverse purity scores used for evaluating clustering. As the Matching score, it is computed over pairs, but contrary to the Matching scores, it focusses on the covered part of the corpus.

\[\begin{split}\textrm{Grouping precision} &= \sum_{t\in\textrm{types}(\textrm{flat}(P_{clus}))}
freq(t, P_{clus})
\frac{|\textrm{match}(t, P_{clus} \cap P_{goldclus})|}{|\textrm{match}(t, P_{clus})|} \\
\textrm{Grouping recall} &= \sum_{t\in\textrm{types}(\textrm{flat}(P_{goldclus}))}
freq(t, P_{goldclus})
\frac{|\textrm{match}(t, P_{clus} \cap P_{goldclus})|}{|\textrm{match}(t, P_{goldclus})|}\end{split}\]

where

\[\begin{split}P_{clus} &= \{\langle \langle i, j\rangle , \langle k, l \rangle\rangle
| &\exists c\in C_{disc},\langle i, j\rangle\in c \wedge \langle k, l\rangle\in c\} \\
P_{goldclus} &= \{\langle \langle i, j\rangle , \langle k, l \rangle\rangle
| &\exists c_1,c_2\in C_{disc}:\langle i, j\rangle\in c_1 \wedge \langle k, l\rangle\in c_2 \\
&& \wedge T_{i,j}=T_{k,l} \wedge [i,j] \cap [k,l] = \varnothing \}\end{split}\]

Note

The original version of the evaluation software contained an incorrect implementation of the function \(match(\cdot,\cdot)\), which did not count all matches, which is now fixed.

The second metrics (Type precision, recall and F-score) takes as the gold cluster set the true lexicon and is therefore much more demanding. Indeed, a system could have very pure clusters, but could systematically missegment words. Since a discovered cluster could have several transcriptions, we use all of them (rather than using some kind of centroid).

\[\begin{split}\textrm{Type precision} &= \frac{|\textrm{types}(F_{disc}) \cap \textrm{types}(F_{goldLex})|}
{|\textrm{types}(F_{disc})|} \\
\textrm{Type recall} &= \frac{|\textrm{types}(F_{disc}) \cap \textrm{types}(F_{goldLex})|}
{|\textrm{types}(F_{goldLex})|} \\\end{split}\]

Note

The original version of the evaluation did not implement the 50% phoneme-boundary association rule for extracting the set of discovered types, which contradicted the documentation. The updated evaluation changes this.

Note

The original version of the evaluation counted the number of discovered types incorrectly. The updated evaluation changes this.

Parsing quality is evaluated using two metrics. The first one (Token precision, recall and F-score) evaluates how many of the word tokens were correctly segmented (\(X = F_{disc}\), \(Y = F_{goldLex}\)). The second one (Boundary precision, recall and F-score) evaluates how many of the gold word boundaries were found (\(X = B_{disc}\), \(Y = B_{gold}\)). These two metrics are typically correlated, but researchers typically use the first. We provide Boundary metrics for completeness, and also to enable system diagnostic.

\[\begin{split}\textrm{Token precision} &= \frac{|F_{disc}\cap F_{goldLex}|}{|F_{disc}|} \\
\textrm{Token recall} &= \frac{|F_{disc}\cap F_{goldLex}|}{|F_{goldLex}|} \\
\textrm{Boundary precision} &= \frac{|B_{disc}\cap B_{gold}|}{|B_{disc}|} \\
\textrm{Boundary recall} &= \frac{|B_{disc}\cap B_{gold}|}{|B_{gold}|}\end{split}\]

Note

The original version of the evaluation did not implement the 50% phoneme-boundary association rule for extracting the set of discovered tokens, which contradicted the documentation. The updated evaluation changes this.

**Ewan Dunbar (Organizer)**Researcher, Université de Paris / Cognitive Machine Learning, ewan.dunbar at univ-paris-diderot.fr

**Emmanuel Dupoux (Coordination)**Researcher, EHESS / Cognitive Machine Learning / Facebook, emmanuel.dupoux at gmail.com

**Mathieu Bernard (Website & Submission)**Engineer, INRIA, Paris, mathieu.a.bernard at inria.fr

**Julien Karadayi (Website & Submission)**Engineer, ENS, Paris, julien.karadayi at gmail.com

**Laurent Besacier**- LIG, Univ. Grenoble Alpes, France
- Automatic speech recognition, processing low-resourced languages, acoustic modeling, speech data collection, machine-assisted language documentation
- email: laurent.besacier at imag.fr, https://cv.archives-ouvertes.fr/laurent-besacier

**Alan W. Black**- CMU, USA
- Speech synthesis, speech processing
- email: awb at cs.cmu.edu, http://www.cs.cmu.edu/~awb

**Ewan Dunbar**- Université de Paris / Cognitive Machine Learning
- Speech perception & processing, Computational Phonology
- email: ewan.dunbar at univ-paris-diderot.fr http://www.linguist.univ-paris-diderot.fr/~edunbar

**Emmanuel Dupoux**- Ecole des Hautes Etudes en Sciences Sociales / Cognitive Machine Learning / Facebook AI
- Computational modeling of language acquisition, psycholinguistics, unsupervised learning of linguistic units
- email: emmanuel.dupoux at gmail.com, http://www.lscp.net/persons/dupoux

**Lucas Ondel**- University of Brno,
- Speech technology, unsupervised learning of linguistic units
- email: iondel at fit.vutbr.cz

**Sakriani Sakti**- Nara Institute of Science and Technonology (NAIST)
- Speech technology, low resources languages, speech translation, spoken dialog systems
- email:ssakti at is.naist.jp, http://isw3.naist.jp/~ssakti

The ZeroSpeech2020 challenge is hosted on Codalab, an open-source web-based platform for machine learning competitions.

- Dunbar, E., Algayres, R., Karadayi, J., Bernard, M., Benjumea, J., Cao, X.-N., Miskic, L., Dugrain, C., Ondel, L., Black, A. W., Besacier, L., Sakti, S., Dupoux, E. (2019). The Zero Resource Speech Challenge 2019: TTS without T. In INTERSPEECH-2019.
- The references for the 2017 challenge are here.
- The references for the 2019 challenge are here.