The Spoken Term Discovery task is still very challenging and has not received the same attention as Acoustic Unit Discovery. One major finding across the three ZRC editions that featured this task is the existence of a tradeoff between attempting to find a lot of words and ensuring that the discovered words are accurate. The quality of the set of words that are treated as matches/repetitions by the system, as measured by the normalized edit distance (NED), will necessarily be better if systems do not commit to extracting more dubious word candidates in the first place; however, the more candidates are ignored, the less of the corpus will receive an analysis (lower coverage) and the fewer of the gold word boundaries will be found (leading also to lower boundary F-scores). The tradeoff between term quality and coverage is shown in Figure 3.
Figure 4 focuses on the segmentation task and displays the Token F-score for each of submitted systems, compared to a topline unigram adaptor grammar segmentation system trained on the corresponding text (phonemized text without the blank spaces between words). All the high-coverage segmentation-oriented models are on the left and all the low NED, matching-first models on the right. The segmentation-oriented models are more likely to do well on this metric, which assesses how many of the true word tokens were correctly segmented. Included here are two new models Kam22 and Alg22 which do not even attempt building a lexicon of types.