Task and intended goal
This challenge targets the unsupervised discovery of linguistic
units from raw speech in an unknown language, focussing on two levels
of linguistic structure: subword units and word units, respectively.
Psycholinguistic evidence shows that infants complete the learning of
subword units and start to construct a recognition lexicon within the
first year of life without any access to orthographic or phonetic
labels, although, they may have multimodal input and proprioceptive
feedback (babbling) which is not modeled in this challenge. Here, we
set up the rather extreme situation where linguistic units have to be
learned from audio only. These two levels have already been
investigated in previous work (see [1-6], and [7-10], respectively),
but the performance of the different systems has not yet been compared
using common evaluation metrics and datasets. In the first track, we
use a psychophysically inspired evaluation task (minimal pair ABX
discrimination), and, in the second, metrics inspired by the ones used
in NLP word segmentation applications (segmentation and token F
scores).
- Track 1: unsupervised subword modeling. The aim in this task is
to construct a representation of speech sounds which is robust to
within- and between-talker variation and supports word
identification. The metric we will use is the ABX discriminability
between phonemic minimal pairs (see [11,12]). The ABX
discriminability between the minimal pair “beg” and “bag” is defined
as the probability that A and X are further apart than B and X,
where A and X are tokens of “beg”, and B a token of “bag” (or vice
versa), distance being defined as the DTW divergence of the
representations of the tokens. Our global ABX discriminability score
aggregates over the entire set of minimal pairs like “beg”-“bag” to
be found in the corpus. We analyze separately the effect of within-
and between-talker variation.
- Track 2: spoken term discovery. The aim in this task is the
unsupervised discovery of “words” defined as recurring speech
fragments. The systems should take raw speech as input and output a
list of speech fragments (timestamps referring to the original audio
file) together with a discrete label for category membership. The
evaluation will use the suite of F-score metrics described in [13],
which enables detailed assessment of the different components of a
spoken term discovery pipeline (matching, clustering, segmentation,
parsing) and so will support a direct comparison with NLP models of
unsupervised word segmentation.
You can find more details on these two tracks in the relevant tabs
(Track 1 and Track 2).
This challenge first appeared as a special session at Interspeech 2015
(Sept, 6-10, 2015, Dresden). It uses only open access materials. It
remains forever open to participants who can without limitations
register, download the materials and try to beat the best systems.
See below for registration.
Data and Sharing
To encourage teams from both ASR and non-ASR communities to apply to
these tracks, all of the resources for this challenge (software for
evaluation and baselines, datasets) are free and open source. We
strongly encourage applicants to make their systems available in an
open source fashion. This is not only scientific good practice
(enables verification and replication), but it is our belief that it
will encourage the growth of this domain by facilitating the emergence
of new teams and participants.
Data for the challenge is drawn from two languages, one English
dataset that we is nevertheless treated as a zero resource language
(which means no pretraining with an English dataset will be allowed),
and a low resource language, Xitsonga. The data is made available in
three sets.
- the sample set (2 speakers, 40 min each, English) is provided to for
anyone to download (see Getting started) together with
the evaluation software.
- the English test dataset (casual conversations, 12 speakers, 16-30
min each, total 5h)
- the Xitsonga test dataset (read speech, 24 speakers, 2-29 minutes
each, total 2h30).
To get these datasets, see Registration below. All datasets
have been prepared in the following way:
- the original recordings were segmented into short files that
contains only ‘clean speech’, ie, no overlap, pauses, or nonspeech
noises, and contain only the speech of a single speakers.
- the file names contain a talker ID. We kept this information on the
basis of the fact that infants arguably have access to this
information when they learn their language, and that it is
relatively easy to recover anyway. Therefore, the proposed systems
can openly use it.
Ground rule
This challenge is primarily driven by a scientific question: how
could an infant or a system learn language(s) in an unsupervised
fashion? We expect therefore that the submissions’ proposals will
emphasize novel and interesting ideas (as opposed to trying to get
the best result through all possible means). Since we provide the
evaluation software, there is the distinct possibility that it can be
used to optimize system parameters according to the particular corpus
at hand. Doing this would blur the comparison between competing ideas
and architectures, especially if this information is not disclosed. We
therefore ask kindly the participants to disclose whenever they
publish their work whether and how they have used the evaluation
software to tune particular system parameters.
Similarly, competitors should disclose the type of information
they have used for training their systems. In order to compare
systems, we will distinguish those that use absolutely no training to
derive the speech features (bare signal processing systems), systems
that use unsupervised training on the provided datasets (unsupervised
systems), and systems that use supervised training on some other
languages or mixtures of languages (transfer systems). Training
features or models with another English dataset will be prohibited,
except for baseline comparison.
Registration
As said above, the challenge remains open and participants can
compete and try to beat the current best system at any time. The only
requirement is that the results are send to the organizers so that we
can update the result page.
To register, send an email to zerospeech2015@gmail.com and use this github
repository for
instructions. If you encounter a problem, please send us an email
(zerospeech2015@gmail.com).
You can try out your systems without registering by downloading the
starter kit (see Getting started).
Organizers
Challenge Organization
- Xavier Anguera (Telefonica)
- Emmanuel Dupoux (Ecole des Hautes Etudes en Sciences Sociales, Paris)
- Aren Jansen (Johns Hopkins University, Baltimore)
- Maarten Versteegh (ENS, Paris)
Track 1
- Thomas Schatz (ENS, Paris)
- Roland Thiollière (EHESS, Paris)
Track 2
- Bogdan Ludusan (EHESS, Paris)
- Maarten Versteegh (ENS, Paris)
References
Subword units/embeddings
- [1] Badino, L., Canevari, C., Fadiga, L., & Metta, G. (2014). An
auto-encoder based approach to unsupervised learning of subword
units. In
ICASSP.
- [2] Huijbregts, M., McLaren, M., & van
Leeuwen, D. (2011). Unsupervised acoustic sub-word unit detection
for query-by-example spoken term detection. In
ICASSP (pp. 4436-4439).
- [3] Jansen, A., Thomas, S., & Hermansky, H. (2013). Weak top-down
constraints for unsupervised acoustic model training. In
ICASSP (pp. 8091-8095).
- [4] Lee, C., & Glass, J. (2012). A nonparametric Bayesian approach
to acoustic model discovery. In Proceedings
of the 50th Annual Meeting of the Association for Computational
Linguistics: Long Papers Volume 1 (pp. 40-49).
- [5] Varadarajan, B., Khudanpur, S. &
Dupoux, E. (2008). Unsupervised Learning of Acoustic Subword Units.
In Proceedings of ACL-08: HLT, (pp 165-168).
- [6] Synnaeve, G., Schatz, T & Dupoux, E. (2014). Phonetics
embedding learning with side information. In
IEEE:SLT.
- [7] Siu, M., Gish, H., Chan, A., Belfield, W. &
Lowe, S. (2014). Unsupervised training of an HMM-based
self-organizing unit recognizer with applications to topic
classification and keyword discovery.
In Computer Speech & Language 28.1, (pp 210-223).
Spoken term discovery
- [8] Jansen, A., & Van Durme, B. (2011). Efficient spoken term
discovery using randomized algorithms. In
IEEE ASRU Workshop (pp. 401-406).
- [9] Muscariello, A., Gravier, G., & Bimbot, F. (2012). Unsupervised
Motif Acquisition in Speech via Seeded Discovery and Template
Matching Combination
in ICASSP (Vol. 20,7, pp. 2031-2044).
- [10] Park, A. S., & Glass, J. R. (2008). Unsupervised Pattern
Discovery in Speech. In
ICASSP, 16(1), 186-197.
- [11] Zhang, Y., & Glass, J. R. (2010). Towards multi-speaker
unsupervised speech pattern discovery. In
ICASSP (pp. 4366-4369).
Evaluation metrics
- [12] Schatz, T., Peddinti, V., Xuan-Nga, C., Bach, F., Hynek, H. &
Dupoux, E. (2014). Evaluating speech features with the Minimal-Pair
ABX task (II): Resistance to noise. In
Interspeech.
- [13] Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hynek, H. &
Dupoux, E. (2013). Evaluating speech features with the Minimal-Pair
ABX task: Analysis of the classical MFC/PLP pipeline. In
Interspeech (pp 1781-1785).
- [14] Ludusan, B., Versteegh, M., Jansen, A., Gravier, G., Cao, X.N.,
Johnson, M. & Dupoux, E. (2014). Bridging the gap between speech
technology and natural language processing: an evaluation toolbox
for term discovery systems. In
Proceedings of LREC.