Unsupervised subword modeling consist in providing speech features that highlight linguistically relevant properties of the speech signal (phoneme structure) and downplay the linguistically irrelevant ones (speaker ID, emotion, channel, etc). Several approaches have been used depending on the kind of model and the type and amount of data used. Here is a non exhaustive list:

- Pure signal processing, which may or may not be inspired by human physiology or psychophysics, derive speech features that are particularly relevant or robust for speech recognition (eg. Mel-filterbank, MFCC, PLP or RASTA-PLP coefficients).
- Unsupervised clustering at the frame level using GMMs (Varadarajan et al 2008; Huijbregts et al., 2011; Jansen et al, 2013).
- Unsupervised segmentation and clustering of frame sequences in order to try and recover discrete phone models, using GMM-HMMs with an architecture similar to typical supervised GMM-HMM systems (eg. Varadarajan et al, 2008; Lee & Glass, 2012; Siu et al, 2014). The output of these systems can be in many different formats such as a transcription in discrete categories, lattices or posteriorgrams.
- Unsupervised or weakly supervised learning of a frame-level embedding using DNNs (Badino et al 2014; Synnaeve et al, 2014).

Other models are possible of course, and in the present challenge, there is only one requirement: Participants’ models should provide a frame-by-frame transcription of the test dataset in terms of their representation (a vector of continuous or discrete values). Note that a timestamp should be attributed to each frame and that irregular spacing of the frames in time is possible see formats description in the github repo).

Typically, unsupervised subword models are evaluated using a decoding approach: train a classifier (linear, SVM, etc) to decode the representation into phonemes and evaluate the decoding against a gold transcription. A major problem with such an approach is that representations that are easily separable on the basis of labeled examples can be totally indiscriminable in the absence of those labels. This means that defects in a representation that would be fatal if it was to be used as part of a zero-resource system can be unduly corrected by an evaluation metric based on supervised classifiers. Another problem is that the final score is a compound of the quality of the representation and the quality of the decoder. Since the representations all vary in terms of number of dimensions, sparsity, and other statistical property, it is unclear how a single decoder would be appropriate for all of the above models.

Here, we will use a minimal pair ABX task (Schatz et al 2013; 2014), which does not require any training at all, but only requires that each representation is provided with a frame-wise distance metric. The logic behind the ABX test is that it is testing for the quality of the representation, irrespective of its format. It can be applied equally well to discrete, probabilistic or continuous representations, and in particular enables one to compare one’s features to the baseline MFCC representations. Here, these tests will be run on tokens belong to small files of controlled sizes (1s, 10s, 30s), and on both old and new speakers.

The ABX task is inspired by match-to-sample tasks used in human
psychophysics and is a simple way to measure discriminability between
two sound categories (where the sounds *A* and *B* belong to different
categories \(\mathbf{x}\) and \(\mathbf{y}\), respectively,
and the task is to decide whether the sound *X* belongs to one or the
other). Specifically, we define the *ABX-discriminability* of category
\(\mathbf{x}\) from category \(\mathbf{y}\) as the probability
that *A* and *X* are further apart than *B* and *X* according to some
distance *d* over the (model-dependent) space of featural
representations for these sounds when *A* and *X* are from category
\(\mathbf{x}\) and *B* is from category \(\mathbf{y}\). Given
a set of sounds \(S(\mathbf{x})\) from category \(\mathbf{x}\)
and a set of sounds \(S(\mathbf{y})\) from category
\(\mathbf{y}\), we estimate this probability using the following
formula:

\[\hat{\theta}(\mathbf{x}, \mathbf{y}) := \frac{1}{m(m-1)n}
\sum_{a\in S(\mathbf{x})} \sum_{b\in S(\mathbf{y})}
\sum_{x\in S(\mathbf{x}) \setminus \{a\}}
(\mathbb{1}_{d(a,x)<d(b,x)} + \frac{1}{2}\mathbb{1}_{d(a,x)=d(b,x)})\]

where \(m\) and \(n\) are the number of sounds in \(S(\mathbf{x})\) and \(S(\mathbf{y})\) and \(\mathbb{1}\) denotes an indicator function. The notion of ABX discriminability defined above is asymmetric in the two categories. We obtain a symmetric measure by taking the average of the ABX discriminability of \(\mathbf{x}\) from \(\mathbf{y}\) and of \(\mathbf{y}\) from \(\mathbf{x}\). Note that we do not require \(d\) to be a metric in the mathematical sense. The default distances provided in this challenge are based on DTW divergences with the underlying frame-to-frame distance being either cosine distance or KL-divergence. For most systems (signal processing, embeddings) the cosine distance usually gives good results, and for others (posteriorgrams) the KL distance is more appropriate. Contestants can experiment with their own distance functions if they wish (more details on the github page), as long as it was not obtained through supervised training.

As categories, we chose minimal pairs (eg, “beg” vs “bag”), because these represent the smallest difference in speech sound which makes a semantic difference, and therefore represents the hardest problem that a speech recognizer may want to solve. Since there are typically not enough word minimal pairs in a small corpus to do this kind of analysis, we use triphone minimal pairs, eg, sequences of 3 phonemes that differ in the middle sound (eg, “beg”-“bag”, “api”-“ati”, etc). Our compound measure sums over all possible such minimal pairs found in the corpus in a structured manner (see below for more details).

In this version of the task all of the triphones belong to the same speaker. eg: \(A=\textrm{beg}_{T1}\), \(B=\textrm{bag}_{T1}\), \(X=\textrm{bag}'_{T1}\). The scores for a given minimal pair are first averaged across all of the speakers for which this minimal pair can be measured. The resulting scores are then averaged over all found contexts for a given pair of central phones (eg. for the pair /a/-/e/, scores for contexts such as b_g, r_d, f_s or any other context present in the database are averaged). Finally the scores for every pair of central phones are averaged to yield the reported within-talker ABX discriminability.

Here, A and B belong to the same speaker, and X to a different one. \(A=\textrm{beg}_{T1}\), \(B=\textrm{bag}_{T1}\), \(X=\textrm{bag}_{T2}\). The scores for a given minimal pair are first averaged across all of the pairs of speakers for which this contrast can be made. As for the within-talker measure, the resulting scores are then averaged over all contexts for each possible pair of central phones and finally over all pairs of central phones.

In addition to these two ABX compound scores, we also provide a csv file with the detailed results listed for each minimal pair and each talker. This enable participants to evaluate how their scores are stable with respect to linguistic contrasts and/or talkers.