The columns are sortable by clicking on the |sortable| picture of each column header. A detailed view of the results is available by clicking on the picture of each row.
The columns are interpreted as follows (see Evaluation metrics for details):
Phonetic (across and within)
- ABX error rate on embeddings
- Scale is $[0, 1]$, lower is better
Lexical and Syntactic
- Mean correct / incorrect classification accurary
- Scale is $[0, 1]$, higher is better
- For Lexical the all column is the mean accuracy over five frequency bins (based on raw frequency counts in LibriSpeech-960: OOV; 1-5; 6-20; 21-100; 101+), and the in vocab. column leaves out the OOV category. Only the all column was published in the Interspeech summary paper.
- Human judgement correlation coeficient (x 100$)
- Scale is $[-100, 100]$, far from 0 is better
- Mean score across all datasets
- Semantic (Weighted): Same as Semantic with mean score weighted by the number of pairs in each dataset. Only the unweighted (Semantic) columns were published in the Interspeech summary paper.
|Phonetic (Within)||Phonetic (Across)||Lexical||Syntactic||Semantic||Semantic (Weighted)|