Since 2015, several approaches have been taken to Task 1, and even though the performances are increasing, there is still a lot to be done (see the Leaderboard for more detailed results).
More recently, Hallap et al (2022) examined in detail whether systems learned context-dependent allophone representations or something more like context-independent phoneme representations - now available in the ABX-LS benchmark (see below for detailed results).
The results, shown in Figure 2, demonstrate that ABX tests which do not control for the phonological context (e.g., comparing the centre phone of the word cat /kæt/ with the centre phone of the word dog /dɔɡ/ ) show much poorer results with current systems (indicated in orange in the graph) than when the context is controlled (e.g., comparing the centre phone of cat versus cot /kɔt/) as indicated in purple - the error rate increases by a factor of roughly 400% in some cases! This is a much greater penalty than is seen for within- versus across-speaker (triangle versus circle) or for the clean versus other subsets of LibriSpeech (solid versus dotted). This suggests that context-independence of the learned units is still relatively poor.
AS: Across Speaker
WS: Within Speaker