Tracks 1 and 2: Speech-based language modelling Track 2: Visually-grounded language modelling Instructions Data Results

Zerospeech 2021

Instructions

Summary

Software

The Zero Resource Speech Challenge 2021 Software is a python3 package working on any recent Linux or MacOS distribution. It provides two command-line tools:

zerospeech2021-validate is used to validate a submission to ensure that it is complete and in the correct format before submitting.
zerospeech2021-evaluate is used to run the evaluation on the development sets.
TBA: zerospeech2021-upload is used to upload a submission to our servers for evaluation.

See https://github.com/zerospeech/zerospeech2021 for installation and usage instructions.

Evaluation Dataset

The dataset is released under a Creative Commons 4.0 licence. Download it on zr2021/eval_dataset.
It is made on four parts (phonetic, lexical, syntactic and semantic) each one divided in dev and test subsets.
The wav files have randomized names in the form beKZpnGdzo.wav and the gold files (either in .csv or .item format) are provided for dev sets only in order to run the evaluation.
Once uncompressed, the dataset is a directory with the following structure:

    README.md
    phonetic/
        test-clean/*.wav
        test-other/*.wav
        dev-clean/*.wav
        dev-other/*.wav
        dev-clean.item
        dev-other.item
    lexical/
        test/*.wav
        dev/*.wav
        dev/gold.csv
    syntactic/
        test/*.wav
        dev/*.wav
        dev/gold.csv
    semantic/
        test/librispeech/*.wav
        test/synthetic/*.wav
        dev/librispeech/*.wav
        dev/synthetic/*.wav
        dev/gold.csv
        dev/pairs.csv

Submission format

Warning: The submission will be invalidated if any extra file or directory is present, or if any required file or directory is missing.

The files should be organized in a ZIP archive with the following content:

   meta.yaml
   code/ (optional, see below)
   phonetic/
     {dev-clean,dev-other}/*.txt
     {test-clean,test-other}/*.txt
   lexical/
     dev.txt
     test.txt
   syntactic/
     dev.txt
     test.txt
   semantic/
     dev/{librispeech,synthetic}/*.txt
     test/{librispeech,synthetic}/*.txt

/meta.yaml

The meta.yaml file must contain the following entries (order does not matter):

   author: <str>
     authors of the submission
   affiliation: <str>
     affiliation of the authors (university or company)
   description: <str>
     description of the submitted system
   open_source: <bool>
     true or false, if true you must provide a 'code' folder with source code
     for the submitted system
   train_set: <str>
     description of the train set used (which subset of LibriSpeech or
     libri-light, along with VAD or not, ...)
   gpu_budget: <float>
     number of hours * GPU used for training
   parameters:
     phonetic:
       metric: <str>
         The metric to use for phonetic evaluation, must be 'euclidean',
         'cosine', 'kl' or 'kl_symmetric'. **WARNING** the 'cosine' metric
         here refeers to an angular distance as in the usual ABX evaluation.
       frame_shift: <float>
         Shift (in s) between two features frames
     semantic:
       metric: <str>
         The metric to use for semantic evaluation. May be any metric
         supported by scipy.spatial.distance.cdist.
       pooling: <str>
         The pooling method to use for semantic evaluation, must be 'min',
         'max', 'mean', 'sum', 'last' or 'lastlast'.

/code

The code directory must be submitted only if the open_source flag is set to true in meta.yaml. It can contain a full working source tree or a README file with a permanent link to download your code (on github for instance).

You are strongly encouraged to submit your code. Participants who submit their code in this way will be awarded an OPEN SCIENCE badge that will appear on the challenge leaderboard.

/phonetic

The phonetic folder of the submission must contain the following subdirectories: dev-clean, dev-other, test-clean and test-other.

Each .wav file in the dataset must have its corresponding .txt file in the submission under the same directory structure. For example the dataset file /path/to/dataset/phonetic/dev-clean/1272-128104-0000.wav must have its submitted file /path/to/submission/phonetic/dev-clean/1272-128104-0000.txt.
Each .txt file encodes a single 2D numpy array of floats, each line encoding one features frame. For example:

     42.286527175400906 -107.68503050450957 59.79000088588511 -113.85831030071697
     0.7872647311548775 45.33505222077471 -8.468742865224545 0
     328.05422046327067 -4.495454384937348 241.186547397405 40.16161685378687

The number of columns (the features dimension) must be constant across the files. The number of lines depends on the speech sample duration.
The frame shift (the shift between two successive frames) must be given in meta.yaml along with the metric used for evalution of those features.
Each array must contain at least 2 frames (i.e. each file must have at least 2 lines).

/lexical and /syntactic

The /lexical and /syntactic folders of the submission must contain the two files dev.txt and test.txt. For each *.wav file in the dataset must correspond a line either in dev.txt or test.txt with its corresponding pseudo-probability (order does not matter). For example if the dev dataset contains:

   /path/to/dataset/lexical/dev
   ├── aAAfmkmQpVz.wav
   ├── AaaggUZsvkR.wav
   ├── aAakhKfuvQI.wav
   ├── aAaOswLeeBL.wav
   ├── AaasVuoMJnS.wav

The submitted file dev.txt must contain entries like:

   aAAfmkmQpVz -313.37445068359375
   AaaggUZsvkR -447.8950500488281
   aAakhKfuvQI -383.8902587890625
   aAaOswLeeBL -430.2048645019531
   AaasVuoMJnS -356.9426574707031

/semantic

The semantic folder of the submission must contain the following subdirectories: dev/synthetic, dev/librispeech, test/synthtic and test/librispeech.

Each .wav file in the dataset must have its corresponding .txt file in the submission under the same directory structure. For example the dataset file /path/to/dataset/semantic/dev/synthetic/aAbcsWWKCz.wav must have its submitted file /path/to/submission/semantic/dev/synthetic/aAbcsWWKCz.txt.
Each .txt file encodes a single 2D numpy array of floats, each line encoding one features frame. For example:

     42.286527175400906 -107.68503050450957 59.79000088588511 -113.85831030071697
     0.7872647311548775 45.33505222077471 -8.468742865224545 0
     328.05422046327067 -4.495454384937348 241.186547397405 40.16161685378687

The number of columns (the features dimension) must be constant across the files. The number of lines depends on the speech sample duration.
The metric and pooling method used for evaluation must be specified in meta.yaml.

Validation

The zerospeech2021-validate program as provided by the :ref:2021_software) will be automatically executed upon submission. This will verify that all required files exist and are in conformance with the required format. If the check fails, your submission will be rejected by Codalab.

You are strongly advised to run the validation program on your own before making your submission. To apply the script that will be run when you make your submission, run:

$ zerospeech2021-validate <dataset> <submission> [--njobs <int>]

where <dataset> is the path to the challenge dataset and <submission> is the path to your submission (can be a zip archive ready for submission or a directory containing all the required files). The --njobs parameter specify the number of CPU cores to use for phonetic and semantic evaluation.

Here is an example output:

$ zerospeech2021-validate /path/to/dataset /path/to/submission -j8
    Prepare input...
     > dataset: /path/to/dataset
     > submission: /path/to/submission
    Validating root folder...
     > meta.yaml
     > root folder
     > code folder detected: submission will be manually inspected to ensure it is open source
    Validating phonetic...
     > phonetic/dev
     > phonetic/test
    Validating lexical...
     > lexical/dev
     > lexical/test
    Validating syntactic...
     > syntactic/dev
     > syntactic/test
    Validating semantic...
     > semantic/dev/synthetic
     > semantic/dev/librispeech
     > semantic/test/synthetic
     > semantic/test/librispeech
    Success!

Evaluation

Once your submission passes the validation, you can use the zerospeech2021-evaluate program to get the scores on the development datasets:

$ zerospeech2021-evaluate <dataset> <submission> -o <output_directory> [--njobs <int>]

where <dataset> and <submission> are as for validation, and <output_directory> is the folder where to store results as .csv files. The parameters required to evaluate the phonetic and semantic tasks are read from <submission>/meta.yaml.

Note: The evaluation of the lexical and syntactic parts are cheap and computed on a single CPU core. The semantic evalation is computed on several CPU cores, as controlled by the --njobs parameter. The phonetic part is computed on GPU using pytorch (fallback to CPU is no GPU available).

The evaluation process will write the following files:

/path/to/output_directory/
   ├── score_lexical_dev_by_frequency.csv
   ├── score_lexical_dev_by_length.csv
   ├── score_lexical_dev_by_pair.csv
   ├── score_phonetic.csv
   ├── score_semantic_dev_correlation.csv
   ├── score_semantic_dev_pairs.csv
   ├── score_syntactic_dev_by_pair.csv
   └── score_syntactic_dev_by_type.csv

Submission

Troubleshooting

If you are experiencing any issues related to the software, please open an issue on github: https://github.com/bootphon/zerospeech2021/issues.
For any other issue, please contact us at issue@zerospeech.com.