Skip to content

Commit

Permalink
Update part of the README
Browse files Browse the repository at this point in the history
  • Loading branch information
greenw0lf committed Jan 3, 2024
1 parent a4ee732 commit 7b675ec
Showing 1 changed file with 56 additions and 50 deletions.
106 changes: 56 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ASR-NL-benchmark
## Description
ASR-NL-benchmark is a python package to evaluate and compare the performance of speech-to-text for the Dutch language. Universities and Dutch media companies joined forces to develop this package that makes it easier to compare the performance of various open-source or commercial speech-to-text solutions on Dutch broadcast media. This package wraps around the famous sclite tool (part of [SCTK](https://github.com/usnistgov/SCTK) that has been used for decades in the speech-to-text benchmark evaluations organised by NIST in the US. Further, the package contains several preprocessing files and connectors to databases.
ASR-NL-benchmark is a python package to evaluate and compare the performance of speech-to-text for the Dutch language. Universities and Dutch media companies joined forces to develop this package that makes it easier to compare the performance of various open-source or commercial speech-to-text solutions on Dutch broadcast media. This package wraps around the famous sclite tool (part of [SCTK](https://github.com/usnistgov/SCTK) that has been used for decades in the speech-to-text benchmark evaluations organised by NIST in the US). Further, the package contains several preprocessing files and connectors to databases.

## How to use
### How to: Create a reference file
Expand All @@ -10,79 +10,83 @@ Reference files can be created using tooling such as:

- [ELAN](https://archive.mpi.nl/tla/elan/download)

A full annotation protocol can be found [here](https://github.com/opensource-spraakherkenning-nl/ASR-NL-benchmark/issues/7).
<!-- A full annotation protocol can be found [here](https://github.com/opensource-spraakherkenning-nl/ASR-NL-benchmark/issues/7). -->

Please check the guidelines for the reference file in the section below.


### How to: Install
- Install docker
- Pull the docker image: <code>docker pull asrnlbenchmark/asr-nl-benchmark</code>
- Install [Docker](https://www.docker.com/products/docker-desktop/)
- Pull the Docker image: <code>docker pull asrnlbenchmark/asr-nl-benchmark</code>

### How to: Run Using the command line only


In order to run the benchmarking tool over a (set of) local hyp and ref file(s) we need docker to mount the local directory where the input files are located. The output files of the benchmarking tool will appear in the same folder.

The following line runs the benchmarking tool over a local hyp and ref file. Use the absolute file path as the value for the variables SOURCE. For HYPFILENAME use the filename of the hypfile and for REFFILENAME the name of the reffile.
The following line runs the benchmarking tool over a local hyp and ref file. Use the absolute file path as the value for the variables `SOURCE`. For `HYPFILENAME` use the filename of the hypfile and for `REFFILENAME` the name of the reffile.

- run: <code> docker run -it --mount type=bind,source=SOURCE,target=/input asrnlbenchmark/asr-nl-benchmark:latest python ASR-NL-benchmark/src/app.py -hyp /input/HYPFILENAME ctm -ref /input/REFFILENAME stm </code>
`HYPFILENAME` and `REFFILENAME` can also be the names of the folders containing the *hypfiles* and *reffiles* respectively. **Make sure** to create a folder named `results` in the `SOURCE` folder before running the command below:

The results (.dtl, .spk, and .csv format) can be found inside a folder named 'results' which can be found on the local 'SOURCE' location (see above).
- <code> docker run -it --mount type=bind,source=SOURCE,target=/input asrnlbenchmark/asr-nl-benchmark:latest python ASR_NL_benchmark -hyp HYPFILENAME ctm -ref REFFILENAME stm </code>

The results (.dtl, .prf, .spk, and .csv format) can be found inside the `results` folder which can be found in the local `SOURCE` location (see above).

### How to: Use the User Interface

In order to open a User Interface, run the same command as above but now with the optional argument -interface set to TRUE:
### How to: Use the Interface

- run: <code> docker run -it --mount type=bind,source=SOURCE,target=/input asrnlbenchmark/asr-nl-benchmark:latest python ASR-NL-benchmark/src/app.py -interactive True </code>
In order to open a User Interface, run a command similar to the one above but now with the optional argument `-interface` set to `True`:

Use a web browser to access the UI by navigating to "http://loaclhost:5000" :
- <code> docker run -it --mount type=bind,source=SOURCE,target=/input asrnlbenchmark/asr-nl-benchmark:latest python ASR_NL_benchmark -interactive True </code>

- Navigate to http://localhost:5000/
Use a web browser to access the UI by navigating to "http://localhost:5000"

Within the tab Select folder, enter the path to the hypotheses and reference files:

- Enter the path of the hyp or the path to a folder containing a set of hyp files: (e.g. "ref_folder" or "ref_file.stm")
- Enter the path of the ref file or the path to a folder containing a set of ref files: (e.g. "hyp_folder" or "hyp_file.stm")
- click submit
- click "Submit"

A progress bar will appear. As soon as the benchmarking is ready, you will be forwarded to the results. The results (.dtl, .spk, and .csv format) can be found inside a folder named 'results' which can be found on the local 'SOURCE' location (see above).
A progress bar will appear. As soon as the benchmarking is ready, you will be forwarded to the results. The results (.dtl, .prf, .spk, and .csv format) can be found inside a folder named `results` which can be found on the local `SOURCE` location (see above).

There is a visual bug when forwarding to the results page after benchmarking is complete where the page is blank. To fix it, refresh the page.


### How to: Interpret the results
The final results are saved in .csv format inside a folder named 'results' stored locally on the 'SOURCE' location (see above). Those results are based upon the .dtl and .spk output files as generated by sclite.
The final results are saved in .csv format inside a folder named `results` stored locally on the `SOURCE` location (see above). Those results are based upon the .dtl and .spk output files as generated by sclite.

#### The different output files
- .dtl files - Detailed Overall Report as returned by sclite
- .dtl files - Detailed overall report as returned by sclite
- .prf files - Detailed report including string alignments between hypothesis and reference as returned by sclite
- .spk files - Report with scoring for a speaker as returned by sclite
- .csv files - Overall results of the benchmarking as shown in the interface


## More about the pipeline
### Normalisation
Manual transcripts (used as reference files) sometimes contain abbreviations (e.g. "'n" instead of "een"), symbols (e.g. "&" instead of "en") and numbers ("4" instead of "vier"). The reference files often contain the written form of the words instead. Since we don't want to penalize the speech-to-text tooling or algorithm for such differences we normalize both, the reference and hypotheses files.
### Normalization
Manual transcripts (used as reference files) sometimes contain abbreviations (e.g. "'n" instead of "een"), symbols (e.g. "&" instead of "en") and numbers ("4" instead of "vier"). The reference files often contain the written form of the words instead. Since we don't want to penalize the speech-to-text tooling or algorithm for such differences, we normalize both the reference and hypothesis files.

Normalization replacements:

Normalisation replacements:
- Symbols:
- '%' => "procent"
- '°' => "graden"
- '&' => "en"
- '€' => "euro"

Symbols:
- '%' => " procent"
- '°' => " graden")
- '&' => " en"
- '€' => " euro"
- Double spaces:
- '__' => '_'

Double spaces:
- ' ' =>' ')
Numbers (i.a.):
- 4 => "vier"
- 4.5 => "vier punt vijf"
- 4,3 => "vier komma drie"
- Numbers (e.g.):
- 4 => "vier"
- 4.5 => "vier punt vijf"
- 4,3 => "vier komma drie"

Combinations (e.g.):
- 12,3% => 'twaalf komma drie procent'
- Combinations (e.g.):
- 12,3% => 'twaalf komma drie procent'

### Variation
In order to deal with spelling variations, this tool applies a .glm file to the reference and hypothesis files. This .glm file contains a list of words with their spelling variations and can be found [here](https://github.com/opensource-spraakherkenning-nl/ASR-NL-benchmark/blob/3f96f9a9584c8567ffce09abe4ea082f6e6fc8c1/ASR_NL_benchmark/variations.glm). Whereas the normalisation step is typically rule-based, the variations are not. Therefore, we invite you all to adjustment to the glm and to create a pull request with the requested additions.
### Variations
In order to deal with spelling variations, this tool applies a `variations.glm` file to the reference and hypothesis files. This .glm file contains a list of words with their spelling variations and can be found [here](https://github.com/opensource-spraakherkenning-nl/ASR_NL_benchmark/blob/main/ASR_NL_benchmark/variations.glm). Whereas the normalisation step is typically rule-based, the variations are not. Therefore, we invite you all to adjustment to the .glm and to create a pull request with the requested additions.


## Guidelines
Expand All @@ -92,9 +96,9 @@ In order for the benchmarking tool to match the reference and hypothesis files,
2. In case you are using subcategories (See Benchmarking subcategories).

### Benchmarking subcategories
[PLACEHOLDER]

example:

Without subcategories:
- program_1.stm
- program_1.ctm
Expand All @@ -121,16 +125,18 @@ The reference file is used as the ground truth. To get the best results, the ref
In order to create those reference files, we suggest to use a transcription tool like [transcriber](http://trans.sourceforge.net/en/usermanUS.php).

#### Segment Time Mark (STM)
The Segment Time Mark files, to be used as reference files, consist of a connotation of time marked text segment records. Those segments are separated by a new line and follow the File_id Channel Speaker_id Begin_Time End_Time <Label> Transcript
The Segment Time Mark files, to be used as reference files, consist of a connotation of time marked text segment records. Those segments are separated by a new line and follow the format:

File_id Channel Speaker_id Begin_Time End_Time <Label> Transcript

To comment out a line start the line with ';;'
To comment out a line, start the line with ';;'

##### Example STM
;; Some information you want to comment out like a description
;; More information you want to include and comment out
;; like the name of the transcriber, the version or explanation of labels
Your_favorite_tv_show_2021_S1_E1 Speaker_01_Female_Native A 0.000 1.527 <o, f1, female> The first line
Your_favorite_tv_show_2021_S1_E1 Speaker_01_Female_Native A 1.530 2.127 <o, f1, male> The second text segment
;; Some information you want to comment out like a description
;; More information you want to include and comment out
;; like the name of the transcriber, the version or explanation of labels
Your_favorite_tv_show_2021_S1_E1 Speaker_01_Female_Native A 0.000 1.527 <o, f1, female> The first line
Your_favorite_tv_show_2021_S1_E1 Speaker_01_Female_Native A 1.530 2.127 <o, f1, male> The second text segment


### Hypothesis file
Expand All @@ -139,19 +145,19 @@ To get the best results the hypothesis file (i.e. the output of a speech recogni
- utf-8 encoded

#### CTM Format
The Time Marked Conversation files, to be used as hypothesis files, consist of a connotation of time-marked word records. Those records are separated by a new line and follow the following format:
The Time Marked Conversation files, to be used as hypothesis files, consist of a connotation of time-marked word records. Those records are separated by a new line and follow the format:

File_id Channel Begin_time Duration Word Confidence
File_id Channel Begin_time Duration Word Confidence

To comment out a line start the line with ';;'
To comment out a line, start the line with ';;'

##### Example CTM

;; Some infomration you want to comment out like a description
;; More information you want to include and comment out
Your_favorite_tv_show_2021_S1_E1 A 0.000 0.482 The 0.95
Your_favorite_tv_show_2021_S1_E1 A 0.496 0.281 first 0.98
Your_favorite_tv_show_2021_S1_E1 A 1.216 0.311 line 0.88
;; Some infomration you want to comment out like a description
;; More information you want to include and comment out
Your_favorite_tv_show_2021_S1_E1 A 0.000 0.482 The 0.95
Your_favorite_tv_show_2021_S1_E1 A 0.496 0.281 first 0.98
Your_favorite_tv_show_2021_S1_E1 A 1.216 0.311 line 0.88

## Related Documentation
- [sclite documentation](https://github.com/usnistgov/SCTK/blob/master/doc/sclite.htm)
Expand Down

0 comments on commit 7b675ec

Please sign in to comment.