COD Dataset 🐟

The released dataset comprises manually generated, localised and cross-lingually aligned TOD data in Arabic, Indonesian, Russian and Swahili, as well as the corresponding data in English from the SGD dataset, which served as the source of dialogue prompts. For details of our prompt-based language-specific dialogue generation method please see our paper.

Baseline code will be released shortly.

Languages

ISO 639-2	Name	Family	Area¹	Script
ar	Arabic	Afro-Asiatic	Northern Africa/Western Asia	Arabic
id	Indonesian	Austronesian	Southeastern Asia	Latin
ru	Russian	Indo-European	Eastern Europe	Cyrillic
sw	Swahili	Niger-Congo	Eastern Africa	Latin

¹ According to the United Nations geoscheme.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
ar_dev.json		ar_dev.json
ar_test.json		ar_test.json
en_dev.json		en_dev.json
en_test.json		en_test.json
id_dev.json		id_dev.json
id_test.json		id_test.json
ru_dev.json		ru_dev.json
ru_test.json		ru_test.json
sw_dev.json		sw_dev.json
sw_test.json		sw_test.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COD Dataset 🐟

Languages

About

Releases

Packages

Contributors 2

cambridgeltl/COD

Folders and files

Latest commit

History

Repository files navigation

COD Dataset 🐟

Languages

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages