Generic CLDF dataset derived from the DoReCo core corpus

CLDF Metadata: Generic-metadata.json

property	value
dc:bibliographicCitation	Seifart, Frank, Ludger Paschen & Matthew Stave (eds.). 2022. Language Documentation Reference Corpus (DoReCo) 1.2. Berlin & Lyon: Leibniz-Zentrum Allgemeine Sprachwissenschaft & laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2). DOI:10.34847/nkl.7cbfq779
dc:conformsTo	CLDF Generic
dc:identifier	https://doreco.huma-num.fr/
dc:license	CC-BY
dcat:accessURL	https://github.com/cldf-datasets/doreco/
prov:wasDerivedFrom	Anal DoReCo dataset Yali (Apahapsili) DoReCo dataset Arapaho DoReCo dataset Baïnounk Gubëeher DoReCo dataset Beja DoReCo dataset Bora DoReCo dataset Cashinahua DoReCo dataset Dolgan DoReCo dataset Evenki DoReCo dataset Goemai DoReCo dataset Gorwaa DoReCo dataset Jahai DoReCo dataset Jejuan DoReCo dataset Kakabe DoReCo dataset Kamas DoReCo dataset Tabaq (Karko) DoReCo dataset Komnzo DoReCo dataset Lower Sorbian DoReCo dataset Movima DoReCo dataset Dalabon DoReCo dataset Nǁng DoReCo dataset Northern Kurdish (Kurmanji) DoReCo dataset Fanbyak DoReCo dataset Pnar DoReCo dataset Daakie DoReCo dataset Resígaro DoReCo dataset Ruuli DoReCo dataset Sadu DoReCo dataset Sanzhi Dargwa DoReCo dataset Savosavo DoReCo dataset Nafsan (South Efate) DoReCo dataset English (Southern England) DoReCo dataset French (Swiss) DoReCo dataset Sümi DoReCo dataset Svan DoReCo dataset Tabasaran DoReCo dataset Teop DoReCo dataset Texistepec Popoluca DoReCo dataset Mojeño Trinitario DoReCo dataset Asimjeeg Datooga DoReCo dataset Vera'a DoReCo dataset Yongning Na DoReCo dataset cldf-datasets/doreco/ v1.2-4-g270e584 Glottolog v4.8
prov:wasGeneratedBy	python: 3.10.12 python-packages: requirements.txt
rdf:ID	doreco
rdf:type	http://www.w3.org/ns/dcat#Distribution

Table media.csv

This table lists audio files which have been transcribed/annotated for DoReCo corpora. Note that only downloadable files are listed.

property	value
dc:conformsTo	CLDF MediaTable
dc:extent	583

Columns

Name/Property	Datatype	Description
ID	`string` Regex: `[a-zA-Z0-9_\-]+`	Primary key
Name	`string`
Media_Type	`string` Regex: `[^/]+/.+`
Download_URL	`anyURI`
`rec_date`	`string` Regex: `[0-9]{4}(-[0-9]{2})?(-[0-9]{2})?`	Date of recording. See also rec_date_assignment_certain.
`rec_date_assignment_certain`	`string` Valid choices: `certain` `approximate`
`genre`	`string` Regex: `traditional narrative	personal narrative
`genre_stim`	`string`
`gloss`	`string`	Information on whether/how the audio has been annotated with glosses.
`transl`	list of `string` (separated by `/`)	Information on meta languages for which the annotations contain translations.
`sound_quality`	`string` Valid choices: `good` `medium` `bad` `middle`
`background_noise`	`string` Valid choices: `punctual` `none` `constant` `medium`
Corpus_ID	`string`	References contributions.csv::ID
Glottocode	`string`	References languages.csv::ID

Table examples.csv

Some corpora in DoReCo contain data annotated with glosses. Such glossed data is extracted into a CLDF ExampleTable.

property	value
dc:conformsTo	CLDF ExampleTable
dc:extent	94672

Columns

Name/Property	Datatype	Description
ID	`string` Regex: `[a-zA-Z0-9_\-]+`	Primary key
Language_ID	`string`	References languages.csv::ID
Primary_Text	`string`	The example text in the source language.
Analyzed_Word	list of `string` (separated by )	The sequence of words of the primary text to be aligned with glosses
Gloss	list of `string` (separated by )	The sequence of glosses aligned with the words of the primary text
Translated_Text	`string`	The translation of the example text in a meta language
Meta_Language_ID	`string`	References the language of the translated text References languages.csv::ID
LGR_Conformance	`string` Valid choices: `WORD_ALIGNED` `MORPHEME_ALIGNED`	The level of conformance of the example with the Leipzig Glossing Rules
Comment	`string`
File_ID	`string`	Link to the audio file to which start and end markers pertain. References media.csv::ID
`start`	`decimal`	Start of the word in the linked sound file in (floating point) seconds.
`end`	`decimal`	End of the word in the linked sound file in (floating point) seconds.
`duration`	`decimal`	Duration of the word in the linked sound file in (floating point) seconds.

Table languages.csv

property	value
dc:conformsTo	CLDF LanguageTable
dc:extent	42

Columns

Name/Property	Datatype	Description
ID	`string` Regex: `[a-zA-Z0-9_\-]+`	Primary key
Name	`string`
Macroarea	`string`
Latitude	`decimal` ≥ -90 ≤ 90
Longitude	`decimal` ≥ -180 ≤ 180
Glottocode	`string` Regex: `[a-z0-9]{4}[1-9][0-9]{3}`
ISO639P3code	`string` Regex: `[a-z]{3}`
Source	`string`	References sources.bib::BibTeX-key
`Family`	`string`

Table contributions.csv

Each DoReCo language corpus is listed as separate contribution in this table.

property	value
dc:conformsTo	CLDF ContributionTable
dc:extent	42

Columns

Name/Property	Datatype	Description
ID	`string` Regex: `[a-zA-Z0-9_\-]+`	Primary key References languages.csv::ID
Name	`string`
Description	`string`
Contributor	`string`
Citation	`string`
`Archive`	`string`
`Archive_link`	`string`
`AnnotationLicense`	`string` Regex: `CC BY	CC BY-NC-SA
`AudioLicense`	`string`
`DOI`	`string`

The ParameterTable lists IPA phones which appear in the DoReCo corpus (if a correspondence to the X-Sampa representation could be determined). If possible, IPA phones are linked to CLTS' BIPA representation, giving access to the CLTS feature system.

property	value
dc:conformsTo	CLDF ParameterTable
dc:extent	335

Columns

Name/Property	Datatype	Description
ID	`string` Regex: `[a-zA-Z0-9_\-]+`	Primary key
Name	`string`	The IPA representation of the sound.
Description	`string`
ColumnSpec	`json`
CLTS_ID	`string`	CLTS ID of the sound, i.e. the underscore-separated ordered list of features of the sound.

Table speakers.csv

property	value
dc:extent	289

Columns

Name/Property	Datatype	Description
ID	`string`	Primary key
Language_ID	`string`	References languages.csv::ID
`age`	`integer`	Speaker age. See also age_assignment_certain.
`age_assignment_certain`	`string` Valid choices: `certain` `approximate`
`sex`	`string` Valid choices: `m` `f`	Speaker sex.

Table phones.csv

This table lists individual, time-aligned phones.

property	value
dc:extent	1863702

Columns

Name/Property	Datatype	Description
ph_ID	`string`	Primary key
ph	`string`	See the description of the Token_Type column.
IPA	`string`	Link to corresponding IPA phoneme, with details given in ParameterTable References parameters.csv::ID
`u_ID`	`string`	Utterance ID. Utterances are words/phones delimited by silent pauses.
`Token_Type`	`string` Valid choices: `label` `pause` `xsampa`	Not all rows in this table correspond to actual phones. If a row does the Token_Type is 'xsampa' and the `ph` column holds the X-SAMPA representation of the phone, otherwise it is a 'pause' or a 'label'. Labels consist of two opening brackets, the label proper, a closing bracket, the content (optional), and another closing bracket, e.g. `<<ui>word>`. Labels may also appear on their own if the content is not known, e.g. `<<ui>>`. Valid proper labels are - fp: Filled pause - fs: False start - pr: Prolongation - fm: Foreign material - sg: Singing - bc: Backchannel - id: Ideophone - on: Onomatopoeic - wip: Word-internal pause - ui: Unidentifiable Silent pauses are marked by a special symbol, `<p:>`. The location of silent pauses is manually checked by the DoReCo team, while the symbol itself is inserted by the WebMAUS service. Unlike labels, the <p:> symbol has only one of each bracket, and no other content may be included in it.
`start`	`decimal`	Start of the phone in the linked sound file in (floating point) seconds.
`end`	`decimal`	End of the phone in the linked sound file in (floating point) seconds.
`duration`	`decimal`	Duration of the phone in the linked sound file in (floating point) seconds.
`wd_ID`	`string`	Link to corresponding word. References words.csv::wd_ID

Table words.csv

property	value
dc:extent	896664

Columns

Name/Property	Datatype	Description
wd_ID	`string`	Primary key
wd	`string`	The word form transcribed into orthography.
Language_ID	`string`	References languages.csv::ID
File_ID	`string`	Link to the audio file to which start and end markers pertain. References media.csv::ID
`Speaker_ID`	`string`	References speakers.csv::ID
`start`	`decimal`	Start of the word in the linked sound file in (floating point) seconds.
`end`	`decimal`	End of the word in the linked sound file in (floating point) seconds.
`duration`	`decimal`	Duration of the word in the linked sound file in (floating point) seconds.
`ref`	`string`
Example_ID	`string`	Words that appear in glossed utterances are linked to an Example. References examples.csv::ID
`mb`	list of `string` (separated by )
`ps`	list of `string` (separated by )
`gl`	list of `string` (separated by )

Table glosses.csv

Gloss abbreviations used for glosses in a corpus.

property	value
dc:extent	2053

Columns

Name/Property	Datatype	Description
ID	`string`	Primary key
Gloss	`string`
`LGR`	`boolean`	Flag, signaling whether a gloss abbreviation is a standard, Leipzig-Glossing-Rules abbreviation.
`Meaning`	`string`
Glottocode	`string`	References languages.csv::ID

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Generic CLDF dataset derived from the DoReCo core corpus

Table media.csv

Columns

Table examples.csv

Columns

Table languages.csv

Columns

Table contributions.csv

Columns

Table parameters.csv

Columns

Table speakers.csv

Columns

Table phones.csv

Columns

Table words.csv

Columns

Table glosses.csv

Columns

Files

README.md

Latest commit

History

README.md

File metadata and controls

Generic CLDF dataset derived from the DoReCo core corpus

Table media.csv

Columns

Table examples.csv

Columns

Table languages.csv

Columns

Table contributions.csv

Columns

Table parameters.csv

Columns

Table speakers.csv

Columns

Table phones.csv

Columns

Table words.csv

Columns

Table glosses.csv

Columns