229 lines (172 loc) · 17.3 KB

229 lines (172 loc) · 17.3 KB

Generic CLDF dataset derived from the DoReCo core corpus

CLDF Metadata: Generic-metadata.json

Sources: sources.bib

property value
dc:bibliographicCitation Seifart, Frank, Ludger Paschen & Matthew Stave (eds.). 2022. Language Documentation Reference Corpus (DoReCo) 1.2. Berlin & Lyon: Leibniz-Zentrum Allgemeine Sprachwissenschaft & laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2). DOI:10.34847/nkl.7cbfq779
dc:conformsTo CLDF Generic
dc:license CC-BY
  1. Anal DoReCo dataset
  2. Yali (Apahapsili) DoReCo dataset
  3. Arapaho DoReCo dataset
  4. Baïnounk Gubëeher DoReCo dataset
  5. Beja DoReCo dataset
  6. Bora DoReCo dataset
  7. Cashinahua DoReCo dataset
  8. Dolgan DoReCo dataset
  9. Evenki DoReCo dataset
  10. Goemai DoReCo dataset
  11. Gorwaa DoReCo dataset
  12. Jahai DoReCo dataset
  13. Jejuan DoReCo dataset
  14. Kakabe DoReCo dataset
  15. Kamas DoReCo dataset
  16. Tabaq (Karko) DoReCo dataset
  17. Komnzo DoReCo dataset
  18. Lower Sorbian DoReCo dataset
  19. Movima DoReCo dataset
  20. Dalabon DoReCo dataset
  21. Nǁng DoReCo dataset
  22. Northern Kurdish (Kurmanji) DoReCo dataset
  23. Fanbyak DoReCo dataset
  24. Pnar DoReCo dataset
  25. Daakie DoReCo dataset
  26. Resígaro DoReCo dataset
  27. Ruuli DoReCo dataset
  28. Sadu DoReCo dataset
  29. Sanzhi Dargwa DoReCo dataset
  30. Savosavo DoReCo dataset
  31. Nafsan (South Efate) DoReCo dataset
  32. English (Southern England) DoReCo dataset
  33. French (Swiss) DoReCo dataset
  34. Sümi DoReCo dataset
  35. Svan DoReCo dataset
  36. Tabasaran DoReCo dataset
  37. Teop DoReCo dataset
  38. Texistepec Popoluca DoReCo dataset
  39. Mojeño Trinitario DoReCo dataset
  40. Asimjeeg Datooga DoReCo dataset
  41. Vera'a DoReCo dataset
  42. Yongning Na DoReCo dataset
  43. cldf-datasets/doreco/ v1.2-4-g270e584
  44. Glottolog v4.8
  1. python: 3.10.12
  2. python-packages: requirements.txt
rdf:ID doreco

Table media.csv

This table lists audio files which have been transcribed/annotated for DoReCo corpora. Note that only downloadable files are listed.

property value
dc:conformsTo CLDF MediaTable
dc:extent 583


Name/Property Datatype Description
ID string
Regex: [a-zA-Z0-9_\-]+
Primary key
Name string
Media_Type string
Regex: [^/]+/.+
Download_URL anyURI
rec_date string
Regex: [0-9]{4}(-[0-9]{2})?(-[0-9]{2})?
Date of recording. See also rec_date_assignment_certain.
rec_date_assignment_certain string
Valid choices:
certain approximate
genre string
Regex: `traditional narrative
personal narrative
genre_stim string
gloss string Information on whether/how the audio has been annotated with glosses.
transl list of string (separated by /) Information on meta languages for which the annotations contain translations.
sound_quality string
Valid choices:
good medium bad middle
background_noise string
Valid choices:
punctual none constant medium
Corpus_ID string References contributions.csv::ID
Glottocode string References languages.csv::ID

Some corpora in DoReCo contain data annotated with glosses. Such glossed data is extracted into a CLDF ExampleTable.

property value
dc:conformsTo CLDF ExampleTable
dc:extent 94672


Name/Property Datatype Description
ID string
Regex: [a-zA-Z0-9_\-]+
Primary key
Language_ID string References languages.csv::ID
Primary_Text string The example text in the source language.
Analyzed_Word list of string (separated by ) The sequence of words of the primary text to be aligned with glosses
Gloss list of string (separated by ) The sequence of glosses aligned with the words of the primary text
Translated_Text string The translation of the example text in a meta language
Meta_Language_ID string References the language of the translated text
References languages.csv::ID
LGR_Conformance string
Valid choices:
The level of conformance of the example with the Leipzig Glossing Rules
Comment string
File_ID string Link to the audio file to which start and end markers pertain.
References media.csv::ID
start decimal Start of the word in the linked sound file in (floating point) seconds.
end decimal End of the word in the linked sound file in (floating point) seconds.
duration decimal Duration of the word in the linked sound file in (floating point) seconds.
property value
dc:conformsTo CLDF LanguageTable
dc:extent 42


Name/Property Datatype Description
ID string
Regex: [a-zA-Z0-9_\-]+
Primary key
Name string
Macroarea string
Latitude decimal
≥ -90
≤ 90
Longitude decimal
≥ -180
≤ 180
Glottocode string
Regex: [a-z0-9]{4}[1-9][0-9]{3}
ISO639P3code string
Regex: [a-z]{3}
Source string References sources.bib::BibTeX-key
Family string

Each DoReCo language corpus is listed as separate contribution in this table.

property value
dc:conformsTo CLDF ContributionTable
dc:extent 42


Name/Property Datatype Description
ID string
Regex: [a-zA-Z0-9_\-]+
Primary key
References languages.csv::ID
Name string
Description string
Contributor string
Citation string
Archive string
Archive_link string
AnnotationLicense string
Regex: `CC BY
AudioLicense string
DOI string

The ParameterTable lists IPA phones which appear in the DoReCo corpus (if a correspondence to the X-Sampa representation could be determined). If possible, IPA phones are linked to CLTS' BIPA representation, giving access to the CLTS feature system.

property value
dc:conformsTo CLDF ParameterTable
dc:extent 335


Name/Property Datatype Description
ID string
Regex: [a-zA-Z0-9_\-]+
Primary key
Name string The IPA representation of the sound.
Description string
ColumnSpec json
CLTS_ID string CLTS ID of the sound, i.e. the underscore-separated ordered list of features of the sound.
property value
dc:extent 289


Name/Property Datatype Description
ID string Primary key
Language_ID string References languages.csv::ID
age integer Speaker age. See also age_assignment_certain.
age_assignment_certain string
Valid choices:
certain approximate
sex string
Valid choices:
m f
Speaker sex.

This table lists individual, time-aligned phones.

property value
dc:extent 1863702


Name/Property Datatype Description
ph_ID string Primary key
ph string See the description of the Token_Type column.
IPA string Link to corresponding IPA phoneme, with details given in ParameterTable
References parameters.csv::ID
u_ID string Utterance ID. Utterances are words/phones delimited by silent pauses.
Token_Type string
Valid choices:
label pause xsampa
Not all rows in this table correspond to actual phones. If a row does the Token_Type is 'xsampa' and the ph column holds the X-SAMPA representation of the phone, otherwise it is a 'pause' or a 'label'. Labels consist of two opening brackets, the label proper, a closing bracket, the content (optional), and another closing bracket, e.g. <<ui>word>. Labels may also appear on their own if the content is not known, e.g. <<ui>>. Valid proper labels are - fp: Filled pause - fs: False start - pr: Prolongation - fm: Foreign material - sg: Singing - bc: Backchannel - id: Ideophone - on: Onomatopoeic - wip: Word-internal pause - ui: Unidentifiable Silent pauses are marked by a special symbol, <p:>. The location of silent pauses is manually checked by the DoReCo team, while the symbol itself is inserted by the WebMAUS service. Unlike labels, the <p:> symbol has only one of each bracket, and no other content may be included in it.
start decimal Start of the phone in the linked sound file in (floating point) seconds.
end decimal End of the phone in the linked sound file in (floating point) seconds.
duration decimal Duration of the phone in the linked sound file in (floating point) seconds.
wd_ID string Link to corresponding word.
References words.csv::wd_ID

Table words.csv

property value
dc:extent 896664


Name/Property Datatype Description
wd_ID string Primary key
wd string The word form transcribed into orthography.
Language_ID string References languages.csv::ID
File_ID string Link to the audio file to which start and end markers pertain.
References media.csv::ID
Speaker_ID string References speakers.csv::ID
start decimal Start of the word in the linked sound file in (floating point) seconds.
end decimal End of the word in the linked sound file in (floating point) seconds.
duration decimal Duration of the word in the linked sound file in (floating point) seconds.
ref string
Example_ID string Words that appear in glossed utterances are linked to an Example.
References examples.csv::ID
mb list of string (separated by )
ps list of string (separated by )
gl list of string (separated by )

Gloss abbreviations used for glosses in a corpus.

property value
dc:extent 2053


Name/Property Datatype Description
ID string Primary key
Gloss string
LGR boolean Flag, signaling whether a gloss abbreviation is a standard, Leipzig-Glossing-Rules abbreviation.
Meaning string
Glottocode string References languages.csv::ID