CLDF Metadata: Generic-metadata.json
Sources: sources.bib
Table media.csv
This table lists audio files which have been transcribed/annotated for DoReCo corpora. Note that only downloadable files are listed.
property | value |
---|---|
dc:conformsTo | CLDF MediaTable |
dc:extent | 583 |
Name/Property | Datatype | Description |
---|---|---|
ID | string Regex: [a-zA-Z0-9_\-]+ |
Primary key |
Name | string |
|
Media_Type | string Regex: [^/]+/.+ |
|
Download_URL | anyURI |
|
rec_date |
string Regex: [0-9]{4}(-[0-9]{2})?(-[0-9]{2})? |
Date of recording. See also rec_date_assignment_certain. |
rec_date_assignment_certain |
string Valid choices: certain approximate |
|
genre |
string Regex: `traditional narrative |
personal narrative |
genre_stim |
string |
|
gloss |
string |
Information on whether/how the audio has been annotated with glosses. |
transl |
list of string (separated by / ) |
Information on meta languages for which the annotations contain translations. |
sound_quality |
string Valid choices: good medium bad middle |
|
background_noise |
string Valid choices: punctual none constant medium |
|
Corpus_ID | string |
References contributions.csv::ID |
Glottocode | string |
References languages.csv::ID |
Table examples.csv
Some corpora in DoReCo contain data annotated with glosses. Such glossed data is extracted into a CLDF ExampleTable.
property | value |
---|---|
dc:conformsTo | CLDF ExampleTable |
dc:extent | 94672 |
Name/Property | Datatype | Description |
---|---|---|
ID | string Regex: [a-zA-Z0-9_\-]+ |
Primary key |
Language_ID | string |
References languages.csv::ID |
Primary_Text | string |
The example text in the source language. |
Analyzed_Word | list of string (separated by ) |
The sequence of words of the primary text to be aligned with glosses |
Gloss | list of string (separated by ) |
The sequence of glosses aligned with the words of the primary text |
Translated_Text | string |
The translation of the example text in a meta language |
Meta_Language_ID | string |
References the language of the translated text References languages.csv::ID |
LGR_Conformance | string Valid choices: WORD_ALIGNED MORPHEME_ALIGNED |
The level of conformance of the example with the Leipzig Glossing Rules |
Comment | string |
|
File_ID | string |
Link to the audio file to which start and end markers pertain. References media.csv::ID |
start |
decimal |
Start of the word in the linked sound file in (floating point) seconds. |
end |
decimal |
End of the word in the linked sound file in (floating point) seconds. |
duration |
decimal |
Duration of the word in the linked sound file in (floating point) seconds. |
Table languages.csv
property | value |
---|---|
dc:conformsTo | CLDF LanguageTable |
dc:extent | 42 |
Name/Property | Datatype | Description |
---|---|---|
ID | string Regex: [a-zA-Z0-9_\-]+ |
Primary key |
Name | string |
|
Macroarea | string |
|
Latitude | decimal ≥ -90 ≤ 90 |
|
Longitude | decimal ≥ -180 ≤ 180 |
|
Glottocode | string Regex: [a-z0-9]{4}[1-9][0-9]{3} |
|
ISO639P3code | string Regex: [a-z]{3} |
|
Source | string |
References sources.bib::BibTeX-key |
Family |
string |
Table contributions.csv
Each DoReCo language corpus is listed as separate contribution in this table.
property | value |
---|---|
dc:conformsTo | CLDF ContributionTable |
dc:extent | 42 |
Name/Property | Datatype | Description |
---|---|---|
ID | string Regex: [a-zA-Z0-9_\-]+ |
Primary key References languages.csv::ID |
Name | string |
|
Description | string |
|
Contributor | string |
|
Citation | string |
|
Archive |
string |
|
Archive_link |
string |
|
AnnotationLicense |
string Regex: `CC BY |
CC BY-NC-SA |
AudioLicense |
string |
|
DOI |
string |
Table parameters.csv
The ParameterTable lists IPA phones which appear in the DoReCo corpus (if a correspondence to the X-Sampa representation could be determined). If possible, IPA phones are linked to CLTS' BIPA representation, giving access to the CLTS feature system.
property | value |
---|---|
dc:conformsTo | CLDF ParameterTable |
dc:extent | 335 |
Name/Property | Datatype | Description |
---|---|---|
ID | string Regex: [a-zA-Z0-9_\-]+ |
Primary key |
Name | string |
The IPA representation of the sound. |
Description | string |
|
ColumnSpec | json |
|
CLTS_ID | string |
CLTS ID of the sound, i.e. the underscore-separated ordered list of features of the sound. |
Table speakers.csv
property | value |
---|---|
dc:extent | 289 |
Name/Property | Datatype | Description |
---|---|---|
ID | string |
Primary key |
Language_ID | string |
References languages.csv::ID |
age |
integer |
Speaker age. See also age_assignment_certain. |
age_assignment_certain |
string Valid choices: certain approximate |
|
sex |
string Valid choices: m f |
Speaker sex. |
Table phones.csv
This table lists individual, time-aligned phones.
property | value |
---|---|
dc:extent | 1863702 |
Name/Property | Datatype | Description |
---|---|---|
ph_ID | string |
Primary key |
ph | string |
See the description of the Token_Type column. |
IPA | string |
Link to corresponding IPA phoneme, with details given in ParameterTable References parameters.csv::ID |
u_ID |
string |
Utterance ID. Utterances are words/phones delimited by silent pauses. |
Token_Type |
string Valid choices: label pause xsampa |
Not all rows in this table correspond to actual phones. If a row does the Token_Type is 'xsampa' and the ph column holds the X-SAMPA representation of the phone, otherwise it is a 'pause' or a 'label'. Labels consist of two opening brackets, the label proper, a closing bracket, the content (optional), and another closing bracket, e.g. <<ui>word> . Labels may also appear on their own if the content is not known, e.g. <<ui>> . Valid proper labels are - fp: Filled pause - fs: False start - pr: Prolongation - fm: Foreign material - sg: Singing - bc: Backchannel - id: Ideophone - on: Onomatopoeic - wip: Word-internal pause - ui: Unidentifiable Silent pauses are marked by a special symbol, <p:> . The location of silent pauses is manually checked by the DoReCo team, while the symbol itself is inserted by the WebMAUS service. Unlike labels, the <p:> symbol has only one of each bracket, and no other content may be included in it. |
start |
decimal |
Start of the phone in the linked sound file in (floating point) seconds. |
end |
decimal |
End of the phone in the linked sound file in (floating point) seconds. |
duration |
decimal |
Duration of the phone in the linked sound file in (floating point) seconds. |
wd_ID |
string |
Link to corresponding word. References words.csv::wd_ID |
Table words.csv
property | value |
---|---|
dc:extent | 896664 |
Name/Property | Datatype | Description |
---|---|---|
wd_ID | string |
Primary key |
wd | string |
The word form transcribed into orthography. |
Language_ID | string |
References languages.csv::ID |
File_ID | string |
Link to the audio file to which start and end markers pertain. References media.csv::ID |
Speaker_ID |
string |
References speakers.csv::ID |
start |
decimal |
Start of the word in the linked sound file in (floating point) seconds. |
end |
decimal |
End of the word in the linked sound file in (floating point) seconds. |
duration |
decimal |
Duration of the word in the linked sound file in (floating point) seconds. |
ref |
string |
|
Example_ID | string |
Words that appear in glossed utterances are linked to an Example. References examples.csv::ID |
mb |
list of string (separated by ) |
|
ps |
list of string (separated by ) |
|
gl |
list of string (separated by ) |
Table glosses.csv
Gloss abbreviations used for glosses in a corpus.
property | value |
---|---|
dc:extent | 2053 |
Name/Property | Datatype | Description |
---|---|---|
ID | string |
Primary key |
Gloss | string |
|
LGR |
boolean |
Flag, signaling whether a gloss abbreviation is a standard, Leipzig-Glossing-Rules abbreviation. |
Meaning |
string |
|
Glottocode | string |
References languages.csv::ID |