Before you use the DoReCo CLDF data you should read through the data model description at cldf/README.md.
This dataset is derived from the DoReCo data as follows:
- DoReCo data is limited to the core data with annotations released under a license without ND (no derivatives) clause (because we add annotations which would violate this license).
- Morpheme-aligned data is converted into IGT instances in an ExampleTable.
- Two minor issues with the DoReCo data are fixed, namely a handful of phones in the Evenki corpus being linked to two words, and phones/words in the Yucatec corpus having a typo in the speaker reference.
- Links from phones to IPA symbols, and eventually to CLTS sounds are added, based on the orthography profile in etc/orthography.tsv
Note that the cldfbench.Dataset
implementation in the Python module cldfbench_doreco.py
provides functionality to also
- download the ND-licensed data (which is appropriate for analysis, but not for re-destribution)
- download the audio files on which the DoReCo data is based.
To do so,
-
Install the required Python packages (preferably in a fresh virtual environment) via
pip install -e .
-
Download (and unpack) the CLTS v2.2.0 data from DOI: 10.5281/zenodo.5583682.
-
Download (and unpack) the Glottolog v4.8 data from DOI: 10.5281/zenodo.7398962.
-
Then run
cldfbench download cldfbench_doreco.py
and answer appropriately when prompted.
-
The CLDF data can then be created running
cldfbench makecldf cldfbench_doreco.py --glottolog PATH/TO/glottolog-4.8/
Due to the size of the DoReCo corpus - ~ 2,000,000 annotated phones (if ND-licensed data is incuded) - analysing the data is made a lot easier (and quicker) when data is accessed via SQL1 from the CLDF SQLite database.
Create the SQLite database by running
cldf createdb cldf/Generic-metadata.json doreco.sqlite
An entity relationship diagram, visualizing the schema of the resulting database looks as follows:
Notes:
-
CLDF's
ParameterTable
stores metadata about sounds, linked fromphones.csv
, ifToken_Type
isxsampa
and an IPA sound corresponding to the X-SAMPA symbol could be determined. -
The database has 4 non-CLDF-standard tables:
glosses.csv
, listing gloss abbreviations used in IGT examples,speakers.csv
, providing metadata about (core) speakers. Linked fromwords.csv
, if available.words.csv
, listing the time-aligned words in the corpus,phones.csv
, listing the time-aligned phones in the corpus.
These non-CLDF-standard tables are named after the corresponding filename. Thus, to prevent the
.
in the name from confusing SQLite, the names must always be quoted, i.e. wrapped in quotes.
Time-aligned phones and words make up the core contribution of the DoReCo dataset. This data is stored
in the phones.csv
and words.csv
tables, respectively, and can be queried in a straightforward way.
Retrieving particular phones, e.g. word initials is best done using somewhat advanced SQL constructs like window functions, though.
An SQL query to retrieve these could look as follows:
SELECT s.* FROM (
SELECT
p.*,
row_number() OVER (PARTITION BY wd_ID ORDER BY cldf_id) rownum
FROM
'phones.csv' AS p
) AS s
WHERE
s.rownum = 1 AND s.token_type = 'xsampa';
Now, since SQLite supports views, we can hide this complexity by creating a corresponding view:
CREATE VIEW word_initials AS
SELECT s.* FROM (
SELECT
p.*,
row_number() OVER (PARTITION BY wd_ID ORDER BY cldf_id) rownum
FROM
'phones.csv' AS p
) AS s
WHERE
s.rownum = 1 AND s.token_type = 'xsampa';
and then use this view just like the phones.csv
table:
SELECT COUNT(*) FROM word_initials;
Note that views are regular SQL schema objects, and thus persisted in the database, i.e. will survive
closing the database connection. Views can be deleted by running a DROP VIEW <name>
query.
Some useful views are defined in etc/views.sql, and can be "installed" in the database by running
sqlite3 -echo doreco.sqlite < etc/views.sql
The default representation for phones in the DoReCo corpus is X-SAMPA.
But the CLDF dataset adds IPA representations as well as mappings to CLTS sounds.
Thus, information about features of phones can be inferred from the cldf_cltsReference
column, which
stores IDs from https://raw.githubusercontent.com/cldf-clts/clts/v2.2.0/data/sounds.tsv
SELECT
sound_class,
count(cldf_id) AS word_initial_instances
FROM (
SELECT
wi.cldf_id,
CASE WHEN ipa.cldf_cltsReference LIKE '%_consonant' THEN 'consonant' ELSE 'vowel' END sound_class
FROM
word_initials AS wi,
parametertable AS ipa
WHERE
wi.cldf_parameterReference = ipa.cldf_id
) AS s
GROUP BY sound_class;
Which will output a result similar to
sound class | word_initial_instances |
---|---|
consonant | 313569 |
vowel | 80820 |
Some kinds of analysis make most sense on utterance level, e.g. computing speech rate. (Utterances are
defined as any chunk of speech delimited by silent pauses.) To make this possible, phones.csv
contains
a column u_id
which allows for aggregating or partitioning phone data by utterance.
Computing speech rate per utterance as ln(phones per second) can be done with the following query:
SELECT
p.u_id AS u_id,
count(p.cldf_id)/sum(p.duration) AS speech_rate
FROM
'phones.csv' AS p
GROUP BY p.u_id;
Again, this could be stored as a view (e.g. utterances
in etc/views.sql), and joined as
needed for speech-rate sensitive analysis.
With a view utterance_inititials
(see etc/views.sql) we could compute average
speech rates per language as follows:
SELECT
w.cldf_languagereference,
AVG(u.speech_rate) AS sr
FROM
utterance_initials AS ui,
'words.csv' AS w,
utterances AS u
WHERE
u.u_id = ui.u_id AND ui.wd_id = w.cldf_id
GROUP BY w.cldf_languagereference
ORDER BY sr;
After storing this query in a file sr_by_lang.sql
we can run it and pipe the output to a tool such
as termgraph to get a quick overview:
$ sqlite3 -csv doreco.sqlite < sr_by_lang.sql | termgraph
kama1351: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.50
nngg1234: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9.43
lowe1385: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9.49
trin1278: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10.15
yong1270: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10.21
resi1247: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10.45
sadu1234: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10.54
arap1274: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10.92
tsim1256: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.40
even1259: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.56
sanz1248: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.57
beja1238: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.58
orko1234: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.59
jeha1242: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.62
bora1263: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.72
sout3282: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.76
svan1243: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.86
dolg1241: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.89
goem1240: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.97
movi1243: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.03
cash1254: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.05
anal1239: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.18
kaka1265: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.20
stan1290: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.30
teop1238: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.35
taba1259: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.41
kark1256: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.49
pnar1238: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.54
ngal1292: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.78
sout2856: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.84
ruul1235: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.99
komn1238: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 13.03
jeju1234: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 13.08
savo1255: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 13.12
port1286: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 13.30
sumi1235: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 13.43
apah1238: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 13.81
goro1270: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14.08
nort2641: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14.11
bain1259: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14.27
texi1237: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14.41
vera1241: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14.85
Speaker information is available from the speakers.csv
table, which can be joined to phones via
words.csv
. E.g. the following query counts phones per speaker sex.
SELECT
s.sex AS sex,
count(p.cldf_id) AS num_phones
FROM
'phones.csv' as p,
'words.csv' as w,
'speakers.csv' as s
WHERE
p.wd_id = w.cldf_id AND w.speaker_id = s.cldf_id
GROUP BY s.sex;
sex | num_phones |
---|---|
f | 721358 |
m | 1142344 |
Correlations between speaker metadata and IPA phone metadata can be assessed by joining the ParameterTable as well
SELECT
s.sex AS sex,
ipa.cldf_name AS IPA,
count(p.cldf_id) AS num
FROM
'phones.csv' AS p,
'words.csv' AS w,
'speakers.csv' AS s,
parametertable AS ipa
WHERE
p.wd_id = w.cldf_id AND
w.speaker_id = s.cldf_id AND
p.cldf_parameterReference = ipa.cldf_id AND
ipa.cldf_cltsreference LIKE '%breathy%'
GROUP BY p.cldf_name, s.sex;
lending superficial support to the hypothesis of "breathiness as a feminine voice characteristic" (DOI: 10.1016/j.jvoice.2007.08.002):
sex | IPA | num |
---|---|---|
f | lʱ | 31 |
m | lʱ | 4 |
f | nʱ | 16 |
m | nʱ | 7 |
Although the main focus of the DoReCo dataset are the time-aligned phones, the data also contains many glossed example sentences. These are available in the CLDF ExampleTable.
So, for example, if we are interested in particular phenomena, we can first inspect the gloss labels used for a language, and the number of examples containing the label:
SELECT
g.cldf_name AS label,
count(DISTINCT e.cldf_id) AS freq
FROM
'glosses.csv' AS g,
exampletable AS e
WHERE
e.cldf_gloss LIKE '%' || g.cldf_name || '%' AND
g.cldf_languagereference = 'sout3282' AND
e.cldf_languagereference = 'sout3282'
GROUP BY g.cldf_name
ORDER BY freq
LIMIT 3;
Notes:
-
We use SQLite's string concatenation operator
||
to string together a suitable argument for theLIKE
operator, making sure we match gloss containingHORT
. -
To make sure the
LIKE
operator matches in a case sensitive way, you may have to issue aPRAGMA
statement to force the correct behaviour:sqlite> PRAGMA case_sensitive_like=ON;
label | freq |
---|---|
SUBJ | 1 |
HORT | 3 |
SUPR | 7 |
And now we can list the aligned analyzed words and gloss lines of the matching IGT examples:
SELECT
cldf_analyzedword || char(10) || cldf_gloss || char(10)
FROM
exampletable
WHERE
cldf_gloss LIKE '%HORT%' AND cldf_languagereference = 'sout3282' ;
Note: We use SQLite's char
function to insert newline characters in the output to "align" the IGTs.
**** and he says **** put tha' pony in **** he says and **** in the cart **** and **** let's try him **** he says **** I want that for Tom-Smith atFaversham **** if it suit-s him
**** and 3SG.M say.PRS.3SG **** put.IMP DIST.SG pony in **** 3SG.M say.PRS.3SG and **** in the cart **** and **** HORT try.INF 3SG.M.OBL **** 3SG.M say.PRS.3SG **** 1SG want.PRS DIST.SG for Tom-Smith at Faversham **** if 3SG.N suit-PRS.3SG 3SG.M.OBL
**** well **** put him in **** let's try him
**** well **** put.IMP 3SG.M.OBL in **** HORT try.INF 3SG.M.OBL
[INT] **** let's leave that for another occasion shall-we
[INT] **** HORT leave.INF DIST.SG for another occasion shall-2PL
The DoReCo CLDF dataset also includes information about the publicly available audio files underlying the
corpora. The start
and end
columns in words.csv
and phones.csv
provide time stamps relative
to the linked files.
Thus, the following query retrieves the longest word in the sout3282 corpus, and its position in the audio file:
sqlite>
SELECT
w.cldf_id,
w.cldf_exampleReference,
w.cldf_name,
count(p.cldf_id) AS wl,
f.cldf_downloadUrl,
w.start,
w.end
FROM
'phones.csv' AS p,
'words.csv' AS w,
mediatable AS f
WHERE
p.wd_id = w.cldf_id AND
w.cldf_languageReference = 'sout3282' AND
w.cldf_mediaReference = f.cldf_id
GROUP BY w.cldf_id
ORDER BY wl DESC
LIMIT 1;
sout3282_w020797|sout3282-1322|agricultural|12|https://api.nakala.fr/data/10.34847%2Fnkl.dcfbh2yw/830afcf32230aa859b39a92137edbdb18c5b174f|1950.493|1951.383
Plugging this data into HTML as follows
<html>
<body>
<p>agricultural</p>
<audio preload="auto"
controls="controls"
src="https://api.nakala.fr/data/10.34847%2Fnkl.dcfbh2yw/830afcf32230aa859b39a92137edbdb18c5b174f#t=1950.493,1951.383">
</audio>
</body>
</html>
and saving the file as agricultural.html
you can get a light-weight corpus viewer by opening the file
in your browser and clicking the play button to listen to the word.
IGT examples are linked to audio files as well. So we can do the same for the glossed example linked to the word agricultural:
SELECT
e.cldf_primaryText,
'<table><tr>' || char(10) || '<td>' || replace(e.cldf_analyzedword, char(9), '</td><td>') || '</td></tr>' || char(10) || '<tr><td>' || replace(e.cldf_gloss, char(9), '</td><td>') || '</td></tr></table>',
e.cldf_translatedText,
f.cldf_downloadUrl,
e.start,
e.end
FROM
exampletable AS e,
mediatable AS f
WHERE
e.cldf_mediareference = f.cldf_id AND e.cldf_id = 'sout3282-1322';
Note: While somewhat cumbersome, basic HTML can be strung together in SQL. In the above code we use the SQLite operator
||
to concatenate strings, and the char function to express the
"special" characters \n
- newline - and \t
- tab, and the replace function
to turn the tab-separated aligned words into HTML table cells.
Again, we can plug the result
He knew all the -- Well, they kne-- the farms -- the bank managers them days, in the agricultural, knew as much about a farm as the farmer did, pretty well.|
<table><tr>
<td>he</td><td>knew</td><td>all</td><td>****</td><td><p:></td><td>well</td><td>they</td><td>kne</td><td>****</td><td>farm-s</td><td>the</td><td>****</td><td>bank-manager</td><td>them</td><td>day-s</td><td><p:></td><td>in</td><td>the</td><td>agricultural</td><td><p:></td><td>knew</td><td>as</td><td>much</td><td>about</td><td>a</td><td>farm</td><td>as</td><td>the</td><td>farmer</td><td>did</td><td>pretty</td><td>well</td></tr>
<tr><td>3SG.M</td><td>know.PST</td><td>all</td><td>****</td><td><p:></td><td>well</td><td>3PL</td><td>NC</td><td>****</td><td>farm-PL</td><td>the</td><td>****</td><td>bank_manager-PL</td><td>DIST.PL</td><td>day-PL</td><td><p:></td><td>in</td><td>the</td><td>agricultural</td><td><p:></td><td>know.PST</td><td>as</td><td>much</td><td>about</td><td>a</td><td>farm</td><td>as</td><td>the</td><td>farmer</td><td>do.PST</td><td>pretty</td><td>well</td></tr></table>
The bank managers in those days, in the agricultural, knew as much about a farm as the farmer did, pretty well.
https://api.nakala.fr/data/10.34847%2Fnkl.dcfbh2yw/830afcf32230aa859b39a92137edbdb18c5b174f
1944.132
1955.501
into an HTML file to get a (very) basic corpus viewer:
<html>
<body>
<p><i>
He knew all the -- Well, they kne-- the farms -- the bank managers them days, in the agricultural, knew as much about a farm as the farmer did, pretty well.
</i></p>
<table><tr>
<td>he</td><td>knew</td><td>all</td><td>****</td><td><p:></td><td>well</td><td>they</td><td>kne</td><td>****</td><td>farm-s</td><td>the</td><td>****</td><td>bank-manager</td><td>them</td><td>day-s</td><td><p:></td><td>in</td><td>the</td><td>agricultural</td><td><p:></td><td>knew</td><td>as</td><td>much</td><td>about</td><td>a</td><td>farm</td><td>as</td><td>the</td><td>farmer</td><td>did</td><td>pretty</td><td>well</td></tr>
<tr><td>3SG.M</td><td>know.PST</td><td>all</td><td>****</td><td><p:></td><td>well</td><td>3PL</td><td>NC</td><td>****</td><td>farm-PL</td><td>the</td><td>****</td><td>bank_manager-PL</td><td>DIST.PL</td><td>day-PL</td><td><p:></td><td>in</td><td>the</td><td>agricultural</td><td><p:></td><td>know.PST</td><td>as</td><td>much</td><td>about</td><td>a</td><td>farm</td><td>as</td><td>the</td><td>farmer</td><td>do.PST</td><td>pretty</td><td>well</td></tr></table>
<p>
The bank managers in those days, in the agricultural, knew as much about a farm as the farmer did, pretty well.
</p>
<audio id="audio2"
preload="auto"
controls="controls"
src="https://api.nakala.fr/data/10.34847%2Fnkl.dcfbh2yw/830afcf32230aa859b39a92137edbdb18c5b174f#t=1944.132,1955.501">
</audio>
</body>
</html>
To make it easier to run parametrized queries (aka
queries with placeholders),
this dataset - when installed via pip install -e .
-
registers a cldfbench command
query
. So, with a query
SELECT count(*) from `phones.csv` where duration > ?;
in a file called phones_by_duration.sql
you can run the parametrized query from the commandline:
$ cldfbench doreco.query phones_by_duration.sql 0.7
count(*)
----------
60218
Since phones are mapped to BIPA sounds listed in
ParameterTable
, they can be filtered based on feature values,
because the column ParameterTable.cldf_cltsReference
contains the sound name, which is a _
delimited
concatenation of feature values for the sound.
Thus, for example selecting pulmonic consonants can be done with SQL as follows
SELECT * FROM
`phones.csv` AS phone,
parametertable as sound
WHERE
phone.cldf_parameterReference = sound.cldf_id AND
sound.cldf_cltsReference LIKE '%_consonant' AND
sound.cldf_cltsReference NOT LIKE '%click%' AND
sound.cldf_cltsReference NOT LIKE '%implosive%' AND
sound.cldf_cltsReference NOT LIKE '%ejective%'
;
If you are accessing the DoReCo SQLite data from R, you can make use of math extensions available with RSQLite to push the "heavy lifting" when filtering outliers down to the database.
Thus, filtering out phones that are unusually long for the respective speaker (indicating incorrect annotation) can be done running SQL as follows
> sql = "SELECT
count(*)
FROM
`phones.csv` AS phone,
`words.csv` AS word
LEFT JOIN
(
SELECT
w.speaker_id, avg(p.duration) + 3 * stdev(p.duration) AS threshold
FROM
`phones.csv` AS p,
`words.csv` AS w
WHERE
p.cldf_parameterreference IS NOT NULL AND
p.wd_id = w.cldf_id
GROUP BY w.speaker_id
) AS t
ON
word.speaker_id = t.speaker_id
WHERE
phone.cldf_parameterReference IS NOT NULL AND
phone.wd_id = word.cldf_id AND
phone.duration < t.threshold
;"
via RSQLite:
> library(DBI)
> library(RSQLite)
> db <- dbConnect(RSQLite::SQLite(), "doreco.sqlite")
> RSQLite::initExtension(db)
> dbGetQuery(db, sql)
count(*)
1 1681433
Note that the SQLite library used in cldfbench doreco.query
does also support
math functions as well as stdev
as aggregate function. Thus, the above query can also be
run by saving the SQL as query.sql
and running
cldfbench doreco.query q.sql
count(*)
----------
1681433
Footnotes
-
For a short overview of SQL and how to access SQL databases (and links to further reading), see https://github.com/dlce-eva/dlce-eva/blob/main/doc/sql.md ↩