WIP: Add rules for Swedish #115

andersjohansson · 2020-07-06T16:59:39Z

Initial rules for extracting Swedish. Seems to give reasonable output already, albeit with unusual words here and there that could very well be filtered out with a blocklist of uncommon words.

Let’s try to generate one:
/action blocklist sv 80

MichaelKohler · 2020-07-06T17:03:47Z

Seems to not work in the initial comment, as that's not an issue comment created. Should work here though:

/action blocklist sv 80

github-actions · 2020-07-06T17:04:24Z

Job started: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/159592047

andersjohansson · 2020-07-06T17:10:14Z

One issue that I did think about is that 14-word sentences in Swedish can tend to be pretty long, as Swedish, like German, write compound words together, to give fairly long words.
One example from my extraction: “Kulturantropologer undersöker de processer som producerar, upprätthåller och förändrar kulturella beteendemönster, samhällsstrukturer och meningssystem.”
That's 48 syllables,

Would it be reasonable to use a lower max word limit?

The particular example sentence would probably be filtered out with a "used more than 80-times" blocklist ("meningssystem" is used 3 times when I ripgrep through the wikiextracted text), but some potentially very long sentences could be constructed from pretty common compound words.

MichaelKohler · 2020-07-06T17:23:30Z

That's definitely something to keep in mind while reviewing. How long does it take to say that sentence?

andersjohansson · 2020-07-06T17:58:15Z

About 8 seconds, timing myself. But I think it would be something like that for most people, if they don't stumble on the words, which is quite possible reading a sentence like that the first time. I'll keep that in mind for reviewing. Should the goal be for sentences to be fairly straightforward to say for most people, and not too long?

MichaelKohler · 2020-07-06T18:12:37Z

I'd say around 8 seconds is fine. However I'd say it shouldn't be all sentences that long, might get quite exhausting after recording for some time.

github-actions · 2020-07-06T20:13:46Z

Job finished: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/159592047
Don't forget to download the artifacts.

MichaelKohler · 2020-07-11T17:42:56Z

@andersjohansson you'll find the blocklist at the top right of the following link as posted by the previous comment: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/159592047

Anything I could help you with?

andersjohansson · 2020-07-11T18:03:42Z

That’s great! I’m away from my computer for a few weeks now so won’t be able to take it forward for a while.

andersjohansson · 2020-07-27T16:24:23Z

The sample extraction seems to result in an empty file? Could it be that all sentences are rejected for some reason or is there some other problem?

One problem that I have noted with Swedish Wikipedia is that it contains a massive amount of bot-articles by lsjbot (https://en.wikipedia.org/wiki/Lsjbot). This is fine for Wikipedia but very few of these articles contain suitable sample sentences.
Some examples:
https://sv.wikipedia.org/wiki/Hillaby
https://sv.wikipedia.org/wiki/Cyperus_pacificus

A lot of the words from these articles also contribute to the massive list of unusual words to block. Would it be possible to exclude these bot-articles in some way before extracting stuff?

MichaelKohler · 2020-07-28T17:14:25Z

The sample extraction seems to result in an empty file? Could it be that all sentences are rejected for some reason or is there some other problem?

There seems to have been an error downloading the WikiExtractor script. I've manually restarted the job, let's see if that helps.

A lot of the words from these articles also contribute to the massive list of unusual words to block. Would it be possible to exclude these bot-articles in some way before extracting stuff?

I thought there was a discussion around that somewhere, however I can't find it. As far as I remember this is not possible as we're not getting author information in the output of the WikiExtractor script.

MichaelKohler · 2020-07-28T17:28:32Z

There seems to have been an error downloading the WikiExtractor script. I've manually restarted the job, let's see if that helps.

Looks like it doesn't. Will have a look tomorrow.

MichaelKohler · 2020-07-29T18:22:45Z

/action blocklist sv 80

(ignore the output, this is for testing only)

github-actions · 2020-07-29T18:23:23Z

Job started: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/187520977

MichaelKohler · 2020-07-29T18:30:47Z

@andersjohansson I think I have fixed the issue for now. If you merge master into your branch and push it, it should generate a new sample output.

github-actions · 2020-07-29T21:16:22Z

Job finished: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/187520977
Don't forget to download the artifacts.

MichaelKohler added the waiting on feedback label Jul 14, 2020

andersjohansson force-pushed the swedish branch from cfbd8e0 to 4aeedf9 Compare July 27, 2020 15:49

common-voice deleted a comment from github-actions bot Jul 29, 2020

andersjohansson added 3 commits July 31, 2020 12:44

Add rules for Swedish

a9dc649

Add generated Swedish blocklist

429e2b4

Clean up and make Swedish abbreviation check more efficient

13b735b

andersjohansson force-pushed the swedish branch from 4aeedf9 to 13b735b Compare July 31, 2020 10:45

MichaelKohler marked this pull request as draft September 1, 2020 16:10

MichaelKohler changed the base branch from master to main October 27, 2020 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add rules for Swedish #115

WIP: Add rules for Swedish #115

andersjohansson commented Jul 6, 2020

MichaelKohler commented Jul 6, 2020

github-actions bot commented Jul 6, 2020

andersjohansson commented Jul 6, 2020

MichaelKohler commented Jul 6, 2020

andersjohansson commented Jul 6, 2020

MichaelKohler commented Jul 6, 2020

github-actions bot commented Jul 6, 2020

MichaelKohler commented Jul 11, 2020

andersjohansson commented Jul 11, 2020 via email

andersjohansson commented Jul 27, 2020

MichaelKohler commented Jul 28, 2020

MichaelKohler commented Jul 28, 2020

MichaelKohler commented Jul 29, 2020

github-actions bot commented Jul 29, 2020

MichaelKohler commented Jul 29, 2020

github-actions bot commented Jul 29, 2020

WIP: Add rules for Swedish #115

Are you sure you want to change the base?

WIP: Add rules for Swedish #115

Conversation

andersjohansson commented Jul 6, 2020

MichaelKohler commented Jul 6, 2020

github-actions bot commented Jul 6, 2020

andersjohansson commented Jul 6, 2020

MichaelKohler commented Jul 6, 2020

andersjohansson commented Jul 6, 2020

MichaelKohler commented Jul 6, 2020

github-actions bot commented Jul 6, 2020

MichaelKohler commented Jul 11, 2020

andersjohansson commented Jul 11, 2020 via email

andersjohansson commented Jul 27, 2020

MichaelKohler commented Jul 28, 2020

MichaelKohler commented Jul 28, 2020

MichaelKohler commented Jul 29, 2020

github-actions bot commented Jul 29, 2020

MichaelKohler commented Jul 29, 2020

github-actions bot commented Jul 29, 2020