Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add rules for Swedish #115

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

andersjohansson
Copy link
Contributor

Initial rules for extracting Swedish. Seems to give reasonable output already, albeit with unusual words here and there that could very well be filtered out with a blocklist of uncommon words.

Let’s try to generate one:
/action blocklist sv 80

@MichaelKohler
Copy link
Member

Seems to not work in the initial comment, as that's not an issue comment created. Should work here though:

/action blocklist sv 80

@github-actions
Copy link

github-actions bot commented Jul 6, 2020

@andersjohansson
Copy link
Contributor Author

One issue that I did think about is that 14-word sentences in Swedish can tend to be pretty long, as Swedish, like German, write compound words together, to give fairly long words.
One example from my extraction: “Kulturantropologer undersöker de processer som producerar, upprätthåller och förändrar kulturella beteendemönster, samhällsstrukturer och meningssystem.”
That's 48 syllables,

Would it be reasonable to use a lower max word limit?

The particular example sentence would probably be filtered out with a "used more than 80-times" blocklist ("meningssystem" is used 3 times when I ripgrep through the wikiextracted text), but some potentially very long sentences could be constructed from pretty common compound words.

@MichaelKohler
Copy link
Member

That's definitely something to keep in mind while reviewing. How long does it take to say that sentence?

@andersjohansson
Copy link
Contributor Author

About 8 seconds, timing myself. But I think it would be something like that for most people, if they don't stumble on the words, which is quite possible reading a sentence like that the first time. I'll keep that in mind for reviewing. Should the goal be for sentences to be fairly straightforward to say for most people, and not too long?

@MichaelKohler
Copy link
Member

I'd say around 8 seconds is fine. However I'd say it shouldn't be all sentences that long, might get quite exhausting after recording for some time.

@github-actions
Copy link

github-actions bot commented Jul 6, 2020

Job finished: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/159592047
Don't forget to download the artifacts.

@MichaelKohler
Copy link
Member

@andersjohansson you'll find the blocklist at the top right of the following link as posted by the previous comment: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/159592047

Anything I could help you with?

@andersjohansson
Copy link
Contributor Author

andersjohansson commented Jul 11, 2020 via email

@andersjohansson
Copy link
Contributor Author

The sample extraction seems to result in an empty file? Could it be that all sentences are rejected for some reason or is there some other problem?

One problem that I have noted with Swedish Wikipedia is that it contains a massive amount of bot-articles by lsjbot (https://en.wikipedia.org/wiki/Lsjbot). This is fine for Wikipedia but very few of these articles contain suitable sample sentences.
Some examples:
https://sv.wikipedia.org/wiki/Hillaby
https://sv.wikipedia.org/wiki/Cyperus_pacificus

A lot of the words from these articles also contribute to the massive list of unusual words to block. Would it be possible to exclude these bot-articles in some way before extracting stuff?

@MichaelKohler
Copy link
Member

The sample extraction seems to result in an empty file? Could it be that all sentences are rejected for some reason or is there some other problem?

There seems to have been an error downloading the WikiExtractor script. I've manually restarted the job, let's see if that helps.

A lot of the words from these articles also contribute to the massive list of unusual words to block. Would it be possible to exclude these bot-articles in some way before extracting stuff?

I thought there was a discussion around that somewhere, however I can't find it. As far as I remember this is not possible as we're not getting author information in the output of the WikiExtractor script.

@MichaelKohler
Copy link
Member

There seems to have been an error downloading the WikiExtractor script. I've manually restarted the job, let's see if that helps.

Looks like it doesn't. Will have a look tomorrow.

@common-voice common-voice deleted a comment from github-actions bot Jul 29, 2020
@common-voice common-voice deleted a comment from github-actions bot Jul 29, 2020
@MichaelKohler
Copy link
Member

/action blocklist sv 80

(ignore the output, this is for testing only)

@github-actions
Copy link

@MichaelKohler
Copy link
Member

@andersjohansson I think I have fixed the issue for now. If you merge master into your branch and push it, it should generate a new sample output.

@github-actions
Copy link

Job finished: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/187520977
Don't forget to download the artifacts.

@MichaelKohler MichaelKohler marked this pull request as draft September 1, 2020 16:10
@MichaelKohler MichaelKohler changed the base branch from master to main October 27, 2020 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants