Skip to content

Stop words

Giannis Daras edited this page Jul 25, 2018 · 1 revision

Stop-words

Welcome to the wiki page of stop-words. In this page, you will find out how Greek stop-words list is produced.

In computingstop words are words which are filtered out before or after processing of natural language data (text).[1] Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

Counting frequencies from Wikipedia dump

  1. First, get a dump of Greek wikipedia:

    wget https://dumps.wikimedia.org/elwiki/latest/elwiki-latest-pages-articles.xml.bz2

  2. Secondly, use the following code to count word frequencies from the wikipedia dump and save the 300 most frequent words in a file, following this format.

    import multiprocessing
    from collections import defaultdict
    from gensim.corpora import WikiCorpus, MmCorpus
    words = defaultdict(int)
    wiki = WikiCorpus("elwiki-latest-pages-articles.xml.bz2",lemmatize=False, dictionary={})
    sentences = list(wiki.get_texts())
    for sentence in sentences:
        for token in sentence:
            words[token]+=1

The full script can be found here.

Note: A file with frequencies of Greek words can be found here. The first column contains the occurrences of the word, the second column the number of documents in which the word occurred and the third column the word itself.

Adding words from other sources

The list extracted from Wikipedia is not enough, because it doesn't include a lot of personal forms, which for some applications might be good stop-word additions.

Because of that, we found it useful to add some words from the Open Subtitles list of words with their frequencies. The list can be found here.

Cross validation

The most frequent words from Wikipedia dump list and Open Subtitles list were concatenated and the output was checked manually in order to ensure the quality of the stop-words list.

Final list

The final stop-words list can be found here.

Clone this wiki locally