Adapt a lemmatize for POS tagging of Kazakh language

Project Overview

This project focuses on adapting a lemmatizer based on a Random Forest classifier to predict Part-of-Speech (POS) tagging for Kazakh words. The objective was to create an efficient model that can accurately identify the grammatical category of Kazakh words, which is crucial for various natural language processing tasks. I then compared the results with a corpus of turkish token (language with same roots), as well as a corpus with english tokens.

Content

The notebook main.ipynb displays some results table for the three languages. While running it, some graphics are also saved in the folder graphs. Finally, to make the notebook clean and readable, every hand-made functions were saved and comented in function.py.

Analysis & conclusion (TO DO)

Data Source

Source of the 3 datasets : Kazakh corpus : https://github.com/nlacslab/kazdet/blob/master/data/kdt-NLANU-0.01.connlu.txt.7z. English corpus : https://github.com/UniversalDependencies/UD_English-EWT/blob/master/en_ewt-ud-dev.conllu Turkish corpus : https://github.com/UniversalDependencies/UD_Turkish-Kenet/blob/master/tr_kenet-ud-dev.conllu

Methodology

The algorithm employed for this project is the Extra Trees classifier. Scikit-learn library was used to run this model : https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html Original paper of the methodology adapted : https://www.scielo.org.mx/scielo.php?pid=S1405-55462020000301353&script=sci_arttext&tlng=en

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
graphs		graphs
README.md		README.md
functions.py		functions.py
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adapt a lemmatize for POS tagging of Kazakh language

Project Overview

Content

Analysis & conclusion (TO DO)

Data Source

Methodology

About

Releases

Packages

Languages

Olivierjaylet/Adapt-a-lemmatizer-for-POS-tagging-of-Kazakh-language

Folders and files

Latest commit

History

Repository files navigation

Adapt a lemmatize for POS tagging of Kazakh language

Project Overview

Content

Analysis & conclusion (TO DO)

Data Source

Methodology

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages