This project explores the necessary steps involved in Text Categorization, and performs a comparative evaluation on a variety of Vectorization and Prediction techniques from Sklearn.
The major steps of Text Categorization are:
- Preprocessing raw text into vectors of various sizes (500, 1000, 2000)
- Training and Testing models
- Evaluation of Models
- Bag of Words (Sklearn's CountVectorizer)
- TFIDF (Sklearn's TfidfVectorizer)
- Sklearn's Logistic Regression Classifier
- Sklearn's Linear Support Vector Classifier
- Sklearn's Passive Agressive Classifier
- Sklearn's SGD Classifier using Elasticnet Penalty
- Sklearn's Random Forest Classifier
- Sklearn's Perceptron
- Sklearn's K-Nearest Neighbors with 10 neighbors
- Sklearn's Multi-Layer Perceptron Classifier
- Accuracy
- Precision (micro & macro)
- Recall (micro & macro)
- F1 (micro & macro)
- Use the package, dependency and environment management system conda to install all dependencies.
conda create -f environment.yml
conda activate tcc_env
- Download the necessary corpuses
import nltk
nltk.download('reuters')
nltk.download('stopwords')
python run.py
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.